CN110705351A

CN110705351A - Video conference sign-in method and system

Info

Publication number: CN110705351A
Application number: CN201910804717.4A
Authority: CN
Inventors: 何巍巍; 彭庆太; 韩杰; 王艳辉
Original assignee: Visionvera Information Technology Co Ltd
Current assignee: Visionvera Information Technology Co Ltd
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2020-01-17
Also published as: CN115311706A

Abstract

The invention provides a video conference sign-in method and system. The method comprises the following steps: after detecting that the faces of the participants appear in the acquired video, the video networking terminal respectively acquires face images of the faces at a plurality of different angles; the video network terminal respectively extracts the facial features corresponding to each face image and combines the facial features into a comprehensive facial feature vector of the face; the video networking terminal uploads the comprehensive facial feature vector to a video networking server based on a video networking protocol; and after determining that the reference comprehensive facial feature vectors matched with the comprehensive facial feature vectors exist in the reference comprehensive facial feature vectors corresponding to all the participants of the video conference, the video networking server determines that the participants sign in successfully. The invention adopts the face recognition check-in mode to check in, has high check-in efficiency and high check-in accuracy, and can further improve the accuracy of face recognition by the recognition mode combined by a plurality of different angles.

Description

Video conference sign-in method and system

Technical Field

The invention relates to the technical field of video networking, in particular to a sign-in method and a sign-in system for a video conference.

Background

With the rapid development of network technologies, bidirectional communications such as video conferences, video teaching, video phones, and the like are widely popularized in the aspects of life, work, learning, and the like of users.

Video conferencing refers to a conference in which people at two or more locations have a face-to-face conversation via a communication device and a network. Video conferences can be divided into point-to-point conferences and multipoint conferences according to different numbers of participating places. Individuals in daily life have no requirements on the security of conversation contents, the quality of a conference and the scale of the conference, and can adopt video software to carry out video chat. And the commercial video conference of government organs and enterprise institutions requires conditions such as stable and safe network, reliable conference quality, formal conference environment and the like, and professional video conference equipment is used to establish a special video conference system.

Before a video conference is carried out, participants need to check in. The existing check-in method is generally manual card punching or manual check-in, but the check-in method is inefficient and has low check-in accuracy.

Disclosure of Invention

In view of the above, embodiments of the present invention are proposed to provide a check-in method and system for a video conference that overcome or at least partially solve the above problems.

In a first aspect, the embodiment of the invention discloses a sign-in method for a video conference, which is applied to the video conference, wherein the video conference applies a video networking server and a plurality of video networking terminals participating in the video conference; the method comprises the following steps:

after detecting that the faces of the participants appear in the acquired video, the video networking terminal respectively acquires face images of the faces at a plurality of different angles;

the video network terminal respectively extracts the facial features corresponding to each face image and combines the facial features into a comprehensive facial feature vector of the face;

the video networking terminal uploads the comprehensive facial feature vector to the video networking server based on a video networking protocol;

and after determining that the reference comprehensive facial feature vectors matched with the comprehensive facial feature vectors exist in the reference comprehensive facial feature vectors corresponding to all the participants of the video conference, the video networking server determines that the participants sign in successfully.

Optionally, the step of respectively obtaining the face images of the face at a plurality of different angles includes: the video network terminal acquires the motion trail of the key point in the face; and after the video network terminal determines that the motion track conforms to a preset motion track at a certain angle, extracting a frame of image from the video as a face image of the face at the angle.

Optionally, the step of combining the facial features into a comprehensive facial feature vector of the human face includes: respectively acquiring a preset weight value of the face image at each angle; and according to the weight values, carrying out weighted combination on the facial features corresponding to the facial images at the different angles to obtain a comprehensive facial feature vector of the face.

Optionally, the plurality of different angles comprises at least two of: the face image correction method comprises a positive face angle, a left face angle, a right face angle, a head raising angle and a head lowering angle, wherein the weight value of the face image of the positive face angle is the largest.

Optionally, after the step of uploading, by the internet of vision terminal, the integrated facial feature vector to the internet of vision server based on an internet of vision protocol, the method further includes: and after determining that the reference comprehensive facial feature vectors corresponding to all the participants of the video conference do not exist in the reference comprehensive facial feature vectors corresponding to all the participants of the video conference, the video networking server stores the reference comprehensive facial feature vectors serving as the reference comprehensive facial feature vectors of the participants.

Optionally, after the step of uploading, by the video networking terminal, the integrated facial feature vector to the video networking server based on a video networking protocol, the method further includes: after the video network terminal detects that the face of the participant disappears, if the disappearing time length is determined to exceed the preset time length, the comprehensive facial feature vector of the face is marked to disappear, and the marked comprehensive facial feature vector is uploaded to the video network server; and after the reference comprehensive facial feature vector matched with the marked comprehensive facial feature vector exists in the reference comprehensive facial feature vector corresponding to the participant who successfully signs in, the video network server determines that the participant is early returned.

In a second aspect, the embodiment of the invention discloses a check-in system for a video conference, which comprises a video networking server and a plurality of video networking terminals participating in the video conference;

the video networking terminal includes:

the acquisition module is used for respectively acquiring face images of the face at a plurality of different angles after the face of the participant is detected to appear in the acquired video;

the combination module is used for respectively extracting facial features corresponding to each face image and combining the facial features into a comprehensive facial feature vector of the face;

a first upload module for uploading the integrated facial feature vector to the video networking server based on a video networking protocol;

the video network server comprises:

and the first determining module is used for determining that the attendees successfully sign in after the reference comprehensive facial feature vectors matched with the comprehensive facial feature vectors exist in the reference comprehensive facial feature vectors corresponding to all the attendees of the video conference.

Optionally, the obtaining module includes: the track acquisition unit is used for acquiring the motion track of the key point in the human face; and the image extraction unit is used for extracting a frame of image from the video as a face image of the face at a certain preset angle after the motion track is determined to be in accordance with the motion track at the certain preset angle.

Optionally, the combining module comprises: the weight obtaining unit is used for respectively obtaining the weight value of the preset face image at each angle; and the feature combination unit is used for performing weighted combination on the facial features corresponding to the facial images at the different angles according to the weight values to obtain a comprehensive facial feature vector of the face.

Optionally, the video network server further comprises: and the storage module is used for storing the comprehensive facial feature vector as the reference comprehensive facial feature vector of the participants after determining that the reference comprehensive facial feature vector matched with the comprehensive facial feature vector does not exist in the reference comprehensive facial feature vectors corresponding to all the participants of the video conference.

Optionally, the video networking terminal further includes: the second uploading module is used for marking the comprehensive facial feature vector of the face as disappearing and uploading the marked comprehensive facial feature vector to the video network server if the disappearing time length is determined to exceed the preset time length after the face of the participant is detected to disappear; the video network server further comprises: and the second determination module is used for determining that the participant leaves early after the reference comprehensive facial feature vector matched with the marked comprehensive facial feature vector exists in the reference comprehensive facial feature vector corresponding to the participant who is determined to successfully sign in.

In the embodiment of the invention, after detecting that the faces of participants appear in the acquired video, the video network terminal respectively acquires face images of the faces at a plurality of different angles; the video network terminal respectively extracts the facial features corresponding to each face image and combines the facial features into a comprehensive facial feature vector of the face; the video networking terminal uploads the comprehensive facial feature vector to the video networking server based on a video networking protocol; and after determining that the reference comprehensive facial feature vectors matched with the comprehensive facial feature vectors exist in the reference comprehensive facial feature vectors corresponding to all the participants of the video conference, the video networking server determines that the participants sign in successfully. Therefore, on one hand, in the embodiment of the invention, the video network terminals of all the participants can acquire the face images of the corresponding participants, and the people sign in by adopting a face recognition sign-in mode, so that the sign-in efficiency is high, and the sign-in accuracy is high; on the other hand, the comprehensive facial feature vector of the human face is obtained through the human face images at the different angles, matching identification is carried out according to the comprehensive facial feature vector, real dynamic participants, not static objects such as photos and the like, in front of the video network terminal can be guaranteed, and the accuracy of human face identification can be further improved through the identification mode combining the different angles.

Drawings

FIG. 1 is a schematic networking diagram of a video network of the present invention;

FIG. 2 is a schematic diagram of a hardware architecture of a node server according to the present invention;

fig. 3 is a schematic diagram of a hardware structure of an access switch of the present invention;

fig. 4 is a schematic diagram of a hardware structure of an ethernet protocol conversion gateway according to the present invention;

fig. 5 is a flowchart illustrating steps of a check-in method for a video conference according to a first embodiment of the present invention;

fig. 6 is a flowchart illustrating steps of a check-in method for a videoconference according to the second embodiment of the present invention;

FIG. 7 is a diagram illustrating device interaction according to a second embodiment of the present invention;

fig. 8 is a block diagram of a check-in system for a video conference according to a third embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The embodiment of the invention provides a sign-in scheme of a video conference, which follows a protocol of video networking and carries out sign-in the video conference by a face recognition technology.

In a video-on-internet video conference, the involved devices may include a video-on-internet server and a plurality of video-on-internet terminals participating in the video conference. The video network server manages various services in the video network in a unified way, all devices for carrying out the video network services need to be registered in the video network server to carry out normal services, and after the registration is successful, the video network server distributes information such as video network numbers, video network MAC addresses (Media Access Control) and the like for the registered devices. The video networking number is an identification number of a device registered in the video networking for identifying a real or virtual terminal that can initiate a video networking service. The video networking terminal is a terminal for carrying out services based on a video networking protocol, and the video networking service can be carried out by the video networking terminal after the video networking terminal is registered to a video networking server.

Example one

Referring to fig. 5, a flowchart illustrating steps of a check-in method for a video conference according to a first embodiment of the present invention is shown.

The video conference check-in method of the embodiment of the invention can comprise the following steps:

step 501, after detecting that the faces of the participants appear in the collected video, the video network terminal respectively acquires face images of the faces at a plurality of different angles.

The terminal of the video networking can be various Set Top Boxes (STB) based on the video networking protocol, and the like. The video network terminal can be connected with external equipment such as a camera and a microphone, videos can be collected through the camera, and audios can be collected through the microphone. In the embodiment of the present invention, each video network terminal participating in a video conference can execute corresponding steps in the check-in method of the video conference, and the embodiment of the present invention is described by taking one video network terminal as an example.

The camera may capture video before the video conference begins, or at the very beginning of the video conference. The video network terminal can detect whether the human face of the meeting personnel appears in the video collected by the camera in real time, and respectively acquire the human face images of the human face at a plurality of different angles after the human face of the meeting personnel appears.

Step 502, the video network terminal respectively extracts the face features corresponding to each face image, and combines the face features into a comprehensive face feature vector of the face.

The method comprises the steps that for a plurality of collected face images, a video network terminal respectively extracts face features corresponding to each face image to obtain a plurality of groups of face features, and then the plurality of groups of face features are combined into a comprehensive face feature vector of the face.

Step 503, the terminal of the video network uploads the comprehensive facial feature vector to the server of the video network based on the video network protocol.

Step 504, after the video network server determines that the reference comprehensive facial feature vectors matched with the comprehensive facial feature vectors exist in the reference comprehensive facial feature vectors corresponding to all the participants of the video conference, the video network server determines that the participants sign in successfully.

The video networking terminal can upload the combined comprehensive facial feature vector to a video networking server based on a video networking protocol. After receiving the comprehensive facial feature vector uploaded by the video networking terminal, the video networking server acquires reference comprehensive facial feature vectors corresponding to all participants of the video conference, determines whether reference comprehensive facial feature vectors matched with the comprehensive facial feature vectors exist or not, and if yes, determines that the participants corresponding to the uploaded comprehensive facial feature vectors successfully sign in.

On one hand, in the embodiment of the invention, the video network terminals of all the participants can acquire the face images of the corresponding participants, and sign-in is carried out in a face recognition sign-in mode, so that the sign-in efficiency is high, and the sign-in accuracy is high; on the other hand, the comprehensive facial feature vector of the human face is obtained through the human face images at the different angles, matching identification is carried out according to the comprehensive facial feature vector, real dynamic participants, not static objects such as photos and the like, in front of the video network terminal can be guaranteed, and the accuracy of human face identification can be further improved through the identification mode combining the different angles.

Example two

Referring to fig. 6, a flowchart illustrating steps of a check-in method for a video conference according to a second embodiment of the present invention is shown.

step 601, after detecting that the faces of the participants appear in the acquired video, the video network terminal respectively acquires face images of the faces at a plurality of different angles.

The video network terminal detects whether the human faces of the participants appear in the video collected by the camera in real time. In an alternative embodiment, whether the faces of the participants appear in the video can be detected according to a fuzzy recognition algorithm. In an implementation, a Viola-Jones face detector may be utilized to detect faces in a video. Firstly, organizing a sample set, solving Haar characteristics (human faces and non-human faces) of the sample set, and training a classifier by using the characteristics: and (4) modifying the weight of the sample according to whether the sample classification in each layer of classifier is correct, sending the modified sample to the next layer of classifier for training, and fusing each layer of classifier to be used as a final Adaboost decision classifier. And extracting Haar characteristics from the video during detection, combining a cascade Adaboost algorithm to obtain a face detection rate, and determining the faces of the participants appearing in the video when the face detection rate reaches the standard.

After the faces of the participants appear in the video, face images of the faces at a plurality of different angles are respectively obtained. In an alternative embodiment, the plurality of different angles includes at least two of: a frontal face angle, a left side face angle, a right side face angle, a head-up face angle, a head-down face angle.

The step of respectively acquiring the face images of the face at a plurality of different angles may include: the video network terminal acquires the motion trail of the key point in the face; and after the video network terminal determines that the motion track conforms to a preset motion track at a certain angle, extracting a frame of image from the video as a face image of the face at the angle.

The key points in the face can be selected from any points in the face that are located on the vertical center line of the face, such as the nose, mouth, and the like. The motion trajectory corresponding to each angle is preset, for example, with a central point of a video picture as a coordinate origin, the motion trajectory corresponding to a frontal face angle may be set as a key point and may be stationary on a vertical axis, the motion trajectory corresponding to a left side face angle may be set as a key point and may move toward the left side of the vertical axis with a horizontal axis coordinate, the motion trajectory corresponding to a right side face angle may move toward the right side of the vertical axis with a horizontal axis coordinate, the motion trajectory corresponding to a head-up face angle may move toward the upper side of the horizontal axis with a vertical axis coordinate, and the motion trajectory corresponding to a head-down face angle may move toward the lower side of the horizontal axis with a vertical axis coordinate. The video network terminal locks and tracks key points in the face, obtains the motion trail of the key points in the face, judges whether the motion trail accords with a preset motion trail at a certain angle, and extracts a frame of image from the video as a face image of the face at the angle if the motion trail accords with the preset motion trail at the certain angle. In an alternative embodiment, a frame of image may be extracted from the video using the multimedia video processing tool FFmpeg.

Step 602, the video network terminal respectively extracts the facial features corresponding to each face image, and combines the facial features into a comprehensive facial feature vector of the face.

The video network terminal acquires a plurality of face images at different angles and respectively extracts the facial features corresponding to each face image. In an alternative embodiment, the facial features corresponding to each facial image can be extracted by using a Convolutional Neural Network (CNN) model generated by pre-training. The CNN model comprises an input layer, a convolution layer, a pooling layer and a full-connection layer, a face image is input into the CNN model through the input layer, then convolution calculation is carried out on the face image through the convolution layer, then compression is carried out through the pooling layer, finally the features are classified through the full-connection layer, and the facial features of multiple dimensions are obtained.

And the video network terminal combines the facial features corresponding to the plurality of face images to obtain a comprehensive facial feature vector of the face. In an alternative embodiment, the step of combining the facial features into a comprehensive facial feature vector for the face may comprise: respectively acquiring a preset weight value of the face image at each angle; and according to the weight values, carrying out weighted combination on the facial features corresponding to the facial images at the different angles to obtain a comprehensive facial feature vector of the face.

The weight value of the face image of each angle is set in advance, and the sum of the weight values of the face images of the angles is 1. In consideration of the fact that the effect in the face image at the front face angle is greater, the weight value of the face image at the front face angle may be set to be the largest. For the assignment of the weight value, a person skilled in the art may set any used value according to practical experience, and the embodiment of the present invention is not limited thereto. For example, if the plurality of different angles include a front face angle, a left side face angle, a right side face angle, a head-up face angle, and a head-down face angle, the weight value of the face image at the front face angle may be set to be 0.4, the weight value of the face image at the left side face angle may be 0.15, the weight value of the face image at the right side face angle may be 0.15, the weight value of the face image at the head-up face angle may be 0.15, and the weight value of the face image at the head-down face angle may be 0.15, and so on.

And according to the weight value of the face image of each angle, carrying out weighted combination on the face features corresponding to the face images of different angles. For example, if the plurality of different angles include a front face angle, a left side face angle, a right side face angle, a head-up face angle, and a head-down face angle, the weight value of the face image at the front face angle is a, the facial feature corresponding to the face image at the front face angle is a, the weight value of the face image at the left side face angle is B, the facial feature corresponding to the face image at the left side face angle is B, the weight value of the face image at the right side face angle is C, the facial feature corresponding to the face image at the right side face angle is C, the weight value of the face image at the head-up face angle is D, the facial feature corresponding to the face image at the head-up face angle is D, the weight value of the face image at the head-down face angle is E, the facial feature corresponding to the face image at the head-down face angle is E, wherein A, B, C, D, E may be in. Then the weighting combination is carried out to obtain the comprehensive facial feature vector of the human face as a.A + b.B + c.C + d.D + e.E.

Step 603, the terminal of the video network uploads the comprehensive facial feature vector to the server of the video network based on the video network protocol.

And the video networking terminal and the video networking server interact based on a video networking protocol. After the video networking terminal obtains the comprehensive facial feature vector corresponding to the face of the participant in the video, the comprehensive facial feature vector is packaged into a video networking protocol data packet based on a video networking protocol, and the video networking protocol data packet is uploaded to a video networking server through the video networking. The video networking protocol data packet can also comprise information such as a video networking number of a video networking terminal (source device), a video networking MAC address of the video networking terminal, a video networking number of a video networking server (destination device), a video networking MAC address of the video networking server and the like.

In step 604, the video network server determines whether a reference comprehensive facial feature vector matched with the comprehensive facial feature vector exists in the reference comprehensive facial feature vectors corresponding to all participants of the video conference. If yes, go to step 605; if not, go to step 606.

And after receiving the comprehensive facial feature vector uploaded by the video network terminal, the video network server acquires reference comprehensive facial feature vectors corresponding to all participants of the video conference. In the implementation, the face images of the faces of all the participants at a plurality of different angles can be collected in advance, the comprehensive face feature vector of the faces of the participants is obtained according to the method and is used as the reference comprehensive face feature vector corresponding to the participants, and the reference comprehensive face feature vectors corresponding to all the participants are stored in the database. The video network server can acquire the reference comprehensive facial feature vectors corresponding to all the participants of the video conference from the database.

Fig. 7 is a schematic diagram of device interaction according to a second embodiment of the present invention. As shown in fig. 7, the terminal of the video network may send data to the video network server, the video network server may query the data from the database, and the database may return the data to the video network server. Therefore, the video network terminal sends the comprehensive facial feature vector to the video network server, the video network server inquires the reference comprehensive facial feature vector corresponding to all the participants of the video conference from the database, and the database inquires and returns the reference comprehensive facial feature vector corresponding to all the participants of the video conference to the video network server.

And the video network server matches the comprehensive facial feature vectors uploaded by the video network terminal with each reference comprehensive facial feature vector in the reference comprehensive facial feature vectors corresponding to all the participants one by one, and determines whether a reference comprehensive facial feature vector matched with the comprehensive facial feature vector exists.

In an alternative embodiment, a similarity between the integrated facial feature vector and a reference integrated facial feature vector may be determined, and when the similarity is less than a preset similarity threshold, it may be determined that the integrated facial feature vector matches the reference integrated facial feature vector. For the specific value of the similarity threshold, a person skilled in the art may set any suitable value according to the actual situation, and the embodiment of the present invention is not limited thereto.

The similarity between two vectors can be measured according to cosine distance, euclidean distance, and the like between the two vectors, and for a specific processing process, a person skilled in the art performs related processing according to actual experience, and the embodiment of the present invention is not discussed in detail herein.

Step 605, after the server of the video network determines that the reference comprehensive facial feature vector matched with the comprehensive facial feature vector exists, the server of the video network determines that the participant signs in successfully.

If the video network server determines that the reference comprehensive facial feature vector matched with the comprehensive facial feature vector exists in the reference comprehensive facial feature vectors corresponding to all the participants of the video conference, the participants corresponding to the comprehensive facial feature vector (namely the participants corresponding to the matched reference comprehensive facial feature vector) can be determined to be successfully signed in.

The video network server can also preset a late arrival time threshold, record the successful check-in time after determining that the participant successfully checks in, compare the successful check-in time with the late arrival time threshold, and determine and record the late arrival of the participant if the successful check-in time is later than the late arrival time threshold.

And 606, after determining that no reference comprehensive facial feature vector matched with the comprehensive facial feature vector exists, the video network server stores the comprehensive facial feature vector as the reference comprehensive facial feature vector of the participant.

If the video networking server determines that the reference comprehensive facial feature vectors corresponding to all the participants of the video conference do not exist in the reference comprehensive facial feature vectors matched with the comprehensive facial feature vectors, the video networking server can determine that the participants are possibly temporarily added participants and do not record the reference comprehensive facial feature vectors into all the participants of the video conference, and therefore the video networking server can also store the comprehensive facial feature vectors corresponding to the participants as the reference comprehensive facial feature vectors of the participants. Such as a database.

And if the video network server determines that reference comprehensive facial feature vectors which are not matched with the comprehensive facial feature vectors exist in the reference comprehensive facial feature vectors corresponding to all the participants of the video conference, determining that the participants corresponding to the unmatched reference comprehensive facial feature vectors fail to sign in.

Step 607, after detecting that the face of the participant disappears, if the disappearing time length is determined to exceed the preset time length, the video network terminal marks the comprehensive facial feature vector of the face as disappearing, and uploads the marked comprehensive facial feature vector to the video network server.

In the embodiment of the invention, whether the participants leave early or not can be monitored. After the video network terminal uploads the comprehensive facial feature vector of the faces of the participants to the video network server, whether the faces of the participants disappear in the video can be detected in real time, and if the faces disappear, the disappearing time length is counted. And the video network terminal compares the disappearing time length with the preset time length, if the disappearing time length exceeds the preset time length, the comprehensive facial feature vector of the face of the participant is marked to disappear, and the marked comprehensive facial feature vector is uploaded to a video network server.

For the specific value of the preset time period, a person skilled in the art may select any suitable value according to practical experience, and the embodiment of the present invention is not limited thereto. For example, the preset time period may be set to 1 hour, 1.5 hours, or the like.

Step 608, after determining that the reference comprehensive facial feature vector matched with the marked comprehensive facial feature vector exists in the reference comprehensive facial feature vector corresponding to the participant who successfully signs in, the video network server determines that the participant is going back early.

After receiving the marked comprehensive facial feature vectors uploaded by the video network terminal, the video network server matches the marked comprehensive facial feature vectors with the reference comprehensive facial feature vectors corresponding to the successfully signed participants one by one, and determines whether reference comprehensive facial feature vectors matched with the marked comprehensive facial feature vectors exist. Similarly, a similarity between the tagged composite facial feature vector and the reference composite facial feature vector may be determined, and when the similarity is less than a preset similarity threshold, it may be determined that the tagged composite facial feature vector matches the reference composite facial feature vector. The similarity between two vectors can be measured according to cosine distance, Euclidean distance and the like between the two vectors.

And after the reference comprehensive facial feature vector matched with the marked comprehensive facial feature vector exists in the reference comprehensive facial feature vector corresponding to the participant who successfully signs in, the video network server determines that the participant corresponding to the marked comprehensive facial feature vector moves back early. If not, no processing is performed.

In the embodiment of the invention, automatic sign-in can be carried out through face recognition, and the conditions of late arrival, early departure and the like can be recorded, so that the processing process is more comprehensive.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

EXAMPLE III

Referring to fig. 8, a block diagram of a check-in system for a video conference according to a third embodiment of the present invention is shown. The video conference check-in system of the embodiment of the invention can comprise a video network server 801 and a plurality of video network terminals 802 participating in the video conference.

The video network terminal 802 includes:

the acquiring module 8021 is configured to, after detecting that the face of the participant appears in the acquired video, respectively acquire face images of the face at a plurality of different angles.

The combination module 8022 is configured to extract facial features corresponding to each face image, and combine the facial features into a comprehensive facial feature vector of the face.

A first upload module 8023, configured to upload the integrated facial feature vector to the internet of view server based on an internet of view protocol.

The video network server 801 includes:

a first determining module 8011, configured to determine that the attendees sign in successfully after determining that, in the reference comprehensive facial feature vectors corresponding to all attendees of the video conference, there is a reference comprehensive facial feature vector matching the comprehensive facial feature vector.

In an alternative embodiment, the obtaining module 8021 includes: the track acquisition unit is used for acquiring the motion track of the key point in the human face; and the image extraction unit is used for extracting a frame of image from the video as a face image of the face at a certain preset angle after the motion track is determined to be in accordance with the motion track at the certain preset angle.

In an alternative embodiment, the combination module 8022 includes: the weight obtaining unit is used for respectively obtaining the weight value of the preset face image at each angle; and the feature combination unit is used for performing weighted combination on the facial features corresponding to the facial images at the different angles according to the weight values to obtain a comprehensive facial feature vector of the face.

In an alternative embodiment, the plurality of different angles includes at least two of: the face image correction method comprises a positive face angle, a left face angle, a right face angle, a head raising angle and a head lowering angle, wherein the weight value of the face image of the positive face angle is the largest.

In an alternative embodiment, the video networking server 801 further comprises: and the storage module is used for storing the comprehensive facial feature vector as the reference comprehensive facial feature vector of the participants after determining that the reference comprehensive facial feature vector matched with the comprehensive facial feature vector does not exist in the reference comprehensive facial feature vectors corresponding to all the participants of the video conference.

In an optional implementation, the video networking terminal 802 further includes: and the second uploading module is used for marking the comprehensive facial feature vector of the face as disappearing and uploading the marked comprehensive facial feature vector to the video network server if the disappearing time length is determined to exceed the preset time length after the face of the participant is detected to disappear. The video networking server 801 further comprises: and the second determination module is used for determining that the participant leaves early after the reference comprehensive facial feature vector matched with the marked comprehensive facial feature vector exists in the reference comprehensive facial feature vector corresponding to the participant who is determined to successfully sign in.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The video networking is an important milestone for network development, is a real-time network, can realize high-definition video real-time transmission, and pushes a plurality of internet applications to high-definition video, and high-definition faces each other.

The video networking adopts a real-time high-definition video exchange technology, can integrate required services such as dozens of services of video, voice, pictures, characters, communication, data and the like on a system platform on a network platform, such as high-definition video conference, video monitoring, intelligent monitoring analysis, emergency command, digital broadcast television, delayed television, network teaching, live broadcast, VOD on demand, television mail, Personal Video Recorder (PVR), intranet (self-office) channels, intelligent video broadcast control, information distribution and the like, and realizes high-definition quality video broadcast through a television or a computer.

To better understand the embodiments of the present invention, the following description refers to the internet of view:

some of the technologies applied in the video networking are as follows:

network Technology (Network Technology)

Network technology innovation in video networking has improved the traditional Ethernet (Ethernet) to face the potentially huge first video traffic on the network. Unlike pure network Packet Switching (Packet Switching) or network Circuit Switching (Circuit Switching), the Packet Switching is adopted by the technology of the video networking to meet the Streaming requirement. The video networking technology has the advantages of flexibility, simplicity and low price of packet switching, and simultaneously has the quality and safety guarantee of circuit switching, thereby realizing the seamless connection of the whole network switching type virtual circuit and the data format.

Switching Technology (Switching Technology)

The video network adopts two advantages of asynchronism and packet switching of the Ethernet, eliminates the defects of the Ethernet on the premise of full compatibility, has end-to-end seamless connection of the whole network, is directly communicated with a user terminal, and directly bears an IP data packet. The user data does not require any format conversion across the entire network. The video networking is a higher-level form of the Ethernet, is a real-time exchange platform, can realize the real-time transmission of the whole-network large-scale high-definition video which cannot be realized by the existing Internet, and pushes a plurality of network video applications to high-definition and unification.

Server Technology (Server Technology)

The server technology on the video networking and unified video platform is different from the traditional server, the streaming media transmission of the video networking and unified video platform is established on the basis of connection orientation, the data processing capacity of the video networking and unified video platform is independent of flow and communication time, and a single network layer can contain signaling and data transmission. For voice and video services, the complexity of video networking and unified video platform streaming media processing is much simpler than that of data processing, and the efficiency is greatly improved by more than one hundred times compared with that of a traditional server.

Storage Technology (Storage Technology)

The super-high speed storage technology of the unified video platform adopts the most advanced real-time operating system in order to adapt to the media content with super-large capacity and super-large flow, the program information in the server instruction is mapped to the specific hard disk space, the media content is not passed through the server any more, and is directly sent to the user terminal instantly, and the general waiting time of the user is less than 0.2 second. The optimized sector distribution greatly reduces the mechanical motion of the magnetic head track seeking of the hard disk, the resource consumption only accounts for 20% of that of the IP internet of the same grade, but concurrent flow which is 3 times larger than that of the traditional hard disk array is generated, and the comprehensive efficiency is improved by more than 10 times.

Network Security Technology (Network Security Technology)

The structural design of the video network completely eliminates the network security problem troubling the internet structurally by the modes of independent service permission control each time, complete isolation of equipment and user data and the like, generally does not need antivirus programs and firewalls, avoids the attack of hackers and viruses, and provides a structural carefree security network for users.

Service Innovation Technology (Service Innovation Technology)

The unified video platform integrates services and transmission, and is not only automatically connected once whether a single user, a private network user or a network aggregate. The user terminal, the set-top box or the PC are directly connected to the unified video platform to obtain various multimedia video services in various forms. The unified video platform adopts a menu type configuration table mode to replace the traditional complex application programming, can realize complex application by using very few codes, and realizes infinite new service innovation.

Networking of the video network is as follows:

the video network is a centralized control network structure, and the network can be a tree network, a star network, a ring network and the like, but on the basis of the centralized control node, the whole network is controlled by the centralized control node in the network.

As shown in fig. 1, the video network is divided into an access network and a metropolitan network.

The devices of the access network part can be mainly classified into 3 types: node server, access switch, terminal (including various set-top boxes, coding boards, memories, etc.). The node server is connected to an access switch, which may be connected to a plurality of terminals and may be connected to an ethernet network.

The node server is a node which plays a centralized control function in the access network and can control the access switch and the terminal. The node server can be directly connected with the access switch or directly connected with the terminal.

Similarly, devices of the metropolitan network portion may also be classified into 3 types: a metropolitan area server, a node switch and a node server. The metro server is connected to a node switch, which may be connected to a plurality of node servers.

The node server is a node server of the access network part, namely the node server belongs to both the access network part and the metropolitan area network part.

The metropolitan area server is a node which plays a centralized control function in the metropolitan area network and can control a node switch and a node server. The metropolitan area server can be directly connected with the node switch or directly connected with the node server.

Therefore, the whole video network is a network structure with layered centralized control, and the network controlled by the node server and the metropolitan area server can be in various structures such as tree, star and ring.

The access network part can form a unified video platform (the part in the dotted circle), and a plurality of unified video platforms can form a video network; each unified video platform may be interconnected via metropolitan area and wide area video networking.

Video networking device classification

1.1 devices in the video network of the embodiment of the present invention can be mainly classified into 3 types: server, exchanger (including Ethernet protocol conversion gateway), terminal (including various set-top boxes, code board, memory, etc.). The video network as a whole can be divided into a metropolitan area network (or national network, global network, etc.) and an access network.

1.2 wherein the devices of the access network part can be mainly classified into 3 types: node server, access exchanger (including Ethernet protocol conversion gateway), terminal (including various set-top boxes, coding board, memory, etc.).

The specific hardware structure of each access network device is as follows:

a node server:

as shown in fig. 2, the system mainly includes a network interface module 201, a switching engine module 202, a CPU module 203, and a disk array module 204;

the network interface module 201, the CPU module 203, and the disk array module 204 all enter the switching engine module 202; the switching engine module 202 performs an operation of looking up the address table 205 on the incoming packet, thereby obtaining the direction information of the packet; and stores the packet in a queue of the corresponding packet buffer 206 based on the packet's steering information; if the queue of the packet buffer 206 is nearly full, it is discarded; the switching engine module 202 polls all packet buffer queues for forwarding if the following conditions are met: 1) the port send buffer is not full; 2) the queue packet counter is greater than zero. The disk array module 204 mainly implements control over the hard disk, including initialization, read-write, and other operations on the hard disk; the CPU module 203 is mainly responsible for protocol processing with an access switch and a terminal (not shown in the figure), configuring an address table 205 (including a downlink protocol packet address table, an uplink protocol packet address table, and a data packet address table), and configuring the disk array module 204.

The access switch:

as shown in fig. 3, the network interface module mainly includes a network interface module (a downlink network interface module 301 and an uplink network interface module 302), a switching engine module 303 and a CPU module 304;

wherein, the packet (uplink data) coming from the downlink network interface module 301 enters the packet detection module 305; the packet detection module 305 detects whether the Destination Address (DA), the Source Address (SA), the packet type, and the packet length of the packet meet the requirements, and if so, allocates a corresponding stream identifier (stream-id) and enters the switching engine module 303, otherwise, discards the stream identifier; the packet (downstream data) coming from the upstream network interface module 302 enters the switching engine module 303; the incoming data packet of the CPU module 304 enters the switching engine module 303; the switching engine module 303 performs an operation of looking up the address table 306 on the incoming packet, thereby obtaining the direction information of the packet; if the packet entering the switching engine module 303 is from the downstream network interface to the upstream network interface, the packet is stored in the queue of the corresponding packet buffer 307 in association with the stream-id; if the queue of the packet buffer 307 is nearly full, it is discarded; if the packet entering the switching engine module 303 is not from the downlink network interface to the uplink network interface, the data packet is stored in the queue of the corresponding packet buffer 307 according to the guiding information of the packet; if the queue of the packet buffer 307 is nearly full, it is discarded.

The switching engine module 303 polls all packet buffer queues and may include two cases:

if the queue is from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queued packet counter is greater than zero; 3) obtaining a token generated by a code rate control module;

if the queue is not from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queue packet counter is greater than zero.

The rate control module 308 is configured by the CPU module 304, and generates tokens for packet buffer queues from all downstream network interfaces to upstream network interfaces at programmable intervals to control the rate of upstream forwarding.

The CPU module 304 is mainly responsible for protocol processing with the node server, configuration of the address table 306, and configuration of the code rate control module 308.

Ethernet protocol conversion gateway：

As shown in fig. 4, the apparatus mainly includes a network interface module (a downlink network interface module 401 and an uplink network interface module 402), a switching engine module 403, a CPU module 404, a packet detection module 405, a rate control module 408, an address table 406, a packet buffer 407, a MAC adding module 409, and a MAC deleting module 410.

Wherein, the data packet coming from the downlink network interface module 401 enters the packet detection module 405; the packet detection module 405 detects whether the ethernet MAC DA, the ethernet MAC SA, the ethernet length or frame type, the video network destination address DA, the video network source address SA, the video network packet type, and the packet length of the packet meet the requirements, and if so, allocates a corresponding stream identifier (stream-id); then, the MAC deletion module 410 subtracts MAC DA, MAC SA, length or frame type (2byte) and enters the corresponding receiving buffer, otherwise, discards it;

the downlink network interface module 401 detects the sending buffer of the port, and if there is a packet, obtains the ethernet MAC DA of the corresponding terminal according to the destination address DA of the packet, adds the ethernet MAC DA of the terminal, the MACSA of the ethernet coordination gateway, and the ethernet length or frame type, and sends the packet.

The other modules in the ethernet protocol gateway function similarly to the access switch.

A terminal:

the system mainly comprises a network interface module, a service processing module and a CPU module; for example, the set-top box mainly comprises a network interface module, a video and audio coding and decoding engine module and a CPU module; the coding board mainly comprises a network interface module, a video and audio coding engine module and a CPU module; the memory mainly comprises a network interface module, a CPU module and a disk array module.

1.3 devices of the metropolitan area network part can be mainly classified into 2 types: node server, node exchanger, metropolitan area server. The node switch mainly comprises a network interface module, a switching engine module and a CPU module; the metropolitan area server mainly comprises a network interface module, a switching engine module and a CPU module.

2. Video networking packet definition

2.1 Access network packet definition

The data packet of the access network mainly comprises the following parts: destination Address (DA), Source Address (SA), reserved bytes, payload (pdu), CRC.

As shown in the following table, the data packet of the access network mainly includes the following parts:

DA

SA

Reserved

Payload

CRC

wherein:

the Destination Address (DA) is composed of 8 bytes (byte), the first byte represents the type of the data packet (such as various protocol packets, multicast data packets, unicast data packets, etc.), there are 256 possibilities at most, the second byte to the sixth byte are metropolitan area network addresses, and the seventh byte and the eighth byte are access network addresses;

the Source Address (SA) is also composed of 8 bytes (byte), defined as the same as the Destination Address (DA);

the reserved byte consists of 2 bytes;

the payload part has different lengths according to different types of datagrams, and is 64 bytes if the datagram is various types of protocol packets, and is 32+1024 or 1056 bytes if the datagram is a unicast packet, of course, the length is not limited to the above 2 types;

the CRC consists of 4 bytes and is calculated in accordance with the standard ethernet CRC algorithm.

2.2 metropolitan area network packet definition

The topology of a metropolitan area network is a graph and there may be 2, or even more than 2, connections between two devices, i.e., there may be more than 2 connections between a node switch and a node server, a node switch and a node switch, and a node switch and a node server. However, the metro network address of the metro network device is unique, and in order to accurately describe the connection relationship between the metro network devices, parameters are introduced in the embodiment of the present invention: a label to uniquely describe a metropolitan area network device.

In this specification, the definition of the Label is similar to that of the Label of MPLS (Multi-Protocol Label Switch), and assuming that there are two connections between the device a and the device B, there are 2 labels for the packet from the device a to the device B, and 2 labels for the packet from the device B to the device a. The label is classified into an incoming label and an outgoing label, and assuming that the label (incoming label) of the packet entering the device a is 0x0000, the label (outgoing label) of the packet leaving the device a may become 0x 0001. The network access process of the metro network is a network access process under centralized control, that is, address allocation and label allocation of the metro network are both dominated by the metro server, and the node switch and the node server are both passively executed, which is different from label allocation of MPLS, and label allocation of MPLS is a result of mutual negotiation between the switch and the server.

As shown in the following table, the data packet of the metro network mainly includes the following parts:

DA

SA

Reserved

label (R)

Payload

CRC

Namely Destination Address (DA), Source Address (SA), Reserved byte (Reserved), tag, payload (pdu), CRC. The format of the tag may be defined by reference to the following: the tag is 32 bits with the upper 16 bits reserved and only the lower 16 bits used, and its position is between the reserved bytes and payload of the packet.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The video conference sign-in method and system provided by the invention are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A video conference sign-in method is characterized in that the method is applied to a video conference, wherein the video conference applies a video networking server and a plurality of video networking terminals participating in the video conference; the method comprises the following steps:

2. The method of claim 1, wherein the step of respectively obtaining the face images of the face at a plurality of different angles comprises:

the video network terminal acquires the motion trail of the key point in the face;

and after the video network terminal determines that the motion track conforms to a preset motion track at a certain angle, extracting a frame of image from the video as a face image of the face at the angle.

3. The method of claim 1, wherein said step of combining said facial features into a composite facial feature vector for said face comprises:

respectively acquiring a preset weight value of the face image at each angle;

and according to the weight values, carrying out weighted combination on the facial features corresponding to the facial images at the different angles to obtain a comprehensive facial feature vector of the face.

4. The method of claim 3, wherein the plurality of different angles comprises at least two of: the face image correction method comprises a positive face angle, a left face angle, a right face angle, a head raising angle and a head lowering angle, wherein the weight value of the face image of the positive face angle is the largest.

5. The method of claim 1, further comprising, after the step of the video networking terminal uploading the integrated facial feature vector to the video networking server based on a video networking protocol:

and after determining that the reference comprehensive facial feature vectors corresponding to all the participants of the video conference do not exist in the reference comprehensive facial feature vectors corresponding to all the participants of the video conference, the video networking server stores the reference comprehensive facial feature vectors serving as the reference comprehensive facial feature vectors of the participants.

6. The method of claim 1, further comprising, after the step of the video networking terminal uploading the integrated facial feature vector to the video networking server based on a video networking protocol:

after the video network terminal detects that the face of the participant disappears, if the disappearing time length is determined to exceed the preset time length, the comprehensive facial feature vector of the face is marked to disappear, and the marked comprehensive facial feature vector is uploaded to the video network server;

and after the reference comprehensive facial feature vector matched with the marked comprehensive facial feature vector exists in the reference comprehensive facial feature vector corresponding to the participant who successfully signs in, the video network server determines that the participant is early returned.

7. A video conference check-in system is characterized by comprising a video network server and a plurality of video network terminals participating in a video conference;

the video networking terminal includes:

the video network server comprises:

8. The system of claim 7, wherein the acquisition module comprises:

the track acquisition unit is used for acquiring the motion track of the key point in the human face;

and the image extraction unit is used for extracting a frame of image from the video as a face image of the face at a certain preset angle after the motion track is determined to be in accordance with the motion track at the certain preset angle.

9. The system of claim 7, wherein the combining module comprises:

the weight obtaining unit is used for respectively obtaining the weight value of the preset face image at each angle;

and the feature combination unit is used for performing weighted combination on the facial features corresponding to the facial images at the different angles according to the weight values to obtain a comprehensive facial feature vector of the face.

10. The system of claim 9, wherein the plurality of different angles comprises at least two of: the face image correction method comprises a positive face angle, a left face angle, a right face angle, a head raising angle and a head lowering angle, wherein the weight value of the face image of the positive face angle is the largest.

11. The system of claim 7, wherein the video networking server further comprises:

and the storage module is used for storing the comprehensive facial feature vector as the reference comprehensive facial feature vector of the participants after determining that the reference comprehensive facial feature vector matched with the comprehensive facial feature vector does not exist in the reference comprehensive facial feature vectors corresponding to all the participants of the video conference.

12. The system of claim 7,

the video networking terminal further comprises: the second uploading module is used for marking the comprehensive facial feature vector of the face as disappearing and uploading the marked comprehensive facial feature vector to the video network server if the disappearing time length is determined to exceed the preset time length after the face of the participant is detected to disappear;

the video network server further comprises: and the second determination module is used for determining that the participant leaves early after the reference comprehensive facial feature vector matched with the marked comprehensive facial feature vector exists in the reference comprehensive facial feature vector corresponding to the participant who is determined to successfully sign in.