CN115706773A

CN115706773A - Network video conference processing method and device, electronic equipment and storage medium

Info

Publication number: CN115706773A
Application number: CN202111623524.2A
Authority: CN
Inventors: 陈仲华; 李峰; 曾维亿; 王自昊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-08-12
Filing date: 2021-12-28
Publication date: 2023-02-17

Abstract

The application provides a network video conference processing method, a network video conference processing device, electronic equipment and a computer readable storage medium; the method comprises the following steps: receiving a video code stream of a network video conference, wherein the video code stream comprises a plurality of real-time conference-participating object images, and the conference-participating object images are obtained by respectively carrying out image acquisition on a plurality of conference-participating objects of the network video conference; displaying a virtual conference scene, and displaying the plurality of participant object images at positions in the virtual conference scene corresponding to the plurality of participant objects, respectively. According to the method and the device, good immersive experience in the network video conference process can be achieved by simulating a real conference scene.

Description

Network video conference processing method and device, electronic equipment and storage medium

The application claims application number 202110924572.9, application date 2021, 08-12, and the name is: network video conference processing method, device, electronic equipment and priority of storage medium.

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method and an apparatus for processing a network video conference, an electronic device, and a computer-readable storage medium.

Background

The development of internet technology accelerates the process of enterprise digital reform, and large-scale enterprises begin to adopt network video conferences to replace traditional conferences. The network video conference can be realized based on the cloud technology, the complexity and the participation threshold of conference organization are greatly reduced, the network video conference is a more economic and flexible choice for enterprises, the preparation period is short, the cost is low, and the network video conference is hardly limited by time and space.

However, the network video conference provided by the related art is basically still in a multi-user video mode, for example, fig. 1 shows a scene in which four users participating in the network video conference perform video calls, as can be seen from fig. 1, the scheme provided by the related art only makes each end portrait simply pieced together in the same picture, so that the interestingness of the users in use is low, the displayed conference picture is far different from the actual conference scene, and the initiative of the users in using the network video conference is reduced.

Disclosure of Invention

The embodiment of the application provides a network video conference processing method and device, electronic equipment and a computer readable storage medium, which can realize good immersive experience in the network video conference process by simulating a real conference scene.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a network video conference processing method, which comprises the following steps:

receiving a video code stream of a network video conference, wherein the video code stream comprises a plurality of real-time conference object images, and the conference object images are obtained by respectively carrying out image acquisition on a plurality of conference objects of the network video conference;

displaying a virtual meeting scene, an

And displaying the images of the plurality of participant objects at positions corresponding to the plurality of participant objects in the virtual conference scene.

An embodiment of the present application provides a network video conference processing apparatus, including:

the receiving module is used for receiving a video code stream of the network video conference, wherein the video code stream comprises a plurality of real-time participant object images, and the participant object images are obtained by respectively carrying out image acquisition on a plurality of participant objects of the network video conference;

a display module for displaying a virtual meeting scene, an

In the above scheme, the apparatus further includes an allocation module, configured to allocate, according to a sequence in which the multiple conference objects join the network video conference, corresponding positions to the multiple conference objects in the virtual conference scene, respectively, where a position ordering of the multiple conference objects in the virtual conference scene corresponds to the sequence; the display module is further configured to display the images of the multiple participating objects according to the corresponding positions respectively allocated to the multiple participating objects.

In the above scheme, the allocating module is further configured to allocate corresponding positions to the multiple participant objects in the virtual conference scene according to the speaking sequence of the multiple participant objects in the network video conference, where the position sequence of the multiple participant objects in the virtual conference scene corresponds to the speaking sequence; the display module is further configured to display the images of the multiple participating objects according to the corresponding positions respectively allocated to the multiple participating objects.

In the foregoing solution, the allocating module is further configured to allocate, according to the identity information sequence of the multiple conference participants in the network video conference, corresponding positions to the multiple conference participants in the virtual conference scene, respectively, where the position sequence of the multiple conference participants in the virtual conference scene corresponds to the identity information sequence; the display module is further configured to display the images of the multiple participating objects according to the corresponding positions respectively allocated to the multiple participating objects.

In the above scheme, the display module is further configured to display a virtual conference scene adapted to the theme of the network video conference according to the video code stream; the device also comprises an updating module used for updating the virtual conference scene to be adapted to the changed theme when the theme of the network video conference is determined to be changed according to the video code stream.

In the above solution, the apparatus further includes a decoding module, configured to perform decoding processing on the video code stream to obtain a plurality of video frames; the device also comprises a theme recognition module, a theme recognition module and a theme recognition module, wherein the theme recognition module is used for calling a theme recognition model to perform theme recognition processing on the plurality of video frames to obtain a theme recognition result of each video frame, and determining the theme recognition result with the highest repetition frequency as the theme of the network video conference; the display module is further used for displaying a virtual conference scene adapted to the theme of the network video conference.

In the above solution, the apparatus further includes a moving module, configured to, when it is identified that a target participant object currently speaking exists in the multiple participant objects, move a participant object image corresponding to the target participant object from an originally allocated position to a specific position in the virtual conference scene, where a degree of significance of the specific position is greater than the originally allocated position; and the moving module is further configured to move the conference object image corresponding to the target conference object from the specific position to the originally allocated position when recognizing that the speaking of the target conference object is completed.

In the above scheme, the video code stream further includes a plurality of mask images corresponding to the plurality of participant object images one to one, and the plurality of mask images are obtained by respectively performing object identification on the plurality of participant object images; the device further comprises a mask module, which is used for executing the following processing aiming at each participated object image: performing mask processing on the participant object image based on the mask image corresponding to the participant object image to obtain the participant object image without the background; the display module is further configured to display the multiple conference object images with the backgrounds removed at positions corresponding to the multiple conference objects in the virtual conference scene.

In the above scheme, the value range of the pixel value of the mask image is smaller than the value range of the pixel value of the conference object image; the device also comprises a mapping module used for mapping the mask image so as to make the value range of the pixel value of the mask image consistent with the value range of the pixel value of the participated object image.

In the above scheme, the receiving module is further configured to receive a video stream of the network video conference, which is generated by the server in the following manner: when any participant object in the multiple participant objects exits the network video conference or the connection is abnormal, moving the any participant object out of the network video conference; and receiving the conference object images of the remaining conference objects of the network video conference, and generating the video code stream of the network video conference according to the conference object images of the remaining conference objects.

In the above solution, the positions corresponding to the multiple conference objects are selected from the positions in the virtual conference scene that are in an unoccupied state; after the any participant object is moved out of the network video conference, the updating module is further configured to update the corresponding position of the any participant object in the virtual conference scene from an occupied state to an unoccupied state.

In the above scheme, the video code stream further includes an image of the virtual conference scene; the decoding module is further configured to decode the video code stream to obtain video pixel data; the device further comprises a rendering module, which is used for rendering according to the video pixel data obtained by decoding so as to display the image of the virtual conference scene in a human-computer interaction interface and display the images of the multiple conference objects in the positions, corresponding to the multiple conference objects, in the image of the virtual conference scene.

In the above scheme, the apparatus further includes an image segmentation module, configured to perform image segmentation processing on each of the participant object images to obtain a mask image corresponding to the participant object image; the mask module is further used for performing mask processing on the corresponding participant object images based on each mask image to obtain a plurality of participant object images with backgrounds removed; the display module is further configured to display the multiple conference object images with the backgrounds removed at positions in the image of the virtual conference scene corresponding to the multiple conference objects, respectively.

In the foregoing solution, the image segmentation module is further configured to perform the following processing for each of the images of the participating objects: calling an image segmentation model based on the image of the participant object to identify the participant object in the image of the participant object, taking a region outside the participant object as a background, and generating a mask image corresponding to the background; the image segmentation model is obtained by training an object marked in a sample image based on the sample image.

In the above scheme, the receiving module is further configured to receive a video stream of the network video conference, which is generated by the server in the following manner: acquiring a plurality of conference object images obtained by respectively carrying out image acquisition on a plurality of conference objects of the network video conference and a mask image corresponding to each conference object image; performing mask processing on the corresponding participant object images based on each mask image to obtain a plurality of participant object images with backgrounds removed; acquiring an image of a virtual conference scene adapted to the theme of the network video conference; filling the plurality of participant object images with the backgrounds removed in positions corresponding to the plurality of participant objects in the image of the virtual conference scene to obtain a combined image; and coding the combined images respectively corresponding to different moments to obtain a video code stream of the network video conference.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the network video conference processing method provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the computer-readable storage medium to implement the network video conference processing method provided by the embodiment of the present application.

The embodiment of the present application provides a computer program product, where the computer program product includes computer executable instructions, and is used for implementing the network video conference processing method provided in the embodiment of the present application when being executed by a processor.

The embodiment of the application has the following beneficial effects:

through displaying the virtual meeting scene and displaying the images of the multiple meeting objects at the positions corresponding to the multiple meeting objects in the virtual meeting scene, the real meeting scene can be simulated, an immersive meeting sense is provided for a user, and the immersive experience of the user in using the network video meeting is improved.

Drawings

Fig. 1 is a schematic view of an application scenario of a network video conference processing method provided in the related art;

fig. 2 is a schematic architecture diagram of a network video conference processing system 100 provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device 500 provided in an embodiment of the present application;

fig. 4 is a schematic flowchart of a network video conference processing method provided in an embodiment of the present application;

fig. 5A to fig. 5C are schematic flow diagrams of a network video conference processing method according to an embodiment of the present application;

fig. 6A to fig. 6C are schematic application scenarios of a network video conference processing method provided in the embodiment of the present application;

FIG. 7 is a block diagram of an overall framework for interaction between a plurality of terminals and a server according to an embodiment of the present application;

FIG. 8 is a schematic flowchart illustrating an interaction between a single terminal and a server according to an embodiment of the present application;

fig. 9 is a schematic diagram of a scenario in which a single web conference client interacts with a server according to an embodiment of the present application;

fig. 10 is a schematic flowchart of a pre-encoding process performed on a human image mask map according to an embodiment of the present application;

fig. 11 is a schematic view of a scenario in which multiple network conference clients interact with a server according to an embodiment of the present application;

fig. 12 is a schematic flowchart of post-decoding processing performed on a human image mask image obtained by decoding according to an embodiment of the present application;

fig. 13 is a schematic flowchart of a network video conference processing method according to an embodiment of the present application;

fig. 14 is a schematic flowchart of a network video conference processing method provided in an embodiment of the present application;

fig. 15 is a schematic application scenario diagram of a network video conference processing method provided in an embodiment of the present application;

fig. 16 is a schematic application scenario diagram of a network video conference processing method according to an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) In the network video conference, an interaction mode using a network (such as the internet and a local area network) as a communication medium is adopted, multimedia data in the forms of voice, video and the like of any participant object can be synchronized to other participant objects in real time, and therefore the communication limit of the participant objects on the space distance is broken through.

2) Video code stream (Data Rate), which refers to the Data flow Rate used by a video file in a unit time, also called code Rate, sampling Rate, and code flow Rate, is the most important part in picture quality control in video coding, and generally, the unit used is kb/s or Mb/s. The larger the code stream is, the larger the sampling rate in unit time is, the higher the data stream accuracy is, the closer the processed file is to the original file, the better the image quality is, the clearer the image quality is, and certainly, the higher the decoding capability of the playing device is required.

3) H.264, a commonly used data Coding algorithm, at a system level, h.264 proposes a new concept, and performs conceptual segmentation between a Video Coding Layer (VCL), which is a representation of the core compressed content of Video content, and a Network Abstraction Layer (NAL), which is a representation delivered through a specific type of Network, such a structure facilitates encapsulation of information and better priority control of information.

4) The virtual meeting scene is a simulation environment for bearing meeting objects, such as a meeting room, a classroom and the like.

At present, a network video conference (also called an online video conference) provided by the related art is basically in a multi-person video mode, and only the portraits of each end are pieced together in the same picture, for example, fig. 1 shows a scene in which four users simultaneously carry out video calls, as can be seen from fig. 1, the scheme provided by the related art only simply pieces together the portraits of each terminal, and further realizes that each terminal can open a virtual background and the like.

However, the applicant has found in the course of carrying out the embodiments of the present application that: the solutions provided by the related art have several obvious problems:

1. the number of terminal connections is limited, usually within 10;

2. if the virtual background is started, the algorithm of the terminal runs completely in the local area of the terminal, and after the using time is too long, the terminal has obvious power consumption problems (such as overheating, more electricity and the like);

3. the scene is single (only simply make up the picture), the user is interesting and low to use, the displayed conference picture is far away from the scene of the real conference, and in practical application, most users are unwilling to start the video and select the voice conference call.

In view of the foregoing technical problems, embodiments of the present application provide a network video conference processing method, apparatus, electronic device, and computer-readable storage medium, which can achieve good immersive experience in a network video conference process by simulating a real conference scene.

The following describes an exemplary application of the electronic device provided in the embodiment of the present application, and the network video conference processing method provided in the embodiment of the present application may be implemented by various electronic devices, for example, various user terminals such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, and a portable game device), and may also be implemented by a server and a terminal in cooperation. The following description will take an example in which a server and a terminal cooperate to implement the network video conference processing method provided in the embodiment of the present application.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a network video conference processing system 100 provided in the embodiment of the present application, in order to implement an application for supporting a simulated real conference scene, where the network video conference processing system 100 includes: the server 200, the network 300, and the terminals (for example, a terminal 400-1 to a terminal 400-N are shown, where N is a positive integer constant greater than 1, for example, the value of N may be 2, 3, 5, etc.), which will be described separately below.

The server 200 is a background server from the client 410-1 to the client 410-N, and is configured to send video code streams of the network video conference to the terminal 400-1 associated with the participant 1 participating in the network video conference to the terminal 400-N associated with the participant N, where the video code streams of the network video conference sent by the server 200 include N images of the participants, and the N images of the participants are obtained by image acquisition of the N participants of the network video conference, for example, the server 200 first receives the images 1 to N of the participants sent by the terminals 1 to N, where the image 1 of the participant is obtained by the terminal 1 calling a camera to perform image acquisition on the participant 1, and the image N of the participant is obtained by the terminal N calling the camera to perform image acquisition on the participant N, and then the server 200 performs encoding processing on the basis of the received images 1 to N of the participants to obtain the video code streams of the network video conference sent to the terminals 1 to N.

The network 300 is used as a medium for communication between the server 200 and the terminals 400-1 to 400-N, and the network 300 may be a wide area network or a local area network, or a combination of both.

The terminals 400-1 to 400-N are respectively provided with the clients 410-1 to 410-N, the clients 410-1 to 410-N are clients of the same type, such as a web conference client, an instant messaging client, and the like, taking the client 410-1 as an example, after receiving a video code stream of a web video conference sent by the server 200, the client 410-1 displays a virtual conference scene on a man-machine interaction interface according to the received video code stream, and displays N conference object images in the virtual conference scene at positions corresponding to the N conference objects, so that compared with the related art in which only conference object images acquired by each terminal are simply pieced together in the same picture, the embodiment of the present application can simulate a real conference scene by displaying the virtual conference scene and displaying corresponding conference object images in the virtual conference scene at positions corresponding to each conference object, thereby providing a realistic conference feeling for a user, and effectively improving the interest of the web video conference, and further achieving good immersive experience in the web video conference.

In some embodiments, the embodiments of the present application may be implemented by Cloud Technology (Cloud Technology), which refers to a hosting Technology for unifying resources of hardware, software, network, and the like in a wide area network or a local area network to implement computation, storage, processing, and sharing of data.

The cloud technology is a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. For example, the service interaction function between the server 200 and the terminals 400-1 to 400-N described above may be implemented by a cloud technology.

For example, the server 200 shown in fig. 1 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminals 400-1 to 400-N may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminals 400-1 to 400-N and the server 200 may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited thereto.

By way of example, the network video conference provided by the embodiment of the application can also be a cloud conference, and the cloud conference is an efficient, convenient and low-cost conference form based on a cloud computing technology. A user can share voice, data files and videos with teams and clients all over the world quickly and efficiently only by performing simple and easy-to-use operation through an Internet interface, and complex technologies such as data transmission and processing in a conference are assisted by a cloud conference service provider to operate.

At present, domestic cloud conferences mainly focus on Service contents mainly based on a Software as a Service (SaaS) mode, including Service forms such as telephones, networks and videos, and cloud computing-based video conferences are called cloud conferences.

In the cloud conference era, data transmission, processing and storage are all processed by computer resources of video conference manufacturers, so that users do not need to purchase expensive hardware and install complicated software, and can carry out efficient teleconference only by opening a browser and logging in a corresponding interface.

In other embodiments, taking the terminal 400-1 as an example, the terminal 400-1 may further implement the network video conference processing method provided in the embodiment of the present application by running a computer program, where the computer program may be the client 410-1 shown in fig. 1. For example, the computer program may be a native program or a software module in an operating system; can be a local (Native) Application program (APP), i.e. a program that needs to be installed in an operating system to run, such as a network conference client, an instant messaging client; or may be an applet, i.e. a program that can be run only by downloading it to a browser environment; but also an applet that can be embedded in any APP, where the applet can be run or shut down by user control. In general, the computer programs described above may be any form of application, module or plug-in.

The following describes a structure of an electronic device provided in an embodiment of the present application. Taking an electronic device as an example, referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device 500 provided in an embodiment of the present application, where the electronic device 500 shown in fig. 3 includes: at least one processor 510, memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components of the connection. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in FIG. 3.

The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 530 includes one or more output devices 531 that enable presentation of media content, including one or more speakers and/or one or more visual display screens. The user interface 530 also includes one or more input devices 532, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.

The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 552 for communicating to other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 553 for enabling presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;

an input processing module 554 to detect one or more user inputs or interactions from one of the one or more input devices 532 and to translate the detected inputs or interactions.

In some embodiments, the network video conference processing apparatus provided in the embodiments of the present application may be implemented in software, and fig. 3 illustrates a network video conference processing apparatus 555 stored in a memory 550, which may be software in the form of programs and plug-ins, and includes the following software modules: the receiving module 5551, the display module 5552, the allocation module 5553, the update module 5554, the decoding module 5555, the theme recognition module 5556, the moving module 5557, the masking module 5558, the mapping module 5559, the rendering module 55510 and the image segmentation module 55511 are logical and thus may be arbitrarily combined or further divided according to the functions implemented. It should be noted that, for the sake of convenience of description, all the modules are shown in fig. 3 at a time, but it should not be considered that the implementation that only the receiving module 5551 and the display module 5552 are excluded from the network video conference processing apparatus 555, and the functions of the respective modules will be described below.

In other embodiments, the network video conference processing apparatus provided in the embodiments of the present Application may be implemented in hardware, and as an example, the apparatus provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the network video conference processing method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable Logic Devices (PLDs), complex Programmable Logic Devices (CPLDs), field Programmable Gate Arrays (FPGAs), or other electronic components.

The network video conference processing method provided by the embodiment of the present application will be described below with reference to exemplary applications and implementations of the terminal provided by the embodiment of the present application. For example, referring to fig. 4, fig. 4 is a schematic flowchart of a network video conference processing method provided in an embodiment of the present application, and will be described with reference to the steps shown in fig. 4.

It should be noted that the method shown in fig. 4 can be executed by various forms of computer programs running on the terminal 400-1 shown in fig. 1, and is not limited to the client 410-1 described above, and may also be the operating system 461, the software modules and the scripts described above, so that the client described below should not be considered as limiting the embodiments of the present application.

In step S101, a video stream of the network video conference is received.

Here, the video code stream may include a plurality of images of the participants in real time, and the images of the participants are obtained by image capturing of the multiple participants of the network video conference.

For example, taking multiple conference participants of the network video conference as users a to D as an example, the server first receives face images acquired in real time by using terminal call cameras respectively associated with the users a to D (that is, the server receives face images of the user a acquired by using the terminal a call camera associated with the user a to acquire images of the user a, face images of the user B acquired by using the terminal B associated with the user B to acquire images of the user B, face images of the user C acquired by using the terminal C call camera associated with the user C to acquire images of the user C, and face images of the user D acquired by using the terminal D associated with the user D to acquire images of the user D), and then the server performs encoding processing on the received face images of the users a to D to obtain video code streams of the network video conference respectively transmitted to the terminals a to D, wherein the video code streams include the face images of the users a to D.

It should be noted that, in practical applications, the terminal may also receive a synchronous audio code stream while receiving a video code stream, then decode the audio code stream to obtain audio sampling data, and then play the audio sampling data synchronous with the video pixel data while rendering and displaying the video pixel data obtained by decoding the video code stream, so as to implement synchronous playing of video and audio.

In some embodiments, the receiving of the video streams of the network video conference may be implemented by: receiving a video code stream of the network video conference generated by the server in the following way: when any one of the plurality of participant objects exits the network video conference or the connection is abnormal, moving any one of the participant objects out of the network video conference; and receiving the conference object images of the remaining conference objects of the network video conference, and generating the video code stream of the network video conference according to the conference object images of the remaining conference objects.

In an example, still taking multiple conference objects of the network video conference as users a to D as an example, when the user a exits the network video conference (for example, a click operation of the user a on a button for ending the conference displayed in the man-machine interaction interface is received) or connection is abnormal (for example, the network connection condition of a terminal associated with the user a is abnormal, a network is disconnected, and the network connection is limited), the server moves the user a out of the network video conference; then the server respectively receives a face image of the user B, which is obtained by calling a camera by a terminal B associated with the user B to acquire an image of the user B, a face image of the user C, which is obtained by calling the camera by a terminal C associated with the user C to acquire an image of the user C, and a face image of the user D, which is obtained by calling the camera by a terminal D associated with the user D to acquire an image of the user D; and then the server generates a video code stream of the network video conference according to the received face images of the users B to D (at this time, the video code stream only comprises the face images of the users B to D), so that when the terminal displays the virtual conference scene according to the video code stream, the face image of the user A is correspondingly removed in the virtual conference scene (namely, the face image of the user A is not displayed in the virtual conference scene), and a conference initiator can conveniently know the situation of the participant in real time, for example, whether the user exits the network video conference.

In other embodiments, the respective corresponding locations of the plurality of participant objects are selected from the unoccupied locations in the virtual meeting scene; after any participant is moved out of the network video conference, the following processing can be executed: and updating the corresponding position of any participant in the virtual conference scene from an occupied state to an unoccupied state.

For example, as described above, the positions corresponding to the users a to D (i.e., the positions for displaying the face images of the users a to D) are selected from the positions in the virtual conference scene that are in the unoccupied state, and it is assumed that the position corresponding to the user a is position 1 in the virtual conference scene (at this time, position 1 is in the occupied state), when the server moves the user a who exits from the network video conference or has an abnormal connection out of the network video conference, the position 1 may also be updated from the occupied state to the unoccupied state, and at this time, other conference participants of the network video conference (e.g., the users B to D, or the user E and the user F that newly join the network video conference) may occupy position 1 in the virtual conference scene, so that resource recovery is performed on the position in the failed state in the virtual conference scene (i.e., the user who has an abnormal connection or exits from the network video conference), and system resources of the server are saved.

In other embodiments, the receiving of the video stream of the network video conference may be implemented by: receiving a video code stream of the network video conference generated by the server in the following way: acquiring a plurality of conference object images obtained by respectively carrying out image acquisition on a plurality of conference objects of the network video conference and a mask image corresponding to each conference object image; performing mask processing on the corresponding participant object image based on each mask image to obtain a participant object image with the background removed; acquiring an image of a virtual conference scene adaptive to the theme of the network video conference; filling a plurality of participant object images with backgrounds removed in positions corresponding to the plurality of participant objects in the image of the virtual conference scene to obtain a combined image; and coding the combined images respectively corresponding to different moments to obtain a video code stream of the network video conference.

By way of example, taking a plurality of conference participants of the network video conference as users a to D as an example, the server first receives face images of users a to D sent by terminals respectively associated with the users a to D and mask images corresponding to the face images (i.e., the server receives the face image and the mask image a of the user a sent by the terminal a associated with the user a, the face image and the mask image B of the user B sent by the terminal B, the face image and the mask image C of the user B sent by the terminal C, and the face image and the mask image D of the user D sent by the terminal D, wherein the mask image a is obtained by calling an object recognition model by the terminal a to perform object recognition on the face image of the user a, and the mask image B is obtained by calling the object recognition model by the terminal B to perform object recognition on the face image of the user B, the mask image C is obtained by the terminal C calling an object recognition model to perform object recognition on the facial image of the user C, the mask image D is obtained by the terminal D calling the object recognition model to perform object recognition on the facial image of the user D), then the server performs mask processing on the corresponding facial image based on each mask image to obtain a facial image with the background removed, for example, the server performs mask processing on the facial image of the user a based on the mask image a to obtain a facial image with the background removed by the user a, performs mask processing on the facial image of the user B based on the mask image B to obtain a facial image with the background removed by the user B, performs mask processing on the facial image of the user C based on the mask image C to obtain a facial image with the background removed by the user C, and performs mask processing on the facial image of the user D based on the mask image D, obtaining a face image of the user D with the background removed; then the server obtains an image of a virtual conference scene adapted to the theme of the current network video conference (for example, the image may be a background image selected by a host of the network video conference, or the background image may be automatically selected by the server according to the theme of the network video conference); after the image of the virtual conference scene is obtained, the server can fill the face images of the user A to the user D with the background removed in the positions corresponding to the user A to the user D in the image of the virtual conference scene respectively to obtain a combined image; and finally, the server encodes the combined images respectively corresponding to different moments to obtain video code streams for sending to the terminals A to D, so that the terminals A to D can directly display the combined images according to the video code streams after receiving the video code streams sent by the server, namely, the terminals are only responsible for image acquisition and display of the combined images, the calculation amount of the terminals is greatly reduced, and the power consumption of the terminals can be well reduced.

In step S102, a virtual meeting scene is displayed.

In some embodiments, displaying the virtual meeting scene described above may be achieved by: displaying a virtual conference scene adaptive to the theme of the network video conference according to the video code stream; and when the theme of the network video conference is determined to be changed according to the video code stream, updating the virtual conference scene to be adapted to the changed theme.

For example, the above virtual conference scene adapted according to the theme of the network video conference displayed according to the video code stream may be implemented as follows: decoding the video code stream to obtain a plurality of video frames; calling a theme recognition model to perform theme recognition processing on a plurality of video frames to obtain a theme recognition result of each video frame, and determining the theme recognition result with the highest repetition frequency as the theme of the network video conference; and acquiring and displaying a virtual conference scene matched with the determined theme of the network video conference.

For example, assuming that 10 video frames are obtained after decoding a current video code stream, then a terminal calls a topic identification model to perform topic identification processing on the 10 video frames to obtain topic identification results of each video frame, and then determines a topic identification result with the highest repetition frequency as a topic of a current network video conference, for example, assuming that the topic identification result of 6 video frames in the 10 video frames is an academic conference, the academic conference is determined as the topic of the current network video conference; and finally, the terminal acquires and displays a virtual meeting scene (such as a virtual classroom scene) adapted to the academic meeting.

In other embodiments, the virtual conference scene may also be set manually by the conference initiator, for example, multiple candidate virtual conference scenes are presented in the human-machine interaction interface of the terminal associated with the conference initiator for the selection of the conference initiator, so that when the conference initiator selects a certain virtual conference scene (for example, the virtual conference scene 1) from the multiple candidate virtual conference scenes, the virtual conference scene 1 may be displayed on the human-machine interaction interface of the terminal respectively associated with each participant in the network video conference.

It should be noted that, in practical applications, virtual conference scenes displayed on the human-computer interaction interfaces of the terminals respectively associated with different participant objects of the network video conference may also be different, that is, a plurality of candidate virtual conference scenes may be displayed on the human-computer interaction interface of the terminal associated with each participant object for selection by the participant object, for example, when the participant object a selects the virtual conference scene 1, the virtual conference scene 1 is displayed on the human-computer interaction interface of the terminal associated with the participant object a; and if the participant B selects the virtual conference scene 2, the virtual conference scene 2 is displayed on the human-computer interaction interface of the terminal associated with the participant B, so that the personalized requirements of the user can be met. In addition, it should be noted that the position rankings of the multiple conference objects are consistent in different virtual conference scenes, that is, the human-computer interaction interfaces of the terminals respectively associated with the different conference objects only display different types of virtual conference scenes, and the position rankings of the multiple conference objects of the network video conference are still the same in the different types of virtual conference scenes.

In addition, when the terminal determines that the theme of the network video conference changes according to the video code stream, the virtual conference scene can be automatically updated to be adapted to the changed theme, for example, when the terminal calls a theme recognition model to perform theme recognition processing on a plurality of video frames obtained by decoding a subsequent video code stream, and the theme recognition result with the highest repetition frequency is changed into a seating conference, the terminal automatically acquires and displays the virtual conference scene (for example, a virtual conference room scene) adapted to the seating conference, so that the virtual conference scene is automatically updated along with the change of the theme of the network video conference, and the use experience of a user is improved.

It should be noted that the topic identification model may be a neural network model (e.g., a convolutional neural network, a deep convolutional neural network, a fully-connected neural network, etc.), a decision tree model, a gradient lifting tree, a multi-layer perceptron, a support vector machine, etc., and the embodiment of the present application does not specifically limit the type of the topic identification model.

In other embodiments, the topic of the network video conference may also be determined according to a conference schedule, a sharing file, and the like, for example, the terminal may determine the topic of the network video conference based on information such as a conference name and conference content carried in the conference schedule of the network video conference; of course, the terminal may also determine the topic of the network video conference according to the file shared by the participant in the network video conference, for example, determine the topic of the network video conference according to the file name and the file content of the shared file.

In some embodiments, the virtual meeting scene may also be manually updated, for example, the meeting initiator has a right to update the virtual meeting scene, for example, assuming that the current virtual meeting scene is a virtual classroom scene, when the next meeting is a relatively light conference, the meeting initiator may manually update the virtual meeting scene to a virtual meeting room scene in advance.

In step S103, a plurality of participant object images are displayed at positions corresponding to the respective plurality of participant objects in the virtual conference scene.

In some embodiments, step S103 shown in fig. 4 may be implemented by steps S1031A to S1032A shown in fig. 5A, which will be described in conjunction with the steps shown in fig. 5A.

In step S1031A, corresponding positions are allocated to the multiple participant objects in the virtual conference scene according to the sequence of the multiple participant objects joining the network video conference.

Here, the position ordering of the multiple participating objects in the virtual conference scene corresponds to the sequence of the multiple participating objects joining the network video conference.

In some embodiments, the foregoing may be implemented in the following manner, where the corresponding positions are respectively allocated to the multiple conference participants in the virtual conference scene according to the sequence of the multiple conference participants joining the network video conference: according to the time when the multiple participant objects join the network video conference, performing descending sequencing on the multiple participant objects; taking a first position in the virtual conference scene as a position corresponding to a conference object ranked at the first position in the descending sorting result (namely, the conference object joining the network video conference at the earliest time); the other positions behind the first position in the virtual conference scene are sequentially used as the positions corresponding to the other conference objects behind the conference object arranged at the first position in the descending sorting result, for example, the second position in the virtual conference scene is used as the position corresponding to the conference object arranged at the second position in the descending sorting result, and so on, so that the positions of the plurality of conference objects in the virtual conference scene are sorted to correspond to the sequence of the conference objects added into the network video conference, and the enthusiasm of the user for adding the network video conference is improved.

In step S1032A, a plurality of participant object images are displayed according to the corresponding positions allocated to the respective plurality of participant objects.

In some embodiments, according to the sequence in which the multiple participant objects join the network video conference, after the multiple participant objects are respectively allocated with corresponding positions in the virtual conference scene, multiple participant object images may be displayed according to the corresponding positions allocated to the multiple participant objects, for example, assuming that the position allocated to the participant object a in the virtual conference scene is position 1, the participant object image of the participant object a (for example, the face image of the participant object a) is displayed at position 1, and the position allocated to the participant object B is position 2, the participant object image of the participant object B (for example, the face image of the participant object B) is displayed at position 2, and so on.

It should be noted that, in the virtual conference scene, the corresponding positions respectively allocated to the multiple conference objects may be continuous or intermittent (for example, an empty position is spaced between two adjacent conference objects), as long as the position ordering corresponds to the sequence of the multiple conference objects joining the network video conference, which is not specifically limited in this embodiment of the present application.

For example, referring to fig. 6A, fig. 6A is an application scene schematic diagram of the network video conference processing method provided in this embodiment of the present application, as shown in fig. 6A, a plurality of continuously sequenced conference object images (for example, conference object image 602, conference object image 603, conference object image 604, and conference object image 605) are displayed in a virtual conference scene 601, and the position sequence of the plurality of conference object images corresponds to the sequence in which the plurality of conference objects join the network video conference, for example, the time when the conference object (for example, user a) corresponding to the leftmost conference object image 602 joins the network video conference is the earliest, the time when the conference object (for example, user B) corresponding to the conference object image 603 joins the network video conference is the next, the time when the conference object (for example, user C) corresponding to the conference object image 604 joins the network video conference is the next, and the time when the conference object (for example, user D) corresponding to the rightmost conference object image 605 joins the network video conference is the latest, so that the priority of the conference users can be increased by adding the position sequence and the network video conference order.

For example, referring to fig. 6B, fig. 6B is an application scene schematic diagram of the network video conference processing method provided in this embodiment of the present application, as shown in fig. 6B, multiple images of participant objects (e.g.,

images

606, 607, 608, and 609 of participant objects) are displayed at intervals in a virtual conference scene 610, and the position sequence of the multiple participants corresponds to the sequence in which the multiple participants join the network video conference, for example, the time when the participant object (e.g., user a) corresponding to the leftmost image 606 of participant object joins the network video conference is the earliest, the time when the participant object (e.g., user B) corresponding to image 607 of participant object joins the network video conference is the next, the time when the participant object (e.g., user C) corresponding to image 608 joins the network video conference is the next to the later, the time when the participant object (e.g., user D) corresponding to the rightmost image 610 of participant object joins the network video conference is the latest, and thus the priority of the network video conference can be increased.

In other embodiments, step S103 shown in fig. 4 can also be implemented by step S1031B to step S1032B shown in fig. 5B, which will be described with reference to the step shown in fig. 5B.

In step S1031B, corresponding positions are allocated to the multiple conference objects in the virtual conference scene according to the speaking orders of the multiple conference objects in the network video conference.

Here, the position ordering of the plurality of conference objects in the virtual conference scene corresponds to the speaking order of the plurality of conference objects.

In some embodiments, the speaking sequence of the multiple participant objects in the network video conference may be obtained in advance, for example, a conference schedule of the network video conference may be obtained (conference contents or speaking time of each participant object are recorded in the conference schedule), the speaking sequence of the multiple participant objects in the network video conference is determined according to the conference schedule, then, corresponding positions are respectively allocated to the multiple participant objects in the virtual conference scene according to the speaking sequence of the multiple participant objects in the network video conference, for example, if the participant object a speaks first, a first position in the virtual conference scene may be taken as a position corresponding to the participant object a, and the participant object B speaks second, a second position in the virtual conference scene may be taken as a position corresponding to the participant object B, and so on.

In step S1032B, a plurality of participant object images are displayed according to the corresponding positions allocated to the respective plurality of participant objects.

In some embodiments, after the plurality of conference objects are respectively allocated with corresponding positions in the virtual conference scene according to the speaking sequence of the plurality of conference objects in the network video conference, a plurality of conference object images may be displayed according to the corresponding positions respectively allocated for the plurality of conference objects, for example, assuming that the position allocated to conference object a (i.e., the conference object speaking earliest) in the virtual conference scene is position 1, the conference object image of conference object a (e.g., the face image of conference object a) is displayed at position 1, the position allocated to conference object B (i.e., the conference object speaking second) is position 2, the conference object image of conference object B (e.g., the face image of conference object B) is displayed at position 2, and so on, the plurality of conference object images are displayed according to the speaking sequence, and the efficiency of the network video conference is improved.

It should be noted that, in the virtual conference scene, the corresponding positions respectively allocated to the multiple conference objects may be continuous or intermittent (for example, an empty position is spaced between two adjacent conference objects), as long as the position ordering corresponds to the speaking sequence of the multiple conference objects in the network video conference, which is not specifically limited in this embodiment of the present application.

In some embodiments, step S103 shown in fig. 4 may also be implemented by step S1031C to step S1032C shown in fig. 5C, which will be described in conjunction with the steps shown in fig. 5C.

In step S1031C, according to the identity information sequence of the multiple conference participants in the network video conference, corresponding positions are respectively allocated to the multiple conference participants in the virtual conference scene.

Here, the position ranks of the multiple conference objects in the virtual conference scene correspond to the identity information ranks of the multiple conference objects, where the identity information ranks may be participant role ranks (e.g., host, speaker, and listener), job position ranks, account number ranks, department ranks, and the like.

In some embodiments, the terminal may send the image of the participant object to the server, and at the same time, send the bound account to the server, so that the server obtains the identity information of the participant object (e.g., a participant role, a position, an account level, a department of the participant, etc.) according to the account, then the server sorts the identity information of the multiple participant objects, and sorts the identity information of the multiple participant objects in the network video conference according to the identity information of the multiple participant objects in the network video conference.

It should be noted that, in practical application, a corresponding area may be pre-allocated in the virtual conference scene for each type of identity information, and the corresponding areas of the different types of identity information in the virtual conference scene are different, for example, the closer the identity information is ranked, the closer the corresponding area is to the center of the virtual conference scene.

In step S1032C, a plurality of participant object images are displayed according to the corresponding positions allocated to the respective plurality of participant objects.

In some embodiments, after the plurality of participant objects are respectively allocated with corresponding positions in the virtual conference scene according to the identity information ordering of the plurality of participant objects in the network video conference, a plurality of images of the participant objects may be displayed according to the corresponding positions respectively allocated for the plurality of participant objects, for example, taking a participant role ordering as an example, assuming that a position allocated to a host of the network video conference in the virtual conference scene is a first row of position 1, an image of the host (for example, a face image of the host) is displayed at position 1, and a position allocated to a speaker a is a second row of position 2, an image of a speaker a (for example, a face image of a speaker a) is displayed at position 2, and a position allocated to a listener B is a fifth row of position 3, an image of a listener B (for example, a face image of a listener B) is displayed at position 3, and thus, management of the network video conference by the position ordering and the participant role ordering of the plurality of participant objects are corresponded.

In still other embodiments, the terminal may further perform the following: when a target conference object which is speaking currently exists in the plurality of conference objects is identified, moving the conference object image corresponding to the target conference object from the originally distributed position to a specific position in the virtual conference scene, wherein the significance degree of the specific position is greater than that of the originally distributed position of the target conference object; and when the target participant is recognized to finish speaking, the participant object image corresponding to the target participant object is moved to the originally distributed position from the specific position again.

For example, referring to fig. 6C, fig. 6C is a schematic view of an application scenario of the network video conference processing method provided in this embodiment, as shown in fig. 6C, conference object images of multiple conference objects are displayed in a virtual conference scene 611, and when the terminal recognizes that a target conference object (for example, conference object a) currently speaking exists in the multiple conference objects, a conference object image 612 corresponding to conference object a may be moved from an originally allocated position 613 to a specific position 614 in virtual conference scene 611 (specific position 614 may be an intermediate position in the first row in virtual conference scene 611); then, when it is recognized that the participant a has finished speaking, the image 612 of the participant corresponding to the participant a may be moved from the specific position 614 to the originally allocated position 613 again, so that the image of the participant corresponding to the participant who is speaking at present may be moved to the specific position in the virtual conference scene, thereby drawing the attention of other participants and improving the user experience.

In other embodiments, the video code stream may further include a plurality of mask images corresponding to the plurality of participant object images one to one, and the plurality of mask images are obtained by respectively performing object identification on the plurality of participant object images; the above-mentioned displaying of the images of the multiple conference objects at the positions corresponding to the multiple conference objects in the virtual conference scene may be implemented by: the following processing is performed for each of the participant object images: performing mask processing on the participant object image based on the mask image corresponding to the participant object image to obtain the participant object image with the background removed; and displaying the plurality of participant object images from which the background is removed at positions corresponding to the plurality of participant objects in the virtual conference scene.

For example, taking multiple conference participants of a network video conference as users a to D as an example, a video code stream sent by a server includes face images of the users a to D and a mask image corresponding to each face image, a terminal calls a decoder to perform decoding processing after receiving the video code stream to obtain the face images of the users a to D and the mask image corresponding to each face image, and then the terminal may perform the following processing for each face image: performing mask processing on the face image based on the mask image corresponding to the face image to obtain a face image with a background removed, for example, taking the face image of the user a as an example, performing mask processing on the face image of the user a by using the mask image corresponding to the face image of the user a to obtain the face image with the background removed of the user a; finally, the terminal can display the face images of the users A to D with the background removed at the positions corresponding to the users A to D respectively in the virtual conference scene, so that the mask processing is carried out on the face images to remove the background of the face images, the conference scene displayed on the human-computer interaction interface is more consistent with the real conference scene, and the use experience of the users is improved.

In some embodiments, to reduce the bandwidth required for transmitting the video bitstream, the pixel values of the mask image may have a value range that is smaller than the value range of the pixel values of the participant object image (i.e., the mask image is compressed), and then the following processing may be performed before the participant object image is masked based on the mask image corresponding to the participant object image: and mapping the mask image to make the value range of the pixel value of the mask image obtained after mapping consistent with the value range of the pixel value of the image of the participant.

For example, assuming that the pixel value range of the mask image obtained by decoding is [0, 128], and the pixel value range of the participant image is [0, 256], before the participant image is masked by using the mask image, the mask image needs to be mapped first to expand the pixel value range of the mask image to [0, 256], and then the participant image is masked based on the mapped mask image to obtain the participant image without the background.

In other embodiments, the video stream may further include an image of a virtual conference scene, and the displaying of the virtual conference scene according to the video stream and the displaying of the multiple images of the participant objects at the positions corresponding to the multiple participant objects in the virtual conference scene may be implemented in the following manners: decoding the video code stream to obtain video pixel data of a video frame; and rendering according to the video pixel data obtained by decoding so as to display the image of the virtual conference scene in the human-computer interaction interface and display the images of the multiple conference objects in the image of the virtual conference scene at positions corresponding to the multiple conference objects respectively.

In some embodiments, following the above example, the above displaying of the images of the multiple participants at the positions corresponding to the multiple participants in the image of the virtual conference scene may be implemented as follows: performing image segmentation processing on each participant object image to obtain a mask image corresponding to each participant object image; performing mask processing on the corresponding participant object images based on each mask image to obtain a plurality of participant object images with the backgrounds removed; and displaying a plurality of conference object images with the background removed at positions corresponding to the plurality of conference objects in the image of the virtual conference scene respectively, so that the conference object images displayed in the virtual conference scene are background-removed, thereby being more in line with the real conference scene and improving the visual experience of users.

For example, the mask image corresponding to each participant object image may be obtained by performing image segmentation processing on each participant object image in the following manner: the following processing is performed for each of the participant object images: and calling an image segmentation model based on the image of the participant object to identify the participant object in the image of the participant object, and generating a mask image corresponding to the background by taking the region except the participant object as the background, wherein the image segmentation model is obtained by training the object marked in the sample image based on the sample image.

It should be noted that the image segmentation model may be a neural network model (e.g., a convolutional neural network, a deep convolutional neural network, a fully-connected neural network, etc.), a decision tree model, a gradient lifting tree, a multi-layer perceptron, a support vector machine, etc., and the embodiment of the present application does not specifically limit the type of the image segmentation model.

According to the network video conference processing method, the virtual conference scene is displayed, and the plurality of participant images are displayed at the positions corresponding to the plurality of participant objects in the virtual conference scene respectively, so that a real conference scene can be simulated, a user can feel an in-place conference feeling, and good immersive experience in the network video conference process is realized.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

The embodiment of the application provides a network video conference processing method, provides various virtual conference scenes rich in interestingness, can better improve the interestingness of the network video conference, and can provide an immersive conference feeling for a user in a virtual conference scene simulating a real-time conference, so that the initiative of the user in using the network video conference is improved.

Specifically, according to the embodiment of the application, the portraits corresponding to the users with different terminals can be collected into the picture of the same virtual conference scene, the collected picture can be displayed on the respective terminals of the different users through rendering, and a live conference atmosphere is created, so that the difference between the network video conference and the live conference experience is closed, and the experience feeling of the network video conference is improved. In addition, the network video conference processing method provided by the embodiment of the application can support simultaneous connection of a plurality of (for example, 100) terminals, and through the system design that the terminals are separated from the network conference server, the terminals are only responsible for data acquisition and display effects, so that the calculation amount of the terminals is greatly reduced, and the power consumption can be well reduced.

The following describes a network video conference processing method provided in the embodiment of the present application in detail.

For example, referring to fig. 7, fig. 7 is a schematic view of an overall framework of interaction between multiple terminals and a network conference server provided in an embodiment of the present application, and as shown in fig. 7, each terminal (for example, a mobile phone) performs data acquisition and reports the acquired data to the network conference server, so that the network conference server processes the data acquired by each terminal, merges the data and the position arrangement of a virtual conference scene, and returns the processed data to each terminal for display.

For example, referring to fig. 8, fig. 8 is a schematic flowchart of a process of interaction between a single terminal and a network conference server provided in an embodiment of the present application, where as shown in fig. 8, the terminal invokes a camera to perform image acquisition and image segmentation on a user (i.e., a participant), outputs a portrait image (i.e., a source image) and a segmented portrait mask image to the network conference server, and the network conference server performs position arrangement, image synthesis, and the like according to the source image and the portrait mask image input by each terminal, and finally returns a synthesized conference image to the terminal for display.

First, a process of transmitting an h.264 code stream obtained by encoding a source image and a portrait mask image to a server by a terminal will be described.

For example, referring to fig. 9, fig. 9 is a schematic view of a scene in which a single network conference client interacts with a network conference server according to an embodiment of the present disclosure, as shown in fig. 9, a terminal (e.g., a mobile phone) acquires a portrait image by calling a camera, transmits the acquired portrait image to a network conference APP (referred to as a network conference in fig. 9 for short), the network conference APP receives the portrait image and then calls a feed-forward neural network inference engine (XNN) to perform artificial intelligence algorithm processing, outputs a source map and a portrait mask map of specified sizes, and then performs range mapping on the portrait mask map by using a precoding module, and compresses a range of pixel values of the portrait mask map, so that in a subsequent encoding process (i.e., compression), fewer code streams are used for encoding.

For example, referring to fig. 10, fig. 10 is a schematic flowchart of a process of performing pre-coding processing on a human image mask image according to an embodiment of the present application, and as shown in fig. 10, assuming that an original value range of pixel values of a human image mask image output from XNN is [0, 255], a value range of pixel values of a human image mask image after normalization processing is [0,1], the value range is multiplied by 128 and then rounded, and mapped to a value range of [0, 128], so as to obtain a pre-coded human image mask image, so that by compressing the value range of pixel values of the human image mask image from [0, 255] to [0, 128], the human image mask image can be subsequently encoded using fewer code streams, thereby saving system resources.

With reference to fig. 9, after receiving the source image and the pre-coded portrait mask image, an encoder in the network conference APP performs h.264 coding on the source image and the pre-coded portrait mask image to obtain a corresponding h.264 code stream, and finally the network conference APP calls a sending module to send the h.264 code stream obtained by coding to the network conference server.

It should be noted that, in practical applications, other Video Coding algorithms may also be adopted, for example, h.265, H266, advanced Video Coding (AVC), and the like, and this is not limited in this embodiment of the present application.

The following description continues with the process of the terminal receiving the h.264 code stream obtained by encoding the rendered image sent by the network conference server.

Continuing to refer to fig. 9, the network conference APP receives an h.264 code stream obtained by encoding the rendered image from the network conference server by using the receiving module, then calls a decoder to decode the received h.264 code stream to obtain an image rendered by the server, and finally displays the rendered image on a screen of the terminal by using the network conference APP.

The following describes a network video conference processing method provided by the embodiment of the present application from a server side.

For example, referring to fig. 11, fig. 11 is a schematic view of a scenario where multiple network conference clients interact with a network conference server according to an embodiment of the present application, as shown in fig. 11, the network conference server receives, through a receiving module, an h.264 code stream (which may also be an h.265 code stream, an H266 code stream, and the like depending on a video coding algorithm used during coding) sent by multiple network conference APPs (which are simply referred to as a network conference in fig. 11 and include, for example, network conference 1 to network conference n), and then the network conference server calls a decoder to perform decoding processing on the h.264 code stream sent by the multiple network conference APPs to obtain multiple corresponding source maps (for example, source fig. 1 to source map n) and a portrait mask map; for the multiple portrait mask images obtained by decoding, the network conference server may further invoke a post-decoding module to perform further pixel value mapping on the multiple portrait mask images to obtain multiple mapped portrait mask images (e.g., portrait mask image 1 to portrait mask image n, where each source image corresponds to one portrait mask image, e.g., source image 1 corresponds to portrait mask image 1, and source image n corresponds to portrait mask image n).

For example, referring to fig. 12, fig. 12 is a schematic flowchart of a post-decoding process performed on a human image mask map obtained by decoding according to an embodiment of the present application, and as shown in fig. 12, assuming that a value range of a pixel value of the human image mask map after decoding is [0, 128], in the mask map, the following arithmetic operations are performed:

mask pixel value/128.0 x 255

Thus, the value range of the pixel value of the decoded human image mask image is remapped to the value range of [0, 255 ].

In addition, a background map selection interface can be presented in the web conference APP as a conference initiator or host, so that the host of the conference can select a required background map from a plurality of candidate background maps. After the host selects the background image, the Web conference APP of the host sends a corresponding notification message to the Web conference server to notify the Web conference server of the background image selected by the host, then the Web conference server calls a picture layout server (Layerout-server) to perform layout modification on the portrait according to the input information, adjusts the 'seats' in the background image according to the meeting sequence, and outputs the rendered image after the adjustment is completed.

First, a process of joining a network video conference will be described below.

For example, referring to fig. 13, fig. 13 is a schematic flowchart of a network video conference processing method provided in an embodiment of the present application, and as shown in fig. 13, after a conference application is issued by a network conference APP, a network conference server imports a source image and a portrait mask image, so that a picture layout server checks whether the input source image and portrait mask image are normal, and when the picture layout server determines that the input source image and portrait mask image are not normal, an error code is returned; when the image layout server judges that the input source image and the human figure mask image are normal, the coordinate position of the human figure in the background image is assigned according to the conference entering sequence, then the image layout server fuses the source image and the human figure mask image to the specified position in the background image (namely, the human figure with the background removed is displayed at the specified position), and finally the image layout server outputs the fused final image.

The following describes the process of exiting the network video conference.

For example, referring to fig. 14, fig. 14 is a schematic flowchart of a network video conference processing method provided in an embodiment of the present application, and as shown in fig. 14, when a certain network conference APP sends a conference ending request, or a picture layout server detects that a connection of the certain network conference APP is abnormal, an algorithm is called to remove a portrait display at a corresponding position in a background image, and a resource module is called to recover resources at the position, and finally a final image with the portrait removed is output.

And finally, the network conference server can encode the rendered image and distribute the encoded H.264 code stream to each network conference APP through the sending module.

For example, referring to fig. 15, fig. 15 is a schematic view of an application scenario of the network video conference processing method provided in this embodiment of the present application, as shown in fig. 15, a plurality of background-removed portraits 1502 are displayed in a background map 1501 (for example, a virtual classroom scene), and a position sequence of the plurality of portraits 1502 in the background map 1501 is determined according to a meeting order, for example, the leftmost portraits 1502 join the network video conference earliest, and the rightmost portraits 1502 join the network video conference latest.

For example, referring to fig. 16, fig. 16 is a schematic view of an application scenario of the network video conference processing method provided in the embodiment of the present application, as shown in fig. 16, a plurality of background-removed person images 1602 are displayed in a background image 1601 (e.g., a virtual seating meeting scene), and a position sequence of the plurality of person images 1602 in the background image 1601 is determined according to a meeting order, for example, a leftmost person image 1602 joins the network video conference earliest, and a rightmost person image 1602 joins the network video conference latest.

The network video conference processing method provided by the embodiment of the application provides various interesting virtual conference scenes, can better improve the interest of the network video conference, provides an on-the-spot conference feeling for a user in the virtual scene of a simulated on-the-spot conference, and improves the initiative of the user in using the network video conference.

Continuing with the exemplary structure of the network video conference processing apparatus 555 provided in the embodiment of the present application implemented as software modules, in some embodiments, as shown in fig. 3, the software modules stored in the network video conference processing apparatus 555 in the memory 550 may include: a reception module 5551, and a display module 5552.

The receiving module 5551 is configured to receive a video code stream of the network video conference, where the video code stream includes a plurality of real-time images of participating objects, and the images of the participating objects are obtained by respectively performing image acquisition on the plurality of participating objects of the network video conference; the display module 5552 is configured to display a virtual meeting scene, and display a plurality of images of the participants at positions corresponding to the plurality of participants in the virtual meeting scene.

In some embodiments, the network video conference processing apparatus 555 further includes an allocating module 5553, configured to allocate, according to a sequence in which the multiple participant objects join the network video conference, corresponding positions to the multiple participant objects in the virtual conference scene, where a position sequence of the multiple participant objects in the virtual conference scene corresponds to the sequence; the display module 5552 is further configured to display a plurality of participant object images according to the corresponding positions allocated to the plurality of participant objects, respectively.

In some embodiments, the allocating module 5553 is further configured to allocate, according to a speaking sequence of the multiple participant objects in the network video conference, corresponding positions to the multiple participant objects in the virtual conference scene, where a position sequence of the multiple participant objects in the virtual conference scene corresponds to the speaking sequence; the display module 5552 is further configured to display a plurality of participant object images according to the corresponding positions respectively allocated to the plurality of participant objects.

In some embodiments, the allocating module 5553 is further configured to allocate, according to the identity information rankings of the multiple participant objects in the network video conference, corresponding positions for the multiple participant objects in the virtual conference scene, respectively, where the position rankings of the multiple participant objects in the virtual conference scene correspond to the identity information rankings; the display module 5552 is further configured to display a plurality of participant object images according to the corresponding positions allocated to the plurality of participant objects, respectively.

In some embodiments, the display module 5552 is further configured to display a virtual meeting scene adapted to the topic of the network video meeting according to the video code stream; the network video conference processing apparatus 555 further includes an updating module 5554, configured to update the virtual conference scene to adapt to the changed theme when it is determined that the theme of the network video conference is changed according to the video code stream.

In some embodiments, the network video conference processing apparatus 555 further includes a decoding module 5555, configured to decode the video code stream to obtain a plurality of video frames; the network video conference processing device 555 further comprises a theme recognition module 5556, configured to invoke a theme recognition model to perform theme recognition processing on the multiple video frames, obtain a theme recognition result of each video frame, and determine the theme recognition result with the highest repetition number as the theme of the network video conference; the display module 5552 is further configured to display a virtual conference scene adapted to the topic of the network video conference.

In some embodiments, the network video conference processing device 555 further includes a moving module 5557, configured to, when it is identified that a target participant currently speaking exists in the multiple participants, move a participant image corresponding to the target participant from an originally allocated position to a specific position in the virtual conference scene, where the specific position is more significant than the originally allocated position; the moving module 5557 is further configured to, when recognizing that the target participant finishes speaking, move the participant image corresponding to the target participant from the specific position to the originally assigned position.

In some embodiments, the video code stream further includes a plurality of mask images corresponding to the plurality of participant object images one to one, and the plurality of mask images are obtained by respectively performing object identification on the plurality of participant object images; the network video conference processing apparatus 555 further includes a mask module 5558 for performing the following processing for each participant image: performing mask processing on the participant object image based on the mask image corresponding to the participant object image to obtain the participant object image without the background; the display module 5552 is further configured to display the plurality of conference object images with the background removed at positions corresponding to the plurality of conference objects in the virtual conference scene, respectively.

In some embodiments, the range of values of the pixel values of the mask image is less than the range of values of the pixel values of the participant object image; the network video conference processing apparatus 555 further includes a mapping module 5559, configured to perform mapping processing on the mask image, so that a value range of a pixel value of the mask image is consistent with a value range of a pixel value of the conference object image.

In some embodiments, the receiving module 5551 is further configured to receive a video stream of the network video conference, where the video stream is generated by the server by: when any one of the plurality of participant objects exits the network video conference or the connection is abnormal, moving any one of the participant objects out of the network video conference; and receiving the conference object images of the remaining conference objects of the network video conference, and generating the video code stream of the network video conference according to the conference object images of the remaining conference objects.

In some embodiments, the respective corresponding positions of the plurality of participant objects are selected from positions in the virtual conference scene that are in an unoccupied state; after any participant object is moved out of the network video conference, the updating module 5554 is further configured to update the corresponding position of any participant object in the virtual conference scene from an occupied state to an unoccupied state.

In some embodiments, the video codestream further includes images of the virtual meeting scene; the decoding module 5555 is further configured to perform decoding processing on the video code stream to obtain video pixel data; the network video conference processing apparatus 555 further includes a rendering module 55510, configured to perform rendering processing according to the decoded video pixel data, so as to display an image of a virtual conference scene in the human-computer interaction interface, and display a plurality of images of the participant objects in positions corresponding to the plurality of participant objects in the image of the virtual conference scene.

In some embodiments, the network video conference processing apparatus 555 further includes an image segmentation module 55511, configured to perform image segmentation on each participant image to obtain a mask image corresponding to the participant image; the mask module 5558 is further configured to perform mask processing on the corresponding participant object images based on each mask image to obtain a plurality of participant object images from which the background is removed; the display module 5552 is further configured to display a plurality of conference object images with the background removed at positions corresponding to the plurality of conference objects in the image of the virtual conference scene.

In some embodiments, the image segmentation module 55511 is further configured to perform the following for each participant image: calling an image segmentation model based on the image of the participant object to identify the participant object in the image of the participant object, and generating a mask image corresponding to the background by taking an area outside the participant object as the background; the image segmentation model is obtained by training an object labeled in the sample image based on the sample image.

In some embodiments, the receiving module 5551 is further configured to receive a video stream of the network video conference generated by the server by: acquiring a plurality of conference object images obtained by respectively carrying out image acquisition on a plurality of conference objects of the network video conference and a mask image corresponding to each conference object image; performing mask processing on the corresponding participant object images based on each mask image to obtain a plurality of participant object images with the backgrounds removed; acquiring an image of a virtual conference scene adaptive to the theme of the network video conference; filling a plurality of participant object images with backgrounds removed in positions corresponding to the plurality of participant objects in the image of the virtual conference scene to obtain a combined image; and coding the combined images respectively corresponding to different moments to obtain a video code stream of the network video conference.

It should be noted that, in the embodiment of the present application, description about a device is similar to the implementation of the network video conference processing method described above, and has similar beneficial effects, and therefore, details are not repeated. The inexhaustible technical details in the network video conference processing device provided by the embodiment of the application can be understood from the descriptions of fig. 4, fig. 5A to fig. 5C, and any one of the drawings.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the network video conference processing method described in the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, where the executable instructions are stored, and when executed by a processor, the executable instructions cause the processor to execute a method provided by embodiments of the present application, for example, a network video conference processing method as shown in fig. 4 or any one of fig. 5A to 5C.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of a program, software module, script, or code written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, in the embodiment of the present application, by displaying the virtual meeting scene and displaying the images of the multiple meeting objects at the positions corresponding to the multiple meeting objects in the virtual meeting scene, a real meeting scene can be simulated, and an immersive meeting feeling is provided for the user, so that the initiative and the interestingness of using the network video meeting by the user are improved.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for processing a network video conference, the method comprising:

displaying a virtual meeting scene, an

And displaying the images of the plurality of participant objects at positions in the virtual conference scene corresponding to the plurality of participant objects respectively.

2. The method of claim 1, wherein displaying the plurality of participant object images in the virtual meeting scene at locations corresponding to the respective plurality of participant objects comprises:

according to the sequence of the plurality of participant objects joining the network video conference, respectively allocating corresponding positions for the plurality of participant objects in the virtual conference scene, wherein the position sequence of the plurality of participant objects in the virtual conference scene corresponds to the sequence;

and displaying the images of the plurality of participant objects according to the corresponding positions respectively allocated to the plurality of participant objects.

3. The method of claim 1, wherein displaying the plurality of participant object images in the virtual meeting scene at locations corresponding to the respective plurality of participant objects comprises:

according to the speaking sequence of the multiple participant objects in the network video conference, corresponding positions are respectively allocated to the multiple participant objects in the virtual conference scene, wherein the position sequence of the multiple participant objects in the virtual conference scene corresponds to the speaking sequence;

4. The method of claim 1, wherein displaying the plurality of participant object images in the virtual meeting scene at locations corresponding to the respective plurality of participant objects comprises:

according to the identity information sequence of the multiple conference objects in the network video conference, distributing corresponding positions for the multiple conference objects in the virtual conference scene respectively, wherein the position sequence of the multiple conference objects in the virtual conference scene corresponds to the identity information sequence;

5. The method of claim 1, wherein displaying the virtual meeting scene comprises:

displaying a virtual conference scene adaptive to the theme of the network video conference according to the video code stream;

and when the theme of the network video conference is determined to be changed according to the video code stream, updating the virtual conference scene to be adapted to the changed theme.

6. The method of claim 5, wherein the displaying the virtual conference scene adapted to the topic of the network video conference according to the video code stream comprises:

decoding the video code stream to obtain a plurality of video frames;

calling a theme recognition model to perform theme recognition processing on the plurality of video frames to obtain a theme recognition result of each video frame, and determining the theme recognition result with the highest repetition frequency as the theme of the network video conference;

and displaying the virtual meeting scene adaptive to the theme of the network video meeting.

7. The method according to any one of claims 1 to 6, further comprising:

when a target participant object which is speaking currently exists in the plurality of participant objects is identified, moving the participant object image corresponding to the target participant object from an originally allocated position to a specific position in the virtual conference scene, wherein the significance degree of the specific position is greater than that of the originally allocated position;

and when recognizing that the target participant object finishes speaking, moving the participant object image corresponding to the target participant object from the specific position to the originally distributed position.

8. The method of claim 1,

the video code stream further comprises a plurality of mask images corresponding to the plurality of participant object images one by one, and the plurality of mask images are obtained by respectively carrying out object identification on the plurality of participant object images;

the displaying the plurality of participant object images at positions in the virtual conference scene corresponding to the plurality of participant objects, respectively, includes:

performing the following processing for each of the participant object images:

performing mask processing on the participant object image based on the mask image corresponding to the participant object image to obtain the participant object image with the background removed;

displaying the plurality of participant object images with the background removed at positions corresponding to the plurality of participant objects in the virtual conference scene, respectively.

9. The method of claim 8,

the value range of the pixel value of the mask image is smaller than the value range of the pixel value of the participated object image;

prior to masking the participant-object image based on the masked image corresponding to the participant-object image, the method may further comprise:

and mapping the mask image to make the value range of the pixel value of the mask image consistent with the value range of the pixel value of the participated object image.

10. The method of claim 1, wherein receiving the video bitstream of the network video conference comprises:

receiving a video code stream of the network video conference generated by a server in the following way:

when any participant object in the multiple participant objects exits the network video conference or the connection is abnormal, moving the any participant object out of the network video conference;

and receiving the conference object images of the remaining conference objects of the network video conference, and generating the video code stream of the network video conference according to the conference object images of the remaining conference objects.

11. The method of claim 10,

the positions corresponding to the multiple participant objects are selected from the positions in the virtual conference scene in an unoccupied state;

after moving the any participant object out of the network video conference, the method further comprises:

and updating the corresponding position of any participant object in the virtual conference scene from an occupied state to an unoccupied state.

12. The method of claim 1,

the video code stream also comprises an image of the virtual conference scene;

the displaying a virtual conference scene and displaying the multiple participant object images at positions in the virtual conference scene corresponding to the multiple participant objects, respectively, includes:

decoding the video code stream to obtain video pixel data;

rendering according to the video pixel data obtained by decoding to display the image of the virtual conference scene in a human-computer interaction interface, and

and displaying the images of the plurality of participant objects at positions in the images of the virtual conference scene corresponding to the plurality of participant objects respectively.

13. The method of claim 12, wherein displaying the plurality of participant-object images in the image of the virtual meeting scene at locations corresponding to the respective plurality of participant objects comprises:

performing image segmentation processing on each participant object image to obtain a mask image corresponding to the participant object image;

performing mask processing on the corresponding participant object images based on each mask image to obtain a plurality of participant object images with backgrounds removed;

and displaying the plurality of participant object images with the backgrounds removed at positions corresponding to the plurality of participant objects in the image of the virtual conference scene.

14. The method according to claim 13, wherein the performing image segmentation processing on each of the images of the participant to obtain a mask image corresponding to the image of the participant comprises:

performing the following processing for each of the participant object images:

calling an image segmentation model based on the image of the participant object to identify the participant object in the image of the participant object, taking a region outside the participant object as a background, and generating a mask image corresponding to the background;

the image segmentation model is obtained by training an object labeled in a sample image based on the sample image.

15. The method of claim 1, wherein receiving the video bitstream of the network video conference comprises:

acquiring a plurality of conference object images obtained by respectively carrying out image acquisition on a plurality of conference objects of the network video conference and a mask image corresponding to each conference object image;

acquiring an image of a virtual conference scene adaptive to the theme of the network video conference;

filling the plurality of participant object images with the backgrounds removed in positions corresponding to the plurality of participant objects in the image of the virtual conference scene to obtain a combined image;

and coding the combined images respectively corresponding to different moments to obtain a video code stream of the network video conference.

16. A network video conference processing apparatus, the apparatus comprising:

a display module for displaying a virtual meeting scene, an

17. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor, configured to execute the executable instructions stored in the memory to implement the network video conference processing method according to any one of claims 1 to 15.

18. A computer-readable storage medium having executable instructions stored thereon, wherein the executable instructions, when executed by a processor, implement the network video conference processing method of any one of claims 1 to 15.