WO2021093882A1

WO2021093882A1 - Video meeting method, meeting terminal, server, and storage medium

Info

Publication number: WO2021093882A1
Application number: PCT/CN2020/129049
Authority: WO
Inventors: 曹泊
Original assignee: 中兴通讯股份有限公司
Priority date: 2019-11-14
Filing date: 2020-11-16
Publication date: 2021-05-20
Also published as: CN112804471A

Abstract

Provided in the embodiments of the present application are a video meeting method, a meeting terminal, a server, and a storage medium; in the process of a video meeting, a server receives a video code stream sent by a plurality of video sources in the video meeting, implements decoding, picture synthesis, and re-encoding of the code stream of only part of the video sources to form a merged code stream, and sends same to the meeting terminals; the server simultaneously sends the video code stream of the video sources other than said part of the video sources as an independent code stream to each meeting terminal; and the meeting terminal decodes and displays the merged code stream and the at least one independent code stream.

Description

Video conference method, conference terminal, server and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on November 14, 2019 with application number 201911115565.3. The entire content of this application is incorporated into this application by reference.

Technical field

This application relates to the technical field of video conferencing, such as a video conferencing method, a conference terminal, a server, and a storage medium.

Background technique

In the video conference in the cloud conference system, multiple conference terminals encode the local video data and send it to the MCU (Multi Control Unit) server. After the MCU server receives the video stream from multiple conference terminals , First decode these video streams, then perform multi-picture synthesis, and finally encode the video data corresponding to the multi-pictures to each conference terminal participating in the video conference. Each conference terminal receives the video code stream from the MCU server, decodes the data, and then displays the picture.

If there are more conference terminals participating in the conference in a video conference, the MCU server needs to process the video streams of multiple video sources at the same time during the video conference: decode and decode the video streams received from multiple video sources. The video picture is synthesized, and then the synthesized picture is encoded. Therefore, this processing process requires the MCU server to have high processing capabilities. Once the MCU server has insufficient performance, it will cause a delay in the video picture on the conference terminal side, which will affect the conference experience of the users participating in the conference.

Summary of the invention

The video conference method, conference terminal, server, and storage medium provided by the embodiments of the present application avoid the situation that insufficient processing performance of the MCU server causes serious video picture delay on the conference terminal side and poor conference experience of the users participating in the conference.

The embodiment of the present application provides a video conference method, which is applied to a conference terminal, and includes:

Receive the composite code stream sent by the server and at least one independent code stream. The composite code stream is formed by the server decoding, synthesizing and re-encoding the video code streams of some video sources in the video conference. At least one independent code stream is the server's self-exit The video code stream that is received at a video source other than the part of the video source and then forwarded to the conference terminal;

Respectively decode the composite code stream and at least one independent code stream;

Display the video picture corresponding to the composite code stream and the at least one independent code stream.

The embodiment of the present application also provides a video conference method, including:

Receive the video stream sent by the video source in the video conference;

Decode, synthesize and re-encode the video code streams of some video sources to obtain a synthesized code stream and send it to each conference terminal in the video conference, and combine the video codes of video sources other than the part of the video source The stream is forwarded to each conference terminal in the video conference as an independent code stream.

An embodiment of the present application also provides a conference terminal, which includes a first processor, a first memory, and a first communication bus;

The first communication bus is configured to realize connection and communication between the first processor and the first memory;

The first processor is configured to execute at least one program stored in the first memory, so as to implement the above-mentioned first video conference method.

An embodiment of the present application also provides a server, which includes a second processor, a second memory, and a second communication bus;

The second communication bus is configured to realize connection and communication between the second processor and the second memory;

The second processor is configured to execute at least one program stored in the second memory to implement the second video conference method described above.

An embodiment of the present application also provides a storage medium. The storage medium stores at least one of a first video conference program and a second video conference program. The first video conference program can be executed by at least one processor, so as to implement the above-mentioned first Video conference method; the second video conference program can be executed by at least one processor to implement the above-mentioned second video conference method.

According to the video conference method, conference terminal, server, and storage medium provided by the embodiments of the present application, in the video conference process, after the server receives the video code streams sent by multiple video sources in the video conference, it only performs video codes for some video sources. The stream is decoded, combined and re-encoded to form a composite code stream and sent to each conference terminal in the video conference. At the same time, the server uses the video code stream of the video source other than the part of the video source as an independent code stream. It is sent to each conference terminal in the video conference, so that the conference terminal decodes and displays the composite code stream and at least one independent code stream. In this kind of conference solution, the server does not need to decode the video code streams of all video sources. Picture synthesis and re-encoding, thus reducing the requirements for server-side encoding and decoding capabilities. For other video streams outside the processing capacity of the server, they are directly sent to the conference terminal, thereby making full use of the processing resources on the conference terminal side, reducing the delay of the video screen on the conference terminal side, improving the smoothness of the video conference, and enhancing Improve the user experience.

Description of the drawings

FIG. 1 is an interactive flowchart of the video conference method provided in Embodiment 1 of this application;

2 is an interactive flowchart of the video conference method provided in Embodiment 2 of this application;

3 is a schematic diagram of the conference terminal provided in the second embodiment of the application displaying video images of multiple video sources;

4 is another schematic diagram of the conference terminal provided in the second embodiment of the application displaying video images of multiple video sources;

FIG. 5 is a flow chart of the server shown in the second embodiment of the application processing the video code streams of some video sources to form a composite code stream;

FIG. 6 is an interactive flowchart of the video conference method provided in Embodiment 3 of this application;

FIG. 7 is a schematic diagram of the video conference screen layout of the conference terminal provided in the third embodiment of the application;

FIG. 8 is another schematic diagram of the video conference screen layout of the conference terminal provided in the third embodiment of the application;

FIG. 9 is a schematic diagram of a hardware structure of a conference terminal provided in Embodiment 4 of this application;

10 is a schematic diagram of a hardware structure of the server provided in the fourth embodiment of the application;

FIG. 11 is a schematic diagram of a video conference system provided in Embodiment 4 of this application.

Detailed ways

Example one:

In order to avoid the insufficient processing performance of the MCU server in the related technology, the MCU server's inefficiency in decoding, synthesizing and re-encoding video streams from multiple video sources is inefficient, which in turn leads to the display of multiple conference terminals in the video conference. The video picture delay is serious, which affects the meeting experience of the users participating in the meeting. This embodiment provides a video conference method. Please refer to an interactive flowchart of the video conference method shown in FIG. 1.

S102: The server receives the video code stream sent by the video source in the video conference.

In this embodiment, the server may be an MCU server. The MCU server is essentially a multimedia information exchange, which performs multipoint calls and connections, realizes functions such as video broadcasting, video selection, audio mixing, and data broadcasting, and completes multiple terminal signals. Tandem and switch.

It is understandable that after the initiator of the video conference initiates the video conference to the server through its own conference terminal, the server will notify the corresponding conference terminal to allow these conference terminals to join the video conference. During a video conference, the media collection device on the side of each conference terminal, such as a camera, can collect image information on the side of each conference terminal to form a video. After the conference terminal encodes the video collected by the media collection device, the video code stream of the conference terminal can be formed, and then the video code stream is sent to the server.

It should be noted that, in general, the media collection device on the conference terminal side may include a microphone, etc., in addition to a camera, and the microphone is set to collect audio information of users participating in the conference. The video collected by the conference terminal also includes both image information and audio information.

In a video conference, there are usually multiple participants. Therefore, as far as the server is concerned, the video stream sent by multiple conference terminals in the video conference will be received. From the point of view of the server, these conference terminals that provide video streams are video sources. In some conference scenarios, some participants may turn off their cameras, that is, do not provide image information of the conference terminal during the conference. In this case, the conference terminal can be considered as not a video source.

S104: The server decodes, synthesizes and re-encodes the video code streams of some video sources, obtains a synthesized code stream and sends it to each conference terminal, and treats the video code streams of video sources other than the part of the video sources as independent The code stream is forwarded to each conference terminal.

After the server receives the video code stream sent by the video source, it can decode, synthesize and re-encode only part of the video code stream. For the video code stream of the video source other than the above part of the video code stream, the server does not do Decoding, picture synthesis and other processing. For example, assuming that a video conference contains four video sources, the server can only decode, synthesize and re-encode the video streams of three of the video sources to form a composite stream. For the other video stream, then Continue to be independent. Of course, in other examples, the server may only decode, synthesize and re-encode the video streams of two of the video sources, so that the remaining two video streams continue to be independent.

The video code stream that has not been synthesized by the server only contains the video picture on the side of a single conference terminal, while the itinerary of the synthesized code stream has undergone picture synthesis, which contains at least two video pictures on the conference terminal side. Stream. Here, the video stream that has not been decoded by the server, synthesized and re-encoded by the server can be called "independent stream".

After processing the received video stream to obtain the composite stream, the server can send the composite stream to each conference terminal so as to display the video image corresponding to the composite stream on each conference terminal. On the other hand, for the independent code stream, the server also needs to send it to each conference terminal so that each conference terminal can display the video image corresponding to the independent code stream. Based on the composite code stream and the independent code stream, the conference terminal can display the image information of multiple video sources. It is worth noting that there is no strict timing relationship between the server’s action of sending the composite stream and the action of sending the independent stream. In some scenarios, the server sends the composite stream first, and then the independent stream. In other cases, the server sends the composite stream first, and then the independent stream. In the scenario, the independent code stream is transmitted before the composite code stream, and even in some examples, the composite code stream and the independent code stream are transmitted to the conference terminal side at the same time. In fact, in order to reduce the picture delay on the conference terminal side, so that the conference terminal can process the video stream in time, so as to display the picture as soon as possible. When a certain video stream can be transmitted to the conference terminal side, the server can proceed immediately. Transmission without paying attention to other video streams.

For example, in an example, when the server decodes, synthesizes and re-encodes the received video streams a1, c1, and d1, the server receives the video stream b1 again, because the server does not need to process the video stream b1. b1 performs additional processing, so it can directly send the video bit stream b1 as an independent bit stream to each conference terminal side. Subsequently, after the synthesis code stream is generated, the synthesized code stream is transmitted to the conference terminal. For another example, the five parties a2, b2, c2, d2, and e2 conduct a video conference. According to the settings, the server decodes, synthesizes and re-encodes the video streams of the three parties b2, c2, and d2. The video streams are treated as independent streams. If the first video code stream received by the server is a video code stream of a2, the server can directly send the independent code stream of the video code stream to each conference terminal side. Subsequently, the server sequentially receives the video streams of b2, c2, e2, and d2. After receiving the video streams of b2 and c2, the server can decode the two video streams first. After receiving the video code of e2 After streaming, the video code stream is forwarded to the conference terminal. After receiving the d2 video code stream, the server decodes it, and then synthesizes the picture with the decoding results of the video code streams on both sides of b2 and c2, and then re-encodes the synthesized picture to obtain a composite code stream, and combine the code The stream is sent to the conference terminal side.

S106: The conference terminal respectively decodes the composite code stream and at least one independent code stream.

After receiving the video code stream sent by the server, the conference terminal can decode the received video code stream. It is understandable that the conference terminal in the related technology only needs to decode one video stream, that is, it only needs to decode the composite stream containing all video source video pictures, so the decoding method is only the same as the encoding method on the server side. Correspondence is fixed and unique. In this embodiment, the conference terminal needs to decode at least two video streams (one composite stream and at least one independent stream). The independent stream and the composite stream are encoded by different objects: the composite stream is encoded by the server. , And the independent code is coded by the corresponding conference terminal. In some examples, the conference terminal and the server use the same encoding method, then the conference terminal can use the same decoding method when decoding the independent code stream and the composite code stream received from the server, that is, the conference terminal The terminal does not need to distinguish between different video code streams to adopt different decoding methods.

However, in more cases, the encoding methods of the conference terminal and the server may be different, and even the encoding methods adopted by different conference terminals are not exactly the same. In these cases, when the conference terminal decodes the video code stream it receives, it also needs to adopt different decoding methods for different video code streams.

In some examples of this embodiment, the server and multiple conference terminals can pre-appoint each other's coding and decoding methods for each video stream. In this way, when the server transmits a video code stream to the corresponding conference terminal, only It is sufficient to carry an identification information in the corresponding video code stream to indicate to the corresponding conference terminal which video code stream the video code stream is. For example, the server and the conference terminal agree to use decoding method 1 for decoding the composite code stream, and use decoding method 2 for the independent code stream to decode. In this way, when the server sends the composite code stream to the conference terminal, it only needs to carry the corresponding identification information in the video code stream. After the conference terminal receives the video code stream, according to the identification information carried by it, it can query the corresponding identification information. The video code stream needs to be decoded using the first decoding method.

S108: The conference terminal displays the video picture corresponding to the composite code stream and the at least one independent code stream.

After the conference terminal completes the decoding of the video code stream it has received, it can display the corresponding video picture on the screen. In some examples of this embodiment, the conference terminal can display the video screen corresponding to the decoded video stream immediately after decoding a certain video stream, and does not need to wait for the independent stream and composite stream to be displayed. Display after the decoding is completed.

In the video conference method provided in this embodiment, the server can only decode, synthesize and re-encode the video code streams of some video sources, and does not need to perform this processing on the video code streams of all video sources. The processed video stream can be directly sent to each conference terminal, allowing each conference terminal to directly decode and display, which can make full use of the processing resources on the conference terminal side, reduce the processing burden of the server, reduce the delay of the video picture, and improve the video The quality of the meeting.

Embodiment two:

According to the introduction of the foregoing embodiment, the server in the embodiment of the present application can only decode, synthesize and re-encode the video stream of some video sources to form a composite stream, and then send the composite stream to each conference terminal. As for In addition to the video code streams of some of the video sources mentioned above, the server directly sends them to each conference terminal.

It is understandable that how many video streams are included in the composite stream generated by the server, and how the encoding and decoding methods of the composite stream and the independent stream can be set in advance, and even the composite stream generated by the server includes The video code streams of which video sources can also be set in advance. For example, in some examples, the server and the conference terminal belong to the same manufacturer, and the designer can solidify these contents on the conference terminal and the server before the equipment leaves the factory. Or, programmers can write these contents into the upgrade program in the way of equipment upgrade, and then push them to the server and conference terminal side respectively.

Of course, in some examples of this embodiment, the above content can also be determined by the server itself according to the current video conference situation. The following describes this video conference method with reference to the flowchart shown in FIG. 2.

S202: The conference terminal sends the video codec capability parameter of the conference terminal to the server.

In this embodiment, after the conference terminal enters the video conference, it can send its own video codec capability parameters to the server. It is understandable that the conference terminal can actively report the video codec capability parameters to the server or at After receiving the server's request, send it to the server.

The video codec capability parameter may represent the video codec capability of the conference terminal. In an example of this embodiment, the video codec capability parameter includes coding parameters and decoding parameters, and the coding parameter includes the coding capability of the conference terminal. The decoding parameters include at least one of the decoding capability of the conference terminal, the meeting speed, the frame rate, and format information.

S204: The server determines the decoding and display strategy of the video conference according to the video codec capability parameter of each conference terminal and the video codec capability of the server.

After the server obtains the video coding and decoding capability parameters of each conference terminal, it can determine the decoding and display strategy of the video conference based on the coding and decoding capabilities of these conference terminals and the video coding and decoding capabilities of the server itself. Based on the number of conference terminals in this video conference, and the way each conference terminal encodes its local video, etc., the server can determine that the video stream of all video sources will be processed into a composite stream in this video conference. Processing requirements. Exemplarily, the server can determine whether its own video codec capability meets the processing requirement. In the case where the server determines that its own video codec capability meets the processing requirement, it indicates that the server can decode all video code streams of all video sources. , Picture synthesis and re-encoding, that is, the server can process the video stream of all video sources into a composite stream. However, if the server determines that its video codec capability is lower than the processing requirements for processing the video code streams of all video sources in the video conference into a composite code stream, that is, the server’s processing capacity is not enough to deal with the video codes of all video sources. When the stream is decoded, screen synthesized and re-encoded, the server will determine that in the subsequent process of the video conference, it will only process part of the video stream to synthesize the stream. As for how many video streams are processed, which video streams are processed, and the encoding method of the processed composite stream, it needs to be determined based on the video encoding and decoding capabilities of each conference terminal. What the server needs to ensure is the video code stream that it sends to each conference terminal, including the composite code stream and at least one independent code stream. The encoding method adopted is supported by the conference terminal, otherwise if a conference terminal cannot respond to itself Decoding the received video code stream will cause the conference terminal to be unable to display at least one video screen on the side of the conference terminal.

In some examples of this embodiment, the decoding display strategy determined by the server includes a decoding instruction, and the decoding instruction is used to indicate that the conference terminal is in the process of receiving multiple video streams, that is, a composite stream and at least one independent stream. The way to decode. For example, the server instructs the conference terminal to decode the video code stream carrying the identification information "1" using decoding method one, and the video code stream carrying the identification information "2" to decode the video code stream carrying the identification information as " The 3" video code stream is decoded by decoding method three. When the server sends the video code stream to each conference terminal in the subsequent process, it only needs to carry the corresponding identification information, and the conference terminal can determine the result based on the decoding and display strategy. The decoding method of the video stream.

In some other examples of this embodiment, the decoding display strategy also includes a display indication, which is used to inform the conference terminal server of the composite code stream sent by the conference terminal server and the mapping of each code stream in the at least one independent code stream to the display area. In this way, after the conference terminal receives a video code stream and decodes it, it can determine which area of the screen to display the corresponding video picture on the basis of the display indication in the decoding display strategy.

In some examples, the display instructions are not necessary for the decoding display strategy, because in these examples, the conference terminal can set a corresponding number of display areas on the display screen according to the number of video streams that it will receive. For example, suppose the conference terminal determines that the number of video streams that it will receive in this meeting is k channels after negotiating with the server, then the conference terminal side can set k display areas, and whenever a video is received and decoded After the code stream, the conference terminal randomly selects a video picture for displaying the video code stream from the display area of the currently unfilled picture. However, since the video screen corresponding to the composite stream contains at least two video images on the conference terminal side at the same time, if the composite stream cannot be guaranteed to be displayed in a large display area, it will cause the video screen The characters etc. are very small, and the user is struggling to watch the details. As shown in Figure 3, there are three video sources a3, b3, and c3 in a video conference. The server processes the video streams of a3 and b3 to form a composite stream, and c3 continues to be an independent stream. According to the previous display scheme, only two display areas need to be set on each conference terminal side. The composite stream occupies one display area, and the independent stream occupies the other display area. In this way, the video images corresponding to a3 and b3 need to be shared. A display area, which makes the video pictures on both sides of a3 and b3 only half the size of the video picture on the side of c3, which not only makes it difficult for users to see the pictures on both sides of a3 and b3, but also does not conform to the user's video conference habit.

In more examples of this embodiment, the decoding display strategy includes the mapping relationship between multiple video streams and multiple display areas. In this way, the server can ensure that the video images of multiple video sources are on the conference terminal side. For example, to ensure that each video source uses the same size display area for display, and for example, to ensure that the video images of multiple video sources can be spliced and displayed in a whole area, as shown in Figure 4: Six conference terminals a4, b4, c4, d4, e4, and f4 participate in the conference, and each conference terminal has its own camera turned on. Therefore, there are a total of six video sources. The server transfers the video codes of a4, b4, c4, and d4 The stream is processed into a composite stream, and the video streams of the other two video sources are treated as independent streams. The server writes the display area corresponding to each video stream to each conference terminal. Among them, the first area 401 is used to display the video picture corresponding to the composite stream. As for the arrangement of each sub-picture in the video picture, the server Decide; the second area 402 is used to display the video screen corresponding to e4, and the third area 403 is used to display the video screen corresponding to f4. Through this display splicing, the pictures of the six video sources are displayed in one area and will not be distributed everywhere on the screen. In Figure 4, the video picture specifications of multiple video sources are consistent, which conforms to the user's video viewing habits. .

S206: The server sends the decoded display strategy to each conference terminal.

After the server determines the decoding display strategy, it can send the decoding display strategy to each conference terminal so that each conference terminal can understand the implementation plan of the video conference.

S208: The server decodes and synthesizes the video code streams of the m video sources among the n video sources of the video conference, and re-encodes the synthesized pictures according to the encoding mode corresponding to the decoding mode in the decoding display strategy to form a synthesized code stream.

In this embodiment, it is assumed that there are a total of n video sources, and the server decides in the process of negotiating with the conference terminal that it will process the video stream of m video sources to form a composite stream. As for the remaining nm videos The code stream will be sent to each conference terminal as an independent code stream, where m is less than n.

Therefore, after the server receives the video stream used to form the composite stream, it can decode, synthesize and re-encode the video stream to form a composite stream. In some examples of this embodiment, the server and multiple conference terminals determine which video source streams form the composite stream during the negotiation phase (that is, the determination phase of the decoding display strategy), and the server is generating the composite stream. You must wait until the video streams of these video sources have been received. However, in some other examples of this embodiment, the server does not specify which video source streams constitute the composite stream in the decoding and display strategy. Therefore, the server can temporarily determine which video sources are used according to the actual situation of the video conference. The video streams are combined together. For example, in some examples of this embodiment, the server may select the first m video code streams to form a composite code stream according to the order in which it receives video code streams from multiple video sources. Please refer to the flowchart shown in FIG. 5 for the server to decode, synthesize and re-encode the video stream of some video sources to obtain the composite stream.

S502: Obtain video code streams of the first m video sources according to the sequence of receiving video code streams from multiple video sources.

S504: Decode video code streams of m video sources.

S506: Perform picture synthesis on the decoding results corresponding to the m video code streams to obtain a synthesized picture.

S508: According to the encoding method corresponding to the decoding method in the decoding display strategy, encode the composite picture to obtain a composite code stream.

S210: The server sends the composite code stream and the remaining n-m independent code streams received from the video source to each conference terminal.

The server can send the composite code stream generated by its own processing to each conference terminal. On the other hand, the server can also send the video code stream that it receives as an independent code stream to each conference terminal. The sending sequence has been described in more detail in the foregoing embodiment, and will not be repeated here.

S212: The conference terminal decodes and displays the video code stream according to the decoding and display strategy.

After receiving the video code stream sent by the server, the conference terminal can decode the composite code stream and at least one independent code stream according to the decoding mode indicated by the decoding instruction in the decoding display strategy. Subsequently, the conference terminal fills the video images of the composite code stream and at least one independent code stream into the corresponding display area for display according to the display instructions in the decoding display strategy.

In the video conference method provided in this embodiment, before the video conference officially starts, the server can obtain the video coding and decoding capability parameters of each conference terminal, and then based on the server’s own video coding and decoding capabilities and the video coding and decoding capabilities of multiple conference terminals. , To determine whether you can process the video streams of all video sources in the video conference to form a composite stream. When the server determines that it can process the video streams of all video sources in the video conference to form a composite stream, the server In the subsequent process, the video code streams of multiple video sources can still be processed according to the video conference solution provided in related technologies; but if the server determines that its own video codec capability is lower than the video code of all video sources in the video conference If the stream is processed into a composite code stream, it can only process the video code stream of some of the video sources, and at the same time make full use of the decoding resources on the side of each conference terminal, thereby reducing the delay of the video screen on the side of each conference terminal. Enhance the meeting experience of users participating in the video conference.

Example three:

In order to enable those skilled in the art to have a clearer understanding of the characteristics and details of the video conference method provided in the embodiments of the present application, this embodiment will describe the video conference method in detail in conjunction with examples, please refer to FIG. 6.

S602: The conference initiator creates a conference on the MCU management platform.

It should be understood that the conference initiator is actually a conference terminal in the video conference, and the conference terminal is usually held by the conference host. The conference initiator creates a conference on the MCU management platform, which is equivalent to opening a network "meeting room" on the management platform.

S604: The MCU notifies the conference terminal that needs to participate in the conference to enter the video conference.

S606: The conference terminal reports its own video coding and decoding capability parameters to the MCU.

S608: The MCU determines its own optimal decoding capability and the optimal decoding capability of the conference terminal, and notifies the conference terminal.

During the meeting, the MCU will select the screen layout by default. The MCU will tell the conference terminal the screen layout. The control message sent by the MCU to the conference terminal includes the number of multi-screens, multi-screen layout mode, and the content of each sub-screen (such as the main video source, Auxiliary video source, the first channel of main video decoding, the second channel of main video decoding, the first channel of auxiliary video decoding, etc.).

S610: The conference terminal encodes the local video according to the negotiated encoding method and sends it to the MCU.

S612: The MCU processes the received video code stream and sends it to each conference terminal.

If it is a single-picture or a multi-picture layout (dual-picture, three-picture) that fully meets the decoding capabilities of the MCU, after receiving the video stream, the MCU performs video stream decoding, picture synthesis, re-encoding, and then sends it to the conference terminal. As shown in Figure 7, there are only three far-end (Far) video streams and one local (Near) video stream. The decoding capability of the MCU is completely sufficient. Therefore, the MCU can be responsible for the decoding of all video streams, picture synthesis and Re-encode to form a composite code stream and then send it to each conference terminal. At this time, the video code stream sent by the server only has a composite code stream, not an independent code stream.

If the picture layout is a multi-picture layout that exceeds the decoding capabilities of the MCU, the MCU will select some of the video streams to form a composite stream according to the negotiated decoding capabilities and by default according to the order in which it receives the stream. Perform decoding, and then perform picture synthesis after the decoding is completed, and then encode the picture synthesis data to form a composite code stream and send it to each conference terminal. For the rest of the video stream, the MCU directly carries the identification tag and sends it to each conference terminal. As shown in Figure 8, a video conference with a multi-screen layout includes six video sources. After negotiation, the MCU will process the video streams of the four video sources, and each conference terminal will process the video of the remaining two video sources. Code stream. Among them, the video streams of Far1, Far, Far3, and Far4 are decoded by the MCU, synthesized and then coded to form a composite stream and then sent to the conference terminal; the two streams of Far5 and Near are received by the MCU and labeled with the corresponding identification labels, and then Sent to each conference terminal.

S614: The conference terminal decodes and displays the received video code stream.

The conference terminal decodes the composite code stream sent by the MCU and displays it in the corresponding area; on the other hand, the conference terminal decodes the independent code stream after receiving it, and then displays it in the corresponding area according to the corresponding identification label.

In the multi-picture mode scenario, according to the code stream information, the conference terminal fills the video pictures corresponding to the composite code stream into the designated area, and fills the video pictures corresponding to other independent code streams into the designated area in the multi-picture layout according to the corresponding tags.

Embodiment four:

This embodiment provides a storage medium. The storage medium can store at least one computer program that can be read, compiled, and executed by at least one processor. In this embodiment, the storage medium can store the first video conference. At least one of a program and a second video conference program, wherein the first video conference program can be used by at least one processor to execute a process on the conference terminal side that implements any one of the video conference methods introduced in the foregoing embodiments. The second video conference program can be used by at least one processor to execute the server-side process for implementing any of the video conference methods introduced in the foregoing embodiments.

This embodiment also provides a conference terminal, as shown in FIG. 9: the conference terminal 90 includes a first processor 91, a first memory 92, and a first communication bus configured to connect the first processor 91 and the first memory 92 93. The first memory 92 may be the aforementioned storage medium storing the first video conference program, and the first processor 91 may read the first video conference program, compile and execute the video conference method introduced in the foregoing embodiment. The process on the terminal side.

The first processor 91 receives the composite code stream and at least one independent code stream sent by the server, where the composite code stream is formed by the server decoding, synthesizing and re-encoding the video code streams of some video sources in the video conference, and at least one independent code stream is formed by the server. The code stream is the video code stream that the server receives from a video source other than the part of the video source and forwards it to the conference terminal. Subsequently, the first processor 91 respectively decodes the composite code stream and the at least one independent code stream, and displays the composite code stream and the video picture corresponding to the at least one independent code stream.

In an example of this embodiment, before receiving the composite code stream and at least one independent code stream sent by the server, the first processor 91 will first send the video codec capability parameters of the conference terminal to the server, and receive the server sent The decoding and display strategy is determined by the server according to the server’s own video coding and decoding capabilities and the video coding and decoding capabilities of each conference terminal in the video conference. The decoding and display strategy is used to instruct the conference terminal to compare the composite code stream with at least one Decoding display mode of independent stream.

In an example of this embodiment, the video encoding and decoding capability parameters sent to the server include encoding parameters and decoding parameters. The encoding parameters include the encoding capability of the conference terminal; the decoding parameters include the decoding capability of the conference terminal, the meeting speed, At least one of frame rate and format information.

Optionally, the decoding display strategy includes a decoding instruction and a display instruction, where the decoding instruction is used to indicate how the conference terminal decodes the synthesized code stream and at least one independent code stream; the display instruction is used to indicate the synthesized code stream and at least one independent code stream sent by the server. The mapping relationship between each code stream in an independent code stream and the display area. When the first processor 91 respectively decodes the composite code stream and the at least one independent code stream, it may decode the composite code stream and the at least one independent code stream according to the decoding mode indicated by the decoding instruction. When displaying the video picture corresponding to the composite code stream and the at least one independent code stream, the first processor 91 fills the composite code stream and the video picture of the at least one independent code stream into the corresponding display area for display according to the display instruction.

This embodiment also provides a server, as shown in FIG. 10: the server 100 includes a second processor 101, a second memory 102, and a second communication bus 103 configured to connect the second processor 101 and the second memory 102, The second memory 102 may be the aforementioned storage medium storing the second video conference program, and the second processor 101 may read the second video conference program, compile and execute the server-side video conference method introduced in the foregoing embodiment. Process.

The second processor 101 receives the video code stream sent by the video source in the video conference, and then decodes, synthesizes and re-encodes the video code stream of part of the video source to obtain the synthesized code stream and send it to each conference terminal, and remove all The video code streams of the video sources other than the above-mentioned video sources are forwarded to each conference terminal as an independent code stream.

Optionally, before the second processor 101 receives the video code stream sent by the video source in the video conference, it may also first obtain the video codec capability parameters of each conference terminal in the video conference, and then according to the video codec of each conference terminal The capability parameters and the video encoding and decoding capabilities of the server determine the decoding and display strategy of the video conference, where the decoding and display strategy is used to indicate the decoding and display mode of each conference terminal on the composite code stream and at least one independent code stream. Subsequently, the second processor 101 sends the decoded display strategy to each conference terminal.

In an example of this embodiment, before the second processor 101 decodes, synthesizes and re-encodes the video code streams of some video sources, it also first performs the following steps according to the video codec capability parameters of each conference terminal and the server The video encoding and decoding capabilities of the server determine that the encoding and decoding capabilities of this server are lower than the processing requirements for processing the video streams of all video sources in the video conference into composite streams.

In an example of this embodiment, the composite code stream is formed by video code streams of m video sources, and m is greater than or equal to 2. The second processor 101 decodes, synthesizes and re-images the video code streams of some video sources. When the composite code stream is obtained by encoding, the video code streams of the first m video sources are obtained in the order in which the video code streams are received from multiple video sources, and then the video code streams of m video sources are decoded, and then m videos are decoded. The decoded result corresponding to the code stream is subjected to picture synthesis to obtain a synthesized picture. Subsequently, the second processor 101 encodes the composite picture to obtain a composite code stream in accordance with the encoding method corresponding to the decoding method in the decoding display strategy.

This embodiment also provides a video conference system. As shown in FIG. 11, the video conference system 11 includes a server 100 and a plurality of conference terminals 90. The server 100 may be an MCU server, and the conference terminal may be implemented in various forms. For example, it may include mobile terminals such as mobile phones, tablet computers, notebook computers, PDAs, PDAs (Personal Digital Assistants), navigation devices, wearable devices, smart bracelets, pedometers, etc., as well as mobile terminals such as digital TV, Fixed terminals such as desktop computers.

In the conference terminal and server provided in this embodiment, during the video conference, after the server receives the video code streams sent by multiple video sources in the video conference, it only decodes, synthesizes and re-encodes the video code streams of some video sources , Forming a composite code stream and sending it to the conference terminal. At the same time, the server sends the video code stream of the video source other than the part of the video source as an independent code stream to each conference terminal, so that the conference terminal can compare the composite code stream with at least An independent code stream is decoded and displayed. In this kind of conference solution, the server does not need to decode, synthesize and re-encode the video code streams of all video sources. Therefore, the requirement on the server-side codec capability is reduced. For other video streams outside the processing capacity of the server, they are directly sent to the conference terminal, thereby making full use of the processing resources on the conference terminal side, reducing the delay of the video screen on the conference terminal side, improving the smoothness of the video conference, and enhancing Improve the user experience.

Those skilled in the art should understand that all or some of the steps, system, and functional modules/units in the device disclosed above can be implemented as software (which can be implemented by program code executable by a computing device), firmware , Hardware and its appropriate combination. In the hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may consist of several physical components. The components are executed cooperatively. Some physical components or all physical components can be implemented as software executed by a processor, such as a central processing unit, a digital signal processor, or a microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on a computer-readable medium and executed by a computing device. In some cases, the steps shown or described may be executed in a different order than here. The computer-readable medium may include computer storage. Medium (or non-transitory medium) and communication medium (or temporary medium). As is well known to those of ordinary skill in the art, the term computer storage medium includes volatile and non-volatile data implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Sexual, removable and non-removable media. Computer storage media include random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read Only Memory, EEPROM), flash memory or other Memory technology, compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical disk storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices Or it can be set as any other medium that stores the desired information and can be accessed by the computer. In addition, as is well known to those of ordinary skill in the art, communication media usually contain computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as carrier waves or other transmission mechanisms, and may include any information delivery media. . Therefore, this application is not limited to any specific combination of hardware and software.

Claims

A video conference method, applied to a conference terminal, includes:

Receive a composite code stream sent by a server and at least one independent code stream, where the composite code stream is formed by the server decoding, synthesizing and re-encoding the video code streams of some video sources in the video conference, and the at least one independent code stream A stream is a video code stream that is received by the server from a video source other than the part of the video source and then forwarded to the conference terminal;

Respectively decode the composite code stream and the at least one independent code stream;

Displaying the composite code stream and the video picture corresponding to the at least one independent code stream.
The method according to claim 1, before the receiving the composite code stream sent by the server and the at least one independent code stream, the method further comprises:

Sending the video coding and decoding capability parameters of the conference terminal to the server;

Receive the decoded display strategy sent by the server, the decoded display strategy is determined by the server according to the server's own video codec capability and the video codec capability parameters of each conference terminal of the video conference, the decoded display strategy It is used to instruct the conference terminal to decode and display the composite code stream and the at least one independent code stream.
3. The method of claim 2, wherein the video coding and decoding capability parameters include coding parameters and decoding parameters, the coding parameters include the coding capabilities of the conference terminal; and the decoding parameters include the decoding capabilities of the conference terminal , At least one of conference speed, frame rate, and format information.
3. The method according to claim 2, wherein the decoding display strategy includes a decoding instruction and a display instruction, and the decoding instruction is used to instruct the conference terminal to decode the composite code stream and the at least one independent code stream; The display indication is used to indicate the mapping relationship between each of the composite code stream and the at least one independent code stream sent by the server and the display area;

The respectively decoding the composite code stream and the at least one independent code stream includes:

Decode the composite code stream and the at least one independent code stream according to the decoding manner indicated by the decoding instruction;

The displaying the composite code stream and the video picture corresponding to the at least one independent code stream includes:

According to the display instruction, the video pictures of the composite code stream and the at least one independent code stream are filled into the corresponding display area for display.
A video conference method applied to a server, including:

Receive the video stream sent by the video source in the video conference;

Decode, synthesize and re-encode the video code streams of some video sources to obtain a synthesized code stream and send it to each conference terminal in the video conference, and combine the video codes of video sources other than the part of the video source The stream is forwarded to each conference terminal in the video conference as an independent code stream.
The method according to claim 5, before the receiving the video code stream sent by the video source in the video conference, the method further comprises:

Acquiring the video codec capability parameter of each conference terminal in the video conference;

Determine the decoding and display strategy of the video conference according to the video coding and decoding capability parameters of each conference terminal and the video coding and decoding capabilities of the server, and the decoding and display strategy is used to instruct each conference terminal to respond to the A decoding display mode of the synthesized code stream and at least one of the independent code streams;

Sending the decoded display strategy to each of the conference terminals.
8. The method according to claim 6, before said decoding, synthesizing and re-encoding the video stream of a part of the video source, the method further comprises:

According to the video codec capability parameters of each conference terminal and the video codec capability of the server, it is determined that the codec capability of the server itself is lower than that of processing the video code streams of all video sources in the video conference into synthesis The processing requirements of the code stream.
The method according to any one of claims 6-7, wherein the composite code stream is formed by video code streams of m video sources, and the m is greater than or equal to 2; the video code streams of some video sources Perform decoding, picture synthesis and re-encoding to obtain a synthesized code stream, including:

Obtain the video code streams of the first m video sources according to the sequence of receiving video code streams from multiple video sources;

Decoding the video code streams of the m video sources;

Perform picture synthesis on the decoding results corresponding to the m video code streams to obtain a synthesized picture;

According to the encoding method corresponding to the decoding method in the decoding display strategy, the composite picture is encoded to obtain a composite code stream.
A conference terminal, the conference terminal including a first processor, a first memory, and a first communication bus;

The first communication bus is configured to realize connection and communication between the first processor and the first memory;

The first processor is configured to execute at least one program stored in the first memory to implement the video conference method according to any one of claims 1 to 4.
A server, the server includes a second processor, a second memory, and a second communication bus;

The second communication bus is configured to realize connection and communication between the second processor and the second memory;

The second processor is configured to execute at least one program stored in the second memory to implement the video conference method according to any one of claims 5 to 8.
A storage medium, the storage medium stores at least one of a first video conference program and a second video conference program, the first video conference program can be executed by at least one processor to implement claims 1 to 4 The video conference method according to any one of the above; the second video conference program can be executed by at least one processor to realize the video conference method according to any one of claims 5 to 8.