WO2020258976A1 - 一种会议录制方法、装置及会议录制系统 - Google Patents

一种会议录制方法、装置及会议录制系统 Download PDF

Info

Publication number
WO2020258976A1
WO2020258976A1 PCT/CN2020/083402 CN2020083402W WO2020258976A1 WO 2020258976 A1 WO2020258976 A1 WO 2020258976A1 CN 2020083402 W CN2020083402 W CN 2020083402W WO 2020258976 A1 WO2020258976 A1 WO 2020258976A1
Authority
WO
WIPO (PCT)
Prior art keywords
recorded
audio
video
conference
stream
Prior art date
Application number
PCT/CN2020/083402
Other languages
English (en)
French (fr)
Inventor
庄松海
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20832662.9A priority Critical patent/EP3979630A4/en
Publication of WO2020258976A1 publication Critical patent/WO2020258976A1/zh
Priority to US17/563,859 priority patent/US11974067B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • H04N5/765Interface circuits between an apparatus for recording and another apparatus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • H04N7/152Multipoint control units therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • H04N7/155Conference systems involving storage of or access to video conference sessions

Definitions

  • This application relates to the field of multimedia technology, and in particular to a conference recording method, device, and conference recording system.
  • Conference recording is an important function in the field of multimedia technology.
  • the content of a conference held by a multipoint control unit (MCU) is recorded through a recording server for conference rebroadcasting, rebroadcasting, post-production, etc.
  • MCU multipoint control unit
  • conference recording during video conferencing has also become a frequent and important requirement.
  • conference recording is generally to record the audio and video of the entire conference, but when you need to find a person's speech or picture later, you need to manually browse and edit. This method causes high labor costs and, at the same time, low recording efficiency.
  • this application provides a conference recording method, device, and conference recording system, which can automatically filter the audio and video streams to be recorded, improve the efficiency of conference recording, and save labor costs.
  • this application provides a conference recording method, including the following steps:
  • the multipoint control unit screens the audio and video streams to be recorded from the audio and video streams sent by the terminals of each venue according to the characteristic information of the person to be recorded.
  • the feature information may include: image information or sound information.
  • the multipoint control unit sends the audio and video streams that need to be recorded, which are screened in the previous step, to the recording server, and the recording server performs conference recording.
  • the audio and video stream to be recorded filtered by the characteristic information of the person to be recorded accurately corresponds to the person to be recorded. Therefore, using this method and using the characteristic information of the person to be recorded, the audio and video streams that need to be recorded can be automatically filtered out for recording.
  • the manual screening process is omitted, so the labor cost of conference recording is reduced.
  • automatic screening also effectively improves the efficiency of meeting recording. Meeting recording is more convenient and provides an efficient implementation plan for the application expansion of meeting recording functions.
  • recording requirements are diverse.
  • the recording requirement may be 1) to record the entire audio and video stream of the meeting place where the person to be recorded is located, and the recording requirement may also be 2) to record the personal audio and video stream of the person to be recorded.
  • the following describes the specific screening methods for the above two different recording requirements.
  • the multipoint control unit first screens the venue terminals of the persons to be recorded, and in specific implementation, screens the venue terminals corresponding to the persons to be recorded according to the characteristic information of the persons to be recorded and the audio and video streams sent by each venue terminal; After that, the multipoint control personnel will screen out the audio and video stream sent by the terminal of the venue, that is, the entire audio and video stream as the audio and video stream to be recorded. So as to meet the recording needs 1).
  • the multipoint control unit also needs to first screen the conference site terminals of the persons to be recorded, and this step is similar to the implementation in requirement 1). Thereafter, the multipoint control unit filters the individual audio and video code streams of the person to be recorded from the selected audio and video code streams sent by the conference site terminal according to the characteristic information of the person to be recorded. For example, during a meeting, a certain person to be recorded speaks three times in total, each time there is a certain time interval. If the characteristic information is sound information, the application can specifically filter out the three audio and video streams of the speech. Use this as the audio and video stream to be recorded. So as to meet the recording needs 2).
  • the conference recording method provided in this application can meet various recording requirements, and therefore has strong applicability, and meets various recording applications in conference recording scenarios. Especially for recording requirements 2) No manual editing is required, which improves the convenience of subsequent applications of the recorded meeting.
  • the multipoint control unit decodes the audio and video streams sent by each conference terminal to obtain the decoded video stream and audio stream; if the feature information includes image information, the multipoint control unit uses the image information and the decoded video The code stream performs feature matching to determine the site terminal corresponding to the person to be recorded; if the feature information includes sound information, the multipoint control unit uses the sound information to perform feature matching with the decoded audio code stream to determine the site terminal corresponding to the person to be recorded.
  • the site terminal corresponding to the person to be recorded can be uniquely and accurately determined, that is, the site terminal configured in the site where the person is located. Therefore, it can be known that the person to be recorded must not be at other venue terminals, and the audio and video code streams sent by these other venue terminals can be efficiently filtered, reducing the analysis and processing burden of the multipoint control unit.
  • multiple conference terminals in the conference scene can be unified as advanced video coding AVC venue terminals, or can be unified as scalable video coding SVC venue terminals.
  • the following describes the specific implementation of the method of the present application for different conference terminals.
  • the conference terminals are all AVC site terminals.
  • the multipoint control unit filters the audio and video streams, and firstly selects the at least two video streams according to the characteristic information of the person to be recorded.
  • the audio and video streams that need to be recorded are selected from the audio and video streams sent by the terminals of different AVC sites. If the audio and video stream to be recorded includes the video stream to be recorded and the audio stream to be recorded, the multipoint control unit synthesizes the video stream to be recorded to obtain a composite picture, and sends the composite picture to the recording server , The audio stream to be recorded is mixed and sent to the recording server.
  • the conference recording method provided in this embodiment is used to synthesize and mix the picture by the multipoint control unit and send it to the recording server. Reduce network bandwidth and save the storage space of the recording server.
  • real-time recording is carried out according to the designated personnel to be recorded, avoiding the cutting process of the audio stream and video stream of manual post-production, saving labor costs and improving the efficiency of conference recording.
  • the conference terminal is an SVC site terminal.
  • the multipoint control unit filters the audio and video stream, and first notifies all SVCs of the stream format applicable to the recording server Venue terminal; then receive the audio and video stream in the stream format suitable for the recording server from at least two different SVC site terminals; finally, according to the characteristic information of the person to be recorded from the stream format suitable for the recording server Filter the audio and video streams that need to be recorded in the audio and video streams.
  • the stream format of the filtered audio and video stream is receivable and processable to the recording server.
  • the multipoint control unit sends the video streams that need to be recorded corresponding to at least two different SVC venue terminals to the recording server Therefore, the recording and broadcasting server performs picture synthesis on the video code streams that need to be recorded corresponding to at least two different SVC site terminals to obtain a synthesized picture.
  • the audio mixing is performed by a multipoint control unit, which mixes the audio code streams that need to be recorded corresponding to at least two different SVC site terminals and sends them to the recording server.
  • the conference recording method provided in this embodiment is used to mix the audio by the multipoint control unit and send it to the recording server.
  • the MCU screens the video streams that need to be recorded by multiple people to be recorded, and finally performs conference recording, which greatly reduces the network bandwidth and saves the storage space of the recording server.
  • this embodiment performs real-time recording according to the designated person to be recorded, avoids the manual post-production audio stream and video stream cutting process, saves labor costs, and improves the efficiency of conference recording.
  • the multipoint control unit filters the audio and video streams to be recorded from the audio and video streams according to the characteristic information of the person to be recorded, which specifically includes:
  • the multipoint control unit uses a pre-trained neural network model to screen the audio and video streams to be recorded from the audio and video streams according to the characteristic information of the persons to be recorded.
  • the neural network model is used to filter the audio and video stream, which improves the efficiency of audio and video stream screening, thereby increasing the overall speed of the conference recording process. Improve users' meeting recording experience.
  • this application provides a conference recording device, which includes a code stream screening module and a code stream sending module.
  • the code stream filtering module filters the audio and video code streams that need to be recorded from the audio and video code streams sent by each venue terminal according to the characteristic information of the person to be recorded; the code stream sending module sends the audio and video code streams that need to be recorded to the recording server, So that the recording server can record the meeting.
  • the device uses the characteristic information to accurately screen out the audio and video streams that need to be recorded, and the audio and video streams are matched with the persons to be recorded to realize automatic screening of the audio and video streams. Compared with manual screening and recording, the efficiency is greatly improved and labor costs are saved.
  • the stream filtering module specifically includes:
  • the site terminal screening unit is used to filter the site terminal corresponding to the person to be recorded according to the characteristic information of the person to be recorded and the audio and video stream sent by each site terminal;
  • the first screening unit of the code stream is used to treat all the audio and video code streams sent by the site terminal as the audio and video code streams to be recorded.
  • the stream filtering module specifically includes:
  • the site terminal screening unit is used to filter the site terminal corresponding to the person to be recorded according to the characteristic information of the person to be recorded and the audio and video stream sent by each site terminal;
  • the second code stream screening unit is used to screen the individual audio and video code streams of the person to be recorded from the selected audio and video code streams sent by the conference site terminal as the audio and video code streams to be recorded according to the characteristic information of the person to be recorded.
  • the terminal screening unit of the conference site specifically includes:
  • the decoding subunit is used to decode the audio and video code streams sent by each conference site terminal to obtain decoded video code streams and audio code streams;
  • the site terminal determination subunit is used to match the features of the person to be recorded with the decoded video code stream to determine the site terminal corresponding to the person to be recorded, or to combine the voice information of the person to be recorded with the decoded audio code
  • the stream performs feature matching and determines the site terminal corresponding to the person to be recorded.
  • the conference terminal is an advanced video coding AVC site terminal
  • the code stream screening module specifically includes:
  • the third code stream screening unit is used to screen the audio and video code streams to be recorded from the audio and video code streams sent by at least two different AVC conference site terminals according to the characteristic information of the person to be recorded.
  • the audio and video code streams that need to be recorded include the video code streams that need to be recorded and the audio code streams that need to be recorded;
  • the code stream sending module specifically includes:
  • the picture synthesis unit is used to synthesize the video stream to be recorded to obtain a synthesized picture
  • the screen sending unit is used to send the composite screen to the recording server
  • the first mixing unit is used to mix the audio stream to be recorded
  • the first audio sending unit is used to send the mixed audio to the recording server.
  • the conference terminal is a scalable video coding SVC site terminal.
  • the code stream screening module specifically includes:
  • the stream format notification unit is used to notify all SVC venue terminals of the stream format applicable to the recording server;
  • a code stream receiving unit configured to receive audio and video code streams in a code stream format suitable for the recording server and sent by at least two different SVC site terminals;
  • the fourth screening unit of the code stream is used for screening the audio and video code streams to be recorded from the audio and video code streams in the code stream format suitable for the recording and broadcasting server according to the characteristic information of the person to be recorded.
  • the audio and video code streams that need to be recorded include the video code streams that need to be recorded and the audio code streams that need to be recorded;
  • the code stream sending module specifically includes:
  • the video code stream sending unit is used to send the video code streams that need to be recorded corresponding to at least two different SVC conference site terminals to the recording server, so that the recording and broadcasting server responds to at least two different SVC conference site terminals that need to be recorded.
  • the video stream performs picture synthesis to obtain a synthesized picture
  • the second audio mixing unit is used for mixing audio code streams corresponding to at least two different SVC conference site terminals that need to be recorded;
  • the second audio sending unit is used to send the mixed audio to the recording server.
  • the code stream filtering module specifically includes:
  • the fifth screening unit of the code stream is used to screen the audio and video code streams to be recorded from the audio and video code streams by using the pre-trained neural network model according to the characteristic information of the person to be recorded.
  • this application provides a conference recording system, including a multipoint control unit, a recording server, and at least two conference site terminals;
  • the venue terminal is used to send audio and video streams to the multipoint control unit;
  • the multipoint control unit is used to screen the audio and video streams to be recorded from the audio and video streams sent by the terminals of each venue according to the characteristic information of the person to be recorded; the characteristic information includes: image information or sound information; the audio and video to be recorded
  • the code stream is sent to the recording server;
  • the recording server is used to record the conference according to the audio and video streams that need to be recorded.
  • the system uses characteristic information to accurately screen out the audio and video streams that need to be recorded from the numerous audio and video streams provided by the venue terminals, and the audio and video streams are matched with the person to be recorded to achieve automatic screening of the audio and video streams. Compared with manual screening and recording, the efficiency is greatly improved and labor costs are saved.
  • the multipoint control unit MCU The characteristic information of the personnel can accurately filter out the audio and video streams that need to be recorded from the audio and video streams sent by each venue. Thereafter, the MCU sends the audio and video stream to be recorded to the recording server, and the recording server can record the meeting according to the audio and video stream received from the MCU.
  • the method uses the characteristic information of the person to be recorded to realize automatic screening of the audio and video streams to be recorded, thereby eliminating the need for manual screening, saving labor costs for conference recording, and greatly improving conference recording efficiency. This method improves the convenience of conference recording and promotes the wide application of video conference functions.
  • FIG. 1 is a schematic diagram of a conference recording scene provided by an embodiment of the application
  • FIG. 2 is a flowchart of a method for recording a meeting according to an embodiment of the application
  • FIG. 3 is a flowchart of a multipoint control unit provided by this embodiment to obtain an audio and video stream to be recorded;
  • FIG. 4 is a flowchart of another multipoint control unit provided by an embodiment of the application to obtain an audio and video stream to be recorded;
  • Figure 5 is a signaling diagram of a conference recording method provided by an embodiment of the application.
  • FIG. 6 is a signaling diagram of another method for conference recording provided by an embodiment of the application.
  • FIG. 7 is a schematic structural diagram of a conference recording device provided by an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of a conference recording system provided by an embodiment of the application.
  • FIG. 9 is a schematic structural diagram of another conference recording system provided by an embodiment of the application.
  • the inventor provides a conference recording method, device and conference recording system.
  • the feature information of the person to be recorded is used to filter the audio and video streams of multiple venues. Since each person's feature information (including image information or sound information) is different from each other, the feature of the person to be recorded is used The information can accurately identify the audio and video stream containing the characteristic information of the person to be recorded, thereby realizing automatic screening of the audio and video stream.
  • the MCU screens out the audio and video code streams that need to be recorded and sends them to the recording server, and the server can record the conference according to the audio and video code streams received from the MCU.
  • FIG. 1 is a schematic diagram of a conference recording scene provided by an embodiment of the application.
  • the application scenario of the conference recording method provided by the present application includes: a multipoint control unit MCU, a recording server, a conference application server (application server, AS), and multiple conference site terminals.
  • the number of conference site terminals may be two or more.
  • FIG. 1 only three conference site terminals are taken as an example. This embodiment does not limit the specific number of conference site terminals in a conference recording scenario.
  • the three conference site terminals in FIG. 1 are conference site terminal 1, conference site terminal 2, and conference site terminal 3, and different conference site terminals belong to different conference sites.
  • the MCU and the recording server are located in the same LAN.
  • the conference AS serves as a videoconferencing service management platform, and the user reserves the conference through the conference AS, and provides or specifies materials containing the characteristic information of the person to be recorded.
  • the feature information may include image information or sound information. That is, the feature information may include only image information, may include only sound information, or may include both image information and sound information.
  • the user may upload to the conference AS materials containing characteristic information of the person to be recorded, such as a picture or audio file of the person to be recorded.
  • the meeting AS processes the picture to obtain the image information of the person to be recorded.
  • the image information can be specifically image features, such as facial features;
  • the meeting AS processes the audio file to obtain the voice information of the person to be recorded, and the sound information can be specifically sound. Pattern features. It is understandable that the characteristic information of the person to be recorded is different from the characteristic information of other persons, that is, the characteristic information of the person to be recorded is uniquely corresponding to the person to be recorded, and the person to be recorded can be uniquely determined by the characteristic information of the person to be recorded.
  • materials corresponding to multiple persons are stored in the conference AS, such as pictures or audio files of each person, and the conference AS has pre-processed to obtain image information and sound information corresponding to each person.
  • the user can select the material containing the characteristic information of the person to be recorded from the materials of the plurality of persons to be selected according to requirements, for example, select a picture or audio file of the person to be recorded.
  • the conference AS can determine the person to be recorded designated by the user according to the user's selection.
  • the conference AS convenes a conference to the multipoint control unit MCU, and sends the characteristic information of the person to be recorded to the MCU when convening.
  • the MCU calls the conference site terminal 1, the conference site terminal 2 and the conference site terminal 3 to join the conference. After each terminal joins the conference, it sends audio and video streams to the MCU.
  • the MCU screens the audio and video streams that need to be recorded from the audio and video streams sent by the terminals of each conference site according to the characteristic information of the person to be recorded issued by the conference AS.
  • the person to be recorded participates in the conference at the conference site terminal 2. Therefore, in the audio and video stream sent by the conference site terminal 2 to the MCU, the video stream contains the image information of the person to be recorded, and the audio code stream contains the information of the person to be recorded. Voice information. If the feature information of the person to be recorded is specifically image information, the MCU can match the image information in the video code stream sent from each site terminal to determine that the person to be recorded participates in the conference at the site terminal 2, so that the audio sent by the site terminal 2 The video stream is filtered out and sent to the recording server so that the recording server can record the meeting.
  • the MCU can match the sound information in the audio code stream sent from each site terminal to determine that the person to be recorded participates in the conference at the site terminal 2, so that the audio sent by the site terminal 2
  • the video stream is filtered out and sent to the recording server so that the recording server can record the meeting.
  • Fig. 2 is a flowchart of the conference recording method provided in this embodiment. This applies to the multipoint control unit MCU in the conference recording scene.
  • the conference recording method provided in this embodiment includes:
  • Step 201 The multipoint control unit filters the audio and video code streams to be recorded from the audio and video code streams sent by the terminals of each venue according to the characteristic information of the person to be recorded.
  • the characteristic information includes: image information or sound information.
  • the MCU can filter the audio and video streams that need to be recorded based on the image information of the person to be recorded, and can filter the audio and video streams that need to be recorded based on the audio information of the person to be recorded.
  • the image information and sound information filter the audio and video streams to be recorded. It is understandable that the comprehensive use of the image information and sound information of the person to be recorded to screen the audio and video stream to be recorded can improve the accuracy of audio and video stream screening and reduce the error rate of screening.
  • the number of persons to be recorded can be one or more. If the characteristic information of the person to be recorded received by the MCU from the conference AS belongs to the same person to be recorded, it means that there is only one person to be recorded; if the characteristic information of the person to be recorded received belongs to multiple different persons to be recorded, it means The number of people to be recorded is multiple. In practical applications, multiple persons to be recorded may be located in the same venue, that is, they jointly correspond to the same venue terminal, or may be located in different venues, that is, respectively correspond to different venue terminals.
  • the actual recording requirement may be: recording a multi-screen meeting, where each screen corresponds to a different person to be recorded.
  • the audio and video streams to be recorded are also different.
  • the selected audio and video code stream is specifically the entire audio and video code stream of the conference room where the person to be recorded is located during the meeting.
  • the selected audio and video stream is specifically the audio and video stream when the person to be recorded speaks.
  • the MCU in this embodiment can use a pre-trained neural network model to filter the audio and video streams to be recorded from the audio and video streams according to the characteristic information of the person to be recorded.
  • the neural network model is obtained by training using the characteristic information of a large number of different persons and materials containing the characteristic information of different persons (for example, pictures or audio files of persons to be recorded).
  • training a neural network model that can accurately identify a video code stream with a certain image information or an audio code stream with a certain sound information is a relatively mature technology, so this implementation For example, the specific training process of the neural network model is not described in detail.
  • Step 202 The multipoint control unit sends the audio and video stream to be recorded to the recording server, so that the recording server can perform conference recording.
  • the recording server is a server that can be used with the MCU and the venue terminal. It can record video, audio, and computer screen signals simultaneously. Therefore, in this step, the MCU sends the audio and video streams that need to be recorded to the recording server. , The recording server can record the meeting according to the audio and video streams that need to be recorded.
  • the multipoint control unit MCU The characteristic information of the personnel can accurately filter out the audio and video streams that need to be recorded from the audio and video streams sent by each venue. Thereafter, the MCU sends the audio and video stream to be recorded to the recording server, and the recording server can record the meeting according to the audio and video stream received from the MCU.
  • the method uses the characteristic information of the person to be recorded to realize automatic screening of the audio and video streams to be recorded, thereby eliminating the need for manual screening, saving labor costs for conference recording, and greatly improving conference recording efficiency.
  • This method improves the convenience of conference recording and promotes the wide application of video conference functions.
  • step 201 means that the audio and video stream to be recorded is the entire audio and video stream of the venue where the person to be recorded is located; (2) and (3) indicate that the audio and video stream to be recorded is The personal audio and video stream of the person to be recorded.
  • step 201 describes the detailed flow of step 201 when the audio and video stream to be recorded is the entire audio and video stream of the conference venue where the person to be recorded is located; in conjunction with Figure 4, it is described that the audio and video stream to be recorded is the personal audio and video stream of the person to be recorded.
  • video code streaming the detailed process of step 201.
  • FIG. 3 is a flowchart of a multipoint control unit provided by this embodiment to obtain audio and video streams to be recorded.
  • the multipoint control unit obtains the audio and video stream to be recorded specifically including:
  • Step 301 The multipoint control unit decodes the audio and video code streams sent by each conference site terminal to obtain decoded video code streams and audio code streams.
  • the audio and video code stream sent by the site terminal to the MCU may specifically be an audio and video real-time transport protocol (RTP) code stream.
  • RTP real-time transport protocol
  • the audio and video streams need to be decoded in advance.
  • decoding of audio and video code streams is a relatively mature technology, so the decoding process will not be described in detail here.
  • the MCU obtains a video stream and an audio stream that can be processed separately. It is understandable that there is a time sequence relationship between the video code stream and the audio code stream decoded from the audio and video code stream.
  • the MCU decodes the video code stream and audio code stream of the site where the site terminal 1 is located from time T1 to time T2.
  • Step 302 The multipoint control unit performs feature matching between the image information of the person to be recorded and the decoded video code stream, and determines the venue terminal corresponding to the person to be recorded, or combines the voice information of the person to be recorded with the decoded audio code stream Perform feature matching and determine the venue terminal corresponding to the person to be recorded.
  • the site terminal 1 belongs to the site where the person to be recorded is located, the site terminal that sends the audio and video stream with the characteristic information of the person to be recorded must be the site terminal 1 and may not be other site terminals. Therefore, in order to filter the audio and video streams that need to be recorded, in this embodiment, it is only necessary to determine the venue terminal that transmits the characteristic information of the person to be recorded.
  • the MCU will perform feature matching based on the image information of the person to be recorded with the decoded video stream to determine the conference site corresponding to the person to be recorded terminal. If the characteristic information of the person to be recorded received by the MCU from the conference AS includes only sound information, then in this step, the MCU will perform feature matching based on the sound information of the person to be recorded with the decoded audio stream to determine the meeting place corresponding to the person to be recorded terminal.
  • the characteristic information of the person to be recorded received by the MCU from the conference AS includes both image information and sound information
  • the image information and sound information are used to match the conference site terminal corresponding to the recording person, which can improve the accuracy of the matching result. And credibility, reduce the error rate.
  • the above steps 301-302 implement the screening of the conference site terminals corresponding to the person to be recorded by the multipoint controller. That is, the site terminal determined by the final match in step 302 is obtained by the multipoint control unit from multiple site terminals based on the characteristic information of the person to be recorded and the audio and video streams sent by each site terminal.
  • Step 303 Use all the audio and video streams sent by the site terminal selected as the audio and video streams to be recorded.
  • the recording requirement is the aforementioned recording requirement (1), that is, to record the complete meeting attended by the person to be recorded, when screening the audio and video streams that need to be recorded, directly use all the audio and video streams sent by the site terminal selected in step 302 as The audio and video stream to be recorded.
  • the MCU filters the entire audio and video stream of the conference room where the person to be recorded is located, and sends the entire audio and video stream to the recording server for conference recording to meet the aforementioned recording requirements (1).
  • FIG. 4 is a flowchart of another multipoint control unit provided in this embodiment to obtain the audio and video stream to be recorded.
  • the multipoint control unit obtains the audio and video stream to be recorded specifically including:
  • Step 401 The multipoint control unit decodes the audio and video code streams sent by each conference terminal to obtain decoded video code streams and audio code streams.
  • Step 402 The multipoint control unit performs feature matching between the image information of the person to be recorded and the decoded video code stream, and determines the venue terminal corresponding to the person to be recorded, or combines the voice information of the person to be recorded with the decoded audio code stream Perform feature matching and determine the venue terminal corresponding to the person to be recorded.
  • steps 401-402 are implemented in the same manner as the foregoing steps 301-302, and related descriptions of steps 401-402 can refer to steps 301-302, which will not be repeated here.
  • Step 403 From the selected audio and video stream sent by the conference site terminal, the individual audio and video stream of the person to be recorded is selected according to the characteristic information of the person to be recorded as the audio and video stream to be recorded.
  • the actual recording requirement is likely to be the recording requirement (2), that is, only the meeting when the person to be recorded speaks.
  • the audio code of the person to be recorded can be determined from the audio code stream from time T1 to time T2 decoded in step 401 to the audio code from time T3 to time T4 that contains the voice information of the person to be recorded. Flow (the meeting start time T1 is earlier than the meeting end time T2, the speech start time T3 of the person to be recorded is earlier than the speech end time T4 of the person to be recorded, T3 is later than or equal to T1 and earlier than T2, and T4 is earlier than or equal to T2) .
  • the video code stream obtained by decoding the audio and video code stream is related to the audio code stream in time sequence
  • the video code stream from time T3 to time T4 can also be obtained according to the audio code stream from time T3 to time T4.
  • the audio code stream and the video code stream from time T3 to time T4 are collectively referred to as the personal audio and video code stream of the person to be recorded, and the audio and video code stream meets the recording requirement (2).
  • the actual recording requirement is likely to be the recording requirement (3), that is, to record the meeting when the person to be recorded appears.
  • the voice information of the person to be recorded can be used to determine the video code from time T5 to time T6 that contains the image information of the person to be recorded from the video code stream from time T1 to time T2 decoded in step 401 Flow (the start time T1 of the meeting is earlier than the end time T2 of the meeting, the person to be recorded appears from time T5 to time T6, T5 is earlier than T6, T5 is later than or equal to T1, and T6 is earlier than or equal to T2).
  • the audio code stream from time T5 to time T6 can also be obtained according to the video code stream from time T5 to time T6.
  • the audio code stream and the video code stream from time T5 to time T6 are collectively referred to as the personal audio and video code stream of the person to be recorded, and the audio and video code stream meets the recording requirements (3).
  • the following exemplarily provides an application scenario of the method in this embodiment.
  • there are multiple persons to be recorded and multiple persons to be recorded are located in the same conference site, that is, multiple persons to be recorded correspond to the same conference site terminal.
  • the specific recording requirement is to record only the meeting when the person to be recorded speaks; the characteristic information of the person to be recorded obtained by the multipoint control unit includes: the voice information of each person to be recorded.
  • the multipoint control unit MCU decodes the audio and video stream sent by each conference site terminal to obtain the decoded video stream and audio stream; the voice of each person to be recorded The information is matched with the decoded audio code stream to determine the same site terminal corresponding to each person to be recorded; the MCU selects each person to be recorded from the audio and video stream sent by the screened site terminal according to the voice information of each person to be recorded Personnel’s personal audio and video streams, these audio and video streams are collectively used as the audio and video streams to be recorded. The MCU finally sends these audio and video streams to be recorded to the recording server for conference recording.
  • the site terminals can be divided into two categories, one is the single-stream advanced video coding (advanced video coding, AVC) site terminal, and the other is the multi-stream scalable video coding (scalable).
  • Video coding, SVC) venue terminal In practical applications, each site terminal in the conference recording scene may be unified as an AVC site terminal, or may be unified as an SVC site terminal.
  • the operations performed by the multipoint control unit MCU in the conference recording method provided in the embodiment of the present application also have corresponding differences.
  • the following two embodiments respectively describe the conference recording method in the AVC site terminal scenario and the conference recording method in the SVC site terminal scenario.
  • FIG. 5 is a signaling diagram of a conference recording method provided in this embodiment.
  • the conference recording scene shown in Figure 5 it includes a multipoint control unit MCU, a conference AS, a recording server, and multiple AVC site terminals, namely AVC 1, AVC 2, and AVC 3.
  • Step 501 The conference AS convenes a conference to the multipoint control unit MCU according to the user's reservation, and delivers the characteristic information of the person to be recorded.
  • the person to be recorded corresponds to at least two different AVC site terminals, that is, the characteristic information of the person to be recorded issued by the conference AS belongs to at least two persons to be recorded.
  • the conference AS delivers the characteristic information of the person to be recorded Role1 and the characteristic information of the person to be recorded Role2.
  • Step 502 The multipoint control unit calls all AVC site terminals to join the conference, including: AVC 1, AVC 2, and AVC 3.
  • Step 503 AVC 1, AVC 2, and AVC 3 join the conference and send audio and video streams to the MCU respectively.
  • Step 504 The MCU calls the recording server to join the conference.
  • Step 505 The MCU decodes the audio and video code streams sent by AVC 1, AVC 2, and AVC 3 respectively.
  • the MCU decodes the audio and video code streams sent by AVC 1, AVC 2, and AVC 3 respectively, and the decoding can obtain the audio code streams and video code streams sent by AVC 1, AVC 2, and AVC 3 respectively.
  • the audio code stream and the video code stream are obtained through decoding, so that the feature information of the persons to be recorded Role1 and Role2 can be used to screen the audio and video code streams.
  • Step 506 The MCU screens the audio and video streams to be recorded from the audio and video streams sent by AVC 1, AVC 2, and AVC 3 according to the characteristic information of the person to be recorded.
  • the venue terminal corresponding to the person to be recorded can be selected according to the characteristic information of the person to be recorded and the decoded audio code stream and video code stream of each venue terminal.
  • the specific screening process can refer to the foregoing embodiment, which will not be repeated here.
  • the site terminal corresponding to Role1 is finally selected as AVC 1 according to the characteristic information of Role1
  • the site terminal corresponding to Role2 is selected as AVC 2 according to the characteristic information of Role2.
  • the MCU can filter by specifying the video source. For example, the MCU specifies the name of the video source of AVC 1, and filters out the inconsistent audio streams and video streams based on the video source name. The audio streams and video streams that are finally filtered are used as the audio streams and video streams that need to be recorded for Role1.
  • Video code stream MCU designates the video source name of AVC 2, and filters out the inconsistent audio code stream and video code stream according to the video source name, and the final filtered audio code stream and video code stream are used as the audio code that needs to be recorded for Role2 Stream and video stream.
  • Step 507 The MCU performs picture synthesis on the video stream to be recorded to obtain a synthesized picture, sends the synthesized picture to the recording server, and mixes the audio stream to be recorded and sends it to the recording server.
  • the MCU has the function of screen synthesis for the video stream sent by the AVC site terminal, and the function of mixing the audio stream sent by the AVC site terminal. Since in this embodiment, the person to be recorded corresponds to at least two different AVC conference site terminals, the MCU can synthesize the video code streams that need to be recorded for multiple persons to be recorded (for example, Role1 and Role2) obtained in the previous step. And according to the audio code stream that needs to be recorded by multiple persons to be recorded obtained in the previous step, it is mixed.
  • Step 508 The recording server performs conference recording after receiving the code stream.
  • the person to be recorded corresponds to at least two AVC site terminals, that is, there are at least two persons to be recorded.
  • the conference AS realizes the designation of the persons to be recorded for conference recording by issuing the characteristic information of the persons to be recorded.
  • the conference recording method provided in this embodiment is used to synthesize and mix the picture by the MCU and send it to the recording server, which greatly reduces the network bandwidth , And save the storage space of the recording server.
  • this embodiment performs real-time recording according to a designated person to be recorded, avoids the manual post-production audio stream and video stream cutting process, saves labor costs, and improves conference recording efficiency.
  • FIG. 6 is a signaling diagram of another method for conference recording provided in this embodiment.
  • the conference recording scene shown in FIG. 6, it includes a multipoint control unit MCU, a conference AS, a recording server, and multiple SVC site terminals, namely SVC 1, SVC 2, and SVC 3.
  • Step 601 The conference AS convenes a conference to the multipoint control unit MCU according to the user's reservation, and delivers characteristic information of the person to be recorded.
  • the person to be recorded corresponds to at least two different SVC site terminals, that is, the characteristic information of the person to be recorded issued by the conference AS belongs to at least two persons to be recorded.
  • the conference AS delivers the characteristic information of the person to be recorded Role3 and the characteristic information of the person to be recorded Role4.
  • Step 602 The multipoint control unit calls all SVC site terminals to join the conference, including SVC 1, SVC 2, and SVC 3, and the multipoint control unit notifies SVC 1, SVC 2, and SVC 3 of the stream formats applicable to the recording server.
  • each SVC site terminal can provide the MCU with audio and video streams in different stream formats.
  • the recording server can usually only record conferences based on the audio and video streams of one of the stream formats, and the audio and video streams of other stream formats are not suitable for the recording server.
  • each SVC site terminal needs to be notified in advance that it is applicable to the recording server The stream format.
  • Step 603 SVC 1, SVC 2, and SVC 3 join the conference, and respectively send audio and video streams in a stream format suitable for the recording server to the MCU.
  • Step 604 The MCU calls the recording server to join the conference.
  • Step 605 The MCU decodes the audio and video code streams sent by SVC 1, SVC 2, and SVC 3 respectively.
  • the MCU decodes the audio and video code streams sent by SVC 1, SVC 2, and SVC 3 respectively, and the decoding can obtain the audio code streams and video code streams sent by SVC 1, SVC 2 and SVC 3 respectively.
  • the audio code stream and the video code stream are obtained by decoding, so that the feature information of Role3 and Role4 of the persons to be recorded can be used to screen the audio and video code streams.
  • Step 606 According to the characteristic information of the person to be recorded, the MCU selects the audio and video code streams to be recorded from the audio and video code streams in the code stream format sent by the SVC 1, SVC 2, and SVC 3, respectively, which are suitable for the recording server.
  • the venue terminal corresponding to the person to be recorded can be selected according to the characteristic information of the person to be recorded and the decoded audio code stream and video code stream of each venue terminal.
  • the specific screening process can refer to the foregoing embodiment, which will not be repeated here.
  • the site terminal corresponding to Role3 is selected as SVC 3 according to the feature information of Role3
  • the site terminal corresponding to Role4 is selected as SVC 2 according to the feature information of Role4.
  • the MCU can filter by specifying the video source. For example, the MCU specifies the video source name of SVC 3, and filters out the inconsistent audio code streams and video code streams according to the video source name. The audio code stream and video code stream that are finally filtered are used as the audio code stream and video code stream that need to be recorded for Role3.
  • Video code stream MCU designates the video source name of SVC 2 and filters out the inconsistent audio code stream and video code stream according to the video source name, and the finally filtered audio code stream and video code stream are used as the audio code to be recorded for Role4 Stream and video stream.
  • Step 607 The MCU sends the video code streams to be recorded corresponding to at least two different SVC site terminals to the recording server, and the audio code streams to be recorded corresponding to at least two different SVC site terminals are mixed and sent to Recording server.
  • this step is to send the video stream that needs to be recorded sent by the previous step SVC 3 and SVC 2 to the recording server.
  • the recording server has the ability to display the video streams of multiple SVC venue terminals.
  • the function of synthesis The MCU has the function of mixing the audio code streams sent by the SVC site terminal, so the audio code streams sent by SVC 3 and SVC 2 that need to be recorded are mixed by the MCU.
  • Step 608 The recording server performs picture synthesis on the video code streams to be recorded corresponding to at least two different SVC conference site terminals to obtain a synthesized picture, and performs conference recording according to the synthesized picture and the mixed audio code stream.
  • the persons to be recorded correspond to at least two SVC site terminals, that is, there are at least two persons to be recorded.
  • the conference AS realizes the designation of the persons to be recorded for conference recording by issuing the characteristic information of the persons to be recorded.
  • the conference recording method provided in this embodiment is used to mix and send to the recording server by the MCU, and the recording server will screen the MCU
  • the video code stream that needs to be recorded by multiple people to be recorded is combined into the picture, and the meeting is recorded finally, which greatly reduces the network bandwidth and saves the storage space of the recording server.
  • this embodiment performs real-time recording according to a designated person to be recorded, avoids the manual post-production audio stream and video stream cutting process, saves labor costs, and improves conference recording efficiency.
  • the present application also provides a conference recording device.
  • the specific implementation of the device will be described below in conjunction with embodiments and drawings.
  • FIG. 7 shows a schematic structural diagram of the conference recording apparatus provided by this embodiment.
  • the device includes:
  • the code stream screening module 701 is used for screening the audio and video code streams to be recorded from the audio and video code streams sent by the terminals of each venue according to the characteristic information of the person to be recorded;
  • the code stream sending module 702 is configured to send the audio and video code streams to be recorded to the recording server so that the recording server can perform conference recording; the characteristic information includes: image information or sound information.
  • the conference recording device uses the characteristic information of the person to be recorded to realize automatic screening of the audio and video streams to be recorded, thereby eliminating the need for manual screening, saving labor costs for conference recording, and greatly improving conference recording efficiency.
  • the application of this device can improve the convenience of conference recording and promote the wide application of video conference functions.
  • the audio and video stream to be recorded can be the entire audio and video stream of the venue where the person to be recorded is located, or the personal audio and video stream of the person to be recorded.
  • the following describes the implementation of the code stream filtering module 701 based on these two situations.
  • the stream filtering module 701 specifically includes:
  • the site terminal screening unit is used to filter the site terminal corresponding to the person to be recorded according to the characteristic information of the person to be recorded and the audio and video stream sent by each site terminal;
  • the first screening unit of the code stream is used to treat all the audio and video code streams sent by the site terminal as the audio and video code streams to be recorded.
  • the stream filtering module 701 specifically includes:
  • the site terminal screening unit is used to filter the site terminal corresponding to the person to be recorded according to the characteristic information of the person to be recorded and the audio and video stream sent by each site terminal;
  • the second code stream screening unit is used to screen the individual audio and video code streams of the person to be recorded from the selected audio and video code streams sent by the conference site terminal as the audio and video code streams to be recorded according to the characteristic information of the person to be recorded.
  • the terminal screening unit can first determine the conference site corresponding to the person to be recorded through matching.
  • the venue terminal screening unit specifically includes:
  • the decoding subunit is used to decode the audio and video code streams sent by each conference site terminal to obtain decoded video code streams and audio code streams;
  • the site terminal determination subunit is used to match the features of the person to be recorded with the decoded video code stream to determine the site terminal corresponding to the person to be recorded, or to combine the voice information of the person to be recorded with the decoded audio code
  • the stream performs feature matching and determines the site terminal corresponding to the person to be recorded.
  • the site terminals can be divided into two categories, one is the single-stream advanced video coding (advanced video coding, AVC) site terminal, and the other is the multi-stream scalable video coding (scalable).
  • Video coding, SVC) venue terminal In practical applications, each site terminal in the conference recording scene may be unified as an AVC site terminal, or may be unified as an SVC site terminal.
  • AVC site terminal the single-stream advanced video coding
  • SVC multi-stream scalable video coding
  • each site terminal in the conference recording scene may be unified as an AVC site terminal, or may be unified as an SVC site terminal.
  • the specific implementation of the conference recording device provided in the embodiment of the present application also has corresponding differences. The following describes the implementation of the conference recording device in the AVC site terminal scenario and the SVC site terminal scenario.
  • the code stream screening module 701 specifically includes:
  • the third code stream screening unit is used to screen the audio and video code streams to be recorded from the audio and video code streams sent by at least two different AVC conference site terminals according to the characteristic information of the person to be recorded.
  • the audio and video stream to be recorded includes the video stream to be recorded and the audio stream to be recorded;
  • the stream sending module 702 specifically includes:
  • the picture synthesis unit is used to synthesize the video stream to be recorded to obtain a synthesized picture
  • the screen sending unit is used to send the composite screen to the recording server
  • the first mixing unit is used to mix the audio stream to be recorded
  • the first audio sending unit is used to send the mixed audio to the recording server.
  • the person to be recorded corresponds to at least two AVC site terminals, that is, there are at least two persons to be recorded.
  • the conference AS realizes the designation of the persons to be recorded for conference recording by issuing the characteristic information of the persons to be recorded.
  • the conference recording device provided in this embodiment is used to synthesize and mix the picture by the MCU and send it to the recording server, which greatly reduces the network bandwidth , And save the storage space of the recording server.
  • the application of the device provided in this embodiment can perform real-time recording according to the designated person to be recorded, avoiding the manual post-production audio stream and video stream cutting process, saving labor costs, and improving conference recording efficiency.
  • the code stream screening module 701 specifically includes:
  • the stream format notification unit is used to notify all SVC venue terminals of the stream format applicable to the recording server;
  • a code stream receiving unit configured to receive audio and video code streams in a code stream format suitable for the recording server and sent by at least two different SVC site terminals;
  • the fourth screening unit of the code stream is used for screening the audio and video code streams to be recorded from the audio and video code streams in the code stream format suitable for the recording and broadcasting server according to the characteristic information of the person to be recorded.
  • the audio and video stream to be recorded includes the video stream to be recorded and the audio stream to be recorded;
  • the stream sending module 702 specifically includes:
  • the video code stream sending unit is used to send the video code streams that need to be recorded corresponding to at least two different SVC conference site terminals to the recording server, so that the recording and broadcasting server responds to at least two different SVC conference site terminals that need to be recorded.
  • the video stream performs picture synthesis to obtain a synthesized picture
  • the second audio mixing unit is used for mixing audio code streams corresponding to at least two different SVC conference site terminals that need to be recorded;
  • the second audio sending unit is used to send the mixed audio to the recording server.
  • the persons to be recorded correspond to at least two SVC site terminals, that is, there are at least two persons to be recorded.
  • the conference AS realizes the designation of the persons to be recorded for conference recording by issuing the characteristic information of the persons to be recorded.
  • the conference recording device provided in this embodiment is used for mixing by the MCU and sent to the recording server, and the recording server screens the MCU
  • the video code stream that needs to be recorded by multiple people to be recorded is combined into the picture, and the meeting is recorded finally, which greatly reduces the network bandwidth and saves the storage space of the recording server.
  • the application of the device provided by this embodiment can perform real-time recording according to a designated person to be recorded, avoiding manual post-production of audio stream and video stream cutting process, saving labor costs, and improving conference recording efficiency.
  • the code stream filtering module 701 in this embodiment specifically includes:
  • the fifth screening unit of the code stream is used to screen the audio and video code streams to be recorded from the audio and video code streams by using the pre-trained neural network model according to the characteristic information of the person to be recorded.
  • the neural network model is obtained by training using the characteristic information of a large number of different persons and materials containing the characteristic information of different persons (for example, pictures or audio files of persons to be recorded).
  • training a neural network model that can accurately identify a video code stream with a certain image information or an audio code stream with a certain sound information is a relatively mature technology, so this implementation
  • the specific training process of the neural network model is not described in detail.
  • the present application also provides a conference recording system.
  • the specific implementation of the system will be described below in conjunction with the drawings and embodiments.
  • FIG. 8 is a schematic structural diagram of a conference recording system provided by an embodiment of the application.
  • the conference recording system includes: a multipoint control unit MCU, a recording server 801 and at least two conference site terminals.
  • the multipoint control unit MCU may specifically execute the conference recording method provided in the foregoing embodiment.
  • the venue terminal used to send audio and video streams to the multipoint control unit MCU;
  • the multipoint control unit MCU is used to screen the audio and video streams to be recorded from the audio and video streams sent by the terminals of each venue according to the characteristic information of the person to be recorded; the characteristic information includes: image information or sound information;
  • the video code stream is sent to the recording server 801;
  • the recording server 801 is used for conference recording according to the audio and video streams to be recorded.
  • the number of conference site terminals is at least two. All conference site terminals can be unified as AVC conference site terminals, and can also be unified as SVC conference site terminals.
  • Each conference site terminal may be a terminal using the Session Initiation Protocol (SIP), or may be a terminal using the H.323 protocol.
  • SIP Session Initiation Protocol
  • H.323 protocol There are no restrictions on the communication protocol used by the conference terminal.
  • the venue terminal 811 uses the SIP protocol to communicate with the MCU
  • the venue terminals 812 and 813 use the H.323 protocol to communicate with the MCU, respectively.
  • the conference recording system uses the characteristic information of the person to be recorded to realize automatic screening of the audio and video streams to be recorded, thereby eliminating the need for manual screening, saving labor costs for conference recording, and greatly improving conference recording efficiency.
  • the system enhances the convenience of conference recording and promotes the wide application of video conference functions.
  • the MCU also has the function of transcribing the audio video stream.
  • the MCU also has the function of transcribing the audio video stream. The following describes the implementation scenario of this function in conjunction with Figure 8.
  • a conference site terminal requests the MCU to rebroadcast the conference of another conference site terminal.
  • the venue terminal 813 is used to send a request for rebroadcasting the meeting of the venue terminal 812 to the MCU.
  • the format of the audio and video stream that the venue terminal 813 can play may be one of the multiple stream formats that the venue terminal 812 can provide. Therefore, the request sent by the venue terminal 813 carries the audio and video stream that the venue terminal 813 can play.
  • the stream format of the video stream is used to send a request for rebroadcasting the meeting of the venue terminal 812 to the MCU.
  • the MCU is used to send a notification to the venue terminal 812 according to the code stream format in the request, so that the venue terminal 812 sends the audio and video code stream of the format to the MCU according to the notification.
  • the MCU is also used to forward the audio and video code stream in the code stream format that the venue terminal 813 can play sent by the venue terminal 812 to the venue terminal 813, so that the venue terminal 813 can play according to the audio and video code stream. Therefore, all participants in the conference site where the conference site terminal 813 is located can watch the conference held by the conference site terminal 812.
  • the recording server 801 also has the function of transcribing the audio video stream.
  • the following describes the implementation scenarios of this function.
  • the recording server 801 transfers the audio video stream: other servers request an on-demand or live conference from the recording server.
  • the recording server 801 is also used to receive on-demand requests or live broadcast requests from other servers.
  • the format of the audio and video stream that other servers can play may be one of the multiple stream formats that each venue terminal can provide. Therefore, the on-demand or live broadcast request sent by other servers can carry other servers that can play.
  • the recording and broadcasting server 801 is also used for notifying the MCU of the code stream format, so that the MCU notifies each venue terminal of the code stream format. When the MCU receives the audio and video stream that conforms to the stream format sent by each conference site terminal, it sends it to the recording server.
  • the recording server 801 is also used to forward the request to other servers based on the on-demand request or the live broadcast request, and the audio and video stream sent by the MCU.
  • other servers request the recording server 801 to request a meeting of the site terminal 811, and the recording server 801 forwards the audio and video stream from the site terminal 811 to the other server and conforms to the stream format of the other server.
  • other servers request the recording server 801 to broadcast live conferences of all participating site terminals, and the recording server 801 forwards to other servers the audio and video streams from each site terminal that conform to the stream formats of other servers; if It is suitable for the stream format of the recording server 801 and the stream format that can be played by other servers that make the request.
  • the recording server can also forward to other servers while recording, and the forwarding content is multi-picture (ie composite picture) Audio and video stream.
  • both the MCU and the recording server 801 can provide audio and video stream forwarding services, thereby enriching the overall functions of the conference recording system and improving the user experience.
  • the conference recording system provided in this embodiment may further include: a conference application server (conference AS) 802.
  • conference AS conference application server 802.
  • FIG. 9 this figure is a schematic structural diagram of another conference recording system provided by this embodiment.
  • the dotted line connecting the site terminal and the conference AS802 indicates that the site terminal is registered to the conference AS802; the dotted line connecting the recording server 801 and the conference AS802 indicates that the recording server 801 is registered to the conference AS802; The dotted line connecting the AS802 and the MCU indicates that the conference AS802 delivers the characteristic information of the person to be recorded to the MCU.
  • Meeting AS802 is used to provide users with the function of uploading materials with characteristics information of the person to be recorded (such as pictures or audio files of the person to be recorded); process the material to obtain the characteristic information of the person to be recorded; download the characteristic information of the person to be recorded Send to MCU.
  • the conference AS802 is also used to store the materials corresponding to multiple personnel and process and obtain the characteristic information of each person.
  • the materials corresponding to each person are provided to the user, and the user selects the materials of the recording personnel.
  • the characteristic information of the person to be recorded corresponding to the selected material is sent to the MCU according to the selected message.
  • the conference AS802 can send the characteristic information of the person to be recorded to the MCU when it convenes a conference to the MCU.
  • the conference AS802 can provide users with the service of designating the person to be recorded, thereby increasing the convenience of subsequent conference recording, and only recording persons are required to record, saving the recording server 801 Storage space, reduce communication bandwidth, and enhance users’ meeting recording experience.
  • At least one (item) refers to one or more, and “multiple” refers to two or more.
  • the following at least one item (a)” or similar expressions refers to any combination of these items, including any combination of a single item (a) or plural items (a).
  • at least one (a) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c” ", where a, b, and c can be single or multiple.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本申请公开了一种会议录制方法、装置及会议录制系统。本申请方法包括:多点控制单元根据待录制人员的特征信息从各个会场终端发送的音视频码流中筛选需要录制的音视频码流;特征信息包括:图像信息或声音信息;多点控制单元将需要录制的音视频码流发送给录播服务器,以使录播服务器进行会议录制。由于待录制人员的特征信息(包括图像信息或声音信息)区别于其他参会人员的特征信息,该方法利用待录制人员的特征信息,实现对需要录制的音视频码流的自动筛选,从而无需人工筛选,节省会议录制的人工成本,同时大大提升会议录制效率。该方法提升会议录制的便捷性,促进视频会议功能的广泛应用。

Description

一种会议录制方法、装置及会议录制系统 技术领域
本申请涉及多媒体技术领域,尤其涉及一种会议录制方法、装置及会议录制系统。
背景技术
会议录制是多媒体技术领域的一个重要功能,通过录播服务器将多点控制单元(multipoint control unit,MCU)召开的会议内容录制下来,以用于会议转播、重播、后期制作等。随着视频会议越来越广泛的应用,在视频会议时进行会议录制也成为一种频繁且重要的需求。
目前,会议录制一般是录制整个会议的音视频,但是后期需要查找某一个人的发言或者画面时,需要人工进行浏览和剪辑。这样的方式造成人力成本较高,同时,录制效率低下。
发明内容
为了解决以上技术问题,本申请提供一种会议录制方法、装置及会议录制系统,能够自动筛选需要录制的音视频码流,提升会议录制的效率,节省人力成本。
第一方面,本申请提供一种会议录制方法,包括以下步骤:
首先,多点控制单元依据待录制人员的特征信息,从各个会场终端发送的音视频码流中筛选需要录制的音视频码流。其中,特征信息可以包括:图像信息或声音信息。其后,多点控制单元将上一步筛选出的需要录制的音视频码流发送给录播服务器,由录播服务器进行会议录制。
因为每个人的特征信息互不相同,因此,利用待录制人员的特征信息所筛选出的需要录制的音视频码流与该待录制人员准确对应。因此,应用该方法利用待录制人员的特征信息,可以将需要录制的音视频码流自动地筛选出来以进行录制。省去了人工筛选的环节,因此会议录制人工成本降低。另外,自动筛选也有效提升会议录制效率。会议录制的便捷性更高,为会议录制功能的应用拓展提供高效的实现方案。
在实际应用中,录制需求是多样化的。例如,录制需求可能是1)录制待录制人员所在会场的整个音视频码流,录制需求也可能是2)录制待录制人员个人的音视频码流。下面针对以上两种不同的录制需求,描述具体筛选方式。
对于录制需求1),多点控制单元首先筛选待录制人员的会场终端,具体实现时,根据待录制人员的特征信息和各个会场终端发送的音视频码流筛选待录制人员对应的会场终端;其后,多点控制人员将筛选出的该会场终端发送的音视频码流,即整个音视频码流,全部作为需要录制的音视频码流。从而满足录制需求1)。
对于录制需求2),多点控制单元也需要首先筛选待录制人员的会场终端,此步骤与需求1)中实现方式类似。其后,多点控制单元从筛选出的会场终端发送的音视频码流中根据待录制人员的特征信息筛选该人员个人的音视频码流。例如,在会议过程中,某一待录制人员共发言三次,每次存在一定的时间间隔,如果其特征信息为声音信息,则本申请中可以具体筛选出其发言的三段音视频码流,以此作为需要录制的音视频码流。从而满足录制需求2)。
可见,本申请提供的会议录制方法能够满足多种录制需求,因此适用性较强,满足会议录制场景下的多种录制应用。尤其对于录制需求2)无需人工进行剪辑,提升所录制会议 后续应用的便利性。
在筛选会场终端时,具体可以采用以下方式进行筛选:
首先,多点控制单元将各个会场终端发送的音视频码流解码,得到解码后的视频码流和音频码流;如果特征信息包括图像信息,则多点控制单元利用图像信息与解码后的视频码流进行特征匹配,确定待录制人员对应的会场终端;如果特征信息包括声音信息,则多点控制单元利用声音信息与解码后的音频码流进行特征匹配,确定待录制人员对应的会场终端。
可见,通过特征信息能够唯一且准确地确定待录制人员对应的会场终端,即其所在的会场配置的会场终端。因此可知待录制人员一定不在其他会场终端,这些其他会场终端发送的音视频码流即可高效过滤,减轻多点控制单元的分析与处理负担。
实际应用中,会议场景下的多个会议终端可以统一为高级视频编码AVC会场终端,也可以统一为可伸缩视频编码SVC会场终端。下面针对不同的会议终端,对本申请方法的具体实现进行描述。
可选地,会议终端均为AVC会场终端,当待录制人员对应至少两个不同的AVC会场终端时,多点控制单元筛选音视频码流,首先根据待录制人员的特征信息分别从至少两个不同的AVC会场终端发送的音视频码流中筛选需要录制的音视频码流。若需要录制的音视频码流包括需要录制的视频码流和需要录制的音频码流,则多点控制单元将需要录制的视频码流进行画面合成获得合成画面,将合成画面发送给录播服务器,将需要录制的音频码流进行混音后发送给录播服务器。
当需要录制的音视频码流为多个待录制人员个人的音视频码流时,利用本实施例提供的会议录制方法由多点控制单元进行画面合成和混音并发送给录播服务器,大大减少网络带宽,并且节省了录播服务器的存储空间。此外,按照指定的待录制人员进行实时录制,避免人工后期制作的音频码流和视频码流的裁剪过程,节约人力成本,提高会议录制的效率。
可选地,会议终端为SVC会场终端,当待录制人员对应至少两个不同的SVC会场终端时,多点控制单元筛选音视频码流,首先将适用于录播服务器的码流格式通知所有SVC会场终端;其后接收至少两个不同的SVC会场终端发送的适用于录播服务器的码流格式的音视频码流;最后根据待录制人员的特征信息从适用于录播服务器的码流格式的音视频码流中筛选需要录制的音视频码流。也就是说,筛选出的音视频码流的码流格式对于录播服务器来说,是可接收,可处理的。若需要录制的音视频码流包括需要录制的视频码流和需要录制的音频码流,则多点控制单元将至少两个不同的SVC会场终端对应的需要录制的视频码流发送给录播服务器,从而,由录播服务器对至少两个不同的SVC会场终端对应的需要录制的视频码流进行画面合成获得合成画面。混音则有多点控制单元进行,其将至少两个不同的SVC会场终端对应的需要录制的音频码流进行混音后发送给录播服务器。
当需要录制的音视频码流为多个待录制人员个人的音视频码流时,利用本实施例提供的会议录制方法由多点控制单元进行混音并发送给录播服务器,由录播服务器对MCU筛选的对于多个待录制人员的需要录制的视频码流进行画面合成,最终进行会议录制,大大减少网络带宽,并且节省了录播服务器的存储空间。此外,本实施例按照指定的待录制人员进行实时录制,避免人工后期制作的音频码流和视频码流的裁剪过程,节约人力成本,提 高会议录制的效率。
可选地,多点控制单元根据待录制人员的特征信息从音视频码流中筛选需要录制的音视频码流,具体包括:
多点控制单元根据待录制人员的特征信息利用预先训练的神经网络模型从音视频码流中筛选需要录制的音视频码流。
利用神经网络模型进行音视频码流的筛选,提升了音视频码流的筛选效率,进而提升会议录制过程的整体速度。提升用户的会议录制体验。
第二方面,本申请提供一种会议录制装置,装置包括:码流筛选模块和码流发送模块。其中码流筛选模块依据待录制人员的特征信息从各个会场终端发送的音视频码流中筛选需要录制的音视频码流;码流发送模块将需要录制的音视频码流发送给录播服务器,以使录播服务器进行会议录制。
该装置利用特征信息准确筛选出需要录制的音视频码流,该音视频码流与待录制人员相互匹配,实现音视频码流的自动筛选。相比于人工筛选和录制,效率大大提升,节省人力成本。
可选地,当需要录制的音视频码流为待录制人员所在会场的整个音视频码流时,码流筛选模块,具体包括:
会场终端筛选单元,用于根据待录制人员的特征信息和各个会场终端发送的音视频码流筛选待录制人员对应的会场终端;
码流第一筛选单元,用于将筛选出的会场终端发送的音视频码流全部作为需要录制的音视频码流。
可选地,当需要录制的音视频码流为待录制人员个人的音视频码流时,码流筛选模块,具体包括:
会场终端筛选单元,用于根据待录制人员的特征信息和各个会场终端发送的音视频码流筛选待录制人员对应的会场终端;
码流第二筛选单元,用于从筛选出的会场终端发送的音视频码流中根据待录制人员的特征信息筛选待录制人员个人的音视频码流作为需要录制的音视频码流。
可选地,会场终端筛选单元,具体包括:
解码子单元,用于将各个会场终端发送的音视频码流进行解码获得解码后的视频码流和音频码流;
会场终端确定子单元,用于根据待录制人员的图像信息与解码后的视频码流进行特征匹配,确定待录制人员对应的会场终端,或,将待录制人员的声音信息与解码后的音频码流进行特征匹配,确定待录制人员对应的会场终端。
可选地,会议终端为高级视频编码AVC会场终端,当待录制人员对应至少两个不同的AVC会场终端时,码流筛选模块,具体包括:
码流第三筛选单元,用于根据待录制人员的特征信息分别从至少两个不同的AVC会场终端发送的音视频码流中筛选需要录制的音视频码流。
可选地,需要录制的音视频码流包括需要录制的视频码流和需要录制的音频码流;码流发送模块,具体包括:
画面合成单元,用于将需要录制的视频码流进行画面合成获得合成画面;
画面发送单元,用于将合成画面发送给录播服务器;
第一混音单元,用于将需要录制的音频码流进行混音;
音频第一发送单元,用于将混音后的音频发送给录播服务器。
可选地,会议终端为可伸缩视频编码SVC会场终端,当待录制人员对应至少两个不同的SVC会场终端时,码流筛选模块,具体包括:
码流格式通知单元,用于将适用于录播服务器的码流格式通知所有SVC会场终端;
码流接收单元,用于接收至少两个不同的SVC会场终端发送的适用于录播服务器的码流格式的音视频码流;
码流第四筛选单元,用于根据待录制人员的特征信息从适用于录播服务器的码流格式的音视频码流中筛选需要录制的音视频码流。
可选地,需要录制的音视频码流包括需要录制的视频码流和需要录制的音频码流;码流发送模块,具体包括:
视频码流发送单元,用于将至少两个不同的SVC会场终端对应的需要录制的视频码流发送给录播服务器,以使录播服务器对至少两个不同的SVC会场终端对应的需要录制的视频码流进行画面合成获得合成画面;
第二混音单元,用于将至少两个不同的SVC会场终端对应的需要录制的音频码流进行混音;
音频第二发送单元,用于将混音后的音频发送给录播服务器。
可选地,码流筛选模块,具体包括:
码流第五筛选单元,用于根据待录制人员的特征信息利用预先训练的神经网络模型从音视频码流中筛选需要录制的音视频码流。
第三方面,本申请提供一种会议录制系统,包括多点控制单元,录播服务器,以及至少两个会场终端;
会场终端,用于向多点控制单元发送音视频码流;
多点控制单元,用于根据待录制人员的特征信息从各个会场终端发送的音视频码流中筛选需要录制的音视频码流;特征信息包括:图像信息或声音信息;将需要录制的音视频码流发送给录播服务器;
录播服务器,用于根据需要录制的音视频码流进行会议录制。
该系统利用特征信息准确地从会场终端提供的众多音视频码流中筛选出需要录制的音视频码流,该音视频码流与待录制人员相互匹配,实现音视频码流的自动筛选。相比于人工筛选和录制,效率大大提升,节省人力成本。
从以上技术方案可以看出,本申请实施例具有以下优点:
由于待录制人员的特征信息(包括图像信息或声音信息)区别于其他参会人员的特征信息,即待录制人员的特征信息是待录制人员唯一对应的,因此,多点控制单元MCU根据待录制人员的特征信息能够从各个会场发送的音视频码流中准确筛选出需要录制的音视频码流。其后,MCU将需要录制的音视频码流发送给录播服务器,录播服务器即可根据从MCU接收的音视频码流进行会议录制。该方法利用待录制人员的特征信息,实现对需要录制的音视频码流的自动筛选,从而无需人工筛选,节省会议录制的人工成本,同时大大提升会议录制效率。该方法提升会议录制的便捷性,促进视频会议功能的广泛应用。
附图说明
图1为本申请实施例提供的一种会议录制场景示意图;
图2为本申请实施例提供的一种会议录制方法的流程图;
图3为本实施例提供的一种多点控制单元获得需要录制的音视频码流的流程图;
图4为本申请实施例提供的另一种多点控制单元获得需要录制的音视频码流的流程图;
图5为本申请实施例提供的一种会议录制方法的信令图;
图6为本申请实施例提供的另一种会议录制方法的信令图;
图7为本申请实施例提供的一种会议录制装置的结构示意图;
图8为本申请实施例提供的一种会议录制系统的结构示意图;
图9为本申请实施例提供的另一种会议录制系统的结构示意图。
具体实施方式
当前视频会议进行会议录制时,往往需要人工查找某一人员所在的会议的音视频码流。假设该人员在会议末尾露面或发言,则为录制该人员参加的会议,需要人工地查找整段会议才能确定该人员参加了此会议。显然,当前的会议录制方法很大程度上依靠人工操作,不但消耗较高的人力成本,并且效率低下。
基于此问题,发明人经过研究,提供一种会议录制方法、装置及会议录制系统。在本申请中,利用待录制人员的特征信息对多个会场的音视频码流进行筛选,由于每个人的特征信息(包括图像信息或声音信息)互不相同,因此,利用待录制人员的特征信息能够准确地识别出包含待录制人员的特征信息的音视频码流,从而实现音视频码流的自动筛选。MCU将需要录制的音视频码流筛出并发送给录播服务器,服务器即可根据从MCU接收的音视频码流进行会议录制。本申请提供的技术方案节省人力成本,同时提升会议录制效率。
为便于理解本申请技术方案,下面结合附图对本申请提供的会议录制方法的应用场景进行描述和说明。参见图1,该图为本申请实施例提供的一种会议录制场景示意图。
如图1所示,本申请提供的会议录制方法的应用场景中,包括:多点控制单元MCU,录播服务器,会议应用服务器(application server,AS),以及多个会场终端。在实际应用中,会场终端的数量可以是两个或两个以上,图1中仅以三个会场终端为示例,本实施例对于会议录制场景下会场终端的具体数量不进行限定。图1中三个会场终端分别为会场终端1、会场终端2和会场终端3,不同的会场终端分属于不同的会场。
MCU与录播服务器位于同一局域网内。在会议录制场景中,会议AS作为视讯业务管理平台,用户通过会议AS预约会议,提供或指定包含待录制人员的特征信息的材料。此处,特征信息可以包括图像信息或声音信息。也就是说,特征信息可以仅包括图像信息,可以仅包括声音信息,还可以既包括图像信息又包括声音信息。
作为一示例,用户可以向会议AS上传包含待录制人员的特征信息的材料,例如待录制人员的图片或音频文件。会议AS对图片进行处理,得到待录制人员的图像信息,图像信息具体可以为图像特征,例如人脸特征;会议AS对音频文件进行处理,得到待录制人员的声音信息,声音信息具体可以为声纹特征。可以理解的是,待录制人员的特征信息区别于其他人员的特征信息,即待录制人员的特征信息是待录制人员唯一对应的,可由待录制人员的特征信息唯一地确定待录制人员。
作为另一示例,会议AS中存储有多个人员各自对应的材料,例如各个人员的图片或音 频文件,并且会议AS已经预先处理获得了每个人员对应的图像信息和声音信息。用户在会议AS中能够根据需求从多个待选的人员的材料中选择包含待录制人员特征信息的材料,例如,选中待录制人员的图片或音频文件。会议AS根据用户的选择可以确定用户指定的待录制人员。
会议AS向多点控制单元MCU召集会议,并在召集时向MCU下发待录制人员的特征信息。由MCU呼叫会场终端1、会场终端2和会场终端3入会,各个终端入会后,向MCU发送音视频码流。MCU根据会议AS所下发的待录制人员的特征信息,从各个会场终端发送的音视频码流中筛选需要录制的音视频码流。
例如,实际情况下待录制人员在会场终端2参加会议,因此会场终端2向MCU发送的音视频码流中,视频码流带有待录制人员的图像信息,而音频码流中带有待录制人员的声音信息。如果待录制人员的特征信息具体为图像信息,则MCU从各个会场终端发送的视频码流中能够通过图像信息的匹配,确定待录制人员在会场终端2参加会议,从而将会场终端2发送的音视频码流筛选出来,将其发送给录播服务器以使得录播服务器进行会议录制。如果待录制人员的特征信息具体为声音信息,则MCU从各个会场终端发送的音频码流中能够通过声音信息的匹配,确定待录制人员在会场终端2参加会议,从而将会场终端2发送的音视频码流筛选出来,将其发送给录播服务器以使得录播服务器进行会议录制。
下面结合附图和实施例,对本申请提供的会议录制方法进行描述和说明。
方法实施例一
参见图2,该图为本实施例提供的会议录制方法的流程图。该应用于会议录制场景下的多点控制单元MCU。
如图2所示,本实施例提供的会议录制方法包括:
步骤201:多点控制单元根据待录制人员的特征信息从各个会场终端发送的音视频码流中筛选需要录制的音视频码流。
在前文对该方法应用场景的描述中已经介绍,本实施例中,特征信息包括:图像信息或声音信息。也就是说,MCU能够单独根据待录制人员的图像信息筛选需要录制的音视频码流,能够单独根据待录制人员的声音信息筛选需要录制的音视频码流,同样也能够综合利用待录制人员的图像信息和声音信息筛选需要录制的音视频码流。可以理解的是,综合利用待录制人员的图像信息和声音信息筛选需要录制的音视频码流可以提升音视频码流筛选的准确性,降低筛选失误率。
在实际应用中,待录制人员的数量可以为一个或多个。如果MCU从会议AS接收的待录制人员的特征信息属于同一待录制人员,则表示待录制人员的数量仅有一个;如果接收的待录制人员的特征信息属于多个不同的待录制人员,则表示待录制人员数量为多个。在实际应用中,多个待录制人员可能位于同一个会场,即共同对应同一个会场终端,也可能位于不同的会场,即分别对应不同的会场终端。
当待录制人员有多个时,实际录制需求可能为:录制多画面的会议,其中每个画面对应不同的待录制人员。
当仅有一个待录制人员时,实际录制需求可能有多种,下面示例性地提供几种录制需求:
(1)录制待录制人员参加的完整会议;
(2)仅录制待录制人员发言时的会议;
(3)录制出现待录制人员时的会议。
可以理解的是,在实际应用中对于不同的录制需求,筛选出的需要录制的音视频码流也不同。作为一示例,对于录制需求(1),则筛选出的音视频码流具体为待录制人员所在会场在会议过程中的整个音视频码流。作为另一示例,对于录制需求(3),则筛选出的音视频码流具体为待录制人员发言时的音视频码流。
作为一种可能的实现方式,本实施例中MCU可以利用预先训练的神经网络模型,根据待录制人员的特征信息从音视频码流中筛选需要录制的音视频码流。该神经网络模型为利用大量不同人员的特征信息以及包含不同人员特征信息的材料(例如待录制人员的图片或音频文件)的训练得到。对于本领域技术人员来说,训练能够准确识别带有某种图像信息的视频码流或能够准确识别带有某种声音信息的音频码流的神经网络模型,属于比较成熟的技术,因此本实施例对于该神经网络模型的具体训练过程不进行详述。
步骤202:多点控制单元将需要录制的音视频码流发送给录播服务器,以使录播服务器进行会议录制。
通过筛选,MCU已经得到需要录制的音视频码流。录播服务器是一种能够配合MCU和会场终端使用的服务器,能够将视频、音频和计算机屏幕信号等进行一体化同步录制,因此本步骤MCU将需要录制的音视频码流发送至录播服务器后,录播服务器即可根据需要录制的音视频码流进行会议录制。
以上即为本申请实施例提供的会议录制方法。由于待录制人员的特征信息(包括图像信息或声音信息)区别于其他参会人员的特征信息,即待录制人员的特征信息是待录制人员唯一对应的,因此,多点控制单元MCU根据待录制人员的特征信息能够从各个会场发送的音视频码流中准确筛选出需要录制的音视频码流。其后,MCU将需要录制的音视频码流发送给录播服务器,录播服务器即可根据从MCU接收的音视频码流进行会议录制。该方法利用待录制人员的特征信息,实现对需要录制的音视频码流的自动筛选,从而无需人工筛选,节省会议录制的人工成本,同时大大提升会议录制效率。该方法提升会议录制的便捷性,促进视频会议功能的广泛应用。
在实际应用中,上述示例录制需求中(1)表示需要录制的音视频码流为待录制人员所在会场的整个音视频码流;(2)和(3)表示需要录制的音视频码流为待录制人员个人的音视频码流。针对不同的需要录制的音视频码流,步骤201的实现方式存在差异。下面结合图3描述需要录制的音视频码流为待录制人员所在会场的整个音视频码流时,步骤201的详细流程;结合图4描述需要录制的音视频码流为待录制人员个人的音视频码流时,步骤201的详细流程。
参见图3,该图为本实施例提供的一种多点控制单元获得需要录制的音视频码流的流程图。
如图3所示,当需要录制的音视频码流为待录制人员所在会场的整个音视频码流时,多点控制单元获得需要录制的音视频码流具体包括:
步骤301:多点控制单元将各个会场终端发送的音视频码流进行解码获得解码后的视频码流和音频码流。
在本实施例中,会场终端向MCU发送的音视频码流具体可以为音视频实时传输协议 (real-time transport protocol,RTP)码流。为便于后续将需要录制的音视频码流筛选出来,需要对音视频码流预先进行解码。对于本领域技术人员,对音视频码流解码属于比较成熟的技术,所以此处对于解码过程不进行详述。通过解码,MCU得到可单独处理的视频码流和音频码流。可以理解的是,由音视频码流解码得到的视频码流和音频码流存在时序联系。例如,会场终端1向MCU发送的T1时刻至T2时刻会场终端1所在会场的音视频码流,则MCU解码得到的是T1时刻至T2时刻会场终端1所在会场的视频码流和音频码流。
步骤302:多点控制单元根据待录制人员的图像信息与解码后的视频码流进行特征匹配,确定待录制人员对应的会场终端,或,将待录制人员的声音信息与解码后的音频码流进行特征匹配,确定待录制人员对应的会场终端。
可以理解的是,不同的会场所处地理位置不同,同一人员在会议召开时仅处于一个固定的会场,而不会出现在其他的会场。作为一示例,如果会场终端1属于待录制人员所在的会场,则发送带有该待录制人员的特征信息的音视频码流的会场终端必然是会场终端1,而不可能是其他的会场终端。因此,为筛选需要录制的音视频码流,本实施例中只需确定传输带有待录制人员的特征信息的会场终端。
如果MCU从会议AS接收的待录制人员的特征信息仅包括图像信息,则本步骤中,由MCU根据待录制人员的图像信息与解码后的视频码流进行特征匹配,确定待录制人员对应的会场终端。如果MCU从会议AS接收的待录制人员的特征信息仅包括声音信息,则本步骤中,由MCU根据待录制人员的声音信息与解码后的音频码流进行特征匹配,确定待录制人员对应的会场终端。可以理解的是,如果MCU从会议AS接收的待录制人员的特征信息既包括图像信息又包括声音信息,则综合利用图像信息和声音信息匹配录制人员对应的会场终端,能够提升匹配结果的准确性和可信度,降低失误率。
以上步骤301-302实现多点控制器对待录制人员对应的会场终端的筛选。也就是说,步骤302最终匹配确定的会场终端为多点控制单元根据待录制人员的特征信息和各个会场终端发送的音视频码流,从多个会场终端筛选得到的。
步骤303:将筛选出的会场终端发送的音视频码流全部作为需要录制的音视频码流。
由于录制需求是前述录制需求(1),即录制待录制人员参加的完整会议,因此在筛选需要录制的音视频码流时,直接将步骤302筛选出的会场终端发送的音视频码流全部作为需要录制的音视频码流。
通过执行图3所示的流程,MCU筛选得到待录制人员所在会场的整个音视频码流,通过将该整个音视频码流发送给录播服务器以进行会议录制,满足上述录制需求(1)。
参见图4,该图为本实施例提供的另一种多点控制单元获得需要录制的音视频码流的流程图。
如图4所示,当需要录制的音视频码流为待录制人员个人的音视频码流时,多点控制单元获得需要录制的音视频码流具体包括:
步骤401:多点控制单元将各个会场终端发送的音视频码流进行解码获得解码后的视频码流和音频码流。
步骤402:多点控制单元根据待录制人员的图像信息与解码后的视频码流进行特征匹配,确定待录制人员对应的会场终端,或,将待录制人员的声音信息与解码后的音频码流进行特征匹配,确定待录制人员对应的会场终端。
本实施例中,步骤401-402与前述步骤301-302的实现方式相同,步骤401-402的相关描述可参照步骤301-302,此处不再赘述。
步骤403:从筛选出的会场终端发送的音视频码流中根据待录制人员的特征信息筛选待录制人员个人的音视频码流作为需要录制的音视频码流。
如果MCU从会议AS接收的待录制人员的特征信息仅包括声音信息,则实际的录制需求很可能是录制需求(2),即仅录制待录制人员发言时的会议。本步骤对于该录制需求,可以利用待录制人员的声音信息从步骤401解码得到的T1时刻至T2时刻的音频码流中确定出包含该待录制人员的声音信息的T3时刻至T4时刻的音频码流(会议起始时刻T1早于会议结束时刻T2,待录制人员发言起始时刻T3早于待录制人员发言结束时刻T4,T3晚于或等于T1且早于T2,T4早于或等于T2)。由于音视频码流解码得到的视频码流和音频码流存在时序联系,因此,根据T3时刻至T4时刻的音频码流也可以相应地得到T3时刻至T4时刻的视频码流。T3时刻至T4时刻的音频码流和视频码流统称为待录制人员个人的音视频码流,该音视频码流满足录制需求(2)。
如果MCU从会议AS接收的待录制人员的特征信息仅包括图像信息,则实际的录制需求很可能是录制需求(3),即录制出现待录制人员时的会议。本步骤对于该录制需求,可以利用待录制人员的声音信息从步骤401解码得到的T1时刻至T2时刻的视频码流中确定出包含该待录制人员的图像信息的T5时刻至T6时刻的视频码流(会议起始时刻T1早于会议结束时刻T2,待录制人员在T5时刻至T6时刻出现,T5早于T6,T5晚于或等于T1,T6早于或等于T2)。由于音视频码流解码得到的视频码流和音频码流存在时序联系,因此,根据T5时刻至T6时刻的视频码流也可以相应地得到T5时刻至T6时刻的音频码流。T5时刻至T6时刻的音频码流和视频码流统称为待录制人员个人的音视频码流,该音视频码流满足录制需求(3)。
下面示例性地提供一种本实施例方法的应用场景。在该示例场景中,待录制人员有多个,并且多个待录制人员位于同一会场,即多个待录制人员对应于同一会场终端。具体的录制需求是,仅录制待录制人员发言时的会议;多点控制单元获得的待录制人员的特征信息包括:每个待录制人员的声音信息。本实施例提供的会议录制方法在具体实现时,由多点控制单元MCU将各个会场终端发送的音视频码流进行解码获得解码后的视频码流和音频码流;将各个待录制人员的声音信息与解码后的音频码流进行特征匹配,确定各个待录制人员对应的同一会场终端;MCU从筛选出的会场终端发送的音视频码流中根据各个待录制人员的声音信息,筛选各个待录制人员其个人的音视频码流,这些音视频码流共同作为需要录制的音视频码流。MCU最后将这些需要录制的音视频码流发送给录播服务器以进行会议录制。
按照会议能力的单流和多流,可将会场终端划分两类,一类是单流的高级视频编码(advanced video coding,AVC)会场终端,另一类是多流的可伸缩视频编码(scalable video coding,SVC)会场终端。在实际应用中,会议录制场景中的各个会场终端可能统一是AVC会场终端,也可能统一是SVC会场终端。鉴于AVC会场终端和SVC会场终端的会议能力不同,本申请实施例提供的会议录制方法中多点控制单元MCU执行的操作也相应存在差别。下面以两个实施例分别描述AVC会场终端场景下的会议录制方法以及SVC会场终端 场景下的会议录制方法。
方法实施例二(AVC会场终端场景)
参见图5,该图为本实施例提供的一种会议录制方法的信令图。在图5示意的会议录制场景中,包括多点控制单元MCU,会议AS,录播服务器,以及多个AVC会场终端,分别为AVC 1、AVC 2和AVC 3。
在图5所示的会议录制方法中,包括以下步骤:
步骤501:会议AS根据用户的预约向多点控制单元MCU召集会议,下发待录制人员的特征信息。
在本实施例中,待录制人员对应至少两个不同的AVC会场终端,也就是说,会议AS下发的待录制人员的特征信息属于至少两个待录制人员。作为示例,会议AS下发了待录制人员Role1的特征信息和待录制人员Role2的特征信息。
步骤502:多点控制单元呼叫所有AVC会场终端入会,包括:AVC 1、AVC 2和AVC 3。
步骤503:AVC 1、AVC 2和AVC 3入会,分别向MCU发送音视频码流。
步骤504:MCU呼叫录播服务器入会。
步骤505:MCU将AVC 1、AVC 2和AVC 3发送的音视频码流分别解码。
本步骤MCU将AVC 1、AVC 2和AVC 3发送的音视频码流分别解码,解码能够获得AVC 1、AVC 2和AVC 3各自发送的音频码流和视频码流。通过解码得到音频码流和视频码流,以便于后续利用待录制人员Role1和Role2的特征信息进行音视频码流的筛选。
步骤506:MCU根据待录制人员的特征信息分别从AVC 1、AVC 2和AVC 3发送的音视频码流中筛选需要录制的音视频码流。
本步骤在具体实现时,可以根据待录制人员的特征信息和解码得到的各个会场终端的音频码流和视频码流,筛选待录制人员对应的会场终端。具体筛选过程可参照前述实施例,此处不再赘述。
作为示例,最终根据Role1的特征信息筛选Role1对应的会场终端为AVC 1,根据Role2的特征信息筛选Role2对应的会场终端为AVC 2。可以理解的是,为筛选需要录制的音视频码流,MCU可以通过指定视频源的方式进行筛选。例如,MCU指定AVC 1的视频源名称,根据该视频源名称筛除不符的音频码流和视频码流,最终筛选得到的音频码流和视频码流即作为对Role1需要录制的音频码流和视频码流;MCU指定AVC 2的视频源名称,根据该视频源名称筛除不符的音频码流和视频码流,最终筛选得到的音频码流和视频码流即作为对Role2需要录制的音频码流和视频码流。
步骤507:MCU将需要录制的视频码流进行画面合成获得合成画面,将合成画面发送给录播服务器,将需要录制的音频码流进行混音后发送给录播服务器。
在实际应用中,MCU具备对AVC会场终端发送的视频码流进行画面合成的功能,以及对AVC会场终端发送的音频码流进行混音的功能。由于本实施例中,待录制人员对应至少两个不同的AVC会场终端,因此MCU可根据前一步骤得到的对于多个待录制人员(例如Role1和Role2)的需要录制的视频码流进行合成,并根据前一步骤得到的对于多个待录制人员的需要录制的音频码流进行混音。
步骤508:录播服务器收到码流后进行会议录制。
在本实施例中,待录制人员对应至少两个AVC会场终端,即待录制人员共有至少两个。 会议AS通过下发待录制人员的特征信息,实现对会议录制的待录制人员的指定。当需要录制的音视频码流为多个待录制人员个人的音视频码流时,利用本实施例提供的会议录制方法由MCU进行画面合成和混音并发送给录播服务器,大大减少网络带宽,并且节省了录播服务器的存储空间。此外,本实施例按照指定的待录制人员进行实时录制,避免人工后期制作的音频码流和视频码流的裁剪过程,节约人力成本,提高会议录制的效率。
方法实施例三(SVC会场终端场景)
参见图6,该图为本实施例提供的另一种会议录制方法的信令图。在图6示意的会议录制场景中,包括多点控制单元MCU,会议AS,录播服务器,以及多个SVC会场终端,分别为SVC 1、SVC 2和SVC 3。
在图6所示的会议录制方法中,包括以下步骤:
步骤601:会议AS根据用户的预约向多点控制单元MCU召集会议,下发待录制人员的特征信息。
在本实施例中,待录制人员对应至少两个不同的SVC会场终端,也就是说,会议AS下发的待录制人员的特征信息属于至少两个待录制人员。作为示例,会议AS下发了待录制人员Role3的特征信息和待录制人员Role4的特征信息。
步骤602:多点控制单元呼叫所有SVC会场终端入会,包括SVC 1、SVC 2和SVC 3,并且多点控制单元将适用于录播服务器的码流格式通知SVC 1、SVC 2和SVC 3。
实际应用中,每个SVC会场终端均可向MCU提供不同码流格式的音视频码流。然而录播服务器通常只能基于其中一种码流格式的音视频码流进行会议录制,其他码流格式的音视频码流不适用于录播服务器。为提升音视频码流的传输效率,避免SVC会场终端向MCU发送不适用于录播服务器的码流格式的音视频码流,本实施例中需要预先向各个SVC会场终端通知适用于录播服务器的码流格式。
步骤603:SVC 1、SVC 2和SVC 3入会,分别向MCU发送适用于录播服务器的码流格式的音视频码流。
步骤604:MCU呼叫录播服务器入会。
步骤605:MCU将SVC 1、SVC 2和SVC 3发送的音视频码流分别解码。
本步骤MCU将SVC 1、SVC 2和SVC 3发送的音视频码流分别解码,解码能够获得SVC 1、SVC 2和SVC 3各自发送的音频码流和视频码流。通过解码得到音频码流和视频码流,以便于后续利用待录制人员Role3和Role4的特征信息进行音视频码流的筛选。
步骤606:MCU根据待录制人员的特征信息分别从SVC 1、SVC 2和SVC 3发送的适用于录播服务器的码流格式的音视频码流中筛选需要录制的音视频码流。
本步骤在具体实现时,可以根据待录制人员的特征信息和解码得到的各个会场终端的音频码流和视频码流,筛选待录制人员对应的会场终端。具体筛选过程可参照前述实施例,此处不再赘述。
作为示例,最终根据Role3的特征信息筛选Role3对应的会场终端为SVC 3,根据Role4的特征信息筛选Role4对应的会场终端为SVC 2。可以理解的是,为筛选需要录制的音视频码流,MCU可以通过指定视频源的方式进行筛选。例如,MCU指定SVC 3的视频源名称,根据该视频源名称筛除不符的音频码流和视频码流,最终筛选得到的音频码流和视频码流 即作为对Role3需要录制的音频码流和视频码流;MCU指定SVC 2的视频源名称,根据该视频源名称筛除不符的音频码流和视频码流,最终筛选得到的音频码流和视频码流即作为对Role4需要录制的音频码流和视频码流。
步骤607:MCU将至少两个不同的SVC会场终端对应的需要录制的视频码流发送给录播服务器,将至少两个不同的SVC会场终端对应的需要录制的音频码流进行混音后发送给录播服务器。
沿用前述示例,本步骤即是将前一步骤SVC 3和SVC 2发送的需要录制的视频码流发送给录播服务器,这是由于录播服务器具备将多个SVC会场终端的视频码流进行画面合成的功能。MCU具备对SVC会场终端发送的音频码流进行混音的功能,因此SVC 3和SVC 2发送的需要录制的音频码流由MCU负责混音。
步骤608:录播服务器对至少两个不同的SVC会场终端对应的需要录制的视频码流进行画面合成获得合成画面,根据合成画面和混音后的音频码流进行会议录制。
在本实施例中,待录制人员对应至少两个SVC会场终端,即待录制人员共有至少两个。会议AS通过下发待录制人员的特征信息,实现对会议录制的待录制人员的指定。当需要录制的音视频码流为多个待录制人员个人的音视频码流时,利用本实施例提供的会议录制方法由MCU进行混音并发送给录播服务器,由录播服务器对MCU筛选的对于多个待录制人员的需要录制的视频码流进行画面合成,最终进行会议录制,大大减少网络带宽,并且节省了录播服务器的存储空间。此外,本实施例按照指定的待录制人员进行实时录制,避免人工后期制作的音频码流和视频码流的裁剪过程,节约人力成本,提高会议录制的效率。
基于前述实施例提供的会议录制方法,相应地,本申请还提供一种会议录制装置。下面结合实施例和附图对该装置的具体实现进行描述。
装置实施例
参见图7,该图所示为本实施例提供的会议录制装置的结构示意图。如图7所示,该装置包括:
码流筛选模块701,用于根据待录制人员的特征信息从各个会场终端发送的音视频码流中筛选需要录制的音视频码流;
码流发送模块702,用于将需要录制的音视频码流发送给录播服务器,以使录播服务器进行会议录制;特征信息包括:图像信息或声音信息。
在本实施例中,会议录制装置利用待录制人员的特征信息,实现对需要录制的音视频码流的自动筛选,从而无需人工筛选,节省会议录制的人工成本,同时大大提升会议录制效率。应用该装置能够提升会议录制的便捷性,促进视频会议功能的广泛应用。
在实际应用中,根据录制需求,需要录制的音视频码流可以是待录制人员所在会场的整个音视频码流,也可以是待录制人员个人的音视频码流。下面基于这两种情况分别对码流筛选模块701的实现方式进行描述。
当需要录制的音视频码流为待录制人员所在会场的整个音视频码流时,码流筛选模块701,具体包括:
会场终端筛选单元,用于根据待录制人员的特征信息和各个会场终端发送的音视频码流筛选待录制人员对应的会场终端;
码流第一筛选单元,用于将筛选出的会场终端发送的音视频码流全部作为需要录制的音视频码流。
当需要录制的音视频码流为待录制人员个人的音视频码流时,码流筛选模块701,具体包括:
会场终端筛选单元,用于根据待录制人员的特征信息和各个会场终端发送的音视频码流筛选待录制人员对应的会场终端;
码流第二筛选单元,用于从筛选出的会场终端发送的音视频码流中根据待录制人员的特征信息筛选待录制人员个人的音视频码流作为需要录制的音视频码流。
通过上述描述可知,作为一种可能的实现方式,无论需要录制的音视频码流是待录制人员所在会场的整个音视频码流,还是待录制人员个人的音视频码流,会议录制装置的会场终端筛选单元均可以首先通过匹配,确定出待录制人员对应的会场。具体地,会场终端筛选单元,具体包括:
解码子单元,用于将各个会场终端发送的音视频码流进行解码获得解码后的视频码流和音频码流;
会场终端确定子单元,用于根据待录制人员的图像信息与解码后的视频码流进行特征匹配,确定待录制人员对应的会场终端,或,将待录制人员的声音信息与解码后的音频码流进行特征匹配,确定待录制人员对应的会场终端。
按照会议能力的单流和多流,可将会场终端划分两类,一类是单流的高级视频编码(advanced video coding,AVC)会场终端,另一类是多流的可伸缩视频编码(scalable video coding,SVC)会场终端。在实际应用中,会议录制场景中的各个会场终端可能统一是AVC会场终端,也可能统一是SVC会场终端。鉴于AVC会场终端和SVC会场终端的会议能力不同,本申请实施例提供的会议录制装置具体实现方式也相应存在差别。下面分别描述AVC会场终端场景下以及SVC会场终端场景下会议录制装置的实现方式。
当待录制人员对应至少两个不同的高级视频编码AVC会场终端时,码流筛选模块701,具体包括:
码流第三筛选单元,用于根据待录制人员的特征信息分别从至少两个不同的AVC会场终端发送的音视频码流中筛选需要录制的音视频码流。
需要录制的音视频码流包括需要录制的视频码流和需要录制的音频码流;码流发送模块702,具体包括:
画面合成单元,用于将需要录制的视频码流进行画面合成获得合成画面;
画面发送单元,用于将合成画面发送给录播服务器;
第一混音单元,用于将需要录制的音频码流进行混音;
音频第一发送单元,用于将混音后的音频发送给录播服务器。
在本实施例中,待录制人员对应至少两个AVC会场终端,即待录制人员共有至少两个。会议AS通过下发待录制人员的特征信息,实现对会议录制的待录制人员的指定。当需要录制的音视频码流为多个待录制人员个人的音视频码流时,利用本实施例提供的会议录制装置由MCU进行画面合成和混音并发送给录播服务器,大大减少网络带宽,并且节省了录播服务器的存储空间。此外,应用本实施例提供的装置可按照指定的待录制人员进行实时录制,避免人工后期制作的音频码流和视频码流的裁剪过程,节约人力成本,提高会议录制 的效率。
当待录制人员对应至少两个不同的可伸缩视频编码SVC会场终端时,码流筛选模块701,具体包括:
码流格式通知单元,用于将适用于录播服务器的码流格式通知所有SVC会场终端;
码流接收单元,用于接收至少两个不同的SVC会场终端发送的适用于录播服务器的码流格式的音视频码流;
码流第四筛选单元,用于根据待录制人员的特征信息从适用于录播服务器的码流格式的音视频码流中筛选需要录制的音视频码流。
需要录制的音视频码流包括需要录制的视频码流和需要录制的音频码流;码流发送模块702,具体包括:
视频码流发送单元,用于将至少两个不同的SVC会场终端对应的需要录制的视频码流发送给录播服务器,以使录播服务器对至少两个不同的SVC会场终端对应的需要录制的视频码流进行画面合成获得合成画面;
第二混音单元,用于将至少两个不同的SVC会场终端对应的需要录制的音频码流进行混音;
音频第二发送单元,用于将混音后的音频发送给录播服务器。
在本实施例中,待录制人员对应至少两个SVC会场终端,即待录制人员共有至少两个。会议AS通过下发待录制人员的特征信息,实现对会议录制的待录制人员的指定。当需要录制的音视频码流为多个待录制人员个人的音视频码流时,利用本实施例提供的会议录制装置由MCU进行混音并发送给录播服务器,由录播服务器对MCU筛选的对于多个待录制人员的需要录制的视频码流进行画面合成,最终进行会议录制,大大减少网络带宽,并且节省了录播服务器的存储空间。此外,应用本实施例提供的装置可按照指定的待录制人员进行实时录制,避免人工后期制作的音频码流和视频码流的裁剪过程,节约人力成本,提高会议录制的效率。
作为一种可能的实现方式,本实施例中码流筛选模块701,具体包括:
码流第五筛选单元,用于根据待录制人员的特征信息利用预先训练的神经网络模型从音视频码流中筛选需要录制的音视频码流。
该神经网络模型为利用大量不同人员的特征信息以及包含不同人员特征信息的材料(例如待录制人员的图片或音频文件)的训练得到。
对于本领域技术人员来说,训练能够准确识别带有某种图像信息的视频码流或能够准确识别带有某种声音信息的音频码流的神经网络模型,属于比较成熟的技术,因此本实施例对于该神经网络模型的具体训练过程不进行详述。
基于前述实施例提供的会议录制方法及会议录制装置,相应地,本申请还提供一种会议录制系统。下面结合附图和实施例对该系统的具体实现方式进行描述。
系统实施例
参见图8,该图为本申请实施例提供的一种会议录制系统的结构示意图。
如图8所示,本实施例提供的会议录制系统,包括:多点控制单元MCU,录播服务器801和至少两个会场终端。
在本实施例中,多点控制单元MCU具体可以执行前述实施例提供的会议录制方法。
会场终端,用于向多点控制单元MCU发送音视频码流;
多点控制单元MCU,用于根据待录制人员的特征信息从各个会场终端发送的音视频码流中筛选需要录制的音视频码流;特征信息包括:图像信息或声音信息;将需要录制的音视频码流发送给录播服务器801;
录播服务器801,用于根据需要录制的音视频码流进行会议录制。
本实施例中会场终端的数量为至少两个。所有会场终端可以统一是AVC会场终端,还可以统一是SVC会场终端。各个会场终端可以是采用会话初始协议(Session Initiation Protocol,SIP)的终端,也可以是采用H.323协议的终端。此处对于会场终端采用的通信协议不进行限定。如图8中所示,会场终端811采用SIP协议与MCU进行通信,会场终端812和813分别采用H.323协议与MCU进行通信。
本实施例提供的会议录制系统利用待录制人员的特征信息,实现对需要录制的音视频码流的自动筛选,从而无需人工筛选,节省会议录制的人工成本,同时大大提升会议录制效率。该系统提升会议录制的便捷性,促进视频会议功能的广泛应用。
可选地,本实施例提供的会议录制系统中,MCU还具备转发音视频码流的功能。下面结合图8描述该功能的实现场景。
MCU转发音视频码流的示例性场景:一个会场终端向MCU请求转播另一会场终端的会议。会场终端813用于向MCU发送转播会场终端812的会议的请求。在实际应用中,会场终端813能够播放的音视频码流的格式可能为会场终端812能够提供的多种码流格式之一,因此,会场终端813发送的请求中携带会场终端813能够播放的音视频码流的码流格式。MCU用于根据请求中的码流格式,向会场终端812发送通知,以使会场终端812根据该通知向MCU发送该格式的音视频码流。MCU还用于将会场终端812发送的会场终端813能够播放的码流格式的音视频码流转发给会场终端813,以便于会场终端813能够根据音视频码流进行播放。从而,会场终端813所在会场的所有参会人员能够观看到会场终端812召开的会议。
可选地,本实施例提供的会议录制系统中,录播服务器801还具备转发音视频码流的功能。下面描述该功能的实现场景。
录播服务器801转发音视频码流的示例性场景:其他服务器向录播服务器请求点播或直播会议。录播服务器801还用于接收其他服务器的点播请求或直播请求。在实际应用中,其他服务器能够播放的音视频码流的格式可能为各个会场终端能够提供的多种码流格式之一,因此,其他服务器发送的点播请求或直播请求中可以携带其他服务器能够播放的音视频码流的码流格式。录播服务器801还用于将该码流格式通知MCU,以便MCU将该码流格式通知各个会场终端。当MCU接收到各个会场终端发送的符合该码流格式的音视频码流后,将其发送给录播服务器。录播服务器801还用于基于点播请求或直播请求,以及MCU发送的音视频码流,向提出该请求的其他服务器进行转发。作为一示例,其他服务器向录播服务器801请求点播会场终端811的会议,则录播服务器801向其他服务器转发来自会场终端811的符合该其他服务器的码流格式的音视频码流。作为另一示例,其他服务器向录播服务器801请求直播所有入会的会场终端的会议,则录播服务器801向其他服务器转发来自各个会场终端的符合其他服务器的码流格式的音视频码流;如果适用于录播服务器801的码流格式与发出请求的其他服务器能够播放的码流格式,则录播服务器还可以将一边录 制一边向其他服务器转发,转发内容即为多画面(即合成画面)的音视频码流。
通过以上场景示例可知,在本实施例提供的会议录制系统中,MCU以及录播服务器801均能够提供音视频码流的转发服务,从而丰富了会议录制系统的整体功能,提升用户的使用体验。
可选地,本实施例提供的会议录制系统还可进一步包括:会议应用服务器(会议AS)802。如图9所示,该图为本实施例提供的另一种会议录制系统的结构示意图。
在图9所示意的系统中,会场终端与会议AS802之间的连接虚线表示会场终端注册到会议AS802;录播服务器801与会议AS802之间的连接虚线表示录播服务器801注册到会议AS802;会议AS802与MCU之间的连接虚线表示会议AS802向MCU下发待录制人员的特征信息。会议AS802用于向用户提供上传带有待录制人员特征信息的材料(例如待录制人员的图片或音频文件)的功能;对材料进行处理得到待录制人员的特征信息;将待录制人员的特征信息下发至MCU。会议AS802还用于存储多个人员各自对应的材料并处理获得每个人员的特征信息,当用户登录会议AS802后,将各个人员对应的材料提供给用户,并接收用户对待录制人员的材料的选定消息,根据选定消息向MCU下发被选定的材料对应的待录制人员的特征信息。在实际应用中,会议AS802可以在向MCU召集会议时,向MCU下发待录制人员的特征信息。
通过上文描述可知,本实施例提供的会议录制系统中,会议AS802能够向用户提供指定待录制人员的服务,从而增加后续会议录制的便捷性,只对待录制人员进行录制,节省录播服务器801的存储空间,减小通信带宽,提升用户的会议录制体验。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (19)

  1. 一种会议录制方法,其特征在于,包括以下步骤:
    多点控制单元根据待录制人员的特征信息从各个会场终端发送的音视频码流中筛选需要录制的音视频码流;所述特征信息包括:图像信息或声音信息;
    所述多点控制单元将所述需要录制的音视频码流发送给录播服务器,以使所述录播服务器进行会议录制。
  2. 根据权利要求1所述的会议录制方法,其特征在于,当所述需要录制的音视频码流为所述待录制人员所在会场的整个音视频码流时,所述多点控制单元根据待录制人员的特征信息从各个会场终端发送的音视频码流中筛选需要录制的音视频码流,具体包括:
    所述多点控制单元根据所述待录制人员的特征信息和所述各个会场终端发送的音视频码流筛选所述待录制人员对应的会场终端;
    将筛选出的会场终端发送的音视频码流全部作为所述需要录制的音视频码流。
  3. 根据权利要求1所述的会议录制方法,其特征在于,当所述需要录制的音视频码流为所述待录制人员个人的音视频码流时,所述多点控制单元根据待录制人员的特征信息从各个会场终端发送的音视频码流中筛选需要录制的音视频码流,具体包括:
    所述多点控制单元根据所述待录制人员的特征信息和所述各个会场终端发送的音视频码流筛选所述待录制人员对应的会场终端;
    从筛选出的会场终端发送的音视频码流中根据所述待录制人员的特征信息筛选所述待录制人员个人的音视频码流作为所述需要录制的音视频码流。
  4. 根据权利要求2或3所述的会议录制方法,其特征在于,所述多点控制单元根据所述待录制人员的特征信息和所述各个会场终端发送的音视频码流筛选所述待录制人员对应的会场终端,具体包括:
    所述多点控制单元将所述各个会场终端发送的音视频码流进行解码获得解码后的视频码流和音频码流;
    所述多点控制单元根据所述待录制人员的图像信息与所述解码后的视频码流进行特征匹配,确定所述待录制人员对应的会场终端,或,将所述待录制人员的声音信息与所述解码后的音频码流进行特征匹配,确定所述待录制人员对应的会场终端。
  5. 根据权利要求1至4任意一项所述的会议录制方法,其特征在于,所述会议终端为高级视频编码AVC会场终端,当所述待录制人员对应至少两个不同的AVC会场终端时,所述多点控制单元根据待录制人员的特征信息从各个会场终端发送的音视频码流中筛选需要录制的音视频码流,具体包括:
    所述多点控制单元根据待录制人员的特征信息分别从所述至少两个不同的AVC会场终端发送的音视频码流中筛选需要录制的音视频码流。
  6. 根据权利要求5所述的会议录制方法,其特征在于,所述需要录制的音视频码流包括需要录制的视频码流和需要录制的音频码流;所述多点控制单元将所述需要录制的音视频码流发送给录播服务器,具体包括:
    所述多点控制单元将所述需要录制的视频码流进行画面合成获得合成画面,将所述合成画面发送给所述录播服务器,将所述需要录制的音频码流进行混音后发送给所述录 播服务器。
  7. 根据权利要求1至4任意一项所述的会议录制方法,其特征在于,所述会议终端为可伸缩视频编码SVC会场终端,当所述待录制人员对应至少两个不同的SVC会场终端时,所述多点控制单元根据待录制人员的特征信息从各个会场终端发送的音视频码流中筛选需要录制的音视频码流,具体包括:
    所述多点控制单元将适用于所述录播服务器的码流格式通知所有所述SVC会场终端;
    所述多点控制单元接收所述至少两个不同的SVC会场终端发送的适用于所述录播服务器的码流格式的音视频码流;
    所述多点控制单元根据待录制人员的特征信息从所述适用于所述录播服务器的码流格式的音视频码流中筛选需要录制的音视频码流。
  8. 根据权利要求7所述的会议录制方法,其特征在于,所述需要录制的音视频码流包括需要录制的视频码流和需要录制的音频码流;所述多点控制单元将所述待录制人员需要录制的音视频码流发送给所述录播服务器,具体包括:
    所述多点控制单元将所述至少两个不同的SVC会场终端对应的所述需要录制的视频码流发送给所述录播服务器,以使所述录播服务器对所述至少两个不同的SVC会场终端对应的所述需要录制的视频码流进行画面合成获得合成画面;
    将所述至少两个不同的SVC会场终端对应的所述需要录制的音频码流进行混音后发送给所述录播服务器。
  9. 根据权利要求1至8任意一项所述的会议录制方法,其特征在于,所述多点控制单元根据待录制人员的特征信息从所述音视频码流中筛选需要录制的音视频码流,具体包括:
    所述多点控制单元根据所述待录制人员的特征信息利用预先训练的神经网络模型从所述音视频码流中筛选需要录制的音视频码流。
  10. 一种会议录制装置,其特征在于,所述装置包括:
    码流筛选模块,用于根据待录制人员的特征信息从各个会场终端发送的音视频码流中筛选需要录制的音视频码流;
    码流发送模块,用于将所述需要录制的音视频码流发送给录播服务器,以使所述录播服务器进行会议录制;所述特征信息包括:图像信息或声音信息。
  11. 根据权利要求10所述的会议录制装置,其特征在于,当所述需要录制的音视频码流为所述待录制人员所在会场的整个音视频码流时,所述码流筛选模块,具体包括:
    会场终端筛选单元,用于根据所述待录制人员的特征信息和所述各个会场终端发送的音视频码流筛选所述待录制人员对应的会场终端;
    码流第一筛选单元,用于将筛选出的会场终端发送的音视频码流全部作为所述需要录制的音视频码流。
  12. 根据权利要求10所述的会议录制装置,其特征在于,当所述需要录制的音视频码流为所述待录制人员个人的音视频码流时,所述码流筛选模块,具体包括:
    会场终端筛选单元,用于根据所述待录制人员的特征信息和所述各个会场终端发送的音视频码流筛选所述待录制人员对应的会场终端;
    码流第二筛选单元,用于从筛选出的会场终端发送的音视频码流中根据所述待录制人员的特征信息筛选所述待录制人员个人的音视频码流作为所述需要录制的音视频码流。
  13. 根据权利要求11或12所述的会议录制装置,其特征在于,所述会场终端筛选单元,具体包括:
    解码子单元,用于将所述各个会场终端发送的音视频码流进行解码获得解码后的视频码流和音频码流;
    会场终端确定子单元,用于根据所述待录制人员的图像信息与所述解码后的视频码流进行特征匹配,确定所述待录制人员对应的会场终端,或,将所述待录制人员的声音信息与所述解码后的音频码流进行特征匹配,确定所述待录制人员对应的会场终端。
  14. 根据权利要求10至13任意一项所述的会议录制装置,其特征在于,所述会议终端为高级视频编码AVC会场终端,当所述待录制人员对应至少两个不同的AVC会场终端时,所述码流筛选模块,具体包括:
    码流第三筛选单元,用于根据待录制人员的特征信息分别从所述至少两个不同的AVC会场终端发送的音视频码流中筛选需要录制的音视频码流。
  15. 根据权利要求14所述的会议录制装置,其特征在于,所述需要录制的音视频码流包括需要录制的视频码流和需要录制的音频码流;所述码流发送模块,具体包括:
    画面合成单元,用于将所述需要录制的视频码流进行画面合成获得合成画面;
    画面发送单元,用于将所述合成画面发送给所述录播服务器;
    第一混音单元,用于将所述需要录制的音频码流进行混音;
    音频第一发送单元,用于将混音后的音频发送给所述录播服务器。
  16. 根据权利要求10至13任意一项所述的会议录制装置,其特征在于,所述会议终端为可伸缩视频编码SVC会场终端,当所述待录制人员对应至少两个不同的SVC会场终端时,所述码流筛选模块,具体包括:
    码流格式通知单元,用于将适用于所述录播服务器的码流格式通知所有所述SVC会场终端;
    码流接收单元,用于接收所述至少两个不同的SVC会场终端发送的适用于所述录播服务器的码流格式的音视频码流;
    码流第四筛选单元,用于根据待录制人员的特征信息从所述适用于所述录播服务器的码流格式的音视频码流中筛选需要录制的音视频码流。
  17. 根据权利要求16所述的会议录制装置,其特征在于,所述需要录制的音视频码流包括需要录制的视频码流和需要录制的音频码流;所述码流发送模块,具体包括:
    视频码流发送单元,用于将所述至少两个不同的SVC会场终端对应的所述需要录制的视频码流发送给所述录播服务器,以使所述录播服务器对所述至少两个不同的SVC会场终端对应的所述需要录制的视频码流进行画面合成获得合成画面;
    第二混音单元,用于将所述至少两个不同的SVC会场终端对应的所述需要录制的音频码流进行混音;
    音频第二发送单元,用于将混音后的音频发送给所述录播服务器。
  18. 根据权利要求10至17任意一项所述的会议录制装置,其特征在于,所述码流 筛选模块,具体包括:
    码流第五筛选单元,用于根据所述待录制人员的特征信息利用预先训练的神经网络模型从所述音视频码流中筛选需要录制的音视频码流。
  19. 一种会议录制系统,其特征在于,包括多点控制单元,录播服务器,以及至少两个会场终端;
    所述会场终端,用于向所述多点控制单元发送音视频码流;
    所述多点控制单元,用于根据待录制人员的特征信息从各个所述会场终端发送的音视频码流中筛选需要录制的音视频码流;所述特征信息包括:图像信息或声音信息;将所述需要录制的音视频码流发送给所述录播服务器;
    所述录播服务器,用于根据所述需要录制的音视频码流进行会议录制。
PCT/CN2020/083402 2019-06-28 2020-04-05 一种会议录制方法、装置及会议录制系统 WO2020258976A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20832662.9A EP3979630A4 (en) 2019-06-28 2020-04-05 CONFERENCE RECORDING METHOD AND APPARATUS AND CONFERENCE RECORDING SYSTEM
US17/563,859 US11974067B2 (en) 2019-06-28 2021-12-28 Conference recording method and apparatus, and conference recording system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910577597.9 2019-06-28
CN201910577597.9A CN112153321B (zh) 2019-06-28 2019-06-28 一种会议录制方法、装置及会议录制系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/563,859 Continuation US11974067B2 (en) 2019-06-28 2021-12-28 Conference recording method and apparatus, and conference recording system

Publications (1)

Publication Number Publication Date
WO2020258976A1 true WO2020258976A1 (zh) 2020-12-30

Family

ID=73869495

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/083402 WO2020258976A1 (zh) 2019-06-28 2020-04-05 一种会议录制方法、装置及会议录制系统

Country Status (4)

Country Link
US (1) US11974067B2 (zh)
EP (1) EP3979630A4 (zh)
CN (1) CN112153321B (zh)
WO (1) WO2020258976A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113194122B (zh) * 2021-04-15 2023-10-31 厦门亿联网络技术股份有限公司 一种会议录制文件的网盘调度同步方法及系统
CN115002396B (zh) * 2022-06-01 2023-03-21 北京美迪康信息咨询有限公司 一种多信号源统一处理系统及方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110261142A1 (en) * 2010-04-27 2011-10-27 Binu Kaiparambil Shanmukhadas Providing Separate Video and Presentation Streams to a Recording Server
CN102968991A (zh) * 2012-11-29 2013-03-13 华为技术有限公司 一种语音会议纪要的分类方法、设备和系统
CN103475835A (zh) * 2013-09-18 2013-12-25 华为技术有限公司 一种音视频会议录制内容的处理方法及装置
US9282284B2 (en) * 2013-05-20 2016-03-08 Cisco Technology, Inc. Method and system for facial recognition for a videoconference
CN107360387A (zh) * 2017-07-13 2017-11-17 广东小天才科技有限公司 一种视频录制的方法、装置及终端设备

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090210491A1 (en) * 2008-02-20 2009-08-20 Microsoft Corporation Techniques to automatically identify participants for a multimedia conference event
CN101662643A (zh) * 2008-08-26 2010-03-03 中兴通讯股份有限公司 一种将实时监控图像/录像引入视频会议的方法
US8723911B1 (en) * 2008-10-06 2014-05-13 Verint Americas Inc. Systems and methods for enhancing recorded or intercepted calls using information from a facial recognition engine
CN102055949B (zh) * 2009-11-02 2013-10-02 华为终端有限公司 多媒体会议的录播方法、装置及系统、回播方法及装置
CN101771853A (zh) * 2010-01-29 2010-07-07 华为终端有限公司 会议内容播放的方法及装置
CN102223515B (zh) * 2011-06-21 2017-12-05 中兴通讯股份有限公司 远程呈现会议系统、远程呈现会议的录制与回放方法
CN105141879B (zh) * 2014-06-03 2018-08-21 南宁富桂精密工业有限公司 视频通话转换方法及移动终端
US20160283860A1 (en) * 2015-03-25 2016-09-29 Microsoft Technology Licensing, Llc Machine Learning to Recognize Key Moments in Audio and Video Calls
CN104767963B (zh) * 2015-03-27 2018-10-09 华为技术有限公司 视频会议中的与会人信息呈现方法和装置
CN105100892B (zh) * 2015-07-28 2018-05-15 努比亚技术有限公司 视频播放装置及方法
CN105448289A (zh) 2015-11-16 2016-03-30 努比亚技术有限公司 一种语音合成、删除方法、装置及语音删除合成方法
CN105512348B (zh) * 2016-01-28 2019-03-26 北京旷视科技有限公司 用于处理视频和相关音频的方法和装置及检索方法和装置
CN105719659A (zh) 2016-02-03 2016-06-29 努比亚技术有限公司 基于声纹识别的录音文件分离方法及装置
US10187439B2 (en) * 2016-04-15 2019-01-22 Microsoft Technology Beaming, LLC Dynamic recording of online conference
US20180123814A1 (en) * 2016-11-02 2018-05-03 Microsoft Technology Licensing, Llc Live meetings for channels in a team collaboration tool
CN108243320B (zh) * 2016-12-23 2020-05-08 杭州华为企业通信技术有限公司 会议控制方法、装置及系统
CN108305632B (zh) * 2018-02-02 2020-03-27 深圳市鹰硕技术有限公司 一种会议的语音摘要形成方法及系统
US12003585B2 (en) * 2018-06-08 2024-06-04 Vale Group Llc Session-based information exchange
CN109640016B (zh) * 2018-11-19 2021-05-28 视联动力信息技术股份有限公司 一种实现在视联网会议中快速录制的方法和装置
CN109597898A (zh) * 2018-11-28 2019-04-09 广州讯立享智能科技有限公司 一种信息检索方法及装置
CN109788235B (zh) * 2019-02-26 2021-06-29 视联动力信息技术股份有限公司 一种基于视联网的会议记录信息的处理方法和系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110261142A1 (en) * 2010-04-27 2011-10-27 Binu Kaiparambil Shanmukhadas Providing Separate Video and Presentation Streams to a Recording Server
CN102968991A (zh) * 2012-11-29 2013-03-13 华为技术有限公司 一种语音会议纪要的分类方法、设备和系统
US9282284B2 (en) * 2013-05-20 2016-03-08 Cisco Technology, Inc. Method and system for facial recognition for a videoconference
CN103475835A (zh) * 2013-09-18 2013-12-25 华为技术有限公司 一种音视频会议录制内容的处理方法及装置
CN107360387A (zh) * 2017-07-13 2017-11-17 广东小天才科技有限公司 一种视频录制的方法、装置及终端设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3979630A4 *

Also Published As

Publication number Publication date
EP3979630A4 (en) 2022-08-03
US20220124280A1 (en) 2022-04-21
CN112153321A (zh) 2020-12-29
CN112153321B (zh) 2022-04-05
EP3979630A1 (en) 2022-04-06
US11974067B2 (en) 2024-04-30

Similar Documents

Publication Publication Date Title
US8289369B2 (en) Distributed real-time media composer
US6674459B2 (en) Network conference recording system and method including post-conference processing
US6775247B1 (en) Reducing multipoint conferencing bandwidth
AU2011320410B2 (en) Conference control method, apparatus and system thereof
WO2011026382A1 (zh) 视频会议虚拟会场的呈现方法、设备及系统
US20160227169A1 (en) System and method for a hybrid topology media conferencing system
WO2011015136A1 (zh) 一种会议控制的方法、装置和系统
US11974067B2 (en) Conference recording method and apparatus, and conference recording system
WO2011063763A1 (zh) 一种包含远程呈现会场的会议控制方法、装置及系统
WO2015003532A1 (zh) 多媒体会议的建立方法、装置及系统
CN101662643A (zh) 一种将实时监控图像/录像引入视频会议的方法
CN108156413B (zh) 视频会议的传输方法及装置、mcu
EP2637404A1 (en) Method and device for controlling multiple auxiliary streams, and network system
CN101662642B (zh) 一种将实时监控图像直接引入视频会议的方法
WO2014026478A1 (zh) 一种视频会议信号处理的方法、视频会议服务器及系统
WO2016101623A1 (zh) 多点音频视频通信中远程互动的方法及设备
WO2021254452A1 (zh) 视频会议系统的控制方法、多点控制单元及存储介质
CN114245063B (zh) 适用于云视频会议的视频处理方法、装置、设备及存储介质
CN118590604A (zh) 会议接入方法及设备
CN115734028A (zh) 一种基于级联编码的媒体流推送方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20832662

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020832662

Country of ref document: EP

Effective date: 20220103