WO2014079302A1 - 低码流的视频会议系统及方法、发送端设备、接收端设备 - Google Patents

低码流的视频会议系统及方法、发送端设备、接收端设备 Download PDF

Info

Publication number
WO2014079302A1
WO2014079302A1 PCT/CN2013/086009 CN2013086009W WO2014079302A1 WO 2014079302 A1 WO2014079302 A1 WO 2014079302A1 CN 2013086009 W CN2013086009 W CN 2013086009W WO 2014079302 A1 WO2014079302 A1 WO 2014079302A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
audio
identity
speaker
data
Prior art date
Application number
PCT/CN2013/086009
Other languages
English (en)
French (fr)
Inventor
李霞
付贤会
张凯
修岩
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Priority to US14/647,259 priority Critical patent/US20150341565A1/en
Priority to EP13856801.9A priority patent/EP2924985A4/en
Publication of WO2014079302A1 publication Critical patent/WO2014079302A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals

Definitions

  • the present invention relates to the field of multimedia communications, and in particular, to a low-stream video conferencing system and a low-stream video conferencing data transmission method, a transmitting device, and a receiving device.
  • Background technique
  • the video conferencing system is used for remote, multi-point and real-time conferences to realize the transmission and interaction of video and sound between multiple points.
  • the video conferencing system is mainly composed of a terminal and a Micro Controller Unit (MCU).
  • MCU Micro Controller Unit
  • a plurality of terminal sets are usually connected to one MCU to form a star topology network.
  • the terminal is a client device equipped with multimedia components such as a display, a camera, a speaker, a microphone, etc.
  • the MCU is a system-side device that centrally exchanges and processes multimedia information of each terminal.
  • the video conferencing system can be said to be a system that integrates network, video and audio.
  • the network requirements are very high.
  • Network bandwidth is actually the basis of the entire video conference, and its use in video conferencing is also complicated, because different requirements generate different bandwidth requirements.
  • the number of participants, the number of speakers, the size of the image many users want to use large resolution images as much as possible, 640 X 480 resolution and 320 x 240 resolution ratio, the amount of data is increased by 4 Times, 20 venues and 10 venues are twice as large as data.
  • Many conferences require screen sharing to be given to the branch office. Although this feature is very valuable, a 1024 x 768 screen is a very large image and generates a lot of traffic.
  • the main purpose of the embodiment of the present invention is to provide a low-stream video conferencing system and method, a transmitting device, and a receiving device to save bandwidth, so that the bandwidth of the IP network can meet the increasingly demanding video conferencing service requirements.
  • An embodiment of the present invention provides a low-stream video conferencing system, where the system includes: a sending end and a receiving end;
  • the transmitting end is configured to acquire audio data and video data and respectively form an audio feature mapping and a video feature mapping to obtain a local dynamic image; and transmit the audio data and the local dynamic image to the receiving end;
  • the receiving end is configured to synthesize the original video data and play the audio data according to the audio features extracted from the audio feature mapping and the video feature mapping of the local end, the video features, and the received partial dynamic image.
  • the sending end includes: an collecting unit, an identifying unit, a feature mapping unit, and a sending unit;
  • the receiving end includes: a receiving unit, a feature extraction comparison unit, and a data synthesis output unit;
  • the collecting unit is configured to collect audio data and video data, and send the collected audio data and video data to the identifying unit;
  • the identification unit is configured to identify a speaker identity, perform voice recognition on the collected audio data, acquire audio features, perform image recognition on the collected video data, and acquire video features and local dynamic images, and combine audio features, video features, and Local dynamic image is sent to feature map Unit
  • the feature mapping unit is configured to query whether an audio feature map and a video feature map already exist, and if not, generate an audio feature map and a video feature map according to the audio feature and the video feature respectively;
  • the sending unit is configured to send audio data and a local dynamic image, where the code of the audio data carries the speaker identity;
  • the receiving unit is configured to receive audio data and a local dynamic image
  • the feature extraction comparison unit is configured to extract the speaker identity from the encoding of the audio data, query an existing audio feature map and a video feature map, and extract audio from the audio feature map according to the speaker identity. Feature, extracting video features from the video feature map;
  • the data synthesis output unit is configured to restore the original video data by using the extracted video feature and the received local dynamic image synthesis, and output the audio data and the original video data in combination with the audio feature.
  • the identifying unit is configured to identify a speaker identity and a conference number in which the speaker currently participates in the conference, and form an identity code by the speaker identity and the conference number, and identify by the identity identifier
  • the collected audio data and the video data correspond to an identity feature; or the identity feature is identified only by the speaker identity.
  • the feature mapping unit is configured to perform the query on the local end of the sending end and the network database, and in the local query, adopt local audio feature mapping and video feature mapping; in the case of the network database query, The audio feature map and the video feature map are downloaded from the network database to the local; the audio feature map and the video feature map are generated locally in the case where neither the local nor the network database can be queried.
  • the audio feature mapping is composed of a speaker identity and an audio feature corresponding to the speaker identity; or the audio feature mapping is performed by the identity code and the identity
  • the audio feature corresponding to the identification code is formed by the speaker identity and the conference number.
  • the video feature mapping is composed of a speaker identity and a video feature corresponding to the speaker identity; or the video feature mapping is composed of an identity code and a video feature corresponding to the identity identifier.
  • the identification code is formed by the speaker identity and the conference number.
  • the local dynamic image includes: at least one trajectory image information of a speaker's head motion, eye movement, gesture, and contour motion.
  • An embodiment of the present invention further provides a low-stream video conference data transmission method, where the method includes:
  • the transmitting end acquires audio data and video data and respectively forms an audio feature map and a video feature map, obtains a partial dynamic image, and transmits the audio data and the partial dynamic image to the receiving end;
  • the receiving end synthesizes the original video data and plays the audio data according to the audio features extracted from the audio feature mapping and the video feature mapping of the local end, the video features, and the received partial dynamic image.
  • the forming the audio feature mapping includes:
  • the audio feature map is formed by using the speaker identity as an index key, and the audio feature map is composed of a speaker identity and an audio feature corresponding to the speaker identity or
  • an audio feature map by using the speaker identity and the conference number as a combined index key, where the audio feature mapping is composed of an identity code and an audio feature corresponding to the identity code;
  • the identification code is formed by the speaker identity and the conference number.
  • forming the video feature mapping includes:
  • a video feature map by using the speaker identity as an index key where the video feature map is composed of a speaker identity and a video feature corresponding to the speaker identity; or
  • a video feature map by using the speaker identity and the conference number as a combined index key, where the video feature mapping is composed of an identity code and a video feature corresponding to the identity identifier;
  • the identification code is formed by the speaker identity and the conference number.
  • the method further includes: performing the query on the local end of the sending end and the network database, and using the local audio feature mapping and the video feature mapping in the local query situation; In the case of the network database query, the audio feature map and the video feature map are downloaded from the network database to the local; in the case that neither the local nor the network database can be queried, the audio feature map and the video feature map are formed locally.
  • the local dynamic image includes: at least one trajectory image information of a speaker's head motion, eye movement, gesture, and contour motion.
  • the embodiment of the present invention further provides a transmitting end device of a low-stream video conferencing system, where the device is configured to acquire audio data and video data, and respectively form an audio feature mapping and a video feature mapping to obtain a local dynamic image; Audio data and local dynamic images to the receiving end.
  • the device includes: an acquisition unit, an identification unit, a feature mapping unit, and a sending unit;
  • the collecting unit is configured to collect audio data and video data, and send the collected audio data and video data to the identifying unit;
  • the identification unit is configured to identify a speaker identity, perform voice recognition on the collected audio data, acquire audio features, perform image recognition on the collected video data, and acquire video features and local dynamic images, and combine audio features, video features, and
  • the local dynamic image is sent to the feature mapping unit;
  • the feature mapping unit is configured to query whether an audio feature map and a video feature map already exist, and if not, generate an audio feature map and a video feature map according to the audio feature and the video feature respectively;
  • the sending unit is configured to send audio data and a local dynamic image, where the encoding of the audio data carries the speaker identity.
  • the embodiment of the present invention further provides a receiving end device of a low-stream video conferencing system, where the device is configured to receive audio features, video features, and slaves according to audio feature mapping and video feature mapping from the local end.
  • the local moving image received by the transmitting end organizes the original video data and plays the audio data.
  • the device includes: a receiving unit, a feature extraction comparison unit, and a data synthesis output unit;
  • the receiving unit is configured to receive audio data and a local dynamic image
  • the feature extraction comparison unit is configured to extract the speaker identity from the encoding of the audio data, query an existing audio feature map and a video feature map, and extract audio from the audio feature map according to the speaker identity. Feature, extracting video features from the video feature map;
  • the data synthesis output unit is configured to restore the original video data by using the extracted video feature and the received local dynamic image synthesis, and output the audio data and the original video data in combination with the audio feature.
  • the system of the embodiment of the present invention acquires audio data and video data at the transmitting end and respectively forms an audio feature map and a video feature map to obtain a local dynamic image; the transmitting end transmits the audio data and the partial dynamic image to the receiving end, and the receiving end according to the present invention
  • the audio feature extracted from the audio feature map and the video feature map and the received local dynamic image are combined to synthesize the original video data and play the audio data.
  • FIG. 1 is a schematic structural diagram of a principle structure of a system according to an embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of implementing a principle of a method according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of an identity establishment application example according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of an audio mapping establishment application example according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of an application example of video mapping establishment according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of an example of a dynamic image acquisition application according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of an application example of a video processing flow of a transmitting end according to an embodiment of the present invention
  • FIG. 8 is a schematic diagram of an application example of a video processing flow of a transmitting end according to an embodiment of the present invention
  • FIG. 9 is a flowchart of a video integration processing process of a receiving end according to an embodiment of the present invention
  • the audio data and the video data are acquired at the transmitting end, and the audio feature mapping and the video feature mapping are respectively formed to obtain the local dynamic image; the transmitting end transmits the audio data and the local dynamic image to the receiving end, and the receiving end according to the The audio feature extracted from the audio feature map and the video feature map and the received local dynamic image are combined to synthesize the original video data and play the audio data.
  • video conferencing is characteristic for a company or organization, such as the participants are basically fixed, the focus of the meeting is on the speaker, especially the speaker's eyes, The mouth shape and gestures are analyzed.
  • video data is not directly transmitted in the video conference, but the video data is split at the transmitting end, and the video data is integrated and processed to restore the original.
  • Video data in this way, since the video data is not directly transmitted during transmission, the amount of data transmitted is reduced compared to the prior art, thereby reducing the bandwidth occupation during video data transmission, and does not care about high-resolution video data.
  • the video data is used instead of the high-resolution video data. Since the video data is not directly transmitted in the embodiment of the present invention, the video data is split. Therefore, there is no need to worry about the bandwidth occupation problem, the bandwidth is within the controllable range, and the bandwidth is within the controllable range. You can also get high-resolution video data for the best display.
  • FIG. 1 is a low-stream video conferencing system according to an embodiment of the present invention, where the system includes: a transmitting end 1 and a receiving end 2;
  • the transmitting end 1 is configured to collect audio data and video data and respectively form an audio feature mapping and video feature mapping to obtain a local dynamic image; and transmit the audio data and the local dynamic image to the receiving end 2;
  • the receiving end 2 is configured to synthesize the original video data and play the audio data according to the audio features extracted from the audio feature mapping and the video feature mapping of the local end, the video features, and the received partial dynamic image.
  • the transmitting end 1 includes: an acquiring unit 11, an identifying unit 12, a feature mapping unit 13, and a transmitting unit 14. among them,
  • the collecting unit 11 is configured to collect audio data and video data, and send the collected audio data and video data to the identification unit.
  • the identification unit 12 is configured to identify the identity of the speaker, perform voice recognition on the collected audio data, acquire audio features, perform image recognition on the collected video data, and acquire video features and local dynamic images, and combine audio features, video features, and localities.
  • the dynamic image is sent to the feature mapping unit 13.
  • the conference number in which the speaker participates can be identified, and the identity code is generated based on the identity of the speaker and the conference number.
  • the video feature includes: a background image feature of the conference and an image feature of the speaker.
  • the partial dynamic image includes: a speaker's head movement, eye movement, gesture, contour movement At least one type of trajectory image information.
  • the identification unit 12 may further be divided into a voice recognition subunit configured to perform voice recognition on the collected audio data and acquire audio features, and an image recognition subunit configured to perform image on the collected video data. Identify and acquire video features and local dynamic images.
  • the feature mapping unit 13 is configured to query, in the local or network database, whether an audio feature mapping and a video feature mapping already exist, and if not, generate an audio feature mapping according to the speaker identity and the received audio feature, according to the speaker identity.
  • the video feature map is generated with the received video feature, and the audio feature map and the video feature map are stored locally, or the audio feature map and the video feature map are uploaded to a network database for storage for subsequent query use.
  • both the audio feature mapping and the video feature mapping may use the speaker identity as the mapping index key, and the mapping may further include a conference number, using the speaker identity and the conference number as the combined mapping index key.
  • the feature mapping unit 13 can also be divided into an audio feature mapping subunit and a video feature mapping subunit.
  • the audio feature mapping sub-unit is configured to query whether the audio feature map already exists in the local or network database, if not, generate an audio feature map according to the speaker identity and the received audio feature, locally store the audio feature map, or audio
  • the feature map is uploaded to the network database for storage for subsequent query use;
  • the video feature map sub-unit is configured to query whether the video feature map already exists in the local or network database, and if not, according to the speaker identity and the received video feature Generate video feature maps, store video feature maps locally, or upload video feature maps to a network database for storage for subsequent queries.
  • the sending unit 14 is configured to send audio data and a partial dynamic image, and the encoding of the audio data carries a speaker identity or an identity code.
  • the receiving end needs to extract the audio feature from the audio feature map according to the speaker identity, so as to be used in sorting and merging.
  • the identity code is composed of the speaker identity and the conference number.
  • the identification code corresponds to the audio feature, the video feature and the local dynamic graphic, so as to merge and merge to restore the original video data, and play the audio data, so that the interaction between the transmitting end and the receiving end can be vivid at the receiving end.
  • the audio/video feature is stored in both the sender and the receiver, and there is also a backup on the network database. In this way, performing the merge and merge to restore the original video data, and playing the audio data only needs to be local or network from the receiving end.
  • the corresponding audio/video data is extracted according to the identity of the speaker, and then synthesized with the received local dynamic graphics, which is simple and easy to operate, reduces the amount of data transmitted, and saves bandwidth. Don't worry about not being able to transfer and display high-resolution video data.
  • the above is actually the various functional units included in the transmitting device of the system.
  • the following describes the functional units included in the receiving device of the system.
  • the receiving end 2 includes: a receiving unit 21, a feature extracting comparison unit 22, and a data synthesizing output unit 23. among them,
  • the receiving unit 21 is configured to receive audio data and a partial dynamic image.
  • the feature extraction comparison unit 22 is configured to extract the speaker identity from the audio data, query an existing audio feature map and a video feature map in a local or network database, and extract audio from the audio feature map according to the speaker identity. Feature, extracting video features from the video feature map based on the identity of the speaker.
  • the speaker identity is used as an index key to query in the audio feature map and the video feature map. If the audio data is not With the identity of the speaker, but carrying an identity code consisting of the identity of the speaker and the conference number, the identity code is used as a combined index key to query the audio feature map and the video feature map.
  • the feature extraction comparison unit 22 can also be divided into an audio feature extraction comparison sub-unit and a video feature extraction comparison sub-unit.
  • the audio feature extraction comparison sub-unit is configured to extract the speaker identity from the audio data, query an existing audio feature map in a local or network database, and extract an audio feature from the audio feature map according to the speaker identity;
  • the extraction alignment sub-unit is configured to extract video features from the video feature map based on the speaker identity.
  • the data synthesizing output unit 23 is configured to synthesize the original video data by using the extracted video feature and the received local dynamic image synthesis, and output the audio data and the original video data in combination with the audio feature.
  • the collection unit 11, the identification unit 12, the feature mapping unit 13, the transmitting unit 14, the receiving unit 21, the feature extraction comparison unit 22, and the data synthesis output unit 23 may all be processed by a central processing unit (CPU, Central Processing). Units, or digital signal processing (DSP), or Field Programmable Gate Array (FPGA), etc.; the CPU, DSP, and FPGA can be built into the video conferencing system.
  • CPU Central Processing
  • DSP digital signal processing
  • FPGA Field Programmable Gate Array
  • FIG. 2 is a schematic diagram of a low-stream video conference data transmission method according to an embodiment of the present invention, including the following steps:
  • Step 101 Collect audio data and video data, identify the identity of the speaker, perform voice recognition on the collected audio data, acquire audio features, perform image recognition on the collected video data, and acquire video features and local dynamic images.
  • Step 102 Send audio data and a partial dynamic image, where the encoding of the audio data carries the identity of the speaker.
  • Step 103 Receive audio data and local dynamic image, and extract from the encoding of the audio data.
  • the speaker identity querying an existing audio feature map and a video feature map in a local or network database, extracting an audio feature from the audio feature map according to the speaker identity, and extracting a video from the video feature map according to the speaker identity feature.
  • Step 104 Restore the original video data by using the extracted video feature and the received local dynamic image synthesis, and output the audio data and the original video data in combination with the audio feature.
  • the embodiment of the present invention further provides a transmitting end device of a low-stream video conferencing system, where the transmitting end device has the same structure and function as the transmitting end 1 in the foregoing system, and the transmitting end device includes: collecting Unit, identification unit, feature mapping unit, and sending unit.
  • the collecting unit is configured to collect audio data and video data, and send the collected audio data and video data to the identifying unit.
  • the identification unit is configured to identify the speaker identity, perform voice recognition on the collected audio data, acquire audio features, perform image recognition on the collected video data, and acquire video features and local dynamic images, and combine audio features, video features, and local dynamics.
  • the image is sent to the feature mapping unit.
  • a feature mapping unit configured to query, in the local or network database, whether an audio feature mapping and a video feature mapping already exist, and if not, generate an audio feature mapping according to the speaker identity and the received audio feature, according to the speaker identity and The received video feature generates a video feature map, and stores the audio feature map and the video feature map locally, or uploads the audio feature map and the video feature map to a network database for storage for subsequent query use.
  • the sending unit is configured to send the audio data and the local dynamic image, and the encoding of the audio data carries the speaker identity or the identity to identify another 'J code.
  • the audio data is sent, there is no need to extract it. It is only necessary to extract the video features from the video feature map according to the identity of the speaker, so as to be used in sorting and merging. Of course, when only the partial dynamic image is sent, the receiving end needs to extract the audio feature from the audio feature map according to the speaker identity, so as to be used in sorting and merging.
  • the sending unit sends the identification code
  • the identity code is composed of the speaker identity and the conference number.
  • Frequency features and local dynamic graphics in order to collate and merge to restore the original video data, and play the audio data, so that the interaction between the sender and the receiver can be vividly restored at the receiving end to express the expression/mouth of the current conference participant. / Gesture/bending, etc., and since only local dynamic graphics need to be transmitted during transmission, there is no need to transmit complete video data, but the audio/video features of previously collected audio/video data are stored at both the transmitting end and the receiving end.
  • One there is also a backup on the network database, so that the finishing and merging is performed to restore the original video data, and when the audio data is played, only the audio/video feature mapping in the local or network database of the receiving end is required, according to the speech
  • the identity of the player extracts the corresponding audio/video data, and then combines with the received local dynamic graphics, which is simple and easy to operate, reduces the amount of data transmitted, and saves bandwidth. Don't worry about not being able to transfer and display high-resolution video data.
  • the collecting unit the identifying unit, the feature mapping unit, and the sending unit
  • CPU 14 can be implemented by CPU, DSP, or FPGA; the CPU, DSP, and FPGA can all be built into the video conferencing system.
  • the embodiment of the present invention further provides a receiving end device of a low-stream video conferencing system, and the receiving end device has the same structure and function as the receiving end 2 in the foregoing system, and the receiving end device includes: receiving Unit, feature extraction comparison unit, data synthesis output unit. among them,
  • a receiving unit configured to receive audio data and a partial dynamic image.
  • a feature extraction comparison unit configured to extract the speaker identity from the audio data, query an existing audio feature map and a video feature map in a local or network database, and extract audio features from the audio feature map according to the speaker identity And extracting video features from the video feature map according to the identity of the speaker.
  • the data synthesis output unit is configured to restore the original video data by using the extracted video feature and the received local dynamic image synthesis, and output the audio data and the original video data in combination with the audio feature.
  • the receiving unit, the feature extraction comparison unit, and the data synthesis output unit may be implemented by a CPU, a DSP, an FPGA, or the like; the CPU, the DSP, and the FPGA may all be built in the video conference system.
  • FIG. 3 is a schematic diagram of an identity establishment application example according to an embodiment of the present invention.
  • the identity establishment process includes: acquiring a speaker identity and a site number, and generating an identity identification code according to the identity of the speaker and the conference number to determine a unique identity.
  • FIG. 4 is a schematic diagram of an audio mapping establishment application example according to an embodiment of the present invention.
  • the audio mapping establishment process includes: after the voice end performs voice recognition on the audio data, the speaker identifies the speaker identity and audio characteristics, and stores the speaker identity and the audio feature.
  • the speaker identity, the audio feature corresponding to the speaker identity form a audio feature map in a mapping relationship; the audio feature map may be stored in the form of an audio feature template.
  • the audio feature mapping relationship in the audio feature template may use the speaker identity as a key value index to the audio feature corresponding to the speaker identity.
  • FIG. 5 is a schematic diagram of an application example of video mapping establishment according to an embodiment of the present invention.
  • the video mapping establishment process includes: after the image end image recognition of the video data, the sender identifies the speaker identity and the video feature, and stores the speaker identity and the video feature.
  • the speaker identity, the video feature corresponding to the speaker identity form a video feature map in a mapping relationship; the video feature map may be stored in the form of a video feature template.
  • the video feature mapping relationship in the video feature template may use the speaker identity as a key value index to the video feature corresponding to the speaker identity.
  • FIG. 6 is a schematic diagram of an example of a dynamic image acquisition application according to an embodiment of the present invention.
  • the dynamic image acquisition process includes: acquiring a local dynamic image by collecting contour motions such as a head motion, an eye movement, a gesture, and a bending of a speaker.
  • the partial dynamic image includes: at least one of trajectory image information of a speaker's head motion, eye movement, gesture, and contour motion.
  • FIG. 7 is a schematic diagram of an application example of a sending end audio processing flow according to an embodiment of the present invention.
  • the process includes: at a transmitting end, a terminal collects an audio input source signal through a microphone to perform audio encoding and voice recognition; and extracts audio features locally. Query whether the audio feature mapping template already exists. If the local device exists, the audio is output and transmitted to the receiving end.
  • the network database is queried for the audio feature mapping template. If the audio feature mapping template is directly downloaded, Output audio and transmit to the receiving end; if the network database does not exist, establish an audio feature mapping template and store it in the local and network databases.
  • FIG. 8 is a schematic diagram of an application example of a video processing process at a sending end according to an embodiment of the present invention.
  • the process includes: at a transmitting end, a terminal collects a video input source signal to perform video encoding; and extracts a video feature, according to a background image feature, a speaker
  • the image feature forms a video feature; locally queries whether a video feature mapping template already exists, and if present locally, collects a local motion image such as a speaker's head motion, a speaker's eye movement, and a gesture, and outputs a local dynamic image and transmits to the receiving end; If the local does not exist, query the network database for the video feature mapping template.
  • video feature mapping template directly downloaded to the local, collect the speaker's head motion, the speaker's eye movement and gestures, and other local dynamic images, and output the local dynamic image. And transmitting to the receiving end; if the network database does not exist, the video feature mapping template is established and stored in the local and network databases.
  • the receiving end processing flow of the embodiment of the present invention includes: receiving audio, extracting an audio feature template; extracting a video feature template, synthesizing the video feature and the local dynamic image to restore the original video data; and audio/video output.
  • the video integration process of the embodiment of the present invention is described as follows:
  • FIG. 9 is a schematic diagram of an application example of a video integration processing process at a receiving end according to an embodiment of the present invention, where the process includes: receiving an audio signal, audio coding, and identity identification (identifying by an identity code composed of a speaker identity and a conference number) Determining whether the local video feature mapping template exists, if not, downloading the video feature mapping template from the network database; if present, extracting the video feature from the local video feature mapping template; receiving the local dynamic image; The audio features and video features extracted from the audio/video feature mapping template in the local or network database, and the received local dynamic images restore the original video data, namely: the venue environment and the speaker image, especially the lip shape and gestures; The audio signal is output, and the synthesized video signal is output.
  • identity identification identifying by an identity code composed of a speaker identity and a conference number
  • the low-stream video conferencing system and method provided by the embodiments of the present invention acquires audio data and video data at the transmitting end and respectively form audio feature mapping and video feature mapping, and acquire local dynamic images; and transmit audio data and local dynamic images.
  • the transmitting end does not need to transmit complete video data, and only needs to transmit the partial dynamic image to the receiving end, and the receiving end organizes and synthesizes the original video according to the extracted audio and video features and the received partial dynamic image.
  • the data and the audio data are played, so that the amount of transmitted data is controlled, which effectively reduces the amount of data transmitted, thereby saving bandwidth and meeting the needs of video service conferences.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Image Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本发明公开了一种低码流的视频会议方法,发送端获取音频数据和视频数据并分别形成音频特征映射和视频特征映射,获取局部动态图像,传输音频数据和局部动态图像到接收端;接收端根据从本端的音频特征映射和视频特征映射中提取的音频特征、视频特征及接收的所述局部动态图像整理合成出原始视频数据并播放音频数据。同时,本发明还公开了一种低码流的视频会议数据传输系统,发送端设备及接收端设备。采用本发明,能节约带宽,以满足日益增长的视频会议业务需求。

Description

低码流的视频 ^义系统及方法、 发送端设备、 接收端设备 技术领域
本发明涉及多媒体通信领域, 尤其涉及一种低码流的视频会议系统及 低码流的视频会议数据传输方法、 发送端设备、 接收端设备。 背景技术
视频会议系统用于召开远程、 多点及实时的会议, 实现多点之间视频 和声音的传输和交互。视频会议系统主要由终端和微控制单元( MCU, Micro Controller Unit )组成。 在一个小型的视频会议系统中, 通常由多个终端集 中连接至一个 MCU上, 组成星型拓朴结构网络。 终端是用户端设备, 配有 显示器、 摄像机、 扬声器、 麦克风等多媒体部件; MCU是系统端设备, 集 中对各终端的多媒体信息进行交换和处理。
视频会议系统, 可以说是集网络、 视频和音频为一体的系统, 对网络 要求非常高。 网络带宽实际上是整个视频会议的基础, 其在视频会议中的 使用也比较复杂, 因为不同的需求产生不同的带宽要求。 比如, 参会人的 多少, 发言人的多少, 图像的大小, 很多用户希望尽可能采用大分辨率的 图像, 640 X 480的分辨率和 320 x 240的分辨率比, 数据量要增大 4倍, 20 个会场和 10个会场比数据量也大一倍。 很多会议需要使用屏幕共享来给分 公司, 尽管这个功能非常的有价值, 不过一个 1024 x 768 的屏幕是一个很 大的图像, 产生的流量也很大。 因此如果没有足够的带宽, 我们看到的视 频会出现抖动, 听到的声音会有杂音, 使整个视频会议不能正常进行。 目 前很多企业都采用了专线网络, 基本上能够保证视频会议系统需要的网络 带宽, 但专线成本很高。
综上所述, 视频数据的传输会占用大量带宽, 而且想要得到最佳的显 示效果, 传输的视频数据的分辨率就越高, 从而导致更多的带宽被占用。 针对传输视频数据时带宽被大量占用的问题, 现有技术中没有有效的解决 方案。 发明内容
有鉴于此, 本发明实施例的主要目的在于低码流的视频会议系统及方 法、 发送端设备、 接收端设备, 节约带宽, 从而使 IP网络的带宽能满足日 益增长的视频会议业务需求。
为达到上述目的, 本发明实施例的技术方案是这样实现的:
本发明实施例提供了一种低码流的视频会议系统, 所述系统包括: 发 送端及接收端; 其中,
所述发送端, 配置为获取音频数据和视频数据并分别形成音频特征映 射和视频特征映射, 获取局部动态图像; 传输音频数据和局部动态图像到 所述接收端;
所述接收端, 配置为根据从本端的音频特征映射和视频特征映射中提 取的音频特征、 视频特征及接收的所述局部动态图像整理合成出原始视频 数据并播放音频数据。
其中, 所述发送端包括: 采集单元、 识别单元、 特征映射单元、 发送 单元;
所述接收端包括: 接收单元、 特征提取比对单元、 数据合成输出单元; 其中,
所述采集单元, 配置为采集音频数据和视频数据, 将采集的音频数据 和视频数据发送给识别单元;
所述识别单元, 配置为识别出发言者身份, 对采集的音频数据进行语 音识别并获取音频特征, 对采集的视频数据进行图像识别并获取视频特征 和局部动态图像, 将音频特征、 视频特征和局部动态图像发送给特征映射 单元;
所述特征映射单元, 配置为查询是否已经存在音频特征映射和视频特 征映射, 如果查询不到, 则根据所述音频特征和所述视频特征分别生成音 频特征映射和视频特征映射;
所述发送单元, 配置为发送音频数据和局部动态图像, 音频数据的编 码中携带所述发言者身份;
所述接收单元, 配置为接收音频数据和局部动态图像;
所述特征提取比对单元, 配置为从音频数据的编码中提取出所述发言 者身份, 查询已经存在的音频特征映射和视频特征映射, 根据所述发言者 身份从音频特征映射中提取出音频特征, 从视频特征映射中提取出视频特 征;
所述数据合成输出单元, 配置为采用提取出的视频特征和接收的局部 动态图像合成还原出原始视频数据, 并结合音频特征输出音频数据和原始 视频数据。
上述方案中, 所述识别单元, 配置为识别出发言者身份和发言者当前 参与会议的会议号, 由所述发言者身份和所述会议号形成身份识别码, 由 所述身份识别码标识与采集的音频数据和视频数据对应的身份特征; 或者, 仅由所述发言者身份标识所述身份特征。
上述方案中, 所述特征映射单元, 配置为在发送端本地和网络数据库 进行所述查询, 在本地查询到的情况, 采用本地的音频特征映射和视频特 征映射; 在网络数据库查询到的情况, 从网络数据库下载音频特征映射和 视频特征映射到本地; 在本地和网络数据库都查询不到的情况, 在本地生 成音频特征映射和视频特征映射。
上述方案中, 所述音频特征映射由发言者身份和与所述发言者身份对 应的音频特征组成; 或者, 所述音频特征映射由身份识别码和与所述身份 识别码对应的音频特征组成, 所述身份识别码由发言者身份和会议号形成。 上述方案中, 所述视频特征映射由发言者身份和与所述发言者身份对 应的视频特征组成; 或者, 所述视频特征映射由身份识别码和与所述身份 识别码对应的视频特征组成, 所述身份识别码由发言者身份和会议号形成。
上述方案中, 所述局部动态图像包括: 发言者的头部运动、 眼 动、 手势、 轮廓运动中的至少一种轨迹图像信息。
本发明实施例还提供了一种低码流的视频会议数据传输方法, 所述方 法包括:
发送端获取音频数据和视频数据并分别形成音频特征映射和视频特征 映射, 获取局部动态图像, 传输音频数据和局部动态图像到接收端;
接收端根据从本端的音频特征映射和视频特征映射中提取的音频特 征、 视频特征及接收的所述局部动态图像整理合成出原始视频数据并播放 音频数据。
上述方案中, 形成所述音频特征映射, 包括:
识别出发言者身份后, 以发言者身份为索引关键字形成音频特征映射, 所述音频特征映射由发言者身份和与所述发言者身份对应的音频特征组 成 或者
识别出发言者身份和会议号后, 以发言者身份和会议号为组合索引关 键字形成音频特征映射, 所述音频特征映射由身份识别码和与所述身份识 别码对应的音频特征组成; 所述身份识别码由所述发言者身份和所述会议 号形成。
上述方案中, 形成所述视频特征映射, 包括:
识别出发言者身份后, 以发言者身份为索引关键字形成视频特征映射, 所述视频特征映射由发言者身份和与所述发言者身份对应的视频特征组 成; 或者 识别出发言者身份和会议号后, 以发言者身份和会议号为组合索引关 键字形成视频特征映射, 所述视频特征映射由身份识别码和与所述身份识 别码对应的视频特征组成; 所述身份识别码由所述发言者身份和所述会议 号形成。
上述方案中, 形成音频特征映射和视频特征映射之前, 所述方法还包 括: 在发送端本地和网络数据库进行所述查询, 在本地查询到的情况, 采 用本地的音频特征映射和视频特征映射; 在网络数据库查询到的情况, 从 网络数据库下载音频特征映射和视频特征映射到本地; 在本地和网络数据 库都查询不到的情况, 在本地形成音频特征映射和视频特征映射。
上述方案中, 所述局部动态图像包括: 发言者的头部运动、 眼 动、 手势、 轮廓运动中的至少一种轨迹图像信息。
本发明实施例还提供了一种低码流的视频会议系统的发送端设备, 所 述设备, 配置为获取音频数据和视频数据并分别形成音频特征映射和视频 特征映射, 获取局部动态图像; 传输音频数据和局部动态图像到接收端。
上述方案中, 所述设备包括: 采集单元、 识别单元、 特征映射单元、 发送单元; 其中,
所述采集单元, 配置为采集音频数据和视频数据, 将采集的音频数据 和视频数据发送给识别单元;
所述识别单元, 配置为识别出发言者身份, 对采集的音频数据进行语 音识别并获取音频特征, 对采集的视频数据进行图像识别并获取视频特征 和局部动态图像, 将音频特征、 视频特征和局部动态图像发送给特征映射 单元;
所述特征映射单元, 配置为查询是否已经存在音频特征映射和视频特 征映射, 如果查询不到, 则根据所述音频特征和所述视频特征分别生成音 频特征映射和视频特征映射; 所述发送单元, 配置为发送音频数据和局部动态图像, 音频数据的编 码中携带所述发言者身份。
本发明实施例还提供了一种低码流的视频会议系统的接收端设备, 所 述设备, 配置为接收端根据从本端的音频特征映射和视频特征映射中提取 的音频特征、 视频特征及从发送端接收的局部动态图像整理合成出原始视 频数据并播放音频数据。
上述方案中, 所述设备包括: 接收单元、 特征提取比对单元、 数据合 成输出单元; 其中,
所述接收单元, 配置为接收音频数据和局部动态图像;
所述特征提取比对单元, 配置为从音频数据的编码中提取出所述发言 者身份, 查询已经存在的音频特征映射和视频特征映射, 根据所述发言者 身份从音频特征映射中提取出音频特征, 从视频特征映射中提取出视频特 征;
所述数据合成输出单元, 配置为采用提取出的视频特征和接收的局部 动态图像合成还原出原始视频数据, 并结合音频特征输出音频数据和原始 视频数据。
本发明实施例的系统是在发送端获取音频数据和视频数据并分别形成 音频特征映射和视频特征映射, 获取局部动态图像; 发送端传输音频数据 和局部动态图像到接收端, 接收端根据从本端的音频特征映射和视频特征 映射中提取的音频特征、 视频特征及接收的局部动态图像整理合成出原始 视频数据并播放音频数据。
由于并不是传输完整的视频数据, 仅传输局部动态图像, 通过在接收 端根据提取的音频特征、 视频特征及接收的局部动态图像整理合成出原始 视频数据并播放音频数据, 因此, 在传输数据量上得到了控制, 降低了传 输数据量, 从而节约了带宽, 满足视频业务会议的需求。 附图说明
图 1为本发明实施例的系统的组成原理结构示意图;
图 2为本发明实施例的方法原理的实现流程示意图;
图 3为本发明实施例的身份建立应用实例的示意图;
图 4为本发明实施例的音频映射建立应用实例的示意图;
图 5为本发明实施例的视频映射建立应用实例的示意图;
图 6为本发明实施例的动态图像获取应用实例的示意图;
图 7为本发明实施例的发送端音频处理流程应用实例的示意图; 图 8为本发明实施例的发送端视频处理流程应用实例的示意图; 图 9为本发明实施例的接收端视频整合处理流程应用实例的示意图。
具体实施方式
在本发明实施例中: 在发送端获取音频数据和视频数据并分别形成音 频特征映射和视频特征映射, 获取局部动态图像; 发送端传输音频数据和 局部动态图像到接收端, 接收端根据从本端的音频特征映射和视频特征映 射中提取的音频特征、 视频特征及接收的局部动态图像整理合成出原始视 频数据并播放音频数据。
考虑到视频会议需要的带宽中视频数据占据绝大部分, 对一个企业或 机关, 视频会议是具有特点的, 如与会的人员基本固定, 开会时焦点在发 言者身上, 尤其是发言者的眼神, 口型和手势, 从而分析得出: 为了改进 对带宽的占用, 在视频会议中不直接传输视频数据, 而是在发送端拆分视 频数据, 到接收端再对视频数据进行整合处理还原出原始视频数据, 这样, 由于在传输时不是直接传输视频数据, 相比现有技术来说降低了传输的数 据量, 从而减少了视频数据传输时对带宽的占用, 也不用顾忌高分辨率视 频数据的传输会占用更多带宽, 而牺牲视频数据的质量, 即用低分辨率的 视频数据来代替高分辨率视频数据, 由于本发明实施例不直接传输视频数 据, 而是拆分, 因此, 无需担心这个带宽大量占用问题, 带宽在可控范围 内, 而且带宽在可控范围内还可以得到最佳显示效果的高分辨率的视频数 据。
下面结合附图对技术方案的实施作进一步的详细描述。
如图 1 所示为本发明实施例的一种低码流的视频会议系统, 该系统包 括: 发送端 1以及接收端 2; 其中,
所述发送端 1,配置为采集音频数据和视频数据并分别形成音频特征映 射和视频特征映射, 获取局部动态图像; 并传输音频数据和局部动态图像 到所述接收端 2;
所述接收端 2,配置为根据从本端的音频特征映射和视频特征映射中提 取的音频特征、 视频特征及接收的所述局部动态图像整理合成出原始视频 数据并播放音频数据。
较佳地, 所述发送端 1包括: 采集单元 11、 识别单元 12、 特征映射单 元 13、 发送单元 14。 其中,
采集单元 11, 配置为采集音频数据和视频数据, 将采集的音频数据和 视频数据发送给识别单元。
识别单元 12, 配置为识别出发言者身份, 对采集的音频数据进行语音 识别并获取音频特征, 对采集的视频数据进行图像识别并获取视频特征和 局部动态图像, 将音频特征、 视频特征和局部动态图像发送给特征映射单 元 13。
这里, 除了识别出发言者身份, 还可以识别出发言者所参与的会议号, 根据发言者身份和会议号生成身份识别码。
这里, 该视频特征包括: 会议的背景图像特征和发言者的图像特征。 该局部动态图像包括: 发言者的头部运动、 眼球运动、 手势、 轮廓运动中 的至少一种轨迹图像信息。
这里, 识别单元 12还可以分成语音识别子单元和图像识别子单元, 语 音识别子单元配置为对采集的音频数据进行语音识别并获取音频特征; 图 像识别子单元配置为对采集的视频数据进行图像识别并获取视频特征和局 部动态图像。
特征映射单元 13, 配置为在本地或网络数据库查询是否已经存在音频 特征映射和视频特征映射, 如果查询不到, 则根据该发言者身份和接收的 音频特征生成音频特征映射, 根据该发言者身份和接收的视频特征生成视 频特征映射, 并在本地存储音频特征映射和视频特征映射, 或者将音频特 征映射和视频特征映射上传到网络数据库进行存储, 以便后续查询使用。
这里, 音频特征映射和视频特征映射都可以用发言者身份作为映射索 引关键字, 映射中还可以进一步包括会议号, 用发言者身份和会议号作为 组合映射索引关键字。
这里, 特征映射单元 13, 还可以分成音频特征映射子单元和视频特征 映射子单元。 音频特征映射子单元配置为在本地或网络数据库查询是否已 经存在音频特征映射, 如果查询不到, 则根据该发言者身份和接收的音频 特征生成音频特征映射, 本地存储音频特征映射, 或者将音频特征映射上 传到网络数据库进行存储, 以便后续查询使用; 视频特征映射子单元配置 为在本地或网络数据库查询是否已经存在视频特征映射, 如果查询不到, 则根据该发言者身份和接收的视频特征生成视频特征映射, 本地存储视频 特征映射, 或者将视频特征映射上传到网络数据库进行存储, 以便后续查 询使用。
发送单元 14, 配置为发送音频数据和局部动态图像, 音频数据的编码 中携带发言者身份或身份识别码。
如果发送音频数据就无需提取了, 只需要根据发言者身份从视频特征 映射中提取出视频特征, 以便于整理合并时使用。 当然也可以仅发送局部 动态图像时, 需要接收端根据发言者身份从音频特征映射中提取出音频特 征, 以便于整理合并时使用。 发送单元发送身份识别码时, 身份识别码由 发言者身份和会议号构成。 在接收端通过身份识别码对应到音频特征、 视 频特征和局部动态图形, 以便整理合并以还原出原始视频数据, 并播放音 频数据, 从而经过发送端和接收端的相互作用处理, 在接收端能生动还原 出当前会议与会发言者的表情 /嘴型 /手势 /弯曲度等,而且由于在传输时只需 要发送局部动态图形, 无需发送完整的视频数据, 而是将之前采集过的音 / 视频数据的音 /视频特征在发送端和接收端都存储一份, 在网络数据库上也 有备份, 这样, 执行所述整理合并以还原出原始视频数据, 并播放音频数 据时, 只需要从接收端本地或网络数据库中的音 /视频特征映射中, 根据发 言者身份提取出对应的音 /视频数据, 再与接收的局部动态图形进行合成就 行, 简单易操作, 减低了传输的数据量, 节约了带宽。 也不用担心无法传 输和显示高分辨率的视频数据了。
以上所述实际上为该系统发送端设备所包含的各个功能单元, 以下对 该系统接收端设备所包含的各个功能单元进行描述。
所述接收端 2包括: 接收单元 21、 特征提取比对单元 22、 数据合成输 出单元 23。 其中,
接收单元 21, 配置为接收音频数据和局部动态图像。
特征提取比对单元 22, 配置为从音频数据中提取出该发言者身份, 在 本地或网络数据库查询已经存在的音频特征映射和视频特征映射, 根据该 发言者身份从音频特征映射中提取出音频特征, 根据该发言者身份从视频 特征映射中提取出视频特征。
这里, 当音频数据携带的是该发言者身份时, 以该发言者身份为索引 关键字到音频特征映射和视频特征映射中进行查询。 如果音频数据不是携 带该发言者身份, 而是携带由发言者身份和会议号构成的身份识别码, 则 由身份识别码作为组合索引关键字到音频特征映射和视频特征映射中进行 查询。
这里, 特征提取比对单元 22还可以分为音频特征提取比对子单元和视 频特征提取比对子单元。 音频特征提取比对子单元配置为从音频数据中提 取出该发言者身份, 在本地或网络数据库查询已经存在的音频特征映射, 根据该发言者身份从音频特征映射中提取出音频特征; 视频特征提取比对 子单元配置为根据该发言者身份从视频特征映射中提取出视频特征。
数据合成输出单元 23, 配置为采用提取出的视频特征和接收的局部动 态图像合成还原出原始视频数据, 并结合音频特征输出音频数据和原始视 频数据。
在实际应用中, 所述采集单元 11、 识别单元 12、 特征映射单元 13、 发 送单元 14、 接收单元 21、 特征提取比对单元 22和数据合成输出单元 23均 可由中央处理单元(CPU, Central Processing Unit ),或数字信号处理(DSP, Digital Signal Processor ),或现场可编程门阵列( FPGA, Field Programmable Gate Array )等来实现; 所述 CPU、 DSP 、 FPGA均可内置于视频会议系统 中。
如图 2所示为本发明实施例的一种低码流的视频会议数据传输方法, 包括以下步骤:
步骤 101、 采集音频数据和视频数据, 识别出发言者身份, 对采集的音 频数据进行语音识别并获取音频特征, 对采集的视频数据进行图像识别并 获取视频特征和局部动态图像。
步骤 102、发送音频数据和局部动态图像, 音频数据的编码中携带发言 者身份。
步骤 103、接收音频数据和局部动态图像,从音频数据的编码中提取出 该发言者身份, 在本地或网络数据库查询已经存在的音频特征映射和视频 特征映射, 根据该发言者身份从音频特征映射中提取出音频特征, 根据该 发言者身份从视频特征映射中提取出视频特征。
步骤 104、采用提取出的视频特征和接收的局部动态图像合成还原出原 始视频数据, 并结合音频特征输出音频数据和原始视频数据。
同时, 本发明实施例还提供了一种低码流的视频会议系统的发送端设 备, 该发送端设备与前述系统中的发送端 1 的组成结构及功能均相同, 该 发送端设备包括: 采集单元、 识别单元、 特征映射单元、 发送单元。 其中, 采集单元, 配置为采集音频数据和视频数据, 将采集的音频数据和视 频数据发送给识别单元。
识别单元, 配置为识别出发言者身份, 对采集的音频数据进行语音识 别并获取音频特征, 对采集的视频数据进行图像识别并获取视频特征和局 部动态图像, 将音频特征、 视频特征和局部动态图像发送给特征映射单元。
特征映射单元, 配置为在本地或网络数据库查询是否已经存在音频特 征映射和视频特征映射, 如果查询不到, 则根据该发言者身份和接收的音 频特征生成音频特征映射, 根据该发言者身份和接收的视频特征生成视频 特征映射, 并在本地存储音频特征映射和视频特征映射, 或者将音频特征 映射和视频特征映射上传到网络数据库进行存储, 以便后续查询使用。
发送单元, 配置为发送音频数据和局部动态图像, 音频数据的编码中 携带发言者身份或身份识另 'J码。
如果发送音频数据就无需提取了, 只需要根据发言者身份从视频特征 映射中提取出视频特征, 以便于整理合并时使用。 当然也可以仅发送局部 动态图像时, 需要接收端根据发言者身份从音频特征映射中提取出音频特 征, 以便于整理合并时使用。 发送单元发送身份识别码时, 身份识别码由 发言者身份和会议号构成。 在接收端通过身份识别码对应到音频特征、 视 频特征和局部动态图形, 以便整理合并以还原出原始视频数据, 并播放音 频数据, 从而经过发送端和接收端的相互作用处理, 在接收端能生动还原 出当前会议与会发言者的表情 /嘴型 /手势 /弯曲度等,而且由于在传输时只需 要发送局部动态图形, 无需发送完整的视频数据, 而是将之前采集过的音 / 视频数据的音 /视频特征在发送端和接收端都存储一份, 在网络数据库上也 有备份, 这样, 执行所述整理合并以还原出原始视频数据, 并播放音频数 据时, 只需要从接收端本地或网络数据库中的音 /视频特征映射中, 根据发 言者身份提取出对应的音 /视频数据, 再与接收的局部动态图形进行合成就 行, 简单易操作, 减低了传输的数据量, 节约了带宽。 也不用担心无法传 输和显示高分辨率的视频数据了。
在实际应用中, 所述采集单元、 识别单元、 特征映射单元、 发送单元
14均可由 CPU、 或 DSP、 或 FPGA等来实现; 所述 CPU、 DSP, FPGA均 可内置于视频会议系统中。
同时, 本发明实施例还提供了一种低码流的视频会议系统的接收端设 备, 该接收端设备与前述系统中的接收端 2 的组成结构及功能均相同, 该 接收端设备包括: 接收单元、 特征提取比对单元、 数据合成输出单元。 其 中,
接收单元, 配置为接收音频数据和局部动态图像。
特征提取比对单元, 配置为从音频数据中提取出该发言者身份, 在本 地或网络数据库查询已经存在的音频特征映射和视频特征映射, 根据该发 言者身份从音频特征映射中提取出音频特征, 根据该发言者身份从视频特 征映射中提取出视频特征。
数据合成输出单元, 配置为采用提取出的视频特征和接收的局部动态 图像合成还原出原始视频数据, 并结合音频特征输出音频数据和原始视频 数据。 在实际应用中, 所述接收单元、 特征提取比对单元和数据合成输出单 元均可由 CPU、 或 DSP、 或 FPGA等来实现; 所述 CPU、 DSP 、 FPGA均 可内置于视频会议系统中。
如图 3 所示为本发明实施例身份建立应用实例的示意图, 身份建立过 程包括: 获取发言者身份和会场号, 根据发言者身份和会议号生成身份识 别码, 决定唯一的身份。
如图 4所示为本发明实施例音频映射建立应用实例的示意图, 音频映 射建立过程包括: 发送端对音频数据进行语音识别后, 识别出发言者身份 和音频特性, 存储发言者身份和音频特征; 发言者身份、 和该发言者身份 对应的音频特征以映射关系形成音频特征映射; 音频特征映射可以采用音 频特征模板的形式存储。 这里, 在音频特征模板中的音频特征映射关系可 以采用发言者身份为键值索引到对应发言者身份的音频特征。
如图 5 所示为本发明实施例视频映射建立应用实例的示意图, 视频映 射建立过程包括: 发送端对视频数据进行图像识别后, 识别出发言者身份 和视频特性, 存储发言者身份和视频特征; 发言者身份、 和该发言者身份 对应的视频特征以映射关系形成视频特征映射; 视频特征映射可以采用视 频特征模板的形式存储。 这里, 在视频特征模板中的视频特征映射关系可 以采用发言者身份为键值索引到对应发言者身份的视频特征。
如图 6所示为本发明实施例动态图像获取应用实例的示意图, 动态图 像获取过程包括: 通过采集发言者的头部运动、 眼球运动、 手势、 弯腰等 轮廓运动来获取局部动态图像。 该局部动态图像包括: 发言者的头部运动、 眼 动、 手势、 轮廓运动中的至少一种轨迹图像信息。
本发明实施例发送端处理流程包括: 音频 /视频采集; 对采集后的音频 数据进行语音识别; 建立音频 /视频特征模板; 发送音频, 采集动态特征图 像并发送。 具体的, 对发送端音频 /视频处理分别描述如下: 如图 7所示为本发明实施例发送端音频处理流程应用实例的示意图, 该流程包括: 在发送端, 终端通过麦克风采集音频输入源信号, 进行音频 编码和语音识别; 提取音频特征, 在本地查询是否已经存在音频特征映射 模板, 如果本地存在, 则输出音频并向接收端传输; 如果本地不存在, 则 查询网络数据库是否存在音频特征映射模板, 存在则直接下载音频特征映 射模板到本地后, 输出音频并向接收端传输; 如果网络数据库也不存在, 则在本地和网络数据库建立音频特征映射模板, 存储。
如图 8所示为本发明实施例发送端视频处理流程应用实例的示意图, 该流程包括: 在发送端, 终端采集视频输入源信号, 进行视频编码; 提取 视频特征, 根据背景图像特征, 发言者图像特征形成视频特征; 在本地查 询是否已经存在视频特征映射模板, 如果本地存在, 则采集发言者头部动 作, 发言者眼球运动及手势等局部动态图像, 输出局部动态图像并向接收 端传输; 如果本地不存在, 则查询网络数据库是否存在视频特征映射模板, 存在则直接下载视频特征映射模板到本地后, 采集发言者头部动作, 发言 者眼球运动及手势等局部动态图像, 输出局部动态图像并向接收端传输; 如果网络数据库也不存在, 则在本地和网络数据库建立视频特征映射模板, 存储。
本发明实施例接收端处理流程包括: 接收音频, 提取音频特征模板; 提取视频特征模板, 视频特征与局部动态图像合成还原出原始视频数据; 音频 /视频输出。 具体的, 对本发明实施例的视频整合处理描述如下:
如图 9所示为本发明实施例接收端视频整合处理流程应用实例的示意 图, 该流程包括: 接收音频信号, 音频编码, 身份识别 (通过由发言者身 份和会议号构成的身份识别码进行识别); 判断本地视频特征映射模板是否 存在, 如果不存在, 则从网络数据库下载视频特征映射模板; 如果存在, 则从本地的视频特征映射模板中提取视频特征; 接收局部动态图像; 根据 本地或网络数据库中音 /视频特征映射模板中提取的音频特征和视频特征, 及接收到的局部动态图像还原出原始视频数据, 即: 会场环境及发言者图 像, 尤其是唇型及手势等; 输出音频信号, 输出合成后的视频信号。
以上所述, 仅为本发明的较佳实施例而已, 并非用于限定本发明的保 护范围。 工业实用性
本发明实施例提供的低码流的视频会议系统及方法, 在发送端获取音 频数据和视频数据并分别形成音频特征映射和视频特征映射, 及获取局部 动态图像; 并传输音频数据和局部动态图像到接收端。 利用本发明实施例 的技术方案, 发送端无需传输完整的视频数据, 仅需传输局部动态图像至 接收端即可, 接收端根据提取的音、 视频特征及接收的局部动态图像整理 合成出原始视频数据并播放音频数据, 如此, 便使得传输数据量得到了控 制, 有效降低了传输数据量, 从而节约了带宽, 满足视频业务会议的需求。

Claims

权利要求书
1、 一种低码流的视频会议系统, 所述系统包括: 发送端及接收端; 其 中,
所述发送端, 配置为获取音频数据和视频数据并分别形成音频特征映 射和视频特征映射, 获取局部动态图像; 传输音频数据和局部动态图像到 所述接收端;
所述接收端, 配置为根据从本端的音频特征映射和视频特征映射中提 取的音频特征、 视频特征及接收的所述局部动态图像整理合成出原始视频 数据并播放音频数据。
2、 根据权利要求 1所述的系统, 其中, 所述发送端包括: 采集单元、 识别单元、 特征映射单元、 发送单元;
所述接收端包括: 接收单元、 特征提取比对单元、 数据合成输出单元; 其中,
所述采集单元, 配置为采集音频数据和视频数据, 将采集的音频数据 和视频数据发送给识别单元;
所述识别单元, 配置为识别出发言者身份, 对采集的音频数据进行语 音识别并获取音频特征, 对采集的视频数据进行图像识别并获取视频特征 和局部动态图像, 将音频特征、 视频特征和局部动态图像发送给特征映射 单元;
所述特征映射单元, 配置为查询是否已经存在音频特征映射和视频特 征映射, 如果查询不到, 则根据所述音频特征和所述视频特征分别生成音 频特征映射和视频特征映射;
所述发送单元, 配置为送音频数据和局部动态图像, 音频数据的编码 中携带所述发言者身份; 所述接收单元, 配置为接收音频数据和局部动态图像;
所述特征提取比对单元, 配置为从音频数据的编码中提取出所述发言 者身份, 查询已经存在的音频特征映射和视频特征映射, 根据所述发言者 身份从音频特征映射中提取出音频特征, 从视频特征映射中提取出视频特 征;
所述数据合成输出单元, 配置为采用提取出的视频特征和接收的局部 动态图像合成还原出原始视频数据, 并结合音频特征输出音频数据和原始 视频数据。
3、 根据权利要求 2所述的系统, 其中, 所述识别单元, 配置为识别出 发言者身份和发言者当前参与会议的会议号, 由所述发言者身份和所述会 议号形成身份识别码, 由所述身份识别码标识与采集的音频数据和视频数 据对应的身份特征; 或者, 仅由所述发言者身份标识所述身份特征。
4、 根据权利要求 2所述的系统, 其中, 所述特征映射单元, 配置为在 发送端本地和网络数据库进行所述查询, 在本地查询到的情况, 采用本地 的音频特征映射和视频特征映射; 在网络数据库查询到的情况, 从网络数 据库下载音频特征映射和视频特征映射到本地; 在本地和网络数据库都查 询不到的情况, 在本地生成音频特征映射和视频特征映射。
5、 根据权利要求 2所述的系统, 其中, 所述音频特征映射由发言者身 份和与所述发言者身份对应的音频特征组成; 或者, 所述音频特征映射由 身份识别码和与所述身份识别码对应的音频特征组成, 所述身份识别码由 发言者身份和会议号形成。
6、 根据权利要求 2所述的系统, 其中, 所述视频特征映射由发言者身 份和与所述发言者身份对应的视频特征组成; 或者, 所述视频特征映射由 身份识别码和与所述身份识别码对应的视频特征组成, 所述身份识别码由 发言者身份和会议号形成。
7、 根据权利要求 1至 6中任一项所述的系统, 其中, 所述局部动态图 像包括: 发言者的头部运动、 眼球运动、 手势、 轮廓运动中的至少一种轨 迹图像信息。
8、 一种低码流的视频会议数据传输方法, 所述方法包括:
发送端获取音频数据和视频数据并分别形成音频特征映射和视频特征 映射, 获取局部动态图像, 传输音频数据和局部动态图像到接收端;
接收端根据从本端的音频特征映射和视频特征映射中提取的音频特 征、 视频特征及接收的所述局部动态图像整理合成出原始视频数据并播放 音频数据。
9、根据权利要求 8所述的方法, 其中, 形成所述音频特征映射, 包括: 识别出发言者身份后, 以发言者身份为索引关键字形成音频特征映射, 所述音频特征映射由发言者身份和与所述发言者身份对应的音频特征组 成 或者
识别出发言者身份和会议号后, 以发言者身份和会议号为组合索引关 键字形成音频特征映射, 所述音频特征映射由身份识别码和与所述身份识 别码对应的音频特征组成; 所述身份识别码由所述发言者身份和所述会议 号形成。
10、 根据权利要求 8所述的方法, 其中, 形成所述视频特征映射, 包 括:
识别出发言者身份后, 以发言者身份为索引关键字形成视频特征映射, 所述视频特征映射由发言者身份和与所述发言者身份对应的视频特征组 成 或者
识别出发言者身份和会议号后, 以发言者身份和会议号为组合索引关 键字形成视频特征映射, 所述视频特征映射由身份识别码和与所述身份识 别码对应的视频特征组成; 所述身份识别码由所述发言者身份和所述会议 号形成。
11、 根据权利要求 8 所述的方法, 其中, 形成音频特征映射和视频特 征映射之前, 所述方法还包括:
在发送端本地和网络数据库进行所述查询, 在本地查询到的情况, 采 用本地的音频特征映射和视频特征映射; 在网络数据库查询到的情况, 从 网络数据库下载音频特征映射和视频特征映射到本地; 在本地和网络数据 库都查询不到的情况, 在本地形成音频特征映射和视频特征映射。
12、 根据权利要求 8至 11中任一项所述的方法, 其中, 所述局部动态 图像包括: 发言者的头部运动、 眼球运动、 手势、 轮廓运动中的至少一种 轨迹图像信息。
13、 一种低码流的视频会议系统的发送端设备, 所述设备, 配置为获 取音频数据和视频数据并分别形成音频特征映射和视频特征映射, 获取局 部动态图像; 传输音频数据和局部动态图像到接收端。
14、 根据权利要求 13所述的设备, 其中, 所述设备包括: 采集单元、 识别单元、 特征映射单元、 发送单元; 其中,
所述采集单元, 配置为采集音频数据和视频数据, 将采集的音频数据 和视频数据发送给识别单元;
所述识别单元, 配置为识别出发言者身份, 对采集的音频数据进行语 音识别并获取音频特征, 对采集的视频数据进行图像识别并获取视频特征 和局部动态图像, 将音频特征、 视频特征和局部动态图像发送给特征映射 单元;
所述特征映射单元, 配置为查询是否已经存在音频特征映射和视频特 征映射, 如果查询不到, 则根据所述音频特征和所述视频特征分别生成音 频特征映射和视频特征映射;
所述发送单元, 配置为发送音频数据和局部动态图像, 音频数据的编 码中携带所述发言者身份。
15、 一种低码流的视频会议系统的接收端设备, 所述设备, 配置为接 收端根据从本端的音频特征映射和视频特征映射中提取的音频特征、 视频 特征及从发送端接收的局部动态图像整理合成出原始视频数据并播放音频 数据。
16、 根据权利要求 15所述的设备, 其中, 所述设备包括: 接收单元、 特征提取比对单元、 数据合成输出单元; 其中,
所述接收单元, 配置为接收音频数据和局部动态图像;
所述特征提取比对单元, 配置为从音频数据的编码中提取出所述发言 者身份, 查询已经存在的音频特征映射和视频特征映射, 根据所述发言者 身份从音频特征映射中提取出音频特征, 从视频特征映射中提取出视频特 征;
所述数据合成输出单元, 配置为采用提取出的视频特征和接收的局部 动态图像合成还原出原始视频数据, 并结合音频特征输出音频数据和原始 视频数据。
PCT/CN2013/086009 2012-11-23 2013-10-25 低码流的视频会议系统及方法、发送端设备、接收端设备 WO2014079302A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US14/647,259 US20150341565A1 (en) 2012-11-23 2013-10-25 Low data-rate video conference system and method, sender equipment and receiver equipment
EP13856801.9A EP2924985A4 (en) 2012-11-23 2013-10-25 SYSTEM AND METHOD FOR LOW BINARY RATE VIDEO CONFERENCE, SENDING END DEVICE, AND RECEIVING END DEVICE

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210480773.5A CN103841358B (zh) 2012-11-23 2012-11-23 低码流的视频会议系统及方法、发送端设备、接收端设备
CN201210480773.5 2012-11-23

Publications (1)

Publication Number Publication Date
WO2014079302A1 true WO2014079302A1 (zh) 2014-05-30

Family

ID=50775511

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/086009 WO2014079302A1 (zh) 2012-11-23 2013-10-25 低码流的视频会议系统及方法、发送端设备、接收端设备

Country Status (4)

Country Link
US (1) US20150341565A1 (zh)
EP (1) EP2924985A4 (zh)
CN (1) CN103841358B (zh)
WO (1) WO2014079302A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105704421A (zh) * 2016-03-16 2016-06-22 国网山东省电力公司信息通信公司 一种视频会议主分会场组网架构及方法

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106559636A (zh) * 2015-09-25 2017-04-05 中兴通讯股份有限公司 一种视频通信方法、装置及系统
EP3437321B1 (en) 2016-07-26 2023-09-13 Hewlett-Packard Development Company, L.P. Teleconference transmission
CN108537508A (zh) * 2018-03-30 2018-09-14 上海爱优威软件开发有限公司 会议记录方法及系统
US11527265B2 (en) * 2018-11-02 2022-12-13 BriefCam Ltd. Method and system for automatic object-aware video or audio redaction
CN112702556A (zh) * 2020-12-18 2021-04-23 厦门亿联网络技术股份有限公司 一种辅流数据传输方法、系统、存储介质及终端设备
CN114866192A (zh) * 2022-05-31 2022-08-05 电子科技大学 一种基于特征及相关信息的信号传输方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100241432A1 (en) * 2009-03-17 2010-09-23 Avaya Inc. Providing descriptions of visually presented information to video teleconference participants who are not video-enabled
CN101951494A (zh) * 2010-10-14 2011-01-19 上海紫南信息技术有限公司 传统电话与视频会议显示图像融合的方法
US20110069142A1 (en) * 2009-09-24 2011-03-24 Microsoft Corporation Mapping psycho-visual characteristics in measuring sharpness feature and blurring artifacts in video streams
CN102271241A (zh) * 2011-09-02 2011-12-07 北京邮电大学 一种基于面部表情/动作识别的图像通信方法及系统
CN102427533A (zh) * 2011-11-22 2012-04-25 苏州科雷芯电子科技有限公司 视频传输装置及方法
CN102572356A (zh) * 2012-01-16 2012-07-11 华为技术有限公司 记录会议的方法和会议系统

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5995518A (en) * 1997-05-01 1999-11-30 Hughes Electronics Corporation System and method for communication of information using channels of different latency
US6072494A (en) * 1997-10-15 2000-06-06 Electric Planet, Inc. Method and apparatus for real-time gesture recognition
WO2008091485A2 (en) * 2007-01-23 2008-07-31 Euclid Discoveries, Llc Systems and methods for providing personal video services
CN101677389A (zh) * 2008-09-17 2010-03-24 深圳富泰宏精密工业有限公司 图片传输系统及方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100241432A1 (en) * 2009-03-17 2010-09-23 Avaya Inc. Providing descriptions of visually presented information to video teleconference participants who are not video-enabled
US20110069142A1 (en) * 2009-09-24 2011-03-24 Microsoft Corporation Mapping psycho-visual characteristics in measuring sharpness feature and blurring artifacts in video streams
CN101951494A (zh) * 2010-10-14 2011-01-19 上海紫南信息技术有限公司 传统电话与视频会议显示图像融合的方法
CN102271241A (zh) * 2011-09-02 2011-12-07 北京邮电大学 一种基于面部表情/动作识别的图像通信方法及系统
CN102427533A (zh) * 2011-11-22 2012-04-25 苏州科雷芯电子科技有限公司 视频传输装置及方法
CN102572356A (zh) * 2012-01-16 2012-07-11 华为技术有限公司 记录会议的方法和会议系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2924985A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105704421A (zh) * 2016-03-16 2016-06-22 国网山东省电力公司信息通信公司 一种视频会议主分会场组网架构及方法
CN105704421B (zh) * 2016-03-16 2019-01-01 国网山东省电力公司信息通信公司 一种视频会议主分会场组网系统及方法

Also Published As

Publication number Publication date
CN103841358A (zh) 2014-06-04
EP2924985A4 (en) 2015-11-25
CN103841358B (zh) 2017-12-26
US20150341565A1 (en) 2015-11-26
EP2924985A1 (en) 2015-09-30

Similar Documents

Publication Publication Date Title
WO2014079302A1 (zh) 低码流的视频会议系统及方法、发送端设备、接收端设备
CN108055496B (zh) 一种视频会议的直播方法和系统
US10057542B2 (en) System for immersive telepresence
WO2021143315A1 (zh) 场景互动方法、装置、电子设备及计算机存储介质
JP6179834B1 (ja) テレビ会議装置
WO2014180371A1 (zh) 一种会议控制方法、装置及会议系统
US20200106708A1 (en) Load Balancing Multimedia Conferencing System, Device, and Methods
KR101480116B1 (ko) 화상 회의의 구현 방법, 화상 회의의 구현 시스템 및 광대역 이동 핫스폿 설비
CN103051864B (zh) 移动视频会议方法
CN102550019A (zh) 管理虚拟协作系统中的共享内容
WO2012109956A1 (zh) 视讯会议中会议信息的处理方法及设备
EP3005690B1 (en) Method and system for associating an external device to a video conference session
WO2014094461A1 (zh) 视频会议中的视音频信息的处理方法、装置及系统
WO2023125350A1 (zh) 音频数据推送方法、装置、系统、电子设备及存储介质
WO2014106430A1 (zh) 一种会议调度的方法、设备及系统
WO2013178188A1 (zh) 视频会议显示方法及装置
WO2014173091A1 (zh) 一种视频会议中会议材料的显示方法及装置
TW201012222A (en) Method for producing internet video images
WO2016206471A1 (zh) 多媒体业务处理方法、系统及装置
CN108320331B (zh) 一种生成用户场景的增强现实视频信息的方法与设备
CN103051858A (zh) 视讯通信实时屏幕互动装置、方法及系统
CN113612759A (zh) 一种基于sip协议的高性能高并发智能广播系统及实现方法
JP2003339037A (ja) ネットワーク会議システム、ネットワーク会議方法およびネットワーク会議プログラム
JP2001268078A (ja) 通信制御装置、その方法およびその提供媒体と通信装置
CN109640030A (zh) 一种视频会议系统的音视频外设扩展装置及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13856801

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 14647259

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 2013856801

Country of ref document: EP