CN107205131A

CN107205131A - A kind of methods, devices and systems for realizing video calling

Info

Publication number: CN107205131A
Application number: CN201610161286.0A
Authority: CN
Inventors: 程岑
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2016-03-18
Filing date: 2016-03-18
Publication date: 2017-09-26
Also published as: WO2017157168A1

Abstract

A kind of methods, devices and systems for realizing video calling, including：First terminal gathers digital audio and video signals and digital video signal respectively；Digital audio and video signals are converted to text message by first terminal, and text message is packaged into text bag, digital audio and video signals are packaged into audio pack, digital video signal is packaged into video bag；Text bag, audio pack and video bag are sent to second terminal by first terminal respectively.

Description

A kind of methods, devices and systems for realizing video calling

Technical field

Present document relates to but be not limited to video calling field, espespecially a kind of methods, devices and systems for realizing video calling.

Background technology

With developing rapidly for mobile and the Internet broadband technology, Visual communications value-added service is set rapidly to be promoted in domestic consumer, the service of the value-added service such as aspectant exchange and online video teaching can be obtained by the technology based on this business, audio if Visual communications business increases sychronization captions, it is not only able to provide more preferable service to the user of hearing difference, and a beneficial complement can be made to actual audio frequency effect in the case where network is not good.

In correlation technique, realizing increases voice subtitle method in video calling is generally comprised：

First terminal gathers digital audio and video signals and digital video signal respectively；Voice coding processing is carried out to the digital audio and video signals of collection, the digital audio and video signals after voice coding is handled are packaged into audio pack；And the digital audio and video signals of collection are converted into text message by speech recognition technology, text message is superimposed after synthesis with the digital video signal gathered and carries out Video coding processing, the digital video signal after Video coding is handled is packaged into video bag；Audio pack and video bag are sent to second terminal respectively；

Second terminal receives audio pack and video bag, digital audio and video signals after voice coding processing in audio pack are carried out with tone decoding to obtain digital audio and video signals and play, video decoding is carried out to the digital video signal after video bag intermediate frequency coded treatment and obtains digital video signal and shows.

In the above method, when network condition is not good, because video bag is than larger, so video bag packet loss occurs and the probability of shake can be bigger, so, text message will together be lost with video bag, cause information loss in video call process.

The content of the invention

The embodiment of the present invention proposes a kind of methods, devices and systems for realizing video calling, and the information that can be reduced when network condition is not good in video call process is lost.

The embodiment of the present invention proposes a kind of method for realizing video calling, including：

First terminal gathers digital audio and video signals and digital video signal respectively；

Digital audio and video signals are converted to text message by first terminal, and text message is packaged into text bag, digital audio and video signals are packaged into audio pack, digital video signal is packaged into video bag；

Text bag, audio pack and video bag are sent to second terminal by first terminal respectively.

Optionally, it is described digital audio and video signals are packaged into audio pack before also include：The first terminal carries out voice coding processing to the digital audio and video signals；

It is described digital audio and video signals are packaged into audio pack to include：The first terminal is packaged into the audio pack to the digital audio and video signals after voice coding processing.

Optionally, it is described digital video signal is packaged into video bag before also include：The first terminal carries out Video coding processing to the digital video signal；

It is described digital video signal is packaged into video bag to include：The first terminal is packaged into the video bag to the digital video signal after Video coding processing.

The embodiment of the present invention also proposed a kind of method for realizing video calling, including：

Second terminal receives the text bag from first terminal；

Second terminal judges that the timestamp corresponding time in the text bag received is less than or equal to the timestamp of the audio pack played or the video bag shown the corresponding time, in the text bag for showing the text bag received and caching, the text message that the timestamp field corresponding time is less than or equal in the text bag of the timestamp field of the audio pack played or the video bag shown corresponding time.

Optionally, during the timestamp that is more than the audio pack played or the video bag shown when the timestamp corresponding time that the second terminal is judged in the text bag received corresponding time, this method also includes：

The text bag received described in the second terminal caching.

Optionally, when second terminal does not receive audio pack and video bag in the preset time after receiving the text bag, this method also includes：

Text message in the text bag of the second terminal display caching.

Optionally, the second terminal is received after the text bag from first terminal, is also included before the timestamp corresponding time during the second terminal judges the text bag received is less than or equal to the timestamp of audio pack or the video bag shown the corresponding time played：

The second terminal judges that caption display function has been opened.

The embodiment of the present invention also proposed a kind of first terminal, including：

Acquisition module, for gathering digital audio and video signals and digital video signal respectively；

First processing module, for digital audio and video signals to be converted into text message, is packaged into text bag by text message, digital audio and video signals is packaged into audio pack, digital video signal is packaged into video bag；

Sending module, for text bag, audio pack and video bag to be sent into second terminal respectively.

Optionally, the first processing module specifically for：

Digital audio and video signals are converted into text message, voice coding processing is carried out to the digital audio and video signals, text message is packaged into text bag, the audio pack is packaged into the digital audio and video signals after voice coding processing, Video coding processing is carried out to the digital video signal, the video bag is packaged into the digital video signal after Video coding processing.

The embodiment of the present invention also proposed a kind of second terminal, including：

Receiving module, for receiving the text bag from first terminal；

Second processing module, it is less than or equal to the timestamp of audio pack or the video bag shown the corresponding time played for the timestamp corresponding time in the text bag judging to receive, in the text bag for showing the text bag received and caching, the text message that the timestamp field corresponding time is less than or equal in the text bag of the timestamp field of the audio pack played or the video bag shown corresponding time.

Optionally, the Second processing module is additionally operable to：

During the timestamp that is more than the audio pack played or the video bag shown when the timestamp corresponding time in the text bag received of judging corresponding time, the text bag received described in caching.

Optionally, the Second processing module is additionally operable to：

When not receiving audio pack and video bag in the preset time after receiving the text bag, the text message in the text bag of display caching.

The embodiment of the present invention also proposed a kind of system for realizing video calling, including：

First terminal, for gathering digital audio and video signals and digital video signal respectively；Digital audio and video signals are converted into text message, text message is packaged into text bag, digital audio and video signals are packaged into audio pack, digital video signal is packaged into video bag；Text bag, audio pack and video bag are sent to second terminal respectively；

Second terminal, for receiving the text bag from first terminal；Judge that the timestamp corresponding time in the text bag received is less than or equal to the timestamp of the audio pack played or the video bag shown the corresponding time, in the text bag for showing the text bag received and caching, the text message that the timestamp field corresponding time is less than or equal in the text bag of the timestamp field of the audio pack played or the video bag shown corresponding time.

Optionally, the second terminal is additionally operable to：

Compared with correlation technique, the technical scheme of the embodiment of the present invention includes：First terminal gathers digital audio and video signals and digital video signal respectively；Digital audio and video signals are converted to text message by first terminal, and text message is packaged into text bag, digital audio and video signals are packaged into audio pack, digital video signal is packaged into video bag；Text bag, audio pack and video bag are sent to second terminal by first terminal respectively.Pass through the scheme of the embodiment of the present invention, text bag, audio pack and video bag are sent to second terminal by first terminal respectively, realize when network condition is not good, text will not be caused to lose during video packet loss, so that the information reduced in video call process is lost.

Brief description of the drawings

The accompanying drawing in the embodiment of the present invention is illustrated below, the accompanying drawing in embodiment is to be used for a further understanding of the present invention, be used to explain the present invention together with specification, do not constitute limiting the scope of the invention.

Fig. 1 is the flow chart for the method that transmitting terminal of the embodiment of the present invention realizes video calling；

Fig. 2 is the flow chart for the method that receiving terminal of the embodiment of the present invention realizes video calling；

Fig. 3 is the structure composition schematic diagram of first terminal of the embodiment of the present invention；

Fig. 4 is the structure composition schematic diagram of second terminal of the embodiment of the present invention；

Fig. 5 is the structure composition schematic diagram for the system that the embodiment of the present invention realizes video calling.

Embodiment

For the ease of the understanding of those skilled in the art, the invention will be further described below in conjunction with the accompanying drawings, can not be used for limiting the scope of the invention.It should be noted that in the case where not conflicting, the various modes in embodiment and embodiment in the application can be mutually combined.

Referring to Fig. 1, the embodiment of the present invention proposes a kind of method for realizing video calling, including：

Step 100, first terminal gather digital audio and video signals and digital video signal respectively.

In this step, first terminal can gather digital video signal using acquisition time collection digital audio and video signals specified in G.711 (a kind of audio coding mode formulated as International Telecommunication Union) according to video frame rate set in advance.For example, every 10 milliseconds (ms) gather a digital audio and video signals, a digital video signal is gathered per 40ms.

Digital audio and video signals are converted to text message by step 101, first terminal, and text message is packaged into text bag, digital audio and video signals are packaged into audio pack, digital video signal is packaged into video bag.

In this step, digital audio and video signals can be converted to text message by first terminal using speech recognition technology.

In this step, text bag or audio pack or video bag can be packaged according to the specification of RTP (RTP, Real-time Transport Protocol) packet protocol.

The form in the packet header of RTP bags is as shown in table 1.

Table 1

In table 1, V presentation protocol versions, 2 bits (bit),

P represents filler, 1 bit, when P set, and the packet header afterbody of RTP bags includes additional byte of padding.

X is extension bits, and 1 bit, when X set, represents to extend a packet header behind the packet header of RTP bags.

CC represents the number of contributing source list (Contributing Source Identifiers) mark.

M is marker bit, 1 bit.

PT is loadtype (Payload Type), and 7 bits, for text bag, can be represented, such as 20 using untapped type in correlation technique.

Sequence number, 16 bits often send out a RTP bag, sequence number increase by 1.In the embodiment of the present invention, text bag, audio pack, the sequence number independent numbering of video bag.

The sampling instant of first character section in timestamp, 32 bits, record RTP bags.For audio pack and video bag, timestamp is the time for starting collection, for text bag, and timestamp is the time that corresponding audio pack starts collection.

Synchronous source identifier (SSRC, Synchronization Source Identifier), 32 bits represent the source of RTP bags, can not there is two identical SSRC values in same RTP sessions.

CSRC, 0~15, each 32 bit, necessary to the field is not the packet header of RTP bags.

Timestamp field in text bag is the time of collection digital audio and video signals or digital video signal, and Payload Type are speech text information type (undefined value can be used, such as 20).

Text bag, audio pack and video bag are sent to second terminal by step 102, first terminal respectively.

In this step, different types of bag can respectively be sent according to different strategies.For example, audio pack is sent according to audio coding sample frequency, video bag is sent according to the frame per second interval of agreement, and text bag is sent according to audio coding sample frequency.

Optionally, also include before digital audio and video signals being packaged into audio pack：First terminal carries out voice coding processing to digital audio and video signals；Accordingly,

Digital audio and video signals are packaged into audio pack includes：First terminal is packaged into audio pack to the digital audio and video signals after voice coding processing.

Optionally, also include before digital video signal being packaged into video bag：First terminal carries out Video coding processing to the digital video signal；Accordingly,

Pass through the scheme of the embodiment of the present invention, text bag, audio pack and video bag are sent to second terminal by first terminal respectively, realize when network condition is not good, text will not be caused to lose during video packet loss, so that the information reduced in video call process is lost.

Referring to Fig. 2, the embodiment of the present invention also proposed a kind of method for realizing video calling, including：

Step 200, second terminal receive the text bag from first terminal.

Step 201, second terminal judge that the timestamp corresponding time in the text bag received is less than or equal to the timestamp of the audio pack played or the video bag shown the corresponding time, in the text bag for showing the text bag received and caching, the text message that the timestamp corresponding time is less than or equal in the text bag of the timestamp of the audio pack played or the video bag shown corresponding time.

In this step, text message can be shown according to the viewing area and/or font size pre-set.Specifically, the last number of words that can be shown of screen can be determined according to viewing area and/or font size, calculate the number of times that the text message of a text bag needs to show, the residence time of display once is determined according to the frequency acquisition of the corresponding audio pack of a text bag, shown according to the residence time.

Gathered once for example, the frequency acquisition of the corresponding audio pack of a text bag is 20ms, text bag has 100 words altogether, the number of words that can once show is 10 words, then need display 10 times, the residence time shown every time is 2ms.

In this step, text message can be shown on the graph layer of screen, that is, be shown on the video layer of display digital video signal that is added to.

Optionally, also include between step 200 and step 201：

Second terminal judges that caption display function has been opened.

When second terminal judges that caption display function is closed, terminate this flow.

This method also includes：

Second terminal judges that the timestamp corresponding time in the text bag received is more than the timestamp of the audio pack played or the video bag shown the corresponding time, caches the text bag received.

This method also includes：

Second terminal does not receive the text message in audio pack and video bag, the text bag of display caching in preset time.

In the above method, second terminal is received after audio pack and/or video bag, can be played out or be shown according to the rule arranged in audio/video decoding consensus standard.

Wherein, second terminal is received after audio pack, it can play out, second terminal is received after video bag, can be shown according to the rule of agreement in video decoding protocol (such as H264) according to the rule of agreement in audio decoder consensus standard (such as G711).

Referring to Fig. 3, the embodiment of the present invention also proposed a kind of first terminal, including：

In the first terminal of the embodiment of the present invention, first processing module specifically for：

Digital audio and video signals are converted into text message, voice coding processing is carried out to digital audio and video signals, text message is packaged into text bag, audio pack is packaged into the digital audio and video signals after voice coding processing, Video coding processing is carried out to digital video signal, video bag is packaged into the digital video signal after Video coding processing.

Referring to Fig. 4, the embodiment of the present invention also proposed a kind of second terminal, including：

Receiving module, for receiving the text bag from first terminal；

In the second terminal of the embodiment of the present invention, Second processing module is additionally operable to：

During the timestamp that is more than the audio pack played or the video bag shown when the timestamp corresponding time in the text bag for judging to receive corresponding time, the text bag received is cached.

When not receiving audio pack and video bag in the preset time after receiving text bag, the text message in the text bag of display caching.

Referring to Fig. 5, the embodiment of the present invention also proposed a kind of system for realizing video calling, including：

Second terminal, for receiving text bag from first terminal；Judge that the timestamp corresponding time in the text bag received is less than or equal to the timestamp of the audio pack played or the video bag shown the corresponding time, in the text bag for showing the text bag received and caching, the text message that the timestamp field corresponding time is less than or equal in the text bag of the timestamp field of the audio pack played or the video bag shown corresponding time.

In the system of the embodiment of the present invention, second terminal is additionally operable to：

It should be noted that; embodiment described above be for only for ease of it will be understood by those skilled in the art that; the protection domain being not intended to limit the invention; on the premise of the inventive concept of the present invention is not departed from, any obvious replacement and improvement that those skilled in the art are made to the present invention etc. is within protection scope of the present invention.

Claims

1. a kind of method for realizing video calling, it is characterised in that including：

2. according to the method described in claim 1, it is characterised in that described to encapsulate digital audio and video signals Also include before into audio pack：The first terminal carries out voice coding processing to the digital audio and video signals；

It is described digital audio and video signals are packaged into audio pack to include：The first terminal is to voice coding processing Digital audio and video signals afterwards are packaged into the audio pack.

3. according to the method described in claim 1, it is characterised in that described to encapsulate digital video signal Also include before into video bag：The first terminal carries out Video coding processing to the digital video signal；

It is described digital video signal is packaged into video bag to include：The first terminal is to Video coding processing Digital video signal afterwards is packaged into the video bag.

4. a kind of method for realizing video calling, it is characterised in that including：

Second terminal receives the text bag from first terminal；

Second terminal judges that the timestamp corresponding time in the text bag received is less than or equal to The timestamp corresponding time of the audio pack of broadcasting or the video bag shown, show the text received In the text bag of bag and caching, the timestamp field corresponding time is less than or equal to the audio pack played Or the text message in the text bag of the timestamp field corresponding time of the video bag shown.

5. method according to claim 4, it is characterised in that when the second terminal judges institute The timestamp corresponding time in the text bag received is stated more than the audio pack played or is shown Video bag timestamp corresponding time when, this method also includes：

The text bag received described in the second terminal caching.

6. method according to claim 5, it is characterised in that when second terminal receive it is described When not receiving audio pack and video bag in the preset time after text bag, this method also includes：

Text message in the text bag of the second terminal display caching.

7. method according to claim 4, it is characterised in that the second terminal, which is received, to be come from After the text bag of first terminal, the timestamp pair in the second terminal judges the text bag that receives The time answered is less than or equal to the audio pack played or the timestamp of the video bag shown is corresponding Also include before time：

The second terminal judges that caption display function has been opened.

8. a kind of first terminal, it is characterised in that including：

First processing module, for digital audio and video signals to be converted into text message, text message is encapsulated Into text bag, digital audio and video signals are packaged into audio pack, digital video signal is packaged into video bag；

9. first terminal according to claim 8, it is characterised in that the first processing module tool Body is used for：

Digital audio and video signals are converted into text message, the digital audio and video signals are carried out at voice coding Reason, text bag is packaged into by text message, and institute is packaged into the digital audio and video signals after voice coding processing Audio pack is stated, Video coding processing is carried out to the digital video signal, to the number after Video coding processing Word vision signal is packaged into the video bag.

10. a kind of second terminal, it is characterised in that including：

Receiving module, for receiving the text bag from first terminal；

Second processing module, is less than for the timestamp corresponding time in the text bag judging to receive Or equal to the timestamp corresponding time of the audio pack played or the video bag shown, display connects In the text bag and the text bag of caching that receive, the timestamp field corresponding time, which is less than or equal to, to be broadcast Text in the text bag of the timestamp field corresponding time of the audio pack put or the video bag shown Information.

11. second terminal according to claim 10, it is characterised in that the Second processing module It is additionally operable to：

It is more than the sound played when the timestamp corresponding time in the text bag received of judging Frequency was wrapped or during the timestamp of video bag that shows corresponding time, the text bag received described in caching.

12. second terminal according to claim 11, it is characterised in that the Second processing module It is additionally operable to：

When not receiving audio pack and video bag in the preset time after receiving the text bag, show Show the text message in the text bag of caching.

13. a kind of system for realizing video calling, it is characterised in that including：

First terminal, for gathering digital audio and video signals and digital video signal respectively；DAB is believed Number text message is converted to, text message is packaged into text bag, digital audio and video signals are packaged into audio Bag, video bag is packaged into by digital video signal；Text bag, audio pack and video bag are sent to respectively Second terminal；

Second terminal, for receiving the text bag from first terminal；Judge the text bag received In the timestamp corresponding time be less than or equal to the audio pack played or the video bag that shows In timestamp corresponding time, the text bag for showing the text bag received and caching, timestamp field pair The time answered is less than or equal to the timestamp field pair of the audio pack played or the video bag shown Text message in the text bag for the time answered.

14. system according to claim 13, it is characterised in that the second terminal is additionally operable to：

15. system according to claim 14, it is characterised in that the second terminal is additionally operable to：