CN111447396A

CN111447396A - Audio and video transmission method and device, electronic equipment and storage medium

Info

Publication number: CN111447396A
Application number: CN202010153193.XA
Authority: CN
Inventors: 谢文龙; 李云鹏; 吕亚亚; 杨春晖
Original assignee: Visionvera Information Technology Co Ltd
Current assignee: Visionvera Information Technology Co Ltd
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2020-07-24

Abstract

The application provides an audio and video transmission method, an audio and video transmission device, electronic equipment and a storage medium, wherein the method is applied to data transmission equipment and comprises the following steps: receiving video data sent by external video acquisition equipment; acquiring audio data acquired by an audio acquisition component of the data transmission equipment, wherein the audio data and the video data have a time sequence incidence relation; according to the time sequence incidence relation, the video data and the audio data are synchronously processed to obtain synchronous audio and video data; and sending the synchronous audio and video data to a receiving end. In the process, the synchronous processing of the video data acquired from the outside and the audio data of the user at the data transmission equipment end is realized, and the receiving end can receive the audio data corresponding to the receiving end in real time when receiving the video data, so that the user at the receiving end can better know the video content, and the user experience is improved.

Description

Audio and video transmission method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to an audio/video transmission method and apparatus, an electronic device, and a storage medium.

Background

The video network is a real-time large-bandwidth transmission network based on Ethernet hardware, and high-speed transmission of high-definition videos is realized by adopting a special protocol. A graphics-rendering device is a device that interfaces and transmits a video stream or encoded video stream. The conventional image transmission device is mainly used for docking video resources of an unmanned aerial vehicle, a monitoring camera, a selfie stick and VR equipment, and accessing the video resources to a corresponding service system, such as a conference system, a live broadcast system or other service systems. In actual service, the image transmission device adopts a unidirectional transmission mode for the audio data and the video data, and when the audio data and the video data have a certain correlation, the audio data and the video data cannot be processed well, so that a receiving end needs to perform correlation processing on the audio data and the video data again, and the use experience of a user is poor.

Disclosure of Invention

The embodiment of the application provides an audio and video transmission method and device, electronic equipment and a storage medium, and aims to realize synchronous processing of audio data and video data.

A first aspect of an embodiment of the present application provides an audio and video transmission method, which is applied to a data transmission device, and the method includes:

receiving video data sent by external video acquisition equipment;

acquiring audio data acquired by an audio acquisition component of the data transmission equipment, wherein the audio data and the video data have a time sequence incidence relation;

according to the time sequence incidence relation, the video data and the audio data are synchronously processed to obtain synchronous audio and video data;

and sending the synchronous audio and video data to a receiving end.

Optionally, according to the time sequence association relationship, performing synchronization processing on the video data and the audio data to obtain synchronized audio and video data, including:

processing the audio data through a first thread to obtain an audio processing result;

processing the video data through a second thread to obtain a video processing result;

through a third thread, according to the audio processing result, the video processing result and the time sequence incidence relation, the video data and the audio data are synchronously processed, and the synchronous audio and video data are obtained;

sending the synchronous audio and video data to a receiving end, comprising:

sending the synchronous audio and video data to a receiving end through the third thread;

wherein the third thread is any one of the first thread and the second thread, or the third thread is different from the first thread and the second thread.

Optionally, processing the audio data to obtain an audio processing result includes:

packaging the audio data to obtain audio data packets, wherein each audio data packet carries a timestamp;

processing the video data to obtain a video processing result, comprising:

packaging the video data to obtain video data packets, wherein each video data packet carries a timestamp;

according to the audio processing result, the video processing result and the time sequence incidence relation, the video data and the audio data are synchronously processed, and the method comprises the following steps:

determining a target audio data packet and a target video data packet according to the time stamp of the audio data packet, the time stamp of the video data packet, the time interval value of the adjacent video data packets and a preset threshold value;

and carrying out synchronous processing on the target audio data packet and the target video data packet according to the time stamp sequence.

Optionally, determining a target audio data packet and a target video data packet according to the time stamp of the audio data packet, the time stamp of the video data packet, the time interval value of an adjacent video data packet, and a preset threshold, includes:

determining the latest video data packet as the target video data packet;

determining the audio data packets with the time stamps larger than a preset difference value in all the audio data packets as the audio data packets of the first type, wherein the preset difference value is the difference value between the time stamp of the target video data packet and the preset threshold value;

and determining the target audio data packet from the first type of audio data packets according to the size relation between the time interval value and the preset threshold value.

Optionally, determining the target audio data packet from the first type of audio data packet according to a size relationship between the time interval value and the preset threshold includes:

when the time interval value is smaller than the preset threshold value, determining an audio data packet with a timestamp smaller than a first sum value in the first type of audio data packets as the target audio data packet, wherein the first sum value is the sum of the timestamp of the target video data packet and the time interval value;

and when the time interval value is not smaller than the preset threshold value, determining the audio data packet with the timestamp smaller than a second sum value in the first type of audio data packet as the target audio data packet, wherein the second sum value is the sum value of the timestamp of the target video data packet and the preset threshold value.

Optionally, the synchronizing the target audio data packet and the target video data packet according to the time stamp sequence includes:

sequencing the target audio data packets and the target video data packets according to the time stamp sequence;

sending the synchronous audio and video data to a receiving end, comprising:

and sequentially sending the target audio data packet with the timestamp less than that of the target video data packet, the target video data packet and the target audio data packet with the timestamp not less than that of the target video data packet to the receiving end.

Optionally, the method further comprises:

determining the audio data packets with the time stamps not larger than the preset difference value in all the audio data packets as second type audio data packets;

deleting the second type of audio data packet.

Optionally, after sending the synchronized audio/video data to a receiving end, the method further includes:

receiving audio response data returned by the receiving end;

and playing the audio response data.

Optionally, after the audio data is encapsulated to obtain an audio data packet, the method further includes:

buffering the audio data packet into an audio sending queue;

periodically detecting the number of audio packets contained in the audio transmission queue;

when the number of the audio packets is larger than the preset number, determining that the current network quality is poor;

and displaying prompt information representing that the current network quality is poor.

A second aspect of the embodiments of the present application provides an audio and video transmission apparatus, which is applied to a data transmission device, and includes:

the first receiving module is used for receiving video data sent by external video acquisition equipment;

the acquisition module is used for acquiring audio data acquired by an audio acquisition component of the data transmission equipment, and the audio data and the video data have a time sequence incidence relation;

the processing module is used for carrying out synchronous processing on the video data and the audio data according to the time sequence incidence relation to obtain synchronous audio and video data;

and the sending module is used for sending the synchronous audio and video data to a receiving end.

Optionally, the processing module includes:

the first processing module is used for processing the audio data through a first thread to obtain an audio processing result;

the second processing module is used for processing the video data through a second thread to obtain a video processing result;

the third processing module is used for carrying out synchronous processing on the video data and the audio data according to the audio processing result, the video processing result and the time sequence incidence relation through a third thread to obtain the synchronous audio and video data;

the sending module comprises:

the first sending submodule is used for sending the synchronous audio and video data to a receiving end through the third thread;

Optionally, the first processing module includes:

the first encapsulation module is used for encapsulating the audio data to obtain audio data packets, and each audio data packet carries a timestamp;

the second processing module comprises:

the second packaging module is used for packaging the video data to obtain video data packets, and each video data packet carries a timestamp;

the third processing module comprises:

the first determining module is used for determining a target audio data packet and a target video data packet according to the time stamp of the audio data packet, the time stamp of the video data packet, the time interval value of adjacent video data packets and a preset threshold value;

and the first processing submodule is used for carrying out synchronous processing on the target audio data packet and the target video data packet according to the time stamp sequence.

Optionally, the first determining module includes:

a first determining submodule, configured to determine a latest video data packet as the target video data packet;

the second determining submodule is used for determining the audio data packets of which the time stamps are greater than a preset difference value in all the audio data packets as the audio data packets of the first type, wherein the preset difference value is the difference value between the time stamp of the target video data packet and the preset threshold value;

and the third determining submodule is used for determining the target audio data packet from the first type of audio data packets according to the size relation between the time interval value and the preset threshold value.

Optionally, the third determining sub-module includes:

a fourth determining submodule, configured to determine, when the time interval value is smaller than the preset threshold, an audio data packet with a timestamp smaller than a first sum value in the first type of audio data packet as the target audio data packet, where the first sum value is a sum of the timestamp of the target video data packet and the time interval value;

and a fifth determining submodule, configured to determine, when the time interval value is not smaller than the preset threshold, an audio data packet with a timestamp smaller than a second sum value in the first type of audio data packet as the target audio data packet, where the second sum value is a sum value of the timestamp of the target video data packet and the preset threshold.

Optionally, the first processing sub-module includes:

the second processing submodule is used for sequencing the target audio data packet and the target video data packet according to the time stamp sequence;

the sending module comprises:

and the second sending submodule is used for sequentially sending the target audio data packet with the timestamp smaller than that of the target video data packet, the target video data packet and the target audio data packet with the timestamp not smaller than that of the target video data packet to the receiving end.

Optionally, the apparatus further comprises:

the second determining module is used for determining the audio data packets of which the time stamps are not more than the preset difference value in all the audio data packets as second type audio data packets;

and the deleting module is used for deleting the second type of audio data packet.

Optionally, the apparatus further comprises:

the second receiving module is used for receiving the audio response data returned by the receiving end;

and the playing module is used for playing the audio response data.

Optionally, the apparatus further comprises:

the buffer module is used for buffering the audio data packet into an audio sending queue;

the detection module is used for periodically detecting the number of the audio packets contained in the audio transmission queue;

the third determining module is used for determining that the current network quality is poor when the number of the audio packets is larger than the preset number;

and the display module is used for displaying prompt information representing that the current network quality is poor.

A third aspect of embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect of the present application when executed.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the method according to the first aspect of the present application.

According to the audio and video transmission method, the data transmission device receives video data sent by the external video acquisition device, the audio data are acquired through the audio acquisition component of the data transmission device, the audio data and the video data have a time sequence incidence relation, then the video data and the audio data are synchronously processed according to the time sequence incidence relation to obtain synchronous audio and video data, and finally the synchronous audio and video data are sent to the receiving end. In the process, the synchronous processing of the video data acquired from the outside and the audio data of the user at the data transmission equipment end is realized, and the receiving end can receive the audio data corresponding to the receiving end in real time when receiving the video data, so that the user at the receiving end can better know the video content, and the user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a schematic diagram of an implementation environment shown in an embodiment of the present application;

fig. 2 is a flowchart illustrating an audio and video transmission method according to an embodiment of the present application;

fig. 3 is an overall flowchart of an audio/video transmission method according to an embodiment of the present application;

fig. 4 is a block diagram illustrating a structure of an audio/video transmission apparatus according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a networking of a video network according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating a hardware structure of a node server according to an embodiment of the present application;

fig. 7 is a schematic diagram illustrating a hardware structure of an access switch according to an embodiment of the present application;

fig. 8 is a schematic diagram of a hardware structure of an ethernet protocol conversion gateway according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic diagram of an implementation environment according to an embodiment of the present application. In fig. 1, an operator can operate the image transmission device and the unmanned aerial vehicle, the monitoring camera, the self-timer stick, the VR device, and the like. The voice data of the user can be input into the image transmission equipment through an audio acquisition assembly of the image transmission equipment, meanwhile, the video data acquired by the unmanned aerial vehicle, the monitoring camera, the selfie stick and the VR equipment is input into the image transmission equipment through a designated port of the image transmission equipment, after the image transmission equipment synchronously processes the audio data and the video data, the processed synchronous audio and video data are sent to receiving equipment in related services, the related services can be a live broadcast system, a conference system and the like, meanwhile, the image transmission equipment can also receive the audio data returned by the receiving equipment, the audio data are played, and the next-step processing of an operator is facilitated.

The application provides an audio and video transmission method which is applied to image transmission equipment in figure 1. Fig. 2 is a flowchart of an audio and video transmission method according to an embodiment of the present application. Referring to fig. 2, the audio/video transmission method of the present application may include the following steps:

step S11: and receiving video data sent by external video acquisition equipment.

In this embodiment, the external video capture device may be any device having a video capture component, such as an unmanned aerial vehicle, a surveillance camera, a selfie stick, a VR device, and the like, and the application does not specifically limit this. The data transmission device is a device with audio/video input/output functions, for example, may be the image transmission device in fig. 1, and may be configured to receive video data sent by an external video capture device.

Step S12: and acquiring audio data acquired by an audio acquisition component of the data transmission equipment, wherein the audio data and the video data have a time sequence incidence relation.

In this embodiment, the data transmission device itself further has an audio collection component, which can be used to collect audio data, for example, voice data of an operator. The data transmission device synchronously collects the audio data and receives the video data, and the audio data having a time sequence correlation relationship with the video data can be: the voice data generated when the operating personnel explain the video data received in real time can enable the receiving end to better know the content of the video according to the audio data when the video data is played by the receiving end.

Step S13: and according to the time sequence incidence relation, carrying out synchronous processing on the video data and the audio data to obtain synchronous audio and video data.

In this embodiment, if the data transmission device directly sends the received video data to the receiving end and directly sends the acquired audio data to the receiving end, in general, due to reasons such as network conditions, the audio data and the video data received by the receiving end are not synchronous, so that it is necessary to perform synchronization processing on the video data and the audio data, and it is ensured that the content of the picture of the audio heard by the receiving end and the content of the video watched by the receiving end are synchronous.

Step S14: and sending the synchronous audio and video data to a receiving end.

In this embodiment, after obtaining the synchronized audio and video data, the data transmission device may send the synchronized audio and video data to the receiving end. During transmission, the transmission may be performed in real time or at a specified time, and the transmission manner is not particularly limited in this embodiment.

Illustratively, user a is currently using an image-rendering device to access video data captured by a surveillance camera into video-on-demand conference X for viewing by members of video-on-demand conference X. In order to facilitate the conference members to know the content of the video more conveniently when watching the video, the user A can input the voice explanation of the video content into the image transmission equipment through the audio acquisition component of the image transmission equipment, and the voice explanation and the video are synchronously transmitted to the receiving end of the video networking conference system. For example, the video content may include content in a plurality of time periods, and the user a may make a voice narration for the video content in each time period, so that the conference member can better understand the video content in each time period. In order to ensure that the audio data and the video data sent by the image transmission device to the receiving end are synchronous, the video data and the audio data need to be synchronously processed according to a time sequence association relationship (for example, the aforementioned corresponding relationship between the voice commentary of each time period and the video content of each time period), so as to obtain synchronous audio and video data, and then the synchronous audio and video data are sent to the receiving end of the video networking conference system.

In this embodiment, the data transmission device receives video data sent by an external video acquisition device, acquires audio data through its own audio acquisition component, the audio data and the video data have a time sequence association relationship, then performs synchronous processing on the video data and the audio data according to the time sequence association relationship to obtain synchronous audio and video data, and finally sends the synchronous audio and video data to the receiving end. In the process, the synchronous processing of the video data acquired from the outside and the audio data of the user at the data transmission equipment end is realized, and the receiving end can receive the audio data corresponding to the video data in real time when receiving the video data, so that the user at the receiving end can better know the video content, and the user experience is effectively improved.

In combination with the above embodiment, in an implementation manner, after the step S14, the following steps may also be performed:

receiving audio response data returned by the receiving end;

and playing the audio response data.

In this embodiment, the data transmission device end may not only send the synchronized audio/video data to the receiving end, but also receive the audio data returned by the receiving end, and play the audio data, so that after the user of the data transmission device end hears the audio data, the user may perform the next processing in time, for example, adjust the sent video, the sending time, and the like. Illustratively, when a user at a receiving end watches a certain piece of video content, the audio content is confused, and the confusion can be recorded into the receiving end in a voice mode at any time, so that the receiving end sends the voice data to a data transmission equipment end, and the user at the data transmission equipment end can explain the confusion in the voice mode and send the explanation to the receiving end again through the data transmission equipment end.

In the embodiment, the data transmission equipment end and the receiving end can perform audio transmission with each other, so that real-time voice interaction during video transmission is realized, communication requirements of users in actual scenes can be effectively met, and the use experience of the users is improved.

In combination with the above embodiments, in an implementation, the step S13 may include:

accordingly, step S14 may include:

In this embodiment, the processing of the audio data, the video data, and the synchronized audio and video data may be implemented by three threads, respectively. Specifically, the audio data can be processed through the first thread to obtain an audio processing result, the video data can be processed through the second thread to obtain a video processing result, and then the video data and the audio data are synchronously processed through the third thread according to the audio processing result, the video processing result and the time sequence incidence relation to obtain synchronous audio and video data. The third thread may be the first thread, the second thread, or any other thread different from the first thread or the second thread, which may be specifically set according to actual requirements, and this is not specifically limited in this application.

In this embodiment, the video data and the audio data are synchronously processed by adopting multithreading, so that the processing speed is increased, the data from the data transmission equipment to the receiving end is rapidly transmitted, the watching experience of a user at the receiving end is improved, and the interactive experience of a double-end user is improved.

With reference to the foregoing embodiments, in one implementation, processing the audio data to obtain an audio processing result includes:

processing the video data to obtain a video processing result, comprising:

In this embodiment, the processing of the audio data specifically means: encapsulating the audio data to obtain audio data packets, wherein each audio data packet is marked with a current timestamp during encapsulation; the processing of the video data specifically includes: packaging the video data to obtain video data packets, wherein each video data packet is marked with a current timestamp during packaging; on the basis, the target audio data packet and the target video data packet can be obtained according to the time stamp of the audio data packet, the time stamp of the video data packet, the time interval value t3 of the adjacent video data packet and a preset threshold t1, wherein the time interval value t3 of the adjacent video data packet can be calculated according to the Fps of the collected video, the preset threshold t1 represents a time range in which the human eye cannot observe the asynchronous phenomenon, and the specific value of the preset threshold t1 can be selected according to actual requirements within 300-500ms generally; and after the target audio data packet and the target video data packet are obtained, carrying out synchronous processing according to the sequence of the timestamps carried by the target audio data packet and the target video data packet, thereby obtaining synchronous audio and video data.

Specifically, determining a target audio data packet and a target video data packet according to the time stamp of the audio data packet, the time stamp of the video data packet, the time interval value of the adjacent video data packet, and a preset threshold value, includes:

determining the latest video data packet as the target video data packet;

In this embodiment, a single processing mode is adopted for video data packets, and each video data packet (i.e., a target video data packet) is generated, a corresponding audio data packet (i.e., a target audio data packet) is obtained, and then the video data packet and the obtained corresponding audio data packet are processed into synchronous audio/video data and then sent to a receiving end.

In implementation, the newly generated video data packet may be determined as a target video data packet, and the timestamp thereof is denoted by T; then, determining the audio data packets with the time stamps larger than a preset difference value as the audio data packets of the first type in all the audio data packets, wherein the preset difference value is the difference value between the time stamp T of the target video data packet and a preset threshold value T1; after the audio data packets of the first type are obtained, the target audio data packets are obtained from the audio data packets of the first type according to the size relationship between the time interval value t3 and the preset threshold value t 1.

Specifically, determining the target audio data packet from the first type of audio data packet according to the size relationship between the time interval value and the preset threshold includes:

In the embodiment, when the time interval value T3 is smaller than the preset threshold value T1, an audio data packet with a timestamp smaller than a first sum value is obtained from the first type of audio data packets, and is determined as a target audio data packet, wherein the first sum value is the sum value of the timestamp T and the time interval value T3 of the target video data packet;

when the time interval value T3 is not less than the preset threshold T1, an audio packet with a timestamp less than a second sum value, which is the sum of the timestamp T of the target video packet and the preset threshold T1, is obtained from the audio packets of the first type and is determined as a target audio packet.

In the present embodiment, each time a target video packet is generated, audio packets with timestamps greater than T-T1 (i.e., audio packets of a first type) are first obtained from all generated and unsent audio packets, then size determination is made for T1 and T3, if T3< T1, audio packets with timestamps less than T + T3 are selected as target audio packets from the audio packets of the first type, and if T3> < T1, audio packets with timestamps less than T + T1 are selected as target audio packets from the audio packets of the first type.

With reference to the foregoing embodiment, in an implementation manner, synchronizing the target audio data packet and the target video data packet according to a time stamp sequence includes:

sending the synchronous audio and video data to a receiving end, comprising:

In this embodiment, after the target audio data packet and the target video data packet are obtained, the target audio data packet and the target video data packet are sorted according to the order of the timestamps, and the sorting result is: the video data packet processing method comprises the steps of a target audio data packet with a timestamp smaller than a timestamp T of a target video data packet, the target video data packet and the target audio data packet with a timestamp not smaller than the timestamp T of the target video data packet. The target audio data packet with the timestamp less than the timestamp T of the target video data packet is a partial audio data packet generated before the target video data packet, the target audio data packet with the timestamp not less than the timestamp T of the target video data packet is a partial audio data packet generated after the target video data packet, and the timestamps of the two partial audio data packets are in a time range that the human eyes cannot observe asynchronous phenomena.

In this embodiment, according to whether the timestamp T of the audio and video packet satisfies two conditions of T > T-T1 and T < T + T3 (or T < T + T1), the audio and video packet synchronized with the target video data packet can be screened out, and the audio and video packets are sequentially sent to the receiving end according to the timestamp sequence, so that the timestamp differences between the audio data packet and the video data packet received by the receiving end are within the time range where the human eyes cannot observe the asynchronous phenomenon, thereby realizing equivalent synchronization of audio and video data, and improving the use experience of users.

In combination with the above embodiments, in one implementation, the method further includes:

deleting the second type of audio data packet.

In this embodiment, since the audio data packet whose timestamp is not greater than the preset difference value exceeds the time range within which the human eye cannot observe the asynchronous phenomenon, the audio data packet does not need to be sent to the receiving end, and the audio data packet can be deleted in order to avoid occupying the memory.

With reference to the foregoing embodiment, in an implementation manner, after the audio data is encapsulated to obtain an audio data packet, the method further includes:

buffering the audio data packet into an audio sending queue;

In this embodiment, all generated audio data packets may be buffered in an audio transmission queue, and then the number of audio packets in the audio transmission queue is periodically detected, and if the number of audio packets is greater than a preset number, it indicates that the current network quality is poor, and information that the current network quality is poor may be displayed on a display screen of the data transmission device, so as to prompt a user at the data transmission device end to check the network connection condition in time.

Fig. 3 is an overall flowchart of an audio/video transmission method according to an embodiment of the present application. The following will make a general description of the audio/video transmission method of the present application with reference to fig. 3 in a specific embodiment.

In fig. 3, the image transmission device is provided with an audio hardware device, and acquires voice data of an operator (i.e., a user at the image transmission device end) through an a thread, encodes the voice data to obtain a plurality of audio packets, stores the audio packets in a queue a, each audio packet carries a time stamp A.t, and then calculates the time t for acquiring one audio packet according to the device sampling rate of the audio hardware device, and if the audio/video time interval is within 500ms of 300 times, the human eye cannot observe the asynchronous phenomenon, so that the minimum audio/video time difference t1 can be obtained accordingly.

Meanwhile, the image transmission equipment is used for butt joint of the unmanned aerial vehicle, the monitoring camera, the selfie stick, the VR equipment and the like through a B thread, video data in the image transmission equipment is collected, the video data is coded to obtain a video packet carrying a timestamp T, the time interval T3 between adjacent video packets can be calculated according to the Fps of the collected video during collection, and T3 is less than T1. In this embodiment, the B thread is used as the sending thread, so the B thread accesses each audio packet in the queue a, sequentially determines whether T-A.t < T1 is satisfied, deletes the corresponding audio packet if T-A.t < T1 is not satisfied, continues to determine whether A.t < T + T3 is satisfied if T-A.t < T1 is satisfied, and selects the audio packet to send if it is satisfied.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

The present application further provides an audio/video transmission apparatus 400, which is applied to a data transmission device, as shown in fig. 4. Fig. 4 is a block diagram of an audio/video transmission device according to an embodiment of the present application. Referring to fig. 4, the audio/video transmission apparatus 400 of the present application may include:

a first receiving module 401, configured to receive video data sent by an external video capture device;

an obtaining module 402, configured to obtain audio data collected by an audio collection component of the data transmission device, where the audio data and the video data have a time sequence association relationship;

the processing module 403 is configured to perform synchronous processing on the video data and the audio data according to the time sequence association relationship, so as to obtain synchronous audio and video data;

a sending module 404, configured to send the synchronized audio/video data to a receiving end.

Optionally, the processing module 403 includes:

the sending module 404 includes:

Optionally, the first processing module includes:

the second processing module comprises:

the third processing module comprises:

Optionally, the first determining module includes:

Optionally, the third determining sub-module includes:

Optionally, the first processing sub-module includes:

the sending module 404 includes:

Optionally, the apparatus 400 further comprises:

and the playing module is used for playing the audio response data.

Optionally, the apparatus 400 further comprises:

Based on the same inventive concept, the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the steps in the audio/video transmission method according to any of the above embodiments of the present application are implemented.

Based on the same inventive concept, the present application provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps in the audio/video transmission method according to any of the above embodiments of the present application.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The video networking is an important milestone for network development, is a real-time network, can realize high-definition video real-time transmission, and pushes a plurality of internet applications to high-definition video, and high-definition faces each other.

The video networking adopts a real-time high-definition video exchange technology, can integrate required services such as dozens of services of video, voice, pictures, characters, communication, data and the like on a system platform on a network platform, such as high-definition video conference, video monitoring, intelligent monitoring analysis, emergency command, digital broadcast television, delayed television, network teaching, live broadcast, VOD on demand, television mail, Personal Video Recorder (PVR), intranet (self-office) channels, intelligent video broadcast control, information distribution and the like, and realizes high-definition quality video broadcast through a television or a computer.

To better understand the embodiments of the present invention, the following description refers to the internet of view:

some of the technologies applied in the video networking are as follows:

network Technology (Network Technology)

Network technology innovation in video networking has improved over traditional Ethernet (Ethernet) to face the potentially enormous video traffic on the network. Unlike pure network Packet Switching (Packet Switching) or network circuit Switching (circuit Switching), the Packet Switching is adopted by the technology of the video networking to meet the Streaming requirement. The video networking technology has the advantages of flexibility, simplicity and low price of packet switching, and simultaneously has the quality and safety guarantee of circuit switching, thereby realizing the seamless connection of the whole network switching type virtual circuit and the data format.

Switching Technology (Switching Technology)

The video network adopts two advantages of asynchronism and packet switching of the Ethernet, eliminates the defects of the Ethernet on the premise of full compatibility, has end-to-end seamless connection of the whole network, is directly communicated with a user terminal, and directly bears an IP data packet. The user data does not require any format conversion across the entire network. The video networking is a higher-level form of the Ethernet, is a real-time exchange platform, can realize the real-time transmission of the whole-network large-scale high-definition video which cannot be realized by the existing Internet, and pushes a plurality of network video applications to high-definition and unification.

Server Technology (Server Technology)

The server technology on the video networking and unified video platform is different from the traditional server, the streaming media transmission of the video networking and unified video platform is established on the basis of connection orientation, the data processing capacity of the video networking and unified video platform is independent of flow and communication time, and a single network layer can contain signaling and data transmission. For voice and video services, the complexity of video networking and unified video platform streaming media processing is much simpler than that of data processing, and the efficiency is greatly improved by more than one hundred times compared with that of a traditional server.

Storage Technology (Storage Technology)

The super-high speed storage technology of the unified video platform adopts the most advanced real-time operating system in order to adapt to the media content with super-large capacity and super-large flow, the program information in the server instruction is mapped to the specific hard disk space, the media content is not passed through the server any more, and is directly sent to the user terminal instantly, and the general waiting time of the user is less than 0.2 second. The optimized sector distribution greatly reduces the mechanical motion of the magnetic head track seeking of the hard disk, the resource consumption only accounts for 20% of that of the IP internet of the same grade, but concurrent flow which is 3 times larger than that of the traditional hard disk array is generated, and the comprehensive efficiency is improved by more than 10 times.

Network Security Technology (Network Security Technology)

The structural design of the video network completely eliminates the network security problem troubling the internet structurally by the modes of independent service permission control each time, complete isolation of equipment and user data and the like, generally does not need antivirus programs and firewalls, avoids the attack of hackers and viruses, and provides a structural carefree security network for users.

Service Innovation Technology (Service Innovation Technology)

The unified video platform integrates services and transmission, and is not only automatically connected once whether a single user, a private network user or a network aggregate. The user terminal, the set-top box or the PC are directly connected to the unified video platform to obtain various multimedia video services in various forms. The unified video platform adopts a menu type configuration table mode to replace the traditional complex application programming, can realize complex application by using very few codes, and realizes infinite new service innovation.

Networking of the video network is as follows:

the video network is a centralized control network structure, and the network can be a tree network, a star network, a ring network and the like, but on the basis of the centralized control node, the whole network is controlled by the centralized control node in the network.

Fig. 5 is a networking diagram of a video network according to an embodiment of the present application. As shown in fig. 5, the video network is divided into an access network and a metropolitan network.

The devices of the access network part can be mainly classified into 3 types: node server, access switch, terminal (including various set-top boxes, coding boards, memories, etc.). The node server is connected to an access switch, which may be connected to a plurality of terminals and may be connected to an ethernet network.

The node server is a node which plays a centralized control function in the access network and can control the access switch and the terminal. The node server can be directly connected with the access switch or directly connected with the terminal.

Similarly, devices of the metropolitan network portion may also be classified into 3 types: a metropolitan area server, a node switch and a node server. The metro server is connected to a node switch, which may be connected to a plurality of node servers.

The node server is a node server of the access network part, namely the node server belongs to both the access network part and the metropolitan area network part.

The metropolitan area server is a node which plays a centralized control function in the metropolitan area network and can control a node switch and a node server. The metropolitan area server can be directly connected with the node switch or directly connected with the node server.

Therefore, the whole video network is a network structure with layered centralized control, and the network controlled by the node server and the metropolitan area server can be in various structures such as tree, star and ring.

The access network part can form a unified video platform (the part in the dotted circle), and a plurality of unified video platforms can form a video network; each unified video platform may be interconnected via metropolitan area and wide area video networking.

Video networking device classification

1.1 devices in the video network of the embodiment of the present invention can be mainly classified into 3 types: servers, switches (including ethernet gateways), terminals (including various set-top boxes, code boards, memories, etc.). The video network as a whole can be divided into a metropolitan area network (or national network, global network, etc.) and an access network.

1.2 wherein the devices of the access network part can be mainly classified into 3 types: node servers, access switches (including ethernet gateways), terminals (including various set-top boxes, code boards, memories, etc.).

The specific hardware structure of each access network device is as follows:

a node server:

fig. 6 is a schematic diagram illustrating a hardware structure of a node server according to an embodiment of the present application. As shown in fig. 6, the system mainly includes a network interface module 601, a switching engine module 602, a CPU module 603, and a disk array module 604;

the network interface module 601, the CPU module 603, and the disk array module 604 all enter the switching engine module 602; the switching engine module 602 performs an operation of looking up the address table 605 on the incoming packet, thereby obtaining the direction information of the packet; and stores the packet in a queue of a corresponding packet buffer 606 according to the packet's steering information; if the queue of the packet buffer 606 is nearly full, it is discarded; the switching engine module 602 polls all packet buffer queues for forwarding if the following conditions are met: 1) the port send buffer is not full; 2) the queue packet counter is greater than zero. The disk array module 604 mainly implements control over the hard disk, including initialization, read-write, and other operations of the hard disk; the CPU module 603 is mainly responsible for protocol processing with an access switch and a terminal (not shown in the figure), configuring an address table 605 (including a downlink protocol packet address table, an uplink protocol packet address table, and a data packet address table), and configuring the disk array module 604.

The access switch:

fig. 7 is a schematic diagram illustrating a hardware structure of an access switch according to an embodiment of the present application. As shown in fig. 7, the network interface module mainly includes a network interface module (a downlink network interface module 701, an uplink network interface module 702), a switching engine module 703 and a CPU module 704;

wherein, the packet (uplink data) coming from the downlink network interface module 701 enters the packet detection module 705; the packet detection module 705 detects whether the Destination Address (DA), the Source Address (SA), the packet type, and the packet length of the packet meet the requirements, if so, allocates a corresponding stream identifier (stream-id) and enters the switching engine module 703, otherwise, discards the stream identifier; the packet (downstream data) coming from the upstream network interface module 702 enters the switching engine module 703; the incoming data packet from the CPU module 704 enters the switching engine module 703; the switching engine module 703 performs an operation of looking up the address table 706 on the incoming packet, thereby obtaining the direction information of the packet; if a packet entering the switching engine module 703 goes from the downstream network interface to the upstream network interface, the packet is stored in the queue of the corresponding packet buffer 707 in association with a stream-id; if the queue of the packet buffer 707 is close to full, discard; if the packet entering the switching engine module 703 does not go from the downlink network interface to the uplink network interface, the packet is stored in the queue of the corresponding packet buffer 707 according to the packet steering information; if the queue of the packet buffer 707 is close to full, it is discarded.

The switching engine module 703 polls all packet buffer queues, which in this embodiment of the present invention is divided into two cases:

if the queue is from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queued packet counter is greater than zero; 3) obtaining a token generated by a code rate control module;

if the queue is not from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queue packet counter is greater than zero.

The rate control module 208 is configured by the CPU module 204, and generates tokens for packet buffer queues from all downstream network interfaces to upstream network interfaces at programmable intervals to control the rate of upstream forwarding.

The CPU module 704 is mainly responsible for protocol processing with the node server, configuration of the address table 706, and configuration of the code rate control module 708.

Ethernet protocol conversion gateway：

Fig. 8 is a schematic diagram of a hardware structure of an ethernet protocol conversion gateway according to an embodiment of the present application. As shown in fig. 8, the apparatus mainly includes a network interface module (a downlink network interface module 801, an uplink network interface module 802), a switch engine module 803, a CPU module 804, a packet detection module 805, a rate control module 808, an address table 806, a packet buffer 807, a MAC adding module 809, and a MAC deleting module 810.

Wherein, the data packet coming from the downlink network interface module 801 enters the packet detection module 805; the packet detection module 805 detects whether the ethernet MAC DA, the ethernet MAC SA, the ethernet length or frame type, the video network destination address DA, the video network source address SA, the video network packet type, and the packet length of the packet meet the requirements, and if so, allocates a corresponding stream identifier (stream-id); then, the MAC deleting module 810 subtracts MAC DA, MAC SA, length or frame type (2byte), and enters the corresponding receiving buffer, otherwise, discards it;

the downlink network interface module 801 detects the sending buffer of the port, and if there is a packet, acquires the ethernet MAC DA of the corresponding terminal according to the destination address DA of the packet, adds the ethernet MAC DA of the terminal, the MACSA of the ethernet coordination gateway, and the ethernet length or frame type, and sends the packet.

The other modules in the ethernet protocol gateway function similarly to the access switch.

A terminal:

the system mainly comprises a network interface module, a service processing module and a CPU module; for example, the set-top box mainly comprises a network interface module, a video and audio coding and decoding engine module and a CPU module; the coding board mainly comprises a network interface module, a video and audio coding engine module and a CPU module; the memory mainly comprises a network interface module, a CPU module and a disk array module.

1.3 devices of the metropolitan area network part can be mainly classified into 2 types: node server, node exchanger, metropolitan area server. The node switch mainly comprises a network interface module, a switching engine module and a CPU module; the metropolitan area server mainly comprises a network interface module, a switching engine module and a CPU module.

2. Video networking packet definition

2.1 Access network packet definition

The data packet of the access network mainly comprises the following parts: destination Address (DA), Source Address (SA), reserved bytes, payload (pdu), CRC.

As shown in the following table, the data packet of the access network mainly includes the following parts:

DA

SA

Reserved

Payload

CRC

wherein:

the Destination Address (DA) is composed of 8 bytes (byte), the first byte represents the type of the data packet (such as various protocol packets, multicast data packets, unicast data packets, etc.), there are 256 possibilities at most, the second byte to the sixth byte are metropolitan area network addresses, and the seventh byte and the eighth byte are access network addresses;

the Source Address (SA) is also composed of 8 bytes (byte), defined as the same as the Destination Address (DA);

the reserved byte consists of 2 bytes;

the payload part has different lengths according to different types of datagrams, and is 64 bytes if the datagram is various types of protocol packets, and is 32+1024 or 1056 bytes if the datagram is a unicast packet, of course, the length is not limited to the above 2 types;

the CRC consists of 4 bytes and is calculated in accordance with the standard ethernet CRC algorithm.

2.2 metropolitan area network packet definition

The topology of a metropolitan area network is a graph and there may be 2, or even more than 2, connections between two devices, i.e., there may be more than 2 connections between a node switch and a node server, a node switch and a node switch, and a node switch and a node server. However, the metro network address of the metro network device is unique, and in order to accurately describe the connection relationship between the metro network devices, parameters are introduced in the embodiment of the present invention: a label to uniquely describe a metropolitan area network device.

In this specification, the definition of the label is similar to that of a label of MP L S (Multi-Protocol L abel Switch), and assuming that there are two connections between device a and device B, there are 2 labels for a packet from device a to device B, and there are 2 labels for a packet from device B to device a. the label is divided into an incoming label and an outgoing label, and assuming that the label (incoming label) of a packet entering device a is 0x0000, the label (outgoing label) of the packet leaving device a may become 0x 0001.

As shown in the following table, the data packet of the metro network mainly includes the following parts:

DA

SA

Reserved

label (R)

Payload

CRC

Namely Destination Address (DA), Source Address (SA), Reserved byte (Reserved), tag, payload (pdu), CRC. The format of the tag may be defined by reference to the following: the tag is 32 bits with the upper 16 bits reserved and only the lower 16 bits used, and its position is between the reserved bytes and payload of the packet.

Based on the characteristics of the video networking, one of the core concepts of the embodiment of the invention is provided, the data transmission equipment receives video data sent by external video acquisition equipment and acquires audio data through an audio acquisition component of the data transmission equipment, the audio data and the video data have a time sequence incidence relation, then the video data and the audio data are synchronously processed according to the time sequence incidence relation to obtain synchronous audio and video data, and finally the synchronous audio and video data are sent to a receiving end.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The audio and video transmission method, the device, the electronic device and the storage medium provided by the invention are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An audio and video transmission method is applied to data transmission equipment and comprises the following steps:

receiving video data sent by external video acquisition equipment;

and sending the synchronous audio and video data to a receiving end.

2. The method according to claim 1, wherein the synchronizing the video data and the audio data according to the time sequence association relationship to obtain synchronized audio/video data comprises:

through a third thread, according to the audio processing result, the video processing result and the time sequence incidence relation, the video data and the audio data are synchronously processed, and the synchronous audio and video data are obtained:

sending the synchronous audio and video data to a receiving end, comprising:

3. The method of claim 2, wherein processing the audio data to obtain audio processing results comprises:

processing the video data to obtain a video processing result, comprising:

4. The method of claim 3, wherein determining the target audio data packet and the target video data packet according to the time stamp of the audio data packet, the time stamp of the video data packet, the time interval value of the adjacent video data packet, and a preset threshold comprises:

determining the latest video data packet as the target video data packet;

5. The method according to claim 4, wherein determining the target audio data packet from the first type of audio data packets according to the magnitude relationship between the time interval value and the preset threshold value comprises:

6. The method of claim 4, wherein synchronizing the target audio data packet and the target video data packet according to the time stamp precedence order comprises:

sending the synchronous audio and video data to a receiving end, comprising:

7. The method according to claim 1, wherein after sending the synchronized audio/video data to a receiving end, the method further comprises:

receiving audio response data returned by the receiving end;

and playing the audio response data.

8. An audio and video transmission device is applied to data transmission equipment and comprises:

9. An electronic device, comprising:

one or more processors; and

one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the electronic device to perform the steps in the audio-visual transmission method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that it stores a computer program causing a processor to execute the steps in the audio-visual transmission method according to any of claims 1-7.