WO2016151852A1

WO2016151852A1 - Audio reproduction device, image display device and audio reproduction method thereof

Info

Publication number: WO2016151852A1
Application number: PCT/JP2015/059430
Authority: WO
Inventors: 栄作石井
Original assignee: Ｎｅｃディスプレイソリューションズ株式会社
Priority date: 2015-03-26
Filing date: 2015-03-26
Publication date: 2016-09-29

Abstract

An audio reproduction device for reproducing audio represented by audio data received via a network comprises: a data processing device that adjusts, on the basis of an amount of deviation from an ideal arrival time for each piece of aforementioned audio data received via the network, a reproduction start timing of the audio represented by the received audio data; and an audio output device that reproduces the audio on the basis of the reproduction start timing of the audio as adjusted by the data processing device.

Description

Audio reproduction apparatus, image display apparatus and audio reproduction method therefor

The present invention relates to an audio reproduction device, an image display device, and an audio reproduction method for reproducing audio indicated by audio data received via a network.

With recent advances in communication technology, it has become possible to transmit and receive moving image data including image data and audio data via a network using packet communication such as the Internet, an intranet, and a wireless LAN. In a network, the data transmission speed changes from moment to moment according to the congestion of the line and the communication environment, so a device that plays back video data accumulates the received video data so that video and audio are not interrupted during playback. A buffer is provided. In this buffer, normally, moving image data of an amount capable of reproducing a moving image of several seconds or more is accumulated.

The packet communication described above is considered to be used for transmission and reception of various data. For example, a technology for transmitting moving image data from a computer or the like to a projector via a network and reproducing the moving image data received by the projector is put into practical use. Has been. In such an image reproduction system, an image display device such as a projector is installed in the vicinity of an image transmission device such as a computer (for example, a visual distance), and an image reproduced on the image display device by the image transmission device at hand. And a usage mode for operating voice.
In that case, in the configuration in which moving image data of several seconds or more is accumulated and played back in the buffer provided in the image display device, it takes several seconds or longer to start playback of video or audio on the image display device. The operability by the transmission device is significantly reduced.

Therefore, in the image reproduction system as described above, video is transmitted as continuous still image data (image data), and only audio is transmitted as audio data separately from the image data, thereby improving the operability of the image transmission apparatus. There is an improved configuration. If image data and audio data are transmitted separately in this way, for example, the transmission rate of image data transmitted from the image transmission device is changed according to the transmission speed of the network, and the image quality and frame rate to be reproduced by the image display device are changed. By changing the time, it is possible to shorten the time until image reproduction (display) starts. In addition, if audio data is transmitted with priority over image data, the time from reception of audio data to the start of reproduction of the audio indicated by the audio data (hereinafter referred to as audio delay time) is shortened. it can.

However, as described above, since the data transmission speed of the network changes every moment, there is “unstable data transmission time fluctuation (delay fluctuation)” of several tens to hundreds of millimeters in the data transmission time. . For this reason, even if the audio data is transmitted with priority over the image data, if the audio delay time is reduced to about several hundred millimeters, the audio is likely to be interrupted.
On the other hand, taking into account the delay fluctuation, for example, if audio data of about 1 to several seconds is stored in a buffer and audio reproduction is started, audio interruption is suppressed. However, in this case, since the audio delay time becomes long, the operability of the image transmission device is degraded.
Therefore, in the image display device, it is desirable not to set the audio delay time to an unnecessarily long time while suppressing the occurrence of audio interruption due to the delay fluctuation.

Although not intended to suppress the interruption of the reproduced audio or reduce the audio delay time, for example, Patent Document 1 (Japanese Patent Application Laid-Open No. 2004-104701) discloses a clock on the transmission side and the reception side. In order to prevent overflow and underflow of a buffer for storing moving image data due to the difference, a method for correcting the clock of the receiving side device is described.

JP 2004-104701 A

An object of the present invention is to provide an audio reproduction device, an image display device, and an audio reproduction method thereof that suppress the occurrence of audio interruption caused by delay fluctuation.

In order to achieve the above object, an audio reproduction device of the present invention is an audio reproduction device that reproduces audio indicated by audio data received via a network,
A data processing device that adjusts the reproduction start timing of the audio indicated by the received audio data based on the amount of deviation from the ideal arrival time for each of the audio data received via the network;
An audio output device for reproducing the audio based on the reproduction start timing of the audio adjusted by the data processing device;
Have

Or an audio playback device that plays back audio represented by audio data received via a network,
When the data transmission rate of the network decreases from the current state, the audio data received via the network is adjusted to extend the playback time of the voice indicated by the network, and the data transmission rate of the network increases from the current state. A data processing device for adjusting so as to shorten the playback time of the voice indicated by the voice data received via the network;
An audio output device that reproduces the audio based on the reproduction time of the audio adjusted by the data processing device;
Have

An image display device according to the present invention includes the above sound reproduction device,
An image output device for displaying an image indicated by image data received via the network;
With
The audio data transmitted corresponding to the image data is received.

On the other hand, the audio reproduction method of the present invention is an audio reproduction method by an image display device for reproducing image data and audio indicated by image data and audio data received via a network,
Based on the amount of deviation from the ideal arrival time for each audio data received via the network, adjust the audio playback start timing indicated by the received audio data,
The audio is reproduced based on the reproduction start timing of the audio adjusted by the data processing device.

According to the present invention, it is possible to suppress the occurrence of speech interruption caused by delay fluctuation.

FIG. 1 is a block diagram illustrating a configuration example of an image reproduction system according to the present invention. FIG. 2 is a block diagram showing a configuration example of the image display apparatus shown in FIG. FIG. 3A is a schematic diagram illustrating an ideal transmission example of audio data in a network. FIG. 3B is a schematic diagram illustrating an actual transmission example of audio data in the network. FIG. 4A is a diagram showing an example of frequency distribution information created by the data processing apparatus shown in FIG. 2, and is a histogram showing an example in which the amount of deviation of each voice packet is distributed in a positive value. FIG. 4B is a diagram illustrating an example of frequency distribution information created by the data processing device 12 illustrated in FIG. 2, and is a histogram illustrating an example in which the deviation amount of each voice packet is distributed in a negative value. Figure 5 is a maximum duration T _B retractable sound audio buffer shown in FIG. 2, the delay and time T _P of the audio, and the reproduction time T _D of the audio corresponding to the amount of audio data remaining in the sound buffer It is a schematic diagram which shows the example of a relationship. FIG. 6A is a histogram showing an example of frequency distribution information when the network has a stable data transmission rate. FIG. 6B is a histogram showing an example of frequency distribution information when the network 3 has an unstable data transmission rate. FIG. 7 is a flowchart showing an example of the processing procedure of the image display apparatus of the present invention.

Next, the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram illustrating a configuration example of an image reproduction system according to the present invention.
As shown in FIG. 1, an image reproduction system of the present invention includes an image transmission device 2 that transmits image data and audio data, and an image that reproduces images and audio indicated by the image data and audio data received from the image transmission device 2. The display device 1 has a configuration in which the image transmission device 2 and the image display device 1 are connected via a network 3.

The image transmission device 2 transmits video to be reproduced by the image display device 1 as image data composed of continuous still image data, and transmits audio data corresponding to the image data separately from the image data. Further, the image transmission device 2 transmits the audio data corresponding to the image data with priority over the image data. Further, the image transmission device 2 changes the transmission rate of the image data according to the transmission speed of the network, and changes the image quality and frame rate of the image to be reproduced by the image display device 1. Note that image data and audio data are transmitted as image packets and audio packets including the respective data. The image data and the audio data are, for example, an identifier indicating the image transmission device 2 that transmits each data, a time stamp corresponding to the timing when each data is captured, a time stamp corresponding to the timing when each data is reproduced or displayed, and It may be associated with information such as a content name corresponding to image data and audio data. In addition, the image packet and the audio packet may include such information. Furthermore, when the image transmission apparatus 2 starts transmitting image data and audio data corresponding to the image data, the audio data sampling frequency, the number of sampling bits, the number of channels (monaural, stereo, etc.), and audio packet transmission Audio related information, which is information necessary for reproduction of audio indicated by audio data, including an interval (audio transmission unit time) T and the like is transmitted. The voice packet transmission interval corresponds to an ideal voice packet arrival interval.
The image display device 1 receives the image data, sound data, and sound related information that the image transmission device 2 instructs to reproduce, and reproduces the image and sound indicated by the image data and sound data.
The image transmission device 2 includes a communication device that transmits image data and audio data via the network 3, a CPU (Central Processing Unit) that executes processing according to a program, and a memory that stores data and programs processed by the CPU. It can be realized by an information processing apparatus (computer). The communication device may have any known configuration regardless of the wired method or the wireless method as long as data can be transmitted via the network 3.
The network 3 is a known data transmission path including a network device (not shown) that relays packets transmitted and received between the image transmission device 2 and the image display device 1. Since the network 3 has a configuration in which a number of data transmission paths are formed by a number of network devices as is well known, the network 3 is shown in a cloud shape in FIG.

FIG. 2 is a block diagram illustrating a configuration example of the image display apparatus 1 illustrated in FIG.
As shown in FIG. 2, the image display device 1 includes a communication device 11 that receives image data and audio data from the image transmission device 2 via the network 3, and data that performs a required process on the received image data and audio data. A processing device 12 and a storage device that holds information generated by the data processing device 12 and holds received image data and received voice data or voice data processed by the data processing device 12 (voice correction data) 13, an audio output device 14 that reproduces and outputs the audio indicated by the received audio data or audio correction data, and an image output device 15 that reproduces and displays an image indicated by the image data.
The audio reproduction device 4 of the present invention is configured to include the communication device 11, the data processing device 12, the storage device 13, and the audio output device 14 shown in FIG.

The image display apparatus 1 includes a projector, a display, and a function capable of reproducing an image and sound, including the functions of the communication device 11, the data processing device 12, the storage device 13, the sound output device 14, and the image output device 15 shown in FIG. It can be realized with a monitor. The functions of the data processing device 12 and the storage device 13 can be realized by an information processing device (computer) that includes a CPU (Central Processing Unit) that executes processing according to a program and a memory that stores data and programs processed by the CPU. .
The communication device 11 includes a data receiving unit 111 that sequentially receives image data and audio data from the image transmission device 2 via the network 3 and outputs the received image data and audio data to the image / audio dividing unit 121. The communication device 11 may have any known configuration regardless of the wired method or the wireless method as long as data can be transmitted via the network 3. Data transmission is performed using, for example, well-known packet communication.
The storage device 13 includes a video memory 131 that holds image data, an audio correction data storage memory 132 that holds information used for audio data correction, and an audio buffer 133 that holds audio data or audio correction data.

The data processing device 12 includes an audio / video dividing unit 121, an audio transmission processing unit 122, and an audio data processing unit 123.
The image / sound dividing unit 121 performs necessary processing such as decoding processing on the image data received from the communication device 11, and writes the processed image data in the video memory 131 included in the storage device 13. In addition, the image / audio dividing unit 121 outputs the audio data and the audio related information received from the communication device 11 to the audio transmission processing unit 122.
The voice transmission processing unit 122 determines a predetermined reference time from the arrival time of the voice packet including the voice data received by the communication device 11, and based on the voice related information, an ideal arrival time for each voice packet, in other words, Then, based on the reference time and a predetermined arrival interval, a deviation amount from the scheduled arrival time is detected, and deviation amount information that is information relating to the deviation amount is generated. The shift amount information includes, for example, information indicating the distribution of the shift amount (frequency distribution information), an average value, a maximum value, a minimum value, or a value of the shift amount. Hereinafter, an example in which frequency distribution information is mainly used for the shift amount information will be described. The audio transmission processing unit 122 writes the determined reference time in the audio correction data storage memory 132 included in the storage device 13. In addition, the voice transmission processing unit 122 writes the generated frequency distribution information in the voice correction data storage memory 132 included in the storage device 13 and outputs the voice data received from the communication device 11 to the voice data processing unit 123. Further, the voice transmission processing unit 122 writes the voice related information in the voice correction data storage memory 132 provided in the storage device 13.

The reference time is a reference value of the arrival time of audio data (audio packet) received from the image transmission device 2 (not shown), and the first audio data (first audio data is the first audio data among the received audio data (audio packets). The arrival time at which the received voice packet) is set as an initial value. The audio data may be audio data corresponding to the received image data. In that case, the audio data and the image data are transmitted in association with each other.
When the reference time is set, the voice transmission processing unit 122 detects a deviation amount from the reference time for each voice data (voice packet), and starts generating deviation amount information.
When the audio packet including the first audio data arrives, the image display device 1 starts reproduction of silent audio from the reference time to a predetermined time. Note that audio playback may be started when a predetermined time has elapsed from the reference time. In other words, the reference time is a reference when playing back sound. Since the audio packets arrive at the image display device 1 at a substantially constant cycle, the image display device 1 holds the audio data received at each substantially constant cycle by the audio buffer 133, and is held by the audio buffer 133. Audio data is read and played sequentially.

The audio data processing unit 123 adjusts the reproduction start timing of the audio indicated by the audio data based on the deviation amount of the audio data. The sound reproduction start timing is adjusted by, for example, expanding or shortening the reproduced sound. Specifically, the audio data processing unit 123 reads out the frequency distribution information from the correction data storage memory 132, and based on the frequency distribution information, processes for expanding or shortening the reproduced audio of a predetermined adjustment time are converted into audio data. The processed audio data (audio correction data) is written in the audio buffer 133.
The audio output device 14 reproduces audio based on the audio reproduction start timing adjusted by the data processing device 12. Specifically, the audio output device 14 includes an audio output unit 141 that sequentially reads audio data or audio correction data from the audio buffer 133 and reproduces / outputs the audio.
The image output device 15 includes an image output unit 151 that sequentially reads image data from the video memory 131 and displays an image.

Next, the operation of the image display apparatus 1 shown in FIG. 2 will be described with reference to the drawings.
FIG. 3A is a schematic diagram showing an ideal transmission example of voice data in the network 3, and FIG. 3B is a schematic diagram showing an actual transmission example of voice data in the network 3. FIG. 4A is a diagram showing an example of frequency distribution information created by the data processing device 12 shown in FIG. 2, and is a histogram showing an example in which the amount of deviation of each voice packet is distributed in a positive value. FIG. 4B is a diagram illustrating an example of frequency distribution information created by the data processing device 12 illustrated in FIG. 2, and is a histogram illustrating an example in which the deviation amount of each voice packet is distributed in a negative value.
FIG. 3A shows a state in which audio packets are ideally transmitted via the network 3, in which audio packets arrive at the image display device 1 at regular intervals (audio transmission unit time T). When voice packets are ideally transmitted via the network 3, the arrival interval is, for example, voice transmission unit time T = 100 ms (milliseconds). That is, the ideal arrival interval is predetermined as the voice transmission unit time T.

When no problem has occurred in the image transmission device 2, the image display device 1, and the network 3, when voice packets are transmitted from the image transmission device 2 at regular intervals (sound transmission unit time T), FIG. As shown, it is considered that a voice packet arrives at the image display device 1 at every voice transmission unit time T. However, actual voice packet arrival intervals vary as indicated by T ₀ to T ₆ in FIG. 3B, for example, due to “unstable data transmission time fluctuation (delay fluctuation)” of the data transmission time by the network 3.

The deviation amount of the actual arrival time (reception completion time) with respect to the ideal arrival time (reception completion time) of the voice packet can be expressed by the following equation (1).

Equation (1) shows the amount of deviation ΔTn from the ideal arrival time of the nth voice packet. ΔTn takes a positive value when the voice packet arrives later than ideal, and takes a negative value when the voice packet arrives earlier than ideal.
The actual voice packet arrival intervals T ₀ to T _n can be detected using the arrival time of each voice packet. The arrival time includes, for example, the timing when the voice packet arrives. That is, time counting is started from the arrival point of the first voice packet arriving at the image display device 1 as a base point (reference time), and the timing, base point (reference time) and voice transmission unit time at which another voice packet arrives. The amount of deviation for each voice packet may be detected based on T.

Normally, each packet includes information indicating which part of the data is the data included in the packet. Therefore, the image display apparatus 1 determines whether the received audio packet is a packet including the first audio data, a packet including the last audio data, or a first audio data packet. Can be determined. Further, the value of the ideal voice packet arrival interval (voice transmission unit time T) is notified to the image display device 1 from the image transmission device 2 in advance, so that the image display device 1 is known. In this case, in order to calculate ΔTn for each voice packet using the above equation (1), it is only necessary to know the arrival time of the voice packet that has first arrived at the image display device 1.

As described above, in the present invention, the arrival time of the first voice data (voice packet) is set as the “reference time” among the voice data newly transmitted for playback transmitted from the image transmission device 2. However, since the actual voice packet arrival interval varies due to the delay fluctuation of the network 3, it is considered that the first voice packet is also deviated from the ideal arrival time. Therefore, in the present invention, the voice transmission processing unit 122 detects a deviation in the current reference time (initial value) based on the frequency distribution information indicating the distribution of ΔTn.
For example, as shown in FIG. 4A, when ΔTn of each voice packet is distributed in a positive value, that is, when each voice packet tends to arrive later than the ideal arrival time, It can be determined that the reference time is early (the first voice packet has arrived earlier than ideal).
On the other hand, as shown in FIG. 4B, when ΔTn of each voice packet is distributed in a negative value, that is, when each voice packet tends to arrive earlier than the ideal arrival time, the current standard It can be determined that the time is late (the first packet arrived later than ideal).

5, (which can be stored in the audio buffer 133) memory corresponding to the capacity of the audio buffer 133 audio maximum duration and T _B of the voice, from arriving first voice data (voice packets) indicated voice data playback and time (delay time of the voice) T _P until starts, an example of the relationship between the reproduction time T _D of the sound corresponding to the sound data amount received previously remaining in the audio buffer 133 Show.
As shown in FIG. 5, the maximum reproduction time T _B [seconds] is set to a value sufficiently larger than the audio delay time T _P. A memory capacity M _{B [bytes]} of the audio buffer 133, the relationship between the maximum duration T _B of the audio corresponding to the audio data amount that can be stored in the audio buffer 133 can be represented by the following formula (2).

Here, f is the sampling frequency [Hz] of the audio data, N _B is the number of sampling bits of the audio data [bit], N _C is the number of channels of audio data.

Reproduction time T _D of the speech, if each voice packet through the network 3 are transmitted ideally, T P _{+ T} ≧ T _D ≧ T _P becomes, each voice packet, the playback time corresponding to T Audio data is included. However, the last audio packet does not necessarily include audio data having a reproduction time corresponding to T. The image display apparatus 1 receives a voice packet, storing the audio data contained in the voice packet into the audio buffer 133 starts reproducing the voice represented by voice data from the reference time after (n-1) T + T P .
Audio buffer 133 is in the initial state the silent voice data and stores T _P min, starts reproduction from the time it receives the first audio data (reference time). The first voice data is delayed T _P min This behavior can be reproduced. Further, by sequentially reproducing the voice represented by the voice data stored in the audio buffer 133, the reproduction of the audio (playback time = T) indicated by the n-th audio data (n-1) T + T P later automatically You can start.

The voice delay time T _P may be set in advance to a value satisfying T _P ≧ T + ΔTn _{MAX in} consideration of the maximum value (ΔTn _MAX ) of the deviation amount ΔTn from the ideal arrival time of the voice packet due to delay fluctuation. . For example, it is conceivable to set T _P = 1.5T, 2T, T _P = 3T, and the like.
Incidentally, in many cases, .DELTA.Tn _MAX network 3 is unknown, the delay time T _P of the voice that the user can tolerate is different depending on the type of sound to be reproduced. For example, it is desirable that the speech delay time _TP is as short as possible for conversation and the like, and BGM (Back-Ground Music) or the like often does not cause a problem even if the speech delay time _TP is long. Therefore, an adjustment mechanism for adjusting the T _P to the image display apparatus 1 or the image transmission apparatus 2 is provided, the user may be allowed to arbitrarily set the T _P by the adjusting mechanism.

When the image display device 1 starts playing the voice, if the arrival next voice packet within T _P, the next audio data at the time when the reproduction of the audio data previously received is completed is stored in the audio buffer 133 Therefore, the audio is not played back intermittently.
On the other hand, if the next voice packet arrives later than T _P, the next audio data when the reproduction is completed audio data previously received is not yet stored in the audio buffer 133. In that case, a silent state is maintained until the next audio data is received.
Here, when the entire voice packet tends to arrive later than the ideal arrival time, that is, when the current reference time is earlier, there is a high possibility that a silent state in which the voice cannot be reproduced in the image display device 1 occurs. Become.
When the entire voice packet tends to arrive earlier than the ideal arrival time, that is, when the current reference time is late, the image display apparatus 1 does not generate a silent state. In that case, by adjusting the reference time to the correct time can be set shorter than the current delay time T _P of the speech. However, even if the reproduction of the audio data previously received is completed, since the audio data corresponding to a time longer than the audio data corresponding to T _P remains in the audio buffer 133, possibly audio buffer 133 overflows is there.

Therefore, the data processing device 12 adjusts the playback start timing of the voice indicated by the voice data included in the received voice packet based on the shift amount ΔTn for each voice packet. Specifically, the voice transmission processing unit 122 detects the corrected deviation amount based on deviation amount information that is information on the deviation amount ΔTn for each voice packet, for example, the frequency distribution information. Also, the audio data processing unit 123 adjusts the audio reproduction start timing based on the detected amount of correction deviation.
When the average value of ΔTn of each voice packet is a positive value, it is assumed that each voice packet tends to arrive later than the ideal arrival time, and the average value of ΔTn of each voice packet is a negative value The voice packets may tend to arrive earlier than the ideal arrival time.
Note that the frequency distribution information includes a frequency distribution in which ΔTn is classified for each predetermined range, as shown in FIG. 4A or 4B.

The generation of the deviation amount information is started when the reference time is set. The generation of the deviation amount information may be performed every predetermined period, may be a period from the first reception of the audio data to the last reception, or every period for adjusting the sound reproduction start timing. Good. The correction amount is detected based on the deviation amount ΔTn for each voice packet. For example, the most frequently occurring amount ΔTn is used as the correction deviation amount. As the correction deviation amount, an average value of the deviation amounts ΔTn may be used. The average value of the most frequently occurring amount ΔTn or the deviation amount ΔTn is an example of the corrected deviation amount.
When using frequency distribution information in which the deviation amount ΔTn is classified for each predetermined range, the correction deviation amount may be a value corresponding to the predetermined range, for example, a median value, a maximum value, or a minimum value of the predetermined range.

To adjust (shift) the audio playback start timing, audio data corresponding to a predetermined adjustment time (about 10 to several tens of seconds) may be used to extend or shorten the audio playback time. . The corrected deviation amount is used for the sound expansion time or shortening time. As a method of extending or shortening the audio playback time, for example, a method of re-sampling the received audio data at a sampling frequency different from that at the time of sampling the audio data, or deleting a part of the audio data or silent data There is a way to insert.

Further, when the audio playback time is extended or shortened, the expansion time or the shortened time is about 0.5% of the playback time of the audio data used for adjustment so that the user of the video display device 1 does not feel uncomfortable with the playback audio. Set to. For example, when the correction deviation amount is −50 ms (milliseconds), the subsequent 10 seconds (= 50 ms × (100 / 0.5)) is set as the adjustment time, and the audio data for the 10 seconds is 9.95 seconds ( = 10s-50ms) for playback. When the correction deviation amount is +50 ms, the playback time of the subsequent 10 seconds of audio data is extended to 10.05 seconds (= 10 s + 50 ms) and played back.

The decompression time and shortening time may be any time as long as the user of the video display device 1 does not feel uncomfortable with the playback sound, and may be set to 0.5% or less of the playback time of the sound data used for adjustment. Specifically, for example, assuming that the reproduction time of the audio data included in each audio packet is 100 ms and the correction deviation amount is −50 ms (or +50 ms), the adjustment time for adjusting the correction deviation amount −50 ms (or +50 ms) is 10 seconds. The number of audio data corresponding to is 100. Therefore, the voice data processing unit 123 shortens each voice data to 99.5 ms (or extends to 100.5 ms). The shortened (or expanded) sound data is stored in the sound buffer 133 as sound correction data. The stored audio correction data is sequentially reproduced. By repeating this process 100 times, the reproduction start timing can be automatically adjusted without interruption of the sound.

Also, it is desirable to adjust the playback start timing of the voice indicated by the voice data included in the received voice packet and to correct the reference time. By adjusting the reference time, it becomes possible to detect the amount of deviation of the audio data with respect to the corrected reference time even after adjusting the audio playback start timing. , It can be adjusted to an appropriate playback timing again. Note that it is desirable to reset the deviation amount information when correcting the reference time. By resetting the deviation amount information at an appropriate timing, for example, appropriate deviation amount information can be generated according to the latest transmission state of the network.

The data processing device 12 adjusts the playback start timing of the voice indicated by the voice data and corrects the reference time based on the deviation amount ΔTn for each voice packet. The correction amount of the reference time and the adjustment amount of the reproduction start timing are detected based on the deviation amount ΔTn for each audio packet.
Specifically, the voice transmission processing unit 122 corrects the reference time based on the correction deviation amount. For example, as described above, when the reproduction time of the audio data included in each audio packet is 100 ms and the correction deviation amount is −50 ms (or +50 ms), the audio data corresponding to the adjustment time of 10 seconds for adjusting the audio reproduction timing. Will be 100. Therefore, the audio data processing unit 123 shortens each audio data to 99.5 ms (or expands to 100.5 ms), and the audio transmission processing unit 122 shifts the reference time by −0.5 ms (or +0.5 ms). The shortened (or expanded) sound data is stored in the sound buffer 133 as sound correction data. The stored audio correction data is sequentially reproduced. By repeating this process 100 times, the sound reproduction start timing can be adjusted without interruption of the sound, and the reference time can be corrected to the correction amount −50 ms (or +50 ms).

Since the correction amount of the reference time is determined based on the correction deviation amount, the adjustment of the audio reproduction start timing based on the correction deviation amount may be adjusted based on the correction amount of the reference time. Including. Further, since the adjustment amount of the audio reproduction start timing is determined based on the correction deviation amount, the adjustment of the reference time based on the correction deviation amount is adjusted based on the adjustment amount of the audio reproduction start timing. Including.

As described above, since the transmission rate of the network 3 changes from moment to moment, when the data transmission rate of the network 3 decreases from the current state, each voice packet tends to arrive later than the ideal arrival time. A large amount of ΔTn of voice packets is distributed in a positive value. Further, when the data transmission rate of the network 3 increases from the current state, each voice packet tends to arrive earlier than the ideal arrival time, and ΔTn of each voice packet is distributed in a large negative value. In the present invention, the sound reproduction start timing can be appropriately set by adjusting the sound reproduction time.
That is, when the data transmission speed of the network 3 is lowered from the current state, the data processing device 12 adjusts the playback time of the voice indicated by the voice data received via the network 3 to be extended in the adjustment period, When the data transmission rate of the network 3 is changed from the current state to the high state, the audio reproduction time indicated by the audio data received via the network 3 is adjusted to be shortened in the adjustment period. The audio output device reproduces audio based on the audio reproduction time adjusted by the data processing device. Note that extending or shortening the audio reproduction time adjusts the audio reproduction start timing.

In the above description, an example is shown in which the correction amount of the reference time and the adjustment amount of the audio reproduction start timing are set to the same amount, and the process of gradually correcting or adjusting is performed. However, the present invention is not limited to such processing. . For example, the correction of the reference time may be performed at once (−50 ms or +50 ms). In this case, it is desirable to stop the shift amount detection process during the adjustment time of the audio reproduction start timing. In order to appropriately detect the shift amount during this adjustment period, it is necessary to take into account the elapsed time from the start of adjustment and the adjustment amount of the reproduction start timing. In other words, when each voice packet tends to arrive later than the ideal arrival time (when the correction deviation is a positive value), it is corrected to delay the current reference time, and the voice playback time Adjust to extend. If each voice packet tends to arrive earlier than the ideal arrival time (if the correction deviation is a negative value), correct the current reference time so that the voice playback time is reduced. Adjust to shorten.

By adjusting the audio playback time (by adjusting the audio playback start timing) in this way, the audio data amount held in the audio buffer 133 becomes an appropriate value, and audio interruptions and the like can be suppressed. In addition, since the overflow of the audio data can be suppressed, the capacity of the audio buffer 133 can be reduced. Furthermore, by adjusting the reference time, it is possible to readjust according to the state of the network even after adjusting the audio reproduction.

Note that a threshold T _MAX is set in advance so that ΔTn <T _MAX is satisfied so that a voice packet that arrives extremely late compared to other voice packets is not used for calculation (detection) of the correction deviation amount. The correction deviation amount may be calculated using only ΔTn. At this time, the deviation amount information may be generated as information on ΔTn that satisfies ΔTn <T _MAX . In that case, a more appropriate correction deviation amount can be calculated.
Further, a predetermined threshold value T _TH is provided for the correction deviation amount, and when the absolute value of the correction deviation amount exceeds the threshold value T _TH , the reference time is corrected and the sound reproduction start timing is adjusted (sound Data processing) may be performed. Alternatively, the adjustment of the sound reproduction start timing may be performed every predetermined period. In this case, the processing load on the data processing device 12 is reduced because the reference time and the playback start timing of the audio are not frequently changed. The threshold value may be set to a different value depending on whether the correction deviation amount is positive or negative. In that case, it is possible to set whether or not to place importance on the delay time of the sound, or whether or not to place importance on the suppression of the occurrence of the sound interruption.

As described above, by correcting the reference time based on the frequency distribution information, the frequency of ΔTn of each voice packet received thereafter is corrected to be distributed in the vicinity of zero (0). Setting such a case, since the subsequent voice packets arriving within T + .DELTA.Tn _MAX, by adjusting the playback start timing of the speech based on the modified thread amount, the sufficiently large value for the delay time T _P a T + .DELTA.Tn _MAX speech Even without this, the occurrence of voice interruptions is suppressed. Therefore, it is not necessary to set the audio delay time _TP to an unnecessarily long time while suppressing the occurrence of audio interruption due to delay fluctuation.

Incidentally, as described above, the audio delay time T _P may be set so as to satisfy T _P ≧ T + ΔTn _MAX . As can be seen from the histograms shown in FIGS. 4A and 4B, the value of ΔTn _{MAX is} the shift amount information. (Frequency distribution information).
6A is a histogram showing an example of frequency distribution information when the network 3 has a stable data transmission rate, and FIG. 6B is a histogram showing an example of frequency distribution information when the network 3 has an unstable data transmission rate. It is. 6A and 6B show examples of frequency distribution information after the reference time is corrected.
For example, when the network 3 is in a stable data transmission state, the frequency distribution of ΔTn of each voice packet is relatively narrow and ΔTn _MAX is a small value as shown in FIG. 6A. On the other hand, when the network 3 is in an unstable data transmission state, the frequency distribution of ΔTn of each voice packet is relatively wide as shown in FIG. 6B, and ΔTn _MAX is a large value.

Therefore, in the present invention, after correcting the reference time, the length of the frequency distribution in the time axis direction, that is, the maximum deviation amount ΔTn from the ideal arrival time of the voice packet (ΔTn _MAX : maximum value in the positive region) ) detects, adjusts the delay time _{T P} of the speech in response to the .DELTA.Tn _MAX. For example, the audio delay time T _P is set to T _P = T + ΔTn _MAX . T _P may be set to _{_{T P = T + ΔTn MAX +}} α in consideration of an error α such as T and ΔTn _MAX. At this time, the deviation amount information that is a frequency distribution may be generated as information on ΔTn that satisfies ΔTn <T _MAX . The same processing can be applied when using the average value and the maximum value of the deviation amounts as the deviation amount information.
With this setting the delay time T _P of the speech, when in data transmission with network 3 is stabilized as shown in FIG. 6A, it can be set shorter delay time T _P of the speech. Even when in a data transmission state network 3 is unstable as shown in Figure 6B, to the extent that the occurrence of interruption of the sound can be suppressed, it can be set as short as possible delay time T _P of the speech.

If you want to change the delay time T _P of the audio playback start time of each audio data as described above from the reference time (n-1) T + T P there later, also depends on T _P, the adjusted speech based on the delay time T _P, using a method for processing the audio data described above, it may be adjusted playback start timing of the sound. For example, to shorten the delay time T _P of the sound, if in accordance with the time of the shortening reduces the audio data of a predetermined adjustment time, a longer delay time T _P of the sound, depending on the time to the long The voice data for a predetermined adjustment time is expanded. The value of .DELTA.Tn _MAX, since it is possible to obtain Knowing corrected displacement amount from the frequency distribution of .DELTA.Tn, may be performed simultaneously with the adjustment of the correction of the reference time and the audio delay time T _P. In that case, based on both of the correction displacement amount and audio delay time after adjustment T _P, it may be adjusted playback start timing of the voice by processing the audio data.

As described above, by adjusting the delay time T _P of the speech from the frequency distribution of .DELTA.Tn, while suppressing the occurrence of interruption of the sound due to delay fluctuations, the transmission of data audio delay time T _P of the network 3 It can be set to an optimum value according to the state.

FIG. 7 is a flowchart showing an example of a processing procedure of the image display apparatus 1 of the present invention.
As shown in FIG. 7, when the audio transmission processing unit 122 receives the audio related information from the image transmitting apparatus 2 by the data receiving unit 111 and the image audio dividing unit 121 at the start of the reception of the image data and the audio data, the audio transmission processing unit 122 The related information is stored in the voice correction data storage memory 132 (step A1).

When the audio transmission processing unit 122 receives the audio data from the image / audio dividing unit 121 (step A2), the audio transmission processing unit 122 determines whether the audio data is the first audio data (step A3). In the case of the first voice data, the voice transmission processing unit 122 stores the arrival time of the voice packet including the voice data in the voice correction data storage memory 132 as the initial value of the reference time (step A4), and the process of step A10 And the audio data is written to the audio buffer 133.
If the received voice data is not the first voice data, the voice transmission processing unit 122 reads the reference time and the voice transmission unit time T included in the voice related information from the voice correction data storage memory 132, and the ideal voice data. A deviation amount ΔTn from the arrival time is calculated (step A5).

The voice transmission processing unit 122 generates deviation amount information including the ΔTn, and calculates a corrected deviation amount based on the deviation amount information (step A6).
Next, the audio transmission processing unit 122 determines whether or not a predetermined adjustment condition is satisfied (step A7). For example, the audio transmission processing unit 122 needs to process the audio data (adjustment of audio reproduction start timing) by comparing the correction deviation amount calculated in step A6 with a preset threshold value _TTH. It is determined whether or not. If the correction deviation amount exceeds the threshold value T _TH , the voice transmission processing unit 122 determines that the voice data needs to be processed, updates the reference time based on the correction deviation amount, and also uses the voice data processing unit. 123 is instructed to process the audio data. The necessity of adjusting the sound reproduction start timing is not limited to the method of determining using the threshold value _TTH . For example, the audio transmission processing unit 122 may determine whether or not a predetermined time has elapsed since the previously executed adjustment processing of the audio reproduction start timing.

When the audio data processing is instructed, the audio data processing unit 123 processes the audio data so as to expand the audio to be reproduced when the correction deviation amount is a positive value. If the correction deviation amount is a negative value, the audio data processing unit 123 processes the audio data so as to shorten the audio to be reproduced (step A8). The audio data processing unit 123 writes the processed audio data (audio correction data) in the audio buffer 133 (step A9).

If in step A7 correction amount of the reference time does not exceed the threshold value T _TH, the voice transmission processing unit 122, processing of audio data is determined to be unnecessary, the audio data processing unit 123 the audio data received in the step A2 Are written in the audio buffer 133 without being processed (step A9).
Finally, the voice transmission processing unit 122 determines whether or not the voice data received in step A2 is the last voice data (step A10). If the voice data is not the last voice data, the process returns to step A2. The next sound data is received from the image sound dividing unit 121, and the processing from step A2 to step A10 is repeated. If the audio data received in step A2 is the last audio data, the process is terminated.

The flowchart shown in FIG. 7 shows a processing example in which the reference time is corrected and the audio reproduction start timing is adjusted based on the correction deviation amount. When adjusting together a delay time T _P of the speech may be determined the value after the adjustment of T _P from the obtained shift amount information and the corrected shift amount in the process of step A6.

According to the present invention, the amount of deviation ΔTn from the ideal arrival time for each voice packet is calculated, and the reference time is corrected from the frequency distribution of ΔTn, so that ΔTn of each voice packet received thereafter is calculated. The frequency is corrected so that it is distributed in the vicinity of ΔTn = zero (0). Further, by adjusting the playback start timing of the speech based on the modified shift amount, without setting a sufficiently large value for the delay time T _P a T + .DELTA.Tn _MAX voice, it is possible to suppress the occurrence of interruption of the speech .
Further, by adjusting the delay time T _P of the speech based on the frequency distribution of ΔTn of each voice packet, while suppressing the occurrence of interruption of the speech, the delay time T _P of the speech in response to the data transmission state of the network 3 Can be set appropriately.
Considering the delay fluctuation, for example, if audio data of about 1 second to several seconds is accumulated in the audio buffer 133 and audio reproduction is started, audio interruption can be suppressed. However, in that case, the operability of the image transmission device 2 is degraded. In particular, when audio data and image data are reproduced (displayed) correspondingly, the audio is reproduced with a significant delay from the image. If the playback start timing of the image data and the audio data corresponding to the image data are shifted, the viewer will feel uncomfortable with the playback audio. By adjusting the delay time T _P of the speech based on the frequency distribution of ΔTn of each voice packet as in the present invention, it is possible to suppress the delay time T _P of the speech, can reduce such discomfort.

In other words, based on the communication state of the network, in other words, when the data transmission rate of the network is lower than the current state, the arrival time of the voice packet is delayed and the distribution of the deviation amount is shifted in the positive direction. In this case, if the audio data held in the audio buffer 133 is reduced and the data transmission rate is further deteriorated, the audio data may be lost. For this reason, in the present invention, the reproduction time of the voice indicated by the arrived voice data (voice corresponding to the voice packet) is extended. Further, when the data transmission rate is increased from the current state, the arrival time of the voice packet is advanced, and the deviation distribution is shifted in the negative direction. In this case, the audio data stored in the audio buffer increases. Therefore, in the present invention, the playback time of the voice corresponding to the voice packet that has arrived is shortened. By adjusting the audio reproduction time in this way, the amount of audio data held in the audio buffer 133 becomes an appropriate value, and audio interruptions or audio data overflows can be suppressed.
Furthermore, in the present invention, since the sound reproduction time is adjusted and the reference time is corrected, the distribution of the deviation amount after the adjustment processing is close to zero, and appropriate sound reproduction is performed. Further, even when the data transmission speed of the network 3 changes after the adjustment process, the same process can be performed, and appropriate sound reproduction can be continued.

As mentioned above, although this invention was demonstrated with reference to embodiment, this invention is not limited to the said embodiment. Various modifications that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

[Appendix 1]
An audio playback device for playing back audio represented by audio data received via a network,
A data processing device that adjusts the reproduction start timing of the audio indicated by the received audio data based on the amount of deviation from the ideal arrival time for each of the audio data received via the network;
An audio output device for reproducing the audio based on the reproduction start timing of the audio adjusted by the data processing device;
An audio reproducing apparatus having
[Appendix 2]
The audio playback device according to attachment 1, wherein
A storage device;
The data processing device adjusts the reproduction start timing of the audio by processing the audio data based on the deviation amount to generate audio correction data,
The storage device temporarily holds the audio data and the audio correction data,
The audio output device reads the audio data or the audio correction data held in the storage device, and reproduces the audio indicated by the read audio data or the audio correction data.
[Appendix 3]
The audio playback device according to

appendix

1 or 2,
The data processing device adjusts the reproduction start timing of the audio and corrects the reference time based on the deviation amount.
[Appendix 4]
The audio reproduction device according to attachment 3, wherein
When the average value of the deviation amount or the value of the deviation amount with the highest occurrence frequency of the deviation amount is a corrected deviation amount,
The audio reproduction start timing is adjusted based on the correction deviation amount,
The audio reproduction device, wherein the reference time is corrected based on the correction deviation amount.
[Appendix 5]
The audio playback device according to appendix 4, wherein
The data processing device includes:
If the amount of correction deviation is positive, adjust to extend the playback time of the sound, correct to delay the reference time,
An audio reproducing apparatus that adjusts the audio reproduction time to be shortened and corrects the reference time to be advanced when the correction deviation is negative.
[Appendix 6]
The audio playback device according to any one of appendices 1 to 5,
The reference time is set as an initial value of an arrival time at which the first audio data of the received audio data arrives.
[Appendix 7]
The audio playback device according to any one of appendices 1 to 6,
The data processing device includes:
An audio reproducing apparatus that gradually adjusts the audio reproduction start timing based on an audio reproduction time used for adjusting the audio reproduction start timing.
[Appendix 8]
The audio playback device according to any one of appendices 1 to 7,
The ideal playback time is an audio playback device that is scheduled based on a predetermined transmission interval of the audio data and a reference time serving as a reference when playing back the audio.
[Appendix 9]
The audio playback device according to appendix 8, wherein
The data processing device includes:
Based on the transmission interval of the audio data and the maximum value indicated by the deviation amount information, which is information relating to the deviation amount, the time from the arrival of the audio data to the start of reproduction of the audio indicated by the audio data An audio playback device that adjusts the audio delay time.
[Appendix 10]
The audio playback device according to any one of appendices 5 to 9,
The data processing device includes:
An audio reproduction apparatus, wherein the audio expansion time or shortening time is set to 0.5% or less of an audio reproduction time used for adjusting the audio reproduction start timing.
[Appendix 11]
An audio playback device for playing back audio represented by audio data received via a network,
When the data transmission rate of the network decreases from the current state, the audio data received via the network is adjusted to extend the playback time of the voice indicated by the network, and the data transmission rate of the network increases from the current state. A data processing device for adjusting so as to shorten the playback time of the voice indicated by the voice data received via the network;
An audio output device that reproduces the audio based on the reproduction time of the audio adjusted by the data processing device;
An audio reproducing apparatus having
[Appendix 12]
The sound reproducing device according to any one of appendices 1 to 11,
An image output device for displaying an image indicated by image data received via the network;
With
An image display device that receives audio data transmitted corresponding to the image data.
[Appendix 13]
An image transmission device that transmits video as image data composed of continuous still image data, and transmits audio data corresponding to the image data separately from the image data;
The image display device according to claim 12, wherein the image display device reproduces the image and sound indicated by the image data and the audio data received from the image transmission device connected to the image transmission device via a network so as to be capable of data transmission.
An image reproduction system.
[Appendix 14]
An audio reproduction method by an image display device that reproduces an image and audio indicated by image data and audio data received via a network,
Based on the amount of deviation from the ideal arrival time for each audio data received via the network, adjust the audio playback start timing indicated by the received audio data,
An audio reproduction method for reproducing the audio based on the reproduction start timing of the audio adjusted by the data processing device.

Claims

An audio playback device for playing back audio represented by audio data received via a network,
A data processing device that adjusts the reproduction start timing of the audio indicated by the received audio data based on the amount of deviation from the ideal arrival time for each of the audio data received via the network;
An audio output device for reproducing the audio based on the reproduction start timing of the audio adjusted by the data processing device;
An audio reproducing apparatus having
The audio playback device according to claim 1,
A storage device;
The data processing device adjusts the reproduction start timing of the audio by processing the audio data based on the deviation amount to generate audio correction data,
The storage device temporarily holds the audio data and the audio correction data,
The audio output device reads the audio data or the audio correction data held in the storage device, and reproduces the audio indicated by the read audio data or the audio correction data.
The sound reproducing device according to claim 1 or 2,
The data processing device adjusts the reproduction start timing of the audio and corrects the reference time based on the deviation amount.
The audio playback device according to claim 3,
When the average value of the deviation amount or the value of the deviation amount with the highest occurrence frequency of the deviation amount is a corrected deviation amount,
The audio reproduction start timing is adjusted based on the correction deviation amount,
The audio reproduction device, wherein the reference time is corrected based on the correction deviation amount.
The audio playback device according to claim 4,
The data processing device includes:
If the amount of correction deviation is positive, adjust to extend the playback time of the sound, correct to delay the reference time,
An audio reproducing apparatus that adjusts the audio reproduction time to be shortened and corrects the reference time to be advanced when the correction deviation is negative.
The audio playback device according to any one of claims 1 to 5,
The reference time is set as an initial value of an arrival time at which the first audio data of the received audio data arrives.
The audio playback device according to any one of claims 1 to 6,
The data processing device includes:
An audio reproducing apparatus that gradually adjusts the audio reproduction start timing based on an audio reproduction time used for adjusting the audio reproduction start timing.
The audio playback device according to any one of claims 1 to 7,
The ideal playback time is an audio playback device that is scheduled based on a predetermined transmission interval of the audio data and a reference time serving as a reference when playing back the audio.
The sound reproducing device according to claim 8,
The data processing device includes:
Based on the transmission interval of the audio data and the maximum value indicated by the deviation amount information, which is information relating to the deviation amount, the time from the arrival of the audio data to the start of reproduction of the audio indicated by the audio data An audio playback device that adjusts the audio delay time.
The sound reproduction device according to any one of claims 5 to 9,
The data processing device includes:
An audio reproduction apparatus, wherein the audio expansion time or shortening time is set to 0.5% or less of an audio reproduction time used for adjusting the audio reproduction start timing.
An audio playback device for playing back audio represented by audio data received via a network,
When the data transmission rate of the network decreases from the current state, the audio data received via the network is adjusted to extend the playback time of the voice indicated by the network, and the data transmission rate of the network increases from the current state. A data processing device for adjusting so as to shorten the playback time of the voice indicated by the voice data received via the network;
An audio output device that reproduces the audio based on the reproduction time of the audio adjusted by the data processing device;
An audio reproducing apparatus having
The sound reproduction device according to any one of claims 1 to 11,
An image output device for displaying an image indicated by image data received via the network;
With
An image display device that receives audio data transmitted corresponding to the image data.
An image transmission device that transmits video as image data composed of continuous still image data, and transmits audio data corresponding to the image data separately from the image data;
The image display device according to claim 12, wherein the image display device reproduces the image and sound indicated by the image data and the audio data received from the image transmission device connected to the image transmission device via a network so as to be capable of data transmission.
An image reproduction system.
An audio reproduction method by an image display device that reproduces an image and audio indicated by image data and audio data received via a network,
Based on the amount of deviation from the ideal arrival time for each audio data received via the network, adjust the audio playback start timing indicated by the received audio data,
An audio reproduction method for reproducing the audio based on the reproduction start timing of the audio adjusted by the data processing device.