WO2016151852A1 - Audio reproduction device, image display device and audio reproduction method thereof - Google Patents

Audio reproduction device, image display device and audio reproduction method thereof Download PDF

Info

Publication number
WO2016151852A1
WO2016151852A1 PCT/JP2015/059430 JP2015059430W WO2016151852A1 WO 2016151852 A1 WO2016151852 A1 WO 2016151852A1 JP 2015059430 W JP2015059430 W JP 2015059430W WO 2016151852 A1 WO2016151852 A1 WO 2016151852A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
data
time
image
reproduction
Prior art date
Application number
PCT/JP2015/059430
Other languages
French (fr)
Japanese (ja)
Inventor
栄作 石井
Original Assignee
Necディスプレイソリューションズ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Necディスプレイソリューションズ株式会社 filed Critical Necディスプレイソリューションズ株式会社
Priority to PCT/JP2015/059430 priority Critical patent/WO2016151852A1/en
Publication of WO2016151852A1 publication Critical patent/WO2016151852A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L7/00Arrangements for synchronising receiver with transmitter

Definitions

  • the present invention relates to an audio reproduction device, an image display device, and an audio reproduction method for reproducing audio indicated by audio data received via a network.
  • the packet communication described above is considered to be used for transmission and reception of various data.
  • a technology for transmitting moving image data from a computer or the like to a projector via a network and reproducing the moving image data received by the projector is put into practical use.
  • an image display device such as a projector is installed in the vicinity of an image transmission device such as a computer (for example, a visual distance), and an image reproduced on the image display device by the image transmission device at hand.
  • a usage mode for operating voice In that case, in the configuration in which moving image data of several seconds or more is accumulated and played back in the buffer provided in the image display device, it takes several seconds or longer to start playback of video or audio on the image display device. The operability by the transmission device is significantly reduced.
  • video is transmitted as continuous still image data (image data), and only audio is transmitted as audio data separately from the image data, thereby improving the operability of the image transmission apparatus.
  • image data and audio data are transmitted separately in this way, for example, the transmission rate of image data transmitted from the image transmission device is changed according to the transmission speed of the network, and the image quality and frame rate to be reproduced by the image display device are changed.
  • the time it is possible to shorten the time until image reproduction (display) starts.
  • audio delay time the time from reception of audio data to the start of reproduction of the audio indicated by the audio data
  • Patent Document 1 Japanese Patent Application Laid-Open No. 2004-104701 discloses a clock on the transmission side and the reception side. In order to prevent overflow and underflow of a buffer for storing moving image data due to the difference, a method for correcting the clock of the receiving side device is described.
  • An object of the present invention is to provide an audio reproduction device, an image display device, and an audio reproduction method thereof that suppress the occurrence of audio interruption caused by delay fluctuation.
  • an audio reproduction device of the present invention is an audio reproduction device that reproduces audio indicated by audio data received via a network, A data processing device that adjusts the reproduction start timing of the audio indicated by the received audio data based on the amount of deviation from the ideal arrival time for each of the audio data received via the network; An audio output device for reproducing the audio based on the reproduction start timing of the audio adjusted by the data processing device;
  • an audio playback device that plays back audio represented by audio data received via a network
  • the audio data received via the network is adjusted to extend the playback time of the voice indicated by the network, and the data transmission rate of the network increases from the current state.
  • An audio output device that reproduces the audio based on the reproduction time of the audio adjusted by the data processing device;
  • An image display device includes the above sound reproduction device, An image output device for displaying an image indicated by image data received via the network; With The audio data transmitted corresponding to the image data is received.
  • the audio reproduction method of the present invention is an audio reproduction method by an image display device for reproducing image data and audio indicated by image data and audio data received via a network, Based on the amount of deviation from the ideal arrival time for each audio data received via the network, adjust the audio playback start timing indicated by the received audio data, The audio is reproduced based on the reproduction start timing of the audio adjusted by the data processing device.
  • FIG. 1 is a block diagram illustrating a configuration example of an image reproduction system according to the present invention.
  • FIG. 2 is a block diagram showing a configuration example of the image display apparatus shown in FIG.
  • FIG. 3A is a schematic diagram illustrating an ideal transmission example of audio data in a network.
  • FIG. 3B is a schematic diagram illustrating an actual transmission example of audio data in the network.
  • FIG. 4A is a diagram showing an example of frequency distribution information created by the data processing apparatus shown in FIG. 2, and is a histogram showing an example in which the amount of deviation of each voice packet is distributed in a positive value.
  • FIG. 4B is a diagram illustrating an example of frequency distribution information created by the data processing device 12 illustrated in FIG.
  • FIG. 2 is a histogram illustrating an example in which the deviation amount of each voice packet is distributed in a negative value.
  • Figure 5 is a maximum duration T B retractable sound audio buffer shown in FIG. 2, the delay and time T P of the audio, and the reproduction time T D of the audio corresponding to the amount of audio data remaining in the sound buffer It is a schematic diagram which shows the example of a relationship.
  • FIG. 6A is a histogram showing an example of frequency distribution information when the network has a stable data transmission rate.
  • FIG. 6B is a histogram showing an example of frequency distribution information when the network 3 has an unstable data transmission rate.
  • FIG. 7 is a flowchart showing an example of the processing procedure of the image display apparatus of the present invention.
  • FIG. 1 is a block diagram illustrating a configuration example of an image reproduction system according to the present invention.
  • an image reproduction system of the present invention includes an image transmission device 2 that transmits image data and audio data, and an image that reproduces images and audio indicated by the image data and audio data received from the image transmission device 2.
  • the display device 1 has a configuration in which the image transmission device 2 and the image display device 1 are connected via a network 3.
  • the image transmission device 2 transmits video to be reproduced by the image display device 1 as image data composed of continuous still image data, and transmits audio data corresponding to the image data separately from the image data. Further, the image transmission device 2 transmits the audio data corresponding to the image data with priority over the image data. Further, the image transmission device 2 changes the transmission rate of the image data according to the transmission speed of the network, and changes the image quality and frame rate of the image to be reproduced by the image display device 1. Note that image data and audio data are transmitted as image packets and audio packets including the respective data.
  • the image data and the audio data are, for example, an identifier indicating the image transmission device 2 that transmits each data, a time stamp corresponding to the timing when each data is captured, a time stamp corresponding to the timing when each data is reproduced or displayed, and It may be associated with information such as a content name corresponding to image data and audio data.
  • the image packet and the audio packet may include such information.
  • the audio data sampling frequency, the number of sampling bits, the number of channels (monaural, stereo, etc.), and audio packet transmission Audio related information which is information necessary for reproduction of audio indicated by audio data, including an interval (audio transmission unit time) T and the like is transmitted.
  • the voice packet transmission interval corresponds to an ideal voice packet arrival interval.
  • the image display device 1 receives the image data, sound data, and sound related information that the image transmission device 2 instructs to reproduce, and reproduces the image and sound indicated by the image data and sound data.
  • the image transmission device 2 includes a communication device that transmits image data and audio data via the network 3, a CPU (Central Processing Unit) that executes processing according to a program, and a memory that stores data and programs processed by the CPU. It can be realized by an information processing apparatus (computer).
  • the communication device may have any known configuration regardless of the wired method or the wireless method as long as data can be transmitted via the network 3.
  • the network 3 is a known data transmission path including a network device (not shown) that relays packets transmitted and received between the image transmission device 2 and the image display device 1. Since the network 3 has a configuration in which a number of data transmission paths are formed by a number of network devices as is well known, the network 3 is shown in a cloud shape in FIG.
  • FIG. 2 is a block diagram illustrating a configuration example of the image display apparatus 1 illustrated in FIG.
  • the image display device 1 includes a communication device 11 that receives image data and audio data from the image transmission device 2 via the network 3, and data that performs a required process on the received image data and audio data.
  • a processing device 12 and a storage device that holds information generated by the data processing device 12 and holds received image data and received voice data or voice data processed by the data processing device 12 (voice correction data) 13, an audio output device 14 that reproduces and outputs the audio indicated by the received audio data or audio correction data, and an image output device 15 that reproduces and displays an image indicated by the image data.
  • the audio reproduction device 4 of the present invention is configured to include the communication device 11, the data processing device 12, the storage device 13, and the audio output device 14 shown in FIG.
  • the image display apparatus 1 includes a projector, a display, and a function capable of reproducing an image and sound, including the functions of the communication device 11, the data processing device 12, the storage device 13, the sound output device 14, and the image output device 15 shown in FIG. It can be realized with a monitor.
  • the functions of the data processing device 12 and the storage device 13 can be realized by an information processing device (computer) that includes a CPU (Central Processing Unit) that executes processing according to a program and a memory that stores data and programs processed by the CPU.
  • the communication device 11 includes a data receiving unit 111 that sequentially receives image data and audio data from the image transmission device 2 via the network 3 and outputs the received image data and audio data to the image / audio dividing unit 121.
  • the communication device 11 may have any known configuration regardless of the wired method or the wireless method as long as data can be transmitted via the network 3. Data transmission is performed using, for example, well-known packet communication.
  • the storage device 13 includes a video memory 131 that holds image data, an audio correction data storage memory 132 that holds information used for audio data correction, and an audio buffer 133 that holds audio data or audio correction data.
  • the data processing device 12 includes an audio / video dividing unit 121, an audio transmission processing unit 122, and an audio data processing unit 123.
  • the image / sound dividing unit 121 performs necessary processing such as decoding processing on the image data received from the communication device 11, and writes the processed image data in the video memory 131 included in the storage device 13.
  • the image / audio dividing unit 121 outputs the audio data and the audio related information received from the communication device 11 to the audio transmission processing unit 122.
  • the voice transmission processing unit 122 determines a predetermined reference time from the arrival time of the voice packet including the voice data received by the communication device 11, and based on the voice related information, an ideal arrival time for each voice packet, in other words, Then, based on the reference time and a predetermined arrival interval, a deviation amount from the scheduled arrival time is detected, and deviation amount information that is information relating to the deviation amount is generated.
  • the shift amount information includes, for example, information indicating the distribution of the shift amount (frequency distribution information), an average value, a maximum value, a minimum value, or a value of the shift amount.
  • frequency distribution information is mainly used for the shift amount information will be described.
  • the audio transmission processing unit 122 writes the determined reference time in the audio correction data storage memory 132 included in the storage device 13. In addition, the voice transmission processing unit 122 writes the generated frequency distribution information in the voice correction data storage memory 132 included in the storage device 13 and outputs the voice data received from the communication device 11 to the voice data processing unit 123. Further, the voice transmission processing unit 122 writes the voice related information in the voice correction data storage memory 132 provided in the storage device 13.
  • the reference time is a reference value of the arrival time of audio data (audio packet) received from the image transmission device 2 (not shown), and the first audio data (first audio data is the first audio data among the received audio data (audio packets).
  • the arrival time at which the received voice packet) is set as an initial value.
  • the audio data may be audio data corresponding to the received image data. In that case, the audio data and the image data are transmitted in association with each other.
  • the voice transmission processing unit 122 detects a deviation amount from the reference time for each voice data (voice packet), and starts generating deviation amount information.
  • the image display device 1 starts reproduction of silent audio from the reference time to a predetermined time.
  • audio playback may be started when a predetermined time has elapsed from the reference time.
  • the reference time is a reference when playing back sound. Since the audio packets arrive at the image display device 1 at a substantially constant cycle, the image display device 1 holds the audio data received at each substantially constant cycle by the audio buffer 133, and is held by the audio buffer 133. Audio data is read and played sequentially.
  • the audio data processing unit 123 adjusts the reproduction start timing of the audio indicated by the audio data based on the deviation amount of the audio data.
  • the sound reproduction start timing is adjusted by, for example, expanding or shortening the reproduced sound.
  • the audio data processing unit 123 reads out the frequency distribution information from the correction data storage memory 132, and based on the frequency distribution information, processes for expanding or shortening the reproduced audio of a predetermined adjustment time are converted into audio data.
  • the processed audio data (audio correction data) is written in the audio buffer 133.
  • the audio output device 14 reproduces audio based on the audio reproduction start timing adjusted by the data processing device 12.
  • the audio output device 14 includes an audio output unit 141 that sequentially reads audio data or audio correction data from the audio buffer 133 and reproduces / outputs the audio.
  • the image output device 15 includes an image output unit 151 that sequentially reads image data from the video memory 131 and displays an image.
  • FIG. 3A is a schematic diagram showing an ideal transmission example of voice data in the network 3
  • FIG. 3B is a schematic diagram showing an actual transmission example of voice data in the network 3.
  • FIG. 4A is a diagram showing an example of frequency distribution information created by the data processing device 12 shown in FIG. 2, and is a histogram showing an example in which the amount of deviation of each voice packet is distributed in a positive value.
  • FIG. 4B is a diagram illustrating an example of frequency distribution information created by the data processing device 12 illustrated in FIG. 2, and is a histogram illustrating an example in which the deviation amount of each voice packet is distributed in a negative value.
  • FIG. 4A is a diagram showing an example of frequency distribution information created by the data processing device 12 shown in FIG. 2, and is a histogram showing an example in which the deviation amount of each voice packet is distributed in a negative value.
  • the 3A shows a state in which audio packets are ideally transmitted via the network 3, in which audio packets arrive at the image display device 1 at regular intervals (audio transmission unit time T).
  • FIG. 3B When no problem has occurred in the image transmission device 2, the image display device 1, and the network 3, when voice packets are transmitted from the image transmission device 2 at regular intervals (sound transmission unit time T), FIG. As shown, it is considered that a voice packet arrives at the image display device 1 at every voice transmission unit time T. However, actual voice packet arrival intervals vary as indicated by T 0 to T 6 in FIG. 3B, for example, due to “unstable data transmission time fluctuation (delay fluctuation)” of the data transmission time by the network 3.
  • Equation (1) shows the amount of deviation ⁇ Tn from the ideal arrival time of the nth voice packet.
  • ⁇ Tn takes a positive value when the voice packet arrives later than ideal, and takes a negative value when the voice packet arrives earlier than ideal.
  • the actual voice packet arrival intervals T 0 to T n can be detected using the arrival time of each voice packet.
  • the arrival time includes, for example, the timing when the voice packet arrives.
  • time counting is started from the arrival point of the first voice packet arriving at the image display device 1 as a base point (reference time), and the timing, base point (reference time) and voice transmission unit time at which another voice packet arrives.
  • the amount of deviation for each voice packet may be detected based on T.
  • each packet includes information indicating which part of the data is the data included in the packet. Therefore, the image display apparatus 1 determines whether the received audio packet is a packet including the first audio data, a packet including the last audio data, or a first audio data packet. Can be determined. Further, the value of the ideal voice packet arrival interval (voice transmission unit time T) is notified to the image display device 1 from the image transmission device 2 in advance, so that the image display device 1 is known. In this case, in order to calculate ⁇ Tn for each voice packet using the above equation (1), it is only necessary to know the arrival time of the voice packet that has first arrived at the image display device 1.
  • the arrival time of the first voice data is set as the “reference time” among the voice data newly transmitted for playback transmitted from the image transmission device 2.
  • the voice transmission processing unit 122 detects a deviation in the current reference time (initial value) based on the frequency distribution information indicating the distribution of ⁇ Tn. For example, as shown in FIG.
  • the image display apparatus 1 receives a voice packet, storing the audio data contained in the voice packet into the audio buffer 133 starts reproducing the voice represented by voice data from the reference time after (n-1) T + T P .
  • the voice delay time T P may be set in advance to a value satisfying T P ⁇ T + ⁇ Tn MAX in consideration of the maximum value ( ⁇ Tn MAX ) of the deviation amount ⁇ Tn from the ideal arrival time of the voice packet due to delay fluctuation.
  • .DELTA.Tn MAX network 3 is unknown, the delay time T P of the voice that the user can tolerate is different depending on the type of sound to be reproduced.
  • the speech delay time TP is as short as possible for conversation and the like, and BGM (Back-Ground Music) or the like often does not cause a problem even if the speech delay time TP is long. Therefore, an adjustment mechanism for adjusting the T P to the image display apparatus 1 or the image transmission apparatus 2 is provided, the user may be allowed to arbitrarily set the T P by the adjusting mechanism.
  • the image display device 1 starts playing the voice
  • the next voice packet if the arrival next voice packet within T P, the next audio data at the time when the reproduction of the audio data previously received is completed is stored in the audio buffer 133 Therefore, the audio is not played back intermittently.
  • the next voice packet arrives later than T P, the next audio data when the reproduction is completed audio data previously received is not yet stored in the audio buffer 133. In that case, a silent state is maintained until the next audio data is received.
  • the entire voice packet tends to arrive later than the ideal arrival time, that is, when the current reference time is earlier, there is a high possibility that a silent state in which the voice cannot be reproduced in the image display device 1 occurs. Become.
  • the image display apparatus 1 When the entire voice packet tends to arrive earlier than the ideal arrival time, that is, when the current reference time is late, the image display apparatus 1 does not generate a silent state. In that case, by adjusting the reference time to the correct time can be set shorter than the current delay time T P of the speech. However, even if the reproduction of the audio data previously received is completed, since the audio data corresponding to a time longer than the audio data corresponding to T P remains in the audio buffer 133, possibly audio buffer 133 overflows is there.
  • the data processing device 12 adjusts the playback start timing of the voice indicated by the voice data included in the received voice packet based on the shift amount ⁇ Tn for each voice packet.
  • the voice transmission processing unit 122 detects the corrected deviation amount based on deviation amount information that is information on the deviation amount ⁇ Tn for each voice packet, for example, the frequency distribution information.
  • the audio data processing unit 123 adjusts the audio reproduction start timing based on the detected amount of correction deviation.
  • the average value of ⁇ Tn of each voice packet is a positive value, it is assumed that each voice packet tends to arrive later than the ideal arrival time, and the average value of ⁇ Tn of each voice packet is a negative value The voice packets may tend to arrive earlier than the ideal arrival time.
  • the frequency distribution information includes a frequency distribution in which ⁇ Tn is classified for each predetermined range, as shown in FIG. 4A or 4B.
  • the generation of the deviation amount information is started when the reference time is set.
  • the generation of the deviation amount information may be performed every predetermined period, may be a period from the first reception of the audio data to the last reception, or every period for adjusting the sound reproduction start timing. Good.
  • the correction amount is detected based on the deviation amount ⁇ Tn for each voice packet. For example, the most frequently occurring amount ⁇ Tn is used as the correction deviation amount.
  • As the correction deviation amount an average value of the deviation amounts ⁇ Tn may be used.
  • the average value of the most frequently occurring amount ⁇ Tn or the deviation amount ⁇ Tn is an example of the corrected deviation amount.
  • the correction deviation amount may be a value corresponding to the predetermined range, for example, a median value, a maximum value, or a minimum value of the predetermined range.
  • audio data corresponding to a predetermined adjustment time may be used to extend or shorten the audio playback time.
  • the corrected deviation amount is used for the sound expansion time or shortening time.
  • a method of extending or shortening the audio playback time for example, a method of re-sampling the received audio data at a sampling frequency different from that at the time of sampling the audio data, or deleting a part of the audio data or silent data There is a way to insert.
  • the expansion time or the shortened time is about 0.5% of the playback time of the audio data used for adjustment so that the user of the video display device 1 does not feel uncomfortable with the playback audio.
  • the correction deviation amount is ⁇ 50 ms (milliseconds)
  • the decompression time and shortening time may be any time as long as the user of the video display device 1 does not feel uncomfortable with the playback sound, and may be set to 0.5% or less of the playback time of the sound data used for adjustment.
  • the reproduction time of the audio data included in each audio packet is 100 ms and the correction deviation amount is ⁇ 50 ms (or +50 ms)
  • the adjustment time for adjusting the correction deviation amount ⁇ 50 ms (or +50 ms) is 10 seconds.
  • the number of audio data corresponding to is 100. Therefore, the voice data processing unit 123 shortens each voice data to 99.5 ms (or extends to 100.5 ms).
  • the shortened (or expanded) sound data is stored in the sound buffer 133 as sound correction data.
  • the stored audio correction data is sequentially reproduced. By repeating this process 100 times, the reproduction start timing can be automatically adjusted without interruption of the sound.
  • the data processing device 12 adjusts the playback start timing of the voice indicated by the voice data and corrects the reference time based on the deviation amount ⁇ Tn for each voice packet.
  • the correction amount of the reference time and the adjustment amount of the reproduction start timing are detected based on the deviation amount ⁇ Tn for each audio packet.
  • the voice transmission processing unit 122 corrects the reference time based on the correction deviation amount. For example, as described above, when the reproduction time of the audio data included in each audio packet is 100 ms and the correction deviation amount is ⁇ 50 ms (or +50 ms), the audio data corresponding to the adjustment time of 10 seconds for adjusting the audio reproduction timing. Will be 100.
  • the audio data processing unit 123 shortens each audio data to 99.5 ms (or expands to 100.5 ms), and the audio transmission processing unit 122 shifts the reference time by ⁇ 0.5 ms (or +0.5 ms).
  • the shortened (or expanded) sound data is stored in the sound buffer 133 as sound correction data.
  • the stored audio correction data is sequentially reproduced. By repeating this process 100 times, the sound reproduction start timing can be adjusted without interruption of the sound, and the reference time can be corrected to the correction amount ⁇ 50 ms (or +50 ms).
  • the adjustment of the audio reproduction start timing based on the correction deviation amount may be adjusted based on the correction amount of the reference time. Including. Further, since the adjustment amount of the audio reproduction start timing is determined based on the correction deviation amount, the adjustment of the reference time based on the correction deviation amount is adjusted based on the adjustment amount of the audio reproduction start timing. Including.
  • the sound reproduction start timing can be appropriately set by adjusting the sound reproduction time.
  • the data processing device 12 adjusts the playback time of the voice indicated by the voice data received via the network 3 to be extended in the adjustment period
  • the audio reproduction time indicated by the audio data received via the network 3 is adjusted to be shortened in the adjustment period.
  • the audio output device reproduces audio based on the audio reproduction time adjusted by the data processing device. Note that extending or shortening the audio reproduction time adjusts the audio reproduction start timing.
  • the correction amount of the reference time and the adjustment amount of the audio reproduction start timing are set to the same amount, and the process of gradually correcting or adjusting is performed.
  • the present invention is not limited to such processing.
  • the correction of the reference time may be performed at once ( ⁇ 50 ms or +50 ms). In this case, it is desirable to stop the shift amount detection process during the adjustment time of the audio reproduction start timing. In order to appropriately detect the shift amount during this adjustment period, it is necessary to take into account the elapsed time from the start of adjustment and the adjustment amount of the reproduction start timing.
  • each voice packet tends to arrive later than the ideal arrival time (when the correction deviation is a positive value)
  • it is corrected to delay the current reference time, and the voice playback time Adjust to extend. If each voice packet tends to arrive earlier than the ideal arrival time (if the correction deviation is a negative value), correct the current reference time so that the voice playback time is reduced. Adjust to shorten.
  • the audio playback time (by adjusting the audio playback start timing) in this way, the audio data amount held in the audio buffer 133 becomes an appropriate value, and audio interruptions and the like can be suppressed. In addition, since the overflow of the audio data can be suppressed, the capacity of the audio buffer 133 can be reduced. Furthermore, by adjusting the reference time, it is possible to readjust according to the state of the network even after adjusting the audio reproduction.
  • a threshold T MAX is set in advance so that ⁇ Tn ⁇ T MAX is satisfied so that a voice packet that arrives extremely late compared to other voice packets is not used for calculation (detection) of the correction deviation amount.
  • the correction deviation amount may be calculated using only ⁇ Tn.
  • the deviation amount information may be generated as information on ⁇ Tn that satisfies ⁇ Tn ⁇ T MAX . In that case, a more appropriate correction deviation amount can be calculated.
  • a predetermined threshold value T TH is provided for the correction deviation amount, and when the absolute value of the correction deviation amount exceeds the threshold value T TH , the reference time is corrected and the sound reproduction start timing is adjusted (sound Data processing) may be performed.
  • the adjustment of the sound reproduction start timing may be performed every predetermined period. In this case, the processing load on the data processing device 12 is reduced because the reference time and the playback start timing of the audio are not frequently changed.
  • the threshold value may be set to a different value depending on whether the correction deviation amount is positive or negative. In that case, it is possible to set whether or not to place importance on the delay time of the sound, or whether or not to place importance on the suppression of the occurrence of the sound interruption.
  • the frequency of ⁇ Tn of each voice packet received thereafter is corrected to be distributed in the vicinity of zero (0).
  • the sufficiently large value for the delay time T P a T + .DELTA.Tn MAX speech Even without this, the occurrence of voice interruptions is suppressed. Therefore, it is not necessary to set the audio delay time TP to an unnecessarily long time while suppressing the occurrence of audio interruption due to delay fluctuation.
  • the audio delay time T P may be set so as to satisfy T P ⁇ T + ⁇ Tn MAX .
  • the value of ⁇ Tn MAX is the shift amount information.
  • 6A is a histogram showing an example of frequency distribution information when the network 3 has a stable data transmission rate
  • FIG. 6B is a histogram showing an example of frequency distribution information when the network 3 has an unstable data transmission rate. It is. 6A and 6B show examples of frequency distribution information after the reference time is corrected.
  • the frequency distribution of ⁇ Tn of each voice packet is relatively narrow and ⁇ Tn MAX is a small value as shown in FIG. 6A.
  • the frequency distribution of ⁇ Tn of each voice packet is relatively wide as shown in FIG. 6B, and ⁇ Tn MAX is a large value.
  • the length of the frequency distribution in the time axis direction that is, the maximum deviation amount ⁇ Tn from the ideal arrival time of the voice packet ( ⁇ Tn MAX : maximum value in the positive region) ) detects, adjusts the delay time T P of the speech in response to the .DELTA.Tn MAX.
  • the deviation amount information that is a frequency distribution may be generated as information on ⁇ Tn that satisfies ⁇ Tn ⁇ T MAX .
  • the same processing can be applied when using the average value and the maximum value of the deviation amounts as the deviation amount information.
  • the adjusted speech based on the delay time T P may be adjusted playback start timing of the sound. For example, to shorten the delay time T P of the sound, if in accordance with the time of the shortening reduces the audio data of a predetermined adjustment time, a longer delay time T P of the sound, depending on the time to the long The voice data for a predetermined adjustment time is expanded.
  • the value of .DELTA.Tn MAX since it is possible to obtain Knowing corrected displacement amount from the frequency distribution of .DELTA.Tn, may be performed simultaneously with the adjustment of the correction of the reference time and the audio delay time T P. In that case, based on both of the correction displacement amount and audio delay time after adjustment T P, it may be adjusted playback start timing of the voice by processing the audio data.
  • the transmission of data audio delay time T P of the network 3 It can be set to an optimum value according to the state.
  • FIG. 7 is a flowchart showing an example of a processing procedure of the image display apparatus 1 of the present invention.
  • the audio transmission processing unit 122 receives the audio related information from the image transmitting apparatus 2 by the data receiving unit 111 and the image audio dividing unit 121 at the start of the reception of the image data and the audio data, the audio transmission processing unit 122 The related information is stored in the voice correction data storage memory 132 (step A1).
  • the audio transmission processing unit 122 determines whether the audio data is the first audio data (step A3). In the case of the first voice data, the voice transmission processing unit 122 stores the arrival time of the voice packet including the voice data in the voice correction data storage memory 132 as the initial value of the reference time (step A4), and the process of step A10 And the audio data is written to the audio buffer 133. If the received voice data is not the first voice data, the voice transmission processing unit 122 reads the reference time and the voice transmission unit time T included in the voice related information from the voice correction data storage memory 132, and the ideal voice data. A deviation amount ⁇ Tn from the arrival time is calculated (step A5).
  • the voice transmission processing unit 122 generates deviation amount information including the ⁇ Tn, and calculates a corrected deviation amount based on the deviation amount information (step A6).
  • the audio transmission processing unit 122 determines whether or not a predetermined adjustment condition is satisfied (step A7). For example, the audio transmission processing unit 122 needs to process the audio data (adjustment of audio reproduction start timing) by comparing the correction deviation amount calculated in step A6 with a preset threshold value TTH. It is determined whether or not. If the correction deviation amount exceeds the threshold value T TH , the voice transmission processing unit 122 determines that the voice data needs to be processed, updates the reference time based on the correction deviation amount, and also uses the voice data processing unit. 123 is instructed to process the audio data.
  • the necessity of adjusting the sound reproduction start timing is not limited to the method of determining using the threshold value TTH .
  • the audio transmission processing unit 122 may determine whether or not a predetermined time has elapsed since the previously executed adjustment processing of the audio reproduction start timing.
  • the audio data processing unit 123 processes the audio data so as to expand the audio to be reproduced when the correction deviation amount is a positive value. If the correction deviation amount is a negative value, the audio data processing unit 123 processes the audio data so as to shorten the audio to be reproduced (step A8). The audio data processing unit 123 writes the processed audio data (audio correction data) in the audio buffer 133 (step A9).
  • step A7 correction amount of the reference time does not exceed the threshold value T TH, the voice transmission processing unit 122, processing of audio data is determined to be unnecessary, the audio data processing unit 123 the audio data received in the step A2 Are written in the audio buffer 133 without being processed (step A9).
  • the voice transmission processing unit 122 determines whether or not the voice data received in step A2 is the last voice data (step A10). If the voice data is not the last voice data, the process returns to step A2. The next sound data is received from the image sound dividing unit 121, and the processing from step A2 to step A10 is repeated. If the audio data received in step A2 is the last audio data, the process is terminated.
  • the flowchart shown in FIG. 7 shows a processing example in which the reference time is corrected and the audio reproduction start timing is adjusted based on the correction deviation amount.
  • a delay time T P of the speech may be determined the value after the adjustment of T P from the obtained shift amount information and the corrected shift amount in the process of step A6.
  • the amount of deviation ⁇ Tn from the ideal arrival time for each voice packet is calculated, and the reference time is corrected from the frequency distribution of ⁇ Tn, so that ⁇ Tn of each voice packet received thereafter is calculated.
  • the delay time T P of the speech in response to the data transmission state of the network 3 can be set appropriately.
  • the delay fluctuation for example, if audio data of about 1 second to several seconds is accumulated in the audio buffer 133 and audio reproduction is started, audio interruption can be suppressed.
  • the operability of the image transmission device 2 is degraded.
  • the audio data and image data are reproduced (displayed) correspondingly, the audio is reproduced with a significant delay from the image.
  • the playback start timing of the image data and the audio data corresponding to the image data are shifted, the viewer will feel uncomfortable with the playback audio.
  • the delay time T P of the speech based on the frequency distribution of ⁇ Tn of each voice packet as in the present invention, it is possible to suppress the delay time T P of the speech, can reduce such discomfort.
  • the arrival time of the voice packet is delayed and the distribution of the deviation amount is shifted in the positive direction.
  • the audio data held in the audio buffer 133 is reduced and the data transmission rate is further deteriorated, the audio data may be lost.
  • the reproduction time of the voice indicated by the arrived voice data (voice corresponding to the voice packet) is extended.
  • the arrival time of the voice packet is advanced, and the deviation distribution is shifted in the negative direction. In this case, the audio data stored in the audio buffer increases.
  • the playback time of the voice corresponding to the voice packet that has arrived is shortened.
  • the amount of audio data held in the audio buffer 133 becomes an appropriate value, and audio interruptions or audio data overflows can be suppressed.
  • the sound reproduction time is adjusted and the reference time is corrected, the distribution of the deviation amount after the adjustment processing is close to zero, and appropriate sound reproduction is performed. Further, even when the data transmission speed of the network 3 changes after the adjustment process, the same process can be performed, and appropriate sound reproduction can be continued.
  • [Appendix 3] The audio playback device according to appendix 1 or 2, The data processing device adjusts the reproduction start timing of the audio and corrects the reference time based on the deviation amount.
  • [Appendix 4] The audio reproduction device according to attachment 3, wherein When the average value of the deviation amount or the value of the deviation amount with the highest occurrence frequency of the deviation amount is a corrected deviation amount, The audio reproduction start timing is adjusted based on the correction deviation amount, The audio reproduction device, wherein the reference time is corrected based on the correction deviation amount.
  • the audio playback device according to appendix 4, wherein The data processing device includes: If the amount of correction deviation is positive, adjust to extend the playback time of the sound, correct to delay the reference time, An audio reproducing apparatus that adjusts the audio reproduction time to be shortened and corrects the reference time to be advanced when the correction deviation is negative.
  • the audio playback device according to any one of appendices 1 to 5, The reference time is set as an initial value of an arrival time at which the first audio data of the received audio data arrives.
  • the audio playback device according to any one of appendices 1 to 6,
  • the data processing device includes: An audio reproducing apparatus that gradually adjusts the audio reproduction start timing based on an audio reproduction time used for adjusting the audio reproduction start timing.
  • the audio playback device according to any one of appendices 1 to 7,
  • the ideal playback time is an audio playback device that is scheduled based on a predetermined transmission interval of the audio data and a reference time serving as a reference when playing back the audio.
  • the audio playback device according to appendix 8, wherein The data processing device includes: Based on the transmission interval of the audio data and the maximum value indicated by the deviation amount information, which is information relating to the deviation amount, the time from the arrival of the audio data to the start of reproduction of the audio indicated by the audio data An audio playback device that adjusts the audio delay time.
  • the audio playback device according to any one of appendices 5 to 9, The data processing device includes: An audio reproduction apparatus, wherein the audio expansion time or shortening time is set to 0.5% or less of an audio reproduction time used for adjusting the audio reproduction start timing.
  • a data processing device for adjusting so as to shorten the playback time of the voice indicated by the voice data received via the network;
  • An audio output device that reproduces the audio based on the reproduction time of the audio adjusted by the data processing device;
  • An audio reproducing apparatus having [Appendix 12] The sound reproducing device according to any one of appendices 1 to 11, An image output device for displaying an image indicated by image data received via the network; With An image display device that receives audio data transmitted corresponding to the image data.
  • An image transmission device that transmits video as image data composed of continuous still image data, and transmits audio data corresponding to the image data separately from the image data;
  • the image display device according to claim 12, wherein the image display device reproduces the image and sound indicated by the image data and the audio data received from the image transmission device connected to the image transmission device via a network so as to be capable of data transmission.
  • An image reproduction system An image reproduction system.
  • Appendix 14 An audio reproduction method by an image display device that reproduces an image and audio indicated by image data and audio data received via a network, Based on the amount of deviation from the ideal arrival time for each audio data received via the network, adjust the audio playback start timing indicated by the received audio data, An audio reproduction method for reproducing the audio based on the reproduction start timing of the audio adjusted by the data processing device.

Abstract

An audio reproduction device for reproducing audio represented by audio data received via a network comprises: a data processing device that adjusts, on the basis of an amount of deviation from an ideal arrival time for each piece of aforementioned audio data received via the network, a reproduction start timing of the audio represented by the received audio data; and an audio output device that reproduces the audio on the basis of the reproduction start timing of the audio as adjusted by the data processing device.

Description

音声再生装置、画像表示装置及びその音声再生方法Audio reproduction apparatus, image display apparatus and audio reproduction method therefor
 本発明は、ネットワークを介して受信した音声データが示す音声を再生する音声再生装置、画像表示装置及びその音声再生方法に関する。 The present invention relates to an audio reproduction device, an image display device, and an audio reproduction method for reproducing audio indicated by audio data received via a network.
 近年の通信技術の進歩に伴い、インターネット、イントラネット、無線LAN等のパケット通信を利用するネットワークを介して画像データや音声データを含む動画データの送受信が可能になってきている。ネットワークでは、そのデータ伝送速度が回線の混雑や通信環境に応じて時々刻々と変化するため、動画データを再生する装置は、再生時に映像や音声が途切れないように、受信した動画データを蓄積するためのバッファを備えている。このバッファには、通常、数秒以上の動画を再生できる量の動画データが蓄積される。 With recent advances in communication technology, it has become possible to transmit and receive moving image data including image data and audio data via a network using packet communication such as the Internet, an intranet, and a wireless LAN. In a network, the data transmission speed changes from moment to moment according to the congestion of the line and the communication environment, so a device that plays back video data accumulates the received video data so that video and audio are not interrupted during playback. A buffer is provided. In this buffer, normally, moving image data of an amount capable of reproducing a moving image of several seconds or more is accumulated.
 上述したパケット通信は、様々なデータの送受信に利用することが検討されており、例えばコンピュータ等からネットワークを介してプロジェクターへ動画データを送信し、プロジェクターで受信した動画データを再生する技術が実用化されている。このような画像再生システムでは、プロジェクター等の画像表示装置を、コンピュータ等の画像送信装置の近く(例えば、目視できる程度の距離)に設置し、手元の画像送信装置によって画像表示装置で再生する映像や音声を操作する利用形態が考えられる。
 その場合、画像表示装置が備えるバッファで数秒以上の動画データを蓄積して再生する構成では、画像表示装置で映像や音声の再生を開始するまでに数秒以上の時間を要することになるため、画像送信装置による操作性が著しく低下してしまう。
The packet communication described above is considered to be used for transmission and reception of various data. For example, a technology for transmitting moving image data from a computer or the like to a projector via a network and reproducing the moving image data received by the projector is put into practical use. Has been. In such an image reproduction system, an image display device such as a projector is installed in the vicinity of an image transmission device such as a computer (for example, a visual distance), and an image reproduced on the image display device by the image transmission device at hand. And a usage mode for operating voice.
In that case, in the configuration in which moving image data of several seconds or more is accumulated and played back in the buffer provided in the image display device, it takes several seconds or longer to start playback of video or audio on the image display device. The operability by the transmission device is significantly reduced.
 そこで、上記のような画像再生システムでは、映像を連続する静止画データ(画像データ)として送信すると共に、画像データとは別に音声のみを音声データとして送信することで、画像送信装置の操作性を向上させた構成がある。このように画像データと音声データとを別々に送信すれば、例えばネットワークの伝送速度に応じて画像送信装置から送信する画像データの送信レートを変更し、画像表示装置に再生させる画質やフレームレートを変更させることで、画像の再生(表示)を開始するまでの時間を短縮できる。また、音声データを画像データよりも優先して送信すれば、音声データを受信してから、該音声データが示す音声の再生を開始するまでの時間(以下、音声の遅延時間と称す)も短縮できる。 Therefore, in the image reproduction system as described above, video is transmitted as continuous still image data (image data), and only audio is transmitted as audio data separately from the image data, thereby improving the operability of the image transmission apparatus. There is an improved configuration. If image data and audio data are transmitted separately in this way, for example, the transmission rate of image data transmitted from the image transmission device is changed according to the transmission speed of the network, and the image quality and frame rate to be reproduced by the image display device are changed. By changing the time, it is possible to shorten the time until image reproduction (display) starts. In addition, if audio data is transmitted with priority over image data, the time from reception of audio data to the start of reproduction of the audio indicated by the audio data (hereinafter referred to as audio delay time) is shortened. it can.
 しかしながら、上述したようにネットワークのデータ伝送速度は時々刻々と変化するため、データ伝送時間には数十ミリ~数百ミリ程度の「不安定なデータ伝送時間のゆらぎ(遅延ゆらぎ)」が存在する。そのため、音声データを画像データよりも優先して送信しても、音声の遅延時間を数百ミリ程度にまで短くすると、音声の途切れが発生し易くなってしまう。
 一方、上記遅延ゆらぎを考慮して、例えば1秒~数秒程度の音声データをバッファに蓄積して音声の再生を開始すれば、音声の途切れは抑制される。しかしながら、その場合は、音声の遅延時間が長くなるため、画像送信装置による操作性が低下してしまう。
 したがって、画像表示装置では、上記遅延ゆらぎに起因する音声の途切れの発生を抑制しつつ、音声の遅延時間を必要以上に長い時間に設定しないことが望ましい。
However, as described above, since the data transmission speed of the network changes every moment, there is “unstable data transmission time fluctuation (delay fluctuation)” of several tens to hundreds of millimeters in the data transmission time. . For this reason, even if the audio data is transmitted with priority over the image data, if the audio delay time is reduced to about several hundred millimeters, the audio is likely to be interrupted.
On the other hand, taking into account the delay fluctuation, for example, if audio data of about 1 to several seconds is stored in a buffer and audio reproduction is started, audio interruption is suppressed. However, in this case, since the audio delay time becomes long, the operability of the image transmission device is degraded.
Therefore, in the image display device, it is desirable not to set the audio delay time to an unnecessarily long time while suppressing the occurrence of audio interruption due to the delay fluctuation.
 なお、再生する音声の途切れを抑制したり音声の遅延時間を短縮したりするためのものではないが、例えば特許文献1(特開2004-104701号公報)には、送信側と受信側のクロックの差異による、動画データを蓄積するバッファのオーバーフローやアンダーフローを防止するために、受信側装置のクロックを補正するための手法が記載されている。 Although not intended to suppress the interruption of the reproduced audio or reduce the audio delay time, for example, Patent Document 1 (Japanese Patent Application Laid-Open No. 2004-104701) discloses a clock on the transmission side and the reception side. In order to prevent overflow and underflow of a buffer for storing moving image data due to the difference, a method for correcting the clock of the receiving side device is described.
特開2004-104701号公報JP 2004-104701 A
 本発明は、遅延ゆらぎに起因する音声の途切れの発生を抑制する音声再生装置、画像表示装置及びその音声再生方法を提供することを目的とする。 An object of the present invention is to provide an audio reproduction device, an image display device, and an audio reproduction method thereof that suppress the occurrence of audio interruption caused by delay fluctuation.
 上記目的を達成するため本発明の音声再生装置は、ネットワークを介して受信した音声データが示す音声を再生する音声再生装置であって、
 前記ネットワークを介して受信する前記音声データ毎の理想的な到着時刻からのずれ量に基づいて、受信した音声データが示す音声の再生開始タイミングを調整するデータ処理装置と、
 前記データ処理装置が調整した前記音声の再生開始タイミングに基づいて前記音声を再生する音声出力装置と、
を有する。
In order to achieve the above object, an audio reproduction device of the present invention is an audio reproduction device that reproduces audio indicated by audio data received via a network,
A data processing device that adjusts the reproduction start timing of the audio indicated by the received audio data based on the amount of deviation from the ideal arrival time for each of the audio data received via the network;
An audio output device for reproducing the audio based on the reproduction start timing of the audio adjusted by the data processing device;
Have
 または、ネットワークを介して受信した音声データが示す音声を再生する音声再生装置であって、
 前記ネットワークのデータ伝送速度が現在の状態から低くなると、前記ネットワークを介して受信する音声データが示す音声の再生時間を伸長するように調整し、前記ネットワークのデータ伝送速度が現在の状態から高くなると、前記ネットワークを介して受信する音声データが示す音声の再生時間を短縮するように調整するデータ処理装置と、
 前記データ処理装置が調整した前記音声の再生時間に基づいて前記音声を再生する音声出力装置と、
を有する。
Or an audio playback device that plays back audio represented by audio data received via a network,
When the data transmission rate of the network decreases from the current state, the audio data received via the network is adjusted to extend the playback time of the voice indicated by the network, and the data transmission rate of the network increases from the current state. A data processing device for adjusting so as to shorten the playback time of the voice indicated by the voice data received via the network;
An audio output device that reproduces the audio based on the reproduction time of the audio adjusted by the data processing device;
Have
 本発明の画像表示装置は、上記音声再生装置と、
 前記ネットワークを介して受信した画像データが示す画像を表示する画像出力装置と、
を備え、
 前記画像データに対応して送信される音声データを受信する構成である。
An image display device according to the present invention includes the above sound reproduction device,
An image output device for displaying an image indicated by image data received via the network;
With
The audio data transmitted corresponding to the image data is received.
 一方、本発明の音声再生方法は、ネットワークを介して受信した画像データ及び音声データが示す画像及び音声を再生する画像表示装置による音声再生方法であって、
 前記ネットワークを介して受信する前記音声データ毎の理想的な到着時刻からのずれ量に基づいて、受信した音声データが示す音声の再生開始タイミングを調整し、
 前記データ処理装置が調整した前記音声の再生開始タイミングに基づいて前記音声を再生する方法である。
On the other hand, the audio reproduction method of the present invention is an audio reproduction method by an image display device for reproducing image data and audio indicated by image data and audio data received via a network,
Based on the amount of deviation from the ideal arrival time for each audio data received via the network, adjust the audio playback start timing indicated by the received audio data,
The audio is reproduced based on the reproduction start timing of the audio adjusted by the data processing device.
 本発明によれば、遅延ゆらぎに起因する音声の途切れの発生を抑制できる。 According to the present invention, it is possible to suppress the occurrence of speech interruption caused by delay fluctuation.
図1は、本発明の画像再生システムの一構成例を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration example of an image reproduction system according to the present invention. 図2は、図1に示した画像表示装置の一構成例を示すブロック図である。FIG. 2 is a block diagram showing a configuration example of the image display apparatus shown in FIG. 図3Aは、ネットワークにおける音声データの理想的な伝送例を示す模式図である。FIG. 3A is a schematic diagram illustrating an ideal transmission example of audio data in a network. 図3Bは、ネットワークにおける音声データの実際の伝送例を示す模式図である。FIG. 3B is a schematic diagram illustrating an actual transmission example of audio data in the network. 図4Aは、図2に示したデータ処理装置で作成される頻度分布情報の一例を示す図であり、各音声パケットのずれ量が正の値に多く分布する例を示すヒストグラムである。FIG. 4A is a diagram showing an example of frequency distribution information created by the data processing apparatus shown in FIG. 2, and is a histogram showing an example in which the amount of deviation of each voice packet is distributed in a positive value. 図4Bは、図2に示したデータ処理装置12で作成される頻度分布情報の一例を示す図であり、各音声パケットのずれ量が負の値に多く分布する例を示すヒストグラムである。FIG. 4B is a diagram illustrating an example of frequency distribution information created by the data processing device 12 illustrated in FIG. 2, and is a histogram illustrating an example in which the deviation amount of each voice packet is distributed in a negative value. 図5は、図2に示した音声バッファに格納可能な音声の最大再生時間Tと、音声の遅延時間Tと、音声バッファに残存する音声データ量に対応する音声の再生時間Tとの関係例を示す模式図である。Figure 5 is a maximum duration T B retractable sound audio buffer shown in FIG. 2, the delay and time T P of the audio, and the reproduction time T D of the audio corresponding to the amount of audio data remaining in the sound buffer It is a schematic diagram which shows the example of a relationship. 図6Aは、ネットワークが安定したデータ伝送速度であるときの頻度分布情報の一例を示すヒストグラムである。FIG. 6A is a histogram showing an example of frequency distribution information when the network has a stable data transmission rate. 図6Bは、ネットワーク3が不安定なデータ伝送速度であるときの頻度分布情報の一例を示すヒストグラムである。FIG. 6B is a histogram showing an example of frequency distribution information when the network 3 has an unstable data transmission rate. 図7は、本発明の画像表示装置の処理手順の一例を示すフローチャートである。FIG. 7 is a flowchart showing an example of the processing procedure of the image display apparatus of the present invention.
 次に本発明について図面を用いて説明する。
 図1は、本発明の画像再生システムの一構成例を示すブロック図である。
 図1に示すように、本発明の画像再生システムは、画像データ及び音声データを送信する画像送信装置2と、画像送信装置2から受信した画像データ及び音声データが示す画像及び音声を再生する画像表示装置1とを有し、画像送信装置2と画像表示装置1とがネットワーク3を介して接続された構成である。
Next, the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram illustrating a configuration example of an image reproduction system according to the present invention.
As shown in FIG. 1, an image reproduction system of the present invention includes an image transmission device 2 that transmits image data and audio data, and an image that reproduces images and audio indicated by the image data and audio data received from the image transmission device 2. The display device 1 has a configuration in which the image transmission device 2 and the image display device 1 are connected via a network 3.
 画像送信装置2は、画像表示装置1に再生させる映像を連続する静止画データから成る画像データとして送信すると共に、該画像データに対応する音声データを該画像データとは別に送信する。また、画像送信装置2は、画像データよりも該画像データに対応する音声データを優先して送信する。また、画像送信装置2は、ネットワークの伝送速度に応じて画像データの送信レートを変更し、画像表示装置1に再生させる画像の画質やフレームレートを変更させる。なお、画像データ及び音声データは、それぞれのデータを含む画像パケット及び音声パケットとして送信される。画像データと音声データとは、例えば、各データを送信する画像送信装置2を示す識別子、各データをキャプチャしたタイミングに対応するタイムスタンプ、各データを再生または表示するタイミングに対応するタイムスタンプ、および/または画像データ及び音声データに対応するコンテンツ名等の情報に対して関連付けされていてもよい。また、画像パケット及び音声パケットには、これらの情報を含んでいてもよい。さらに、画像送信装置2は、画像データ及び該画像データに対応する音声データの送信を開始するとき、該音声データのサンプリング周波数、サンプリングビット数、チャンネル数(モノラル、ステレオ等)、音声パケットの送信間隔(音声伝送単位時間)T等を含む、音声データが示す音声の再生に必要な情報である音声関連情報を送信する。なお、音声パケットの送信間隔は、音声パケットの理想的な到着間隔に対応する。
 画像表示装置1は、画像送信装置2が再生を指示する画像データ、音声データ及び音声関連情報を受信し、該画像データ及び音声データが示す画像及び音声を再生する。
 画像送信装置2は、ネットワーク3を介して画像データ及び音声データを送信する通信装置、プログラムにしたがって処理を実行するCPU(Central Processing Unit)及び該CPUで処理するデータやプログラムを保存するメモリを備えた情報処理装置(コンピュータ)で実現できる。通信装置は、ネットワーク3を介してデータの伝送が可能であれば、有線方式または無線方式に係わらず、周知のどのような構成でもよい。
 ネットワーク3は、画像送信装置2と画像表示装置1との間で送受信されるパケットを中継する不図示のネットワーク機器を含む周知のデータ伝送経路である。ネットワーク3は、周知のように多数のネットワーク機器によって多数のデータ伝送経路が形成される構成であるため、図1ではネットワーク3を雲状の形状で示している。
The image transmission device 2 transmits video to be reproduced by the image display device 1 as image data composed of continuous still image data, and transmits audio data corresponding to the image data separately from the image data. Further, the image transmission device 2 transmits the audio data corresponding to the image data with priority over the image data. Further, the image transmission device 2 changes the transmission rate of the image data according to the transmission speed of the network, and changes the image quality and frame rate of the image to be reproduced by the image display device 1. Note that image data and audio data are transmitted as image packets and audio packets including the respective data. The image data and the audio data are, for example, an identifier indicating the image transmission device 2 that transmits each data, a time stamp corresponding to the timing when each data is captured, a time stamp corresponding to the timing when each data is reproduced or displayed, and It may be associated with information such as a content name corresponding to image data and audio data. In addition, the image packet and the audio packet may include such information. Furthermore, when the image transmission apparatus 2 starts transmitting image data and audio data corresponding to the image data, the audio data sampling frequency, the number of sampling bits, the number of channels (monaural, stereo, etc.), and audio packet transmission Audio related information, which is information necessary for reproduction of audio indicated by audio data, including an interval (audio transmission unit time) T and the like is transmitted. The voice packet transmission interval corresponds to an ideal voice packet arrival interval.
The image display device 1 receives the image data, sound data, and sound related information that the image transmission device 2 instructs to reproduce, and reproduces the image and sound indicated by the image data and sound data.
The image transmission device 2 includes a communication device that transmits image data and audio data via the network 3, a CPU (Central Processing Unit) that executes processing according to a program, and a memory that stores data and programs processed by the CPU. It can be realized by an information processing apparatus (computer). The communication device may have any known configuration regardless of the wired method or the wireless method as long as data can be transmitted via the network 3.
The network 3 is a known data transmission path including a network device (not shown) that relays packets transmitted and received between the image transmission device 2 and the image display device 1. Since the network 3 has a configuration in which a number of data transmission paths are formed by a number of network devices as is well known, the network 3 is shown in a cloud shape in FIG.
 図2は、図1に示した画像表示装置1の一構成例を示すブロック図である。
 図2に示すように、画像表示装置1は、ネットワーク3を介して画像送信装置2から画像データ及び音声データを受信する通信装置11と、受信した画像データ及び音声データに所要の処理を施すデータ処理装置12と、データ処理装置12で生成された情報を保持すると共に、受信した画像データ、並びに受信した音声データまたはデータ処理装置12で加工された音声データ(音声補正データ)を保持する記憶装置13と、受信した音声データまたは音声補正データが示す音声を再生・出力する音声出力装置14と、画像データが示す画像を再生・表示する画像出力装置15とを有する。
 本発明の音声再生装置4は、図2に示す通信装置11、データ処理装置12、記憶装置13及び音声出力装置14を備えた構成である。
FIG. 2 is a block diagram illustrating a configuration example of the image display apparatus 1 illustrated in FIG.
As shown in FIG. 2, the image display device 1 includes a communication device 11 that receives image data and audio data from the image transmission device 2 via the network 3, and data that performs a required process on the received image data and audio data. A processing device 12 and a storage device that holds information generated by the data processing device 12 and holds received image data and received voice data or voice data processed by the data processing device 12 (voice correction data) 13, an audio output device 14 that reproduces and outputs the audio indicated by the received audio data or audio correction data, and an image output device 15 that reproduces and displays an image indicated by the image data.
The audio reproduction device 4 of the present invention is configured to include the communication device 11, the data processing device 12, the storage device 13, and the audio output device 14 shown in FIG.
 画像表示装置1は、図2に示す通信装置11、データ処理装置12、記憶装置13、音声出力装置14及び画像出力装置15の機能を備えた、画像及び音声の再生が可能なプロジェクター、ディスプレイ、モニタ等で実現できる。データ処理装置12及び記憶装置13の機能は、プログラムにしたがって処理を実行するCPU(Central Processing Unit)及び該CPUで処理するデータやプログラムを保存するメモリを備えた情報処理装置(コンピュータ)で実現できる。
 通信装置11は、ネットワーク3を介して画像送信装置2から画像データ及び音声データを順次受信し、受信した画像データ及び音声データを画像・音声分割部121に出力するデータ受信部111を備える。通信装置11は、ネットワーク3を介してデータの伝送が可能であれば、有線方式または無線方式に係わらず、周知のどのような構成でもよい。データ伝送は、例えば周知のパケット通信を用いて行われる。
 記憶装置13は、画像データを保持するビデオメモリ131と、音声データの補正に用いる情報を保持する音声補正データ格納メモリ132と、音声データまたは音声補正データを保持する音声バッファ133とを備える。
The image display apparatus 1 includes a projector, a display, and a function capable of reproducing an image and sound, including the functions of the communication device 11, the data processing device 12, the storage device 13, the sound output device 14, and the image output device 15 shown in FIG. It can be realized with a monitor. The functions of the data processing device 12 and the storage device 13 can be realized by an information processing device (computer) that includes a CPU (Central Processing Unit) that executes processing according to a program and a memory that stores data and programs processed by the CPU. .
The communication device 11 includes a data receiving unit 111 that sequentially receives image data and audio data from the image transmission device 2 via the network 3 and outputs the received image data and audio data to the image / audio dividing unit 121. The communication device 11 may have any known configuration regardless of the wired method or the wireless method as long as data can be transmitted via the network 3. Data transmission is performed using, for example, well-known packet communication.
The storage device 13 includes a video memory 131 that holds image data, an audio correction data storage memory 132 that holds information used for audio data correction, and an audio buffer 133 that holds audio data or audio correction data.
 データ処理装置12は、画像音声分割部121、音声伝送処理部122及び音声データ加工部123を備える。
 画像・音声分割部121は、通信装置11から受け取った画像データにデコード処理等の所要の処理を施し、処理後の画像データを記憶装置13が備えるビデオメモリ131に書き込む。また、画像・音声分割部121は、通信装置11から受け取った音声データ及び音声関連情報を音声伝送処理部122へ出力する。
 音声伝送処理部122は、通信装置11で受信した音声データを含む音声パケットの到着時刻から所定の基準時刻を決定すると共に、音声関連情報に基づいて、音声パケット毎の理想的な到着時刻、換言すると基準時刻と予め定められる到着間隔とに基づいて、予定される到着時刻からのずれ量をそれぞれ検出し、該ずれ量に関する情報であるずれ量情報を生成する。該ずれ量情報には、例えば、該ずれ量の分布を示す情報(頻度分布情報)、該ずれ量の平均値や最大値、最小値、または該ずれ量の値等がある。以下では、主として頻度分布情報を該ずれ量情報に用いる例で説明する。音声伝送処理部122は、決定した基準時刻を記憶装置13が備える音声補正データ格納メモリ132に書き込む。また、音声伝送処理部122は、生成した頻度分布情報を記憶装置13が備える音声補正データ格納メモリ132に書き込むと共に、通信装置11から受け取った音声データを音声データ加工部123へ出力する。さらに、音声伝送処理部122は、音声関連情報を記憶装置13が備える音声補正データ格納メモリ132に書き込む。
The data processing device 12 includes an audio / video dividing unit 121, an audio transmission processing unit 122, and an audio data processing unit 123.
The image / sound dividing unit 121 performs necessary processing such as decoding processing on the image data received from the communication device 11, and writes the processed image data in the video memory 131 included in the storage device 13. In addition, the image / audio dividing unit 121 outputs the audio data and the audio related information received from the communication device 11 to the audio transmission processing unit 122.
The voice transmission processing unit 122 determines a predetermined reference time from the arrival time of the voice packet including the voice data received by the communication device 11, and based on the voice related information, an ideal arrival time for each voice packet, in other words, Then, based on the reference time and a predetermined arrival interval, a deviation amount from the scheduled arrival time is detected, and deviation amount information that is information relating to the deviation amount is generated. The shift amount information includes, for example, information indicating the distribution of the shift amount (frequency distribution information), an average value, a maximum value, a minimum value, or a value of the shift amount. Hereinafter, an example in which frequency distribution information is mainly used for the shift amount information will be described. The audio transmission processing unit 122 writes the determined reference time in the audio correction data storage memory 132 included in the storage device 13. In addition, the voice transmission processing unit 122 writes the generated frequency distribution information in the voice correction data storage memory 132 included in the storage device 13 and outputs the voice data received from the communication device 11 to the voice data processing unit 123. Further, the voice transmission processing unit 122 writes the voice related information in the voice correction data storage memory 132 provided in the storage device 13.
 基準時刻は、不図示の画像送信装置2から受信した音声データ(音声パケット)の到着時刻の基準値であり、受信する音声データ(音声パケット)のうち、最初の音声データ(最初の音声データを含む音声パケット)が到着した到着時刻が初期値として設定される。音声データは、受信した画像データに対応する音声データとしてもよい。その場合、音声データと画像データとは関連付けて送信される。
 音声伝送処理部122は、基準時刻を設定すると、音声データ(音声パケット)毎の基準時刻からのずれ量を検出し、ずれ量情報の生成を開始する。
 画像表示装置1は、最初の音声データを含む音声パケットが到着すると、上記基準時刻から所定の時間まで無音の音声の再生を開始する。なお、上記基準時刻から所定の時間が経過した時点で音声の再生を開始してもよい。換言すると、基準時刻は音声を再生するときの基準となる。画像表示装置1には略一定の周期で音声パケットが到着するため、画像表示装置1は、略一定の周期毎に受信する音声データを音声バッファ133でそれぞれ保持し、音声バッファ133で保持された音声データを順次読み出して再生する。
The reference time is a reference value of the arrival time of audio data (audio packet) received from the image transmission device 2 (not shown), and the first audio data (first audio data is the first audio data among the received audio data (audio packets). The arrival time at which the received voice packet) is set as an initial value. The audio data may be audio data corresponding to the received image data. In that case, the audio data and the image data are transmitted in association with each other.
When the reference time is set, the voice transmission processing unit 122 detects a deviation amount from the reference time for each voice data (voice packet), and starts generating deviation amount information.
When the audio packet including the first audio data arrives, the image display device 1 starts reproduction of silent audio from the reference time to a predetermined time. Note that audio playback may be started when a predetermined time has elapsed from the reference time. In other words, the reference time is a reference when playing back sound. Since the audio packets arrive at the image display device 1 at a substantially constant cycle, the image display device 1 holds the audio data received at each substantially constant cycle by the audio buffer 133, and is held by the audio buffer 133. Audio data is read and played sequentially.
 音声データ加工部123は、音声データのずれ量に基づいて、音声データが示す音声の再生開始タイミングを調整する。音声の再生開始タイミングは、例えば、再生音声を伸長または短縮することで調整される。具体的には、音声データ加工部123により補正データ格納メモリ132から上記頻度分布情報を読み出し、該頻度分布情報に基づき、所定の調整時間の再生音声を伸長または短縮するための加工を音声データに施し、加工後の音声データ(音声補正データ)を音声バッファ133に書き込む。
 音声出力装置14は、データ処理装置12が調整した音声の再生開始タイミングに基づいて音声を再生する。詳細には、音声出力装置14は、音声バッファ133から音声データまたは音声補正データを順次読み出して音声を再生・出力する音声出力部141を備える。
 画像出力装置15は、ビデオメモリ131から画像データを順次読み出して画像を表示する画像出力部151を備える。
The audio data processing unit 123 adjusts the reproduction start timing of the audio indicated by the audio data based on the deviation amount of the audio data. The sound reproduction start timing is adjusted by, for example, expanding or shortening the reproduced sound. Specifically, the audio data processing unit 123 reads out the frequency distribution information from the correction data storage memory 132, and based on the frequency distribution information, processes for expanding or shortening the reproduced audio of a predetermined adjustment time are converted into audio data. The processed audio data (audio correction data) is written in the audio buffer 133.
The audio output device 14 reproduces audio based on the audio reproduction start timing adjusted by the data processing device 12. Specifically, the audio output device 14 includes an audio output unit 141 that sequentially reads audio data or audio correction data from the audio buffer 133 and reproduces / outputs the audio.
The image output device 15 includes an image output unit 151 that sequentially reads image data from the video memory 131 and displays an image.
 このような構成において、次に図2に示した画像表示装置1の動作について図面を用いて説明する。
 図3Aはネットワーク3における音声データの理想的な伝送例を示す模式図であり、図3Bはネットワーク3における音声データの実際の伝送例を示す模式図である。図4Aは、図2に示したデータ処理装置12で作成される頻度分布情報の一例を示す図であり、各音声パケットのずれ量が正の値に多く分布する例を示すヒストグラムである。図4Bは、図2に示したデータ処理装置12で作成される頻度分布情報の一例を示す図であり、各音声パケットのずれ量が負の値に多く分布する例を示すヒストグラムである。
 図3Aは、画像表示装置1に音声パケットが一定の間隔(音声伝送単位時間T)で到着する、ネットワーク3を介して音声パケットが理想的に伝送されている状態を示している。音声パケットがネットワーク3を介して理想的に伝送されている場合、到着間隔は、例えば音声伝送単位時間T=100ms(ミリ秒)となる。つまり、理想的な到着間隔は、音声伝送単位時間Tとして予め定められている。
Next, the operation of the image display apparatus 1 shown in FIG. 2 will be described with reference to the drawings.
FIG. 3A is a schematic diagram showing an ideal transmission example of voice data in the network 3, and FIG. 3B is a schematic diagram showing an actual transmission example of voice data in the network 3. FIG. 4A is a diagram showing an example of frequency distribution information created by the data processing device 12 shown in FIG. 2, and is a histogram showing an example in which the amount of deviation of each voice packet is distributed in a positive value. FIG. 4B is a diagram illustrating an example of frequency distribution information created by the data processing device 12 illustrated in FIG. 2, and is a histogram illustrating an example in which the deviation amount of each voice packet is distributed in a negative value.
FIG. 3A shows a state in which audio packets are ideally transmitted via the network 3, in which audio packets arrive at the image display device 1 at regular intervals (audio transmission unit time T). When voice packets are ideally transmitted via the network 3, the arrival interval is, for example, voice transmission unit time T = 100 ms (milliseconds). That is, the ideal arrival interval is predetermined as the voice transmission unit time T.
 画像送信装置2、画像表示装置1及びネットワーク3において何も問題が発生していない場合、画像送信装置2から一定の間隔(音声伝送単位時間T)で音声パケットが送信されると、図3Aに示すように画像表示装置1には該音声伝送単位時間T毎に音声パケットが到着すると考えられる。しかしながら、実際の音声パケットの到着間隔は、上記ネットワーク3によるデータ伝送時間の「不安定なデータ伝送時間のゆらぎ(遅延ゆらぎ)」によって、例えば図3BのT~Tで示すようにばらつく。 When no problem has occurred in the image transmission device 2, the image display device 1, and the network 3, when voice packets are transmitted from the image transmission device 2 at regular intervals (sound transmission unit time T), FIG. As shown, it is considered that a voice packet arrives at the image display device 1 at every voice transmission unit time T. However, actual voice packet arrival intervals vary as indicated by T 0 to T 6 in FIG. 3B, for example, due to “unstable data transmission time fluctuation (delay fluctuation)” of the data transmission time by the network 3.
 音声パケットの理想的な到着時刻(受信が完了した時刻)に対する実際の到着時刻(受信が完了した時刻)のずれ量は、以下の式(1)で表すことができる。
Figure JPOXMLDOC01-appb-M000001
  式(1)は、第n番目の音声パケットの理想的な到着時刻からのずれ量ΔTnを示している。ΔTnは、音声パケットが理想よりも遅れて到着した場合は正の値となり、音声パケットが理想よりも早く到着した場合は負の値となる。
 なお、実際の音声パケットの到着間隔T~Tは、音声パケット毎の到着時刻を用いて検出できる。到着時刻は、例えば音声パケットが到着したタイミングを含む。つまり、画像表示装置1へ最初に到着した音声パケットの到着したタイミングを基点(基準時刻)に時間のカウントを開始し、他の音声パケットが到着するタイミングと基点(基準時刻)と音声伝送単位時間Tとに基づいて音声パケット毎のずれ量をそれぞれ検出すればよい。
The deviation amount of the actual arrival time (reception completion time) with respect to the ideal arrival time (reception completion time) of the voice packet can be expressed by the following equation (1).
Figure JPOXMLDOC01-appb-M000001
Equation (1) shows the amount of deviation ΔTn from the ideal arrival time of the nth voice packet. ΔTn takes a positive value when the voice packet arrives later than ideal, and takes a negative value when the voice packet arrives earlier than ideal.
The actual voice packet arrival intervals T 0 to T n can be detected using the arrival time of each voice packet. The arrival time includes, for example, the timing when the voice packet arrives. That is, time counting is started from the arrival point of the first voice packet arriving at the image display device 1 as a base point (reference time), and the timing, base point (reference time) and voice transmission unit time at which another voice packet arrives. The amount of deviation for each voice packet may be detected based on T.
 通常、各パケットには、自身に含まれるデータが、データ全体のどの部分なのかを示す情報が含まれている。そのため、画像表示装置1は、受信した音声パケットが、最初の音声データを含むパケットであるのか、最後の音声データを含むパケットであるのか、あるいは第何番目の音声データを含むパケットであるのかを判別できる。また、理想的な音声パケットの到着間隔(音声伝送単位時間T)の値は、予め画像送信装置2から画像表示装置1へ通知されることで、画像表示装置1が既知とする。その場合、上記式(1)を用いて音声パケット毎のΔTnを算出するには、画像表示装置1へ最初に到着した音声パケットの到着時刻が分かればよい。 Normally, each packet includes information indicating which part of the data is the data included in the packet. Therefore, the image display apparatus 1 determines whether the received audio packet is a packet including the first audio data, a packet including the last audio data, or a first audio data packet. Can be determined. Further, the value of the ideal voice packet arrival interval (voice transmission unit time T) is notified to the image display device 1 from the image transmission device 2 in advance, so that the image display device 1 is known. In this case, in order to calculate ΔTn for each voice packet using the above equation (1), it is only necessary to know the arrival time of the voice packet that has first arrived at the image display device 1.
 上述したように、本発明では、画像送信装置2から送信される、新たに再生が指示された音声データのうち、最初の音声データ(音声パケット)の到着時刻を「基準時刻」として設定する。但し、実際の音声パケットの到着間隔は、上記ネットワーク3の遅延ゆらぎによってばらつくため、最初の音声パケットも理想的な到着時刻からずれていると考えられる。そのため、本発明では、音声伝送処理部122が、ΔTnの分布を示す頻度分布情報に基づいて現状の基準時刻(初期値)のずれを検出する。
 例えば、図4Aで示すように、各音声パケットのΔTnが正の値に多く分布している場合、すなわち各音声パケットが理想的な到着時刻よりも遅れて到着する傾向にある場合は、現状の基準時刻が早い(最初の音声パケットが理想よりも早く到着した)と判断できる。
 一方、図4Bで示すように、各音声パケットのΔTnが負の値に多く分布している場合、すなわち各音声パケットが理想的な到着時刻よりも早く到着する傾向にある場合は、現状の基準時刻が遅い(最初のパケットが理想よりも遅く到着した)と判断できる。
As described above, in the present invention, the arrival time of the first voice data (voice packet) is set as the “reference time” among the voice data newly transmitted for playback transmitted from the image transmission device 2. However, since the actual voice packet arrival interval varies due to the delay fluctuation of the network 3, it is considered that the first voice packet is also deviated from the ideal arrival time. Therefore, in the present invention, the voice transmission processing unit 122 detects a deviation in the current reference time (initial value) based on the frequency distribution information indicating the distribution of ΔTn.
For example, as shown in FIG. 4A, when ΔTn of each voice packet is distributed in a positive value, that is, when each voice packet tends to arrive later than the ideal arrival time, It can be determined that the reference time is early (the first voice packet has arrived earlier than ideal).
On the other hand, as shown in FIG. 4B, when ΔTn of each voice packet is distributed in a negative value, that is, when each voice packet tends to arrive earlier than the ideal arrival time, the current standard It can be determined that the time is late (the first packet arrived later than ideal).
 図5は、音声バッファ133のメモリ容量に対応する(音声バッファ133に格納可能な)音声の最大再生時間Tと、最初の音声データ(音声パケット)が到着してから該音声データが示す音声の再生が開始されるまでの時間(上記音声の遅延時間)Tと、先に受信して音声バッファ133に残存している音声データ量に対応する音声の再生時間Tとの関係例を示している。
 図5に示すように、最大再生時間T[秒]は、音声の遅延時間Tよりも十分に大きな値に設定される。音声バッファ133のメモリ容量M[バイト]と、音声バッファ133に格納可能な音声データ量に対応する音声の最大再生時間Tとの関係は、下記式(2)で表すことができる。
Figure JPOXMLDOC01-appb-M000002
 ここで、fは音声データのサンプリング周波数[Hz]であり、Nは音声データのサンプリングビット数[ビット]であり、Nは音声データのチャンネル数である。
5, (which can be stored in the audio buffer 133) memory corresponding to the capacity of the audio buffer 133 audio maximum duration and T B of the voice, from arriving first voice data (voice packets) indicated voice data playback and time (delay time of the voice) T P until starts, an example of the relationship between the reproduction time T D of the sound corresponding to the sound data amount received previously remaining in the audio buffer 133 Show.
As shown in FIG. 5, the maximum reproduction time T B [seconds] is set to a value sufficiently larger than the audio delay time T P. A memory capacity M B [bytes] of the audio buffer 133, the relationship between the maximum duration T B of the audio corresponding to the audio data amount that can be stored in the audio buffer 133 can be represented by the following formula (2).
Figure JPOXMLDOC01-appb-M000002
Here, f is the sampling frequency [Hz] of the audio data, N B is the number of sampling bits of the audio data [bit], N C is the number of channels of audio data.
 音声の再生時間Tは、ネットワーク3を介して各音声パケットが理想的に伝送されている場合、T+T≧T≧Tとなり、各音声パケットには、Tに相当する再生時間の音声データが含まれる。但し、最後の音声パケットには、Tに相当する再生時間の音声データが含まれているとは限らない。画像表示装置1は、音声パケットを受信し、該音声パケットに含まれる音声データを音声バッファ133に格納すると、該音声データが示す音声の再生を基準時刻から(n-1)T+T後に開始する。
 音声バッファ133は、初期状態では無音の音声データをT分格納しており、最初の音声データを受け取った時刻(基準時刻)から再生を開始する。この動作により最初の音声データをT分遅延させて再生することができる。また、音声バッファ133に格納される音声データが示す音声を順次再生することにより、第n番目の音声データが示す音声(再生時間=T)の再生を(n-1)T+T後に自動的に開始できる。
Reproduction time T D of the speech, if each voice packet through the network 3 are transmitted ideally, T P + T ≧ T D ≧ T P becomes, each voice packet, the playback time corresponding to T Audio data is included. However, the last audio packet does not necessarily include audio data having a reproduction time corresponding to T. The image display apparatus 1 receives a voice packet, storing the audio data contained in the voice packet into the audio buffer 133 starts reproducing the voice represented by voice data from the reference time after (n-1) T + T P .
Audio buffer 133 is in the initial state the silent voice data and stores T P min, starts reproduction from the time it receives the first audio data (reference time). The first voice data is delayed T P min This behavior can be reproduced. Further, by sequentially reproducing the voice represented by the voice data stored in the audio buffer 133, the reproduction of the audio (playback time = T) indicated by the n-th audio data (n-1) T + T P later automatically You can start.
 音声の遅延時間Tは、遅延ゆらぎによる音声パケットの理想的な到着時刻からのずれ量ΔTnの最大値(ΔTnMAX)を考慮して、T≧T+ΔTnMAXとなる値に予め設定すればよい。例えば、T=1.5T、2T、T=3T等に設定することが考えられる。
 なお、多くの場合、ネットワーク3のΔTnMAXは不明であり、ユーザが許容できる音声の遅延時間Tは再生する音声の種類によって異なっている。例えば、会話等は音声の遅延時間Tができるだけ短いことが望ましく、BGM(Back-Ground Music)等は音声の遅延時間Tが長くても問題とならない場合が多い。したがって、画像表示装置1または画像送信装置2にTを調整するための調整機構を設け、該調整機構によりユーザがTを任意に設定できるようにしてもよい。
The voice delay time T P may be set in advance to a value satisfying T P ≧ T + ΔTn MAX in consideration of the maximum value (ΔTn MAX ) of the deviation amount ΔTn from the ideal arrival time of the voice packet due to delay fluctuation. . For example, it is conceivable to set T P = 1.5T, 2T, T P = 3T, and the like.
Incidentally, in many cases, .DELTA.Tn MAX network 3 is unknown, the delay time T P of the voice that the user can tolerate is different depending on the type of sound to be reproduced. For example, it is desirable that the speech delay time TP is as short as possible for conversation and the like, and BGM (Back-Ground Music) or the like often does not cause a problem even if the speech delay time TP is long. Therefore, an adjustment mechanism for adjusting the T P to the image display apparatus 1 or the image transmission apparatus 2 is provided, the user may be allowed to arbitrarily set the T P by the adjusting mechanism.
 画像表示装置1が音声の再生を開始すると、次の音声パケットがT以内に到着すれば、先に受信した音声データの再生が完了した時点で次の音声データが音声バッファ133に格納されているため、音声を途切れて再生することはない。
 一方、次の音声パケットがTよりも遅れて到着した場合、先に受信した音声データの再生が完了した時点で次の音声データが音声バッファ133に未だ格納されていない。その場合、次の音声データを受信するまで無音状態となる。
 ここで、音声パケット全体が理想的な到着時刻よりも遅れて到着する傾向にある場合、すなわち現状の基準時刻が早い場合、画像表示装置1では音声を再生できない無音状態が発生する可能性が高くなる。
 音声パケット全体が理想的な到着時刻よりも早く到着する傾向にある場合、すなわち現状の基準時刻が遅い場合、画像表示装置1では無音状態が発生することはない。その場合、基準時刻を適切な時刻に調整することにより、音声の遅延時間Tを現状よりも短く設定できる。但し、先に受信した音声データの再生が完了しても、音声バッファ133にはTに相当する音声データよりも長い時間に相当する音声データが残るため、音声バッファ133がオーバーフローする可能性がある。
When the image display device 1 starts playing the voice, if the arrival next voice packet within T P, the next audio data at the time when the reproduction of the audio data previously received is completed is stored in the audio buffer 133 Therefore, the audio is not played back intermittently.
On the other hand, if the next voice packet arrives later than T P, the next audio data when the reproduction is completed audio data previously received is not yet stored in the audio buffer 133. In that case, a silent state is maintained until the next audio data is received.
Here, when the entire voice packet tends to arrive later than the ideal arrival time, that is, when the current reference time is earlier, there is a high possibility that a silent state in which the voice cannot be reproduced in the image display device 1 occurs. Become.
When the entire voice packet tends to arrive earlier than the ideal arrival time, that is, when the current reference time is late, the image display apparatus 1 does not generate a silent state. In that case, by adjusting the reference time to the correct time can be set shorter than the current delay time T P of the speech. However, even if the reproduction of the audio data previously received is completed, since the audio data corresponding to a time longer than the audio data corresponding to T P remains in the audio buffer 133, possibly audio buffer 133 overflows is there.
 そこで、データ処理装置12は、音声パケット毎のずれ量ΔTnに基づいて、受信した音声パケットに含まれる音声データが示す音声の再生開始タイミングを調整する。詳細には、音声伝送処理部122は、音声パケット毎のずれ量ΔTnに関する情報であるずれ量情報、例えば上記頻度分布情報に基づき、修正ずれ量を検出する。また、音声データ加工部123は、検出された修正ずれ量に基づいて音声の再生開始タイミングを調整する。
 各音声パケットのΔTnの平均値が正の値の場合は、各音声パケットが理想的な到着時刻よりも遅れて到着する傾向にあるとし、各音声パケットのΔTnの平均値が負の値の場合は、各音声パケットが理想的な到着時刻よりも早く到着する傾向にあるとしてもよい。
 なお、頻度分布情報は、図4Aまたは図4Bに示すように、ΔTnを所定の範囲毎に分類した頻度分布を含む。
Therefore, the data processing device 12 adjusts the playback start timing of the voice indicated by the voice data included in the received voice packet based on the shift amount ΔTn for each voice packet. Specifically, the voice transmission processing unit 122 detects the corrected deviation amount based on deviation amount information that is information on the deviation amount ΔTn for each voice packet, for example, the frequency distribution information. Also, the audio data processing unit 123 adjusts the audio reproduction start timing based on the detected amount of correction deviation.
When the average value of ΔTn of each voice packet is a positive value, it is assumed that each voice packet tends to arrive later than the ideal arrival time, and the average value of ΔTn of each voice packet is a negative value The voice packets may tend to arrive earlier than the ideal arrival time.
Note that the frequency distribution information includes a frequency distribution in which ΔTn is classified for each predetermined range, as shown in FIG. 4A or 4B.
 ずれ量情報の生成は、基準時刻が設定されると開始される。ずれ量情報の生成は、予め定められる所定の期間毎としてもよく、音声データを最初に受信してから最後に受信するまでの期間としてもよく、音声の再生開始タイミングを調整する期間毎としてもよい。修正すれ量は、音声パケット毎のずれ量ΔTnに基づいて検出される。例えば修正ずれ量には、発生頻度が最も多いずれ量ΔTnを用いる。修正ずれ量には、ずれ量ΔTnの平均値を用いてもよい。発生頻度が最も多いずれ量ΔTnまたはずれ量ΔTnの平均値は、修正ずれ量の一例である。
 ずれ量ΔTnを所定の範囲毎に分類した頻度分布情報を用いる場合、修正ずれ量は所定の範囲に対応する値、例えば所定の範囲の中央値、最大値または最小値などを用いればよい。
The generation of the deviation amount information is started when the reference time is set. The generation of the deviation amount information may be performed every predetermined period, may be a period from the first reception of the audio data to the last reception, or every period for adjusting the sound reproduction start timing. Good. The correction amount is detected based on the deviation amount ΔTn for each voice packet. For example, the most frequently occurring amount ΔTn is used as the correction deviation amount. As the correction deviation amount, an average value of the deviation amounts ΔTn may be used. The average value of the most frequently occurring amount ΔTn or the deviation amount ΔTn is an example of the corrected deviation amount.
When using frequency distribution information in which the deviation amount ΔTn is classified for each predetermined range, the correction deviation amount may be a value corresponding to the predetermined range, for example, a median value, a maximum value, or a minimum value of the predetermined range.
 音声の再生開始タイミングを調整する(ずらす)には、所定の調整時間(十~数十秒程度)に対応する音声データを用いて、該音声データによる音声の再生時間を伸長または短縮すればよい。音声の伸長時間または短縮時間には、上記修正ずれ量を用いる。音声の再生時間を伸長または短縮する方法としては、例えば受信した音声データを、該音声データの標本化時と異なるサンプリング周波数で再サンプリングする方法、あるいは音声データの一部を削除したり、無音データを挿入したりする方法がある。 To adjust (shift) the audio playback start timing, audio data corresponding to a predetermined adjustment time (about 10 to several tens of seconds) may be used to extend or shorten the audio playback time. . The corrected deviation amount is used for the sound expansion time or shortening time. As a method of extending or shortening the audio playback time, for example, a method of re-sampling the received audio data at a sampling frequency different from that at the time of sampling the audio data, or deleting a part of the audio data or silent data There is a way to insert.
 また、音声の再生時間を伸長または短縮する場合、映像表示装置1のユーザが再生音声に違和感を覚えないように、伸長時間や短縮時間は調整に用いる音声データの再生時間の0.5%程度に設定する。例えば、修正ずれ量が-50ms(ミリ秒)の場合、その後に続く10秒間(=50ms×(100/0.5))を調整時間とし、該10秒分の音声データを9.95秒(=10s-50ms)に短縮して再生する。また、修正ずれ量が+50msの場合、その後に続く10秒分の音声データの再生時間を10.05秒(=10s+50ms)に伸張して再生する。 Further, when the audio playback time is extended or shortened, the expansion time or the shortened time is about 0.5% of the playback time of the audio data used for adjustment so that the user of the video display device 1 does not feel uncomfortable with the playback audio. Set to. For example, when the correction deviation amount is −50 ms (milliseconds), the subsequent 10 seconds (= 50 ms × (100 / 0.5)) is set as the adjustment time, and the audio data for the 10 seconds is 9.95 seconds ( = 10s-50ms) for playback. When the correction deviation amount is +50 ms, the playback time of the subsequent 10 seconds of audio data is extended to 10.05 seconds (= 10 s + 50 ms) and played back.
 伸長時間や短縮時間は、映像表示装置1のユーザが再生音声に違和感を覚えない時間であればよく、調整に用いる音声データの再生時間の0.5%以下に設定してもよい。詳細には、例えば、各音声パケットに含まれる音声データの再生時間を100msとし、修正ずれ量を-50ms(または+50ms)とすると、修正ずれ量-50ms(または+50ms)を調整する調整時間10秒に対応する音声データの数は100個となる。そこで、音声データ加工部123は各音声データを99.5msに短縮する(または100.5msに伸長する)。短縮された(または伸長された)音声データは、音声補正データとして音声バッファ133に格納される。格納された音声補正データは順次再生される。この処理を100回繰り返すことにより、音声が途切れることなく、自動的に再生開始タイミングを調整できる。 The decompression time and shortening time may be any time as long as the user of the video display device 1 does not feel uncomfortable with the playback sound, and may be set to 0.5% or less of the playback time of the sound data used for adjustment. Specifically, for example, assuming that the reproduction time of the audio data included in each audio packet is 100 ms and the correction deviation amount is −50 ms (or +50 ms), the adjustment time for adjusting the correction deviation amount −50 ms (or +50 ms) is 10 seconds. The number of audio data corresponding to is 100. Therefore, the voice data processing unit 123 shortens each voice data to 99.5 ms (or extends to 100.5 ms). The shortened (or expanded) sound data is stored in the sound buffer 133 as sound correction data. The stored audio correction data is sequentially reproduced. By repeating this process 100 times, the reproduction start timing can be automatically adjusted without interruption of the sound.
 また、受信した音声パケットに含まれる音声データが示す音声の再生開始タイミングを調整すると共に、基準時刻を修正することが望ましい。基準時刻を修正することにより、音声の再生開始タイミングを調整した後も、修正された基準時刻に対する音声データのずれ量を検出することが可能になるため、ネットワークの伝送状態が変化した場合などに、再度適切な再生タイミングに調整できる。なお、ずれ量情報は、基準時刻を修正するときにリセットすることが望ましい。ずれ量情報を適切なタイミングでリセットすることにより、例えば、最新のネットワークの伝送状態に応じて適切なずれ量情報を生成できる。 Also, it is desirable to adjust the playback start timing of the voice indicated by the voice data included in the received voice packet and to correct the reference time. By adjusting the reference time, it becomes possible to detect the amount of deviation of the audio data with respect to the corrected reference time even after adjusting the audio playback start timing. , It can be adjusted to an appropriate playback timing again. Note that it is desirable to reset the deviation amount information when correcting the reference time. By resetting the deviation amount information at an appropriate timing, for example, appropriate deviation amount information can be generated according to the latest transmission state of the network.
 データ処理装置12は、音声パケット毎のずれ量ΔTnに基づいて、音声データが示す音声の再生開始タイミングを調整するとともに基準時刻を修正する。基準時刻の修正量及び再生開始タイミングの調整量は、音声パケット毎のずれ量ΔTnに基づいて検出される。
 詳細には、音声伝送処理部122は、上記修正ずれ量に基づいて基準時刻を修正する。例えば、上述したように各音声パケットに含まれる音声データの再生時間を100msとし、修正ずれ量を-50ms(または+50ms)とすると、音声の再生タイミングを調整する調整時間10秒に対応する音声データの数は100個となる。そこで、音声データ加工部123が各音声データを99.5msに短縮する(または100.5msに伸長する)と共に、音声伝送処理部122は基準時刻を-0.5ms(または+0.5ms)ずらす。短縮された(または伸長された)音声データは、音声補正データとして音声バッファ133に格納される。格納された音声補正データは順次再生される。この処理を100回繰り返すことにより、音声が途切れることなく、音声の再生開始タイミングを調整できると共に、基準時刻を修正量-50ms(または+50ms)に修正できる。
The data processing device 12 adjusts the playback start timing of the voice indicated by the voice data and corrects the reference time based on the deviation amount ΔTn for each voice packet. The correction amount of the reference time and the adjustment amount of the reproduction start timing are detected based on the deviation amount ΔTn for each audio packet.
Specifically, the voice transmission processing unit 122 corrects the reference time based on the correction deviation amount. For example, as described above, when the reproduction time of the audio data included in each audio packet is 100 ms and the correction deviation amount is −50 ms (or +50 ms), the audio data corresponding to the adjustment time of 10 seconds for adjusting the audio reproduction timing. Will be 100. Therefore, the audio data processing unit 123 shortens each audio data to 99.5 ms (or expands to 100.5 ms), and the audio transmission processing unit 122 shifts the reference time by −0.5 ms (or +0.5 ms). The shortened (or expanded) sound data is stored in the sound buffer 133 as sound correction data. The stored audio correction data is sequentially reproduced. By repeating this process 100 times, the sound reproduction start timing can be adjusted without interruption of the sound, and the reference time can be corrected to the correction amount −50 ms (or +50 ms).
 なお、基準時刻の修正量は修正ずれ量に基づいて決定されるため、音声の再生開始タイミングが修正ずれ量に基づいて調整されることは、基準時刻の修正量に基づいて調整されることも含む。また、音声の再生開始タイミングの調整量は修正ずれ量に基づいて決定されるため、基準時刻が修正ずれ量に基づいて調整されることは、音声の再生開始タイミングの調整量に基づいて調整されることを含む。 Since the correction amount of the reference time is determined based on the correction deviation amount, the adjustment of the audio reproduction start timing based on the correction deviation amount may be adjusted based on the correction amount of the reference time. Including. Further, since the adjustment amount of the audio reproduction start timing is determined based on the correction deviation amount, the adjustment of the reference time based on the correction deviation amount is adjusted based on the adjustment amount of the audio reproduction start timing. Including.
 上述したようにネットワーク3の伝送速度は時々刻々と変化するため、ネットワーク3のデータ伝送速度が現在の状態から低くなると、各音声パケットは理想的な到着時刻よりも遅れて到着する傾向となり、各音声パケットのΔTnが正の値に多く分布するようになる。また、ネットワーク3のデータ伝送速度が現在の状態から高くなると、各音声パケットは理想的な到着時刻よりも早く到着する傾向となり、各音声パケットのΔTnが負の値に多く分布するようになる。本発明では、音声の再生時間を調整することにより音声の再生開始タイミングを適切に設定できる。
 つまり、データ処理装置12は、ネットワーク3のデータ伝送速度が現在の状態から低い状態になると、ネットワーク3を介して受信する音声データが示す音声の再生時間を調整期間において伸長するように調整し、ネットワーク3のデータ伝送速度が現在の状態から高い状態になると、ネットワーク3を介して受信する音声データが示す音声の再生時間を調整期間において短縮するように調整する。また、音声出力装置は、データ処理装置が調整した音声再生時間に基づいて音声を再生する。なお、音声再生時間を伸長または短縮することは音声の再生開始タイミングを調整することになる。
As described above, since the transmission rate of the network 3 changes from moment to moment, when the data transmission rate of the network 3 decreases from the current state, each voice packet tends to arrive later than the ideal arrival time. A large amount of ΔTn of voice packets is distributed in a positive value. Further, when the data transmission rate of the network 3 increases from the current state, each voice packet tends to arrive earlier than the ideal arrival time, and ΔTn of each voice packet is distributed in a large negative value. In the present invention, the sound reproduction start timing can be appropriately set by adjusting the sound reproduction time.
That is, when the data transmission speed of the network 3 is lowered from the current state, the data processing device 12 adjusts the playback time of the voice indicated by the voice data received via the network 3 to be extended in the adjustment period, When the data transmission rate of the network 3 is changed from the current state to the high state, the audio reproduction time indicated by the audio data received via the network 3 is adjusted to be shortened in the adjustment period. The audio output device reproduces audio based on the audio reproduction time adjusted by the data processing device. Note that extending or shortening the audio reproduction time adjusts the audio reproduction start timing.
 上記説明では、基準時刻の修正量と音声の再生開始タイミングの調整量とを同じ量として、徐々に修正または調整する処理を行う例を示したが、このような処理に限定されるものではない。例えば、基準時刻の修正は、一度に(-50msまたは+50ms)実施してもよい。この場合、音声の再生開始タイミングの調整時間においては、ずれ量の検出処理を停止することが望ましい。なお、この調整期間でずれ量を適切に検出するには、調整開始からの経過時間と再生開始タイミングの調整量とを考慮した処理が必要となる。つまり、各音声パケットが理想的な到着時刻よりも遅れて到着する傾向にある場合(修正ずれ量が正の値の場合)は、現状の基準時刻を遅らせるように修正して、音声の再生時間を伸長するように調整する。また、各音声パケットが理想的な到着時刻よりも早く到着する傾向にある場合(修正ずれ量が負の値の場合)は、現状の基準時刻を早めるように修正して、音声の再生時間を短縮するように調整する。 In the above description, an example is shown in which the correction amount of the reference time and the adjustment amount of the audio reproduction start timing are set to the same amount, and the process of gradually correcting or adjusting is performed. However, the present invention is not limited to such processing. . For example, the correction of the reference time may be performed at once (−50 ms or +50 ms). In this case, it is desirable to stop the shift amount detection process during the adjustment time of the audio reproduction start timing. In order to appropriately detect the shift amount during this adjustment period, it is necessary to take into account the elapsed time from the start of adjustment and the adjustment amount of the reproduction start timing. In other words, when each voice packet tends to arrive later than the ideal arrival time (when the correction deviation is a positive value), it is corrected to delay the current reference time, and the voice playback time Adjust to extend. If each voice packet tends to arrive earlier than the ideal arrival time (if the correction deviation is a negative value), correct the current reference time so that the voice playback time is reduced. Adjust to shorten.
 このように音声の再生時間を調整することにより(音声の再生開始タイミングを調整することより)、音声バッファ133で保持している音声データ量が適切な値となり、音声の途切れ等を抑制できる。また、音声データのオーバーフロー等を抑制できるため、音声バッファ133の容量を小さくすることも可能になる。さらに、基準時刻を修正することで、音声の再生を調整した後もネットワークの状態に応じて再調整することができる。 By adjusting the audio playback time (by adjusting the audio playback start timing) in this way, the audio data amount held in the audio buffer 133 becomes an appropriate value, and audio interruptions and the like can be suppressed. In addition, since the overflow of the audio data can be suppressed, the capacity of the audio buffer 133 can be reduced. Furthermore, by adjusting the reference time, it is possible to readjust according to the state of the network even after adjusting the audio reproduction.
 なお、他の音声パケットと比べて極端に遅れて到着した音声パケットを修正ずれ量の算出(検出)に用いないようにするため、予めしきい値TMAXを設定し、ΔTn<TMAXを満たすΔTnのみを用いて修正ずれ量を算出してもよい。このとき、ずれ量情報は、ΔTn<TMAXを満たすΔTnに関する情報として生成されるようにしてもよい。その場合、より適切な修正ずれ量を算出できる。
 また、修正ずれ量に所定のしきい値TTHを設けておき、修正ずれ量の絶対値が該しきい値TTHを越えたとき、基準時刻の修正や音声の再生開始タイミングの調整(音声データの加工)を実施してもよい。あるいは、音声の再生開始タイミングの調整は、予め定められた期間毎に実施してもよい。その場合、基準時刻や音声の再生開始タイミングの変更が頻繁に実施されることがないため、データ処理装置12の処理負荷が軽減される。しきい値は、修正ずれ量が正の場合と負の場合とで異なる値に設定してもよい。その場合、音声の遅延時間を重視するか否か、または音声の途切れの発生の抑止を重視するか否かを設定できる。
Note that a threshold T MAX is set in advance so that ΔTn <T MAX is satisfied so that a voice packet that arrives extremely late compared to other voice packets is not used for calculation (detection) of the correction deviation amount. The correction deviation amount may be calculated using only ΔTn. At this time, the deviation amount information may be generated as information on ΔTn that satisfies ΔTn <T MAX . In that case, a more appropriate correction deviation amount can be calculated.
Further, a predetermined threshold value T TH is provided for the correction deviation amount, and when the absolute value of the correction deviation amount exceeds the threshold value T TH , the reference time is corrected and the sound reproduction start timing is adjusted (sound Data processing) may be performed. Alternatively, the adjustment of the sound reproduction start timing may be performed every predetermined period. In this case, the processing load on the data processing device 12 is reduced because the reference time and the playback start timing of the audio are not frequently changed. The threshold value may be set to a different value depending on whether the correction deviation amount is positive or negative. In that case, it is possible to set whether or not to place importance on the delay time of the sound, or whether or not to place importance on the suppression of the occurrence of the sound interruption.
 以上説明したように、頻度分布情報に基づいて基準時刻を修正することで、それ以降に受信する各音声パケットのΔTnの頻度が零(0)付近に分布するように修正される。その場合、以降の音声パケットはT+ΔTnMAX以内に到着するため、修正すれ量に基づいて音声の再生開始タイミングを調整すれば、音声の遅延時間TをT+ΔTnMAXに対して十分に大きな値に設定しなくても、音声の途切れの発生が抑制される。したがって、遅延ゆらぎに起因する音声の途切れの発生を抑制しつつ、音声の遅延時間Tを必要以上に長い時間に設定しなくてもよい。 As described above, by correcting the reference time based on the frequency distribution information, the frequency of ΔTn of each voice packet received thereafter is corrected to be distributed in the vicinity of zero (0). Setting such a case, since the subsequent voice packets arriving within T + .DELTA.Tn MAX, by adjusting the playback start timing of the speech based on the modified thread amount, the sufficiently large value for the delay time T P a T + .DELTA.Tn MAX speech Even without this, the occurrence of voice interruptions is suppressed. Therefore, it is not necessary to set the audio delay time TP to an unnecessarily long time while suppressing the occurrence of audio interruption due to delay fluctuation.
 ところで、上述したように、音声の遅延時間TはT≧T+ΔTnMAXを満たすように設定すればよいが、図4A及びBに示したヒストグラムから分かるように、このΔTnMAXの値はずれ量情報(頻度分布情報)から得ることができる。
 図6Aはネットワーク3が安定したデータ伝送速度であるときの頻度分布情報の一例を示すヒストグラムであり、図6Bはネットワーク3が不安定なデータ伝送速度であるときの頻度分布情報の一例を示すヒストグラムである。図6A及びBは、基準時刻が修正された後の頻度分布情報の一例をそれぞれ示している。
 例えば、ネットワーク3が安定したデータ伝送状態にある場合、図6Aに示すように各音声パケットのΔTnの頻度分布は比較的狭くなり、ΔTnMAXは小さな値となる。一方、ネットワーク3が不安定なデータ伝送状態にある場合、図6(b)に示すように各音声パケットのΔTnの頻度分布は比較的広くなり、ΔTnMAXは大きな値となる。
Incidentally, as described above, the audio delay time T P may be set so as to satisfy T P ≧ T + ΔTn MAX . As can be seen from the histograms shown in FIGS. 4A and 4B, the value of ΔTn MAX is the shift amount information. (Frequency distribution information).
6A is a histogram showing an example of frequency distribution information when the network 3 has a stable data transmission rate, and FIG. 6B is a histogram showing an example of frequency distribution information when the network 3 has an unstable data transmission rate. It is. 6A and 6B show examples of frequency distribution information after the reference time is corrected.
For example, when the network 3 is in a stable data transmission state, the frequency distribution of ΔTn of each voice packet is relatively narrow and ΔTn MAX is a small value as shown in FIG. 6A. On the other hand, when the network 3 is in an unstable data transmission state, the frequency distribution of ΔTn of each voice packet is relatively wide as shown in FIG. 6B, and ΔTn MAX is a large value.
 そこで、本発明では、基準時刻を修正した後、頻度分布の時間軸方向の長さ、すなわち音声パケットの理想的な到着時刻からのずれ量ΔTnの最大値(ΔTnMAX:正の領域の最大値)を検出し、該ΔTnMAXに応じて音声の遅延時間Tを調整する。例えば、音声の遅延時間Tを、T=T+ΔTnMAXに設定する。Tは、TやΔTnMAXの誤差α等を考慮してT=T+ΔTnMAX+αに設定してもよい。このとき、頻度分布であるずれ量情報はΔTn<TMAXを満たすΔTnに関する情報として生成してもよい。ずれ量情報として、ずれ量の平均値と最大値を用いる場合も同様の処理を適用できる。
 このように音声の遅延時間Tを設定すると、図6Aで示したようにネットワーク3が安定したデータ伝送状態にある場合は、音声の遅延時間Tをより短く設定できる。また、図6Bで示したようにネットワーク3が不安定なデータ伝送状態にある場合でも、音声の途切れの発生が抑制される範囲内で、音声の遅延時間Tを可能な限り短く設定できる。
Therefore, in the present invention, after correcting the reference time, the length of the frequency distribution in the time axis direction, that is, the maximum deviation amount ΔTn from the ideal arrival time of the voice packet (ΔTn MAX : maximum value in the positive region) ) detects, adjusts the delay time T P of the speech in response to the .DELTA.Tn MAX. For example, the audio delay time T P is set to T P = T + ΔTn MAX . T P may be set to T P = T + ΔTn MAX + α in consideration of an error α such as T and ΔTn MAX. At this time, the deviation amount information that is a frequency distribution may be generated as information on ΔTn that satisfies ΔTn <T MAX . The same processing can be applied when using the average value and the maximum value of the deviation amounts as the deviation amount information.
With this setting the delay time T P of the speech, when in data transmission with network 3 is stabilized as shown in FIG. 6A, it can be set shorter delay time T P of the speech. Even when in a data transmission state network 3 is unstable as shown in Figure 6B, to the extent that the occurrence of interruption of the sound can be suppressed, it can be set as short as possible delay time T P of the speech.
 音声の遅延時間Tを変更する場合、上述したように各音声データの再生開始時刻は基準時刻から(n-1)T+T後であり、Tにも依存するため、調整後の音声の遅延時間Tに基づき、上述した音声データを加工する手法を用いて、音声の再生開始タイミングを調整すればよい。例えば、音の遅延時間Tを短くする場合は、その短くする時間に応じて所定の調整時間の音声データを短縮し、音の遅延時間Tを長くする場合は、その長くする時間に応じて所定の調整時間の音声データを伸長する。なお、ΔTnMAXの値は、ΔTnの頻度分布から修正ずれ量が分かれば得ることができるため、基準時刻の修正と音声の遅延時間Tの調整とは同時に実行してもよい。その場合、修正ずれ量と調整後の音声の遅延時間Tの両方に基づいて、音声データを加工して音声の再生開始タイミングを調整すればよい。 If you want to change the delay time T P of the audio playback start time of each audio data as described above from the reference time (n-1) T + T P there later, also depends on T P, the adjusted speech based on the delay time T P, using a method for processing the audio data described above, it may be adjusted playback start timing of the sound. For example, to shorten the delay time T P of the sound, if in accordance with the time of the shortening reduces the audio data of a predetermined adjustment time, a longer delay time T P of the sound, depending on the time to the long The voice data for a predetermined adjustment time is expanded. The value of .DELTA.Tn MAX, since it is possible to obtain Knowing corrected displacement amount from the frequency distribution of .DELTA.Tn, may be performed simultaneously with the adjustment of the correction of the reference time and the audio delay time T P. In that case, based on both of the correction displacement amount and audio delay time after adjustment T P, it may be adjusted playback start timing of the voice by processing the audio data.
 以上説明したように、ΔTnの頻度分布から音声の遅延時間Tを調整することで、遅延ゆらぎに起因する音声の途切れの発生を抑制しつつ、音声の遅延時間Tをネットワーク3のデータ伝送状態に応じた最適な値に設定できる。 As described above, by adjusting the delay time T P of the speech from the frequency distribution of .DELTA.Tn, while suppressing the occurrence of interruption of the sound due to delay fluctuations, the transmission of data audio delay time T P of the network 3 It can be set to an optimum value according to the state.
 図7は、本発明の画像表示装置1の処理手順の一例を示すフローチャートである。
 図7に示すように、音声伝送処理部122は、画像データ及び音声データの受信開始時、データ受信部111及び画像音声分割部121により画像送信装置2から上記音声関連情報を受信すると、該音声関連情報を音声補正データ格納メモリ132に格納する(ステップA1)。
FIG. 7 is a flowchart showing an example of a processing procedure of the image display apparatus 1 of the present invention.
As shown in FIG. 7, when the audio transmission processing unit 122 receives the audio related information from the image transmitting apparatus 2 by the data receiving unit 111 and the image audio dividing unit 121 at the start of the reception of the image data and the audio data, the audio transmission processing unit 122 The related information is stored in the voice correction data storage memory 132 (step A1).
 また、音声伝送処理部122は、画像音声分割部121から音声データを受け取ると(ステップA2)、該音声データが最初の音声データであるか否かを判定する(ステップA3)。最初の音声データである場合、音声伝送処理部122は、該音声データを含む音声パケットの到着時刻を音声補正データ格納メモリ132に基準時刻の初期値として保存し(ステップA4)、ステップA10の処理へ移行して該音声データを音声バッファ133に書き込む。
 受信した音声データが最初の音声データでない場合、音声伝送処理部122は、音声補正データ格納メモリ132から基準時刻及び上記音声関連情報に含まれる音声伝送単位時間Tを読み出し、該音声データの理想的な到着時刻からのずれ量ΔTnを計算する(ステップA5)。
When the audio transmission processing unit 122 receives the audio data from the image / audio dividing unit 121 (step A2), the audio transmission processing unit 122 determines whether the audio data is the first audio data (step A3). In the case of the first voice data, the voice transmission processing unit 122 stores the arrival time of the voice packet including the voice data in the voice correction data storage memory 132 as the initial value of the reference time (step A4), and the process of step A10 And the audio data is written to the audio buffer 133.
If the received voice data is not the first voice data, the voice transmission processing unit 122 reads the reference time and the voice transmission unit time T included in the voice related information from the voice correction data storage memory 132, and the ideal voice data. A deviation amount ΔTn from the arrival time is calculated (step A5).
 音声伝送処理部122は、該ΔTnを含めたずれ量情報を生成し、該ずれ量情報に基づいて修正ずれ量を算出する(ステップA6)。
 次に、音声伝送処理部122は、所定の調整条件を満たすか否かを判定する(ステップA7)。例えば、音声伝送処理部122は、ステップA6で計算した修正ずれ量と、予め設定されたしきい値TTHとを比較し、音声データの加工(音声の再生開始タイミングの調整)が必要であるか否かを判定する。修正ずれ量がしきい値TTHを超えている場合、音声伝送処理部122は、音声データの加工が必要と判定し、該修正ずれ量に基づいて基準時刻を更新すると共に、音声データ加工部123に音声データの加工を指示する。音声の再生開始タイミングの調整要否は、上記しきい値TTHを用いて判定する方法に限定されるものではない。例えば、音声伝送処理部122は、前回実行した音声の再生開始タイミングの調整処理から所定の時間が経過したか否かで判定してもよい。
The voice transmission processing unit 122 generates deviation amount information including the ΔTn, and calculates a corrected deviation amount based on the deviation amount information (step A6).
Next, the audio transmission processing unit 122 determines whether or not a predetermined adjustment condition is satisfied (step A7). For example, the audio transmission processing unit 122 needs to process the audio data (adjustment of audio reproduction start timing) by comparing the correction deviation amount calculated in step A6 with a preset threshold value TTH. It is determined whether or not. If the correction deviation amount exceeds the threshold value T TH , the voice transmission processing unit 122 determines that the voice data needs to be processed, updates the reference time based on the correction deviation amount, and also uses the voice data processing unit. 123 is instructed to process the audio data. The necessity of adjusting the sound reproduction start timing is not limited to the method of determining using the threshold value TTH . For example, the audio transmission processing unit 122 may determine whether or not a predetermined time has elapsed since the previously executed adjustment processing of the audio reproduction start timing.
 音声データの加工が指示されると、音声データ加工部123は、修正ずれ量が正の値の場合、再生する音声を伸長するように音声データを加工する。また、修正ずれ量が負の値の場合、音声データ加工部123は、再生する音声を短縮するように音声データを加工する(ステップA8)。音声データ加工部123は、加工した音声データ(音声補正データ)を音声バッファ133に書き込む(ステップA9)。 When the audio data processing is instructed, the audio data processing unit 123 processes the audio data so as to expand the audio to be reproduced when the correction deviation amount is a positive value. If the correction deviation amount is a negative value, the audio data processing unit 123 processes the audio data so as to shorten the audio to be reproduced (step A8). The audio data processing unit 123 writes the processed audio data (audio correction data) in the audio buffer 133 (step A9).
 ステップA7で基準時刻の修正量がしきい値TTHを越えていない場合、音声伝送処理部122は、音声データの加工が不要と判定し、ステップA2で受け取った音声データを音声データ加工部123に加工させることなく、音声バッファ133に書き込む(ステップA9)。
 最後に、音声伝送処理部122は、ステップA2で受け取った音声データが最後の音声データであるか否かを判定し(ステップA10)、最後の音声データでない場合は、ステップA2の処理に戻って次の音声データを画像音声分割部121から受け取り、ステップA2からステップA10の処理を繰り返す。ステップA2で受け取った音声データが最後の音声データである場合は処理を終了する。
If in step A7 correction amount of the reference time does not exceed the threshold value T TH, the voice transmission processing unit 122, processing of audio data is determined to be unnecessary, the audio data processing unit 123 the audio data received in the step A2 Are written in the audio buffer 133 without being processed (step A9).
Finally, the voice transmission processing unit 122 determines whether or not the voice data received in step A2 is the last voice data (step A10). If the voice data is not the last voice data, the process returns to step A2. The next sound data is received from the image sound dividing unit 121, and the processing from step A2 to step A10 is repeated. If the audio data received in step A2 is the last audio data, the process is terminated.
 図7に示すフローチャートは、基準時刻を修正すると共に、修正ずれ量に基づいて音声の再生開始タイミングを調整する処理例を示している。音声の遅延時間Tを併せて調整する場合は、ステップA6の処理で取得したずれ量情報及び修正ずれ量からTの調整後の値を決定すればよい。 The flowchart shown in FIG. 7 shows a processing example in which the reference time is corrected and the audio reproduction start timing is adjusted based on the correction deviation amount. When adjusting together a delay time T P of the speech may be determined the value after the adjustment of T P from the obtained shift amount information and the corrected shift amount in the process of step A6.
 本発明によれば、音声パケット毎の理想的な到着時刻からのずれ量ΔTnをそれぞれ算出し、該ΔTnの頻度分布から基準時刻を修正することで、それ以降に受信する各音声パケットのΔTnの頻度がΔTn=零(0)付近に分布するように修正される。また、修正ずれ量に基づいて音声の再生開始タイミングを調整することで、音声の遅延時間TをT+ΔTnMAXに対して十分に大きな値に設定しなくても、音声の途切れの発生を抑制できる。
 さらに、各音声パケットのΔTnの頻度分布に基づいて音声の遅延時間Tを調整すれば、音声の途切れの発生を抑制しつつ、音声の遅延時間Tをネットワーク3のデータ伝送状態に応じて適切に設定できる。
 上記遅延ゆらぎを考慮して、例えば1秒~数秒程度の音声データを音声バッファ133に蓄積して音声の再生を開始すれば、音声の途切れを抑制できる。しかしながら、その場合は、画像送信装置2の操作性が低下してしまう。特に、音声データと画像データとが対応して再生(表示)される場合、音声が画像よりも著しく遅れて再生される。画像データと該画像データに対応した音声データとの再生開始タイミングがずれると、視聴者は再生音声に違和感を覚えてしまう。本発明のように各音声パケットのΔTnの頻度分布に基づいて音声の遅延時間Tを調整すれば、音声の遅延時間Tを抑制できるため、このような違和感を軽減できる。
According to the present invention, the amount of deviation ΔTn from the ideal arrival time for each voice packet is calculated, and the reference time is corrected from the frequency distribution of ΔTn, so that ΔTn of each voice packet received thereafter is calculated. The frequency is corrected so that it is distributed in the vicinity of ΔTn = zero (0). Further, by adjusting the playback start timing of the speech based on the modified shift amount, without setting a sufficiently large value for the delay time T P a T + .DELTA.Tn MAX voice, it is possible to suppress the occurrence of interruption of the speech .
Further, by adjusting the delay time T P of the speech based on the frequency distribution of ΔTn of each voice packet, while suppressing the occurrence of interruption of the speech, the delay time T P of the speech in response to the data transmission state of the network 3 Can be set appropriately.
Considering the delay fluctuation, for example, if audio data of about 1 second to several seconds is accumulated in the audio buffer 133 and audio reproduction is started, audio interruption can be suppressed. However, in that case, the operability of the image transmission device 2 is degraded. In particular, when audio data and image data are reproduced (displayed) correspondingly, the audio is reproduced with a significant delay from the image. If the playback start timing of the image data and the audio data corresponding to the image data are shifted, the viewer will feel uncomfortable with the playback audio. By adjusting the delay time T P of the speech based on the frequency distribution of ΔTn of each voice packet as in the present invention, it is possible to suppress the delay time T P of the speech, can reduce such discomfort.
 なお、ネットワークの通信状態に基づいて、換言すると、ネットワークのデータ伝送速度が現在の状態から低くなると、音声パケットの到着時刻が遅くなり、ずれ量の分布が正の方向にずれる。この場合、音声バッファ133で保持する音声データが少なくなり、さらにデータ伝送速度が悪化して低くなると、音声データが無くなる可能性がある。そのため、本発明では、到着した音声データが示す音声(音声パケットに対応する音声)の再生時間を伸長する。また、データ伝送速度が現在の状態から高くなると、音声パケットの到着時刻が早くなり、ずれ量の分布が負の方向にずれる。この場合、音声バッファに格納されている音声データが多くなることになる。そのため、本発明では、到着した音声パケットに対応する音声の再生時間を短縮する。このように音声の再生時間を調整することにより、音声バッファ133で保持している音声データの量が適切な値になり、音声の途切れ、または音声データのオーバーフロー等を抑制できる。
 さらに、本発明では、音声の再生時間を調整すると共に、基準時刻を修正するため、調整処理後のずれ量の分布は零付近となり、適正な音声再生が行われる。また、調整処理後にネットワーク3のデータ伝送速度が変化した場合でも、同様の処理を行うことができ、適正な音声再生を継続することができる。
In other words, based on the communication state of the network, in other words, when the data transmission rate of the network is lower than the current state, the arrival time of the voice packet is delayed and the distribution of the deviation amount is shifted in the positive direction. In this case, if the audio data held in the audio buffer 133 is reduced and the data transmission rate is further deteriorated, the audio data may be lost. For this reason, in the present invention, the reproduction time of the voice indicated by the arrived voice data (voice corresponding to the voice packet) is extended. Further, when the data transmission rate is increased from the current state, the arrival time of the voice packet is advanced, and the deviation distribution is shifted in the negative direction. In this case, the audio data stored in the audio buffer increases. Therefore, in the present invention, the playback time of the voice corresponding to the voice packet that has arrived is shortened. By adjusting the audio reproduction time in this way, the amount of audio data held in the audio buffer 133 becomes an appropriate value, and audio interruptions or audio data overflows can be suppressed.
Furthermore, in the present invention, since the sound reproduction time is adjusted and the reference time is corrected, the distribution of the deviation amount after the adjustment processing is close to zero, and appropriate sound reproduction is performed. Further, even when the data transmission speed of the network 3 changes after the adjustment process, the same process can be performed, and appropriate sound reproduction can be continued.
 以上、実施形態を参照して本願発明を説明したが、本願発明は上記実施形態に限定されものではない。本願発明の構成や詳細は本願発明のスコープ内で当業者が理解し得る様々な変更が可能である。 As mentioned above, although this invention was demonstrated with reference to embodiment, this invention is not limited to the said embodiment. Various modifications that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
[付記1]
 ネットワークを介して受信した音声データが示す音声を再生する音声再生装置であって、
 前記ネットワークを介して受信する前記音声データ毎の理想的な到着時刻からのずれ量に基づいて、受信した音声データが示す音声の再生開始タイミングを調整するデータ処理装置と、
 前記データ処理装置が調整した前記音声の再生開始タイミングに基づいて前記音声を再生する音声出力装置と、
を有する音声再生装置。
[付記2]
 付記1に記載の音声再生装置であって、
 記憶装置をさらに有し、
 前記データ処理装置は、前記ずれ量に基づいて前記音声データを加工して音声補正データを生成することにより前記音声の再生開始タイミングを調整し、
 前記記憶装置は、前記音声データ及び前記音声補正データを一時的に保持し、
 前記音声出力装置は、前記記憶装置で保持された前記音声データまたは前記音声補正データを読み出し、該読み出した前記音声データまたは前記音声補正データが示す音声を再生する、音声再生装置。
[付記3]
 付記1または2に記載の音声再生装置であって、
 前記データ処理装置は、前記音声の再生開始タイミングを調整するとともに、前記ずれ量に基づいて前記基準時刻を修正する、音声再生装置。
[付記4]
 付記3に記載の音声再生装置であって、
 前記ずれ量の平均値または前記ずれ量の発生頻度が最も多い前記ずれ量の値を修正ずれ量とすると、
 前記音声の再生開始タイミングは、前記修正ずれ量に基づいて調整され、
前記基準時刻は、前記修正ずれ量に基づいて修正される、音声再生装置。
[付記5]
 付記4に記載の音声再生装置であって、
 前記データ処理装置は、
 前記修正ずれ量が正の場合、前記音声の再生時間を伸長するように調整し、前記基準時刻を遅らせるように修正し、
 前記修正ずれ量が負の場合、前記音声の再生時間を短縮するように調整し、前記基準時刻を早めるように修正する、音声再生装置。
[付記6]
 付記1乃至5のいずれか1項に記載の音声再生装置であって、
 前記基準時刻は、受信する音声データのうち、最初の音声データが到着した到着時刻を初期値として設定される、音声再生装置。
[付記7]
 付記1乃至6のいずれか1項に記載の音声再生装置であって、
 前記データ処理装置は、
 前記音声の再生開始タイミングを、前記音声の再生開始タイミングの調整に用いる音声の再生時間に基づいて徐々に調整する、音声再生装置。
[付記8]
 付記1乃至7のいずれか1項に記載の音声再生装置であって、
 前記理想的な到着時刻は、予め定められる前記音声データの送信間隔と、前記音声を再生するときの基準となる基準時刻とに基づいて予定される到着時刻である、音声再生装置。
[付記9]
 付記8に記載の音声再生装置であって、
 前記データ処理装置は、
 前記音声データの前記送信間隔と前記ずれ量に関する情報であるずれ量情報が示す最大値とに基づき、前記音声データが到着してから該音声データが示す音声の再生を開始するまでの時間である音声の遅延時間を調整する、音声再生装置。
[付記10]
 付記5乃至9のいずれか1項に記載の音声再生装置であって、
 前記データ処理装置は、
 前記音声の伸張時間または短縮時間を、前記音声の再生開始タイミングの調整に用いる音声の再生時間の0.5%以下に設定する、音声再生装置。
[付記11]
 ネットワークを介して受信した音声データが示す音声を再生する音声再生装置であって、
 前記ネットワークのデータ伝送速度が現在の状態から低くなると、前記ネットワークを介して受信する音声データが示す音声の再生時間を伸長するように調整し、前記ネットワークのデータ伝送速度が現在の状態から高くなると、前記ネットワークを介して受信する音声データが示す音声の再生時間を短縮するように調整するデータ処理装置と、
 前記データ処理装置が調整した前記音声の再生時間に基づいて前記音声を再生する音声出力装置と、
を有する音声再生装置。
[付記12]
 付記1乃至11のいずれか1項に記載の音声再生装置と、
 前記ネットワークを介して受信した画像データが示す画像を表示する画像出力装置と、
を備え、
 前記画像データに対応して送信される音声データを受信する、画像表示装置。
[付記13]
 映像を連続する静止画データから成る画像データとして送信すると共に、該画像データに対応する音声データを前記画像データとは別に送信する画像送信装置と、
 ネットワークを介して前記画像送信装置とデータ伝送可能に接続される、前記画像送信装置から受信した画像データ及び音声データが示す画像及び音声を再生する、請求項12に記載の画像表示装置と、
を有する画像再生システム。
[付記14]
 ネットワークを介して受信した画像データ及び音声データが示す画像及び音声を再生する画像表示装置による音声再生方法であって、
 前記ネットワークを介して受信する前記音声データ毎の理想的な到着時刻からのずれ量に基づいて、受信した音声データが示す音声の再生開始タイミングを調整し、
 前記データ処理装置が調整した前記音声の再生開始タイミングに基づいて前記音声を再生する音声再生方法。
[Appendix 1]
An audio playback device for playing back audio represented by audio data received via a network,
A data processing device that adjusts the reproduction start timing of the audio indicated by the received audio data based on the amount of deviation from the ideal arrival time for each of the audio data received via the network;
An audio output device for reproducing the audio based on the reproduction start timing of the audio adjusted by the data processing device;
An audio reproducing apparatus having
[Appendix 2]
The audio playback device according to attachment 1, wherein
A storage device;
The data processing device adjusts the reproduction start timing of the audio by processing the audio data based on the deviation amount to generate audio correction data,
The storage device temporarily holds the audio data and the audio correction data,
The audio output device reads the audio data or the audio correction data held in the storage device, and reproduces the audio indicated by the read audio data or the audio correction data.
[Appendix 3]
The audio playback device according to appendix 1 or 2,
The data processing device adjusts the reproduction start timing of the audio and corrects the reference time based on the deviation amount.
[Appendix 4]
The audio reproduction device according to attachment 3, wherein
When the average value of the deviation amount or the value of the deviation amount with the highest occurrence frequency of the deviation amount is a corrected deviation amount,
The audio reproduction start timing is adjusted based on the correction deviation amount,
The audio reproduction device, wherein the reference time is corrected based on the correction deviation amount.
[Appendix 5]
The audio playback device according to appendix 4, wherein
The data processing device includes:
If the amount of correction deviation is positive, adjust to extend the playback time of the sound, correct to delay the reference time,
An audio reproducing apparatus that adjusts the audio reproduction time to be shortened and corrects the reference time to be advanced when the correction deviation is negative.
[Appendix 6]
The audio playback device according to any one of appendices 1 to 5,
The reference time is set as an initial value of an arrival time at which the first audio data of the received audio data arrives.
[Appendix 7]
The audio playback device according to any one of appendices 1 to 6,
The data processing device includes:
An audio reproducing apparatus that gradually adjusts the audio reproduction start timing based on an audio reproduction time used for adjusting the audio reproduction start timing.
[Appendix 8]
The audio playback device according to any one of appendices 1 to 7,
The ideal playback time is an audio playback device that is scheduled based on a predetermined transmission interval of the audio data and a reference time serving as a reference when playing back the audio.
[Appendix 9]
The audio playback device according to appendix 8, wherein
The data processing device includes:
Based on the transmission interval of the audio data and the maximum value indicated by the deviation amount information, which is information relating to the deviation amount, the time from the arrival of the audio data to the start of reproduction of the audio indicated by the audio data An audio playback device that adjusts the audio delay time.
[Appendix 10]
The audio playback device according to any one of appendices 5 to 9,
The data processing device includes:
An audio reproduction apparatus, wherein the audio expansion time or shortening time is set to 0.5% or less of an audio reproduction time used for adjusting the audio reproduction start timing.
[Appendix 11]
An audio playback device for playing back audio represented by audio data received via a network,
When the data transmission rate of the network decreases from the current state, the audio data received via the network is adjusted to extend the playback time of the voice indicated by the network, and the data transmission rate of the network increases from the current state. A data processing device for adjusting so as to shorten the playback time of the voice indicated by the voice data received via the network;
An audio output device that reproduces the audio based on the reproduction time of the audio adjusted by the data processing device;
An audio reproducing apparatus having
[Appendix 12]
The sound reproducing device according to any one of appendices 1 to 11,
An image output device for displaying an image indicated by image data received via the network;
With
An image display device that receives audio data transmitted corresponding to the image data.
[Appendix 13]
An image transmission device that transmits video as image data composed of continuous still image data, and transmits audio data corresponding to the image data separately from the image data;
The image display device according to claim 12, wherein the image display device reproduces the image and sound indicated by the image data and the audio data received from the image transmission device connected to the image transmission device via a network so as to be capable of data transmission.
An image reproduction system.
[Appendix 14]
An audio reproduction method by an image display device that reproduces an image and audio indicated by image data and audio data received via a network,
Based on the amount of deviation from the ideal arrival time for each audio data received via the network, adjust the audio playback start timing indicated by the received audio data,
An audio reproduction method for reproducing the audio based on the reproduction start timing of the audio adjusted by the data processing device.

Claims (14)

  1.  ネットワークを介して受信した音声データが示す音声を再生する音声再生装置であって、
     前記ネットワークを介して受信する前記音声データ毎の理想的な到着時刻からのずれ量に基づいて、受信した音声データが示す音声の再生開始タイミングを調整するデータ処理装置と、
     前記データ処理装置が調整した前記音声の再生開始タイミングに基づいて前記音声を再生する音声出力装置と、
    を有する音声再生装置。
    An audio playback device for playing back audio represented by audio data received via a network,
    A data processing device that adjusts the reproduction start timing of the audio indicated by the received audio data based on the amount of deviation from the ideal arrival time for each of the audio data received via the network;
    An audio output device for reproducing the audio based on the reproduction start timing of the audio adjusted by the data processing device;
    An audio reproducing apparatus having
  2.  請求項1に記載の音声再生装置であって、
     記憶装置をさらに有し、
     前記データ処理装置は、前記ずれ量に基づいて前記音声データを加工して音声補正データを生成することにより前記音声の再生開始タイミングを調整し、
     前記記憶装置は、前記音声データ及び前記音声補正データを一時的に保持し、
     前記音声出力装置は、前記記憶装置で保持された前記音声データまたは前記音声補正データを読み出し、該読み出した前記音声データまたは前記音声補正データが示す音声を再生する、音声再生装置。
    The audio playback device according to claim 1,
    A storage device;
    The data processing device adjusts the reproduction start timing of the audio by processing the audio data based on the deviation amount to generate audio correction data,
    The storage device temporarily holds the audio data and the audio correction data,
    The audio output device reads the audio data or the audio correction data held in the storage device, and reproduces the audio indicated by the read audio data or the audio correction data.
  3.  請求項1または2に記載の音声再生装置であって、
     前記データ処理装置は、前記音声の再生開始タイミングを調整するとともに、前記ずれ量に基づいて前記基準時刻を修正する、音声再生装置。
    The sound reproducing device according to claim 1 or 2,
    The data processing device adjusts the reproduction start timing of the audio and corrects the reference time based on the deviation amount.
  4.  請求項3に記載の音声再生装置であって、
     前記ずれ量の平均値または前記ずれ量の発生頻度が最も多い前記ずれ量の値を修正ずれ量とすると、
     前記音声の再生開始タイミングは、前記修正ずれ量に基づいて調整され、
    前記基準時刻は、前記修正ずれ量に基づいて修正される、音声再生装置。
    The audio playback device according to claim 3,
    When the average value of the deviation amount or the value of the deviation amount with the highest occurrence frequency of the deviation amount is a corrected deviation amount,
    The audio reproduction start timing is adjusted based on the correction deviation amount,
    The audio reproduction device, wherein the reference time is corrected based on the correction deviation amount.
  5.  請求項4に記載の音声再生装置であって、
     前記データ処理装置は、
     前記修正ずれ量が正の場合、前記音声の再生時間を伸長するように調整し、前記基準時刻を遅らせるように修正し、
     前記修正ずれ量が負の場合、前記音声の再生時間を短縮するように調整し、前記基準時刻を早めるように修正する、音声再生装置。
    The audio playback device according to claim 4,
    The data processing device includes:
    If the amount of correction deviation is positive, adjust to extend the playback time of the sound, correct to delay the reference time,
    An audio reproducing apparatus that adjusts the audio reproduction time to be shortened and corrects the reference time to be advanced when the correction deviation is negative.
  6.  請求項1乃至5のいずれか1項に記載の音声再生装置であって、
     前記基準時刻は、受信する音声データのうち、最初の音声データが到着した到着時刻を初期値として設定される、音声再生装置。
    The audio playback device according to any one of claims 1 to 5,
    The reference time is set as an initial value of an arrival time at which the first audio data of the received audio data arrives.
  7.  請求項1乃至6のいずれか1項に記載の音声再生装置であって、
     前記データ処理装置は、
     前記音声の再生開始タイミングを、前記音声の再生開始タイミングの調整に用いる音声の再生時間に基づいて徐々に調整する、音声再生装置。
    The audio playback device according to any one of claims 1 to 6,
    The data processing device includes:
    An audio reproducing apparatus that gradually adjusts the audio reproduction start timing based on an audio reproduction time used for adjusting the audio reproduction start timing.
  8.  請求項1乃至7のいずれか1項に記載の音声再生装置であって、
     前記理想的な到着時刻は、予め定められる前記音声データの送信間隔と、前記音声を再生するときの基準となる基準時刻とに基づいて予定される到着時刻である、音声再生装置。
    The audio playback device according to any one of claims 1 to 7,
    The ideal playback time is an audio playback device that is scheduled based on a predetermined transmission interval of the audio data and a reference time serving as a reference when playing back the audio.
  9.  請求項8に記載の音声再生装置であって、
     前記データ処理装置は、
     前記音声データの前記送信間隔と前記ずれ量に関する情報であるずれ量情報が示す最大値とに基づき、前記音声データが到着してから該音声データが示す音声の再生を開始するまでの時間である音声の遅延時間を調整する、音声再生装置。
    The sound reproducing device according to claim 8,
    The data processing device includes:
    Based on the transmission interval of the audio data and the maximum value indicated by the deviation amount information, which is information relating to the deviation amount, the time from the arrival of the audio data to the start of reproduction of the audio indicated by the audio data An audio playback device that adjusts the audio delay time.
  10.  請求項5乃至9のいずれか1項に記載の音声再生装置であって、
     前記データ処理装置は、
     前記音声の伸張時間または短縮時間を、前記音声の再生開始タイミングの調整に用いる音声の再生時間の0.5%以下に設定する、音声再生装置。
    The sound reproduction device according to any one of claims 5 to 9,
    The data processing device includes:
    An audio reproduction apparatus, wherein the audio expansion time or shortening time is set to 0.5% or less of an audio reproduction time used for adjusting the audio reproduction start timing.
  11.  ネットワークを介して受信した音声データが示す音声を再生する音声再生装置であって、
     前記ネットワークのデータ伝送速度が現在の状態から低くなると、前記ネットワークを介して受信する音声データが示す音声の再生時間を伸長するように調整し、前記ネットワークのデータ伝送速度が現在の状態から高くなると、前記ネットワークを介して受信する音声データが示す音声の再生時間を短縮するように調整するデータ処理装置と、
     前記データ処理装置が調整した前記音声の再生時間に基づいて前記音声を再生する音声出力装置と、
    を有する音声再生装置。
    An audio playback device for playing back audio represented by audio data received via a network,
    When the data transmission rate of the network decreases from the current state, the audio data received via the network is adjusted to extend the playback time of the voice indicated by the network, and the data transmission rate of the network increases from the current state. A data processing device for adjusting so as to shorten the playback time of the voice indicated by the voice data received via the network;
    An audio output device that reproduces the audio based on the reproduction time of the audio adjusted by the data processing device;
    An audio reproducing apparatus having
  12.  請求項1乃至11のいずれか1項に記載の音声再生装置と、
     前記ネットワークを介して受信した画像データが示す画像を表示する画像出力装置と、
    を備え、
     前記画像データに対応して送信される音声データを受信する、画像表示装置。
    The sound reproduction device according to any one of claims 1 to 11,
    An image output device for displaying an image indicated by image data received via the network;
    With
    An image display device that receives audio data transmitted corresponding to the image data.
  13.  映像を連続する静止画データから成る画像データとして送信すると共に、該画像データに対応する音声データを前記画像データとは別に送信する画像送信装置と、
     ネットワークを介して前記画像送信装置とデータ伝送可能に接続される、前記画像送信装置から受信した画像データ及び音声データが示す画像及び音声を再生する、請求項12に記載の画像表示装置と、
    を有する画像再生システム。
    An image transmission device that transmits video as image data composed of continuous still image data, and transmits audio data corresponding to the image data separately from the image data;
    The image display device according to claim 12, wherein the image display device reproduces the image and sound indicated by the image data and the audio data received from the image transmission device connected to the image transmission device via a network so as to be capable of data transmission.
    An image reproduction system.
  14.  ネットワークを介して受信した画像データ及び音声データが示す画像及び音声を再生する画像表示装置による音声再生方法であって、
     前記ネットワークを介して受信する前記音声データ毎の理想的な到着時刻からのずれ量に基づいて、受信した音声データが示す音声の再生開始タイミングを調整し、
     前記データ処理装置が調整した前記音声の再生開始タイミングに基づいて前記音声を再生する音声再生方法。
    An audio reproduction method by an image display device that reproduces an image and audio indicated by image data and audio data received via a network,
    Based on the amount of deviation from the ideal arrival time for each audio data received via the network, adjust the audio playback start timing indicated by the received audio data,
    An audio reproduction method for reproducing the audio based on the reproduction start timing of the audio adjusted by the data processing device.
PCT/JP2015/059430 2015-03-26 2015-03-26 Audio reproduction device, image display device and audio reproduction method thereof WO2016151852A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2015/059430 WO2016151852A1 (en) 2015-03-26 2015-03-26 Audio reproduction device, image display device and audio reproduction method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2015/059430 WO2016151852A1 (en) 2015-03-26 2015-03-26 Audio reproduction device, image display device and audio reproduction method thereof

Publications (1)

Publication Number Publication Date
WO2016151852A1 true WO2016151852A1 (en) 2016-09-29

Family

ID=56978179

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/059430 WO2016151852A1 (en) 2015-03-26 2015-03-26 Audio reproduction device, image display device and audio reproduction method thereof

Country Status (1)

Country Link
WO (1) WO2016151852A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0630047A (en) * 1992-07-10 1994-02-04 Matsushita Electric Ind Co Ltd Packet delay fluctuation control circuit
JP2003258894A (en) * 2002-03-05 2003-09-12 Matsushita Electric Ind Co Ltd Method for receiving and reproducing data and data communication equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0630047A (en) * 1992-07-10 1994-02-04 Matsushita Electric Ind Co Ltd Packet delay fluctuation control circuit
JP2003258894A (en) * 2002-03-05 2003-09-12 Matsushita Electric Ind Co Ltd Method for receiving and reproducing data and data communication equipment

Similar Documents

Publication Publication Date Title
US20220263423A9 (en) Controlling a jitter buffer
US7424026B2 (en) Method and apparatus providing continuous adaptive control of voice packet buffer at receiver terminal
US8937963B1 (en) Integrated adaptive jitter buffer
US7457282B2 (en) Method and apparatus providing smooth adaptive management of packets containing time-ordered content at a receiving terminal
US6985501B2 (en) Device and method for reducing delay jitter in data transmission
JP4462996B2 (en) Packet receiving method and packet receiving apparatus
US9948578B2 (en) De-jitter buffer update
JP2006135974A (en) Audio receiver having adaptive buffer delay
US8594184B2 (en) Method and apparatus for controlling video-audio data playing
JP2007511939A5 (en)
US7738772B2 (en) Apparatus and method for synchronizing video data and audio data having different predetermined frame lengths
JP4076981B2 (en) Communication terminal apparatus and buffer control method
WO2016151852A1 (en) Audio reproduction device, image display device and audio reproduction method thereof
JP5186094B2 (en) Communication terminal, multimedia playback control method, and program
JP2013005423A (en) Video reproducer, video reproduction method and program
JP2017204700A (en) Image reproduction apparatus, image reproduction method, and image reproduction program
JP2016126037A (en) Signal processing device, signal processing method, and program
US8572273B2 (en) Method and apparatus for reproducing multimedia data by controlling reproducing speed
JP2010136159A (en) Data receiver
JP2007318283A (en) Packet communication system, data receiver
JP2005064873A (en) Jitter buffer control method, and ip telephone set
JP2007274536A (en) Receiver and transmission/reception method
JP2008199361A (en) Stream data receiving and reproducing device
JP2005229168A (en) Medium output system, synchronous error control system thereof and program
JP2005101818A (en) Apparatus and method for decoding and reproducing, and decoding and reproducing program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15886404

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15886404

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP