WO2011148519A1

WO2011148519A1 - Dwelling unit device for interphone system for residential complex

Info

Publication number: WO2011148519A1
Application number: PCT/JP2010/062581
Authority: WO
Inventors: 実福島; 恵一 ▲吉▼田; 哲平鷲; 幸夫岡田; 和生土橋; 克彦木村
Original assignee: パナソニック電工株式会社
Priority date: 2010-05-24
Filing date: 2010-07-27
Publication date: 2011-12-01
Also published as: TWI442759B; JP5544012B2; CN102918825B; CN102918825A; JPWO2011148519A1; TW201143350A

Abstract

In a dwelling unit device (A), if the call terminal at the other end of communication is an analogue transmission-type, a call processing unit (2) implements a first software; if the call terminal at the other end of communication is a packet transmission-type, the call processing unit (2) implements a second software, and thus call processing which is suited to each transmission-type can be selectively implemented.

Description

Intercom system dwelling unit for apartment houses

The present invention relates to a dwelling unit used in a dwelling unit intercom system and installed in each dwelling unit of a dwelling unit.

Conventionally, it was equipped with a common unit device (lobby intercom) installed at the common entrance of the apartment, a dwelling unit installed inside each dwelling unit of the apartment, and a doorphone slave unit installed at the outside entrance of each dwelling unit An intercom system for collective housing is provided. A signal trunk line is connected to the shared unit, and each dwelling unit is connected to a dwelling unit branching from the signal trunk line. Moreover, in each dwelling unit, the dwelling unit in the dwelling unit and the doorphone cordless unit at the outside entrance are connected by a cordless handset connection line. Furthermore, another dwelling unit may be connected to each dwelling unit by a home connection line. However, the dwelling unit connected to the dwelling unit line is called a dwelling unit main unit, and the dwelling unit connected to the dwelling unit main unit by the in-house connection line is called a dwelling unit sub-master unit. In Japanese Patent Publication No. 2010-28771, the voice transmission method via the signal trunk line and the dwelling unit line is a packet transmission method, so that another dwelling unit ( A dwelling unit intercom system that enables calls between two dwelling units) is described.

By the way, in a dwelling unit, various call processes, for example, a call direction switching process and an echo suppression process for a hands-free call (amplified call) are performed. Furthermore, as in the conventional example described in the above-mentioned prior document, the common unit device and the plurality of dwelling units can be digitally communicated, and the signal trunk line and the dwelling unit line connecting the common unit device and each dwelling unit are used. In the case of transmitting voice data in the case of transmitting digital data, in order to improve call quality, call processing that compensates for voice loss due to packet loss, delay and fluctuation (jitter) accompanying packet transmission is performed. Necessary.

On the other hand, there are cases where a conventional inexpensive device, that is, a device that transmits voice by an analog transmission method, is used for the door phone slave unit or the dwelling unit secondary master unit. In this case, an analog transmission method is adopted as a voice transmission method between the dwelling unit (dwelling unit main unit) and the door phone slave unit or between the dwelling unit main unit and the dwelling unit sub-master unit. Even in the analog transmission system, it is necessary to perform a call direction switching process and an echo suppression process for hands-free call (speech call), but consider the case where digital data is transmitted through the signal trunk line as described above. The voice loss compensation process essential for the packet transmission method is not necessary for the analog transmission method.

Here, in the dwelling unit (dwelling unit base unit), it is necessary to execute call processing corresponding to both the analog transmission method and the packet transmission method, but these call processing is performed by separate hardware (for call processing). When implemented with a circuit), there is a problem that the circuit configuration becomes complicated and the cost increases.

Therefore, an object of the present invention is to use a packet transmission system for voice transmission via a signal trunk line and an analog transmission system for voice transmission in the vicinity of a home not via a signal trunk line, while suppressing the complexity and cost increase of the circuit configuration. An object of the present invention is to provide a dwelling unit for an apartment intercom system that can be used and can improve call quality.

The dwelling unit of the intercom system for collective housing of the present invention is a common unit device installed in the common entrance of the collective housing, a dwelling unit installed in each dwelling unit of the collective housing, and installed in the exterior entrance of the collective housing Doorphone slave unit, signal trunk line connected to the common unit, a dwelling unit branching from the signal trunk line and connected to each dwelling unit, and a slave unit connecting the dwelling unit and the doorphone slave unit A connecting line. Call voice is transmitted between the shared device and the dwelling unit, and between the dwelling units by the packet transmission method via the signal trunk line and the dwelling unit line, and between the dwelling unit and the doorphone slave unit. Call voice is transmitted by an analog transmission method through the slave unit connection line. A microphone and a speaker; a transmission processing unit that transmits a voice packet including voice data for calling and a control packet including control data for call control via the dwelling unit line and the signal trunk line; and the slave unit connection line An analog signal transmission unit for transmitting an analog audio signal via the first, an analog audio signal output from the microphone is converted into audio data, and the audio data is converted into an analog audio signal and output to the speaker. 1 conversion processing unit and a second conversion process for converting an analog audio signal received by the analog signal transmission unit into audio data, converting the audio data into an analog audio signal, and outputting the analog audio signal to the analog signal transmission unit Unit, a call processing unit that performs predetermined call processing on voice data, and a door phone call detection that detects a call from the door phone slave unit A storage unit that stores first software for speech processing for voice data transmitted in an analog transmission system and second software for speech processing for speech data transmitted in a packet transmission system; And a control unit for instructing execution of call processing. In the first feature of the present invention, the control unit instructs the call processing unit to execute the first software when the door phone call detection unit detects the call, and the shared unit device or When the control data for call control is received from the dwelling unit, the call processing unit is instructed to execute the second software. In the present invention, the call processing unit executes the first software when the other party's call terminal is an analog transmission system, and the call processing unit executes the second software when the other terminal is a packet transmission system. Therefore, it is possible to use a packet transmission system for voice transmission via the signal trunk line and an analog transmission system for voice transmission in the vicinity of the house not via the signal trunk line, while suppressing complexity of the circuit configuration and cost increase. The call quality can be improved.

In one embodiment, the second software includes a program for acoustic echo suppression processing for suppressing acoustic echo generated by acoustic coupling between the microphone and a speaker, and a residual for suppressing residual echo that cannot be suppressed by the acoustic echo suppression processing. And an echo suppression processing program. In the present invention, since the second software includes the program for acoustic echo suppression processing and the program for residual echo suppression processing, the call quality in the packet transmission method can be further improved.

In one embodiment, it is preferable that the second software includes a fluctuation absorption processing program for absorbing fluctuations in transmission delay in the transmission processing unit. In the present invention, since the second software includes a fluctuation absorption processing program, the call quality in the packet transmission method can be further improved.

In one embodiment, there is provided a fluctuation absorbing buffer for accumulating voice data included in the voice packet received by the transmission processing unit. The fluctuation absorbing processing program counts the number of voice data packets stored in the fluctuation absorbing buffer at a period not longer than the packetization period of the voice packet and calculates a packet count value; It is preferable to cause the call processing unit to perform a buffer size changing step of inserting or deleting a packet in the fluctuation absorbing buffer based on the packet count value calculated in the counting step. In the present invention, the call processing unit performs a buffer size changing step for inserting or deleting packets in the fluctuation absorbing buffer based on the packet count value calculated in the counting step. Reduction of call delay can be realized, and call quality can be further improved.

In one embodiment, the fluctuation absorption processing program calculates a representative value of the packet count value based on a past history of the packet count value in the buffer size changing step, and the calculated representative value is a predetermined reference. When the value is larger than the value, it is preferable to delete the packet from the fluctuation absorbing buffer, and when the representative value is smaller than the reference value, it is preferable to cause the call processing unit to perform a process of inserting the packet into the fluctuation absorbing buffer. In the present invention, prevention of packet depletion and reduction of call delay can be realized with higher accuracy.

In one embodiment, the fluctuation absorption processing program causes the call processing unit to record the reception time of the latest packet, and in the counting step, the count value of the latest packet is calculated at the calculation timing of the packet count value. A process of setting the difference between a calculation time and the reception time to a value divided by the packetization period, setting the count value of packets other than the latest packet to 1, and calculating the packet count value It is preferable to have the processing unit perform it. In this invention, since the call processing unit calculates the packet count value by setting the count value of packets other than the latest packet to 1, it is only necessary to record the reception time only for the latest packet, The recording capacity in the recording medium for recording the reception time can be saved.

In one embodiment, the fluctuation absorption processing program causes the call processing unit to hold the packet count value of the past N (N is a positive integer value) times in the counting step, and in the buffer size changing step, Of the past N packet count values, it is preferable to cause the call processing unit to perform a process using the nth (n is a positive integer value less than N) -th smallest packet count value as the representative value. In the present invention, prevention of packet depletion and reduction of call delay can be realized with higher accuracy.

In one embodiment, when the fluctuation absorbing processing program determines the presence or absence of a spike delay based on the past N packet count values in the counting step, and determines that the spike delay has occurred Has the process of extracting the packet count value of the past M (M is a positive integer value of M <N) out of the past N packet count values to be performed by the call processing unit, and the buffer size changing step In the above, the call processing unit is caused to perform processing for calculating, as the representative value, the packet count value that is the mth (m is an integer less than M) of the past M packet count values extracted in the counting step. It is preferable. In the present invention, the representative value can be calculated while eliminating a spike delay that rarely occurs.

In one embodiment, when the packet count value is continuously zero in the counting step, the fluctuation absorption processing program increases in absolute value as the number of times of continuous zero increases. It is preferable to cause the call processing unit to perform a process of calculating a negative value as the packet count value. In the present invention, the fluctuation absorption processing program calculates a negative value, which increases in absolute value as the number of times of continuous zero increases, as the packet count value, so that packets can be received periodically. However, the packet count value can be calculated in consideration of the difference between the case where the number of stored packets happens to be 0 at the calculation time and the case where the packets cannot be received regularly. Therefore, the packet is less likely to be deleted in the latter case than in the former case.

In one embodiment, when all or a part of the voice data included in the voice packet received by the transmission processing unit is missing, the second software uses the missing voice data, It is preferable that a program for audio data missing compensation processing for compensating all or part of the missing audio data is included. According to the present invention, when all or part of the voice data is lost in the voice data missing compensation process, the missing part is compensated by using voice data that is not missing, so that the call quality in the packet transmission method is further improved. Can do.

In one embodiment, a fluctuation absorbing buffer for accumulating voice data included in the voice packet received by the transmission processing unit is provided, and the fluctuation absorbing processing program stores the voice stored in the fluctuation absorbing buffer. A counting step for calculating the packet count value by counting the number of data packets; and a buffer size changing step for inserting or deleting packets in the fluctuation absorbing buffer based on the packet count value calculated in the counting step; In the buffer size changing step, when one packet is deleted from the fluctuation absorbing buffer, if there are two or more valid packets including voice data, Located in the middle of consecutive valid packets Possible to perform the process of deleting the two valid packets successive overlap-add to the call processor is preferable. In the present invention, since the call processing unit overlaps and deletes two consecutive valid packets located in the middle, the voice deterioration due to the packet loss concealment process can be reduced.

In one embodiment, when the fluctuation absorbing processing program inserts a packet into the fluctuation absorbing buffer in the buffer size changing step, if there are two consecutive valid packets, the program is between these two valid packets. It is preferable to cause the call processing unit to perform processing for inserting an invalid packet not including voice. In the present invention, if there are two consecutive valid packets, the call processing unit inserts an invalid packet that does not include voice between the two valid packets. Can be small.

In one embodiment, the second software detects an audio data loss detection processing program for detecting loss of all or part of audio data output from the transmission processing unit, and detects a pitch of audio from the audio data. A program for pitch detection processing, and audio data missing compensation processing for compensating for missing voice data based on a pitch detected by the pitch detection processing when voice data missing is detected by the voice data missing detection processing. The pitch detection processing program includes a process of setting an audio signal having a time width from the current time to the past as a reference signal, and sliding the reference signal from the current time to the past with respect to the audio signal. And detecting the pitch of the audio signal by obtaining the correlation between the reference signal and the audio signal. The reference signal and a process of increasing the time width of the reference signal as the amount of sliding is increased is possible to perform the call processing unit of the preferred. In this invention, since the time width of the reference signal increases as the slide amount of the reference signal increases, it is possible to accurately detect the pitch of the audio signal immediately before the loss occurrence time.

In one embodiment, the pitch detection processing program causes the call processing unit to perform a process of setting a time width of the reference signal to a predetermined initial time width until a slide amount of the reference signal reaches a predetermined slide reference value. It is preferable to carry out. According to the present invention, even when the slide amount of the reference signal is small, it is possible to ensure a time width of the reference signal equal to or larger than a certain amount, and the correlation between the reference signal and the audio signal is more accurate. You can ask well.

In one embodiment, it is preferable that the pitch detection processing program causes the call processing unit to perform processing for obtaining a correlation between the reference signal and the voice signal by an average amplitude difference function method. According to the present invention, the correlation between the reference signal and the audio signal can be obtained with high accuracy with a relatively small amount of calculation.

In one embodiment, it is preferable that the pitch detection processing program causes the call processing unit to perform a process of obtaining a correlation between the reference signal and the voice signal using an average amplitude difference function of Expression (1).

Where φ (τ) is the correlation value, N is the time width of the reference signal, x (j) is the reference signal, x (j−τ) is the audio signal, k + 1 is the starting point of the reference signal, a represents a predetermined coefficient, and τ represents the slide amount of the reference signal. In the present invention, the correlation between the reference signal and the audio signal can be obtained with higher accuracy by using Expression (1).

In the second aspect of the present invention, the second software includes a program for audio data loss detection processing for detecting loss of all or part of the audio data output from the transmission processing unit, and audio data from the audio data. A program for pitch detection processing for detecting a pitch, and audio data for compensating for missing audio data based on the pitch detected by the pitch detection processing when audio data loss is detected by the audio data loss detection processing A program for missing compensation processing, and a program for speech speed conversion processing for expanding or compressing the audio data using the pitch detected by the pitch detection processing. In the present invention, since the pitch detected in the pitch detection process is shared in the voice data loss compensation process and the speech speed conversion process, the voice data missing compensation process program and the speech speed conversion process program are respectively pitch detection. Compared to a configuration equipped with a processing program, it is possible to suppress the consumption of memory for loading the program.

In one embodiment, the pitch detection process counts a predetermined detection cycle and repeatedly detects the pitch in synchronization with the detection cycle. When the audio data loss detection process detects a loss of audio data, It is preferable that the pitch is detected at the time of detection of the missing audio data and the detection cycle is restarted from the detection time. In the present invention, it is possible to maintain the quality of the voice after the voice data missing compensation process.

In one embodiment, it is preferable that the pitch detection process detects only a pitch in a predetermined frequency range. In the present invention, since the pitch detection in an unnecessary frequency range is not performed, the processing load can be reduced.

In one embodiment, it is preferable that the speech speed conversion process detects a voice section of the voice data and converts only the voice data of the voice section. In the present invention, since the speech speed conversion process is not performed in a section other than the voice section (for example, a silent section), the processing load in the speech speed conversion process can be reduced.

In one embodiment, the audio data loss detection processing is performed in synchronization with a first time interval obtained by dividing a time length of the audio data for one packet by a positive integer and the input timing of the audio data. It is preferable that the pitch detection process detects the pitch in synchronization with the detection period obtained by multiplying the first time interval by a positive integer and the first time interval. In the present invention, the pitch detection process detects the pitch in synchronization with the detection period obtained by multiplying the first time interval by a positive integer and the first time interval. There is an advantage that the control becomes simple.

In one embodiment, the speech speed conversion process is performed when the speech data loss detection process detects speech data loss when the speech data loss detection process detects speech data loss. It is preferable that speech speed conversion be performed using the pitch detected by the pitch detection process immediately before. According to the present invention, it is possible to suppress deterioration in voice quality due to the speech speed conversion process.

In one embodiment, the speech speed conversion process is performed by using speech data compensated by the speech data loss compensation process when speech speed conversion is performed when the speech data loss detection process detects a lack of speech data. It is preferable to perform speech speed conversion using the pitch detected by the pitch detection process. In the present invention, even when the speech speed conversion process is started when voice data is missing, the pitch detection process only needs to be executed at a constant detection cycle. There is an advantage that the control becomes simple.

In one embodiment, it is preferable that the pitch detection process discriminates between a voice section and a non-voice section of the voice data, and makes the detection period in the non-voice section longer than the detection period in the voice section. In the present invention, since the pitch detection is performed with a relatively short detection period in the voice section, the quality of speech speed conversion processing is ensured, and the pitch detection is performed with a relatively long detection period in the non-voice section. Therefore, the processing load can be reduced.

In the third aspect of the present invention, the second software includes a voice switch processing program that reduces a loop gain of a closed loop formed by an acoustic echo path generated by acoustic coupling between the microphone and a speaker and suppresses howling. The voice switch processing program estimates a feedback gain of the acoustic echo path, and, based on the estimated value of the feedback gain, attenuates the received voice attenuation data received from the transmission processing unit; Calculating the sum of the attenuation on the transmission side that attenuates the voice data of the transmission input to the transmission processing unit, monitoring the voice data of the transmission and reception, estimating the call state, and The distribution of the transmission side attenuation and the reception side attenuation is determined according to the state estimation result and the calculated value of the sum, and the estimated value of the feedback gain is reduced. It is preferred to perform the process for reducing the total depending on the amount to the call processor. In the present invention, the call processing unit determines the distribution of the transmission-side attenuation amount and the reception-side attenuation amount in accordance with the estimation result of the call state and the calculated value of the sum, and determines the estimated value of the feedback gain. Since the sum is decreased according to the amount of decrease, call quality in the packet transmission method can be further improved.

In a fourth aspect of the present invention, the power supply device includes an extension connection line to which a communication device installed in a house is connected, and an extension analog signal transmission unit that transmits an analog voice signal through the extension connection line. It is preferable that the voice data processed by executing the first software in the call processing unit is transmitted from the extension analog signal transmission unit to the call device via the extension connection line. According to the present invention, an extension call can be made with the call device by an analog transmission method.

In the fifth aspect of the present invention, the first software detects a pitch of a voice from a digital voice signal obtained by A / D converting the analog voice signal and uses the pitch for the digital voice. It is preferable to include a speech speed conversion processing program for expanding or compressing a signal. In the present invention, since the first software includes a program for converting the speech speed, the speech speed of the voice uttered by the other party can be made faster or slower even in a call using the analog transmission method.

Preferred embodiments of the invention are described in further detail. Other features and advantages of the present invention will be better understood with reference to the following detailed description and accompanying drawings.
It is a block diagram which shows the dwelling unit of Embodiment 1 of this invention, and the system block diagram of the collective housing intercom system containing the said dwelling unit. It is a block diagram when the call processing part of Embodiment 1 of this invention is running 1st software. It is a flowchart for demonstrating the process of the audio | voice switch of Embodiment 1 of this invention. FIG. 4A is a block diagram for explaining an operation during an intercom call with the door phone slave unit according to the first embodiment of the present invention, and FIG. 4B illustrates an operation during an extension call with the sub master unit according to the first embodiment of the present invention. It is a block diagram for doing. FIG. 5A is a block diagram for explaining an operation during an interphone call with the lobby interphone according to the first embodiment of the present invention, and FIG. 5B explains an operation during an interphone call with the management room device according to the first embodiment of the present invention. FIG. 5C is a block diagram for explaining an operation during an interphone call with another dwelling unit according to Embodiment 1 of the present invention, and FIG. 5D is a lobby interphone or management room device according to Embodiment 1 of the present invention. FIG. 6 is a block diagram for explaining an operation during an interphone call between the mobile phone and the sub-master. It is a block diagram when the call processing part of Embodiment 1 of this invention is running 2nd software. It is a flowchart for demonstrating the process of the echo suppressor of Embodiment 1 of this invention. It is a block diagram which shows the audio | voice data missing compensation processing part of Embodiment 1 of this invention. It is a wave form diagram of a voice signal (received voice signal) for explaining a basic principle of voice data loss compensation processing of Embodiment 1 of the present invention. It is a wave form diagram of a received voice signal for demonstrating the process of the template setting part and pitch detection part of Embodiment 1 of this invention. It is the graph which showed the calculation result of the correlation value of a template when using the conventional template, and an incoming voice signal. It is a figure explaining the process of the template setting part and pitch detection part of Embodiment 1 of this invention. The graph of the correlation value of Embodiment 1 of the present invention is shown. It is a flowchart which shows the audio | voice data missing compensation process of Embodiment 1 of this invention. It is a block diagram which shows the fluctuation | variation absorption process part of Embodiment 1 of this invention. It is explanatory drawing of the calculation process of the packet count value by the count part of Embodiment 1 of this invention. It is a figure for demonstrating the role of the jitter buffer of Embodiment 1 of this invention. It is a figure which shows an example of the transmission delay characteristic which shows the relationship between transmission delay and occurrence frequency. It is a figure for demonstrating the optimal buffer size of the jitter buffer of Embodiment 1 of this invention. It is a flowchart which shows the fluctuation | variation absorption process of Embodiment 1 of this invention. It is a flowchart which shows the detail of the calculation process of the packet count value of Embodiment 1 of this invention. It is the graph which showed the relationship between the packet count value of Embodiment 1 of this invention, and the calculation time of a packet count value. FIG. 23A is a schematic diagram showing processing at the time of packet insertion by the buffer size changing unit, and FIG. 23B is a schematic diagram showing processing at the time of packet deletion by the buffer size changing unit. It is explanatory drawing of another calculation method of the packet count value of Embodiment 1 of this invention. It is a flowchart which shows another calculation process of the packet count value of Embodiment 1 of this invention. It is a graph for demonstrating the determination process of the presence or absence of spike delay of Embodiment 1 of this invention. It is a graph which shows the relationship between the packet count value and index when the spike delay of Embodiment 1 of this invention has generate | occur | produced. 28A and 28B are diagrams illustrating the processing of the count unit according to the first embodiment of the present invention. 29A, 29B, and 29C are explanatory diagrams of processing in which the buffer size changing unit deletes one packet by overlap addition. 30A and 30B are explanatory diagrams of processing in which the buffer size changing unit deletes one invalid packet. 31A and 31B are explanatory diagrams of processing in which the buffer size changing unit inserts one packet by overlap addition. 32A and 32B are diagrams for explaining processing when five packets are inserted into the jitter buffer at one time. 33A, 33B, and 33C are diagrams for explaining processing when a valid packet corresponding to a deleted invalid packet is received after the invalid packet is deleted. FIGS. 34A and 34B are diagrams illustrating processing when the buffer size changing unit inserts a concealed packet in place of an invalid packet into the jitter buffer. It is the flowchart which showed the deletion process by the buffer size change part. It is the flowchart which showed the insertion process by the buffer size change part. It is a block diagram of a telephone call processing part when the pitch of the voice of Embodiment 2 of the present invention is shared by the voice data loss compensation processing part and the speech speed conversion processing part. It is operation | movement explanatory drawing of the pitch detection part of Embodiment 2 of this invention. 39A and 39B are explanatory diagrams of operations of the voice data loss detection unit and the pitch detection unit according to the third embodiment of the present invention. It is operation | movement explanatory drawing of Embodiment 3 of this invention. It is operation | movement explanatory drawing of Embodiment 3 of this invention. It is operation | movement explanatory drawing of Embodiment 3 of this invention.

(Embodiment 1)
Hereinafter, Embodiment 1 of the present invention will be described in detail with reference to FIGS. First, an intercom system for an apartment house including a dwelling unit according to the present invention will be described.

As shown in FIG. 1, the intercom system for an apartment house in this embodiment includes a common unit device (lobby interphone) LI installed at the common entrance (lobby) of the apartment house, and a dwelling unit installed in each unit of the apartment house Unit A (only one shown), door phone slave unit B installed at the entrance of each dwelling unit, signal trunk line Ls connected to lobby interphone LI, and branch unit from signal trunk line Ls A dwell unit line Ld connected to A and a slave unit connection line Lb connecting the dwell unit A and the door phone slave unit B are provided. In addition, the control unit CT connected to the dwelling unit A and the lobby intercom LI via the signal trunk line Ls and the dwelling unit line Ld, and the lobby intercom LI and each And a management room device X that exchanges voice information and the like with the dwelling unit A. Further, one or more (two in the illustrated example) communication devices (secondary master units) C are installed in the dwelling unit, and the dwelling unit (parent unit) A and the second master unit C are connected by the extension connection line Lc. Yes.

The door phone slave unit B transmits a call signal via the microphone and speaker, a call button that accepts a visitor's call operation, and the slave unit connection line Lb, and transmits and receives voice signals to and from the dwell unit A (analog transmission). ) Communication unit. In addition, when the doorphone slave unit B is equipped with a camera, a visitor image captured by the camera is analog-transmitted from the doorphone slave unit B to the dwelling unit A via the slave unit connection line Lb. The dwelling unit A transfers the video transmitted from the door phone slave unit B to the sub-master unit C via the extension connection line Lc. In the dwelling unit A and the sub-main unit C, if the video transmitted from the doorphone slave unit B is displayed on the monitor (display unit 3) and the response button of the dwelling unit A is pressed, the dwelling unit A and the doorphone slave unit are displayed. A call can be made with B, and if the response button of the sub-master unit C is pressed, a call can be made between the sub-master unit C and the door phone slave unit B.

The sub-master C includes a microphone and a speaker, a call button for receiving an extension call operation, a communication unit that transmits a call signal and transmits / receives an audio signal (analog transmission) via the extension connection line Lc. Yes.

The lobby interphone LI packet-transmits voice information and video information via the signal trunk line Ls, an imaging device that captures the image of the visitor, a microphone and a speaker, a numeric keypad or touch panel for the visitor to enter the dwelling unit number of the visited residence. A transmission unit and the like are provided. In the lobby intercom LI, when the ten key switch or touch panel is operated to accept the operation input of the dwelling unit number of any dwelling unit, the packet storing the dwelling unit number in the data field and the video of the visitor captured by the imaging device (video) A packet storing information) in the data field is transmitted (packet transmission) from the transmission unit to the address of the control device CT via the signal trunk line Ls.

The management room device X includes a microphone and a speaker, a numeric keypad or a touch panel for an administrator to input a dwelling unit number of a contact destination, a transmission unit for transmitting voice information through the signal trunk line Ls, and the like. In the management room device X, when the numeric keypad or the touch panel is operated to accept the operation input of the dwelling unit number of any dwelling unit, the control unit transmits the packet storing the dwelling unit number in the data field from the transmission unit via the signal trunk line Ls. Send to CT address.

The control device CT stores the correspondence between the address assigned to the dwelling unit A of each dwelling unit and the dwelling unit number of the dwelling unit, and stores it in the data field of the packet received from the lobby intercom LI or the management room device X. The stored unit number is compared with the correspondence and converted into an address, the address is stored in the destination address field, and a call command for notifying the call from the lobby intercom LI or the control room device X is stored in the data field. The stored packet and the packet storing the video information in the data field are sent to the signal trunk line Ls. However, since the lobby intercom LI, the management room device X, and the control device CT as described above are conventionally known, detailed illustration and description thereof will be omitted.

The dwelling unit A includes a control unit 1, a microphone 2a and a speaker 2b, a call processing unit 2, a display unit 3, a video processing unit 4, a storage unit 5, a call detection unit 6, a transmission processing unit 7, and a secondary communication processing unit 8. , An analog signal transmission unit 9, a first conversion processing unit 10, a second conversion processing unit 11, a first switching unit 12, a second switching unit 13, a third switching unit 14, and the like.

The analog voice signal (speech voice signal) output from the microphone 2a is amplified by the amplifier AMP1, and then converted into a digital voice signal (speech voice) by the A / D converter 10a of the first conversion processing unit 10. Data) and input to the call processing unit 2. The digital voice signal (received voice signal) after the call processing by the call processing unit 2 is converted into an analog received voice signal by the D / A converter 10b of the first conversion processing unit 10, and then the amplifier. Amplified by AMP2 and output to the speaker 2b.

On the other hand, the digital transmission voice signal (transmission voice data) processed by the call processing unit 2 is transmitted by the D / A converter 11a of the second conversion processing unit 11 in the case of a door phone call or an extension call to be described later. After being converted into an analog transmission voice signal, it is amplified by the amplifier AMP3 and output to the analog signal transmission unit 9. However, in the case of an interphone call or a dwelling unit call which will be described later, the digital transmitted voice signal after the call processing by the call processing unit 2 is directly output to the transmission processing unit 7. Further, the analog reception voice signal output from the analog signal transmission unit 9 is amplified by the amplifier AMP4 and then digitally received by the A / D converter 11b of the second conversion processing unit 11 (reception voice data). And is input to the call processing unit 2. However, the digital received voice signal output from the transmission processing unit 7 is directly input to the call processing unit 2. The analog signal transmission unit 9 is composed of a conventionally known 2-wire / 4-wire converter (hybrid transformer).

The first switching unit 12 is connected to the two-wire side of the analog signal transmission unit 9. The first switching unit 12 selectively switches between a state in which the two-wire side of the analog signal transmission unit 9 is connected to the slave unit connection line Lb and a state in which the analog signal transmission unit 9 is connected to the second switching unit 13. The second switching unit 13 selectively switches the first switching unit 12 between a state where it is connected to the extension connection line Ld and a state where it is not connected. Further, the third switching unit 14 selectively switches between a state in which the slave unit connection line Lb and the extension connection line Lc are connected and a state in which it is not connected. Note that the switching of the first to

third switching units

12, 13, and 14 is all controlled by the control unit 1.

The control unit 1 has a microcomputer as a main component and controls the entire dwelling unit A including the switching control. The display unit 3 includes a display device such as a liquid crystal display, a driver circuit that drives the display device, a touch panel as an input device, and the like. As will be described later, the video processing unit 4 performs signal processing on the video signal received from the transmission processing unit 7 and displays the video on the display unit 3. Specifically, a video (still image or moving image) of a visitor packet-transmitted from the lobby interphone LI is displayed on the display unit 3.

The call processing unit 2 includes a microprocessor, an ASIC (Application Specific Integrated Circuit) or a DSP (Digital Signal Processor) and performs various controls and various calculations for call processing. Data and received voice data) are subjected to various signal processing (call processing). The storage unit 5 includes an electrically rewritable nonvolatile semiconductor memory (flash memory or the like), and stores first software and second software. The first software is composed of a collection of a plurality of programs for performing various call processing on the audio signal transmitted by the analog signal transmission unit 9 by the analog transmission method. The second software is composed of a collection of a plurality of programs for performing various call processing on the audio signal transmitted by the packet transmission method by the transmission processing unit 7. Details of each program will be described later.

The transmission processing unit 7 performs packet transmission with the control device CT and other dwelling units A via the signal trunk line Ls (including the dwelling unit line Ld, the same applies hereinafter). The transmission processing unit 7 divides the control signal (control data) created by the control unit 1 to create a packet (control packet), and the transmission voice signal (transmission voice data) also created by the call processing unit 2. ) To create a packet (voice packet). Further, the transmission processing unit 7 encodes the control packet and the voice packet, converts (modulates) the encoded bit string into an electric signal, and sends the electric signal to the signal trunk line Ls. The transmission processing unit 7 converts (demodulates) an electric signal flowing through the signal trunk line Ls into a bit string and decodes a packet (voice packet, control packet, video packet) from the demodulated bit string. The transmission processing unit 7 discards the packet if the address of the decrypted packet does not match its own address (address of the dwelling unit A), and if the address matches, it is included in the data field of the packet. If the data is video data (video signal), it is output to the video processing unit 4, if it is control data (control signal), it is output to the control unit 1, and if it is audio data (voice signal), it is output to the call processing unit 2.

The secondary master communication processing unit 8 encodes and frequency-modulates the control data for the secondary master created by the control unit 1 and transmits it to the secondary master C via the extension connection line Lc. Control data obtained by frequency-demodulating and decoding a control signal transmitted from the sub-master unit C via Lc is passed to the control unit 1.

Next, the operation of the collective housing intercom system in this embodiment will be described. First, the door phone call between the dwelling unit A and the door phone slave unit B will be described. When a call button of the door phone slave unit B is operated by a visitor, a call signal is transmitted from the door phone slave unit B via the slave unit connection line Lb. In the dwelling unit A, the call detection unit 6 that has detected the call signal outputs a call detection signal to the control unit 1. Receiving the call detection signal, the control unit 1 sounds a ringing tone from the speaker 2b. Here, when the doorphone cordless handset B is equipped with a camera, after the call button is operated, the camera is activated to image a visitor, and the captured image is transmitted from the doorphone cordless handset B via the cordless handset connection line Lb. Is transmitted. In the dwelling unit A, the video transmitted through the slave unit connection line Lb is displayed on the display unit 3 by the video processing unit 4. When the resident who hears the ringing tone confirms the video of the visitor displayed on the display unit 3 and operates a response button (not shown) provided on the dwelling unit A, the control unit 1 performs the first operation. The switching unit 12 is controlled so that the two-wire side of the analog signal transmission unit 9 is connected to the slave unit connection line Lb, and the third switching unit 14 is switched to the disconnected state. To load and execute the first software stored in the. Then, as shown in FIG. 4A, the call processing unit 2 executes the first software to perform the call processing, so that a resident of the dwelling unit and a visitor make a doorphone call using the dwelling unit A and the doorphone slave unit B. Can do.

Here, the control unit 1 that has received the call detection signal causes the secondary phone communication processing unit 8 to transmit the doorphone call control signal and switches the third switching unit 14 to the connected state so that the slave unit connection line Lb is connected. The video transmitted via the extension connection line Lc is transmitted to the secondary master unit C. In the secondary master unit C that has received the control signal, a ringing tone is generated from the speaker and a video of the visitor is displayed on the monitor. Then, when the resident who has heard the ringing tone confirms the image of the visitor displayed on the monitor and operates the response button of the secondary master unit C, the secondary phone C to the residential unit A via the extension connection line Lc. A response control signal is transmitted. In the dwelling unit A, the control signal (control data) of the doorphone response is output from the peer-to-subordinate communication processing unit 8 to the control unit 1, and the control unit 1 that has received the control data changes the connection state of the third switching unit 14. Keep it as it is. As a result, a resident of the dwelling unit and a visitor can make a doorphone call using the sub-master C and the doorphone slave unit B. In this case, the call processing unit 2 of the dwelling unit A does not perform any call processing.

Next, an extension call between the dwelling unit A and the secondary master unit C will be described. When the extension call button of the secondary master unit C is operated by the resident, a control signal for extension call is transmitted from the secondary master unit C via the extension connection line Lc. In the dwelling unit A, an extension call control signal (control data) is output from the secondary master communication processing unit 8 to the control unit 1. Upon receiving the extension call control data, the control unit 1 causes the speaker 2b to ring. When another resident who hears the ringing tone operates the response button provided on the dwelling unit A, the control unit 1 controls the first switching unit 12 so that the two-wire side of the analog signal transmission unit 9 is switched. The second switching unit 13 is connected and the second switching unit 13 is controlled to connect the first switching unit 12 to the extension connection line Lc. Further, the control unit 1 instructs the call processing unit 2 to load and execute the first software stored in the storage unit 5. Then, as shown in FIG. 4B, when the call processing unit 2 executes the first software to perform the call processing, the residents in the same dwelling unit can make an extension call using the dwelling unit A and the sub-master unit C. it can.

Note that the extension call control signal transmitted from one secondary master unit C is received not only by the dwelling unit A but also by the other secondary master unit C. When the response button is operated on the other secondary master unit C that has received the control signal, a communication path is formed between the two secondary master units C and C via the extension connection line Lc, and the same dwelling unit Residents can make extension calls using the sub-masters C and C, respectively.

Here, the call processing performed by the call processing unit 2 executing the first software will be described. The first software includes a voice switch processing program for switching the call direction, an acoustic echo canceller processing program for suppressing acoustic echo, a line echo canceller processing program for suppressing line echo, and an output from the speaker 2b. And a speech speed conversion processing program for reducing or speeding up the speed (speech speed) of the voice of the other party to be called.

The call processing unit 2 executing the first software includes a voice switch VS, an acoustic side echo canceller EC1, a line side echo canceller EC2, and a speech rate conversion processing unit SE as shown in FIG. However, the voice switch VS, the acoustic side echo canceller EC1, the line side echo canceller EC2, and the speech rate conversion processing unit SE, the signal processing circuit such as DSP constituting the speech processing unit 2 is a voice switch processing program, and the acoustic side echo canceller. This is realized by executing a processing program, a line-side echo canceller processing program, and a speech rate conversion processing program, respectively. In FIG. 2, the first and second

conversion processing units

10 and 11 are not shown.

Acoustic side echo canceller EC1 has a conventionally known structure comprising an adaptive filter ADF1 a subtractor SUB1, adapting the impulse response of the feedback path (acoustic echo path) H _AC formed by the acoustic coupling between the loudspeaker 2b- microphones 2a An echo component (acoustic echo) that is adaptively identified by the filter ADF1 and estimated from the reference signal (output signal to the first conversion processing unit 10) is input by the subtractor SUB1 from the first conversion processing unit 10 ( The echo component is suppressed by subtracting from the transmitted voice signal). The line-side echo canceller EC2 also has a conventionally known configuration including an adaptive filter ADF2 and a subtractor SUB2, and impedance between the analog signal transmission unit 9 and the transmission path (slave unit connection line Lb or extension connection line Lc). Adapted to the impulse response of the feedback path (line echo path) H _LIN formed by the reflection due to mismatch and the acoustic coupling between the speaker and microphone in the other party's loudspeaker (doorphone slave unit B or submaster unit C) An echo component (line echo) that is adaptively identified by the filter ADF2 and estimated from the reference signal (the output signal to the second conversion processing unit 11, that is, the transmitted voice signal) is subtracted from the received voice signal by the subtractor SUB2. In this way, the echo component is suppressed.

A voice switch VS is provided between the acoustic echo canceller EC1 and the line echo canceller EC2. The voice switch VS includes a transmission side attenuator 100 for attenuating a transmission voice signal, a reception side attenuator 101 for attenuating a reception voice signal, and attenuation amounts (insertion) in the transmission side and

reception side attenuators

100, 101. And an insertion loss amount control unit 102 for controlling the loss amount). The insertion loss amount control unit 102 includes a total loss amount calculation unit 103 and an insertion loss amount distribution processing unit 104. Total loss amount calculation unit 103, the route for returning from the output point Rout of the receiving side attenuator 101 to the input point Tin of the transmitting end attenuator 100 via the acoustic echo path H _AC (hereinafter referred to as "acoustic side feedback path" ) On the acoustic side feedback gain α, and a feedback path from the output point Tout of the transmitting side attenuator 100 to the input point Rin of the receiving side attenuator 101 via the line echo path H _LIN (hereinafter referred to as `` line side The line-side feedback gain β of the feedback path) is estimated, and the total amount of loss to be inserted into the closed loop based on the estimated values α ′ and β ′ of the feedback gains α and β on the acoustic side and the line side (transmission) The sum of the attenuation amount <insertion loss amount> of the side attenuator 100 and the attenuation amount <insertion loss amount> of the reception side attenuator 101 is calculated. The insertion loss amount distribution processing unit 104 monitors the transmission voice signal and the reception voice signal to estimate the call state, and according to the estimation result and the calculated value of the total loss amount calculation unit 103, the transmission side attenuator 100 and The distribution of each attenuation amount (insertion loss amount) of the receiving side attenuator 101 is determined.

The total loss calculation unit 103 estimates the time-average power in a short time of the input signal (speech voice signal) of the transmission side attenuator 100 using a rectifier smoother, a low-pass filter, etc. and using a low-pass filter or the like to estimate the time average power in a short time of the output signal of the receiving side attenuator 101 (received voice signal), the receiving side in the maximum delay time assumed in acoustic side feedback path H _AC The minimum value of the estimated value of the time average power of the output signal of the attenuator 101 is obtained, and the value obtained by dividing the estimated value of the time average power of the input signal of the transmission side attenuator 100 by this minimum value is the acoustic feedback gain α. The estimated value α ′. Further, the total loss calculation unit 103 estimates the time average power of the input signal (received voice signal) of the reception side attenuator 101 in a short time using a rectifier smoother, a low-pass filter, etc. Estimate the short time average power of the output signal (speech voice signal) of the transmission side attenuator 100 using a low-pass filter etc., and send it at the maximum delay time assumed in the line side feedback path H _LIN . Obtain the minimum value of the estimated value of the time average power of the output signal of the talker attenuator 100, and divide the estimated value of the time average power of the input signal (received voice signal) of the receive side attenuator 101 by this minimum value. Assume that the estimated value β ′ of the line-side feedback gain β. Then, the total loss amount calculation unit 103 calculates the total loss amount Lt necessary to obtain a desired gain margin MG from the estimated values α ′ and β ′ of the acoustic side feedback gain α and the line side feedback gain β, The value Lt is output to the insertion loss amount distribution processing unit 104.

The insertion loss distribution processing unit 104 monitors the input / output signals of the transmitting side attenuator 100 and the input / output signals of the receiving side attenuator 101, and determines the power level of these signals and information such as the presence or absence of speech. Attenuate the call state (receiving state, transmitting state, etc.) and distribute each loss so that the total loss Lt is distributed to the transmitting side attenuator 100 and the receiving side attenuator 101 at a rate according to the determined call state The attenuation amount (insertion loss amount) of the

devices

100 and 101 is adjusted.

By the way, the total loss calculation unit 103 calculates an adaptive update by calculating the sum of loss amounts to be inserted into the closed loop based on the estimated values α ′ and β ′ of the feedback gains α and β as described above, and There are two operation modes, a fixed mode for fixing the total loss amount to a predetermined initial value. The total loss amount calculation unit 103 operates in the fixed mode during the period from the start of the call with the other party's call terminal until the echo cancellers EC1 and EC2 on the acoustic side and the line side sufficiently converge, and the acoustic side and the line In the period after the echo cancellers EC1 and EC2 on the side have sufficiently converged, it operates in the update mode. That is, the total loss amount calculation unit 103 has both the estimated values α ′ and β ′ of the acoustic feedback gain α and the line feedback gain β continuously for a predetermined time (several hundred milliseconds) from the start of a call for a predetermined threshold ε ( For example, it is considered that the echo cancellers EC1 and EC2 on the acoustic side and the line side have sufficiently converged when the values are below 10 dB to 15 dB smaller than the estimated values α ′ and β ′ at the start of the call. Before, the operation mode is switched to the update mode in which the total loss amount is adaptively updated based on the estimated values α ′ and β ′. Note that the initial value of the total loss amount in the fixed mode is set to a value sufficiently larger than the total loss amount updated as needed in the update mode.

Thus, in a state where the echo cancellers EC1 and EC2 on the acoustic side and the line side immediately after the start of the call are not sufficiently converged, a sufficiently large value is set by the total loss amount calculation unit 103 operating in the fixed mode. Since the initial total loss amount is inserted into the closed loop, it is possible to suppress the generation of unpleasant echoes (acoustic echoes and line echoes) and howling, and realize a stable half-duplex call. Also, in the state where the echo cancellers EC1 and EC2 on the acoustic side and the line side have sufficiently converged after the start of the call, the operation mode of the total loss calculation unit 103 is switched from the fixed mode to the update mode and closed loop. Since the total loss amount to be inserted into the value decreases to a value sufficiently lower than the initial value, two-way simultaneous calls can be realized.

Here, the specific operation of the total loss amount calculation unit 103 in the update mode will be described with reference to the flowchart of FIG.

The total loss calculation unit 103 executes an estimation process of the acoustic side feedback gain α and the line side feedback gain β at a predetermined sampling period from the time when the fixed mode is changed to the update mode, and the estimated value α ′ (n), β ′ (n) is calculated (step 1), and the gain margin of the closed loop is maintained at MG [dB] from the product of these two estimated values α ′ (n) and β ′ (n) and the gain margin MG. The desired total loss amount Lr (n) required for the above is calculated by the following equation (step 2).

Lr (n) = 20log | α '(n) · β' (n) | + MG [dB]
Note that α ′ (n), β ′ (n), and Lr (n) indicate an estimated value of feedback gain and a desired total loss amount calculated by the nth sampling from the update mode transition point, respectively. Further, the total loss amount calculation unit 103 calculates the n-th total loss amount desired value Lr (n) calculated from the above formula and the previous (n−1th) total loss amount Lt (n−1), that is, the previous processing. When the desired total loss amount Lr (n) calculated this time is larger than the total loss amount determined and actually inserted, a slight increase Δi [dB in the previous total loss amount Lt (n-1) ] Is the current total loss amount Lt (n) = Lt (n-1) + Δi (steps 3 and 4), and this time is calculated for the previous total loss amount Lt (n-1). When the total loss desired value Lr (n) is small, a value obtained by subtracting a slight decrease Δd [dB] from the previous total loss Lt (n−1) is set to the current total loss Lt (n) = Lt ( n−1) −Δd (steps 5 and 6).

In this way, by suppressing the increase / decrease in the total loss amount by the total loss calculation unit 103 to a small value of Δi or Δd, just after the start of a call with the other party's call terminal (door phone slave unit B or sub master unit C). In addition, the acoustic side and line side echo cancellers EC1 and EC2 actively update the coefficients toward convergence, so even when the acoustic side feedback gain α and the line side feedback gain β change drastically, there is a sense of discomfort in hearing. Can be eliminated.

The speech rate conversion processing unit SE converts the speech rate of the original speech by expanding or compressing the speech (received speech) .For example, the well-known conventionally called PICOLA (Pointer Interval Controlled OverLap and Add) Based on the speech speed conversion algorithm, the speech speed is converted (fast or slow) by inserting or deleting waveforms in pitch units. “Pitch” is the pitch of the voice determined by the vibration period of the vocal cords. If the vibration period of the vocal cords is short, the voice will be high, and if the vibration period is long, the voice will be low. . Therefore, if the speech speed conversion processing unit SE performs the speech speed conversion process during a doorphone call with the doorphone slave unit B or an extension call with the sub-master unit C, the other party of the call that is ringed from the speaker 2b of the dwelling unit A The speech speed can be made faster or slower than the speech speed actually spoken by the other party.

Next, the intercom call between the dwelling unit A and the lobby intercom LI will be described. In the lobby intercom LI, when a visitor accepts an operation input of a dwelling unit number of any dwelling unit by operating the numeric keypad or touch panel, the packet storing the dwelling unit number in the data field and the visitor imaged by the imaging device A packet storing video (video data) in the data field is transmitted (packet transmission) from the transmission unit to the address of the control device CT via the signal trunk line Ls. The control device CT sends a packet storing a call command for notifying a call from the lobby intercom LI in the data field and a packet storing the video data in the data field to the signal trunk line Ls.

In the dwelling unit A installed in the dwelling unit with the dwelling unit number, when the transmission processing unit 7 receives the packet via the dwelling unit line Ld, the paging command (control signal) stored in the data field of the packet is controlled. The video data stored in the data field is output to the video processing unit 4 while being output to the unit 1. When the control unit 1 receives the call command, the control unit 1 causes the speaker 2b to ring. The video processing unit 4 processes the video signal received from the transmission processing unit 7 and causes the display unit 3 to display the video of the visitor. When the resident who has heard the ringing tone confirms the video of the visitor displayed on the display unit 3 of the dwelling unit A and then operates the response button, the control unit 1 causes the call processing unit 2 to store the storage unit. The second software stored in 5 is instructed to be loaded and executed. Then, as shown in FIG. 5A, when the call processing unit 2 executes the second software to perform the call processing, the resident of the dwelling unit and the visitor can make an interphone call using the dwelling unit A and the lobby intercom LI. it can. Here, the lobby interphone LI has almost the same configuration as the right side dwelling unit A in FIG. 5A except for the speech speed conversion processing unit SE as shown on the left side in FIG. Those having the same functions as those of the units of the dwelling unit A are given the same reference numerals.

Subsequently, an intercom call between the dwelling unit A and the management room device X will be described. In the management room device X, when the manager operates the numeric keypad or the touch panel and receives the operation input of the dwelling unit number of any dwelling unit, the packet storing the dwelling unit number in the data field is transmitted from the transmission unit via the signal trunk line Ls. To the address of the control device CT (packet transmission). The control device CT sends a packet storing a call command for notifying a call from the management room device X in the data field to the signal trunk line Ls.

In the dwelling unit A installed in the dwelling unit with the dwelling unit number, when the transmission processing unit 7 receives the packet via the dwelling unit line Ld, the paging command (control signal) stored in the data field of the packet is controlled. Output to part 1. When the control unit 1 receives the call command, the control unit 1 causes the speaker 2b to ring. When the resident who hears the ringing tone operates the response button, the control unit 1 instructs the call processing unit 2 to load and execute the second software stored in the storage unit 5. Then, as shown in FIG. 5B, the call processing unit 2 executes the second software to perform the call processing, so that the resident and the manager of the dwelling unit make an interphone call using the dwelling unit A and the management room device X. Can do. Here, as shown on the left side of FIG. 5B, the management room apparatus X has substantially the same configuration as the dwelling unit A on the right side of FIG. Therefore, the same code | symbol is attached | subjected to what has a common function with each part of the dwelling machine A. FIG.

However, it is also possible for the secondary master unit C to respond to a call from the lobby intercom LI or the management room device X. When the secondary master unit C responds to a call from the lobby interphone LI or the management room device X, the call processing unit 2 of the dwelling unit A executes the second software as shown in FIG. By doing this, the residents of the dwelling unit and the visitors or managers can make interphone calls using the sub-master C and the lobby intercom LI or the management room device X.

Furthermore, the intercom call between the dwelling units A installed in different dwelling units will be described. In the dwelling unit A, when the resident operates the numeric keypad and receives an operation input of the dwelling unit number of another dwelling unit, a packet storing the dwelling unit number in the data field is transmitted from the transmission unit via the signal trunk line Ls of the control device CT. Send to address (packet transmission). The control device CT sends a packet storing a call command for notifying the call from the dwelling unit A in the data field to the signal trunk line Ls.

In another dwelling unit A installed in the dwelling unit with the dwelling unit number, when the transmission processing unit 7 receives the packet via the dwelling line Ld, a call command (control signal) stored in the data field of the packet Is output to the control unit 1. When the control unit 1 receives the call command, the control unit 1 causes the speaker 2b to ring. When the resident who hears the ringing tone operates the response button, the control unit 1 instructs the call processing unit 2 to load and execute the second software stored in the storage unit 5. Then, as shown in FIG. 5C, the call processing unit 2 in the dwelling unit A of each dwelling unit executes the second software to perform call processing, so that residents in different dwelling units use the dwelling unit A. Intercom calls can be made.

Here, the call processing performed by the call processing unit 2 by executing the second software will be described. The second software includes a voice switch processing program for switching the call direction, an acoustic echo canceller processing program for suppressing acoustic echo, an echo suppressor processing program for suppressing residual echo, and packet loss associated with packet transmission. Audio data loss compensation processing program that compensates for loss of audio data due to noise, a fluctuation absorption processing program that absorbs delay and fluctuation (jitter) associated with packet transmission, and the voice of the other party's voice output from the speaker 2b And a speech speed conversion processing program for decreasing or increasing the speed (speech speed).

As shown in FIG. 6, the call processing unit 2 executing the second software includes a voice switch VS, an acoustic echo canceller EC1, an echo suppressor ES, a speech speed conversion processing unit SE, a voice data loss compensation unit VC, and fluctuations. Absorption processing unit JA is provided. However, the voice switch VS, the acoustic side echo canceller EC1, the echo suppressor ES, the speech speed conversion unit SE, the voice data loss compensation unit VC, and the fluctuation absorption processing unit JA are signal processing circuits such as a DSP constituting the call processing unit 2. Realized by executing a program for voice switch processing, a program for acoustic echo canceller processing, a program for echo suppressor processing, a program for speech rate conversion processing, a program for voice data loss compensation processing, and a program for fluctuation absorption processing, respectively It is. In FIG. 6, the first and second

conversion processing units

10 and 11 are not shown.

Since the acoustic side echo canceller EC1 has the same configuration as the acoustic side echo canceller EC1 when the first software is executed, a detailed illustration of the configuration is omitted. Also, the voice switch VS has the same configuration as the voice switch VS when the first software is executed, and therefore detailed illustration of the configuration is omitted. However, the voice switch VS in the second software is different from the first software in that the total loss amount calculated by the total loss amount calculation unit 103 is reduced according to the reduction amount of the estimated value α ′ of the acoustic feedback gain α. It is different from the voice switch VS. That is, in the voice switch VS in the first software corresponding to the analog transmission method, the total loss calculation unit 103 considers two types of feedback gains of the acoustic side feedback gain α and the line side feedback gain β and calculates the total loss amount. It is necessary to calculate. On the other hand, in the packet transmission system, since no feedback path is formed, there is no need to consider the line side feedback gain β. Therefore, in the voice switch VS in the second software, by reducing the total loss amount calculated by the total loss amount calculation unit 103 according to the reduction amount of the estimated value α ′ of the acoustic feedback gain α as described above, A two-way simultaneous call can be realized more reliably.

The echo suppressor ES is provided between the transmission processing unit 7 and the voice switch VS in the signal path of the transmission voice signal, and attenuates residual echo (acoustic echo that could not be suppressed by the acoustic echo canceller EC1, the same applies hereinafter). Is. In other words, in the packet transmission system that divides voice data into packets and transmits it, the transmission delay is longer than in the analog transmission system, and a residual echo that cannot be suppressed by the acoustic echo canceller EC1 occurs. It is necessary to increase the amount of echo suppression by the echo suppressor ES. Note that the echo suppressor ES effectively attenuates the residual echo, while the audio signal to be transmitted (transmitted audio signal) needs not to be attenuated.

The echo suppressor ES attenuates the transmitted voice signal in conjunction with the voice switch VS, and specifically operates as shown in the flowchart of FIG. That is, the echo suppressor ES always monitors the state of the voice switch VS (the estimation result of the call state <receiving state or transmitting state> by the insertion loss distribution processing unit 104) (step 1), and the voice switch VS is in the receiving state. In some cases, it is assumed that there is no transmission voice signal to be transmitted to the signal path, and the input signal is attenuated by being multiplied (multiplied) by the input signal (step 2). On the other hand, when the voice switch VS is not in the reception state, the echo suppressor ES determines that there is no residual echo to be canceled or there is a transmission voice signal to be transmitted, and does not apply an attenuation coefficient to the input signal. The output is output as it is without being attenuated (step 3).

Thus, even when a transmission delay occurs in the audio transmitted between the other party's communication devices (lobby interphone LI, management room device X, and other dwelling unit A), the transmission is caused by the transmission delay. The residual echo generated in the signal path of the speech signal can be attenuated by the echo suppressor ES. As a result, two-way simultaneous calls can be reliably realized even in the packet transmission method. Here, when the voice switch VS is not in the reception state, for example, when the echo suppressor ES attenuates the transmission voice signal in the transmission state, the near-end speaker (resident who talks on the dwelling unit A). May be attenuated inadvertently, resulting in an increase in the volume of the near-end speaker that can be heard from the other party's call device. However, in this embodiment, the echo suppressor ES attenuates the input signal when the voice switch VS is in the receiving state, and the echo suppressor ES does not attenuate the input signal when the voice switch VS is not in the receiving state. It is possible to attenuate only an unpleasant echo (residual echo) during a call without causing any inflection. Note that the speech speed conversion processing unit SE is realized by executing the same program as the speech speed conversion processing program included in the first software, and thus the description thereof is omitted.

FIG. 9 is a waveform diagram of an audio signal for explaining the basic principle of audio data loss compensation processing (hereinafter abbreviated as “compensation processing”). In FIG. 9, the vertical axis indicates the intensity of the received voice signal input from the transmission processing unit 7 to the call processing unit 2, and the horizontal axis indicates time. When reception of a voice packet fails and a packet loss (voice data loss) occurs, the voice data loss compensation processing unit VC sets the received voice signal of a predetermined period immediately before the packet loss as a reference signal (template). To do.

Next, the template is slid toward the past from the time when the packet loss occurs with respect to the reception voice signal, and the correlation calculation between the template and the reception voice signal is performed, and the reception voice signal immediately before the packet loss occurs The basic period (pitch) is detected. Then, from the occurrence of packet loss, the received voice signal for one pitch is extracted retroactively, and the received voice signal is repeatedly applied to the loss period, whereby a loss period (period in which voice data is missing. The same.) Here, the loss period is compensated by the received voice signal for one pitch. For example, when the speaker utters the voice “A”, the voice “A” is divided into about 20 msec (packetization). This is because the received voice signal for one pitch immediately before the occurrence of the packet loss is likely to be repeated in the loss period because it is transmitted on one voice packet.

The audio data loss compensation processing unit VC includes a delay fluctuation absorbing buffer (jitter buffer) 20, a timer 21, a packet loss detection unit 22, a detection processing unit 23, and a compensation processing unit 24 as shown in FIG. However, each of these units is realized by executing a voice data loss compensation processing program by the DSP of the call processing unit 2.

Here, in the header of the voice packet, a number (sequence number) assigned in order when the original voice signal is divided (packetized) is stored, and the voice data (received voice signal) of the voice packet is sequenced. The original audio signal can be restored by connecting them in the order of the numbers. Then, the transmission processing unit 7 outputs the received received voice signal (received voice data) to the jitter buffer 20 in chronological order according to the sequence number. The voice packet header includes a time stamp in addition to the sequence number. The sequence number indicates the transmission order of the voice packets, and the time stamp indicates the relative position of the voice signal in the original voice waveform.

The jitter buffer 20 temporarily holds the received voice data output from the transmission processing unit 7, delays it for a predetermined time, and outputs it to the detection processing unit 23 to absorb the delay fluctuation of the voice packet.

The timer 21 is used when the packet loss detection unit 22 detects a packet loss. The packet loss detection unit 22 starts the timer 21 timing when the jitter buffer 20 outputs the reception voice data to the detection processing unit 23, and before the jitter buffer 20 outputs the next reception voice data, the timer 21 If the measured time exceeds a predetermined time in which packet loss is assumed to occur, it is determined that packet loss has occurred.

When a packet loss is detected by the packet loss detection unit 22, the detection processing unit 23 performs a basic period (pitch) detection process on the received voice data output from the jitter buffer 20, and the packet loss detection unit 22 If no packet loss is detected, nothing is performed on the received voice data. The detection processing unit 23 holds received voice data for a certain period in the past.

Here, the detection processing unit 23 includes a template setting unit 23a and a pitch detection unit 23b. The template setting unit 23a sets received voice data having a predetermined time width as a template from the loss occurrence time to the past when the packet loss has occurred. Here, the template setting unit 23a increases the time width of the template as the pitch detection unit 23b increases the slide amount of the template.

The pitch detection unit 23b slides the template set by the template setting unit 23a toward the past from the point of occurrence of loss with respect to the reception voice data, obtains the cross-correlation between the template and the reception voice data, and calculates the template and the reception voice data. The pitch of the received voice signal immediately before the point of occurrence of loss is detected from the amount of slide when the correlation peak with the maximum appears.

FIG. 10 is a waveform diagram of a received voice signal for explaining the processing of the template setting unit 23a and the pitch detection unit 23b. In addition, the vertical axis | shaft shown in FIG. 10 shows the intensity | strength of a received voice signal, and the horizontal axis shows time by the number of samples. A template TJ shown in FIG. 10 indicates a template used in the conventional compensation process.

When a packet loss occurs, conventionally, for example, a received voice signal for a predetermined period in the past from the loss occurrence time RT is set as a template TJ. Then, by sliding the template TJ toward the past from the loss occurrence time RT with respect to the received voice signal, the cross-correlation between the received voice signal and the template TJ is obtained, and the template TJ when the strongest correlation peak is obtained. The pitch of the received voice signal was detected from the slide amount.

FIG. 11 is a graph showing the calculation result of the correlation value between the template TJ and the received voice signal when the conventional template TJ is used. In FIG. 11, the correlation value is calculated using a conventionally known average amplitude difference function (Average (Magnitude Difference Function). In FIG. 11, the vertical axis indicates the correlation value, and the horizontal axis indicates the time when the loss occurrence time RT is 0 as the number of samples. Further, since FIG. 11 shows the correlation value by AMDF, the smaller the value, the stronger the correlation between the received voice signal and the template TJ.

In FIG. 11, first, a downwardly-correlated correlation peak PK1 appears at the time of 37 samples, and then a downwardly-correlated correlation peak PK2 appears at the time of 47 samples, and thereafter convex downward at a period of approximately 37 samples. The correlation peak of appears repeatedly. The correlation peak PK1 appears smaller than the correlation peak PK2. Therefore, in the conventional method, 37 samples are detected as the pitch of the received voice signal.

On the other hand, as shown in FIG. 10, the pitch of the received voice signal immediately before the loss occurrence time RT is 47 samples. Therefore, it can be seen that in the conventional method, the pitch of the received voice signal immediately before the loss occurrence time RT is not accurately detected.

This is because the time width of the template TJ is much larger than 47 samples, and the template TJ includes only one period of the received voice signal whose pitch to be detected is 47 samples, but the pitch that is not to be detected is 37. Since the sample received voice signal includes three periods, it is considered that a strong correlation peak appeared at 37 samples.

In this case, 37 samples of received voice signals are extracted retroactively from the loss occurrence time RT, and compensation processing is performed by repeatedly applying the received voice signals to the loss period.

Therefore, it is difficult to smoothly connect the waveform of the loss period and the waveform other than the loss period, and it is difficult to perform the compensation process with high accuracy.

On the other hand, if the template time width is smaller than 47 samples, the pitch of 47 samples cannot be detected.

Therefore, in the detection processing unit 23 in the present embodiment, the time width of the template TM is increased as the slide amount of the template TM is increased as shown in FIG.

Therefore, for example, when the template TM is slid to some extent as in the template TM shown in the third row of FIG. 10, the template includes only 47 samples of received voice signals that are to be detected. On the other hand, the template TM at the fourth stage in FIG. 10 includes a received voice signal with a pitch of 37 samples in addition to a received voice signal with a pitch of 47 samples. Therefore, the correlation between the third-stage template TM and the received voice signal is stronger than the correlation between the fourth-stage template TM and the received voice signal, and the pitch of the received voice signal immediately before the loss occurrence time RT is increased. It becomes possible to detect with high accuracy.

Here, it is preferable that the pitch detection unit 23b adopts, for example, AMDF shown in the equation (1) as the correlation calculation.

Where φ (τ) is the correlation value, N is the time width of the template TM, x (j) is the template TM, x (j−τ) is the received voice signal, k + 1 is the starting point of the template TM, and a is in advance The determined coefficient, τ indicates the slide amount of the template TM, and j indicates the sampling number of each sampling point of the received voice signal.

Further, it is preferable that the template setting unit 23a sets the time width of the template TM to a predetermined initial time width until the slide amount of the template TM reaches a predetermined slide reference value.

By doing this, when the slide amount of the template TM is relatively small, the time width of the template TM is set to the initial time width, and even when the slide amount is small, the time width of the template TM is larger than a certain amount. The correlation between the template TM and the received voice signal (input signal) can be obtained with higher accuracy.

Furthermore, the time width of the template TM is set to the initial time width until the slide amount of the template TM reaches the slide reference value, but the amount of calculation can be reduced by relatively shortening the initial time width. .

Note that, as the initial time width, it is preferable to adopt the assumed minimum value of the pitch of the received voice signal. As the slide reference value, for example, an initial time width may be adopted.

FIG. 12 is a diagram for explaining processing of the template setting unit 23a and the pitch detection unit 23b. Each point on the straight line shown in FIG. 12 indicates a sampling point of the received voice signal. The rightmost sampling point indicates a loss occurrence time RT, and each sampling point indicates a past sampling point toward the left. The loss occurrence time RT is set as the 0th sampling point. The pitch of the received voice signal is about 3 msec in a short case, and if the sampling frequency is 8 kHz, it corresponds to 24 samples. Therefore, the initial time width may be 24 samples, for example. In FIG. 12, for convenience of explanation, the initial time width of the template TM is set to 4, a = 1, and the slide reference value is set to 5.

First, when a packet loss occurs, the pitch detection unit 23b sets τ = 0 and the initial time width of the template TM is 4. Therefore, the fourth sampling point on the left from the loss occurrence time RT is set as the reference sampling point k. And set the sampling number to each sampling point so that it increases by 1 from k to the loss occurrence time RT, and assign the sampling number to each sampling point so that it decreases by 1 from k to the past. To do.

Then, the template setting unit 23a sets the reception voice signals x (k + 1) to x (k + 4) as the template TM0.

Then, the pitch detection unit 23b calculates a correlation value φ (0) between the template TM0 and the received voice signal x (j-0) using the equation (1). In this case, the template TM0 is applied to the audio signals x (k + 1) to x (k + 4).

Next, the pitch detection unit 23b sets τ = 1, and similarly to τ = 0, using the equation (1), the correlation value φ (1) between the template TM0 and the audio signal x (j−1). ) Is calculated. In this case, the template TM0 is applied to the audio signals x (k) to x (k + 3).

Thereafter, the template TM0 is slid toward the past with respect to the received voice signal until τ = 4, and φ (2), φ (3), φ (4) are calculated using Equation (1). .

Next, when τ = 5, the pitch detection unit 23b sets τ ≧ slide reference value (= 5), and therefore sets the fifth sampling point to the left from the loss occurrence time RT as the reference sampling point k. Then, the template setting unit 23a sets the audio signals x (k + 1) to x (k + 5) as the template TM5. Then, the pitch detection unit 23b obtains a correlation value φ (5) between the template TM5 and the audio signal x (j-5) using Expression (1). In this case, the template TM5 is applied to the audio signals x (k-4) to x (k).

Next, the pitch detection unit 23b sets τ = 6, and sets the sixth sampling point to the left from the loss occurrence time RT as the reference sampling point k. Then, the template setting unit 23a sets the received voice signals x (k + 1) to x (k + 6) as the template TM6. Then, the pitch detection unit 23b obtains a correlation value φ (6) between the template TM6 and the received voice signal x (j-6) using Expression (1). In this case, the template TM6 is applied to the audio signals x (k-5) to x (k).

Thereafter, the pitch detection unit 23b repeats the above processing until τ reaches the maximum slide amount τmax, and obtains φ (τ). As a result, the time width of the template TM is increased as the slide amount increases.

FIG. 13 shows a graph of the correlation value φ (τ) when the correlation value φ (τ) is obtained for the received voice signal shown in FIG. 10 using the method according to the present embodiment. In FIG. 13, the vertical axis indicates the correlation value φ (τ), and the horizontal axis indicates time in terms of the number of samples. In FIG. 13, the correlation value φ (τ) is calculated by AMDF. Therefore, as in FIG. 11, the correlation peak with the lower correlation value has a stronger correlation between the received voice signal and the template TM.

In FIG. 13, a convex correlation peak PK1 appears downward when approximately 47 samples have elapsed from the loss occurrence time RT (= 0), and then downward when approximately 37 samples have elapsed since the correlation peak PK1 appeared. A convex correlation peak PK2 appears, and thereafter a convex correlation peak appears every approximately 37 samples. Further, the value of the correlation peak increases with time, and the correlation between the template TM and the received voice signal is weakened. If the sampling frequency is 8 kHz, 37 samples correspond to 37 × 0.125 msec = 4.625 msec, and 47 samples correspond to 47 × 0.125 = 5.875 msec.

That is, among the correlation peaks shown in FIG. 13, the correlation peak PK1 when the template TM is shifted by 47 samples is the smallest.

Therefore, the pitch detector 23b detects 47 samples, which are the time when the minimum correlation peak PK1 appears, as the pitch of the received voice signal immediately before the loss occurrence time RT. Therefore, it can be seen that the pitch detector 23b can detect 47 samples, which are the pitches of the received voice signal immediately before the loss occurrence time RT shown in FIG.

The compensation processing unit 24 extracts a reception voice signal for one pitch detected by the pitch detection unit 23b from the loss occurrence time point RT to the past, and compensates for a loss period in which a packet loss has occurred in the extracted reception voice signal Process.

Here, for example, if the received voice signal shown in FIG. 10 is input to the compensation processing unit 24 and the pitch detection unit 23b detects 47 samples as the pitch, the reception of 47 samples from the loss occurrence time RT to the past is performed. A voice signal is extracted, and the received reception voice signal is repeatedly applied to the end of the loss period to compensate for the loss period.

FIG. 14 is a flowchart showing the procedure of the operation (audio data loss compensation processing) of the audio data loss compensation processing unit VC. In the flowchart of FIG. 14, a = 1 is set for convenience of explanation. First, in step S1, when the packet loss detection unit 22 detects a packet loss (step S1), the pitch detection unit 23b sets τ = 0 (step S2).

Next, the template setting unit 23a sets a template TM having a time width corresponding to the value of τ from the received voice signal (step S3). At this time, the template setting unit 23a sets the time width of the template TM to the initial time width if τ <slide reference value, and sets the time width of the template TM to N = τ if τ ≧ slide reference value. To do.

Next, the pitch detection unit 23b sets a reference sampling point k so that k + 1 becomes the starting point of the template TM, and assigns a sampling number to each sampling point (step S4).

Next, the pitch detection unit 23b calculates a correlation value between the template TM and the received voice signal using the equation (1) (step S5).

Next, the pitch detector 23b sets τ = τ + 1 (step S6). Next, when τ ≧ slide reference value (step S7), that is, when the slide amount of the template TM exceeds the slide reference value, the pitch detection unit 23b advances the process to step S8, where τ <slide reference value If so (step S7), the process returns to step S5. By repeating the processes of steps S5 to S7, the template TM having the initial time width is slid toward the past with respect to the received voice signal until the slide TM becomes the slide reference value.

In step S8, if τ <τmax (step S8), the process returns to step S3, and the processes of steps S3 to S8 are repeated until τ ≧ τmax. Thereby, the time width of the template TM is increased as τ which is the slide amount increases.

In step S8, when τ ≧ τmax (step S8), the pitch detector 23b detects a correlation peak from the correlation value calculated in step S5, and among the detected correlation peaks, the template TM and the received voice signal The slide amount of the correlation peak with the strongest correlation is identified, and the pitch is detected from the identified slide amount (step S9). Here, when Equation (1) is adopted, the correlation peak indicating the minimum correlation value indicates the strongest correlation between the template TM and the received voice signal.

Further, the pitch detection unit 23b may calculate the pitch by multiplying the specified slide amount by the sampling period of the audio signal.

Next, the compensation processing unit 24 extracts the received voice signal according to the pitch detected in step S9, and compensates the loss period using the received received voice signal (step S10).

In the description of FIG. 12, the template setting unit 23a sets a = 1. However, the present invention is not limited to this, and a is set to 1 ≦ a <2 until the slide amount of the template TM exceeds a predetermined change reference value. When the slide amount exceeds the change reference value, the value of a may be gradually decreased so as to approach 1 as the slide amount approaches the maximum slide amount (τmax). . As the change reference value, for example, the above-described slide reference value can be adopted.

As a result, when the slide amount is small, the time width of the template TM can be set larger than the slide amount, and when the slide amount is large, the time width of the template TM can be set to a value about the slide amount. it can. Therefore, when the slide amount is small, it is possible to prevent the correlation calculation accuracy from being lowered due to the time width of the template TM becoming too small.

Further, as the correlation calculation, instead of AMDF shown in Equation (1), a conventionally known method such as cross-correlation or mean square difference function (Average Difference Function) may be employed.

As described above, according to the voice data loss compensation processing unit VC in the present embodiment, the received voice signal having a time width from the packet loss occurrence time point RT to the past is set as the template TM. Then, the set template TM is slid toward the past from the present time with respect to the received voice signal. Then, the correlation between the template TM and the received voice signal is obtained, and the pitch of the received voice signal is detected.

Here, the time width of the template TM increases as the slide amount increases. Therefore, at a relatively early stage where the slide amount is small, a timing occurs when the received voice signal for one pitch almost immediately before the current time is used as the template TM. At this time, a strong correlation peak appears between the template TM and the received voice signal. On the other hand, when the slide amount increases, the time width of the template TM increases accordingly, and the template TM includes a plurality of frequency components. Therefore, it becomes impossible to obtain a stronger correlation peak as the correlation peak obtained at the above timing. Therefore, it is possible to accurately detect the pitch of the received voice signal almost immediately before the current time.

As shown in FIG. 15, the fluctuation absorption processing unit JA includes a jitter buffer 30, a counting unit 31, a buffer size changing unit 32, a reception time recording unit 33, a reference value storage unit 34, a concealment processing unit 35, an output unit 36, and an observation history. A holding part 37 is provided. However, these units are realized by the DSP of the call processing unit 2 executing a fluctuation absorbing processing program in the second software. The jitter buffer 30 is shared with the jitter buffer 20 of the audio data loss compensation processing unit VC.

The reception time recording unit 33 records the time (time stamp) when the transmission processing unit 7 receives the voice packet (received voice packet) in association with the sequence number of the received packet.

The jitter buffer 30 is configured by, for example, a ring buffer, and accumulates packets received by the transmission processing unit 7 in chronological order. As a result, fluctuations in the transmission delay of the voice packet transmitted via the signal trunk line Ls are absorbed. As the size of the jitter buffer 30, a size larger than a reference value described later is adopted.

The counting unit 31 calculates a packet count value by counting the number of accumulated packets accumulated in the jitter buffer 30 at a predetermined period (count period) that is equal to or less than a period in which voice is packetized (packetization period). The packet count value calculated by the count unit 31 is held in the observation history holding unit 37. The observation history holding unit 37 is composed of, for example, a volatile semiconductor memory, and holds the packet count value of the past N (N is a positive integer) calculated by the counting unit 31.

FIG. 16 is an explanatory diagram of packet count value calculation processing by the count unit 31. As shown in FIG. 16, the count unit 31 calculates a packet count value at the count cycle Tb.

Here, the counting unit 31 sets the count value to a value obtained by ΔT / Ta for the packet PS received in the past in the packetization period Ta from the calculation time Tk that is the calculation timing of the packet count value, For the packet PL received before the packetization period Ta from the calculation time Tk, the packet count value is calculated by setting the count value to 1. That is, the packet count value of the packet PS decreases as the difference ΔT decreases as the reception time approaches the calculation time Tk.

Here, for the packet PS, since the reception time is used in calculating the packet count value, it is necessary to hold the reception time. On the other hand, for the packet PL, since the reception time is not necessary for calculating the packet count value, it is not necessary to record the reception time.

Therefore, when the packet count value calculation process ends, the counting unit 31 receives the packet received before the difference (= Ta−Tb) between the packetization period Ta and the count period Tb from the calculation time Tk. The time is deleted from the reception time recording unit 33.

Thereby, at the time Tk + 1 which is the next calculation time of the packet count value, as a result of the reception time of the packet received in the past in the packetization period Ta being held in the reception time recording unit 33, the counting unit 31 is At time Tk + 1, the reception time of the packet received in the past in the packetization period Ta can be acquired. In this way, the capacity of the reception time recording unit 33 can be saved.

The buffer size changing unit 32 reads the past N packet count values of the packet count value calculated by the counting unit 31 from the observation history holding unit 37, and the nth smallest packet from the read N packet count values The count value is calculated as a representative value of the packet count value. If the calculated representative value is larger than a predetermined reference value, the packet stored in the jitter buffer 30 is deleted. If the representative value is smaller than the reference value, the jitter buffer Insert packet into 30. The reference value is stored in the reference value storage unit 34.

Here, when the representative value is smaller than the reference value, the buffer size changing unit 32 may insert a packet into the jitter buffer 30 so that the representative value is not less than the reference value and less than the reference value + 1. For example, when the representative value is 2.1 and the reference value is 4, two packets are inserted into the jitter buffer 30 so that the representative value is 4.1. In addition, when the representative value is larger than the reference value, the buffer size changing unit 32 may delete the packet from the jitter buffer 30 so that the representative value is not less than the reference value and less than the reference value + 1. For example, when the representative value is 4.2 and the reference value is 2, two packets are deleted from the jitter buffer 30 so that the representative value is 2.2.

In addition, as n, it is preferable to adopt a value rounded to an integer value by N × α. In addition, as the reference value, a value determined in advance based on a call delay time allowed by the intercom system for collective housing in an interphone call (call using a packet transmission method) is adopted. That is, if the number of packets stored in the jitter buffer 30 is larger than the reference value, the number of packets waiting for output in the jitter buffer 30 increases, so that a call delay occurs. Therefore, as described above, when the representative value that is the nth packet count value is larger than the reference value, it is possible to prevent call delay by deleting the packet from the jitter buffer 30.

On the other hand, when the representative value which is the nth packet count value is smaller than the reference value, the packet is inserted into the jitter buffer 30. As a result, the probability that the number of stored packets is equal to or less than the reference value can be set to α (= n / N)%.

The concealment processing unit 35 performs a packet loss concealment process on invalid packets (packets that do not include voice; the same applies hereinafter) inserted into the jitter buffer 30 and when the packets are depleted in the jitter buffer 30. Perform packet loss concealment processing. Here, as the packet loss concealment process, for example, the pitch of the received voice signal is detected from the received voice signal in the past from the invalid packet, and the valid packet immediately before the invalid packet (packet including voice; the same applies hereinafter). In the received voice signal, the voice waveform of the section one pitch before the end is taken out, and the voice waveform obtained by repeating this voice waveform for the period of packetization period (for example, 20 msec) is generated as the received voice signal of the invalid packet. It is sufficient to adopt a technique to do this. As for the pitch detection, a method common to the pitch detection process in the audio data loss compensation process described above may be employed.

When the number of packets stored in the jitter buffer 30 exceeds the reference value, the output unit 36 reads packets (received voice data) from the jitter buffer 30 in chronological order in synchronization with the packetization period Ta, and receives the received voice signal Output to the route. Here, when the packet extracted from the jitter buffer 30 is an invalid packet that does not include voice, the output unit 36 causes the concealment processing unit 35 to execute the packet loss concealment process, and outputs the voice data after the execution process.

The observation history holding unit 37 is configured by, for example, a non-volatile semiconductor memory, and holds the packet count value of the past N times calculated by the counting unit 31.

FIG. 17 is a diagram for explaining the role of the jitter buffer 30. As shown in FIG. 17, a packet including a received voice signal is transmitted from the other party's call terminal (lobby interphone LI, management room device X, or other dwelling unit) at a packetization period (20 msec in the illustrated example). . FIG. 17 shows a situation in which 8 packets with numbers 1 to 8 (sequence numbers) are transmitted at intervals of 20 msec.

The packet transmitted from the other party's call terminal is received by the dwelling unit A via the signal trunk line Ls. Here, since a large number of packets (voice packets, video packets, and control packets) are multiplexed and transmitted via the signal trunk line Ls, voice packets transmitted from the partner telephone terminal at the packetization period reach the dwelling unit A. The time until the transmission time (transmission delay) is greatly different for each voice packet, and so-called transmission delay fluctuation occurs. Therefore, the reception intervals of voice packets by the dwelling unit A are unequal intervals.

Therefore, a jitter buffer 30 is provided to absorb this transmission delay fluctuation. In FIG. 17, the buffer size of the jitter buffer 30 is three packets. Further, the output unit 36 starts the output by performing the decoding process and the D / A conversion process on the first packet at the time T1 when the delay time Td has elapsed since the reception of the first packet. .

In the case of FIG. 17, the jitter buffer 30 stores the second packet at time T2, which is the output time of the second packet after 20 msec from time T1. Therefore, the output unit 36 can output the second packet at time T2.

On the other hand, since the third packet has an extremely large transmission delay, it does not reach the dwelling unit A at the time T3 and the jitter buffer 30 is depleted. For this reason, the output unit 36 cannot output the third packet at time T3, and sound loss (voice data loss) occurs.

The third to seventh packets reach the dwelling unit A continuously in a short time after the congestion is eliminated. When the seventh packet reaches the dwelling unit A, the jitter buffer 30 includes the fifth and sixth pieces. However, since the jitter buffer 30 is empty, the seventh packet is not discarded and stored in the jitter buffer 30. Therefore, the seventh packet is output from the output unit 36 at time T7.

As described above, since the characteristics of the transmission delay fluctuation dynamically change, if the buffer size of the jitter buffer 30 is set to a fixed size, the transmission delay fluctuation must be sufficiently longer than the assumed transmission delay fluctuation. Moreover, if the buffer size of the jitter buffer 30 is made sufficiently long and the delay time Td is made sufficiently long, the occurrence of sound omission can be prevented, but if the delay time Td is long, the jitter buffer 30 waits for output. Packets increase and call delay occurs.

FIG. 18 shows an example of a transmission delay characteristic graph showing the relationship between the transmission delay and the frequency of occurrence of the transmission delay. In FIG. 18, the vertical axis indicates the occurrence frequency, and the horizontal axis indicates the transmission delay. FIG. 19 is a diagram for explaining an optimum buffer size of the jitter buffer 30. In FIG. 18, dmin represents the minimum transmission delay, and dmax represents the maximum transmission delay. In FIG. 19, the transmission delay of the (k-1) th packet is dmin, the transmission delay of the kth packet is d, and the transmission delay of the (k + 1) th packet is dmax.

In this case, the optimum output waiting time by the output unit 36 is as follows. i) Packets received with dmax are output immediately. ii) Wait for dmax-dmin before outputting packets that arrive at dmin. iii) The packet arrived at d is output after waiting dmax-d.

Therefore, in order to avoid packet depletion in the jitter buffer 30, the buffer size buf of the jitter buffer 30 may be set to buf ≧ dmax−dmin. However, when dmax of the transmission delay characteristic becomes extremely large, that is, FIG. If the tail at the right end of the graph becomes extremely long, the buffer size buf will increase. Further, as shown in the graph of FIG. 18, since the frequency of occurrence decreases as the transmission delay increases, in order to observe the true dmax, it is necessary to observe the transmission delay of a huge number of packets. For this reason, in the graph of FIG. 18, not true dmax but a value obtained by rounding down the upper few percent of the distribution of transmission characteristics is regarded as dmax. In this case, when a transmission delay exceeding the value considered as dmax occurs, packet depletion occurs.

Therefore, in order to prevent packet depletion, it is preferable to set a large value to be regarded as dmax, but conversely, if the value regarded as dmax is too large, the buffer size buf increases, and the jitter buffer 30 waits for output. As a result of the increase in waiting packets, an output delay occurs. Such an output delay appears as a call delay in a packet transmission interphone call, and is preferably suppressed as low as possible. Therefore, by executing the above-described processing, packet depletion is prevented and at the same time, call delay is prevented.

FIG. 20 is a flowchart showing the fluctuation absorption processing of the fluctuation absorption processing unit JA. First, in step S1, the counting unit 31 determines whether or not the packet count value calculation timing comes after the count period Tb has elapsed since the packet count value calculation timing was calculated last time. If the counting unit 31 determines that the packet count value calculation timing has come (YES in step S1), the counting unit 31 counts the number of accumulated packets that are currently accumulated in the jitter buffer 30 (step S2). On the other hand, when determining that the packet count value calculation timing has not come (NO in step S1), the counting unit 31 returns the process to step S1.

Next, the count unit 31 executes a packet count value calculation process to calculate a packet count value (step S3).

FIG. 21 is a flowchart showing details of packet count value calculation processing. First, the count unit 31 specifies the current time as the packet count value calculation time (step S21). Here, since the control unit 1 of the dwelling unit A has a clock function, the calculated time can be specified using the clock function.

Next, the counting unit 31 specifies the reception time of each packet received in the past in the packetization period Ta from the calculation time Tk as shown in FIG. 16 among the packets stored in the jitter buffer 30. (Step S22). In this case, the count unit 31 specifies the reception time of each packet by specifying the sequence number associated with the reception time recorded in the reception time recording unit 33.

Next, from the calculation time Tk, the counting unit 31 calculates a difference ΔT between the calculation time Tk and the reception time for each packet received in the past in the packetization period Ta (step S23). Next, the counting unit 31 calculates ΔT / Ta for each packet received in the past in the packetization period Ta, and sets this ΔT / Ta as the count value of each packet (step S24).

Next, the count unit 31 sets the count value to 1 for packets received from the calculation time Tk before the packetization period Ta among the packets stored in the jitter buffer 30 (step S25). ).

Next, the count unit 31 calculates the packet count value by counting the number of packets stored in the jitter buffer 30 using the count value set in steps S24 and S25 (step S26). For example, from the calculation time Tk, the number of packets received before the packetization cycle Ta in the past is 1, and from the calculation time Tk, the number of packets received in the past within the packetization cycle Ta is two. When the reception time of each packet is Ti and Tj, the packet count value is 1+ (Tk−Ti) / Ta + (Tk−Tj) / Ta.

Next, the counting unit 31 deletes the reception time from the reception time recording unit 33 for packets received in the past and before Ta-Tb from the calculation time Tk (step S27).

Referring back to the flowchart of FIG. 20, in step S4, the counting unit 31 causes the observation history holding unit 37 to hold the packet count value at the calculation time Tk. In this case, the count unit 31 deletes the oldest packet count value from the observation history holding unit 37 so that the number of packet count values held in the observation history holding unit 37 is N.

Next, the buffer size changing unit 32 specifies the nth smallest packet count value among the N packet count values stored in the observation history holding unit 37 as a representative value (step S5).

FIG. 22 is a schematic diagram showing the relationship between the packet count value and the calculation time of the packet count value. The vertical axis shows the packet count value, and the horizontal axis shows the calculation time of the packet count value. In FIG. 22, N = 9 and n = 3. Therefore, since the packet count value at the second time Tk-7 from the left end shown in FIG. 22 is the third smallest, the buffer size changing unit 32 specifies the packet count value at the time Tk-7 as a representative value.

Next, the buffer size changing unit 32 determines whether or not the representative value is greater than the reference value.If representative value ≧ reference value + 1 (YES in step S6), the representative value is greater than or equal to the reference value and the reference value + The number of packets that is less than 1 is deleted from the jitter buffer 30 (step S7).

Next, the buffer size changing unit 32 subtracts the number of packets deleted in step S7 from each of the N packet count values held in the observation history holding unit 37, and updates the N packet count values. The observation history is updated (step S8). For example, assuming that the number of deleted packets is 1, 1 is subtracted from all N packet count values. Thereby, the fact that the packet is deleted from the jitter buffer 30 is reflected in the observation history.

On the other hand, in step S6, when the representative value is less than the reference value +1 (NO in step S6) and the representative value is equal to or larger than the reference value (NO in step S9), the buffer size changing unit 32 is configured to use the jitter buffer 30. The packet is not deleted or inserted in step S10.

On the other hand, if representative value <reference value (YES in step S9), the buffer size changing unit 32 inserts into the jitter buffer 30 a number of packets whose representative value is greater than or equal to the reference value and less than the reference value + 1 (step S11). ).

Next, the buffer size changing unit 32 adds the number of packets inserted in step S11 to each of the N packet count values held in the observation history holding unit 37, and updates the N packet count values. Then, the observation history is updated (step S12). For example, if the number of inserted packets is 1, 1 is added to all N packet count values. Thereby, the fact that the packet is inserted into the jitter buffer 30 is reflected in the observation history.

Then, when the process of step S8, S10 or S12 is completed, the process returns to step S1, and when the next packet count value calculation time comes, the processes after step S2 are executed.

FIG. 23A is a schematic diagram showing processing at the time of packet insertion by the buffer size changing unit 32, and FIG. 23B is a schematic diagram showing processing at the time of packet deletion by the buffer size changing unit 32. In the example of FIG. 23A, the buffer size changing unit 32 inserts an invalid packet between the fourth packet and the fifth packet, which are valid packets. In the example of FIG. 23B, the buffer size changing unit 32 overlaps the fourth packet and the fifth packet, which are valid packets, so that two packet lengths become one packet length. Has been deleted.

Thus, in the fluctuation absorption processing unit JA, the packet count value is calculated from the number of packets stored in the jitter buffer 30, and the nth smallest packet count value is specified as the representative value among the past N packet count values. If the identified representative value is larger than the reference value, the packet is deleted from the jitter buffer 30. For this reason, the number of packets stored in the jitter buffer 30 tends to be larger than the reference value from the past history of the packet count value, and if output delay occurs, the packet is deleted from the jitter buffer 30 and the output delay is reduced. Is done. On the other hand, if the number of packets stored in the jitter buffer 30 tends to be smaller than the reference value from the past history of the packet count value, and there is a high possibility that the packet will be exhausted, the packet is inserted into the jitter buffer 30 Therefore, it is possible to prevent packet depletion.

Next, another method for calculating the packet count value in the fluctuation absorbing process will be described. Here, in the reception time recording unit 33, only the reception time of the latest packet is recorded.

The count unit 31 sets the count value for the latest packet to a value obtained by the difference ΔT / Ta between the calculation time Tk and the reception time of the latest packet, and sets the count value to 1 for other packets. To calculate a packet count value.

As shown in FIG. 24, the counting unit 31 has received the packet received in the packetization period Ta in the jitter buffer 30 when the packets received in the past in the packetization period Ta have been accumulated from the calculation time Tk. The packet PS having the latest reception time is identified from the packets, and the count value of the latest packet PS is set to ΔT / Ta. On the other hand, the count unit 31 uniformly sets the count value to 1 for the packets PL1 and PL2 other than the latest packet PS among the packets stored in the jitter buffer 30. In this case, since the counting unit 31 only needs to know the reception time of the latest packet PS that is a packet received within the packetization period Ta in the past from the calculation time Tk, the packet count value calculation process is performed. After the completion, the reception record recorded in the reception time recording unit 33 is deleted.

The packet count value calculation process will be described in detail with reference to the flowchart of FIG. Steps S31, S33, S34, and S36 in FIG. 25 are the same as steps S21, S23, S24, and S26 in FIG. In step S32 in FIG. 25, the counting unit 31 specifies the reception time of the latest packet among the packets received in the past in the packetization period Ta from the calculation time Tk in the jitter buffer 30. Further, the count unit 31 uniformly sets the count value to 1 for packets other than the latest packet from the calculation time Tk (step S35). In step S37, the count unit 31 deletes the latest packet reception time from the reception time recording unit 33.

If the packet count value is calculated by the above-described method, it is only necessary to record the reception time for only the latest packet, so that the capacity of the reception time recording unit 33 can be further saved.

By the way, in voice transmission by the packet transmission method, sound interruption of 500 msec or more may occur due to spike delay variation (spike delay) due to sudden accumulation of packets in the transmission path. Therefore, the fluctuation absorption processing unit JA determines whether or not a spike delay has occurred. If a spike delay has occurred, the window width of the past packet count value to be referred to is shortened, and packets within the shortened window width are detected. It is preferable to calculate the representative value from the count value.

Therefore, the count unit 31 stores the calculated packet count value in the observation history holding unit 37 in association with an index for indicating the time-series order of each packet count value. Specifically, since the observation history holding unit 37 holds the packet count value of the past N times, the count unit 31 has an index of N for the latest packet count value and an index of 1 for the oldest packet count value. Thus, an index is added to the past N packet count values so that the index increases as the calculation time becomes new. The counting unit 31 determines the presence or absence of a spike delay based on the past N packet count values held in the observation history holding unit 37, and determines that the spike delay has occurred. From the packet count value of the number of times, the packet count value of the past M (M <N) times is extracted.

Here, the counting unit 31 determines the presence or absence of a spike delay as follows. FIG. 26 is a graph for explaining the determination processing for the presence or absence of spike delay. In FIG. 26, the vertical axis indicates the packet count value, and the horizontal axis indicates the index. Also, N = 100.

First, the count unit 31 specifies a packet count value that is equal to or less than the reference value. In the example of FIG. 26, the packet count values at points PP1 to PP6 are below the reference value. Next, the count unit 31 specifies the smallest index, that is, the oldest point, and the largest index, that is, the latest point among packet count values equal to or less than the reference value. In the example of FIG. 26, the counting unit 31 specifies the points PP1 and PP6.

Next, the count unit 31 obtains a difference ΔI between the minimum index and the maximum index. The counting unit 31 determines that a spike delay has occurred if the difference ΔI is smaller than a predetermined threshold, and determines that no spike delay has occurred if the difference ΔI is larger than the threshold.

FIG. 27 is a graph showing the relationship between the packet count value and the index when spike delay occurs. In FIG. 27, the vertical axis represents the packet count value, and the horizontal axis represents the index. In the example of FIG. 27, the packet count values at points PP1 to PP5 are equal to or less than the reference value. The point PP1 has the smallest index, and the point PP5 has the largest index. The difference ΔI between the index of the point PP1 and the index of the point PP5 is smaller than the threshold value. Therefore, the count unit 31 determines that a spike delay has occurred.

When the counting unit 31 determines that the spike delay has occurred as shown in FIG. 27, the count unit 31 extracts the past M packet count values from the calculation time Tk. Here, as M, a value obtained by multiplying ΔI by a predetermined coefficient β (0 <β ≦ 1) (= β · ΔI) rounded by an integer can be adopted.

Then, the buffer size changing unit 32 calculates the m-th smallest packet count value among the past M packet count values as a representative value. Thereafter, the buffer size changing unit 32 compares the representative value with the reference value, and inserts or deletes the packet in the jitter buffer 30. Here, as m, a value obtained by rounding M × α with an integer can be adopted.

In this way, when a spike delay occurs, the window width of the past packet count value to be referred to is narrowed, and a packet is inserted into or deleted from the jitter buffer 30. Therefore, the representative value can be calculated in such a manner that spike delays that rarely occur are eliminated.

Further, in the fluctuation absorption processing unit JA, when the number of accumulated packets of 0 occurs continuously, it is preferable to calculate the packet count value as follows.

Specifically, the count unit 31 sets, as the packet count value, a negative value that increases in absolute value as the number of consecutive 0 stored packet numbers increases when the number of 0 stored packet numbers continues. calculate.

28A and 28B are diagrams for explaining the processing of the counting unit 31. FIG. In FIG. 28A, packets are received immediately after the packet count value calculation times Tk-4, Tk-3, Tk-2, and Tk-1 in each section of the count cycle Tb. Further, the output unit 36 receives the packet from the jitter buffer 30 in each section until the next packet count value calculation time Tk-3, Tk-2, Tk-1, Tk elapses. Reading (received voice data). For example, a packet received immediately after the calculation time Tk-4 is read out until the next calculation time Tk-3 elapses. Therefore, at each calculation time Tk-4, Tk-3, Tk-2, Tk-1, Tk, the number of stored packets in the jitter buffer 30 is zero. Therefore, the count unit 31 calculates the packet count value as 0 at each of the calculation times Tk-4, Tk-3, Tk-2, Tk-1, and Tk.

On the other hand, in FIG. 28B, no packet has been received since one packet was received slightly before the calculation time Tk-4. Note that a packet received shortly before the calculation time Tk-4 is read out after the calculation time Tk-4 has elapsed and until the next calculation time Tk-3 has elapsed. Even in this case, the number of stored packets at the calculation time Tk-4 is 1, but the number of stored packets at other calculation times Tk-3, Tk-2, Tk-1, and Tk is 0. The counting unit 31 calculates the packet count value as 0 at each of the calculation times Tk-3, Tk-2, Tk-1, and Tk.

However, in FIG. 28A and 28B, the situation of the signal trunk line Ls is greatly different. That is, in FIG. 28A, the packet periodically reaches the dwelling unit A, and the output unit 36 can continuously output the packet. However, in FIG. Therefore, the output unit 36 cannot output continuously.

In order to distinguish this, the counting unit 31 performs the following processing. First, the difference between the calculated time (current time) and the latest packet reception time is compared with the count cycle Tb. If the difference is smaller than the count cycle Tb, it is determined that the situation in FIG. On the other hand, if the difference is greater than the count cycle Tb, it is determined that no packet has been received since the previous calculation time, that is, the situation in FIG. 28B, and the following processing is performed. That is, as shown in FIG. 28B, the number of accumulated packets is 0 at the calculation time Tk-3, and the number of accumulated packets is 0 at the calculation time Tk-2. The number of consecutive numbers is one. In this case, the count unit 31 calculates 0 as the packet count value at the calculation time Tk-2.

In addition, at the calculation time Tk-1, the continuous number of 0 stored packets is two. Therefore, the count unit 31 calculates −1, which is a value obtained by multiplying the value obtained by subtracting 1 from 2 that is the number of consecutive times by −1, as the packet count value at the calculation time Tk−1. At the calculation time Tk, since the number of consecutive 0 stored packet numbers is 3, the count unit 23 calculates -2, which is a value obtained by multiplying the value obtained by subtracting 1 from 3 which is the number of consecutive times, and -1. Calculated as the packet count value at Tk. Therefore, the counting unit 31 calculates (number of consecutive times−1) · (−1) as the packet count value.

Thus, although the packet can be received periodically as shown in FIG. 28A, the packet can be received periodically as shown in FIG. 28B when the number of stored packets happens to be zero at the calculation time. The packet count value can be calculated in consideration of the difference from the case where the packet is not received. Therefore, in the case of FIG. 28B, packets are less likely to be deleted from the jitter buffer 30 than in the case of FIG. 28A.

Next, a process for inserting or deleting a packet in the jitter buffer 30 will be specifically described. When the buffer size changing unit 32 deletes one packet from the jitter buffer 30, if there are two or more valid packets including voice in succession, two consecutive consecutive packets located in the middle of these consecutive valid packets will be described. Two valid packets are overlapped and deleted.

29A, 29B, and 29C are explanatory diagrams of processing in which the buffer size changing unit 32 deletes one packet by overlap addition, FIG. 29A shows the jitter buffer 30 before deletion, and FIG. 29B shows jitter after deletion. A buffer 30 is shown.

29A, 29B, and 29C, the read pointer RP indicates the start address of the jitter buffer 30 having a ring buffer structure, and the write pointer WP indicates the end address of the jitter buffer 30. In FIG. 29, each 升 indicates one packet, and the numbers in 升 indicate the time-series order of the packets. In addition, a white wrinkle indicates an invalid packet, and a gray wrinkle indicates a valid packet.

In the case of FIG. 29A, the 5th and 6th valid packets located in the 4th to 7th valid packet sections, not the 1st to 2nd valid packet sections, overlap as shown in FIG. 29B. The packets are combined into one packet by addition, and one packet is deleted.

Here, if overlap addition is performed in the first to second valid packet sections shown in FIG. 29A, an invalid packet exists after one packet generated by overlap addition. There is a possibility that speech degradation will increase when concealment processing is performed. On the other hand, if the fifth valid packet and the sixth valid packet are overlap-added, the packets before and after one packet generated by overlap addition are valid packets, so that the voice deterioration due to the packet loss concealment process is reduced. can do.

In other words, if two or more valid packets are consecutive, one packet can be deleted by overlap addition, but packet loss concealment processing is performed when overlap addition is performed in a section where there are many consecutive valid packets. It is possible to reduce voice deterioration when

Therefore, in the jitter buffer 30, when there are a plurality of sections in which valid packets are continuous, overlap addition is performed using a valid packet in the middle of a section in which the number of consecutive valid packets is large.

Here, as the overlap addition, overlap addition using triangular window functions RF1 and RF2 can be adopted as shown in FIG. 29C. Specifically, the buffer size changing unit 32 performs window function processing using the triangular window function RF1 on the audio signal of the fifth packet, and applies the triangular window to the audio signal of the sixth packet. The window function processing using the function RF2 is performed, the two audio signals after the window function processing are added to generate one audio signal, and this is packetized into one to perform overlap addition.

Here, as the triangular window function RF1, a linear function having a time width of 20 msec, a maximum value of 1 and a minimum value of 0 and decreasing in value as time passes can be adopted. As the triangular window function RF2, a linear function having a time width of 20 msec, a maximum value of 1 and a minimum value of 0 and increasing in value as time passes can be adopted.

Further, when deleting a packet from the jitter buffer 30, the buffer size changing unit 32 deletes the invalid packet if there is an invalid packet inserted in the past.

30A and 30B are explanatory diagrams of processing in which the buffer size changing unit 32 deletes one invalid packet. FIG. 30A shows the jitter buffer 30 before deletion, and FIG. 30B shows the jitter buffer 30 after deletion. Yes.

In FIG. 30A, the third and fourth packets are invalid packets. Therefore, the buffer size changing unit 32 deletes one packet by deleting either the third or the fourth packet. Here, when there are a plurality of invalid packets in the jitter buffer 30, for example, one invalid packet may be selected at random, and the selected invalid packet may be deleted. Alternatively, when two or more invalid packets are continuously present, the buffer size changing unit 32 preferentially extracts invalid packets in a continuous area, and randomly selects one invalid packet from the extracted invalid packets. A packet may be selected and deleted.

In addition, when inserting a packet into the jitter buffer 30, the buffer size changing unit 32 inserts an invalid packet between these two valid packets if there are two consecutive valid packets.

31A and 31B are explanatory diagrams of processing in which the buffer size changing unit 32 inserts one packet. FIG. 31A shows the jitter buffer 30 before insertion, and FIG. 31B shows the jitter buffer 30 after insertion. .

As shown in FIGS. 31A and 31B, one invalid packet is inserted between the fifth valid packet and the sixth valid packet. This is because inserting one invalid packet between the fifth valid packet and the sixth valid packet increases the number of consecutive valid packets.

For example, even if an invalid packet is inserted between the first valid packet and the second valid packet, there is a valid packet before and after the inserted invalid packet, so that packet concealment processing can be performed.

However, before and after the second valid packet become invalid packets, the number of consecutive valid packets becomes small. On the other hand, if an invalid packet is inserted between the fifth valid packet and the sixth valid packet, all the valid packets are continuous. Here, when packet loss concealment processing is performed, voice deterioration can be reduced as the number of consecutive valid packets increases. Therefore, when there are a plurality of sections where valid packets continue in the jitter buffer 30, the buffer size changing unit 32 inserts invalid packets in the middle of a section where the number of consecutive valid packets is large.

The buffer size changing unit 32 has a predetermined upper limit value for the number of packets that can be inserted or deleted at a time.

32A and 32B are diagrams for explaining processing when five packets are inserted into the jitter buffer 30 at once, FIG. 32A shows the jitter buffer 30 before insertion, and FIG. 32B shows the jitter buffer after insertion. 30 is shown. 32A and 32B, five invalid packets are inserted between the first valid packet and the second valid packet. In this case, since there are continuous invalid packets, there is a risk that voice deterioration will increase. Therefore, an upper limit is set for the number of invalid packets inserted. Here, “at once” refers to one process executed when the above-described count cycle Tb has been reached.

For example, if the upper limit value is set to 3 in FIG. 32A, even if it is necessary to insert five invalid packets, only three invalid packets are inserted.

This prevents the number of consecutive invalid packets from exceeding a certain number and reduces voice deterioration due to packet loss concealment processing.

In addition, when the invalid packet is deleted, the buffer size changing unit 32 receives another valid packet corresponding to the deleted invalid packet. Replace the packet with the received valid packet.

33A, 33B, and 33C are diagrams for explaining processing when a valid packet corresponding to a deleted invalid packet is received after deleting the invalid packet. FIG. 33A shows the jitter buffer 30 before deletion, and FIG. 33B shows the jitter buffer 30 after deletion, and FIG. 33C shows the jitter buffer 30 after replacement.

As shown in FIGS. 33A and 33B, the third invalid packet has been deleted. Thereafter, as shown in FIG. 33C, the third valid packet corresponding to the third invalid packet is received.

In this case, since the fourth packet next to the third invalid packet is an invalid packet, the buffer size changing unit 32 replaces the fourth invalid packet with the received third valid packet. As a result, the third valid packet can be restored, and voice deterioration can be reduced.

Here, when a packet is accumulated in the jitter buffer 30, the buffer size changing unit 32 determines whether or not invalid packets corresponding to the accumulated packet are accumulated in the jitter buffer 30. Then, if the corresponding invalid packet is accumulated in the jitter buffer 30, the buffer size changing unit 32 determines whether the invalid packet is stored next to the invalid packet, and the invalid packet is stored. If it is, the next invalid packet is deleted, and the received valid packet is inserted into the deleted location, so that the next invalid packet and the received valid packet are exchanged.

On the other hand, when the invalid packet corresponding to the packet accumulated in the jitter buffer 30 is not accumulated in the jitter buffer 30, or the invalid packet is not stored next to the corresponding invalid packet, the buffer size changing unit 32 The above replacement is not performed. The buffer size changing unit 32 may determine that a valid packet corresponding to an invalid packet has been received when a packet having the same sequence number as that of the invalid packet is accumulated in the jitter buffer 30.

Also, when inserting a packet between two consecutive valid packets, the buffer size changing unit 32 causes the concealment processing unit 35 to execute a packet loss concealment process using the previous valid packet, thereby concealing. A processed packet may be generated and inserted into the jitter buffer 30.

34A and 34B are diagrams for explaining processing when the buffer size changing unit 32 inserts a concealed packet in place of an invalid packet into the jitter buffer 30, and FIG. 34A shows the jitter buffer 30 before insertion. FIG. 34B shows the jitter buffer 30 after insertion.

34A and 34B, a concealed packet is inserted between the third valid packet and the fourth valid packet.

Thus, when the output unit 36 reads a packet (voice data) from the jitter buffer 30, it is not necessary to execute the packet loss concealment process, and the processing delay of the packet loss concealment process at the time of output can be reduced.

Note that when inserting an invalid packet, the buffer size changing unit 32 preferably inserts an invalid packet between two consecutive packets including vowel sounds. Thereby, the voice generated by executing the packet loss concealment process on the inserted invalid packet is continuously connected to the voice included in the preceding and succeeding packets, and voice deterioration can be reduced.

FIG. 35 is a flowchart showing the deletion process by the buffer size changing unit 32.

First, in step S51, the buffer size changing unit 32 determines whether or not the number of packet deletion requests is equal to or less than a predetermined maximum packet deletion number (upper limit), and the number of deletion requests is equal to or less than the upper limit value. If so (YES in step S51), the deletion count value DN is set to the number of deletion requests (step S52). On the other hand, when the number of deletion requests is larger than the upper limit value (NO in step S51), the deletion count value DN is set to the upper limit value (step S53).

Next, when the maximum continuous number of consecutive valid packets is 2 or more (2 or more in step S54) in the jitter buffer 30, the buffer size changing unit 32 has a maximum continuous number that is twice or more the deletion count value DN. It is determined whether or not (step S55). Here, it is determined whether or not the maximum continuous number is twice the deletion count value DN. When one packet is deleted, two packets are overlap-added. This is because twice the value DN is required.

When the buffer size changing unit 32 determines that the maximum number of consecutive times is twice or more the deletion count value DN (YES in step S55), the buffer size changing unit 32 deletes the packet corresponding to the deletion count value DN by overlap addition, The delete count value DN is updated by subtracting the number of deleted packets from the value DN (step S58).

On the other hand, when the maximum continuous number is less than twice the deletion count value DN in step S55 (NO in step S55), the buffer size changing unit 32 deletes the deleteable packet by overlap addition, and deletes the deletion count value. The number of deleted packets is subtracted from the DN, the deletion count value DN is updated (step S56), and the process returns to step S54.

For example, when the maximum consecutive number is 7 and the deletion count value DN (= 4) × 2 is 8, 3 valid packets are overlapped by adding 6 valid packets of 7 consecutive valid packets. Delete the packet. Then, the deletion count value DN is updated to DN = 1 (= 4-3).

On the other hand, in step S54, if the maximum number of consecutive valid packets is 1 or less (1 or less in step S54), invalid packets are deleted, and the deleted count value DN is subtracted from the deleted count value DN. Is updated (step S57).

For example, if the deletion count value DN is 4 and the number of invalid packets is 3, then 3 invalid packets are deleted and updated to DN = 1 (= 4-3).

In step S59, the buffer size changing unit 32 determines whether or not the deletion count value DN is 0. If the deletion count value DN is 0 (YES in step S59), the process ends.

On the other hand, in step S59, if the deletion count value DN is not 0 (NO in step S59), the buffer size changing unit 32 deletes the effective packet and processes it if there is a valid packet (YES in step S60). Is finished (step S61). In this case, since the valid packet to be deleted is not continuous with other valid packets, it is simply deleted regardless of overlap addition. On the other hand, if there is no valid packet (NO in step S60), the process is terminated as it is.

FIG. 36 is a flowchart showing the insertion processing by the buffer size changing unit 32.

First, in step S71, the buffer size changing unit 32 determines whether or not the number of packet insertion requests is equal to or less than a predetermined maximum packet insertion number (upper limit), and the number of deletion requests is equal to or less than the maximum number of insertions. If there is (YES in step S71), the number of insertions is set to the number of insertion requests (step S72). On the other hand, if the number of insertion requests is larger than the maximum number of insertions (NO in step S71), the number of insertions is set to the maximum number of insertions (step S73).

Next, when the maximum number of consecutive valid packets is 0 in the jitter buffer 30, the buffer size changing unit 32 inserts as many invalid packets as the number of insertions from the beginning of the jitter buffer 30 (0 in step S74). Step S75), the process is terminated.

Further, when the maximum number of consecutive valid packets in the jitter buffer 30 is 2 or more (2 or more in step S74), the buffer size changing unit 32 inserts invalid packets by the number of insertions in the middle of the continuous valid packet section. Is inserted (step S76), and the process is terminated.

When the maximum number of consecutive valid packets is 1 in the jitter buffer 30 (1 in step S74), the buffer size changing unit 32 inserts invalid packets for the number of insertions immediately after the valid packets (step S77). ), The process ends.

As described above, when one packet is deleted from the jitter buffer 30, one packet is generated by overlapping and adding two packets located in the middle of a section where two or more valid packets are continuous. Therefore, voice quality degradation can be reduced.

Further, when a packet is inserted into the jitter buffer 30, if there are two consecutive valid packets, an invalid packet is inserted between these two valid packets. When packet loss concealment processing is executed for this invalid packet, the invalid packet can be concealed from the preceding and succeeding valid packets, the continuity of the voice is maintained, and the voice can be reproduced smoothly.

Note that the packet loss concealment processing performed by the concealment processing unit 35 of the fluctuation absorption processing unit JA can be replaced by the voice data loss compensation processing by the voice data loss compensation processing unit VC described above.

As described above, in the dwelling unit A of the present embodiment, the call processing unit 2 executes the first software when the other party's call terminal is an analog transmission method, and the call processing unit 2 is the case when the other terminal is a packet transmission method. By executing the second software, call processing suitable for each transmission method can be selectively executed. As a result, while suppressing the complexity of the circuit configuration and cost increase, the packet transmission method is used for voice transmission via the signal trunk line Ls, and the analog transmission method is used for voice transmission in the vicinity of the house not via the signal trunk line Ls. It is possible to improve the call quality.

(Embodiment 2)
Hereinafter, the second embodiment of the present invention will be described in detail with reference to FIGS. For the sake of clarity, the same elements as those in the intercom system for multi-dwelling houses of Embodiment 1 are assigned to the same elements, and the description thereof is omitted.

Since both the voice data loss compensation process and the speech speed conversion process in the first embodiment described above use the pitch of the voice, it is necessary to perform a pitch detection process for detecting the pitch of the voice. However, if the audio data loss compensation processing program and the speech speed conversion processing program are each equipped with a pitch detection processing program (program module), a memory for loading the program is wasted. Therefore, in this embodiment, the pitch detection processing program for detecting the pitch of the speech is made independent of the speech data missing compensation processing and the speech speed conversion processing program, and is detected by the pitch detection processing in the speech data missing compensation processing and speech speed conversion processing. This is characterized in that the same pitch is shared, and this can reduce wasteful consumption of memory.

Hereinafter, the call processing unit 2 of the present embodiment will be described. Note that the speech speed conversion processing unit SE of the present embodiment may execute voice quality conversion processing other than speech speed conversion processing, speech segment detection processing, speech enhancement processing, speaker discrimination processing, speech recognition processing, and the like. I do not care.

As shown in FIG. 37, the call processing unit 2 of the present embodiment includes an acoustic echo canceller EC1, a voice switch VS, a voice data missing detection unit 15, a pitch detection unit 16, a voice data missing compensation processing unit VC, and a speech speed conversion process. Department SE is provided. The audio data loss detection unit 15 detects the loss of audio data output from the transmission processing unit 7, and the audio data is lost when the audio data output from the jitter buffer of the transmission processing unit 7 is not continuous. A detection flag is set up. Note that the cause of missing audio data includes packet loss, delay, and jitter (fluctuation) associated with transmission as described in the first embodiment.

Based on the detection flag from the audio data loss detection unit 15 and the counter inside the pitch detection unit 16, the pitch detection unit 16 outputs audio data (audio data with missing compensation or This is to detect the pitch of audio from audio data that has not been compensated for omission (the same applies hereinafter). As a specific method of pitch detection, for example, a method of calculating the autocorrelation of speech while changing the frame length and estimating the frame length having the highest correlation as the pitch of the speech may be used. The audio data loss compensation processing unit VC detects the audio data loss based on the pitch detected by the pitch detection unit 16 when the audio data loss detection unit 15 detects the audio data loss (when the detection flag is set). To compensate. Specifically, the audio data loss compensation processing unit VC extracts audio data for one pitch from past audio data held in the buffer and makes up for it so that the audio is not interrupted. However, if there is no missing voice data, the voice data missing compensation processing unit VC outputs the input voice data as it is without missing compensation.

The speech rate conversion processing unit SE converts the speech rate of the original speech by expanding or compressing the speech data output from the speech data loss compensation processing unit VC.For example, PICOLA (Pointer Interval Controlled OverLap The speech speed is converted (fast or slow) by inserting or deleting waveforms in units of pitches based on a conventionally known speech speed conversion algorithm called “and Add”. These units are realized by causing a DSP (Digital Signal Processor) to execute a predetermined program.

Here, when the voice data loss compensation processing unit VC and the speech speed conversion processing unit SE individually perform pitch detection processing, when the voice data loss compensation processing and the speech speed conversion processing are simultaneously executed in the call processing unit 2 The processing load increases. On the other hand, the call processing unit 2 of the present embodiment has only one pitch detection unit 16, and both the voice data loss compensation processing unit VC and the speech rate conversion processing unit SE are a common pitch detection unit 16. The detected pitch is used. Therefore, when both the voice data loss compensation processing unit VC and the speech speed conversion processing unit SE share the pitch detected by the pitch detection unit 16, the voice data loss compensation processing and the speech speed conversion processing are executed simultaneously. An increase in processing load (DSP processing load on the DSP) can be suppressed.

As shown in FIG. 38, the pitch detection unit 16 in the present embodiment counts a predetermined detection cycle Tx and repeatedly detects the pitch in synchronization with the detection cycle Tx, and the audio data loss detection unit 15 detects that audio data is missing. When detected, the pitch is detected at the detection time point t1 of the missing audio data, and the detection cycle Tx is restarted from the detection time point t1. That is, when the pitch detection unit 16 repeatedly detects the pitch in synchronization with a certain detection cycle Tx, the speech speed conversion processing unit SE detects the pitch of the speech section in which the speech speed conversion process is executed and the pitch detection unit 16 detects the pitch. Therefore, the quality of speech after conversion of speech speed can be maintained. It should be noted that it is desirable to set the detection cycle Tx to a time during which the voice can be regarded as steady, for example, about 10 milliseconds.

On the other hand, in the voice data loss compensation process, a longer interval must be compensated as compared to the speech speed conversion process, so that more accurate pitch detection is required. Therefore, when the audio data loss detection unit 15 detects the audio data loss, the pitch detection unit 16 immediately detects the pitch regardless of the detection cycle Tx, so that the audio data loss compensation processing unit VC performs the audio data loss compensation. Quality in processing can be maintained.

Here, it is desirable that the pitch detection unit 16 detects only a pitch in a predetermined frequency range. In other words, since the frequency of the voice waveform in a normal voice call is within the frequency range of a few hundred tens to a few hundreds of hertz, if only the pitch in the frequency range is detected, the pitch detection in the unnecessary frequency range can be performed. By not doing so, the processing load can be reduced.

Also, it is desirable that the speech speed conversion processing unit SE detects the speech section of the speech data and converts only the speech data in the speech section. That is, the processing load in the speech speed conversion process can be reduced by not performing the speech speed conversion process in a section other than the speech section (for example, a silent section).

(Embodiment 3)
Hereinafter, the third embodiment of the present invention will be described in detail with reference to FIGS. 39A to 42. For the sake of clarity, similar elements are assigned the same reference numerals as those for the intercom system for collective housing of the second embodiment, and description thereof is omitted.

The voice data loss detection unit 15 in this embodiment is synchronized with the first time interval T1 (= τ / m) obtained by dividing the time length τ of voice data for one packet by a positive integer m and the voice data input timing. The lack of voice data is detected. The pitch detector 16 in the present embodiment detects the pitch in synchronization with the detection period Tx (= n × τ / m) obtained by multiplying the first time interval T1 by a positive integer n and the first time interval T1. ing.

Here, the execution timing of the audio data loss detection process and the pitch detection process when m = n = 4 will be described with reference to FIGS. 39A and 39B. As shown in FIG. 39A, the voice data loss detection unit 15 and the pitch detection unit 16 perform a voice data loss detection process and a pitch detection process every τ / 4 hours. Then, as shown in FIG. 39B, if the start of the speech speed conversion process is instructed at the time t = t0, the speech speed conversion processing section SE uses the pitch detection section 16 immediately before the time (time t = t0). The speech speed conversion process is executed using the latest detected pitch.

As described above, if the timing at which the audio data loss detection process is executed and the timing at which the pitch detection process is executed are synchronized, the control of the timing at which the pitch detection unit 16 executes the pitch detection process is simplified. There is.

Also, as shown in FIG. 40, when the lack of voice data is detected at the time when the start of the speech speed conversion process is instructed (time t = t0), the speech speed conversion processing unit SE detects that the voice data is missing. If speech speed conversion is performed using the pitch detected by the pitch detection unit 16 immediately before detection, it is possible to suppress deterioration in speech quality due to the speech speed conversion processing.

Alternatively, as shown in FIG. 41, when voice data loss is detected at the time when the start of speech speed conversion processing is instructed (time t = t0), the speech speed conversion processing unit SE performs voice data loss compensation processing. The speech speed conversion may be performed using the pitch detected by the pitch detection unit 16 from the voice data compensated by the unit VC. In this way, even when the speech speed conversion process is started when audio data is missing, the pitch detection unit 16 only needs to execute the pitch detection process at a constant detection cycle Tx. 16 has an advantage that the control of the timing for executing the pitch detection process becomes simple.

Here, the dwelling unit A according to the present embodiment has a recording unit (not shown) that can record the audio data output from the audio data loss compensation processing unit VC. Consider a case where speech speed conversion processing is performed by the speed conversion processing section SE. In the case of recording / playback, the ease of listening is improved by performing the speech speed conversion process not only on the speech section but also on the non-speech section. On the other hand, if the speech speed conversion process is performed even for a non-speech section during a normal call, a delay due to the speech speed conversion process increases, which hinders natural conversation. When speech speed conversion processing is also performed for a non-speech segment in this way, it is desirable to make the detection cycle Tx2 in the non-speech segment longer (Tx1 <Tx2) than the detection cycle Tx1 in the speech segment as shown in FIG. . As a result, since the pitch detection is performed with a relatively short detection cycle Tx1 in the speech section, the quality of the speech speed conversion process is ensured, and the pitch detection is performed with a relatively long detection cycle Tx2 in the non-speech section. Therefore, the processing load can be reduced.

While the invention has been described in terms of several preferred embodiments, various modifications and variations can be made by those skilled in the art without departing from the true spirit and scope of the invention, ie, the claims.

Claims

Connected to the common unit device installed in the common entrance of the apartment house, the dwelling unit installed in each dwelling unit of the apartment building, the door phone slave unit installed in the outer entrance of the apartment building, and the shared unit device A signal main line, a dwell unit line branched from the signal main line and connected to each dwell unit, a slave unit connection line connecting the dwell unit and the door phone slave unit, the shared unit device and the Call voice is transmitted between the dwelling units and between the dwelling units by the packet transmission method via the signal trunk line and the dwelling unit line, and between the dwelling unit and the door phone slave unit, the cordless handset connection line is connected. A dwelling unit of an intercom system for apartment houses, in which call voice is transmitted via an analog transmission method,
A microphone and a speaker; a transmission processing unit that transmits a voice packet including voice data for calling and a control packet including control data for call control via the dwelling unit line and the signal trunk line; and the slave unit connection line An analog signal transmission unit for transmitting an analog audio signal via the first, an analog audio signal output from the microphone is converted into audio data, and the audio data is converted into an analog audio signal and output to the speaker. 1 conversion processing unit and a second conversion process for converting an analog audio signal received by the analog signal transmission unit into audio data, converting the audio data into an analog audio signal, and outputting the analog audio signal to the analog signal transmission unit Unit, a call processing unit that performs predetermined call processing on voice data, and a door phone call detection that detects a call from the door phone slave unit A storage unit that stores first software for speech processing for voice data transmitted in an analog transmission system and second software for speech processing for speech data transmitted in a packet transmission system; And a control unit for instructing execution of call processing to
The control unit instructs the call processing unit to execute the first software when the door phone call detection unit detects the call, and receives control data for call control from the shared unit device or the dwelling unit. When the mobile phone is received, the call processing unit is instructed to execute the second software.
The second software includes an acoustic echo suppression processing program for suppressing acoustic echo generated by acoustic coupling of the microphone and a speaker, and a residual echo suppression processing program for suppressing residual echo that cannot be suppressed by the acoustic echo suppression processing. The dwelling unit of the intercom system for collective housing according to claim 1, characterized in that
3. The dwelling unit for an intercom system for an apartment house according to claim 1 or 2, wherein the second software includes a fluctuation absorption processing program for absorbing fluctuations in transmission delay in the transmission processing section.
A fluctuation absorbing buffer for accumulating voice data included in the voice packet received by the transmission processing unit;
The fluctuation absorbing processing program counts the number of voice data packets stored in the fluctuation absorbing buffer at a period not longer than the packetization period of the voice packet and calculates a packet count value; 4. The set according to claim 3, wherein the call processing unit is caused to perform a buffer size changing step of inserting or deleting a packet in the fluctuation absorbing buffer based on the packet count value calculated in the counting step. 5. Residential intercom system dwelling unit.
The fluctuation absorption processing program calculates a representative value of the packet count value based on the past history of the packet count value in the buffer size changing step, and if the calculated representative value is larger than a predetermined reference value, 5. The call processing unit according to claim 4, wherein when the packet is deleted from the fluctuation absorbing buffer and the representative value is smaller than the reference value, the call processing unit performs processing to insert the packet into the fluctuation absorbing buffer. Intercom system dwelling unit for multiple dwelling houses.
The fluctuation absorption processing program causes the call processing unit to record the latest packet reception time, and in the counting step, the latest packet count value is calculated as the calculation time of the packet count value and the calculation time. Set the difference from the reception time divided by the packetization period, set the count value of packets other than the latest packet to 1, and cause the call processing unit to perform the process of calculating the packet count value The dwelling unit of the intercom system for collective housing according to claim 4 or 5.
The fluctuation absorbing processing program causes the call processing unit to hold the packet count value of the past N (N is a positive integer value) times in the counting step, and in the buffer size changing step, the packet of the past N times 6. The housing complex according to claim 5, wherein said call processing unit is caused to perform a process of setting a packet count value that is nth smallest (n is a positive integer value less than N) among said count values as said representative value. Intercom system dwelling unit.
In the counting step, the fluctuation absorbing processing program determines the presence or absence of a spike delay based on the past N packet count values, and determines that the spike delay has occurred. The packet processing unit is caused to perform a process of extracting the packet count value of the past M (M is a positive integer value of M <N) out of the packet count value of the number of times, and in the buffer size changing step, the counting step The call processing unit is caused to perform a process of calculating, as the representative value, a packet count value that is mth (m is an integer less than M) among the past M packet count values extracted by The dwelling unit of the intercom system for apartment houses according to claim 5.
In the counting step, when the packet count value is continuously zero in the counting step, the fluctuation absorbing processing program sets a negative value that increases in absolute value as the number of times of continuous zero increases. 9. The dwelling unit for an intercom system for an apartment house according to any one of claims 4 to 8, wherein the call processing unit is caused to perform processing for calculating a packet count value.
When all or part of the audio data included in the audio packet received by the transmission processing unit is missing, the second software uses the audio data that is not missing, and the missing audio data 10. The dwelling unit for an apartment intercom system according to any one of claims 1 to 9, comprising a program for audio data loss compensation processing that compensates for all or part of the intercom system.
A fluctuation absorbing buffer for accumulating voice data included in the voice packet received by the transmission processing unit;
The fluctuation absorption processing program counts the number of packets of audio data stored in the fluctuation absorption buffer to calculate a packet count value, and based on the packet count value calculated in the count step A buffer size changing step for inserting or deleting a packet in the fluctuation absorbing buffer is performed by the call processing unit, and in the buffer size changing step, one packet is deleted from the fluctuation absorbing buffer. If there are two or more valid packets containing data in succession, the call processing unit performs processing for overlapping and deleting two consecutive valid packets located in the middle of these consecutive valid packets. 4. A set according to claim 3, wherein Dwelling units machine intercom system for the home.
In the fluctuation absorption processing program, when a packet is inserted into the fluctuation absorption buffer in the buffer size changing step, if there are two consecutive valid packets, audio is included between the two valid packets. 12. The dwelling unit of an intercom system for an apartment house according to claim 11, wherein the call processing unit is caused to perform processing for inserting a non-invalid packet.
The second software includes: a program for detecting missing audio data that detects all or part of the audio data output from the transmission processing unit; and a program for detecting pitch from the audio data. And a program of audio data loss compensation processing that compensates for missing audio data based on the pitch detected in the pitch detection processing when audio data loss is detected in the audio data loss detection processing,
The pitch detection processing program is a process of setting a sound signal having a time width from the present time to the past as a reference signal, and sliding the reference signal from the present time to the past with respect to the sound signal, And a process of increasing the time width of the reference signal as the slide amount of the reference signal is increased in the call processing unit. 13. The dwelling unit for an apartment intercom system according to any one of claims 1 to 12, characterized in that
The pitch detection processing program causes the call processing unit to perform a process of setting a time width of the reference signal to a predetermined initial time width until a slide amount of the reference signal reaches a predetermined slide reference value. The dwelling unit of the intercom system for collective housing according to claim 13.
15. The intercom for collective housing according to claim 13 or 14, wherein the pitch detection processing program causes the call processing unit to perform processing for obtaining a correlation between the reference signal and the audio signal by an average amplitude difference function method. System dwelling machine.
The pitch detection processing program causes the call processing unit to perform a process of obtaining a correlation between the reference signal and the voice signal using an average amplitude difference function of Expression (1). A dwelling unit for an intercom system for apartment buildings.

Where φ (τ) is the correlation value, N is the time width of the reference signal, x (j) is the reference signal, x (j−τ) is the audio signal, k + 1 is the starting point of the reference signal, a represents a predetermined coefficient, and τ represents the slide amount of the reference signal.
The second software includes: a program for detecting missing audio data that detects all or part of the audio data output from the transmission processing unit; and a program for detecting pitch from the audio data. A voice data missing compensation processing program that compensates for missing voice data based on a pitch detected by the pitch detection processing when voice data missing is detected by the voice data missing detection processing, and the pitch The dwelling unit of the intercom system for an apartment house according to claim 3, further comprising: a speech speed conversion processing program that expands or compresses the audio data using a pitch detected by the detection processing.
The pitch detection process counts a predetermined detection period and repeatedly detects the pitch in synchronization with the detection period. When the voice data loss detection process detects a lack of voice data, The dwelling unit of the intercom system for an apartment house according to claim 17, wherein the pitch is detected at a detection time and counting of the detection period is restarted from the detection time.
19. The dwelling unit of an intercom system for an apartment house according to claim 17 or 18, wherein the pitch detection process detects only a pitch in a predetermined frequency range.
18. The dwelling unit for an intercom system for an apartment house according to claim 17, wherein the speech speed conversion processing detects a voice section of the voice data, and converts only the voice data of the voice section.
The voice data loss detection process detects a voice data loss in synchronization with a first time interval obtained by dividing a time length of the voice data for one packet by a positive integer and the input timing of the voice data, 19. The collective housing intercom system according to claim 18, wherein the pitch detection processing detects the pitch in synchronization with the detection period obtained by multiplying the first time interval by a positive integer and the first time interval. Dwelling machine.
When the speech speed conversion process performs speech speed conversion when the voice data loss detection process detects a loss of voice data, the pitch immediately before the voice data loss detection process detects a voice data loss 18. The dwelling unit of an intercom system for an apartment house according to claim 17, wherein speech rate conversion is performed using the pitch detected in the detection process.
In the speech speed conversion process, when the speech speed conversion is performed when the voice data loss detection process detects a loss of voice data, the pitch detection process uses the voice data compensated in the voice data loss compensation process. 18. The dwelling unit for an intercom system for an apartment house according to claim 17, wherein speech speed conversion is performed using the detected pitch.
19. The pitch detection process determines a speech interval and a non-speech interval of the speech data, and makes the detection cycle in the non-speech interval longer than the detection cycle in the speech interval. Intercom system dwelling unit for multiple dwelling houses.
The second software includes a voice switch processing program that suppresses howling by reducing a loop gain of a closed loop formed by an acoustic echo path generated by acoustic coupling between the microphone and a speaker. A feedback gain of the acoustic echo path is estimated, and based on the estimated value of the feedback gain, a reception-side attenuation amount for attenuating received voice data output from the transmission processing unit, and input to the transmission processing unit The sum of the attenuation on the transmission side for attenuating the voice data of the transmission is calculated, and the call state is estimated by monitoring each voice data of the transmission and reception, and the estimation result of the call state and the calculation of the sum are calculated. The distribution of the transmission-side attenuation and the reception-side attenuation is determined according to the value, and the sum is decreased according to the decrease in the estimated value of the feedback gain. Dwelling machine collective housing intercom system according to any one of claims 1 to 24 that processes, characterized in that to perform the call processing unit.
An extension connection line to which a communication device installed in a house is connected, and an extension analog signal transmission unit for transmitting an analog voice signal via the extension connection line, and the call processing unit includes the first 26. The voice data that has been subjected to call processing by executing software is transmitted from the extension analog signal transmission unit to the call device via the extension connection line. The dwelling unit of the intercom system for apartment houses described.
The first software detects speech pitch from a digital audio signal obtained by A / D converting the analog audio signal and uses the pitch to expand or compress the digital audio signal. The dwelling unit for an intercom system for an apartment house according to any one of claims 1 to 26, comprising a processing program.