JP2009122204A

JP2009122204A - Sound volume control unit, method, and program

Info

Publication number: JP2009122204A
Application number: JP2007293743A
Authority: JP
Inventors: Tasuku Shinozaki; 翼篠崎; Yoshiaki Noda; 喜昭野田; Taichi Asami; 太一浅見
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-11-12
Filing date: 2007-11-12
Publication date: 2009-06-04
Anticipated expiration: 2027-11-12
Also published as: JP4814861B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a technique to control sound volume without changing the predetermined feature amounts of the sound such as frequencies as much as possible. <P>SOLUTION: The inputted sound is divided into frames of a certain time length. The outer shape value which is the feature amount representing the sound volume in a frame is calculated for each frame. By defining the sound section consisting of the preset number B<SB>2</SB>of the frames or more between the silent frames each continuing for the preset number B<SB>1</SB>or more as a first sound section, and removing some of the outer shape values of the frames constituting the first sound section starting from the larger one, and the maximum remaining one is decided to be the outer shape value of the first sound section. The information for adjusting the inputted sound volume (the first volume control information, hereinafter) is decided so that the obtained outer shape value of the first sound section is within the preset range, and the sound volume is adjusted by using a first volume control instruction means to output and the outputted first volume control information. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、音声認識等のために、電話機やマイクロホン等の音入力装置から入力した音の音量を自動的に調整する音量調整装置、方法及びプログラムに関する。 The present invention relates to a volume adjusting device, method, and program for automatically adjusting the volume of sound input from a sound input device such as a telephone or a microphone for voice recognition and the like.

図７を参照して、従来の自動音量調整装置の説明をする。従来の自動音量調整装置は、音楽、音声等の音を人が聞いて聞きやすい音量に調整することを目的としている。
図７の一点鎖線は、自動音量調整装置が出力する音（すなわち自動音量調整された音）が入力される装置の入力のピーク（許容最大値）を表す。
（１）入力された音の音量が入力のピークを超えている場合には、その音量がその入力のピークを超えなくなるまで急いで利得を下げる。
（２）入力された音の短時間（１０〜３０秒）平均パワーを計算して、その平均パワーが予め設定した平均パワーの目標値に近づくように徐々に利得を上げ下げする。この際、音の短時間パワー（フレームごとのパワー）を計算して有音区間（音のある区間）と無音区間とを識別し、有音区間の音のみを使って平均パワーを計算することにより、音量調節を適切に行うことができる。 A conventional automatic volume control apparatus will be described with reference to FIG. The conventional automatic volume control device is intended to adjust a sound such as music and voice to a volume that is easy for a person to hear and hear.
The dashed-dotted line in FIG. 7 represents the input peak (allowable maximum value) of the device to which the sound output from the automatic volume control device (that is, the sound adjusted automatically) is input.
(1) If the volume of the input sound exceeds the peak of the input, the gain is reduced rapidly until the volume does not exceed the peak of the input.
(2) The average power of the input sound is calculated for a short time (10 to 30 seconds), and the gain is gradually increased and decreased so that the average power approaches the preset target value of the average power. At this time, the short-term power (power per frame) of the sound is calculated to distinguish between the sounded section (sounded section) and the silent section, and the average power is calculated using only the sound of the sounded section. Thus, the volume can be adjusted appropriately.

従来の自動音量調整装置は、上記（１）、（２）の方法を組み合わせて、自動音量調整を行い、入力された大きな音や小さな音を人が聞きやすい一定の音量に常になるように制御している（例えば、特許文献１参照。）。
特開昭５８−１４１０１８号公報 Conventional automatic volume control devices combine the methods (1) and (2) above to perform automatic volume control, and control so that the input loud and small sounds are always at a constant volume that is easy for humans to hear. (For example, refer to Patent Document 1).
JP 58-14410 A

従来の自動音量調整装置は、利得を上げ下げすることにより、音量を少しずつ常に調整している。このように、入力された音を絶えず調整すると、波形が歪み周波数等の音の所定の特徴量が失われやすいという問題があった。
上記（２）の音量調整方法の具体例を図８を参照して説明をする。例えば、図８に示すように、振幅の分散が小さい音のパワーＬ１の平均値Ｐ１（平均パワーＰ１）を、平均パワーの目標値Ｐ^＊に近づける。近づけるように調整した後の振幅の分散が小さい音のパワーをＬ２の符号で示し、その平均値をＰ２（平均パワーＰ２）で表す。このとき、振幅の分散が大きい音Ｌ３（Ｌ３の縦軸は振幅を表す。）が入力された場合には、入力のピークを超えることが断続的に発生し、その入力のピークを超えた部分で音の所定の特徴量が変化してしまうという問題があった。 Conventional automatic volume control devices constantly adjust the volume little by little by increasing and decreasing the gain. As described above, when the input sound is constantly adjusted, there is a problem that a predetermined feature amount of the sound whose waveform is distorted or the like is easily lost.
A specific example of the volume adjusting method (2) will be described with reference to FIG. For example, as shown in FIG. 8, the average value P1 (average power P1) of the power L1 of the sound with small amplitude dispersion is brought close to the target value P ^* of the average power. The power of a sound with small amplitude dispersion after being adjusted to be close is indicated by a symbol L2, and the average value thereof is indicated by P2 (average power P2). At this time, when a sound L3 having a large amplitude dispersion (the vertical axis of L3 represents the amplitude) is input, a portion exceeding the input peak intermittently occurs and the portion exceeds the input peak. There is a problem that a predetermined feature amount of sound changes.

この発明は、対象としている音の所定の特徴量を保ちつつ、その音の音量が所定の音量（入力音のピーク）以下になるように調整する音量調整装置、方法及びプログラムを提供することを目的とする。 The present invention provides a volume adjusting device, method, and program for adjusting a volume of a target sound to be equal to or lower than a predetermined volume (peak of input sound) while maintaining a predetermined feature amount of the target sound. Objective.

この発明による音量調整装置は、入力された音を一定の時間長のフレームで分割するフレーム分割手段と、フレームに含まれる音の大きさを表す特徴量である外形値をフレームごとに求めるフレーム外形値決定手段と、予め定められた数Ｂ_１以上連続する無音フレームに挟まれ、予め定められた数Ｂ_２以上のフレームから構成された音区間を第一音区間として、第一音区間を構成する複数のフレームの外形値から、外形値が大きい方から複数の外形値を除外して、除外されずに残った外形値の最大値をその第一音区間の外形値として求める第一音区間外形値決定手段と、上記求まった第一音区間の外形値が予め定められた範囲に入るように、上記入力された音の音量を調整するための情報（以下、第一音量調整情報とする。）を決定して、出力する第一音量調整指示手段と、上記出力された第一音量調整情報を用いて上記入力された音の音量を調整する第一音量調整手段と、を備える。 The volume control device according to the present invention includes a frame dividing unit that divides an input sound into frames of a certain length of time, and a frame outer shape that obtains an outer shape value that is a feature amount representing the magnitude of sound included in the frame for each frame. configuration value determining means, sandwiched silence consecutive frames number B ₁ than the predetermined, the sound interval which is composed of several B ₂ or more frames which is predetermined as a first sound zone, the first sound segment The first sound interval that obtains the maximum value of the remaining outer shape value as the outer shape value of the first sound interval by excluding the plurality of outer shape values from the larger outer shape value from the outer shape values of the plurality of frames Information for adjusting the volume of the input sound (hereinafter referred to as first volume adjustment information) so that the outer shape value determining means and the determined outer shape value of the first sound section fall within a predetermined range. .) Comprising a first sound volume adjustment instruction means for force, and a first volume control means for adjusting the volume of the input sound using a first volume adjustment information the output.

音の所定の特徴量を保ちつつ、その音の音量が所定の音量（入力音のピーク）以下にすることができる。 While maintaining a predetermined feature amount of sound, the volume of the sound can be reduced to a predetermined volume (peak of input sound) or less.

図１を参照して、本発明の一実施例である自動音量調整装置１について説明をする。
入力部１１から、音が入力される。入力部１１は、例えばマイクロホンである。入力部１１として、電話の送話音声と受話音声の一方を又は両方をミックスして取り出すために、電話機と、送受話器又はヘッドセットとの間に設置した送受話アダプターを用いてもよい。入力部１１から入力された音は電気信号に変換されて、音量調整部１２に出力される。
音量調整部１２は、後述する音量調整情報に基づいて、入力された音の音量を調整して、出力する。出力された音の一部がＡＤ変換部は入力される。音量調整部１２は、アナログでもデジタルでもよい。音量調整部１２の処理の詳細については後述する。 With reference to FIG. 1, an automatic volume control apparatus 1 according to an embodiment of the present invention will be described.
Sound is input from the input unit 11. The input unit 11 is a microphone, for example. As the input unit 11, a transmission / reception adapter installed between a telephone and a transmitter / receiver or a headset may be used in order to extract one or both of the transmitted voice and the received voice of the telephone. The sound input from the input unit 11 is converted into an electrical signal and output to the volume adjustment unit 12.
The volume adjustment unit 12 adjusts the volume of the input sound based on volume adjustment information described later, and outputs the adjusted sound. A part of the output sound is input to the AD converter. The volume adjustment unit 12 may be analog or digital. Details of the processing of the volume adjusting unit 12 will be described later.

ＡＤ変換部１３は、音のアナログ信号を所定のサンプリング周波数で量子化することによりデジタル化して、フレーム分割部１４に送る。なお、図１に点線で示すように、音量調整部１２の前にＡＤ変換部１３を設けてもよい。この場合、音量調整部１２は、デジタル式となる。以下、デジタル化された音の信号を、音信号と呼ぶ。
フレーム分割部１４は、入力された音を一定の時間長のフレームで分割する。例えば、１フレームの長さを１００ｍｓ（サンプリング周波数が１６ｋＨｚである場合にはフレームを構成するサンプル数は１６００）とする。このように、フレームの時間長を例えば男性の音声波形及び電源ノイズの基本周期よりも十分長くすることにより、声の高低及び電源ノイズによらず安定して音量調整をすることができる。フレーム化された音信号は、バッファ１５に送られる。 The AD conversion unit 13 digitizes the sound analog signal by quantizing the analog signal at a predetermined sampling frequency, and sends the digitized signal to the frame division unit 14. Note that an AD conversion unit 13 may be provided in front of the volume adjustment unit 12 as indicated by a dotted line in FIG. In this case, the volume adjustment unit 12 is digital. Hereinafter, the digitized sound signal is referred to as a sound signal.
The frame dividing unit 14 divides the input sound into frames having a certain time length. For example, the length of one frame is set to 100 ms (when the sampling frequency is 16 kHz, the number of samples constituting the frame is 1600). In this way, by making the time length of the frame sufficiently longer than, for example, the male speech waveform and the basic period of the power supply noise, the volume can be adjusted stably regardless of the voice level and the power supply noise. The framed sound signal is sent to the buffer 15.

バッファ１５は、予め定めた数１以上の数Ａ_１のフレームを一時的に格納する。
直流バイアス計算部１６は、バッファ１５に格納されたフレーム化された音信号を読み込み、その音信号の振幅の平均値を長時間観測して計算する。その平均値、すなわち直流成分の値は、減算部１７に送られる。
減算部１７は、バッファ１５から読み込んだ音信号から、直流バイアス計算部１６が計算した直流成分の値を減算して、バイアスのかかっていない音信号を生成する。生成された音信号は、終始判定部１８と、外形値決定部１９と、第二音量調整指示部２６とに送られる。以下、断りなく音信号といった場合には、このバイアスのかかっていない音信号を意味するものとする。 The buffer 15 temporarily stores a number A ₁ of frames equal to or greater than a predetermined number 1.
The DC bias calculator 16 reads the framed sound signal stored in the buffer 15 and calculates the average value of the amplitude of the sound signal by observing it for a long time. The average value, that is, the value of the DC component is sent to the subtracting unit 17.
The subtractor 17 subtracts the value of the DC component calculated by the DC bias calculator 16 from the sound signal read from the buffer 15 to generate an unbiased sound signal. The generated sound signal is sent to the end-to-end determination unit 18, the external shape determination unit 19, and the second volume adjustment instruction unit 26. Hereinafter, the term “sound signal” refers to an unbiased sound signal.

終始判定部１８は、フレームごとの音信号の絶対値の平均値を観測することで、発音の開始時と発音の終了時を判定する。発音の開始時と発音の終了時の音区間のことを、発音と定義する。発音の開始時と発音の終了時とは、音が電話等の音声である場合には通話の始端と終端のことである。この場合、発音は、いわゆる通話区間に相当することになる。
具体的には、終始判定部１８の平均値計算部１８１は、入力された音信号の振幅の絶対値の平均値をフレームごとに計算する。そして、終始判定部１８が、計算された振幅の絶対値の平均値が予め定められた閾値Ａ_２よりも大きいかどうかを順次判定して、大きいと判定された場合には発音が開始されたと判定し、その旨の信号を終了時音量調整部３３を含む自動音量調整装置１の各部に送る。計算された振幅の絶対値の平均値が予め定められた閾値Ａ_２よりも大きいと判定された場合に、その判定された時から一定時間長（例えば０．５秒）遡った時から発音が開始されたと判定してもよい。 The end-to-end determination unit 18 determines the start time of sound generation and the end time of sound generation by observing the average value of the absolute values of sound signals for each frame. The sound section at the beginning and end of pronunciation is defined as pronunciation. The start and end of pronunciation are the beginning and end of a call when the sound is a voice such as a phone call. In this case, the pronunciation corresponds to a so-called call section.
Specifically, the average value calculation unit 181 of the end-to-end determination unit 18 calculates the average value of the absolute values of the amplitudes of the input sound signals for each frame. Then, throughout the determination unit 18, to determine whether the average value of the calculated absolute value of the amplitude is greater than the threshold value A ₂ with a predetermined sequence, if it is determined to be larger sound is started and Determination is made and a signal to that effect is sent to each part of the automatic volume control device 1 including the end-time volume control unit 33. If the average value of the calculated absolute value of the amplitude is determined to be larger than the threshold value A ₂ of predetermined phonetic since predated predetermined time length from the time when the it is determined (e.g., 0.5 seconds) You may determine with having started.

また、終始判定部１８は、計算された振幅の絶対値の平均値が、予め定められた閾値Ａ_３（閾値Ａ_３は、閾値Ａ_２よりも小さい値である。）よりも小さい状態が予め定められた一定時間長続いた場合には、又は、予め定められた数Ａ_４のフレームだけ続いた場合には、発音が終了したと判定し、その旨の信号を終了時音量調整部３３を含む自動音量調整装置１の各部に送る。
発音が開始された旨の信号を受け取った外形値決定部１９は、フレームの音の大きさを表す特徴量である外形値をフレームごとに求める。例えば、外形値とは、音信号の振幅の絶対値の最大値のことである。換言すると、外形値とは、フレームを構成する複数のサンプルの値の最大値のことである。求められたフレームごとの外形値は、有音無音フレーム判定部２０、第一音量調整指示部２５に送られる。図３Ａ，Ｂに、外形値抽出の具体例を示す。図３Ａはバイアスがかかっていない音信号の波形である。図３Ｂは、Ａに示した音信号の波形からフレームごとに振幅の絶対値の最大値（外形値）を求めて、図示したものである。 Further, the end-to-end determination unit 18 has a state in which the average value of the calculated absolute value of the amplitude is smaller than a predetermined threshold A ₃ (the threshold A ₃ is smaller than the threshold A ₂ ). If lasted a certain time length defined in or, when followed by pre-determined number a ₄ frames, it is determined that the sound is completed, the end volume adjusting unit 33 a signal to that effect It is sent to each part of the automatic volume control apparatus 1 including it.
The outline value determination unit 19 that has received a signal indicating that the sound generation has been started obtains an outline value that is a feature amount representing the loudness of the frame for each frame. For example, the outer shape value is the maximum absolute value of the amplitude of the sound signal. In other words, the outer shape value is the maximum value of the values of a plurality of samples constituting the frame. The obtained outer shape value for each frame is sent to the sound / silence frame determination unit 20 and the first sound volume adjustment instruction unit 25. 3A and 3B show specific examples of external value extraction. FIG. 3A shows a waveform of an unbiased sound signal. FIG. 3B illustrates the maximum value (outer shape value) of the absolute value of the amplitude obtained for each frame from the waveform of the sound signal shown in A. FIG.

再度、図１を参照して説明をする。有音無音フレーム判定部２０は、外形値と予め定められた閾値Ａ_５とを比較して、外形値の方が大きければそのフレームを有音フレームと判定し、そうでなければ、そのフレームを無音フレームと判定する。閾値Ａ_５を、予め定めた値とせずに、例えば、過去１０秒間の無音フレームの外形値の最小値の定数倍（例えば３倍）の値として動的に閾値Ａ_５を変化させてもよい。フレームが、有音フレームであるか、無音フレームであるかの情報は、有音無音区間判定部２１に送られる。
有音無音区間判定部２１は、無音フレームが予め定められた数Ａ_６（例えば５、時間長にして０．５秒となるように、Ａ_６を設定する。）以上連続する場合には、その連続するフレームから構成される音区間を無音区間と判定し、それ以外のフレームから構成される音区間を有音区間と判定する。有音区間、無音区間についての情報は、第一音量調整指示部２５の第一音区間抽出部２２に送られる。 The description will be given again with reference to FIG. Voice activity frame determination unit 20 compares the threshold value A ₅ with a predetermined outer shape value, the larger the better contour value determines the frame as voiced frame, otherwise the frame Judged as a silent frame. The threshold A _5, without the predetermined value, for example, may be dynamically changed the threshold A ₅ as the value of the constant multiple of the minimum value of the outline values of the silent frame of the past 10 seconds (e.g., 3 times) . Information on whether the frame is a sound frame or a sound frame is sent to the sound / silence section determination unit 21.
The voiced / silent section determination unit 21 sets the silent frame for a predetermined number A ₆ (for example, A ₆ is set so that the time length is 0.5 seconds). The sound section composed of the continuous frames is determined as a silent section, and the sound section composed of other frames is determined as a sound section. Information about the voiced section and the silent section is sent to the first sound section extraction unit 22 of the first volume adjustment instruction section 25.

以下、図２を参照して、第一音量調整指示部２５の説明をする。第一音量調整指示部２５の第一音区間抽出部２２は、上記判定された有音区間が予め定められた時間長Ａ_７（例えば２秒）よりも長いかどうか、又は、上記判定された有音区間を構成するフレーム数Ａ_８（例えば２０フレーム）が予め定められた数Ａ_８よりも大きい場合には、その有音区間を第一音区間とする。入力される音が電話等の音声である場合には、第一音区間はいわゆる発話区間に相当する。発話区間は、人間が一呼吸で発した音の区間のことである。このようにして、第一音区間を抽出することにより、「こんにちは」や「ちょっと質問があるのですが」といった人の感覚に近い長さの音区間を切り出すことができる。図３Ｂに、第一音区間の抽出の具体例を示す。例えば、この図３Ｂ示すように、０．５秒以上の無音区間を使って２秒以上の有音区間のかたまりを第一音区間として抽出する。 Hereinafter, the first volume adjustment instruction unit 25 will be described with reference to FIG. The first sound segment extraction unit 22 of the first volume adjustment instruction unit 25 determines whether or not the determined sound segment is longer than a predetermined time length A ₇ (for example, 2 seconds) or the above determination. When the number of frames A ₈ (for example, 20 frames) constituting the sound section is larger than the predetermined number A ₈ , the sound section is set as the first sound section. When the input sound is a voice such as a telephone, the first sound section corresponds to a so-called speech section. The utterance section is a section of a sound that a person utters with one breath. In this way, by extracting the first sound section, it is possible to cut out the "Hello" or "little question is you, but" such as the sense to close the length of a person's sound section. FIG. 3B shows a specific example of extraction of the first sound section. For example, as shown in FIG. 3B, a lump of sounded sections of 2 seconds or longer is extracted as a first sound section using a silent section of 0.5 seconds or longer.

第一音区間抽出部２２は、例えば、第一音区間を構成するフレームと、それらのフレームの外形値とに関する情報を、第一音区間外形値抽出部２３に送る。第一音区間を構成するフレームの外形値は、第一音区間抽出部２２が外形値決定部１９から受け取ったフレームの外形値の情報を用いる。
第一音区間外形値抽出部２３の除外部２３１は、第一音区間を構成する複数のフレームの外形値から、外形値が大きい方から複数の外形値を除外する。除外する外形値の数は、第一音区間を構成するフレームの数が多いほど多くするとよい。例えば、第一音区間を構成するフレームの数に予め設定した割合Ａ_９（例えば１０〜３０％、今回は２０％）をかけて、小数点以下を切り捨て・四捨五入・切り上げた数の外形値を除外する。予め定めた数Ａ_１０の外形値を除外することにしてもよい。除外されずに残った外形値は、最大値決定部２３２に送られる。 The first sound segment extraction unit 22 sends, for example, information regarding the frames constituting the first sound segment and the outer shape values of those frames to the first sound segment outer value extraction unit 23. As the outer shape value of the frame constituting the first sound section, information on the outer shape value of the frame received by the first sound section extracting unit 22 from the outer shape value determining unit 19 is used.
The exclusion unit 231 of the first sound section outer shape value extraction unit 23 excludes a plurality of outer shape values from the larger outer shape value from the outer shape values of a plurality of frames constituting the first sound section. The number of external values to be excluded is preferably increased as the number of frames constituting the first sound section is larger. For example, multiply the number of frames that make up the first sound section by a preset ratio A ₉ (for example, 10 to 30%, this time 20%), and exclude the number of external values that are rounded down, rounded, or rounded up. To do. It may be to exclude a predetermined outer shape of the number A _10. The outline value remaining without being excluded is sent to the maximum value determination unit 232.

最大値決定部２３２は、除外されずに残った外形値の最大値を求め、その最大値を第一音区間の外形値として保存する。第一音区間の外形値は、第一利得決定部２４に送られる。
第一利得決定部２４は、第一音区間の外形値が予め定められた範囲に入るように、入力された音を調整するための情報（以下、第一音量調整情報とする。）を決定して、音量調整部１２に送る。例えば、第一利得決定部２４に入力のピークが入力される。第一利得決定部２４は、入力のピークに予め定められた割合Ａ_１１（例えば、１０％〜２５％）をかけた範囲に、第一音区間の外形値が入るように、利得を決定する。この場合、利得が第一音量調整情報となる。 The maximum value determination unit 232 obtains the maximum value of the external shape value that remains without being excluded, and stores the maximum value as the external value of the first sound section. The external value of the first sound section is sent to the first gain determination unit 24.
The first gain determination unit 24 determines information (hereinafter, referred to as first volume adjustment information) for adjusting the input sound so that the outer shape value of the first sound section falls within a predetermined range. Then, it is sent to the volume adjustment unit 12. For example, an input peak is input to the first gain determination unit 24. The first gain determination unit 24 determines the gain so that the outer shape value of the first sound section is within a range obtained by multiplying the input peak by a predetermined ratio A ₁₁ (for example, 10% to 25%). . In this case, the gain is the first volume adjustment information.

なお、第一音量調整情報が決定された場合には、第一音量調整指示部２５は、バッファ１５の遅延分の時間に相当するフレームについて、上記の処理を行わない。
図３Ｃを参照して、具体例を説明する。除外部２３１は、第一音区間を構成するフレームの外形値のうち、外形値が大きい予め定められた数（この例では、７つ）の外形値を除外する。図３Ｃの白で示した外形値が除外された外形値である。最大値決定部２３２は、第一音区間の外形値として、除外されずに残った外形値のうち最も大きい外形値を選択する。除外されずに残った外形値が図３Ｃの黒と射線で示した外形値であり、その最大値である第一音区間の外形値は射線で示した外形値である。 When the first sound volume adjustment information is determined, the first sound volume adjustment instruction unit 25 does not perform the above process for a frame corresponding to the delay time of the buffer 15.
A specific example will be described with reference to FIG. 3C. The excluding unit 231 excludes a predetermined number (seven in this example) of outer shape values having large outer shape values from the outer shape values of the frames constituting the first sound section. It is an outer shape value from which the outer shape value shown in white in FIG. 3C is excluded. The maximum value determination unit 232 selects the largest outline value among the outline values remaining without being excluded as the outline value of the first sound section. The outer shape value remaining without being excluded is the outer shape value indicated by black and rays in FIG. 3C, and the outer shape value of the first sound section, which is the maximum value, is the outer shape value indicated by rays.

第一音区間の外形値が入るべき予め定められた範囲を３０００〜８０００とすると、この例では、第一音区間の外形値はその範囲に入っていない。第一利得決定部２４は、第一音区間の外形値とその範囲との差分を計算して、第一音区間の外形値がその範囲に入るように利得を決定する。第一音区間の外形値がその範囲に入っている場合には、処理を行わない。
別の具体例を説明する。第一音区間の外形値が入力のピークの５％であり、第一音区間の外形値が入るべき予め定められた範囲が入力のピークの１０％〜２５％であるとする。この場合、第一利得決定部２４は、第一音区間の外形値が入力のピークの１０％になるように、利得を決定する。このように、音量調整後の第一音区間の外形値が、予め定められた範囲の上限値又は下限値のうち、音量調整前の第一音区間の外形値と近い方の値と等しくなるように、利得を決定することにより、音量調整量が最も小さくすることができ、音の所定の特徴量の変化を最も小さくすることができる。 Assuming that a predetermined range in which the outer shape value of the first sound section is to be entered is 3000 to 8000, in this example, the outer shape value of the first sound section is not in that range. The first gain determination unit 24 calculates the difference between the outer shape value of the first sound section and the range thereof, and determines the gain so that the outer shape value of the first sound section falls within the range. If the external value of the first sound section is within that range, no processing is performed.
Another specific example will be described. It is assumed that the outer shape value of the first sound section is 5% of the input peak, and the predetermined range in which the outer shape value of the first sound section is to enter is 10% to 25% of the input peak. In this case, the first gain determination unit 24 determines the gain so that the outer shape value of the first sound section is 10% of the input peak. Thus, the external value of the first sound section after volume adjustment becomes equal to the value closer to the external value of the first sound section before volume adjustment, among the upper limit value or lower limit value of the predetermined range. Thus, by determining the gain, the volume adjustment amount can be minimized, and the change in the predetermined feature amount of the sound can be minimized.

また、このように、第一音区間の外形値が入るべき予め定められた範囲を設けて、この範囲に第一音区間の外形値が入っている場合には上記の利得の計算を行わないようにすることにより、利得を変更する回数を少なくすることができる。これにより、音の波形が歪む回数を少なくすることができるため、音の所定の特徴量の変化を小さくすることができる。
この方法では、「はい」、「あ」、「えー」等の音量が不安定な短い音区間ではなく、「お電話ありがとうございます。」、「ちょっと聞きたいことがあるのですが」等のある程度の長さを持ち音量が安定した音区間を音量調整の基準としている。また、第一音区間を構成する複数のフレームの外形値から、外形値が大きい複数の外形値を除外して、除外されずの残った外形値の最大値を第一音区間の外形値として、その第一音区間の外形値を用いて、利得を調整している。 In addition, in this way, when a predetermined range in which the outer shape value of the first sound section is to be entered is provided and the outer shape value of the first sound section is included in this range, the above gain calculation is not performed. By doing so, the number of times of changing the gain can be reduced. As a result, the number of times the waveform of the sound is distorted can be reduced, so that the change in the predetermined feature amount of the sound can be reduced.
In this method, “Yes”, “Ah”, “Eh”, etc. are not short sound intervals where the volume is unstable, but “Thank you for calling.” “I have something I want to hear” A sound section having a certain length and a stable volume is used as a reference for volume adjustment. In addition, by excluding a plurality of contour values having a large contour value from the contour values of a plurality of frames constituting the first sound interval, the maximum value of the remaining contour values not excluded is used as the contour value of the first sound interval. The gain is adjusted using the external value of the first sound section.

これにより、咳やくしゃみ等の突発的な雑音の影響を受けにくくなり、かつ、対象とする音の振幅の分散の大小によっても音量調整後の音量が入力のピークが超えることがなくなる。
上記の例においては、第一音区間を構成するフレームの外形値のうち、大きい方から２０％の外形値を除外し、第一音区間の外形値が入るべき予め定められた範囲を入力ピークの１０％〜２０％としている。これは、実験を行った結果、突発的な雑音を除くと、入力のピークが第一音区間の外形値のおよそ４倍未満であったためである。 This makes it less susceptible to sudden noise such as coughing and sneezing, and the volume after adjusting the volume does not exceed the input peak even if the amplitude of the target sound varies.
In the above example, out of the outer shape values of the frames constituting the first sound section, the outer shape value of 20% from the larger one is excluded, and a predetermined range in which the outer shape value of the first sound section should be entered is the input peak. 10% to 20%. This is because, as a result of the experiment, the peak of the input was less than about 4 times the external value of the first sound section, excluding sudden noise.

再度、図１を参照して説明をする。音量調整部１２の第一音量調整部１２１は、第一音量調整指示部２５が決定した第一音量調整情報（例えば利得）を用いて、入力された音の音量を調整して出力する。第一音量調整部１２１は、新たな第一音量調整情報が第一音量調整指示部２５から送られてくるまで、既に送られている第一音量調整情報に基づいて音量調整を行う。
このように、本発明では、従来技術の音量調整装置と比較して長い時間、同じ第一音量調整情報に基づいて音量を調整している。これにより、従来技術のように頻繁に音量を調整するための利得が変化する場合と比較して、音の所定の特徴量が失われづらくなる。 The description will be given again with reference to FIG. The first volume adjustment unit 121 of the volume adjustment unit 12 uses the first volume adjustment information (for example, gain) determined by the first volume adjustment instruction unit 25 to adjust and output the volume of the input sound. The first volume adjustment unit 121 performs volume adjustment based on the already sent first volume adjustment information until new first volume adjustment information is sent from the first volume adjustment instruction unit 25.
Thus, in the present invention, the volume is adjusted based on the same first volume adjustment information for a long time compared with the volume adjustment device of the prior art. This makes it difficult to lose the predetermined feature amount of the sound as compared with the case where the gain for frequently adjusting the sound volume is changed as in the prior art.

下記に述べる、第一音区間よりも短い音区間（第二音区間）を基準として、音量調整をする第二音量調整指示部２６、第二音量調整部１２２を有していてもよい。
図４を参照して、第二音量調整指示部２６の説明をする。減算部１７から出力された音信号は、第二音量調整指示部２６の過大入力サンプル数決定部２７に入力される。過大入力サンプル数決定部２７は、予め定められた値Ａ_１２（例えばサンプル値で表現することができる値の上限の９０％の値）よりも大きいサンプルの数（以下、過大入力サンプル数とする。）をフレームごとに決定する。決定されたフレームごとの過大入力サンプル数は、過大入力フレーム決定部２８と、記憶部２９とに送られる。 You may have the 2nd sound volume adjustment instruction | indication part 26 and the 2nd sound volume adjustment part 122 which adjust a sound volume on the basis of the sound area (2nd sound area) shorter than a 1st sound area described below.
The second volume adjustment instruction unit 26 will be described with reference to FIG. The sound signal output from the subtraction unit 17 is input to the excessive input sample number determination unit 27 of the second volume adjustment instruction unit 26. The excessive input sample number determination unit 27 determines the number of samples larger than a predetermined value A ₁₂ (for example, 90% of the upper limit of values that can be expressed by sample values) (hereinafter referred to as excessive input sample number). .) Is determined for each frame. The determined number of excessive input samples for each frame is sent to the excessive input frame determination unit 28 and the storage unit 29.

過大入力フレーム決定部２８は、過大入力サンプル数が予め定められた数Ａ_１３（１フレームのサンプル数の３０％の数）よりも大きいかどうかをフレームごとに決定する。以下、過大入力サンプル数が予め定められた数Ａ_１３よりも大きいフレームを、過大入力フレームとする。過大入力フレームについての情報（例えば、過大入力フレームであることを表すフラグ）は、記憶部２９に送られる。
第二音区間過大入力サンプル数決定部３０は、第一音区間を構成するフレームの数よりも少ない数Ａ_１４（例えば１０、時間長にして１秒）のフレームから構成される音区間を第二音区間として、その第二音区間を構成するフレームについての過大入力サンプル数の総数を計算して、その総数を第二利得決定部３２に送る。具体的には、第二音区間が過去１０フレームである場合には、記憶部２９から、過去１０フレームの過大入力サンプル数をそれぞれ読み出して、それらを加算することにより、過大入力サンプル数の総数を求める。 The excessive input frame determination unit 28 determines for each frame whether or not the excessive input sample number is larger than a predetermined number A ₁₃ (30% of the number of samples in one frame). Hereinafter, a larger frame than the number A ₁₃ excessive number of input samples has been determined in advance and excessive input frame. Information on the excessive input frame (for example, a flag indicating that it is an excessive input frame) is sent to the storage unit 29.
The second sound section excessive input sample number determination unit 30 selects a sound section composed of frames of a number A ₁₄ (for example, 10 for a time length of 1 second) smaller than the number of frames constituting the first sound section. As a two-tone section, the total number of excessive input samples for the frames constituting the second sound section is calculated, and the total number is sent to the second gain determining unit 32. Specifically, when the second sound interval is the past 10 frames, the number of excessive input samples of the past 10 frames is read from the storage unit 29 and added to obtain the total number of excessive input samples. Ask for.

第二音区間過大入力フレーム数決定部３１は、第二音区間を構成するフレームの中の過大入力フレームの数を決定して、その数を第二利得決定部３２に送る。具体的には、第二音区間が過去１０フレームである場合には、記憶部２９から、過去１０フレームの過大入力フレームについての情報を読み込み、過大入力フレームの数を決定する。
第二利得決定部３２は、過大入力サンプル数の総数が予め定められた数Ａ_１５（例えば第二音区間を構成するサンプルの総数の２０％の数）よりも大きく、かつ、過大入力フレームの数が予め定められた値Ａ_１６（第二音区間が１０フレームである場合には、例えば３）よりも大きい場合には、入力された音の音量を所定の音量だけ下げるための情報（以下、第二音量調整情報とする。）を、音量調整部１２に送る。第二音量調整情報は、具体的な利得の値（例えば０．７、音量にして３ｄＢ）等であってもよいし、具体的な数値を伴わない単なる音量を下げる旨を指示する情報であってもよい。 The second sound interval excessive input frame number determination unit 31 determines the number of excessive input frames in the frames constituting the second sound interval, and sends the number to the second gain determination unit 32. Specifically, when the second sound interval is the past 10 frames, information on the excessive input frames of the past 10 frames is read from the storage unit 29, and the number of excessive input frames is determined.
The second gain determination unit 32 is configured such that the total number of excessive input samples is larger than a predetermined number A ₁₅ (for example, 20% of the total number of samples constituting the second sound interval), and When the number is larger than a predetermined value A ₁₆ (for example, 3 when the second sound interval is 10 frames), information for lowering the volume of the input sound by a predetermined volume (hereinafter referred to as “the volume”) , Second volume adjustment information) is sent to the volume adjustment unit 12. The second volume adjustment information may be a specific gain value (for example, 0.7, 3 dB as a volume), or the like, or information indicating that the volume is simply decreased without a specific numerical value. May be.

音量調整部１２の第二音量調整部１２２は、第二音量調整情報に基づいて、入力された音の音量を下げる。利得を下げた場合には、第二音量調整指示部２６は、フレームに短時間音量調整フラグを立て、以降は、バッファ１５の遅延分の時間に相当するフレームについて処理を行わない。
これにより、突発的な雑音のうち、比較的短い継続時間長をもった雑音を回避して、利得を下げることにより、音量を下げることができる。
終始判定部１８によって発音の開始が検出された後は、上記のように、第一音量調整指示部２５、第二音量調整指示部２６の指示に従って音量が調節される。終始判定部１８が発音の終了を検出した場合には、発音が終了した旨の情報が、終了時音量調整部３３に送られる。 The second volume adjustment unit 122 of the volume adjustment unit 12 decreases the volume of the input sound based on the second volume adjustment information. When the gain is lowered, the second volume adjustment instruction unit 26 sets a short-time volume adjustment flag for the frame, and thereafter does not perform processing on the frame corresponding to the time corresponding to the delay of the buffer 15.
As a result, it is possible to reduce the volume by avoiding noise having a relatively short duration from among sudden noises and reducing the gain.
After the start / stop determination unit 18 detects the start of sound generation, the volume is adjusted according to the instructions of the first volume adjustment instruction unit 25 and the second volume adjustment instruction unit 26 as described above. When the end-to-end determination unit 18 detects the end of the sound generation, information indicating that the sound generation has ended is sent to the end-time sound volume adjustment unit 33.

終了時音量調整部３３は、発音が終了した旨の情報を受け取ると、音量調整部１２に設定された発音の終了時の利得を読み込んで、終了時音量調整部３３の記憶部３３１に格納する。そして、終了時音量調整部３３は、直近の発音から予め定められた数Ａ_１７の過去の発音の終了時の利得を記憶部３３１からそれぞれ読み出して、それらの平均値を求め、その平均値を音量調整部１２に設定する。
音量調整部１２から現在の利得の値を得ることができない場合には、終了時音量調整部３３は、以下のようにして利得を音量調整部１２に設定する。音量調整部１２から現在の利得の値を得ることができない場合とは、例えば、音量調整部１２が３ｄＢ音量を上げる、３ｄＢ音量を下げるというような相対的な利得の指定手段しか持たず、装置の調整範囲を超えた場合や、調整できなかったことを通知する手段を持たない場合のことである。 When receiving the information that the sound generation has ended, the end-time volume adjustment unit 33 reads the gain at the end of the sound generation set in the sound volume adjustment unit 12 and stores it in the storage unit 331 of the end-time volume adjustment unit 33. . Then, at the end of the volume adjusting unit 33 reads the respective gains at the end of the past pronounce number A ₁₇ predetermined by the last sound from the storage unit 331, obtains the average value thereof, the average value Set in the volume adjustment unit 12.
When the current gain value cannot be obtained from the volume adjustment unit 12, the end-time volume adjustment unit 33 sets the gain in the volume adjustment unit 12 as follows. The case where the current gain value cannot be obtained from the volume adjustment unit 12 means that, for example, the volume adjustment unit 12 has only a relative gain specifying means for increasing the 3 dB volume and decreasing the 3 dB volume. This is a case where the adjustment range is exceeded or there is no means for notifying that the adjustment has failed.

１．第一音量調整指示部２５の指示によっては音量を調整するために利得を変更しなかった場合には、終了時音量調整部３３は何もしない。
２．第一音量調整指示部２５の指示により音量を下げるために利得を下げた場合には、終了時音量調整部３３は現在の利得から予め設定した値Ａ_１８だけを値を下げた利得を音量調整部１２に設定する。
３．第一音量調整指示部２５の指示により音量を上げるために利得を上げたときには、終了時音量調整部３３は、以下の処理を行う。 1. When the gain is not changed to adjust the volume according to the instruction of the first volume adjustment instruction unit 25, the end-time volume adjustment unit 33 does nothing.
2. When lowering the gain to decrease the volume according to an instruction of the first sound volume adjustment instruction unit 25, volume control gain at the end volume adjustment portion 33 having a reduced value by a value A ₁₈ set in advance from the current gain Set in part 12.
3. When the gain is increased in order to increase the volume in accordance with an instruction from the first volume adjustment instruction unit 25, the end-time volume adjustment unit 33 performs the following processing.

３−１．第二音量調整指示部２６の指示により音量を下げるために利得を下げた場合には、終了時音量調整部３３は何もしない。
３−２．「３−１．」以外の場合には、終了時音量調整部３３は現在の利得から予め設定した値Ａ_１９だけ値を上げた利得を音量調整部１２に設定する。
このような方法で、発音の終了時に音量を調整することで、次の発音開始時の音量を適切な値に近づけることができるとともに、話者、マイク位置、声量等の収音環境条件の変化に追随して音量を適切に調整することができる。 3-1. When the gain is lowered to lower the volume according to the instruction from the second volume adjustment instruction unit 26, the end-time volume adjustment unit 33 does nothing.
3-2. In cases other than “3-1.”, The end-time volume adjustment unit 33 sets a gain obtained by increasing the value by a preset value A ₁₉ from the current gain in the volume adjustment unit 12.
By adjusting the volume at the end of pronunciation in this way, the volume at the beginning of the next pronunciation can be brought close to an appropriate value, and the sound collection environmental conditions such as speaker, microphone position, and volume can be changed. The volume can be adjusted appropriately following the above.

図９に例示するように、入力部１１から入力された音が、ＡＤ変換部１３と音量調整部１２とにそれぞれ入力され、ＡＤ変換部１３に入力された音から上記と同様に音量調整情報が決定され、その決定された音量調整情報に基づいて、音量調整部１２が音量の調整をしてもよい。
例えば、コールセンターで日ごとにオペレータが席を替わる等の場合には、一定時間同一の収音条件が続くが、日々収音条件が変わる。このような環境では、数通話の短い時間で、それぞれのオペレータの声量、マイク位置などの収音条件に合うように音量を調整することができ、また、オペレータが途中で交替した場合にも、追随して適切に音量を調整できる。 As illustrated in FIG. 9, the sound input from the input unit 11 is input to the AD conversion unit 13 and the volume adjustment unit 12, and the volume adjustment information is similar to the above from the sound input to the AD conversion unit 13. Is determined, and the volume adjusting unit 12 may adjust the volume based on the determined volume adjustment information.
For example, when an operator changes seats every day at a call center, the same sound collection condition continues for a certain period of time, but the sound collection condition changes every day. In such an environment, the volume can be adjusted to meet the sound collection conditions such as the voice volume and microphone position of each operator in a short time of several calls. The volume can be adjusted appropriately.

第二音量調整指示部２６及び第二音量調整部１２２はなくてもよい。また、終了時音量調整部３３がなくてもよい。
図５に、コールセンターで自動音量調整装置１を利用してオペレータとユーザとの会話を録音するシステムを示す。
電話機３４に接続したヘッドセット３５をオペレータが装着し、ユーザと会話をする。ヘッドセット３５と電話機３４との間に音量調整部１２を有する送受話器分岐アダプタ３６を接続して、オーディオ入力又はＵＳＢを使って、その音声をＰＣ３７に取り込む。ＰＣに取り込んだオペレータ、ユーザそれぞれの音声はエコーキャンセル部を通して、側音としてユーザ音声側に入っているオペレータ音声を抑圧する。図６に示すように送受話器分離アダプタにエコーキャンセル部３８が付いている場合には、このエコーキャンセル部の処理をバイパスする。 The second volume adjustment instruction unit 26 and the second volume adjustment unit 122 may be omitted. Further, the end-time volume adjustment unit 33 may not be provided.
FIG. 5 shows a system for recording a conversation between an operator and a user using the automatic volume control device 1 in a call center.
An operator wears a headset 35 connected to the telephone 34 and has a conversation with the user. A handset branch adapter 36 having a volume control unit 12 is connected between the headset 35 and the telephone 34, and the audio is input to the PC 37 using audio input or USB. The voices of the operator and the user captured in the PC are transmitted through the echo canceling unit, and the operator voice that is in the user voice side is suppressed as a side sound. As shown in FIG. 6, when the echo canceling unit 38 is attached to the handset separation adapter, the processing of the echo canceling unit is bypassed.

エコーキャンセル部３８から送られたそれぞれの音声をもとに終始判定部１８で、通話の始端を検出すると、送信側自動音量調整装置１ａは、オペレータ音声の音量を上記説明した自動音量調整装置１と同様に調整する。また、受信側自動音量調整装置１ｂは、ユーザ音声の音量を上記説明した自動音量調整装置１と同様に調整する。送信側自動音量調整装置１ａと受信側自動音量調整装置１ｂはそれぞれ、音量調整部１２と終始判定部１８とを有していないが、送受話器分岐アダプタ３６の音量調整部１２及びＰＣ３７の終始判定部１８が、送信側自動音量調整装置１ａと受信側自動音量調整装置１ｂの音量調整部１２及び終始判定部１８として機能する。それ以外の点では、自動音量調整装置１と同様である。 When the end-of-call determination unit 18 detects the beginning of a call based on the respective voices sent from the echo cancel unit 38, the transmission-side automatic volume adjustment device 1a determines the volume of the operator voice as described above. Adjust in the same way. The reception-side automatic volume adjustment device 1b adjusts the volume of the user voice in the same manner as the automatic volume adjustment device 1 described above. The transmission-side automatic volume adjustment device 1a and the reception-side automatic volume adjustment device 1b do not have the volume adjustment unit 12 and the end-to-end determination unit 18, but the end-to-end determination of the volume adjustment unit 12 of the handset branch adapter 36 and the PC 37. The unit 18 functions as the volume adjustment unit 12 and the end-to-end determination unit 18 of the transmission-side automatic volume adjustment device 1a and the reception-side automatic volume adjustment device 1b. The other points are the same as those of the automatic volume control device 1.

オペレータ音声はオペレータが同じ間は収音条件がほぼ同じなので数通話で適切な音量に調整することができる。しかし、ユーザ音声は、一通話ごとに電話機、伝送路等が異なる。このため、受信側自動音量調整装置１ｂは、終了時音量調整部３３による音調調整の指示を行わない。
終始判定部１８が通話の終了を検出すると、音量が調整された音声は録音部３９を通して、ＰＣ３７のディスク４０に格納される。
上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。 The operator voice can be adjusted to an appropriate volume with a few calls because the sound collection conditions are substantially the same while the operator is the same. However, the user voice has a different telephone, transmission line, etc. for each call. For this reason, the reception-side automatic volume adjustment device 1b does not give an instruction for tone adjustment by the end-time volume adjustment unit 33.
When the end-to-end determination unit 18 detects the end of the call, the sound whose volume is adjusted is stored in the disk 40 of the PC 37 through the recording unit 39.
When the above configuration is realized by a computer, the processing contents of the functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ
−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical discs, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD
-R (Recordable) / RW (ReWritable), etc., MO (Magneto-Optical disc), etc. as a magneto-optical recording medium, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. as a semiconductor memory it can.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。
また、上述した実施形態とは別の実行形態として、コンピュータが可搬型記録媒体から直接このプログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.
As an execution form different from the above-described embodiment, the computer may read the program directly from the portable recording medium and execute processing according to the program. Each time is transferred, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。
また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.
In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Needless to say, other modifications are possible without departing from the spirit of the present invention.

本発明の一実施例である自動音量調整装置１の機能構成を例示する図。The figure which illustrates the function structure of the automatic volume control apparatus 1 which is one Example of this invention. 第一音量調整指示部２５の機能構成を例示する図。The figure which illustrates the function structure of the 1st volume adjustment instruction | indication part. Ａは音信号の波形を例示する図。Ｂは第一音区間（発話区間）を例示する図。Ｃは第一音区間の外形値を例示する図。A is a figure which illustrates the waveform of a sound signal. B is a diagram illustrating a first sound section (speech section). C is a diagram illustrating an outer shape value of a first sound section. 第二音量調整指示部２６の機能構成を例示する図。The figure which illustrates the functional structure of the 2nd volume adjustment instruction | indication part. オペレータとユーザとの会話を録音するシステムを例示する図。The figure which illustrates the system which records the conversation of an operator and a user. オペレータとユーザとの会話を録音するシステムを例示する図。The figure which illustrates the system which records the conversation of an operator and a user. 従来の自動音量調整を例示する図。The figure which illustrates the conventional automatic volume adjustment. 従来の自動音量調整の問題点を説明するための図。The figure for demonstrating the problem of the conventional automatic volume adjustment. 自動音量調整装置１’の機能構成を例示する図。The figure which illustrates the function structure of the automatic volume control apparatus 1 '.

Explanation of symbols

１自動音量調整装置
１ａ送信側自動音量調整装置
１ｂ受信側自動音量調整装置
１１入力部
１２音量調整部
１３変換部
１４フレーム分割部
１５バッファ
１６直流バイアス計算部
１７減算部
１８終始判定部
１９外形値決定部
２０有音無音フレーム判定部
２１有音無音区間判定部
２２第一音区間抽出部
２３第一音区間外形値抽出部
２４第一利得決定部
２５第一音量調整指示部
２６第二音量調整指示部
２７過大入力サンプル数決定部
２８過大入力フレーム決定部
２９記憶部
３０第二音区間過大入力サンプル数決定部
３１第二音区間過大入力フレーム数決定部
３２第二利得決定部
３３終了時音量調整部
３４電話機
３５ヘッドセット
３６送受話器分岐アダプタ
３８エコーキャンセル部
３９録音部
４０ディスク
１２１第一音量調整部
１２２第二音量調整部
１８１平均値計算部
２３１除外部
２３２最大値決定部
３３１記憶部 DESCRIPTION OF SYMBOLS 1 Automatic volume adjustment apparatus 1a Transmission side automatic volume adjustment apparatus 1b Reception side automatic volume adjustment apparatus 11 Input part 12 Volume adjustment part 13 Conversion part 14 Frame division part 15 Buffer 16 DC bias calculation part 17 Subtraction part 18 End-of-life determination part 19 Outline value Determining unit 20 Sounded / silent frame determining unit 21 Sounded / silent segment determining unit 22 First sound segment extracting unit 23 First sound segment outer shape value extracting unit 24 First gain determining unit 25 First volume adjustment instruction unit 26 Second volume adjustment Instructing unit 27 excessive input sample number determining unit 28 excessive input frame determining unit 29 storage unit 30 second sound segment excessive input sample number determining unit 31 second sound segment excessive input frame number determining unit 32 second gain determining unit 33 volume at end Adjustment unit 34 Telephone 35 Headset 36 Handset branch adapter 38 Echo canceling unit 39 Recording unit 40 Disk 121 First Volume Adjustment Unit 122 Second Volume Adjustment Unit 181 Average Value Calculation Unit 231 Exclusion Unit 232 Maximum Value Determination Unit 331 Storage Unit

Claims

Frame dividing means for dividing the input sound into frames of a certain time length;
Frame outer shape value determining means for obtaining an outer shape value, which is a feature amount representing a loudness included in a frame, for each frame;
A plurality of frames constituting the first sound interval are defined as a first sound interval between sound frames composed of frames equal to or more than a predetermined number B ₂ and sandwiched between a predetermined number B ₁ or more of continuous silence frames. A first sound section outer shape value determining unit that excludes a plurality of outer shape values from the outer shape value from the larger outer shape value and obtains a maximum value of the remaining outer shape values as the outer shape value of the first sound section without being excluded; ,
Information for adjusting the volume of the input sound (hereinafter, referred to as first volume adjustment information) is determined so that the obtained outer shape value of the first sound section falls within a predetermined range. First volume adjustment instruction means for outputting;
First volume adjustment means for adjusting the volume of the input sound using the output first volume adjustment information;
A volume control device comprising:

The volume control device according to claim 1,
The outer shape value of the frame is the maximum absolute value of the sample values included in the frame.
A volume control device characterized by that.

In the volume control apparatus according to claim 1 or 2,
If greater than the threshold value B ₃ of external value of a frame is predetermined to determine the frame and voiced frame, a silence frame and determines voice activity frame determination means that frame otherwise,
A voiced silence section determining means for determining a sound section composed of continuous silence frames of a predetermined number B ₁ or more as a silent section and determining other sound sections as a voiced section;
A first sound segment extraction means that sets a sound segment longer than a predetermined time length among the determined sound segments as the first sound segment;
A volume control device comprising:

In the volume adjusting device according to any one of claims 1 to 3,
The number of the absolute value of a sample is greater than the threshold value B ₄ a predetermined sample value (hereinafter referred to. An excessive number of input samples) and excessive input sample number determining means for determining for each frame,
Excessive input frame determination means for determining whether or not the number of excessive input samples is larger than a predetermined number B ₅ (hereinafter, the number of excessive input samples is larger than a predetermined number B ₆ ); The frame is an excessive input frame),
A sound interval composed of a number of frames smaller than the number of frames constituting the first sound interval is defined as a second sound interval, and the determined number of excessive input samples for the frames constituting the second sound interval is determined. When the total number is larger than the predetermined number B ₇ and the number of excessive input frames in the frames constituting the second sound section is larger than the predetermined number B ₈ , the input Second volume adjustment instruction means for outputting information (hereinafter referred to as second volume adjustment information) for lowering the volume of the generated sound by a predetermined volume;
Second volume adjustment means for lowering the volume of the input sound using the output second volume adjustment information;
A volume control device comprising:

The volume control device according to any one of claims 1 to 4,
Means for calculating an average value of the absolute value of the amplitude of the input sound for each frame;
When a frame having an average value larger than a predetermined threshold B ₉ is detected, it is determined that sound generation has started, and a predetermined number B _{11 of} frames having an average value smaller than the predetermined threshold B ₁₀ is determined. Means for determining the end of sound generation only when it continues continuously,
When it is determined that the sound generation has ended, the first sound volume adjustment information and / or the second sound volume adjustment information at the time of sound generation end is stored in the storage means, and a predetermined number B ₁₂ of past is determined from the latest sound generation. The first volume adjustment information and / or the second volume adjustment information at the end of pronunciation is read from the storage means, the average value thereof is obtained, and the setting is set in the first volume adjustment means and / or the second volume adjustment means Hour volume adjustment means,
A volume control device comprising:

A frame dividing step for dividing the input sound into frames of a certain length of time;
A frame outer shape value determining step for obtaining an outer shape value, which is a feature amount representing a loudness included in the frame, for each frame;
A plurality of frames constituting the first sound interval are defined as a first sound interval between sound frames composed of frames equal to or more than a predetermined number B ₂ and sandwiched between a predetermined number B ₁ or more of continuous silence frames. A first sound interval outer shape determination step for excluding a plurality of outer shape values from the outer shape value from the larger outer shape value and obtaining the maximum value of the remaining outer shape values as the outer shape value of the first sound interval without being excluded; ,
Information for adjusting the volume of the input sound (hereinafter, referred to as first volume adjustment information) is determined so that the obtained outer shape value of the first sound section falls within a predetermined range. A first volume adjustment instruction step to output;
A first volume adjustment step for adjusting the volume of the input sound using the output first volume adjustment information;
A volume adjustment method comprising:

A sound volume adjusting program for causing a computer to function as each means of the sound volume adjusting device according to claim 1.