WO1997009713A1 - Procede de traitement de signal audio en vue d'une reproduction fidele et a vitesse variable - Google Patents

Procede de traitement de signal audio en vue d'une reproduction fidele et a vitesse variable Download PDF

Info

Publication number
WO1997009713A1
WO1997009713A1 PCT/CN1996/000074 CN9600074W WO9709713A1 WO 1997009713 A1 WO1997009713 A1 WO 1997009713A1 CN 9600074 W CN9600074 W CN 9600074W WO 9709713 A1 WO9709713 A1 WO 9709713A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
extreme value
value
processing method
extreme
Prior art date
Application number
PCT/CN1996/000074
Other languages
English (en)
French (fr)
Inventor
Yong Su
Original Assignee
Shen, Xueliang
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=5080693&utm_source=google_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=WO1997009713(A1) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Application filed by Shen, Xueliang filed Critical Shen, Xueliang
Priority to AU68689/96A priority Critical patent/AU6868996A/en
Publication of WO1997009713A1 publication Critical patent/WO1997009713A1/zh

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/02Analogue recording or reproducing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0091Means for obtaining special acoustic effects
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H7/00Instruments in which the tones are synthesised from a data store, e.g. computer organs
    • G10H7/02Instruments in which the tones are synthesised from a data store, e.g. computer organs in which amplitudes at successive sample points of a tone waveform are stored in one or more memories
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B15/00Driving, starting or stopping record carriers of filamentary or web form; Driving both such record carriers and heads; Guiding such record carriers or containers therefor; Control thereof; Control of operating function
    • G11B15/18Driving; Starting; Stopping; Arrangements for control or regulation thereof
    • G11B15/1808Driving of both record carrier and head
    • G11B15/1875Driving of both record carrier and head adaptations for special effects or editing
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/005Reproducing at a different information rate from the information rate of recording

Definitions

  • the present invention generally relates to a method for performing variable-speed processing on audio signals, and more particularly, to a method for processing audio signals with fidelity and variable-speed processing, including a method for processing fidelity slower and a method for processing fidelity faster.
  • Ordinary recording signal playback systems (such as recorders) usually play at a standard speed, and output voice or sound at a normal speed.
  • speeding up or slowing down especially slowing down the speed of speech, which is very helpful for foreign language learners.
  • speeding up or slowing down especially slowing down the speed of speech, which is very helpful for foreign language learners.
  • speeding up or slowing down especially slowing down the speed of speech, which is very helpful for foreign language learners.
  • the tradition is usually only achieved by changing the transport speed of the player.
  • the shortcomings of this method are obvious, that is, the change in the transport speed will cause the frequency of the signal output by the playback magnetic head to change. Although it can change the playback speed, it also changes the frequency of the sound accordingly, resulting in changes in tone and timbre.
  • the voice effect becomes worse, and in severe cases, there is a phenomenon that the voice content cannot be distinguished.
  • the purpose of the present invention is to provide an audio signal fidelity variable-speed processing method.
  • the processing method can make the processed audio signal express without changing the original audio signal's relative strength, frequency, tone, tone color and other characteristics. The content of this signal slows down or speeds up.
  • any natural vocal behavior has a vibration process, that is, there must be a force to cause the object to vibrate, thereby generating sound waves.
  • This force is intermittent or non-constant.
  • the object After the force-bearing object is deformed, the object will deform, and the object will have the ability and / or tendency to return to the original form, thereby generating the movement to return to the original state.
  • This generates vibration, and this vibration has the characteristics of damping vibration.
  • the audio signal is the electrical manifestation of this mechanical vibration, and its waveform corresponds to the mechanical vibration, so in a very small time interval, it should be regarded as consisting of tiny damped vibration waves.
  • This tiny damped vibration wave can be a simple type, that is, a complete damped vibration wave with a strictly decreasing amplitude; it can also be a composite type, that is, a section generated under the action of noise interference, waveform superposition and other factors Damped vibration waves with non-strictly decreasing or non-complete amplitude.
  • an electrical signal audio signal
  • it when recording sound, it is to convert the mechanical vibration of the sound into an electrical signal (audio signal) corresponding to the waveform or directly synthesize the electronic audio signal by electronic technology; when the sound is played back, it contains the sound The audio signal of the content is converted into mechanical vibration. So for audio signals containing sound signals, it is also a vibration signal, and it also contains damped vibration waves.
  • Audio signals are complex and diverse, and audio signals containing different content have different signal characteristics.
  • these damped vibration waves are the basic units constituting the audio signal.
  • a phoneme is the most basic structural unit that constitutes an audio signal and contains independent and complete basic information elements. (According to the experiments and observations of the present invention, the phonemes are usually not longer than 50 ⁇ ,
  • Each phonetic usually contains 2 to 24 extreme points (including peaks and valleys). )
  • the present invention provides a method for processing audio signal fidelity and variable speed, which includes the following steps: cutting the audio signal into small segments;
  • the audio signal fidelity variable speed processing method provided above is a fidelity slow processing method.
  • the audio signal fidelity variable speed processing method of the present invention further includes a fidelity variable fast processing method. The method includes the following steps:
  • the cutting segmentation performed here can be based on the time interval as the basic cutting unit, or the zero or extreme points in the audio signal as the basic cutting unit, or the above-mentioned number of phonemes as the basic cutting unit. Among them, it is particularly preferable to use the number of phonemes as a basic cutting unit.
  • the length of the time interval is generally 0.1 to 400 milliseconds, and especially 1 to 20 milliseconds is the best.
  • the number of zero or extreme points in the audio signal is used as the basic cutting unit, the number of zero or extreme points is generally 2-80, and 3-24 is the best.
  • the number of vowels is generally 1 to 10, with 1 to 3 being the best.
  • the information unit inserted here has the basic characteristics of a small segment of audio signal inserted, and its time length is generally less than 400 milliseconds. It can be all or part of the signal before the insertion point, all or part of the signal before the insertion point after the amplitude correction, or a blank signal. In the process of inserting the same audio signal, one of the above information units may be inserted, or any two or a combination of three information units may be inserted.
  • the present invention increases the length of an audio signal by inserting an information unit, or shortens the length of an audio signal by deleting certain signal segments, this method keeps the amount of information reproduced in a unit time unchanged.
  • the audio signal thus processed is played back, the signal frequency is not changed, and the original tone and timbre can be maintained.
  • the conventional methods of variable-speed processing of various audio signals do not increase or decrease the amount of information in the sound. Instead, they change the playback speed and other means to replay all the original information in a longer or shorter period of time. The amount of information replayed per unit time. When this amount of change exceeds a certain level, it will cause severe distortion. Therefore, the processing method of the present invention belongs to a fidelity shift processing method. This processing technology is not only applicable to language learning, but also has a wide range of application prospects in speech synthesis, speech recognition, spectrum analysis, music score recording, music learning, and performance evaluation in music equipment and audio products.
  • Figure 1 is a schematic diagram of an audio signal
  • Figure 2 is a waveform of a damped vibration wave
  • 3 is a schematic diagram of a cutting point of an audio signal
  • FIG. 5 is a flowchart of a phoneme segmentation method according to Embodiment 4 of the present invention.
  • FIG. 6 is a flowchart of a phoneme segmentation method according to Embodiment 5 of the present invention.
  • FIG. 7 is a flowchart of a phoneme segmentation method according to Embodiment 6 of the present invention.
  • Figure 8 is a schematic diagram of a section of damped vibration wave and its damped vibration envelope
  • 9A and 9B are flowcharts of a phoneme segmentation method according to Embodiment 7 of the present invention.
  • FIG. 10 is a flowchart of a phoneme segmentation method according to Embodiment 8 of the present invention.
  • FIG. 11 is a flowchart of a phoneme segmentation method according to Embodiment 9 of the present invention.
  • FIG. 12 is a graph before and after the damping vibration envelope is corrected
  • FIG. 13 is a flowchart of a method for deleting small segments with similar characteristics in Embodiment 11 of the present invention.
  • FIG. 15 is a block diagram of a computer system implementing the audio signal fidelity shift processing method of the present invention. Embodiments of the invention
  • any audio signal is composed of phonemes.
  • the phonemes themselves are in the process of continuous occurrence, growth, development, evolution or demise.
  • Figure 1 shows a section of audio signal, which contains three sound units. From the previous description of the phoneme, we can know that the phoneme is the sound unit produced by the object due to the damping vibration when the force acts on the object.
  • the damped vibration wave in an ideal state, gradually converges, that is, in a damped vibration wave, the absolute value of the latter extreme value (peak and valley value) is always smaller than the absolute value of the previous extreme value (such as Figure 2).
  • each extremum is generally convergent, which can be described by the damped vibration envelope equation.
  • the present invention also believes that different audio signals have different phoneme compositions, and the difference between phonemes and phonemes is related to the content of the signal.
  • the more repetitions of the same or similar vowels are connected in sequence the longer the sounds expressing the same content in time.
  • the fewer the number of phonons with the same or similar traits connected in sequence the shorter the time the information expressing the same content will last. Therefore, the audio signal fidelity variable speed processing method of the present invention is to artificially increase or decrease such sequentially connected information units with the same or similar characteristics in the audio signal, so that the information expressing the same content lasts longer or Shorter, so as to achieve the purpose of fidelity shift processing.
  • the first thing to consider is where to insert or delete sound information, and what kind of information to insert or delete.
  • Audio signal fidelity variable speed processing includes two aspects: audio signal fidelity slow processing and fast processing. Let's discuss the processing method of audio signal fidelity and slowing down first. The audio signal is first cut into small segments, and the length of each segment should be between 0.1-400 milliseconds. Insert a section of information unit after some or all of the sections.
  • the hearing range of the human ear is generally between 20 Hz and 20 kHz. Frequency in this range The sound inside is audible to the human ear. According to experiments, if the present inventors want to use the present invention to perform variable speed processing in the entire audible range and achieve better results, the length of each segment is preferably between 0.1 and 400 milliseconds. Considering that the frequency range of speech signals is generally between 200 and 4000 Hz, for speech signals, the preferred range of small segments is 1-20 milliseconds.
  • the sound information After determining the location where the sound information is inserted, it is necessary to further determine how much sound information is inserted. This should be determined according to the degree of shifting required by the user. For example, the sound needs to be extended by 1/2, that is, if it was normally played for 1 minute, it is now 1.5 minutes. This requires inserting 1/2 times the sound information into the original audio signal.
  • insertion methods There are several insertion methods:
  • the former method of inserting belongs to inserting a piece of information unit after all small sections, while the latter method of inserting belongs to inserting a piece of information unit after some small sections. It is evenly inserted, of course, it can also be inserted non-uniformly.
  • the information units inserted above can be as follows:
  • the so-called amplitude correction refers to amplifying or attenuating the signal amplitude.
  • the above three types of information units may be used alone, or in combination of two or two of them.
  • the cutting method is the same as the audio signal fidelity slowing processing method, and the audio signal is cut into small sections, and the length of each small section is between 0.1-400 milliseconds.
  • the audio signal needs to be shortened by 1/4, which can be shortened by the following method, that is, a small segment is deleted every four cutting points.
  • This is a method of deleting small segments at an even interval, or it can be deleted unevenly, such as A small segment is deleted every 3 cutting points, and then a small segment is deleted every 5 cutting points, but in general, the total number of deleted segments should be equal to 1/4 of the total audio signal. After deleting the small segment, tighten the undeleted small segment signal Pick it up.
  • the time interval of the cutting segment is 1-20 milliseconds, which is a better case.
  • the length of the cut segment can be selected within 0.1-400 ms. In the same cut, the length of the cut segments can be the same or inconsistent, as long as the length of the segment is within 0.1-400 milliseconds.
  • the signals processed in this embodiment are all digital signals. If the audio signals are analog before processing, analog / digital conversion should be performed first.
  • the cutting is performed based on the length of time, and the cutting point may fall on any position of the signal. As shown in Figure 3, the cutting point may fall on points A, B, C, or D. Obviously, when the cutting point is on point A, B, or C, there is no guarantee that information units are inserted or some small segments are deleted. A smooth connection between the two small sections before and after will produce a sudden change, which will make the sound worse. However, if the cutting points can be located at the zero point (ie, point D in FIG. 3), the smooth connection between the front and back small sections can be reduced, so that the distortion is reduced (the zero point referred to here is the amplitude for continuous analog signals.
  • the zero point or the extreme point number in the audio signal is used as the basic unit for cutting.
  • the audio signal is divided into small segments at the zero point of the audio signal, and the length of each segment is between 0.1-400 milliseconds or contains 2 -80 zero or extreme points.
  • the preferred range is 1-20 milliseconds for each segment, or each segment contains 3-24 zero or extreme points.
  • a sound element is the basic unit of an audio signal.
  • the audio signal is divided into small segments with a length of 0.1 to 400 milliseconds, the cutting points of these small segments are often Splitting the vowels may damage the integrity of the vowels to some extent.
  • the segmentation is performed by using a phoneme as a basic cutting unit, and each divided segment includes 1 to 10 sound units, and it is particularly preferable to include 1-3.
  • a phoneme is a sound unit produced by an object due to damping vibration when a force is applied to the object. Therefore, the first peak (extreme value) of the phoneme is usually the largest. We call it the maximum extreme value. point.
  • the maximum extremum point can be determined by comparing the extremum points in the phoneme. It can be determined by comparing the absolute value of all extreme points, or by comparing the unilateral extreme points.
  • the so-called unilateral extreme value comparison refers to the comparison between the positive extreme value (peak value) and the positive extreme value in the phoneme or the absolute value of the negative extreme value (peak and valley) in the phoneme and the absolute value of the negative extreme value. Comparison between. These two comparison methods can be used at the same time, or one of them can be selected. Considering the convenience of actually searching for a phoneme, the present embodiment uses the positive value comparison method in the unilateral extreme value comparison to find the maximum extreme value. Based on the characteristics of the damped vibration, the phoneme segmentation is performed as follows.
  • the process starts from 100 and sets the number of vowels (S) included in a small segment.
  • the number of vowels included in a small segment is set to 1-10, and the preferred number is 1- Three.
  • step 101 each positive sample value between two adjacent zeros is taken for comparison; in step 102, one of the maximum values obtained by comparison in 101 is determined as an extreme value Ao.
  • step 103 the counter X is set to zero, and it is judged in 103A whether the current data is processed? If so, the process enters 114 and returns, otherwise, the next set of positive samples between two adjacent zeros are compared for comparison (104).
  • step 105 one of the maximum values is determined as an extreme value. Enter person 106, and compare the two extreme values obtained recently.
  • the flow returns to step 103A.
  • the next set of positive samples between two adjacent zeros are compared at 104, and one of the maximum values is determined at step 105. Is extreme. Enter 106 again and compare the two recently obtained extreme values.
  • the small segments cut out in this embodiment all include one or several complete phonemes, and there is no situation where the cutting point is in the phoneme.
  • This method is used to cut and then insert or delete. The effect will be better than those of Examples 1 and 2.
  • Embodiment 3 considers a more ideal state, without taking into account factors such as noise interference and waveform superposition. However, under the influence of the above factors, sometimes the extreme value does not strictly decrease gradually in the same phoneme.
  • FIG. 5 shows a method for performing phoneme segmentation in this embodiment, which takes the above factors into consideration.
  • the method shown in FIG. 5 is basically the same as that in FIG. 4 except that step 107A is replaced by step 107A in the method of FIG. 5, that is, the latter extreme value is compared with the previous extreme value, and only When the next extreme value is greater than the previous extreme value and exceeds a predetermined amount, the process enters 108 and determines that the latter extreme value is the maximum extreme value of the next phoneme, otherwise, the step returns to 103A.
  • the predetermined amount here can be determined according to factors such as noise interference and waveform superposition in the audio signal.
  • Embodiment 3 The advantage of this embodiment compared with Embodiment 3 is that the influence of factors such as noise interference and waveform superimposition on the phoneme segmentation can be eliminated.
  • Example 5 This embodiment is a modification based on Embodiment 3.
  • the segmentation method shown in FIG. 6 is basically the same as that in FIG. 4, except that the method in FIG. 6 adds a step 108A to the method in FIG. 4, and the step 108A is after 107 in FIG. 4, that is, When it is determined at 107 that the next extreme value is greater than the previous extreme value, enter 108A, and then compare the latter extreme value with the maximum extreme value Ao of the phoneme to which the previous extreme value belongs. If the latter extreme value is greater than If the maximum extremum AQ is 60%, go to 108, and determine the next extremum as the maximum extremum of the next phoneme. Otherwise, the step returns to 103A. If the maximum extreme value has not been determined at the beginning of the program, the first extreme value obtained at the beginning of the program is compared as the maximum extreme value.
  • This embodiment is an improvement on the basis of Embodiments 4 and 5.
  • the method in FIG. 7 is different from the method in FIG. 5 in that steps 107B-107J are added after 107A in FIG. 5. That is, when the latter extreme value (for convenience of description, it is set to M1) is greater than the maximum extreme value A in the phonemes to which the previous extreme value belongs. 60%, or two consecutive extreme values after Ml are less than Ml, then Ml is set to the maximum extreme value.
  • the specific steps are: When the determination condition is not satisfied in 107A in FIG. 5, the flow enters 107B, and then the next extreme value Ml is compared with the maximum extreme value A Q in the phoneme to which the previous extreme value belongs. If Ml Greater than A.
  • 107C compare the next set of positive samples between two adjacent zeros.
  • the maximum value in 107C was determined to be the extreme value (M2) at 107D. Then, the flow proceeds to 107E to compare the sizes of M1 and M2. If Ml> M2, go to 107F.
  • 107F compare the size of each positive sample between the next set of two adjacent zeros. In 107G, the maximum value in 107F was determined as the extreme value M3. Then the process goes to 107H and compares the sizes of M1 and M3. If MI> M3, the process proceeds to 108, determines that the next extreme value Ml is the maximum extreme value, and proceeds to 109.
  • a phoneme with a relatively complicated shape can be segmented.
  • the cutting method described is an extreme value comparison method, that is, the maximum extreme value is found by comparing the extreme values, thereby determining the cutting point.
  • the maximum extremum of the phoneme is determined using the damped vibration envelope equation method. That is, the extreme point in the phoneme is substituted into the equation, and the phoneme is found according to whether the discrimination condition is satisfied.
  • the extreme points that are substituted can be all extreme points that include the absolute values of the positive and negative values, or they can be unilateral extreme points, that is, only positive extremes (peaks) or only negative Absolute values (peak and valley). The two can be used at the same time, or they can be used either. In this embodiment, for reasons such as convenience, the positive value in the unilateral extreme point is selected as the damping vibration envelope. Equation.
  • step 201 the positive samples of the audio signal for a certain period of time (generally, the length of one phonetic, within 50 milliseconds) are compared, and one of the maximum values obtained by the comparison is set to the maximum extreme value A. . Then go to 202 and set the counter X to zero. Then, set the maximum extreme value A. The corresponding time t is set to 0 (203).
  • the flow enters 204, and the next set of positive samples between two adjacent zeros are compared for comparison.
  • one of the maximum values is determined as the extreme value m.
  • the current extreme value is determined at 214 as the maximum extreme value A Q of the next phoneme.
  • the same method is used to unit or delete small segments, according to actual needs, to lengthen or shorten the audio signal. After that, the process returns to 202 to search for the next cutting point.
  • Embodiment 7 What is considered in Embodiment 7 is a more ideal state, and factors such as noise interference and waveform superposition are not considered. However, under the influence of these factors, sometimes the extreme value in the same phoneme does not decrease strictly according to the damping vibration envelope equation.
  • FIG. 10 illustrates a method for performing phoneme segmentation in consideration of the foregoing factors in this embodiment.
  • the method shown in FIG. 10 is basically the same as that shown in FIG. 9 except that a correction coefficient is added to the amplitude of the damped vibration envelope, as shown in FIG.
  • This correction coefficient k is generally 1.0-1.4, and the preferred value is 1.3.
  • Another method to modify the amplitude of the damped vibration envelope is to add an amplitude correction amount to the equation, that is, in step 208 ', determine the damped vibration envelope equation of the current phoneme as Where C is the amplitude correction amount.
  • This correction amount C should be determined according to the noise interference and waveform superposition in the audio signal. Generally take 0 to 40% A. The preferred value is 30% Ao.
  • the effect is shown in Figure 12B.
  • FIG. 11 shows a flowchart of the method. This method is basically the same as the method shown in FIG. 9 except that a correction amount is added to the damping coefficient of the damped vibration envelope, as shown in FIG. 11.
  • the correction amount D should be determined according to the influence of noise and other factors in the audio signal. Generally 0 to -25% ⁇ , and the preferred value is-3 to -8% ⁇ .
  • Embodiment 7 the advantage of this embodiment compared with Embodiment 7 is that the influence of the above factors on the phoneme segmentation can be eliminated.
  • This embodiment mainly relates to a method for processing audio signal fidelity and fastness.
  • the first is to cut the audio signal, which can be achieved by using the sound element as the basic cutting unit as described in Embodiment 3-10.
  • This embodiment mainly discusses how to delete small segments to shorten the audio signal.
  • Embodiment 1 a method of partially deleting small segments in a spaced manner is described.
  • a condition is added to the deletion, that is, to delete small segments with similar characteristics.
  • the following takes a small segment containing only one phoneme as an example. For a small segment containing multiple phonemes, the analogy can also be deduced by analogy. The specific method is shown in Figure 13.
  • the predetermined amount E is generally set to be 5% -20% of the maximum extreme value of the previous one of the two adjacent soundphones; or 5% -20% of the maximum extreme value of the next one.
  • the predetermined amount F is 5% -20% of the extreme value of the previous one of the two adjacent sound units, and may also be 5% -20% of the extreme value of the next one.
  • the predetermined amount G is two adjacent phonemes /
  • the length of the previous one is 5% -20%, and it can also be 5% -20% of the length of the next one.
  • This embodiment mainly relates to a method for processing audio signal fidelity and fastness. It is a further improvement of Embodiment 11.
  • Figure 14 shows the method of this embodiment. It differs from Example 11 ( Figure 13) in that at 30 ⁇ , the maximum extreme value and the extreme value in one phoneme are taken out; the maximum extreme value and the extreme value in the next phoneme are taken out at 302 '; at 306 and. Insert 306A and 306B between 307. That is, at 306, when I ⁇ NIF or I ⁇ TI ⁇ G, the flow proceeds to 306A, and compares the extreme values corresponding to two adjacent phonemes.
  • the flow returns 301A, otherwise it proceeds to 307.
  • the predetermined amount is generally set at 5% -20% of one of the two extreme values to be compared.
  • the audio signal fidelity shift processing method of the present invention has been described in detail above. To implement this method, computer technology can be used. At present, computer technology has developed to a considerable extent. For ordinary technicians in the computer field, it is not difficult to implement the above method with a computer. Only a computer structure for implementing the above method is briefly described below.
  • FIG. 15 is a block diagram of a computer system for implementing the audio signal fidelity shift processing method of the present invention.
  • the computer system includes a central processing unit CPU, program memory PRAM, data memory DRAM, and the like.
  • the audio signal is an analog signal (such as output from a tape recorder)
  • it is first input to the attenuator 1, and then converted to a digital signal by the A / D converter 2, and then stored by the CPU to the data memory DRAM through the bus BUS, and These data are processed as above.
  • the audio signal is a digital signal (such as output from a CD player), it can be sent directly to the data bus BUS through the serial / parallel interface 3, which is stored in the data memory DRAM by the CPU and processed.
  • the program memory PRAM stores a program for implementing the method of the present invention, and the CPU calls the program from the program memory PRAM to run.
  • the CPU records the processed data through a parallel / serial interface 4 to a digitally recorded medium such as a hard disk or a laser disc, or converts it to an analog signal after converting it to a D / A converter 5 and records it on a magnetic tape or the like Media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Stereophonic System (AREA)

Description

音频信号保真变速处理方法 技术领域
本发明一般涉及一种对音频信号进行变速处理的方法, 尤其涉及一种保真变速 处理音频信号的方法, 包括保真变慢处理和保真变快处理方法。
前景技术
普通的录音信号放音系统 (如录音机)一般用标准的速度进行放音, 输出正常速 度的语音或声音。 但在日常工作学习中, 有时希望能改变放出的语音速度, 如加快 或减慢,尤其是减慢语速,这对于学习外语者有很大的帮助。在改变放音速度方面, 传统通常仅通过改变放音机的走带速度来实现。这种方法的缺点是显然的, 即走带 速度改变, 会导致放音磁头输出的信号频率改变, 虽然能改变放音速度, 但也相应 地改变了声音的频率, 导致音调、 音色的变化, 语音效果变差, 严重时会产生无法 辨清语音内容的现象。
本发明的目的在于提供一种音频信号保真变速处理方法, 该处理方法能使处理 后的音频信号在保持原来音频信号相对强弱、 频率、 音调、 音色等特征不变的情况 下, 使表达该信号内容的速度放慢或加快。
众所周知, 任何自然的发声行为, 都有振动过程, 即必须有作用力使物体产生 振动, 从而产生声波, 这个作用力是间歇的或非恒定的。 受力物体受力作用后, 发 生形变, 物体将产生恢复到原形态的能力和 /或趋势, 从而产生恢复到原状态的运 动, 这样便产生了振动, 并且这种振动具有阻尼振动的特征。 音频信号就是这种机 械振动的电表现, 其波形与机械振动相对应, 所以在十分小的时间间歇内, 它应该 可被看作是由微小的阻尼振动波所组成的。 这微小的的阻尼振动波可以是单纯型 的, 即一段幅值严格递减的、 完整的阻尼振动波; 亦可以是复合型的, 即在噪声干 扰、波形叠加等因素的作用下而产生的一段幅值非严格递减的、或非完整的阻尼振 动波。 对于电声技术来说, 在记录声音时, 是将声音的机械振动转换成波形相对应 的电信号 (音频信号)或者通过电子技术直接合成电子音频信号; 在重放声音时, 是 将含有声音内容的音频信号转换成机械振动。 所以对于含有声音信号的音频信号, 它也是一种振动信号, 它也包含有阻尼振动波。
音频信号是复杂多样的, 包含不同内容的音频信号有不同的信号特征。 但如上 所述, 由于声音信号都是由频率、 幅度等不同的微小的阻尼振动波组成的, 所以从 微观上来看, 这些阻尼振动波是构成音频信号的基本单位。在本申请中我们把这种 阻尼振动波称为音元。 音元是构成音频信号的、 包含独立和完整的基本信息元素 的、 最基本的结构单元。 (根据本发明的试验和观察, 音元通常不长于 50亳秒, 一
确认 个音元中通常含有 2到 24个极值点 (包括峰值和谷值)。 )
发明内容
据此, 本发明提出了一种音频信号保真变速处理方法, 它包括下列步骤: 将音频信号进行切割分成小段;
在部分或全部小段后插人至少一段信息单元, 以延长音频信号。
上面提供的音频信号保真变速处理方法为一种保真变慢处理方法, 本发明的音 频信号保真变速处理方法还包括保真变快处理方法, 该方法包括下列步骤:
将音频信号进行切割分成小段;
间隔地将部分小段删除, 将未删除的小段紧缩连接, 以缩短音频信号。
在这里所进行的切割分段, 可以以时间间隔为基本切割单位, 或者以音频信号 中的零点数或极值点数为基本切割单位, 也可以以上述的音元个数为基本切割单 位。 其中, 尤以以音元个数为基本切割单位为最佳。
在以时间间隔为基本切割单位时, 其时间间隔长度一般为 0. 1 - 400毫秒, 其 中尤以 1 - 20毫秒为最佳。
在以音频信号中的零点数或极值点数为基本切割单位时, 其零点或极值点数一 般取 2 - 80个, 其中尤以 3 - 24个为最佳。
在以音元个数为基本切割单位时, 其音元个数一般取 1 - 10个, 其中尤以 1 - 3个为最佳。
在这里所插入的信息单元具有被插人小段音频信号的基本特征, 其时间长度一 般小于 400毫秒。 它可以是插入点前一小段的全部或部分信号、经幅值修正的插入 点前一小段的全部或部分信号或一段空白信号。在对同一音频信号进行插入的过程 中,可以插人上述信息单元中的一种,也可以是任意两种,或三种信息单元的组合。
从上面可以看出, 本发明虽然通过插入信息单元来增加音频信号的长度, 或者 删除某些信号小段来缩短音频信号长度,但是这种方法保持了单位时间内重放出的 信息量不变, 因此, 在重放经如此处理后的音频信号时并不改变信号频率, 能保持 原来的音调和音色。而传统的各种音频信号变速处理方法,并不是增加或减少声音 的信息量,而是通过改变放音速度等手段在更长或更短的时间内重放所有原来的信 息量, 所以改变了单位时间内重放出的信息量, 当这一改变量超出一定程度时, 将 造成严重的失真。 因此本发明的处理方法属于保真变速处理方法。 这种处理技术, 不仅可应用于语言学习, 而且在语音合成、 语音识别、 频谱分析、 乐谱记录、 音乐 学习以及音乐器材和音响制品中的性能鉴定等方面有着广泛的应用前景。
下面将结合附图来详细描述本发明的实施例, 本发明的其它目的和优点将通过 下面的描述表现出来。
附图概述
图 1是一段音频信号示意图;
图 2是阻尼振动波的波形; 图 3是音频信号切割点的示意图;
图 4是本发明实施例 3的音元分割方法的流程图;
图 5是本发明实施例 4的音元分割方法的流程图;
图 6是本发明实施例 5的音元分割方法的流程图;
图 7是本发明实施例 6的音元分割方法的流程图;
图 8是一段阻尼振动波及其阻尼振动包络线示意图;
图 9A和 B是本发明实施例 7的音元分割方法的流程图;
图 10是本发明实施例 8的音元分割方法的流程图;
图 1 1是本发明实施例 9的音元分割方法的流程图;
图 12是阻尼振动包络线修正前、 后的曲线图;
图 13是本发明实施例 1 1的删除性状相近似的小段的方法的流程图;
图 14是本发明实施例 12的删除性状相近似的小段的方法的流程图;
图 15是实现本发明的音频信号保真变速处理方法的计算机系统的方框图。 本发明的实施方式
在描述本发明的实施例之前, 先叙述前面提到的音元。
如前面所述, 本发明认为任何音频信号都是由音元连接构成的, 在表达实际信 息内容的不同阶段, 音元本身处于不断的发生、 成长、 发展、 演变或消亡的过程。 图 1示出了一段音频信号, 在这段音频信号中包含了三个音元。从前面对音元的描 述中可知音元是作用力作用于物体时, 物体因阻尼振动所产生的声音单元。根据阻 尼振动理论, 在理想状态下, 阻尼振动波是逐渐收敛的, 即在一个阻尼振动波中, 后一极值 (峰值和谷值)的绝对值总小于前一极值的绝对值 (如图 2所示)。 在同一个 音元中, 各极值总体上是收敛的, 它可以用阻尼振动包络线方程来描述。
本发明还认为, 不同的音频信号有不同的音元组成, 音元与音元之间的差异 与信号内容有关。顺序连接的性状相同或相近似的音元重复次数愈多, 则表达同一 内容的声音在时间上持续得愈长。 同理,顺序连接的性状相同或相近似的音元个数 愈少, 则表达同一内容的信息在时间上持续得愈短。 因此, 本发明的音频信号保真 变速处理方法就是在音频信号中人为地增加或减少这种顺序连接的性状相同或相 近似的信息单元, 使表达同一内容的信息在时间上持续得更长或更短,从而达到保 真变速处理的目的。
实施例 1
为了在音频信号中加入或删除一定量的声音信息, 首先应考虑的问题是在什么 地方插入或删除声音信息, 插人或删除怎样的信息。
音频信号保真变速处理包括两个方面:音频信号保真变慢处理和变快处理。先讨 论音频信号保真变慢处理方法。 首先将音频信号切割成小段, 每小段的长度应在 0.1-400毫秒之间。 在部分或所有小段后插入一段信息单元。
众所周知, 通常人耳的听觉范围一般在 20Hz至 20kHz之间。 频率在这段范围 内的声音是人耳所能听到的。本发明人根据实验,如要在整个可听范围内用本发明 进行变速处理, 并达到较好的效果, 则上述每小段的长度最好在 0.1 - 400毫秒之 间。 考虑到语言信号的频带范围一般在 200至 4000Hz , 所以对于语言信号, 小段 较佳的长度范围为 1-20毫秒。
在确定了插入声音信息的位置后, 需进一步确定插入多少声音信息。 这应根据 用户所要求的变速的程度来决定, 例如需要将声音延长 1/2, 即如原来正常放 1分 钟的内容, 现在要放 1.5分钟。 这就需要在原来的音频信号中插人 1/2倍的声音信 息, 可以有如下几种插入方法:
1 . 在每个切割点插入一段长度等于切割点前一小段的 1/2的信息单元; 2 . 在每隔一个切割点插入一段长度等于切割点前一小段的信息单元。
前一种插入方法属于在全部小段后插入一段信息单元, 而后一种插人方法属于 在部分小段后插人一段信息单元, 是均匀地插入的, 当然也可以非均匀地插入。
再举一例, 如需要将声音 (音频信号)延长 1倍, 即原来正常放 1分钟的内容, 现在要放 2分钟, 这需要在原来的音频信号中插入 1倍的声音信息, 可以用如下几 种插入方法:
1 . 在每个切割点插入一段长度等于切割点前一小段的信息单元;
2 . 在每个切割点插人一段长度小于 400毫秒的信息单元, 插入的信息单元的 总长度等于需插入的音频信号长度, 在本例中为 1分钟;
3 . 在每隔一个切割点插人两段长度小于 400毫秒的信息单元, 插入的信息单 元的总长度等于需插人的音频信号长度, 在这里为 1分钟。
当需要将声音延长更多倍时, 在切割点后插入的信息单元的个数将相应增多。 上面所插入的信息单元可以是如下几种:
1 . 空白信号;
2 . 插入点前一小段的全部或部分信号;
3 . 经幅值修正的插入点前一小段的全部或部分信号。
由于, 如眼睛有视觉残留一样, 人的耳朵也有听觉残留现象, 因此在小段之后 插人一段空白信号是可行的。 根据实验, 一般该空白信号的长度取 50毫秒为佳, 但不宜超过 100毫秒。 在上述第 3种的信息单元中, 所谓的幅值修正是指将信号幅 值放大或者衰减。 另外, 上述三种信息单元可以单独使用, 也可以两两组合使用或 者全部一起组合使用。
现在来讨论音频信号保真变快处理方法。 切割方法与音频信号保真变慢处理方 法相同, 将音频信号切割成小段, 每小段的长度在 0.1-400毫秒之间。 现在假设需 要将音频信号缩短 1/4 , 可以用如下方法进行缩短, 即每隔四个切割点删除一个小 段, 这是一种均匀间隔地删除小段的方法, 也可以不均匀地进行删除, 如隔 3个切 割点删除一个小段, 然后再隔 5个切割点删除一个小段, 但总体来说, 删除的小段 总数应等于总的音频信号的 1/4。 在删除了小段之后, 将未删除的小段信号紧缩连 接起来。
在本实施例中, 切割小段的时间间隔取 1 - 20毫秒, 这是一种较佳的情况。 一 般, 切割成的小段的长度可以在 0.1 - 400毫秒内选取。 在同一次切割中, 切割成 的小段长度可以一致, 也可以不一致, 只要小段的长度在 0. 1 - 400毫秒内即可。
在本实施例中所处理的信号都是数字信号, 如果处理前音频信号为模拟的, 则 应先进行模 /数转换。
实施例 2
在实施例 1中, 是以时间长度为标准进行切割的, 其切割点可能落在信号的任 意位置上。 如图 3所示, 切割点可能落在 A、 B、 C、 或 D点上, 显然, 当切割 点位于 A、 B或 C点上时, 插入了信息单元或删除了一些小段之后并不能保证使 前后两小段之间光滑地进行连接, 会产生一段突变, 该段突变会使声音变差。 但如 果能使切割点都位于零点 (即图 3中的 D点)时, 则能使前后小段之间光滑连接, 从 而降低失真(这里所指的零点, 对于连续的模拟信号来说, 是幅值为零的那个时间 点,对于离散的数字信号而言,在某样本段内,可能不存在幅值正好为零的样本点, 但可以取极性不同的两相邻样本点中的任何一个或者样本点中绝对幅值较小的那 个样本点作为零点。 )。 因此在本实施例中, 以音频信号中的零点或极值点数为切 割的基本单位, 在音频信号的零点将音频信号分割成小段, 每个小段的长度在 0.1-400毫秒之间或者包含 2 - 80个零点或极值点,较佳的取值范围是每个小段的 长度在 1 - 20毫秒之间, 或者每个小段包含 3 - 24个零点或极值点。 分割之后的 插人和删除方法与实施例 1相同, 这里不再重复。
实施例 3
在本说明书的前面, 已经述及, 音元是音频信号的基本单元, 在实施例 1和 2 中, 虽然将音频信号分割成了长度在 0.1 - 400毫秒的小段, 但这些小段的切割点 往往将音元分割开, 可能在一定程度上破坏了音元的完整性。
在本实施例中, 以音元作为基本切割单位进行切割分段, 分成的每个小段包含 有 1 - 10个音元, 其中尤以包含 1 - 3个为较佳。
如上所述, 音元是作用力作用于物体时, 物体因阻尼振动所产生的声音单元, 因此, 音元的第一个峰值 (极值)通常是最大的, 我们把它称为最大极值点。 在以音 元为单位进行分割时, 如果将切割点取在最大极值点前一个零点, 就能保证切割点 不会将音元分割开, 从而切割出完整的音元。
最大极值点可以通过把音元内的各极值点进行比较来确定。 即可通过比较所有 极值点的绝对值大小来确定, 也可以通过单边极值点的比较来确定。所谓单边极值 比较是指音元中正的极值 (峰值)与正的极值之间的比较或指音元中负的极值(峰 谷)的绝对值与负的极值的绝对值之间的比较。 这两种比较方法可以同时使用, 也 可以选其一种使用。考虑到实际寻找音元时的便利等原因,本实施例采用单边极值 比较中的正极值比较法来寻找最大极值。 根据阻尼振动的特性, 用如下方法进行音元分割。
如图 4所示, 流程从 100开始, 并且设置小段包含的音元个数 (S), 通常, 将一 个小段中包含的音元个数设置成 1 - 10 , 较佳的个数为 1 - 3个。 在步骤 101 , 取两相邻零点之间的各正样值进行比较;在步骤 102 , 将在 101 中比较得到的其中 —个最大值定为极值 Ao。 在步骤 103, 将计数器 X置为零, 并在 103A判断本次 数据是否处理完? 如是, 则流程进人 114返回, 否则取下一组两相邻零点之间的各 正样值进行比较 (104)。 在步骤 105 , 将其中一个最大值定为极值。 进人 106 , 将 最近得到的两个极值作比较, 如果在 107 , 后一极值 (即在 105中得到的极值)不大 于前一极值, 说明后一极值与前一极值属于同一音元, 流程返回到步骤 103A , 判 别本次数据是否处理完后, 在 104取下一组两相邻零点之间的各正样值进行比较, 在步骤 105 , 将其中一个最大值定为极值。 再次进人 106 , 把最近得到的两个极值 进行比较, 在 107, 如果后一极值仍不大于前一极值, 步骤仍返回 103A;如果后一 极值大于前一极值, 则说明一个新的音元开始, 且该极值为后一音元的最大极值 A0(108);进入 109 , 计数器加 1(X = X + 1), 然后比较 X和 S( l 10), 如果 XoS , 则流程返回 103A , 否则流程进入 1 1 1 , 将该最大极值的前一个零点作为切割点, 然后在 1 12 , 用与在实施例 1或 2中所述的插入信息单元或删除小段相同的方法, 根据实际需要,延长或缩短音频信号。此后,在步骤 1 13判别本次数据是否处理完。 如是, 则流程在 1 14返回; 否则, 步骤再次返回 103 , 进行下一切割点的寻找。
从上可以看出, 本实施例中切割出的小段均包含一个或数个完整的音元, 不会 出现切割点在音元中的情况, 用这种方法切割, 然后进行插人或删除处理, 效果将 优于实施例 1和 2 。
实施例 4
实施例 3考虑的是一种较理想的状态, 没有把噪声干扰以及波形叠加等因素考 虑进去。 然而, 在上述因素的影响下, 有时会出现在同一个音元中, 极值并非严格 逐渐递减的情况。
图 5示出了本实施例进行音元分割的方法, 它考虑了上述因素。 图 5所示的方 法基本上与图 4相同, 区别在于, 在图 5的方法中用步骤 107A代替了图 4的方法 中的步骤 107 , 即把后一极值与前一极值比较, 只有在后一极值大于前一极值, 并 且超出一预定量时, 流程才进人 108 ,确定后一极值为后一音元的最大极值,否则, 步骤返回 103A。 这里的预定量可以根据音频信号中噪声干扰以及波形叠加等因素 来确定, 根据实验, 一般取前一极值的 20%至 40%, 即当后一极值与前一极值之 差大于前一极值的 20%至 40%时, 才认为后一极值为最大极值。 较佳的预定量为 前一极值的 30%。
本实施例与实施例 3相比的优点在于可以消除噪声干扰以及波形叠加等因素对 音元分割的影响。
实施例 5 本实施例是在实施例 3基础上的变化。 如图 6所示, 图 6所示的分割方法基本 上与图 4相同, 其区别在于, 图 6的方法比图 4的方法增加了一个步骤 108A , 该 步骤 108A在图 4的 107之后, 即当在 107判别出后一极值大于前一极值时, 进人 108A , 再把后一极值与前一极值所属于的音元中的最大极值 Ao比较, 如果后一极 值大于最大极值 AQ的 60%, 则进入 108 , 确定后一极值为后一音元的最大极值, 否则步骤返回 103A。 如果在程序刚开始时, 尚未确定了最大极值, 则把程序开始 时所得到的第一个极值作为最大极值进行比较。
本实施例与实施例 3相比的优点在于考虑了噪声干扰以及波形叠加等因素对音 元分割的影响, 使切割更准确。
实施例 6
本实施例是在实施例 4和 5的基础上的改进。 如图 7所示, 图 7的方法与图 5 的方法的区别在于, 在图 5的 107A后, 加人了步骤 107B - 107J。 即当后一极值 (为叙述方便, 设为 Ml)大于前一极值所属于的音元中的最大极值 A。的 60 % , 或 者 Ml后连续有 2个极值小于 Ml , 则定 Ml为最大极值。 具体步骤是: 当图 5的 107A中不满足判别条件时, 流程进人 107B , 再把后一极值 Ml与前一极值所属于 的音元中的最大极值 AQ进行比较, 如果 Ml大于 A。的 60 % , 则进人 108 , 否则, 进人 107C。在 107C , 比较下一组相邻两零点之间的正样值。在 107D确定出 107C 中的最大值为极值 (M2)。 然后, 流程进入 107E比较 Ml和 M2的大小。 如果 Ml > M2 , 则进入 107F。 在 107F , 比较再下一组相邻两零点之间的各正样值大小。 在 107G确定出 107F中的最大值为极值 M3。 然后流程进入 107H , 比较 Ml和 M3的大小。 如果 MI > M3, 则流程进入 108 , 确定后一极值 Ml为最大极值, 并进入 109, 以下步骤与图 5相同。 如果在 107E中条件不满足, 进人 1071 , 确定 M2为最大极值 A。; 在 107H中条件不满足, 进入 107J , 确定 M3为最大极值, 然 后直接进入 109 。
本实施例可以将形状比较复杂的音元分割出来。
实施例 7
在实施例 3到 6中, 叙述的切割方法属于极值比较法, 即通过比较各极值来寻 找最大极值, 从而确定切割点。
前面已经揭示, 每个音元实际上是一段阻尼振动波, 其包络线可用阻尼振动包 络线方程 Y(t)=A。e^ 1 (如图 8所示)来描述, 其中 AQ为最大极值, P为阻尼系数。 阻尼振动波的所有极值点均应落在该方程所描述的包络线上或其内。本实施例即根 据这一原理, 用阻尼振动包络线方程法来确定音元的最大极值。 即, 将音元中的极 值点代入方程, 根据是否满足判别条件来寻找音元。所代入的极值点可以是包含有 正极值点和负极值的绝对值的所有极值点, 也可以是单边极值点, 即只用正的极值 (峰)或只用负的极值的绝对值 (峰谷)。 这二者可以同时使用, 也可以择其一种使用。 本实施例出于便利等原因的考虑,选用单边极值点中的正极值代入阻尼振动包络线 方程。
图 9A和 9B示出了本实施例所述方法的流程图。 流程从 200开始, 并且设置小 段包含的音元个数 (S), 通常, 将一个小段中包含的音元个数设置成 1 - 10个, 较 佳的个数为 1 - 3个。 在步骤 201, 取音频信号起始一段时间 (一般取一个音元的 长度, 50毫秒以内)的各正样值进行比较,将比较得到的其中一个最大值定为最大 极值 A。。 然后进入 202 , 将计数器 X置零。 然后, 将该最大极值 A。对应的时间 t 置为 0(203)。 流程进入 204 , 取下一组两相邻零点之间的各正样值进行比较。 在 205 , 将其中一个最大值定为极值 m。 然后将极值 m、 最大极值 AQ和极值 m所对 应的时间 tm代入方程 Y(t)=Aoe— β ι(206), 成为 πι=Α。^ m , 求出阻尼系数 P (207)。 在 207A判别, 若 < 0, 至 214, 确定 m为 AQ , 否则, 求出 P后, 即可确定当 前音元的阻尼振动包络线方程 (208)。 然后, 取再下一组两相邻零点之间的各正样 值进行比较 (209), 在 210将其中一个最大值定为极值 n , 并在 211, 确定该极值 所对应的时间 tn。 将^代入方程 Y(t)=A。e—e t , 即可求出 Y(tn)(212)。 流程进人 213 , 比较极值 n和 Y(tn), 如果 n<=Y(tn), 则说明该极值 n仍属于最大极值 A。所表征的 音元, 流程返回 209, 寻找下一个极值。 如果 n〉Y(tn), 则在 214确定当前极值为 下一音元的最大极值 AQ , 在 215 , 计数器 X加 1(X=X+1), 然后比较 X和 S(216), 如果 XoS, 则流程返回 203 , 否则, 流程进 217 , 把该最大极值 AQ的前一个零 点作为切割点, 然后在 218 , 用如在实施例 1或 2中所述的插人信息单元或删除小 段相同的方法, 根据实际需要, 延长或缩短音频信号。 此后流程返回到 202, 进行 下一切割点的寻找。 另外, 在每次取数据进行比较样值之前, 即在步骤 204和 209 之前, 有一判别步骤 (图中未示出), 判别是否还有未处理的数据。 如有, 则流程进 行下去; 如无, 则流程返回上一层程序。
实施例 8
实施例 7中所考虑的是一种较理想的状态, 没有考虑噪声干扰以及波形叠加等 因素。 然而, 在这些因素的影响下, 有时会出现在同一音元中极值并非严格按阻尼 振动包络线方程递减的情况。
图 10示出了本实施例考虑了上述因素后进行音元分割的方法。 图 10所示的方 法基本上与图 9相同, 其区别在于, 对阻尼振动包络线的幅值增加了一个修正系 数,即如图 10所示,在步骤 208',确定当前音元的阻尼振动包络线方程为 Y(t)=kA0e— e t , 其中 k为幅值修正系数。 这一修正系数 k一般取 1.0-1.4 , 较佳值为 1.3。 另 一种修正阻尼振动包络线幅值的方法是在方程中增加一个幅值修正量, 即在步骤 208'中确定当前音元的阻尼振动包络线方程为
Figure imgf000010_0001
其中 C为幅值修正 量。 这一修正量 C应根据音频信号中噪声干扰以及波形叠加等的情况确定。 一般 取 0至 40 % A。, 较佳值为 30 % Ao .其效果如图 12B所示。
本实施例与实施例 7相比的优点在于可以消除噪声干扰以及波形叠加等对音元 分割的影响。 本实施例叙述另一种考虑了噪声干扰以及波形叠加等因素后进行音元分割的方 法。 图 1 1示出了该方法的流程图, 该方法与图 9所示的方法基本上相同, 其区别 在于, 对阻尼振动包络线的阻尼系数增加了一个修正量, 即如图 11所示, 在步骤 208", 确定当前音元的阻尼振动包络线方程为 Y(t)=AQe _ ( & +D)t, 其中 D为阻尼系 数修正值, 使包络线的收敛程度减缓, 其效果如图 12A所示, 关于修正量 D , 应 根据音频信号中噪声等因素影响程度而确定。 一般取 0至- 25 % β , 较佳值为- 3至- 8 % β 。
同样, 本实施例与实施例 7相比的优点也在于可以消除上述因素对音元分割的 影响。
实施例 10
本实施例是实施例 8和 9的结合。 即在确定阻尼振动包络线方程时, 同时增加 包络线的幅度修正量 (或修正系数)和阻尼系数修正量。 即把包络线方程确定为 Y(t)=k(A0+C)e - ( p ^D)t„ 其中, C为幅度修正量, D为阻尼系数修正量, k为幅度 修正系数。 C一般取 0至 40 % AG ; K—般取 1.0至 1.4 ; D—般取 - 25 %至 + 25 % , 较佳地取 - 6 %至 + 6 % 。
实施例 1 1
本实施例主要涉及音频信号保真变快处理方法。 首先是对音频信号进行切割, 可以采用如实施例 3-10所述的以音元为基本切割单位来实现, 本实施例主要讨论 如何删除小段, 以缩短音频信号。在实施例 1中描述了一种间隔方式部分删除小段 的方法。 在本实施例中对删除增加了一个条件, 即删除那些性状相近似的小段, 下 面以一个小段内仅包含一个音元为例, 对于含有多个音元的小段, 亦可类推。 具体 方法如图 13所示。
流程从 300开始, 首先在 301 , 取一个音元中的最大极值, 在 301A判别是否 还有未处理的音元, 如无, 则流程在 309返回; 否则, 在 302 , 取下一音元中的最 大极值。 在 303 , 比较两相邻音元的最大极值, 若两音元的最大极值的差值△ A 的绝对值大于一预定量 E(304),则说明该两音元的性状不相近似,流程返回 301A 。 若 I Δ A I <=E , 则流程进入 305 , 比较两相邻音元中的极值数量或音元长度, 若两相邻音元中的极值数量之差 I Δ N I大于一预定量 F , 或者两相邻音元的长 度之差 I Δ T I大于一预定量 G(306), 则说明该两音元的性状不相近似, 流程则 返回 301A。 若, I Δ N I <=F , 或者 I Δ T I <=G , 贝 !j, 说明两相邻音元性状相 近似。 在 307 , 删除后一音元, 并在 308判别是否还有未处理的音元, 如无, 则流 程在 309返回, 否则再返回 301 。
在本实施例中,预定量 E—般定为两相邻音元中前一音元的最大极值的 5%-20%; 也可以是后一音元最大极值的 5%-20%。 预定量 F为两相邻音元中前一音元的极值 数量的 5%-20%, 也可以是后一音元极值数量的 5%-20%。 预定量 G为两相邻音元 /
中前一音元的长度 5%-20%, 也可以是后一音元长度的 5%-20%。
显然, 由于在本实施例中仅删除了那些性状相似的小段 (音元), 所以重放用本 实施例的删除方法处理得到的经缩短的音频信号的效果更好。
实施例 12
本实施例主要涉及音频信号保真变快处理方法。 它是对实施例 11 的进一步改 进。 图 14示出了本实施例的方法。 它与实施例 11(图 13)的区别在于, 在 30Γ, 取 出一个音元中的最大极值和极值;在 302'取出下一音元中的最大极值和极值;在 306 和 .307之间插入 306A和 306B。 即在 306 , 当 I △ N I F或 I Δ T I < G时, 流程进入 306A , 比较两相邻音元对应的极值, 在 306B , 若两相邻音元对应的极 值之差的绝对值均大于一预定量时,则说明两相邻音元不相近似,流程返回 301A , 否则进入到 307。 该预定量一般定为两个作比较的极值之一的 5%-20%。
本实施例的效果优于实施例 11 。
上面详细描述了本发明的音频信号保真变速处理方法。 实现这种方法, 可以使 用计算机技术。 目前计算机技术已发展到相当程度, 对于计算机领域的普通技术人 员来说, 用计算机实现上述方法不是难事。下面仅简单地描述一种实现上述方法的 计算机结构。
图 15 是一种实现本发明的音频信号保真变速处理方法的计算机系统的方框 图。 如图 15所示, 该计算机系统包括中央处理器 CPU、 程序存储器 PRAM、 数 据存储器 DRAM等。 音频信号如果是模拟信号 (如从磁带录音机输出的), 则先输 入至衰减器 1 , 然后经 A/D转换器 2转换成数字信号后, 由 CPU通过总线 BUS存 储到数据存储器 DRAM中, 并对这些数据用如上的方法进行处理。 如果音频信号 为数字信号 (如从 CD机输出的), 则可通过串 /并行接口 3直接送到数据总线 BUS 上, 由 CPU将其存入数据存储器 DRAM中, 并对其进行处理。 程序存储器 PRAM 存储着实现本发明方法的程序, CPU从程序存储器 PRAM中调取程序运行。 CPU 将已处理的数据经并 /串行接口 4记录到硬磁盘或激光唱片等以数字形式记录的介 质上, 或经 D/A转换器 5转换成模拟信号后记录在磁带等以模拟形式记录的介质 上。

Claims

权利要求书
1.一种音频信号保真变速处理方法, 其特征在于, 包含下列步骤:
将音频信号进行切割分成小段;
在部分或全部小段后插人至少一段信息单元, 以延长音频信号。
2 . 一种音频信号保真变速处理方法, 其特征在于, 包含下列步骤:
将音频信号进行切割分成小段;
间隔地将部分小段删除, 将未删除的小段紧缩连接, 以缩短音频信号。 3 . 如权利要求 1或 2所述的音频信号保真变速处理方法, 其特征在于, 所述 分成小段是以时间间隔为基本切割单位, 切割成的小段长度为 0.1-400毫秒。
4.如权利要求 1或 2所述的音频信号保真变速处理方法, 其特征在于, 所述分 成小段是以音频信号中的零点数或极点数为基本切割单位, 切割成的小段包含 2 - 80个零点或极值点。
5.如权利要求 1或 2所述的音频信号保真变速处理方法, 其特征在于, 所述切 割分成小段是以音频信号中的音元个数为基本切割单位, 切割成的小段包含 1 - 10个音元。
6.如权利要求 1所述的音频信号保真变速处理方法, 其特征在于, 所述信息单 元为插入点前一小段音频信号的全部或部分信号、经幅值修正的插入点前一小段音 频信号的全部或部分信号和 /或小于 100毫秒的空白信号。 7.如权利要求 5所述的音频信号保真变速处理方法, 其特征在于, 所述以音元 个数为基本切割单位的分段方法是比较相邻的极值, 如果后一极值大于前一极值, 则确定该后一极值前的第一个零点作为切割点, 。
8.如权利要求 7所述的音频信号保真变速处理方法, 其特征在于, 所述以音元 个数为基本切割单位的分段方法包含:
(al)确定所述小段包含的音元的个数;
(a2)在相邻的两个零点之间将各样值进行比较, 将其中一个绝对值最大的样值 定为极值;
(a3)将相邻两极值进行比较, 当比较结果为前一极值大于后一极值时, 取下一 零点, 并回到步骤 (a2);否则把后一极值定为最大极值; (a4)计算音元个数, 如果该小段中包含的音元个数等于在步骤 (al)中确定的个数 时, 将该最大极值前的第一个零点作为切割点, 重新开始计数再回到步骤 (a2);否 贝 IJ , 计数器加一取下一零点, 返回步骤 (a2)。 9.如权利要求 8所述的音频信号保真变速处理方法, 其特征在于, 在比较两极 值的步骤 (a3)时, 当比较结果为后一极值大于前一极值一预定值时, 把后一极值定 为最大极值。
10.如权利要求 9所述的音频信号保真变速处理方法, 其特征在于, 所述预定量 为前一极值的 20%至 40%。
1 1.如权利要求 8所述的音频信号保真变速处理方法, 其特征在于, 在步骤 (a3) 中, 判别出后一极值大于前一极值时, 把后一极值与前一极值所属于的音元中的最 大极值作比较, 如果后一极值大于前所述最大极值的 60%, 则确定后一极值为最 大极值, 否则回到步骤 (a2
12.如权利要求 5所述的音频信号保真变速处理方法, 其特征在于, 所述以音元 个数为基本切割单位的分段方法是利用基准阻尼振动包络线方程 Y(t)=A。e & 1来确 定音元中的极值。
13.如权利要求 12所述的音频信号保真变速处理方法, 其特征在于, 所述以音 元个数为基本切割单位的分段方法包含:
(bl)确定所述小段包含的音元的个数;取音频信号起始一段时间的各样值绝对值 进行比较, 将其中的一个最大样值定为最大极值 A0 ;
(b2)置时间 t为零;
(b3)取下一组两相邻零点之间的各样值绝对值进行比较, 将其中一个最大的样 值定为极值 m ;
(b4)根据所述最大极值 A。,极值 m及该极值所对应的时间 tm求出阻尼振动包络 线方程中的阻尼系数; 并确定以所述最大极值表征的阻尼振动包络线方程 Y(t)=Aoe" , 其中 A。为所述最大极值, P为阻尼系数;
(b5)取下一组两相邻零点之间的各样值绝对值进行比较, 将其中一个最大的样 值定为极值 n , 并将该极值对应的时间 tn代入所述阻尼振动包络线方程, 求出该时 间上的包络线值 Y(tn);
(b6)比较所述极值 n和所述包络线值 Y(tn); 如果 Y(tn) > n, 则返回步骤 (b3); 否则确定该极值 n为下一音元的最大极值 A0
(b7)计算音元个数,如果该小段中包含的音元个数等于在步骤 (bl)中确定的个数 时,将该最大极值前的第一个零点作为切割点,重新开始计数, 回到步骤 (b2);否则, 计数器加一, 流程返回到步骤 (b2)。
14.如权利要求 13所述的音频信号保真变速处理方法, 其特征在于, 在步骤 (b4) 中, 把所述包络线方程确定为 Y(t)=k(AQ+C)e— ( & +D)t , 其中 k为幅值修正系数, C为 幅值修正量; D为阻尼系数修正量。
15.如权利要求 14所述的音频信号保真变速处理方法, 其特征在于, 其中幅值 修正系数 k值在 1.0- 1.4之间。 16.如权利要求 14所述的音频信号保真变速处理方法, 其特征在于, 其中幅值 修正量 C在 0 - 40 % A。之间。
Π.如权利要求 14所述的音频信号保真变速处理方法, 其特征在于, 其中阻尼 系数修正量 D在- 25 % P至 + 25 % β之间 t
18.如权利要求 7或 12所述的音频信号保真变速处理方法, 其特征在于, 所述 极值是指正极值和 /或负极值的绝对值和或正极值和负极值的绝对值。
19.如权利要求 1所述的音频信号保真变速处理方法, 其特征在于, 均匀间隔地 在部分小段后插入至少一段信息单元。
20.如权利要求 2所述的音频信号保真变速处理方法, 其特征在于, 均匀间隔地 将部分小段删除。
21.如权利要求 2所述的音频信号保真变速处理方法, 其特征在于, 在删除步骤 中, 删除性状相近似的小段。
22.如权利要求 21 所述的音频信号保真变速处理方法, 其特征在于, 所述删除 性状相近似的小段包括下列步骤:
将两相邻音元的最大极值进行比较, 若两音元最大极值的差值的绝对值大于第 一预定量, 则判定该两相邻音元不相近似, 否则,
比较两相邻音元的极值数量或两相邻音元的长度, 若两相邻音元的极值数量之 差的绝对值大于第二预定量或两相邻音元的长度之差的绝对值大于第三预定量 时, 则判定该两相邻音元不相近似;否则,
删除后一音元, 并将被删除的音元的前后音元作紧缩连接。
23.如权利要求 22所述的音频信号保真变速处理方法, 其特征在于, 所述删除 性状相近似的小段的步骤在比较了两相邻音元的极值数量或两相邻音元的长度之 后加人下列步骤:
比较两相邻音元之间相对应的极值, 若两相邻音元之间相对应的极值之差的绝 对值均小于第四预定量, 则判定该两相邻音元相近似, 删除后一音元。
24.如权利要求 22所述的音频信号保真变速处理方法, 其特征在于, 所述第一 预定量为所述前一音元中最大极值或所述后一音元中的最大极值的 5%— 20%, 所 述第二预定量为所述前一音元中的极值数量或所述后一音元中的极值数量的 5%— 20%, 所述第三预定量为所述前一音元的长度或所述后一音元的长度的 5%— 20%。
25.如权利要求 23所述的音频信号保真变速处理方法, 其特征在于, 所述第四 预定量为所述两个相比较的极值之一的 5%— 20%。
PCT/CN1996/000074 1995-09-01 1996-09-02 Procede de traitement de signal audio en vue d'une reproduction fidele et a vitesse variable WO1997009713A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU68689/96A AU6868996A (en) 1995-09-01 1996-09-02 A method of processing audio signal for fidelity varying-speed replaying

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN 95115914 CN1145519A (zh) 1995-09-01 1995-09-01 音频信号保真变速处理方法
CN95115914.3 1995-09-01

Publications (1)

Publication Number Publication Date
WO1997009713A1 true WO1997009713A1 (fr) 1997-03-13

Family

ID=5080693

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN1996/000074 WO1997009713A1 (fr) 1995-09-01 1996-09-02 Procede de traitement de signal audio en vue d'une reproduction fidele et a vitesse variable

Country Status (3)

Country Link
CN (1) CN1145519A (zh)
AU (1) AU6868996A (zh)
WO (1) WO1997009713A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7136571B1 (en) 2000-10-11 2006-11-14 Koninklijke Philips Electronics N.V. System and method for fast playback of video with selected audio
CN102855883A (zh) * 2011-06-28 2013-01-02 清华大学 一种基于音频特征的数字音频延展方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6625387B1 (en) * 2002-03-01 2003-09-23 Thomson Licensing S.A. Gated silence removal during video trick modes
CN101901612B (zh) * 2009-05-27 2013-07-24 珠海扬智电子有限公司 变速不变调的声音播放方法及装置
CN114566164A (zh) * 2022-02-23 2022-05-31 成都智元汇信息技术股份有限公司 基于公共交通的人工播报音频自适应方法、显示终端及系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0376342A2 (en) * 1988-12-29 1990-07-04 Casio Computer Company Limited Data processing apparatus for electronic musical instruments
CN1021091C (zh) * 1989-11-09 1993-06-02 庄明 电子钢琴音源波形增量调制方法及其电路
CN1023353C (zh) * 1989-05-22 1993-12-29 株式会社精工舍 录音和重放的方法和设备

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0376342A2 (en) * 1988-12-29 1990-07-04 Casio Computer Company Limited Data processing apparatus for electronic musical instruments
CN1023353C (zh) * 1989-05-22 1993-12-29 株式会社精工舍 录音和重放的方法和设备
CN1021091C (zh) * 1989-11-09 1993-06-02 庄明 电子钢琴音源波形增量调制方法及其电路

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7136571B1 (en) 2000-10-11 2006-11-14 Koninklijke Philips Electronics N.V. System and method for fast playback of video with selected audio
CN102855883A (zh) * 2011-06-28 2013-01-02 清华大学 一种基于音频特征的数字音频延展方法

Also Published As

Publication number Publication date
CN1145519A (zh) 1997-03-19
AU6868996A (en) 1997-03-27

Similar Documents

Publication Publication Date Title
Arons Techniques, perception, and applications of time-compressed speech
KR101334366B1 (ko) 오디오 배속 재생 방법 및 장치
JP5367932B2 (ja) オーディオ速度変換を可能にするシステムおよび方法
JP4965371B2 (ja) 音声再生装置
JP3308567B2 (ja) ディジタル音声処理装置及びディジタル音声処理方法
US6085157A (en) Reproducing velocity converting apparatus with different speech velocity between voiced sound and unvoiced sound
WO1997009713A1 (fr) Procede de traitement de signal audio en vue d&#39;une reproduction fidele et a vitesse variable
JPS5982608A (ja) 音声の再生速度制御方式
JP3373933B2 (ja) 話速変換装置
JP4580297B2 (ja) 音声再生装置、音声録音再生装置、およびそれらの方法、記録媒体、集積回路
WO1998044483A1 (en) Time scale modification of audiovisual playback and teaching listening comprehension
JP2009075280A (ja) コンテンツ再生装置
JP2000081897A (ja) 音声情報の記録方法、音声情報記録媒体、並びに音声情報の再生方法及び再生装置
JPH09138698A (ja) 音声記録再生装置
JP3081469B2 (ja) 話速変換装置
JPH04367898A (ja) 音声再生装置
JPH0573089A (ja) 音声再生方法
KR100372576B1 (ko) 오디오신호 가공방법
JPH07192392A (ja) 話速変換装置
JPH0854895A (ja) 再生装置
JP4648183B2 (ja) 連続メディアデータ短縮再生方法、複合メディアデータ短縮再生方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体
JP2962777B2 (ja) 音声信号の時間軸伸長圧縮装置
JPH09146587A (ja) 話速変換装置
JPH05303400A (ja) 音声再生装置と音声再生方法
CN1074849C (zh) 音频信号保真变速处理方法

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BB BG BR BY CA CH CN CZ DE DK EE ES FI GB GE HU IL IS JP KE KG KP KR KZ LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TR TT UA UG US UZ VN

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA