CN104079247B

CN104079247B - Balanced device controller and control method and audio reproducing system

Info

Publication number: CN104079247B
Application number: CN201310100401.XA
Authority: CN
Inventors: 芦烈; 阿兰·西费尔特; 王珺; 胡明清
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2013-03-26
Filing date: 2013-03-26
Publication date: 2018-02-09
Anticipated expiration: 2033-03-26
Also published as: US10044337B2; EP3232567A1; JP6053984B2; JP2016519493A; JP6325640B2; CN104079247A; EP2979359B1; US9621124B2; HK1244110A1; EP3232567B1; US20160056787A1; WO2014160548A1; EP2979359A1; ES2630398T3; JP2017073811A; US20170230024A1

Abstract

Disclose balanced device controller and control method and audio reproducing system.In one embodiment, balanced device controller includes：Audio classifiers, for identifying the audio types of audio signal in real time；And adjustment unit, balanced device is adjusted in a continuous manner for the confidence value based on the audio types identified, wherein, audio classifiers are configured to by audio signal classification into multiple audio types with respective confidence value, and adjustment unit is configured to be weighted the confidence value of multiple audio types by the importance based on multiple audio types to consider at least some audio types in multiple audio types.

Description

Balanced device controller and control method and audio reproducing system

Technical field

Present invention relates generally to Audio Signal Processing.Specifically, presently filed embodiment is related to for audio classification With the apparatus and method of audio frequency process, more particularly to dialogue booster, surround sound virtual machine, volume leveller and balanced device Control.

Background technology

In order to lift the total quality of audio and correspondingly lift Consumer's Experience, some audios improve device be used for when Audio signal is changed in domain or in spectral domain.Have developed various audios for various purposes improves device.Audio improves Some Usual examples of device include：

Talk with booster：In film and broadcast or TV programme, for understanding story, dialogue is most important Composition.In order to improve its definition and its intelligibility, especially for the older of Hearing, enhancing dialogue have developed Method.

Surround sound virtual machine：Surround sound virtual machine makes it possible in the boombox of PC (PC) or earphone In render around (multichannel) voice signal.That is, pass through stereo (such as loudspeaker and earphone), surround sound Virtual machine is that user generates virtual surrounding sound effect, there is provided the experience of film.

Volume leveller：Volume leveller is intended to that the volume of the audio content of playback is adjusted, and is based on target Loudness value makes the volume almost be consistent on a timeline.

Balanced device：Balanced device provides the uniformity for the spectrum balance for being referred to as " tone " or " tone color ", and enables users to Reach to amplify some sound or remove undesirable sound and the configuration frequency response (gain) on each single frequency band One-piece pattern (curve or shape).In traditional balanced device, for example different music wind of different sound can be directed to Lattice and different equalizer presets is provided.Once have selected preset, or there is provided balanced mode, then apply on signal identical EQ Gain, untill the balanced mode is by manual modification.By contrast, dynamic equalizer passes through continuous monitoring audio Spectrum balance, it is compared with desired tone and dynamically adjusts equalization filter so that the original pitch of audio to be changed into Tone it is expected, to realize spectrum balance uniformity.

Generally, audio, which improves device, has the application scenario/context of its own.That is, audio improvement device can Specific properties collection can be only applicable to and be not suitable for all possible audio signal because different contents may need with Different mode is handled.For example, dialogue Enhancement Method is generally applied to movie contents.If Enhancement Method application will be talked with In the music without dialogue, then some frequency subbands may mistakenly be strengthened and introduce substantial amounts of sound by talking with Enhancement Method Color change and inconsistency perceptually.Similarly, if noise suppressing method is applied on music signal, can hear Strong distortion.

But improve for generally including one group of audio for the audio frequency processing system of device, it is inputted inevitably It is probably the audio signal of be possible to type.For example, the audio frequency processing system being integrated in PC will be received from each introduces a collection Audio content, including film, music, VoIP and game.Therefore, in order to every using preferable algorithm or application to corresponding contents The preferable parameter of individual algorithm, it is important that identify or distinguish these processed contents.

Improve algorithm to distinguish audio content and the correspondingly preferable parameter of application or preferable audio, it is traditional System be generally pre-designed one group it is preset, and it is preset to select for the content to be played to require user.It is preset generally by one Group audio improves algorithm and/or its optimal parameter to be applied is encoded, such as special for film or music playback " film " of design is preset and " music " is preset.

But for a user, manually select and inconvenient.User will not generally enter at various predefined preset Row frequently switches, but keeps preset using one to all the elements.In addition, in some automated solutions, Parameter in preset or algorithm set be typically it is discrete (for example, to being opened for the special algorithm of certain content or Person closes), it can not be in a manner of continuous based on content come adjusting parameter.

The content of the invention

The first aspect of the application is that the audio content based on playback automatically configures audio improvement in a continuous manner Device.By " automatic " pattern, user can not have to tired different preset in selecting, and simply enjoy their content.Separately On the one hand, in order to avoid the audible distortion at transfer point, continuous regulation are more important.

According to the embodiment of first aspect, a kind of audio processing equipment includes：Audio classifiers, for by audio signal It is categorized into real time at least one audio types；Audio improves device, for improving listener experiences；And adjustment unit, use Improve at least one ginseng of device in the confidence value based at least one audio types to adjust audio in a continuous manner Number.

It can be any in dialogue booster, surround sound virtual machine, volume leveller and balanced device that audio, which improves device, Device.

Correspondingly, a kind of audio-frequency processing method includes：Audio signal is categorized at least one audio types in real time； And the confidence value based at least one audio types come adjust in a continuous manner it is at least one for audio improve Parameter.

According to another embodiment of first aspect, a kind of volume leveller controller includes：Audio content grader, For identifying the content type of audio signal in real time；And adjustment unit, for based on the content type identified come to connect Continuous mode adjusts volume leveller.Adjustment unit is configurable to make the dynamic gain of volume leveller and the letter of audio signal The content type positive correlation of breath property, and make the dynamic gain of volume leveller and the interference content type negative correlation of audio signal.

Also disclose a kind of audio processing equipment for including above-mentioned volume leveller controller.

Correspondingly, a kind of volume leveller control method includes：The content type of audio signal is identified in real time；By making The dynamic gain of volume leveller and the informational content type positive correlation of audio signal, and increase the dynamic of volume leveller The interference content type of benefit and audio signal is negatively correlated, and adjusts sound in a continuous manner based on the content type identified Measure leveller.

According to the further embodiment of first aspect, a kind of balanced device controller includes：Audio classifiers, for real-time Ground identifies the audio types of audio signal；And adjustment unit, for based on the audio types identified come in a continuous manner Adjust balanced device.

Also disclose a kind of audio processing equipment for including above-mentioned balanced device controller.

Correspondingly, a kind of balanced device control method includes：The audio types of audio signal are identified in real time；And based on institute The audio types of identification adjust balanced device in a continuous manner.

Present invention also provides the computer-readable medium for being recorded on computer program instructions, when by processor Lai When performing the instruction, the instruction enable a processor to perform above-mentioned audio-frequency processing method or volume leveller control method, Or balanced device control method.

, can be according to the type of audio signal and/or the confidence level of the type according to each embodiment of first aspect Value improves device continuously to adjust audio, and it can be dialogue booster, surround sound virtual machine, volume school that the audio, which improves device, One of flat device and balanced device.

The second aspect of the application is to develop content recognition component to identify multiple audio types, and can use detection As a result manipulating/instruct various audios by finding preferable parameter in a continuous manner improves the working method of device.

According to the embodiment of second aspect, audio classifiers include：Short-term characteristic extractor, for from each including sound Short-term characteristic is extracted in the short-term audio fragment of frequency frame sequence；Short-term grader, for using corresponding Short-term characteristic come by length Short-term audio fragment sequence in phase audio fragment is categorized into short-term audio types；Statistics extractor, it is short for calculating Phase grader is directed to the statistics of the result of the short-term audio fragment sequence in the long-term audio fragment, as long-term characteristic； And long-term grader, for long-term audio fragment to be categorized into long-term audio types using long-term characteristic.

Also disclose a kind of audio processing equipment for including above-mentioned audio classifiers.

Correspondingly, a kind of audio frequency classification method includes：Extracted from each short-term audio fragment including audio frame sequence Short-term characteristic；The short-term audio fragment sequence in long-term audio fragment is categorized into short-term audio using corresponding Short-term characteristic In type；Statistics of the short-term grader for the result of the short-term audio fragment sequence in the long-term audio fragment is calculated, As long-term characteristic；And long-term audio fragment is categorized into long-term audio types using long-term characteristic.

According to another embodiment of second aspect, a kind of audio classifiers include：Audio content grader, for knowing The content type of the short-term fragment of other audio signal；And audio context grader, for being based at least partially on by audio The content type that content classifier is identified identifies the context type of the short-term fragment.

Also disclosing includes the audio processing equipment of above-mentioned audio classifiers.

Correspondingly, a kind of audio frequency classification method includes：Identify the content type of the short-term fragment of audio signal；And at least Identified content type is based in part on to identify the context type of the short-term fragment.

The present disclosure also provides the computer-readable medium that record thereon has computer program instructions, when by processor During performing the instruction, the instruction enables a processor to perform above-mentioned audio frequency classification method.

According to one embodiment, there is provided a kind of balanced device controller, including：Audio classifiers, for continuously identifying sound The audio types of frequency signal；And adjustment unit, for the confidence value based on the audio types identified come with continuous side Formula adjusts balanced device, wherein, audio classifiers are configured to audio signal classification to multiple sounds with respective confidence value In frequency type, and adjustment unit is configured to the confidence by the importance based on multiple audio types to multiple audio types Angle value is weighted to consider at least some audio types in multiple audio types.

According to another embodiment, there is provided a kind of balanced device controller, including：Audio classifiers, for continuously identifying The audio types of audio signal；And adjustment unit, for the confidence value based on the audio types identified come with continuous Mode adjusts balanced device, wherein, audio classifiers are configured to audio signal classification to multiple with respective confidence value In audio types, and adjustment unit be configured to by based on influence of the confidence value to multiple audio types be weighted come Consider at least some audio types in multiple audio types.

According to another embodiment, there is provided a kind of audio reproducing system, it includes the balanced device control according to above-described embodiment Device processed.

According to one embodiment, there is provided a kind of balanced device control method, including：The audio class of audio signal is identified in real time Type；And the confidence value based on the audio types identified adjusts balanced device in a continuous manner, wherein, by audio signal It is categorized into multiple audio types with respective confidence value, and the operation adjusted is configured to by based on multiple audios The importance of type is weighted to the confidence value of multiple audio types to consider at least some sounds in multiple audio types Frequency type.

According to another embodiment, there is provided a kind of balanced device control method, including：The audio of audio signal is identified in real time Type；And confidence value based on the audio types identified adjusts balanced device in a continuous manner, wherein, audio is believed Number it is categorized into multiple audio types with respective confidence value, and the operation adjusted is configured to by based on confidence level Influence of the value to multiple audio types is weighted to consider at least some audio types in multiple audio types.

According to each embodiment of second aspect, audio signal can be classified into different long-term type or upper and lower In literary type, either context type is different from short-term type or content type for the long-term type.The type of audio signal and/ Or the confidence value of type can be also used for adjusting audio improvement device, such as dialogue booster, surround sound virtual machine, volume school Flat device or balanced device.

Brief description of the drawings

In the accompanying drawings, in an illustrative manner and unrestricted mode illustrates the application, in the accompanying drawings, identical accompanying drawing mark Note represents similar element, in the accompanying drawings：

Fig. 1 schematic diagram illustrates the audio processing equipment according to presently filed embodiment；

Fig. 2 and Fig. 3 schematic diagram illustrates the modification of embodiment as shown in Figure 1；

Fig. 4 to Fig. 6 schematic diagram illustrate for identify multiple audio types and calculate confidence value grader can Can framework；

Fig. 7 to Fig. 9 schematic diagram illustrates more embodiments of the audio processing equipment of the application；

Figure 10 schematic diagram illustrates the transfer lag between different audio types；

Figure 11 to Figure 14 is the flow chart according to the audio-frequency processing method of presently filed embodiment；

Figure 15 schematic diagram illustrates strengthens controller according to the dialogue of presently filed embodiment；

Figure 16 and Figure 17 is the flow that the audio-frequency processing method according to the application is used in the control to talking with booster Figure；

Figure 18 schematic diagram illustrates the surround sound virtual machine controller according to presently filed embodiment；

Figure 19 is the flow chart that the audio-frequency processing method according to the application is used in the control to surround sound virtual machine；

Figure 20 schematic diagram illustrates the volume leveller controller according to presently filed embodiment；

Figure 21 schematic diagram illustrates the effect of the volume leveller controller according to the application；

Figure 22 schematic diagram illustrates the balanced device controller according to presently filed embodiment；

Figure 23 shows that desired spectrum balances preset some examples；

Figure 24 schematic diagram illustrates the audio classifiers according to presently filed embodiment；

Figure 25 and Figure 26 schematic diagram illustrates some features as used in the audio classifiers according to the application；

Figure 27 to Figure 29 schematic diagram illustrates more embodiments of the audio classifiers according to the application；

Figure 30 to Figure 33 is the flow chart according to the audio frequency classification method of presently filed embodiment；

Figure 34 schematic diagram illustrates the audio classifiers of another embodiment according to the application；

Figure 35 schematic diagram illustrates the audio classifiers of the further embodiment according to the application；

Figure 36 schematic diagram illustrates the heuristic rule used in the audio classifiers of the application；

Figure 37 and Figure 38 schematic diagram illustrates more embodiments of the audio classifiers according to the application；

Figure 39 and Figure 40 is the flow chart according to the audio frequency classification method of presently filed embodiment；And

Figure 41 is the block diagram for realizing the example system according to presently filed embodiment.

Embodiment

Presently filed embodiment is described referring to the drawings.It should be noted that for the sake of clarity, in accompanying drawing and description In eliminate to as well known to those skilled in the art and for understanding that the application is not those required components and processing Represent and describe.

It will be understood by those skilled in the art that the various aspects of the application may be implemented as system, device (example Such as, cellular phone, portable electronic device, personal computer, server, TV set-top box or digital VTR, or Person any other media player), method or computer program product.Therefore, the various aspects of the application can be taken firmly The form of part embodiment, the form of Software Implementation (including firmware, resident software, microcode etc.) or by software and hardware The form for the embodiment that aspect combines, " circuit ", " module ", " system " can be generally referred to as here.Moreover, The various aspects of the application can take the one or more computer for including computer-readable program coding thereon can Read the form of computer program product included in medium.

Any combinations of one or more computer-readable medium can be used.Computer-readable medium can be meter Calculation machine readable signal medium or computer-readable recording medium.Computer-readable recording medium for example may be, but not limited to, System, equipment or device of electronics, magnetic, optical, electromagnetism, infrared either semiconductor, or it is above-mentioned Any appropriate combination.The more specifically example (enumerating for nonexhaustive) of computer-readable recording medium can include：Have Electrical connection, portable computer diskette, hard disk, random access memory (RAM), the read-only storage of one or more wire Device (ROM), Erasable Programmable Read Only Memory EPROM (EPROM or flash memory), optical fiber, compact disc read-only memory (CD-ROM), light Storage arrangement, magnetic storage device or above-mentioned any appropriate combination.It is computer-readable in the linguistic context of this document Storage medium can be can include either storing for used by instruction execution system, equipment or device or or with Combined use program any tangible medium.

Computer-readable signal media can include the propagation data signal for wherein including computer-readable program coding, Such as in a base band or the part as carrier wave.Such transmitting signal can take various forms, include but is not limited to, Electromagnetic signal either optical signalling or its any suitable combination.

Computer-readable signal media can be any computer-readable medium in addition to computer-readable recording medium, It can communicate, propagate either transmission by the use or or used in combination of instruction execution system, equipment or device Program.

Included program coding can be transmitted using any appropriate medium in computer-readable medium, appropriate Jie Matter includes but is not limited to：Radiolink, Wireline, optical cable, RF (radio frequency) etc., or above-mentioned any suitable combination.

The computer program code that operation is performed for the various aspects for the application can be with one or more Any combinations of individual programming language are write, and programming language includes programming language such as Java, Smalltalk, C of object-oriented ++ etc., and conventional program programming language, such as " C " programming language or similar programming language.Program coding can conduct Independent software package fully performs on the computer of user, or part performs on the computer of user, partly remote Perform on journey computer, either performed completely on remote computer or server.In the scene of the latter, remote computer Can be by the computer of any type of network connection to user, any type of network includes LAN (LAN) or wide Domain net (WAN), or can be with externally connected computer (for example, being connected using Internet service operator by internet).

Hereinafter, the flow of the method according to presently filed embodiment, equipment (system) and computer program product is passed through Figure diagram and/or block diagram describe the various aspects of the application.It is to be understood that each frame of flowchart illustration and/or block diagram, And the combination of the frame of flowchart illustration and/or block diagram, it can be realized by computer program instructions.Can be by these computers Programmed instruction is supplied to the processor of all-purpose computer, special-purpose computer or other programmable data processing devices, to be formed Machine so that formed by the instruction of computer or the computing device of other programmable data processing devices for realizing stream The device of specified function/action in journey figure and/or a block of block diagram or multiple pieces.

These computer program instructions can also be stored in computer-readable medium, its can instruct computer, other Programmable data processing device or other devices work in a particular manner, to cause the institute in computer-readable medium A kind of manufacture is produced in the instruction of storage, and the manufacture includes implementation process figure and/or a block of block diagram or multiple pieces In specified function/action instruction.

Computer programming instruction can also be loaded into computer, other programmable data processing devices or other devices On, to cause a series of arithmetic operations to be carried out on computer, other programmable data processing devices or other devices, So as to produce computer-implemented processing, to cause the instruction performed on computer or other programmable data processing devices to carry For the processing for realizing specified function/action in flow chart and/or a block of block diagram or multiple pieces.

Presently filed embodiment described in detail below, for the sake of clarity, carry out tissue description according to following framework：

Part 1：Audio processing equipment and method

The audio types of trifle 1.1

The confidence value of the audio types of trifle 1.2 and the framework of grader

Trifle 1.3 carries out smooth to the confidence value of audio types

The parameter adjustment of trifle 1.4

The parameter smoothing of trifle 1.5

The conversion of the audio types of trifle 1.6

The combination of the embodiment of trifle 1.7 and application scenarios

The audio-frequency processing method of trifle 1.8

Part 2：Talk with booster controller and control method

The rank of the dialogue enhancing of trifle 2.1

Trifle 2.2 is used for the threshold value for determining the frequency band to be strengthened

Adjustment of the trifle 2.3 to background sound level

The combination of the embodiment of trifle 2.4 and application scenarios

Trifle 2.5 talks with booster control method

Third portion：Surround sound virtual machine controller and control method

The surround sound enhancing amount of trifle 3.1

The initial frequency of trifle 3.2

The combination of the embodiment of trifle 3.3 and application scenarios

The surround sound virtual machine control method of trifle 3.4

4th part：Volume leveller controller and control method

The informational content type of trifle 4.1 and interference content type

Content type in 4.2 different contexts of trifle

The context type of trifle 4.3

The combination of the embodiment of trifle 4.4 and application scenarios

The volume leveller control method of trifle 4.5

5th part：Balance controller and control method

Control of the trifle 5.1 based on content type

There is leading source in the music of trifle 5.2

The balanced device of trifle 5.3 it is preset

Control of the trifle 5.4 based on context type

The combination of the embodiment of trifle 5.5 and application scenarios

The balanced device control method of trifle 5.6

6th part：Audio classifiers and sorting technique

Context classifier of the trifle 6.1 based on content type classification

The extraction of the long-term characteristic of trifle 6.2

The extraction of the Short-term characteristic of trifle 6.3

The combination of the embodiment of trifle 6.4 and application scenarios

The audio frequency classification method of trifle 6.5

7th part：VoIP graders and sorting technique

Context quantization of the trifle 7.1 based on short-term fragment

Trifle 7.2 uses the classification of VoIP voices and VoIP noises

Trifle 7.3 makes smoothing fluctuations

The combination of the embodiment of trifle 7.4 and application scenarios

Trifle 7.5VoIP sorting techniques

Part 1：Audio processing equipment and method

Fig. 1 shows the overall framework for the audio processing equipment 100 for being adapted to content, and this is adapted to the audio frequency process of content Equipment 100 is supported to improve device 400 based on the audio content of playback automatically to configure at least one audio with improved parameter. The overall framework includes three major parts：Audio classifiers 200, adjustment unit 300 and audio improve device 400.

Audio classifiers 200 are used to audio signal being categorized into real time at least one audio types.Audio classifiers 200 automatically identify the audio types of playing back content.Any audio classification techniques, for example, by signal transacting, machine learning and The audio classification techniques that pattern-recognition is realized, it can apply to identify audio content.Confidence value can be generally estimated simultaneously, put Certainty value represents probability of the audio content for one group of predefined target audio type.

Audio improves device 400 and is used to lift listener experiences by handling audio signal, later will be detailed Describing audio improves device 400.

Adjustment unit 300 changes for the confidence value based at least one audio types to adjust audio in a continuous manner At least one parameter of kind device.Adjustment unit 300, which is designed to manipulation of audio, improves the working method of device 400.Adjustment Unit 300 estimates that respective audio improves the optimal parameter of device based on the result obtained from audio classifiers 200.

Various audios can be applied to improve device in this device.Fig. 2 is shown improves showing for device including four audios Example sexual system, the system include talking with booster (Dialog Enhancer, DE) 402, surround sound virtual machine (Surround Virtualizer, SV) 404, volume leveller (Volume Leveler, VL) 406 and balanced device (Equalizer, EQ) 408. , can in a continuous manner automatically based on the result (audio types and/or confidence value) obtained in audio classifiers 200 Adjusting each audio improves device.

Certainly, audio processing equipment may not necessarily include the audio improvement device of all categories, and can only include wherein One or more audio improve device.On the other hand, audio improves that device is not limited to provide in present disclosure A little devices, and can include further types of audio improves device, it is also within the scope of application.In addition, present disclosure Those audios of middle discussion improve the title of device, including dialogue booster (DE) 402, surround sound virtual machine (SV) 404, volume Leveller (VL) 406 and balanced device (EQ) 408, should not be construed as limiting, it is identical that each of which should be understood that covering is realized Or any other device of identity function.

1.1 audio types

In order to suitably control various types of audios to improve device, present invention also offers the new frame of audio types Structure, but those audio types of the prior art can also be applied to this.

Specifically, the audio types of different meaning of one's words ranks are modeled, including represents basic group in audio signal The low level audio element divided and the high-level sound for representing most common audio content in the entertainment applications of user in real life Frequency type.The former can also be named as " content type ", and basic audio content type can include voice (speech), sound Happy (music, including song), background sound (background sound, or audio) and noise (noise).

The implication of voice and music is self-evident.Noise in this application means physical noise, rather than finger speech meaning Noise.In this application, physical noise can include the noise from such as air-conditioning, and from technical reason noise for example Pink noise caused by signal transmission path.By contrast, " background sound " in the application is that those can be hair The audio of the raw auditory events around the core objective of hearer's notice.For example, in audio signal in telephone relation, remove The sound of caller, can also there is some other sound unintentionally, such as unrelated with the telephone relation some other The sound of people, the sound of keyboard, the sound etc. of step.These unwanted sound are referred to as " background sound ", rather than noise. In other words, " background sound " can be defined as and non-targeted (the either core objective of hearer's notice) or even not It is desired, but still have the sound of some meaning of one's words implications；And " noise " can be defined as except target sound and background sound it The unwanted sound of those outer.

Sometimes background sound be not really " unwanted " but be intentionally produced and carry some useful information, example Such as the background sound in film, TV programme or radio broadcast program.So background sound can also be referred to as sometimes " audio ".Later in present disclosure, " background sound " is only used for terseness, and can be also simply referred to as the " back of the body Scape ".

Further, music can be classified into the music for not dominating source and the music for having leading source.If in musical film Have that a source (voice or musical instrument) is stronger more than other sources, then the music is referred to as " music for having leading source ", otherwise in section It is referred to as " music of dereliction stem ".For example, with singing in the polyphony of sound and various musical instruments, put down if it is harmony Weighing apparatus, or the energy in some most important sources is suitable each other, then its music for being considered as not dominating source；Compared to it Under, if source (for example, voice) loudness is much higher, much, it is considered as including leading source for other source peace and quiet.Make For another example, single either prominent musical instrument tone is " music with leading source ".

Music is also based on different standards and is divided into different types.It can be classified based on the style of music, Such as rock and roll, jazz, a Chinese musical telling and folk rhyme, but not limited to this.It is also based on musical instrument and is classified, such as vocal music and instrumental music.Device Pleasure can include with the various music of different instrument playings, such as piano music and guitar music.Other exemplary standard bags Rhythm, speed, tone color and/or any other musical features of music are included, to allow music based on the similar of these features Property and be classified.For example, according to tone color, vocal music is divided into tenor, baritone, bass, soprano, mezzo-soprano and female Bass.

The content type of audio signal can be for for example classifying including the short-term audio fragment of multiple frames.Generally, sound The length of frequency frame is multiple milliseconds, such as 20ms, and can be had by the length for the short-term audio fragment that audio classifiers are classified There are the length from hundreds of milliseconds to the several seconds, such as 1 second.

In order to control audio to improve device in a manner of being adapted to content, audio signal can be classified in real time.Pin To content type set forth above, the content type of current short-term audio fragment represents the content class of current audio signal Type.Because the length of short-term audio fragment is not very long, audio signal can be divided into non-overlapped short-term sound in succession Frequency fragment.But short-term audio fragment can also along audio signal time shaft by continuously/semi-continuously sample.Also It is to say, short-term audio fragment can move predetermined to time shaft of the step-length along audio signal of one or more frame The window of length (desired short-term audio clip length) samples.

High-level audio types can also be named as " context type ", because it indicates the long-term class of audio signal Type, and can be treated as being categorized into the environment or context of the distance speech event of the above type.According to The application, context type can include most common voice applications, such as film class media (movie-like media), sound Happy (music, including song), game (game) and VoIP (voice over internet protocol).

The implication of music, game and VoIP is self-evident.Film class media can include film, TV programme, radio Broadcast program or with above-mentioned any other similar audio frequency media.Film class media be mainly characterized by being mixed with can Voice, music and the various types of background sounds (audio) of energy.

It should be noted that content type and context type all include music (including song).In the application hereafter In, distinguish area using vocabulary " short-term music (short-term music) " and " long-term music (long-term music) " Both point.

For some embodiments of the application, it is also proposed that some other context type frameworks.

For example, audio signal can be classified as the audio (such as film class media and music CD) or low-quality of high quality The audio (such as the online stream audio of VoIP, low bit rate and content of user's generation) of amount, it may be collectively termed as " audio matter Measure type ".

As another example, audio signal can be classified as VoIP or non-VoIP, and it can be considered as above-mentioned 4 The deformation of context type framework (VoIP, film class media, (long-term) music and game).With above and below VoIP or non-VoIP Relatively, audio signal is divided into the audio content type related to VoIP to text, for example, VoIP voices, non-VoIP voices, VoIP noises and non-VoIP noises.The framework of VoIP audio content types especially has for distinguishing VoIP and non-VoIP contexts With because VoIP contexts are typically a kind of most challenging application scenarios of volume leveller (audio improvement device).

Generally, the context type of audio signal can divide for the long-term audio fragment longer than short-term audio fragment Class.The quantity for multiple frames that long-term audio fragment includes is more more than the frame in short-term audio fragment.Long-term audio fragment Multiple short-term audio fragments can also be included.Generally, long-term audio fragment can have the length of number of seconds magnitude, such as the several seconds is extremely Tens of seconds, such as 10 seconds.

Similarly, in order to control the audio to improve device in an adaptive way, audio signal can be classified in real time Into context type.Similarly, the context type of current long-term audio fragment represents the upper and lower of current audio signal Literary type.Because the length of long-term audio fragment is relatively grown, audio signal can be along the time shaft quilt of audio signal Continuously/semi-continuously sample, to avoid the drastically change of its context type and therefore caused audio improves device The drastically change of running parameter.That is, long-term audio fragment can use predetermined length, (desired long-term audio fragment is grown Degree) window believe with the step-length of one or more frame, or with the step-length of one or more short-term fragment along audio Number time shaft move and sample.

It is described above both content type and context type.In presently filed embodiment, adjustment unit 300 can be based at least one at least one content type in various content types and/or various context types Hereafter type improves at least one parameter of device to adjust audio.Therefore, as shown in figure 3, in the embodiment shown in Fig. 1 In deformation, audio classifiers 200 can include either audio context grader 204 or two of audio content grader 202 Person.

The different audio types based on various criterion (such as context type) have been had been mentioned above, also refer to Different audio types based on different levels rank (such as content type).But the standard and the stratum level It is provided to convenience described herein and obvious and non-limiting.In other words, in this application, above-mentioned any two or more Individual audio types can be identified simultaneously by audio classifiers 200, and be considered simultaneously by adjustment unit 300, as retouched hereinafter State.In other words, all audio types in different levels rank can be arranged side by side, or in same rank.

The confidence value of 1.2 audio types and the framework of grader

Audio classifiers 200 can export hard decision result, or adjustment unit 300 can be by audio classifiers 200 As a result it is considered as hard decision result.Even with hard decision, multiple audio types can also be assigned to audio fragment.Example Such as, audio fragment can be marked as both " voice " and " short-term music ", because it can be the mixed of voice and short-term music Close signal.The label obtained, which can be used directly to manipulation of audio, improves device 400.Simple example is when there is voice Enable dialogue booster 402 and dialogue booster 402 is closed when in the absence of voice.But if without careful smooth side Case (will be discussed) later, and the hard-decision method may draw at the transfer point from an audio types to another audio types Enter some unnatural sound.

In order to greater flexibility and can adjust in a continuous manner audio improve device parameter, can estimate Calculate the confidence value (soft-decision) of each target audio type.Confidence value represents audio content to be identified and target audio type Between matching it is horizontal, its value is from 0 to 1.

As it was previously stated, many sorting techniques can directly export confidence value.It can also be put according to various methods to calculate Certainty value, these methods can be considered as a part for grader.If for example, pass through some probabilistic Modeling technologies such as Gauss Mixed model (Gaussian Mixture Models, GMM) trains audio model, then posterior probability can be used to indicate Confidence value, such as：

Wherein, x is an audio fragment, c_iIt is target audio type, N is the quantity of target audio type, p (x | c_i) be Audio fragment x belongs to audio types c_iPossibility, and p (c_i| x) it is corresponding posterior probability.

On the other hand, if passing through some discriminating methods such as SVMs (Support Vector Machine, SVM) and adaBoost train audio mode, then score (actual value) can only be obtained according to the control of model. In the case of these, usually using sigmoid function (sigmoid function) by the score obtained (in theory from-∞ to ∞) It is mapped to desired confidence level (from 0 to 1)：

Wherein, y is the output score from SVM or adaBoost, and A and B are that needs are well-known by using some Technology concentrate two parameters estimating from training data.

For some embodiments of the application, adjustment unit 300 can use more than two content type and/or more In the context type of two.So, audio content grader 202 just needs to identify more than two content type, and/or Person's audio context grader 204 needs to identify more than two context type.In this case, audio content grader 202 or audio context grader 204 can be classifiers with certain framework tissue.

For example, if adjustment unit 300 needs four kinds of all context types：Film class media, long-term music, game And VoIP, then audio context grader 204 can have following different framework：

First, audio context grader 204 can include：The 6 man-to-man binary organized as shown in Figure 4 Grader (each grader is differentiated a target audio type and another target audio type)；As shown in Figure 5 (each grader is by a target audio class for the binary classifier of 3 " a pair other " (one-to-others) organizing like that Type is differentiated with other target audio types)；And 4 " a pair other " graders organized as shown in Figure 6.Also There is other frameworks such as decision-making directed acyclic graph (Decision Directed Acyclic Graph, DDAG) framework.Note Meaning, in Fig. 4 to Fig. 6 and following corresponding description, " film (movie) " rather than " film class are used for simplicity Media ".

Each binary classifier will provide confidence score H (x) and export (x represents audio fragment) as it.It is every obtaining , it is necessary to map that to the final confidence value of identified context type after the output of individual binary classifier.

In general it is assumed that in be classified to M context type of audio signal (M is positive integer).Traditional is man-to-man Framework constructs M (M-1)/2 grader each trained by the data from two classifications, then each one-against-one device A ticket of its classification being inclined to is launched, and final result is who gets the most votes's class in the classification of M (M-1)/2 grader Not.Compared with traditional one-to-one framework, the level framework in Fig. 4 is also required to construct M (M-1)/2 grader.But survey Examination iteration can shorten to M-1 times, be/be not in respective classes because fragment x will be determined in each hierarchy levels, and And overall number of levels is M-1.Can be according to binary classification confidence level H_k(x) calculate for the final of various context types Confidence value, such as (k=1,2 ... 6, represent different context types)：

C_MOVIE=(1-H₁(x))·(1-H₃(x))·(1-H₆(x))

C_VOIP=H₁(x)·H₂(x)·H₄(x)

C_MUSIC=H₁(x)·(1-H₂(x))·(1-H₅(x))+H₃(x)·(1-H₁(x))·(1-H₅(x))

+H₆(x)·(1-H₁(x))·(1-H₃(x))

C_GAME=H₁(x)·H₂(x)·(1-H₄(x))+H₁(x)·H₅(x)·(1-H₂(x))+H₃(x)·H₅(x)

·(1-H₁(x))

In framework as shown in Figure 5, from binary classification result H_k(x) can be with to the mapping function of final confidence value It is defined as following example：

C_MOVIE=H₁(x)

C_MUSIC=H₂(x)·(1-H₁(x))

C_VOIP=H₃(x)·(1-H₂(x))·(1-H₁（x))

C_GAME=(1-H₃(x))·(1-H₂(x))·(1-H₁(x))

In framework as shown in Figure 6, final confidence value can be equal to corresponding binary classification result Hk (x), or If person requires for the confidence value of all categories and is 1, final confidence value can be based on the H estimated_k(x) To be simply normalized：

C_MOVIE=H₁(x)/(H₁(x)+H₂(x)+H₃(x)+H₄(x))

C_MUSIC=H₂(x)/(H₁(x)+H₂(x)+H₃(x)+H₄(x))

C_MOVIP=H₃(x)/(H₁(x)+H₂(x)+H₃(x)+H₄(x))

C_GAME=H₄(x)/(H₁(x)+H₂(x)+H₃(x)+H₄(x))

One or more classification with maximum confidence value can be determined that the final class identified.

It should be noted that in framework as shown in Figures 4 to 6, the order of different binary classifiers is not necessarily as schemed It is shown, but can be other orders, this sequentially can pass through manual assignment or automatic according to the different demands of various applications Learn to select.

Above description is directed to audio context grader 204.For audio content grader 202, situation is similar.

Alternately, whether audio content grader 202 or audio context grader 204 may be implemented as All the elements type/context type is identified while providing the single grader of respective confidence value simultaneously.There are many use In the prior art for accomplishing this point.

Using confidence value, the output of audio classifiers 200 can represent that each dimension represents each target with vector The confidence value of audio types.For example, if target audio type were voice, short-term music, noise, background successively, example Output result can be (0.9,05,0.0,0.0), represent that its 90% ground determines that the audio content is voice, 50% ground determines should Audio is music.Pay attention to, vectorial all dimensions of output and be not necessarily 1 (for example, Fig. 6 result is not necessarily normalizing Change), it is probably the mixed signal of voice and short-term music to represent the audio signal.

In the 6th follow-up part and the 7th part, will be discussed in detail audio context classification and audio content classification it is new The implementation of grain husk.

The confidence value of 1.3 audio types it is smooth

Alternatively, after each audio fragment has been classified into predefined audio types, additional step is Make classification results smooth along time shaft, to avoid the drastically transition from a type to another type, and change audio The estimation of parameter in kind device is smoother.For example, a long selections are in addition to only having a fragment to be classified as VoIP It is classified as film class media, therefore lofty VoIP decision-makings can be revised as by film class media by smoothing processing.

Therefore, in the modification of embodiment as shown in Figure 7, type smooth unit 712 is also provided with, for for every Individual audio types, the confidence value when secondary audio program signal is carried out smoothly.

Conventional smoothing method is based on weighted average, for example, calculate current actual degree of belief value with last time it is smoothed Confidence value weighted sum, it is as follows：

SmoothConf (t)=β smoothConf (t-1)+(1- β) conf (t) (3)

Wherein, t is represented when time (current audio fragment), and t-1 represents last (a upper audio fragment), and β is weight, Conf and smoothConf be respectively it is smooth before confidence value and it is smooth after confidence value.

From the perspective of confidence value, the result of the hard decision from grader can also be represented with confidence value, Its value is 0 or 1.That is, if certain target audio type is chosen to distribute to certain audio fragment, corresponding confidence level is 1；Otherwise, confidence level is 0.Therefore, even if audio classifiers 200 do not provide confidence value and only provided on the hard of audio types Judgement, can also be adjusted the continuous adjustment of unit 300 by the smooth operation of type smooth unit 712.

By using different smoothing weights for different situations, smoothing algorithm can be " asymmetric ".For example, it is used for Calculate weighted sum weight can confidence value based on the audio types of audio signal adaptively change.Current clip Confidence value is bigger, then its weight is bigger.

From another perspective, can be based on different from an audio types to another for calculating the weight of weighted sum The conversion of one audio types to adaptively changing, especially when based on identified by audio classifiers 200 it is multiple in When holding type, rather than improving device based on the existence or non-existence of single content type to adjust one or more audios.For example, For from the audio types relatively frequently occurred in some context to less frequently occur in this context another The conversion of audio types, the confidence value of the latter can be carried out smoothly, to cause it too fast not increase, because it may Simply individual accidental interference.

Another factor is that change (increases or reduced) trend, including rate of change.Assuming that go out more concerned with audio types The delay of (that is, when its confidence value increase) now, can design smoothing algorithm as follows：

Above formula enables the confidence value quick response current state smoothed when confidence value increase, and works as Confidence value confidence value smoothed when reducing can slowly disappear.Smooth letter can be easily designed in a similar manner Several modifications.For example, formula (4) could be modified so that proper conf (t)>Conf (t) power during=smoothConf (t-1) Become much larger again.In fact, in formula (4), it is believed that β=0 and conf (t) weight become maximum, i.e., and 1.

Change from the point of view of a visual angle, the change trend for considering some audio types is to consider the different switching pair of audio types Specific example.For example, the increase of type A confidence value can be seen as being transformed into A from non-A, and type A confidence level The reduction of value can be seen as being transformed into non-A from A.

1.4 parameter adjustment

Adjustment unit 300 is designed to estimate or adjust audio based on the result obtained from audio classifiers 200 Improve the suitable parameter of device 400.By using content type either context type or use content type and context Both types are used for joint decision, can improve device for different audios to design different adjustment algorithms.For example, use Context type information such as film class media and long-term music, it can automatically select foregoing preset and be applied It is added on corresponding contents.Using obtainable content-type information, each audio can be adjusted in a manner of more accurately improves dress The parameter put, as further part will be introduced.Content-type information and upper can also be jointly used in adjustment unit 300 Context information, to balance long-term information and short term information.The specific adjusted algorithm for improving device for special audio can be worked as Make independent adjustment unit, or different adjustment algorithms can be considered as united adjustment unit jointly.

That is, adjustment unit 300 is configured to the confidence value and/or extremely of at least one content type The confidence value of a few context type improves at least one parameter of device to adjust audio.Improve for specific audio Device, some audio types are informednesses and some audio types are interfering.Therefore, specific audio improves device Parameter can be born with the confidence value positive correlation of the audio types of informedness or with the confidence value of interfering audio types It is related.Here, " positive correlation " means parameter in a linear fashion or in a non-linear fashion with the confidence level of audio types The increase or reduction of value and increase or reduce." negative correlation " mean parameter in a linear fashion or in a non-linear fashion with Being decreased or increased for confidence value of audio types and increase respectively or reduce.

Here, by the reduction of confidence value and increased by positive correlation or negative correlation directly " passed " to ginseng to be adjusted Number.Mathematically, either " transmission " can be embodied in linear scale or inverse proportion to this correlation, add computing or subtract Computing (addition or subtraction), multiplication either division operation or nonlinear function.The association of all these forms can be claimed For " transmission function "., can also be by current confidence value or its mathematical distortions in order to determine increasing or decreasing for confidence value Compared with a upper confidence value or multiple history confidence values or its mathematical distortions.In the linguistic context of the application, term " comparison " refer to by subtraction relatively or the comparison that passes through division arithmetic.By determining whether difference is more than 0 or ratio Whether rate can be determined that increase or reduces more than 1.

In concrete implementation, can by appropriate algorithm (such as transmission function) by the value of parameter and confidence level or Its ratio or difference are directly associated, therefore " external observer " be not necessarily required to clearly to know specific confidence value with/ Or specific parameter is the increase in and is also the reduction of.By in the ensuing part 2 on specific audio improvement device to the 5th Some specific examples are provided in part.

As described in trifle above, for same audio fragment, grader 200 can be identified with respective confidence value Multiple audio types, its confidence value can not necessarily add up to 1 because the audio fragment can include simultaneously it is multiple into Point, such as music and voice and background sound.In such circumstances, it should which Balanced Audio improves between different audio types The parameter of device.For example, adjustment unit 300 can be configured as by the importance based at least one audio types at least The confidence value of one audio types is weighted to consider at least some audio types in multiple audio types.Special audio Type is more important, then parameter is bigger by its effect.

Weight can also reflect that the informedness of audio types influences and interference influences.For example, for interference audio Type, can be to the weight that it bears.It will be given in the ensuing part 2 for improving device on specific audio into the 5th part Go out some specific examples.

Please note that in the linguistic context of the application " weight " has than the wider implication of coefficient in multinomial.Except multinomial The form of coefficient in formula, it can also be the form of index or power.When being the coefficient in multinomial, weight coefficient can be with It is normalized or can not be normalized.In short, weight only represents the object being weighted for parameter to be adjusted Influenceed with how many.

In some other embodiments, for multiple audio types included in same audio fragment, its confidence Angle value can by normalization be converted into weight, it is then possible to by calculate for each audio types it is predefined and By the preset parameter value of the Weight based on confidence value and determine final parameter.That is, adjustment unit 300 It may be configured to by the way that the effect of multiple audio types is weighted to consider multiple audio types based on confidence value.

As the specific example of weighting, adjustment unit is configured to consider at least one leading sound based on confidence value Frequency type.For with low confidence value (being less than threshold value) audio types, it can not be considered.It is equal to confidence The weight that angle value is less than other audio types of threshold value is arranged to zero.The of device will be improved on specific audio ensuing 2 parts provide some specific examples into the 5th part.

Content type and context type can be considered together.In one embodiment, content type and context class Type can be treated as can having corresponding weight in same rank and its confidence value.In another embodiment, As shown by its name, " context type " is the context or environment residing for " content type ", therefore can be configured Adjustment unit 200 with cause depending on audio signal context type and to different context types audio signal in Hold type and distribute different weights.In general, any audio types may be constructed the context of another audio types, therefore Adjustment unit 200 may be configured to change the weight of an audio types according to the confidence value of another audio types. Some specific examples will be provided into the 5th part in the ensuing part 2 for improving device on specific audio.

In the linguistic context of the application, " parameter " has implication more wider than its literal meaning.Except with single value Parameter, its can also refer to the set of foregoing preset including different parameter, the vector being made up of different parameters or Pattern (profile).Specifically, following parameter will be discussed into the 5th part in ensuing part 2, but the application is not It is limited to this：The rank of dialogue enhancing, the threshold value of the frequency band for determining to talk with enhancing, background sound level, surround sound enhancing amount, use The dynamic gain of initial frequency, volume leveller in surround sound virtual machine or the scope of dynamic gain, represent audio signal It is that the parameter of the degree of new discernable audio event, balanced rank, balanced mode and spectrum balance are preset.

1.5 parameter smoothing

In trifle 1.3, discuss and the confidence value of audio types has been carried out smoothly to avoid its acute variation, and And therefore avoid the acute variation of the parameter of audio improvement device.Other modes are also possible.A kind of mode be by based on The parameter of audio types adjustment carries out smoothly, this mode being discussed in this trifle；Another way is configuration audio classification Device and/or adjustment unit will discuss this mode to postpone the change of the result of audio classifiers in trifle 1.6.

In one embodiment, parameter can also be further smoothed audible to avoid to introduce at transfer point Distortion quick change, such as：

Wherein,It is smoothed parameter, L (t) is not smoothed parameter, and τ is the coefficient for representing time constant, and t is When secondary, and t-1 is last.

That is, as shown in figure 8, audio processing equipment can include parameter smoothing unit 814, it is used for：For by adjusting Audio that whole unit 300 is adjusted improves device (such as dialogue booster 402, surround sound virtual machine 404, volume leveller 406 With it is at least one in balanced device 408) parameter, by calculate when time determined by adjustment unit parameter value with it is last The weighted sum of smoothed parameter value when the secondary parameter determined by adjustment unit 300 come to carrying out smoothly.

Timeconstantτ can be the specific requirement based on application and/or the implementation based on audio improvement device 400 Fixed value.It can also be based on audio types, be based especially on different from an audio types to another audio types Translation type --- such as from music to voice and from voice to music --- and adaptively change.

Example (further details may be referred to the 5th part) is used as using balanced device.Balanced device is suitable for application to music Content and be not suitable for being applied to voice content.Therefore, it is smooth in order to be carried out to the rank of equilibrium, when audio signal is changed from music During to voice, time constant can be with relatively small, quickly to apply smaller balanced rank to voice content.The opposing party Face, can phase for the time constant of the conversion from voice to music in order to avoid producing audible distortion at transfer point To larger.

In order to estimate translation type (for example, from voice to music, or from music to voice), content can be directly used Classification results.That is, audio content is categorized into or music or voice make it obtain translation type straight from the shoulder.In order to The conversion is estimated in a more continuous fashion, can also depend on estimated not smoothed balanced rank, rather than directly The hard decision of comparing audio type.Overall thought is：If not smoothed balanced rank is increased, its represent from voice to The conversion of music (or more like music)；Otherwise, itself it is more likely that from music to voice (or more like voice) conversion.Pass through Different translation types is distinguished, time constant can be correspondingly set, an example is：

Wherein τ (t) is the time constant changed over time depending on content, and τ 1 and τ 2 are two preset time constants Value, generally meets τ 1>τ2.Intuitively, above function representation carries out relatively slow conversion when balanced rank increase, and when equal Weighing apparatus rank carries out relatively fast conversion when reducing, but the application not limited to this.In addition, parameter is not limited to balanced rank, and can To be other specification.That is, can with configuration parameter smooth unit 814 so that the weight that must be used to calculating weighted sum be based on by The increase trend of parameter determined by adjustment unit 300 reduces trend and adaptively changed.

The conversion of 1.6 audio types

Reference picture 9 and Figure 10, will be described as avoiding the drastically change of audio types, and therefore avoid audio from improving dress Another scheme jumpy for the parameter put.

As shown in figure 9, audio processing equipment 100 can also include timer 916, connect for measuring audio classifiers 200 The duration of the continuous same new audio types of output, wherein, adjustment unit 300 may be configured to be continuing with current audio Type, untill the length of the duration of new audio types reaches threshold value.

In other words, the observation period as shown in Figure 10 (or maintaining the phase) is introduced.Use the observation period (length with the duration The threshold value of degree is corresponding), the change of audio types is further monitored in one continuous time, to confirm whether audio types are true It is real to have changed, new audio types then could be actually used in adjustment unit 300.

As shown in Figure 10, arrow (1) shows that current state is that the results of type A and audio classifiers 200 does not become The situation of change.

If the result that current state is type A and audio classifiers 200 becomes type B, timer 916 starts to count When, or as shown in Figure 10, processing proceeds to the observation period (arrow (2)), and sets the initial value for staying and counting cnt, its table Show observation period length (being equal to threshold value).

Then, if continuously output type B, cnt continuously reduce (arrow (3)) until cnt to audio classifiers 200 Equal to 0 (that is, the length of new type B duration reaches threshold value), then adjustment unit 300 can use new audio Type B (arrow (4)), or in other words, up to now just it is considered that audio types really become and turn to type B.

Otherwise, if before cnt is changed into 0 (before the length of duration reaches threshold value), the output of audio classifiers 200 change back to original type A, then the observation period terminates, and adjustment unit 300 still uses original type A (arrows (5))。

Can be similar with above-mentioned processing to type A change from type B.

In the above process, threshold value (or staying counting) can be set based on application demand.The threshold value can be pre- It is defined as fixed value.The threshold value can also be set adaptively.In a modification, for different from an audio types It is different to the conversion pair of another audio types, the threshold value.For example, when changing to type B from type A, the threshold value can be First value；And when changing to type A from type B, the threshold value can be second value.

In another modification, staying counting (threshold value) can be negatively correlated with the confidence value of new audio types.It is overall Thought is：If confidence level shows to obscure (for example, when confidence value is only 0.5 or so) between two types, observe Phase needs to grow；Otherwise, the observation period can be with relatively short.According to this guilding principle, can be set by below equation exemplary Stay counting：

HangCnt=C | 0.5-Conf |+D

Wherein, HangCnt is lingering period or threshold value, and C and D are two parameters that can be set based on application demand, Usual C is negative value and D is positive value.

Incidentally, above timer 916 (and therefore above-mentioned conversion process) is described as setting as audio frequency process A standby part is still in the outside of audio classifiers 200.In some other embodiments, as being retouched in trifle 7.3 State, timer 916 can be considered as a part for audio classifiers 200.

The combination of 1.7 embodiments and application scenarios

All the above embodiment and its modification can be arbitrarily combined to realize, and in different portions with it Point/embodiment in be previously mentioned but with same or similar function any component can be used as same component or individually Component realize.

Especially, when embodiment and its modification is described above, eliminate with embodiment above or The component of the similar reference of the reference of the component being had been described above in person's modification, and only describe different components. In fact, these different components can merge with other embodiment or the component of modification, can also form alone Single solution.Can mutually it be harmonious for example, referring to any two described by Fig. 1 to Figure 10 or more solutions And.As most complete solution, audio processing equipment can include audio content grader 202 and audio context is classified Both devices 204, and type smooth unit 712, parameter smoothing unit 814 and timer 916.

As previously mentioned, audio improve device 400 can include dialogue booster 402, surround sound virtual machine 404, Volume leveller 406 and balanced device 408.Audio processing equipment 100 can include any of which or more, and Suitable for their adjustment unit 300.When being related to multiple audios improvement devices 400, adjustment unit 300 can be considered as including specially Improve multiple subelement 300A to 300D (Figure 15, Figure 18, Figure 20 and Figure 22) of device 400 for corresponding audio, or still It is considered as a united adjustment unit.When being exclusively used in audio and improving device, adjustment unit 300 together with audio classifiers 200, And other possible components can be considered as the controller of special audio improvement device, it will be in ensuing part 2 extremely It is discussed in detail in 5th part.

In addition, audio, which improves device 400, is not limited to already mentioned example, but can change including any other audio Kind device.

In addition, any solution discussed or its any combinations can also be with the other parts of present disclosure Described in or implied embodiment combination.Especially, the audio that will be discussed in the 6th part and the 7th part point The embodiment of class device can be used in audio processing equipment.

1.8 audio-frequency processing method

During the audio processing equipment in describing embodiment of above, it is clear that also disclose that some processes and side Method.Hereinafter, the summary of these methods is provided in the case of the details for not repeating to have discussed, it should be noted that It is that, although disclosing method during audio processing equipment is described, these methods not necessarily use described group Part is not necessarily performed by these components.For example, the embodiment of audio processing equipment can be partially or fully Realized with hardware and/or firmware, and following audio-frequency processing methods can be realized fully by computer executable program, Although these methods can also use the hardware and/or firmware of audio processing equipment.

These methods are described hereinafter with reference to Figure 11 to Figure 14.The stream attribute corresponding to audio signal is note that, works as reality When repeatedly carry out various operations when realizing methods described, and different operations is not necessarily directed to same audio fragment.

In embodiment as shown in figure 11, there is provided audio-frequency processing method.First, it is pending audio signal is real-time Ground is categorized at least one audio types (operation 1102)., can be continuous based on the confidence value of at least one audio types At least one parameter (operation 1104) improved for audio of ground adjustment.Audio improve can be dialogue enhancing (operation 1106), Surround sound virtual (operation 1108), volume smoothing (1110) and/or equilibrium (operation 1112).Accordingly, at least one parameter It can include being used for talking with enhancing processing, surround sound is virtually handled, at least one place in volume smoothing process and equilibrium treatment At least one parameter of reason.

Here, mean audio types (so as to the parameter) by according to the specific of audio signal " in real time " and " continuously " Content and change in real time, and " continuously " still mean that adjustment is the continuous adjustment based on confidence value, rather than mutation Or discrete adjustment.

Audio types can include content type and/or context type.Correspondingly, adjustment operation 1104 can be configured Adjusted into the confidence value based at least one content type and the confidence value of at least one context type at least one Parameter.Content type can also include at least one content type in short-term music, voice, background sound and noise.Up and down Literary type can also include at least one context type in long-term music, film class media, game and VoIP.

It can also propose other context type schemes, for example, it is related to VoIP upper including VoIP and non-VoIP Hereafter type, and the audio quality type including high quality audio or low quality audio.

Short-term music can also be broken into further various subtypes according to different standards.Depending on dominating depositing for source , short-term music can include without leading source music and have the music in leading source.In addition, short-term music can be included at least One type based on style, either at least one type or at least one rhythm based on music, speed based on musical instrument Degree, tone color and/or any other musical features and the music type classified.

When not only identify content type but also identify context type when, can by the context type residing for content type come Determine the importance of content type.That is, the context type depending on audio content, the sound to different context types Content type in frequency signal distributes different weights.More generally, an audio types can influence another audio types, Or an audio types can be the premise of another audio types.Therefore, adjustment operation 1104 can be configured as basis The confidence value of another audio types changes the weight of an audio types.

When audio signal is classified into multiple audio types (namely for same audio fragment) simultaneously, in order to adjust To improve the audio fragment, adjustment operation 1104 can contemplate some or all in identified audio types whole parameter. For example, adjustment operation 1104 can be configured as based on the importance of at least one audio types come at least one audio types Confidence value be weighted.Or adjustment operation 1104 may be configured to by the confidence value pair based on audio types It is weighted to consider at least some audio types in audio types.In the case of special, adjustment operation 1104 can be with It is configured as considering at least one leading audio types based on confidence value.

In order to avoid the drastically change of result, Smooth scheme can be introduced.

Adjusted parameter value can be carried out smooth (operation 1214 in Figure 12).For example, operated when secondary by adjustment 1104 parameter values determined may alternatively be when the secondary parameter value determined by adjustment operation and last smoothed parameter The weighted sum of value.Therefore, by the smooth operation of iteration, smooth on a timeline parameter value.

For calculate weighted sum weight can the audio types based on audio signal, or based on different from a sound Frequency type and adaptively changes to the conversion pair of another audio types.Or for calculate weighted sum weight be based on by The increase trend for the parameter value that adjustment operation determines reduces trend adaptively to change.

Another Smooth scheme is shown in Figure 13.That is, this method can also include：For each audio class Type, by calculating current actual degree of belief value and the weighted sum of last smoothed confidence value, to working as secondary audio program The confidence value of signal is carried out smooth (operation 1303).With parameter smoothing operation 1214 similarly, for calculating the power of weighted sum Weight can the confidence value based on the audio types of audio signal, or based on different from an audio types to another sound The conversion pair of frequency type, and adaptively change.

Even if another Smooth scheme is the exporting change for audio classification operation 1102 but delay is from sound Buffering of the frequency type to the conversion of another audio types.That is, adjustment operation 1104 does not use new sound immediately Frequency type, but the stabilization of the output of stand by tone frequency division generic operation 1102.

Specifically, this method can include：The duration that same new audio types are continuously exported to sort operation enters Row measurement (operation 1403 in Figure 14), wherein, adjustment operation 1104 is configured as being continuing with (the operation of current audio types " N " and operation in 14035 are 11041) until the length of the duration of new audio types reaches threshold value (in operation 14035 " Y " and operation 11042).Specifically, adjusted when the audio types from audio classification operation 1102 export relative to audio frequency parameter During present video type change used in whole operation 1104 (operation 14031 in " Y "), then timing starts (operation 14032).If audio classification operation 1102 continues to output the new audio types, that is to say, that if in operation 14031 Judgement continues as " Y ", then timing continues (operation 14032).Finally reach threshold value (behaviour when the duration of the new audio types Make " Y " in 14035) when, adjustment operation 1104 is using the new audio types (operation 11042), and timing resets (operation 14034), for being prepared for the conversion of audio types next time.In (" N " in operation 14035), adjustment before reaching threshold value Operation 1104 is continuing with current audio types (operation 11041).

Here, timing can be realized by the mechanism of timer and (be counted up or counted downwards).If in timing After starting but before threshold value is reached, the output of audio classification operation 1104 changes back to institute in current adjustment operation 1104 The present video type used, then it should be considered as not relative to the change of the present video type used in adjustment operation 1104 Change (" N " in operation 14031).But current classification results (the present video piece to be sorted corresponded in audio signal Section) relative to the previous output (the previous audio fragment to be sorted corresponded in audio signal) of audio classification operation 1102 (" Y " in operation 14033) is changed, therefore, timing resets (operation 14034), (operation 14031 when change next time In " Y ") start timing.Certainly, if the classification results of audio classification operation 1102 did not both adjust relative to audio frequency parameter Present video Change of types (" N " in operation 14031) used in operation 1104, also not relative to previous Classification Change (" N " in operation 14033), then it represents that audio classification is in the state of stabilization and is continuing with current audio types.

Threshold value used herein above can also be directed to the different conversions from an audio types to another audio types Pair and it is different because when state is not very stable, may generally be more desirable to audio improve device be in its default conditions without It is to be in other states.On the other hand, if the confidence value of the new audio types is of a relatively high, it is transformed into new audio Type is safer.Therefore, the threshold value can be negatively correlated with the confidence value of new audio types.Confidence level is higher, then threshold value is got over It is low, it is meant that audio types can quickly be transformed into new audio types.

With the embodiment of audio processing equipment similarly, on the one hand, the embodiment of audio-frequency processing method and embodiment party Any combinations of the modification of formula are all feasible；On the other hand, the modification of the embodiment and embodiment of audio-frequency processing method Each aspect also can be single solution.Especially, in all audio-frequency processing methods, such as the can be used The audio frequency classification method discussed in 6 parts and the 7th part.

Part 2：Talk with booster controller and control method

The example that audio improves device is dialogue booster (DE), and it is intended to the audio for continuously monitoring playback, inspection Survey the presence of dialogue, and strengthen the dialogue with improve its definition and intelligibility (make the dialogue be easier to be heard and by Understand), especially for the older of hearing loss.In addition to detecting whether to have dialogue, if talking with and existing also Detection most important frequency for intelligibility, then correspondingly strengthens the frequency (using dynamic spectrum releveling). The H.Muesch A2 of Publication No. WO 2008/106036 " Speech Enhancement in Entertainment An example of dialogue Enhancement Method is given in Audio ", the full content of the document is incorporated by reference into herein.

Enable the common manual for talking with booster is configured generally for the content of film class media, and for music content Then disable, because dialogue enhancing may erroneous trigger too much to music signal.

In the case where audio types information can be obtained, can be adjusted based on the confidence value of the audio types identified Talk with the rank and other specification of enhancing.As the audio processing equipment and the specific example of method discussed before, dialogue enhancing Device can use any combination of all embodiments and these embodiments discussed in part 1.Specifically, controlling In the case of system dialogue booster, the audio classifiers 200 of audio processing equipment 100 as shown in Figures 1 to 10 and adjustment are single Member 300 can form dialogue booster controller 1500 as shown in figure 15.In this embodiment, because adjustment unit is It is exclusively used in talking with booster, so it can be referred to as 300A.Also, as front portion is discussed, audio classifiers 200 Can be including at least one in audio content grader 202 and audio context grader 204, and talk with booster control Device 1500 can also include at least one in type smooth unit 712, parameter smoothing unit 814 and timer 916.

Therefore, in this section, these contents that will do not had been described above in front portion repeatedly, and only provide some Specific example.

For talking with booster, adjustable parameter includes but is not limited to：Talk with the rank strengthened, background sound level and be used for Determine the threshold value of frequency band to be reinforced.Referring to the H.Muesch A2 of Publication No. WO 2008/106036 " Speech Enhancement in Entertainment Audio ", entire contents are incorporated by reference into herein.

The rank of 2.1 dialogue enhancings

When being related to the rank of dialogue enhancing, adjustment unit 300A can be configured as making the dialogue of dialogue booster to strengthen Rank and voice confidence value positive correlation.Additionally or alternatively, the rank can be put with other guide type Certainty value is negatively correlated.Therefore, talk with enhancing rank can be configured to it is proportional to the confidence level of voice (linear or non- Linear), to cause in non-speech audio such as music and background sound (audio), dialogue enhancing is less effective.

For context type, adjustment unit 300A may be configured to make the rank of the dialogue enhancing of dialogue booster with The confidence value positive correlation of film class media and/or VoIP, and/or make the rank and length of the dialogue enhancing of dialogue booster Phase music and/or the confidence value of game are negatively correlated.For example, the rank of dialogue enhancing can be set to and film class media Confidence value is proportional (linear or nonlinear).When the confidence value of film class media is 0 (for example, in music In appearance), the rank for talking with enhancing is also 0, and it strengthens equivalent to dialogue is disabled.

As described by front portion, it can combine and consider content type and context type.

2.2 threshold value for determining frequency band to be reinforced

During the work of dialogue booster, exist for each frequency band and be used to determine frequency band threshold whether to be enhanced Value (being typically energy threshold or loudness threshold), that is to say, that will be to corresponding energy/loudness threshold the above frequency band Strengthened.In order to adjust threshold value, adjustment unit 300A can be configured as making the threshold value and short-term music and/or noise and/ Or the confidence value positive correlation of background sound, and/or make the confidence value of threshold value and voice negatively correlated.If for example, language Sound confidence level height (meaning more reliable speech detection), then can reduce threshold value, to enable more frequency bands to be enhanced； On the other hand, can be with rise threshold to cause less frequency band is enhanced (therefore to have less when the confidence value of music is high Distortion).

The adjustment of 2.3 pairs of background sound levels

As shown in figure 15, another component talked with booster is minimum tracing unit 4022, and it is used to estimate sound (background sound level is used for SNR (signal to noise ratio) estimation, and the frequency band threshold being previously mentioned in trifle 2.2 to background sound level in frequency signal Estimation).It is also based on the confidence value of audio content type to adjust.For example, if the confidence level height of voice, minimum Background sound level can more assuredly be set to current minimum by amount tracing unit.If the confidence level of music is high, background Sound level is set to higher than current minimum, or in another way, background sound level is configured to current minimum With the weighted average of the energy of present frame, wherein current minimum is subjected to big weight.If the confidence level of noise and background Height, then background sound level can be set more much higher than the value of current minimum, or in another way, background sound level is set Determine into current minimum and the weighted average of the energy of present frame, wherein current minimum is subjected to small weight.

Therefore, adjustment unit 300A may be configured to apply one to the background sound level estimated by minimum tracing unit Adjustment amount, wherein, adjustment unit is additionally configured to make putting for the adjustment amount and short-term music and/or noise and/or background sound Certainty value positive correlation, and/or make the confidence value of the adjustment amount and voice negatively correlated.In modification, adjustment unit 300A It can be configured as making the adjustment amount and the confidence value ratio of noise and/or background sound and the more positive correlation of short-term music.

The combination of 2.4 embodiments and application scenarios

Similar with part 1, the modification of all the above embodiment and embodiment can be arbitrary group with its Close to realize, and still any group with same or similar function being previously mentioned in different part/embodiments Part can be used as same component or single component to realize.

For example, it can be combined with each other in trifle 2.1 to any two described in trifle 2.3 or more solution. And these combinations can also be with times that be described in part 1 and hint and will being described in other parts later What embodiment is combined.Especially, many formula, which actually can be applied to every kind of audio, improves device or method, but not One is scheduled in each part of present disclosure and quotes or discuss these formula.In this case, the disclosure is each Individual part can mutually be referred to, and the specific formulation discussed in a part is applied in another part, simply needs root According to the specific requirement of concrete application, relevant parameter, coefficient, power (index) and weight are suitably adjusted.

2.5 dialogue booster control methods

It is similar with part 1, during the dialogue enhancing controller in describing embodiment above, it is clear that also disclose Some processes and method.Hereinafter, the summary of these methods is provided in the case of the details for not repeating to have discussed.

First, the embodiment for the audio-frequency processing method discussed in part 1 can be used for talking with booster, dialogue The parameter of booster is one of target to be adjusted by audio-frequency processing method.According to this point, audio-frequency processing method is also dialogue Booster control method.

In this trifle, the those aspects specific to the control for talking with booster will be only discussed.One on control method As aspect, may be referred to part 1.

According to an embodiment, audio-frequency processing method can also include dialogue enhancing processing, and adjust operation 1104 Including making the rank that dialogue strengthens and film class media and/or VoIP confidence value positive correlation, and/or strengthen dialogue Rank and long-term music and/or game confidence value it is negatively correlated.That is, dialogue enhancing is mainly for context type For the audio signal of film class media or VoIP.

More specifically, adjustment operation 1104 can include the rank of dialogue enhancing and the confidence of voice for making dialogue booster Angle value positive correlation.

The application can also adjust frequency band to be reinforced in dialogue enhancing processing.As shown in figure 16, can according to the application Threshold value (being typically energy or loudness), the threshold are adjusted with the confidence value based on the audio types (operation 1602) identified It is worth for determining whether corresponding frequency band is to be enhanced.Then, in booster is talked with, based on the threshold value adjusted, selection (behaviour Make 1604) and strengthen frequency band more than (operation 1606) respective threshold.

Especially, adjustment operation 1104 can include making putting for threshold value and short-term music and/or noise and/or background sound Certainty value positive correlation, and/or make the confidence value of threshold value and voice negatively correlated.

Audio-frequency processing method (especially talking with enhancing processing) generally also includes the background sound level in estimation audio signal, leads to The processing is often realized by minimum tracing unit 4022, minimum tracing unit 4022 is realized in booster 402 is talked with, and And for SNR estimations or frequency band threshold estimation.The application can be also used for adjusting background sound level.In such circumstances, such as Shown in Figure 17, after background sound level is estimated (operation 1702), the confidence values of audio types is primarily based on to adjust background sound Level (operation 1704), background sound level is then used for SNR estimations and/or frequency band threshold estimation (operation 1706).Especially, adjust Operation 1104 may be configured to apply an adjustment amount to estimated background sound level, wherein adjustment operation 1104 can also quilt Be configured to make the adjustment amount and short-term music and/or noise and/or background sound positive correlation, and/or make the adjustment amount with The confidence value of voice is negatively correlated.

More specifically, adjustment operation 1104 may be configured to make the adjustment amount and noise and/or the confidence value of background Than with the more positive correlation of short-term music.

With the embodiment of audio processing equipment similarly, on the one hand, the embodiment of audio-frequency processing method and embodiment party Any combinations of the modification of formula are all feasible；On the other hand, the change of the embodiment and embodiment of audio-frequency processing method Each aspect of type can be single solution.In addition, any two described in this trifle or more solves Scheme can be combined with each other, and these combination can also with part 1 and in the other parts that will be described later Described in and any embodiment for being implied be combined.

Third portion：Surround sound virtual machine controller and control method

Surround sound virtual machine makes it possible to render in PC boombox or earphone (such as more around acoustical signal Sound channel 5.1 and multichannel 7.1).That is, by for example built-in portable computer loudspeaker of stereo or earphone, Surround sound virtual machine is that user generates virtual surrounding sound effect and provides the experience of film.In surround sound virtual machine generally Come from and multi-channel sound using head related transfer function (Head Related Transfer Function, HRTF) to simulate Ripple of the sound of the associated various loudspeaker positions of frequency signal at ear is extremely.

Although existing surround sound virtual machine works well on earphone, on boombox, surround sound is empty Intend device for different contents it is different work.Generally, film class media content enables surround sound virtual machine for loudspeaker, And music is not done that, because music may sound too thin.

Because it is simultaneously raw that the identical parameters of surround sound virtual machine can not be directed to both film class media content and music content Into good acoustic image, so needing to be based on content more accurately adjustment parameter.Use obtainable audio types information, especially sound Happy confidence value and voice confidence value, and some other content-type informations and contextual information, this Shen can be used Please complete the work.

With part 2 similarly, as the audio processing equipment and the specific example of method discussed in part 1, ring All embodiments for being discussed in part 1 can be used around sound virtual machine 404 and in these disclosed in part 1 Any combinations of embodiment.Especially, in the case where controlling surround sound virtual machine 404, audio as shown in Figures 1 to 10 The audio classifiers 200 and adjustment unit 300 of processing equipment 100 can form surround sound virtual machine controller as shown in figure 18 1800.In this embodiment, because adjustment unit is exclusively used in surround sound virtual machine 404, it can be referred to as 300B.Also, similar with part 2, audio classifiers 200 can include audio content grader 202 and audio context point It is at least one in class device 204, and surround sound virtual machine controller 1800 can also include type smooth unit 712, parameter It is at least one in smooth unit 814 and timer 916.

Therefore, in this section, by these contents that repeatedly part 1 does not have been described above, and some tools are only provided Body example.

For surround sound virtual machine, adjustable parameter includes but is not limited to：Starting frequency for surround sound virtual machine 404 Rate and surround sound enhancing amount.

3.1 surround sound enhancing amounts

When being related to surround sound enhancing amount, adjustment unit 300B may be configured to make surrounding for surround sound virtual machine 404 The confidence value positive correlation of sound enhancing amount and noise and/or background and/or voice, and/or make surround sound enhancing amount with it is short The confidence value of phase music is negatively correlated.

Specifically, in order to change surround sound virtual machine 404 so that music (content type) sounds acceptable, adjustment Unit 300B example implementation can adjust surround sound enhancing amount based on short-term music confidence value, such as：

SB∝(1–Conf_music) (5)

Wherein, SB represents surround sound enhancing amount, Conf_musicIt is the confidence value of short-term music.

It helps to weaken surround sound enhancing for music, prevents it from sounding fuzzy.

Similarly, voice confidence value can also be utilized, such as：

SB∝(1–Conf_music)*Conf_speech ^α (6)

Wherein, Conf_speechIt is the confidence value of voice, α is the weight coefficient of exponential form, and its scope can be 1 to 2. The formula represents that surround sound enhancing amount is only high to pure voice (high voice confidence level and low music confidence level).

Or it can only consider the confidence value of voice：

SB∝Conf_speech (7)

Various modifications can be designed in a similar way.Especially, for noise or background sound, can construct and public affairs Formula (5) is to the similar formula of formula (7).Furthermore, it is possible to combine the effect for considering four content types with any combinations.At this In the case of sample, noise and background sound are ambient sounds, it is possible to more safely with big enhancing amount；Assuming that speaker It is usually located at before screen, so voice there can be medium enhancing amount；And music uses less enhancing amount.Therefore, Adjustment unit 300B may be configured to make surround sound enhancing amount with noise and/or the confidence value of background than the content with voice Type more positive correlation.

Assuming that predefining desired enhancing amount (that is, equivalent to weight) for each content type, can also apply another One alternative formula：

Wherein,It is the enhancing amount of estimation, is expectation/predefined enhancing of content type with target α under content type Measure (weight), with target Conf under content type be content type confidence value (bkg represents background sound, That is background sound).Depend on the circumstances, a_musicCan (but not necessarily) 0 is set to, representing will for absolute music (content type) Disable surround sound virtual machine 404.

From another perspective, the expectation that target α under content type is content type of carrying in formula (8)/predetermined Justice enhancing amount, and the confidence value of corresponding contents type by the confidence value of all content types identified and remove Business can be considered as corresponding contents type predefined/desired enhancing amount normalized weight.That is, adjustment is single First 300B may be configured to by being weighted based on confidence value to the predefined enhancing amount of multiple content types, to examine Consider at least some content types in multiple content types.

For context type, adjustment unit 300B may be configured to the surround sound for making surround sound virtual machine 404 Enhancing amount and film class media and/or the confidence value positive correlation of game, and/or make surround sound enhancing amount and long-term music And/or VoIP confidence value is negatively correlated.It is then possible to construct and formula (5) to the similar formula of formula (8).

As special example, surround sound virtual machine 404 can be enabled to pure film class media and/or game, but it is right Music and/or VoIP disabling surround sounds virtual machine 404.Meanwhile it can be arranged differently than surrounding for film class media and game The enhancing amount of sound virtual machine 404.Film class media use higher enhancing amount, and play and use less enhancing amount.Therefore, adjust Whole unit 300B may be configured to confidence value ratio and the game more positive correlation for making surround sound enhancing amount and film class media.

Similar with content type, the enhancing amount of audio signal can also be set to adding for the confidence value of context type Weight average value：

Wherein,It is the enhancing amount of estimation, is the expectation/predefined of context type with target α under context type Enhancing amount (weight), it is the confidence value of context type with target Conf under context type.Depend on the circumstances, a_MUSICWith a_VOIPCan (but not necessarily) 0 is set to, represent virtual for absolute music (content type) and/or pure VoIP disabling surround sounds Device 404.

Equally, it is similar with content type, the phase that target α under context type is context type of carrying in formula (9) Prestige/predefined enhancing amount, and the confidence value of respective contexts type is by the confidence of all context types identified Business that is angle value and being removed can be considered as the normalized weight of predefined/desired enhancing amount of respective contexts type. That is, adjustment unit 300B may be configured to by based on predefined increasing of the confidence value to multiple context types Strong amount is weighted, to consider at least some context types in multiple context types.

3.2 initial frequency

Other specification, such as initial frequency can also be changed in surround sound virtual machine.Generally, the high frequency in audio signal Component by space more suitable for being rendered.For example, in music, if carrying out space to render to bass with more rings Around sound effective value, then it will sound queer.Therefore, for specific audio signal, surround sound virtual machine is it needs to be determined that frequency Threshold value, space is carried out to the component more than threshold value and renders and keeps the component below the threshold value.The frequency threshold is exactly to originate Frequency.

, can be to the initial frequency of music content increase surround sound virtual machine so that right according to presently filed embodiment More basses can be kept in music signal.Therefore, adjustment unit 300B can be configured as making rising for surround sound virtual machine The confidence value positive correlation of beginning frequency and short-term music.

The combination of 3.3 embodiments and application scenarios

Similar with part 1, the modification of all the above embodiment and embodiment can be combined with it To realize, and it is previously mentioned in different part/embodiments but has any component of same or similar function can To be realized as same component or single component.

For example, it can be combined with each other in trifle 3.1 to any two described in trifle 3.2 or more solution. And these combinations can also with it is described in part 1, part 2 and being implied and in other parts later Any embodiment that will be described is combined.

3.4 surround sound virtual machine control methods

It is similar with part 1, during the surround sound virtual machine controller in describing embodiment above, it is clear that Disclose some processes and method.Hereinafter, the general of these methods is provided in the case of the details for not repeating to have discussed Will.

First, the embodiment for the audio-frequency processing method discussed in part 1 can be used for surround sound virtual machine, ring Parameter around sound virtual machine is one of target to be adjusted by audio-frequency processing method.According to this point, audio-frequency processing method is also Surround sound virtual machine control method.

In this trifle, only discussion is exclusively used in controlling to the those aspects of surround sound virtual machine.On control method General aspect, it may be referred to part 1.

According to an embodiment, audio-frequency processing method can also virtually be handled including surround sound, and adjust operation The 1104 surround sound enhancing amounts that may be configured to make surround sound virtually handle and noise and/or background and/or the confidence of voice Angle value positive correlation, and/or make the confidence value of surround sound enhancing amount and short-term music negatively correlated.

Specifically, adjustment operation 1104 may be configured to make surround sound enhancing amount and noise and/or the confidence level of background Value is than the content type more positive correlation with voice.

Alternatively, or additionally, the confidence value of context is also based on to adjust surround sound enhancing amount.Specifically Ground, adjustment operation 1104 may be configured to the surround sound enhancing amount for making surround sound virtually handle and film class media and/or trip The confidence value positive correlation of play, and/or make surround sound enhancing amount and long-term music and/or VoIP confidence value negative Close.

More specifically, adjustment operation 1104 may be configured to the confidence value for making surround sound enhancing amount and film class media Than with play more positive correlation.

Another parameter to be adjusted is the initial frequency that surround sound is virtually handled.As shown in figure 19, it is primarily based on audio The confidence value of type adjusts initial frequency (operation 1902), is then wrapped around more than sound virtual machine processing initial frequency those Audio component (operation 1904).Specifically, adjustment operation 1104 may be configured to the initial frequency for making surround sound virtually handle With the confidence value positive correlation of short-term music.

It is similar with the embodiment of audio processing equipment, on the one hand, the embodiment and embodiment of audio-frequency processing method Any combinations of modification be all feasible；On the other hand, the modification of the embodiment and embodiment of audio-frequency processing method Each aspect can also be single solution.In addition, any two or more solutions described in this trifle Scheme can be combined with each other, and these combinations can also with it is described in the other parts in present disclosure and implied Any embodiment be combined.

4th part：Volume leveller controller and control method

The volume of the volume of different audio-sources or the different fragments in same audio-source changes very big sometimes.Because user Volume must not be infrequently adjusted, so pretty troublesome.Volume leveller (VL) is intended to the volume progress to the audio content of playback Regulation, and volume is consistent on a timeline based on target loudness value.In A.J.Seefeldt et al. publication number For US2009/0097676A1 " Calculating and Adjusting the Perceived Loudness and/or The Perceived Spectral Balance of an Audio Signal ", B.G.Grockett et al. Publication No. WO2007/127023A1 " Audio Gain Control Using Specific-Loudness-Based Auditory Event Detection " and A.Seefeldt et al. Publication No. WO2009/011827A1 " Audio Example is given in Processing Using Auditory Scene Analysis and Spectral Skewness " Volume leveller.The full content of these three documents is incorporated by reference into herein.

Volume leveller continuously measures the loudness of audio signal in some way, then changes the letter with amount of gain Number, the amount of gain is the zoom factor for changing the loudness of audio signal, and is typically measured loudness, desired mesh Mark the function of loudness and some other factors.In the case where that should reach target loudness and keep the potential condition of dynamic range again, need Multiple factors are considered to estimate suitable gain.Volume leveller generally includes some daughter elements, such as dynamic gain control (AGC), auditory events detection, dynamic range control (DRC).

Commonly used control signal controls " gain " of audio signal in volume leveller.For example, control signal can Be the amplitude of audio signal drawn by pure signal analysis change instruction.Control signal can also pass through psychologic acoustics Analysis such as auditory scene analysis or the auditory events based on specific loudness are detected to indicate whether to occur new auditory events Auditory events instruction.Gain control is carried out using such control signal in volume leveller, for example, by ensuring Gain is nearly constant in auditory events, and by limiting most of change in gain near event boundaries, to reduce The possible audible distortion as caused by the quick change of the gain in audio signal.

But show that the common method of control signal can not be to the auditory events and non-information (interference) of informedness Auditory events make a distinction.Here, informative auditory event is represented comprising significant information and may more closed by user The audio event of note, such as dialogue and music, rather than the signal of informedness do not include the information significant to user, such as VoIP In noise.As a result, the signal of non-information may also be applied in big gain and be raised to close to target loudness. This will be very unpleasant in some applications.For example, in voip phone, after being handled by volume leveller, go out The noise in call interval is often raised to loud volume now.This is undesirable for user.

In order to solve the problem at least in part, the application proposes to control based on the embodiment discussed in part 1 Volume leveller processed.

It is similar with part 2 and third portion, as the specific of the audio processing equipment and method discussed in part 1 Example, volume leveller 406 can be using all embodiments discussed in part 1 and disclosed in part 1 These embodiments any combinations.Especially, in the case where controlling volume leveller 406, as shown in Figures 1 to 10 The audio classifiers 200 and adjustment unit 300 of audio processing equipment 100 can form volume leveller 406 as shown in figure 20 Controller 2000.In this embodiment, because adjustment unit is exclusively used in volume leveller 406, it can be with It is referred to as 300C.

That is, the disclosure based on part 1, volume leveller controller 2000 can include：Audio classification Device 200, for continuously identifying the audio types (such as content type and/or context type) of audio signal；And adjustment Unit 300C, volume leveller is adjusted in a continuous manner for the confidence value based on the audio types identified.It is similar Ground, audio classifiers 200 can include audio content grader 202 and audio context grader 204 in it is at least one, and And volume leveller controller 2000 can also include in type smooth unit 712, parameter smoothing unit 814 and timer 916 It is at least one.

Therefore, in this section, these contents that will do not had been described above in part 1 repeatedly, and only provide some Specific example.

The different parameters of volume leveller 406 can be adaptively adjusted based on classification results.For example, by reducing non-letter The gain of breath property signal, can adjust the parameter directly relevant with the scope of dynamic gain or dynamic gain.It can also adjust Indication signal is the parameter of the degree of new appreciable audio event, and being then indirectly controlled dynamic gain, (gain will be It is slowly varying in audio event, but may rapidly change in the boundary of two audio events).In this application, provide Some embodiments of parameter regulation or volume leveller controlling mechanism.

4.1 informational content types and interference content type

As the above mentioned, relevantly with the control of volume leveller, audio content type can be classified as information Property content type and interference content type.And adjustment unit 300C may be configured to the dynamic gain for making volume leveller With the informational content type positive correlation of audio signal, and make the dynamic gain of volume leveller and the interference of audio signal Content type is negatively correlated.

As an example, think noise be interfering (non-information) and by noise bring up to loud volume be order People is unhappy, directly controls the parameter of dynamic gain or indicates that the parameter of new audio event can be set to put with noise Certainty value (Conf_noise) decreasing function it is proportional, such as：

GainControl∝1–Conf_noise (10)

Here, for simplicity, represent to be formed with the gain control in volume leveller using symbol GainControl All parameters closed, because the different implementations of volume leveller can use the different parameters name with different latent meanings Claim.It can make expression briefly without losing its generality using single term GainControl.Substantially, these parameter phases are adjusted When in applying linear or nonlinear weight to original gain.As an example, GainControl can be used directly to Scalar gain, make it that gain is small if GainControl is small.As another specific example, in B.G.Grockett etc. The Publication No. WO2007/127023A1 of people " Audio Gain Control Using Specific-Loudness- Described in Based Auditory Event Detection " by using GainControl to scale event control signals Gain is indirectly controlled, the full content of the document is incorporated by reference into herein.In this case, GainControl is worked as Hour, the control of the gain to volume leveller is changed to prevent gain significant changes over time.When GainControl height When, modification is controlled to enable the gain of leveller more freely to change.

Using described in formula (10) gain control (directly original gain is zoomed in and out, otherwise scaling thing Part control signal), the dynamic gain of audio signal is related to noise confidence value (linear or nonlinear).If signal It is the noise with confidence value, then due to the factor (1-Conf_noise) and final gain will be small.In this way, avoid Noise signal brings up to unpleasant loud volume.

As the exemplary variation of formula (10), if (such as in VoIP) does not feel emerging to background sound yet in the application Interest, background sound can be similarly processed and also apply small gain to it.Control function can both consider the confidence of noise Angle value (Conf_noise) it is further contemplated that background confidence value (Conf_bkg), such as：

GainControl∝(1–Conf_noise)·(1–Conf_bkg) (11)

In the equation above, because noise and background sound are all undesirable, GainControl comparably by The influence of the confidence value of noise and the confidence value of background, and may be considered that noise and background sound have same power Weight.Depend on the circumstances, noise and background sound there can be different weights.For example, can be to the confidence value and the back of the body of noise The confidence value (either they difference) from 1 of scape sound provides different coefficients or different index (α and γ).Namely Say, formula (11) can be rewritten as：

GainControl∝(1–Conf_noise)^α·(1–Conf_bkg)^γ (12)

Or

GainControl∝(1–Conf_noise ^α)·(1–Conf_bkg ^γ) (13)

Or adjustment unit 300C is configured to confidence value to consider at least one leading content type. Such as：

GainControl∝1–max(Conf_noise,Conf_bkg) (14)

Formula (11) (and its modification) and formula (14) are both represented to noise signal and background noise signal with small Gain, and only when the confidence level of the confidence level and background of noise is both small (such as in voice signal and music signal) When just keep volume leveller original working method, to cause GainControl close to 1.

Above example is to consider leading interference content type.Depend on the circumstances, adjustment unit 300C can also be by It is configured to consider leading informational content type based on confidence value.In order to have more generality, adjustment unit 300C can be with Be configured to consider at least one leading content type based on confidence value, regardless of whether the audio types identified whether Be/include informedness audio types and/or interference audio types.

Another exemplary variation as formula (10), it is assumed that voice signal be most the content of informedness and need Less modification is made to the acquiescence working method of volume leveller, control function can contemplate noise confidence value (Conf_noise) and voice confidence value (Conf_speechBoth), such as：

GainControl∝1–Conf_noise·(1–Conf_speech) (15)

Using the function, only to those with strong noise confidence level and with sound confidence level of speaking in a low voice (for example, pure noise) Signal obtains small GainControl, and if voice confidence level is high, then GainControl will close to 1 (so that because This maintains the original working method of volume leveller).More generally, its can be considered as can be according at least in another Hold confidence value (such as the Conf of type_speech) change a content type (such as Conf_noise) weight.More than In formula (15), its can be considered as the confidence level of voice change noise confidence level weight coefficient (with formula 12 and public affairs It is another weight that weight in formula 13, which is compared).In other words, in formula (10), Conf_noiseCoefficient can be considered as 1；And In formula (15), some other audio types (such as voice, but not limited to this) will influence the important of the confidence value of noise Property, it can be said that Conf_noiseWeight have modified by the confidence value of voice.In the linguistic context of present disclosure, term " power Weight " should be interpreted as including this point.That is, the importance of its value of indicating, but be not necessarily normalized.Can To refer to trifle 1.4.

From another perspective, it is similar with formula (12) and formula (13), confidence value can be applied in superior function Add the weight of exponential form, to represent the priority of different audio signals (or importance), for example, formula (15) can be changed Into：

GainControl∝1-Conf_noise ^α·(1-Conf_speech)^γ (16)

Wherein, α and γ is two weights, if it is desired to have faster response for the modification of leveller parameter, then this two Individual weight can be set smaller.

Freely the various control letters of different application can be may adapt to be formed to formula (16) by combinatorial formula (10) Number.It can also merge in a similar way easily by the confidence value of other audio content types, such as music confidence value Into control function.

It is used to adjust the parameter that expression signal is the degree of new perceptible audio event in GainControl, then Being indirectly controlled dynamic gain, (gain will be slowly varying in audio event, but the boundary in two audio events can Can quickly change) in the case of, it is believed that there is another between the confidence value and final dynamic gain of content type Transmission function.

Content type in 4.2 different contexts

Audio content type, such as noise, background are considered in the control function of above formula (10) to formula (16) The confidence value of sound, short-term music and voice, but the audio context in sound institute source is not accounted for, such as film class matchmaker Body and VoIP.It is likely to require in different audio contexts and difference is carried out to same audio content type such as background sound Processing.Background sound includes various sound, such as car engine, blast and applause.In VoIP, background signal is probably nothing Meaning, but in film class media, background signal is probably important.This expression needs to identify above and below audio interested Text and need to design different control functions for different audio context.

Therefore, adjustment unit 300C is configured to the context type of audio signal by audio signal Appearance type is considered as informedness or interfering.For example, by considering noise confidence value and background confidence value, and area Divide VoIP contexts and non-VoIP contexts, the control function dependent on audio context can be：

If audio context is VoIP

GainControl∝1-max(Conf_noise,Conf_bkg)

Otherwise (17)

GainControl∝1-Conf_noise

That is, in VoIP contexts, noise and background sound are considered as interfering content type；And non- In VoIP contexts, background sound is considered as the content type of informedness.

As another example, consider the confidence value of voice, noise and background and distinguish VoIP and non-above and below VoIP Text the control function dependent on audio context can be：

If audio context is VoIP

GainControl∝1-max(Conf_noise,Conf_bkg)

Otherwise (18)

GainControl∝1-Conf_noise·(1-Conf_speech)

Here, voice is emphasised as the content type of informedness.

Assuming that in non-VoIP contexts, music is also important informational message, can be by the part 2 of formula (18) Expand to：

GainControl∝1-Conf_noise·(1-max(Conf_speech,Conf_music)) (19)

In fact, each control function or its modification in control function (10) to control function (16) can apply to Different/corresponding audio contexts.Therefore, substantial amounts of combination can be produced to form the control letter dependent on audio context Number.

, can in addition to the differentiation in formula (17) and formula (18) and the VoIP contexts utilized and non-VoIP contexts To utilize other audio contexts, such as film class media, long-term music and game, or low quality audio in a similar manner And high quality audio.

4.3 context type

Context type can also be directly used in control volume leveller to avoid those unpleasant sound from (such as making an uproar Sound) it is elevatedly too many.For example, VoIP confidence values can be used for manipulating volume leveller, volume leveller is set to be put in VoIP It is less sensitive when reliability is high.

Especially, using VoIP confidence values Conf_VOIP, the rank of volume leveller can be configured to and (1- Conf_VOIP) proportional.That is, volume leveller is almost deactivated in VoIP contents (when VoIP confidence values are high), This is consistent with traditional manual setting (preset) that volume leveller is disabled for VoIP contexts.

Or the dynamic gain ranging of different contexts setting of audio signal can be directed to.Generally, VL (smooth by volume Device) amount also adjusts the amount for the gain for putting on audio signal, and can be seen as another (non-linear) weight in gain. In one embodiment, setting can be：

Table 1

	Film class media	Long-term music	VOIP	Game
					VL amounts	It is high	It is medium	Close (or minimum)	It is low

Moreover, it is assumed that predefine desired VL amounts for each context type.For example, for film class media VL amounts 1 is set to, is 0 for VoIP, is 0.6 for music, and is 0.3 for game, but the application not limited to this.According to this Example, if the scope that the scope of the dynamic gain of film class media is 100%, VoIP dynamic gain is 60%, with this Analogize.If the classification of audio classifiers 200 is based on hard decision, the scope of dynamic gain can be directly disposed as above-mentioned Example.If the classification of audio classifiers 200 is can be based on context type based on soft-decision, the scope of dynamic gain Confidence value adjust.

Similarly, audio classifiers 200 can identify multiple context types, and adjustment unit from audio signal 300C may be configured to the confidence value of the plurality of context type by the importance based on the plurality of context type It is weighted, to adjust the scope of dynamic gain.

Typically for context type, can also use here and formula (10) to the similar function of formula (16) --- Content type in formula is replaced with into context type --- adaptively to set suitable VL amounts.In fact, table 1 reflects The importance of different context types.

From another perspective, confidence value can be used for obtaining the normalized weight discussed in trifle 1.4.It is false It is located in table 1 and predefines specific amount for each context type, then can applies the formula similar with formula (9).It is suitable Just refer to, similar solution can also be applied to multiple content types and any other audio types.

The combination of 4.4 embodiments and application scenarios

Similar with part 1, the modification of all the above embodiment and embodiment can be arbitrary group with its Close to realize, and still any group with same or similar function being previously mentioned in different part/embodiments Part can be used as same component or single component to realize.For example, in trifle 4.1 to any two described in trifle 4.3 Individual or more solutions can be combined with each other.And these combination can also with part 1 to third portion and Any embodiment described in other parts and being implied will be described later to be combined.

Figure 21 by by original short-term fragment (Figure 21 (A)), by do not change parameter conventional volume leveller handle The short-term fragment (Figure 21 (C)) that short-term fragment (Figure 21 (B)) and the volume leveller proposed by the application are handled is compared, Show the effect of volume leveller controller set forth herein.As can be seen that in the conventional sound as shown in Figure 21 (B) Measure in leveller, the volume of noise (latter half of audio signal) is also enhanced, and this is unpleasant.By contrast, In the new volume leveller as shown in Figure 21 (C), the volume of the live part of audio signal is enhanced without substantially carrying The volume of strong noise, good experience is brought to audience.

4.5 volume leveller control methods

It is similar with part 1, during the volume leveller controller in describing embodiment above, it is clear that also public Some processing and method are opened.Hereinafter, the summary of these methods is provided in the case of the details for not repeating to have discussed.

First, the embodiment for the audio-frequency processing method discussed in part 1 can be used for volume leveller, volume The parameter of leveller is one of target to be adjusted by audio-frequency processing method.According to this point, audio-frequency processing method is also volume Leveller control method.

In this trifle, only discussion is exclusively used in controlling to the those aspects of volume leveller.One on control method As aspect, may be referred to part 1.

According to the application, there is provided volume leveller control method, including：The content class of audio signal is identified in real time Type；And by making the dynamic gain of volume leveller and the informational content type positive correlation of audio signal, and make volume The dynamic gain of leveller and the interference content type of audio signal are negatively correlated, come based on the content type identified with continuous Mode adjust volume leveller.

Content type can include voice, short-term music, noise and background sound.Generally, noise is considered as interfering Content type.

When adjusting the dynamic gain of volume leveller, can confidence value based on content type directly adjust, or Person can be adjusted by the transmission function of the confidence value of content type.

As has been described, audio signal may be categorized into multiple audio types simultaneously.When being related to multiple content classes During type, adjustment operation 1104 may be configured to the plurality of content type by the importance based on the plurality of content type Confidence value is weighted, or by being weighted the influence of the plurality of content type based on confidence value, to consider this At least some audio content types in multiple audio content types.Especially, adjustment operation 1104 is configured to Confidence value considers at least one leading content type.When audio signal had not only included interference content type but also including information Property content type when, adjustment operation be configured to confidence value to consider at least one leading interference content class Type, and/or at least one leading informational content type is considered based on confidence value.

Different audio types may affect one another.Therefore, adjustment operation 1104 may be configured to using at least one The confidence value of other guide type changes the weight of a content type.

As described in part 1, the confidence value of the audio types of audio signal can be carried out smooth.On The detail with reference part 1 of smooth operation.

This method can also include the context type of identification audio signal, wherein, adjustment operation 1104 can be configured Into adjusting the scope of dynamic gain based on the confidence value of context type.

The role of content type is limited to the context type residing for it.Therefore, when (being directed to for audio signal simultaneously Same audio fragment) not only obtained content-type information but also obtain context type information when, the context class based on audio signal The content type of audio signal can be defined as informedness or interfering by type.It is in addition, upper depending on audio signal Hereafter type, the content type that can be given in the audio signal of different context types distribute different weights.From another angle Degree seen, different weight (larger or less, on the occasion of or negative value) can be used to reflect the informedness of content type Matter or jamming pattern.

The context type of audio signal can include VoIP, film class media, long-term music and game.And in VoIP In the audio signal of context type, background sound is considered as interference content type；And in the sound of non-VoIP context types In frequency signal, background and/or voice and/or music are considered as informational content type.Other context types can include height Quality audio or low quality audio.

It is similar with multiple content types, when audio signal (is directed to same audio fragment) simultaneously and be classified into accordingly putting During multiple context types of certainty value, adjustment operation 1104 may be configured to by the weight based on the plurality of context type The confidence value of the plurality of context type is weighted by the property wanted, or by based on confidence value by the plurality of context class The influence of type is weighted, to consider at least some context types in the plurality of context type.Especially, adjustment operation Confidence value is configured to consider at least one leading context type.

Finally, the embodiment of the method as described in this trifle can be used as will be discussed in the 6th part and the 7th part The audio frequency classification method stated, its detailed description is omitted here.

With the embodiment of audio processing equipment similarly, on the one hand, the embodiment of audio-frequency processing method and embodiment party Any combinations of the modification of formula are all feasible；On the other hand, the modification of the embodiment and embodiment of audio-frequency processing method Each aspect can be single solution.In addition, any two or more solutions described in this trifle Scheme can be combined with each other, and these combinations can also with it is described in the other parts in present disclosure and implied Any embodiment be combined.

5th part：Balance controller and control method

Balanced device is commonly used to music signal to adjust or change the spectrum of music signal balance, and spectrum balance is referred to as " tone " or " tone color ".Traditional balanced device permitted a user to emphasize some sound or remove undesirable sound and It is each that individually configuration frequency responds the one-piece pattern (curve or shape) of (gain) on frequency band.Popular music player example The gain at each frequency band is adjusted such as the commonly provided graphic equalizer of Windows (form) media player, and is provided One group is directed to the preset of different music styles such as rock and roll, a Chinese musical telling, jazz and folk rhyme, to listen attentively to the music process of different-style In obtain optimum experience.Once have selected preset or set pattern, then apply same EQ Gain on signal until hand Untill changing the pattern dynamicly.

By contrast, in order to keep the global consistency of related to desired tone or tone color spectrum balance, dynamic equalization Device provides the mode for automatically adjusting the EQ Gain at each frequency band.Balanced by the spectrum for continuously monitoring audio, by it Compared with desired preset spectrum balance, and applied EQ Gain is dynamically adjusted so that the original spectrum of audio balance to be turned It is changed to expectation and composes balance, so as to realizes the uniformity.It is expected that spectrum balance is manually to select before treatment or preset.

Two kinds of balanced device all has the disadvantage that：Must manually select optimal equalization pattern, it is expected compose balance or The related parameter of person, and can not the audio content based on playback come dynamically change optimal equalization pattern, it is expected to compose balance or The related parameter of person.In order to provide overall high quality for different classes of audio signal, distinguishing audio content type is It is very important.For example, different musical works needs different balanced modes, such as the balanced mode of different-style music.

In the equalizer system for being possible to input various audio signals (being not only music), it is necessary to based on content type come Adjust parametric equalizer.For example, balanced device is generally enabled music signal, but balanced device is disabled to voice signal, because Weighing apparatus may change the tone color of voice too much and correspondingly sound unnatural signal.

In order to solve this problem at least in part, present applicant proposes based on the embodiment discussed in part 1 To control balanced device.

It is similar to the 4th part with part 2, as the specific of the audio processing equipment and method discussed in part 1 Example, balanced device 408 can use all embodiments for being discussed in part 1 and in these disclosed in part 1 Any combinations of a little embodiments.Especially, in the case where controlling balanced device 408, audio frequency process as shown in Figures 1 to 10 The audio classifiers 200 and adjustment unit 300 of equipment 100 can form the controller 2200 of balanced device 408 as shown in figure 22. In this embodiment, because adjustment unit is exclusively used in balanced device 408, it can be referred to as 300D.

That is, the disclosure based on part 1, balanced device controller 2200 can include：Audio classifiers 200, for continuously identifying the audio types of audio signal；And adjustment unit 300D, for based on the audio class identified The confidence value of type adjusts balanced device in a continuous manner.Similarly, audio classifiers 200 can include audio content classification It is at least one in device 202 and audio context grader 204, and volume equalizer controller 2200 can also include type It is at least one in smooth unit 712, parameter smoothing unit 814 and timer 916.

Therefore, in this section, these contents that will do not had been described above in part 1 repeatedly, and only provide theirs Some specific examples.

5.1 controls based on content type

In general, it is directed in general audio content type such as music, voice, background sound and noise, it should to not Same content type is arranged differently than balanced device.It is similar with traditional setting, balanced device can be automatically enabled music signal, but It is that balanced device is automatically disabled to voice；Or in a manner of more continuous, high balanced rank is set still to music signal Low balanced rank is set to voice signal.In this way, it is possible to the equilibrium of balanced device is automatically set for audio content Rank.

Especially for music, it was observed that balanced device is not fine to the snatch of music effect with leading source, because such as Fruit is applied with unsuitable equilibrium, may significantly change the tone color in leading source and sound unnatural.In consideration of it, Low balanced rank is preferably set on the snatch of music with leading source, and the snatch of music in not leading source is kept High balanced rank.Using this information, balanced device can be automatically set balanced rank for different music contents.

It is also based on different attributes --- such as style, musical instrument and including music rhythm, speed and tone color Feature --- come to music assorting.Different music styles can be directed in an identical manner using different balanced preset, this A little music group/types can also have the optimum equalization pattern of its own either equalizer curve (in traditional balanced device) or Optimal expectation spectrum balance is (in dynamic equalizer).

As previously discussed, balanced device is generally enabled music content, but balanced device is disabled to voice, because due to tone color Change, for dialogue, balanced device may be allowed to sound bad.It is automatic realize this point a method be make balanced rank with Content is related, and content refers specifically to the music confidence value and/or voice confidence value obtained from audio content sort module.This In, balanced rank can be interpreted the weight of applied EQ Gain.Rank is higher, then the equilibrium applied is stronger.Pin To the example, if balanced rank is 1, apply complete balanced mode；If balanced rank is 0, correspondingly all increasings Benefit is 0dB and therefore application is not balanced.It can represent equal with different parameters in the different implementations of equalizer algorithm Weigh rank.One illustrative embodiments of the parameter are such as in A.Seefeldt et al. Publication No. US 2009/ 0097676A1 " Calculating and Adjusting the Perceived Loudness and/or the The weight for the balanced device realized in Perceived Spectral Balance of an Audio Signal ", the document Full content is incorporated by reference into herein.

Various control programs can be designed to adjust balanced rank.For example, using audio content type information, can use Voice confidence value or music confidence value set balanced rank, such as：

L_eq∝Conf_music (20)

Or

L_eq∝1-Conf_speech (21)

Wherein, L_eqIt is balanced rank, and Conf_musicAnd Conf_speechRepresent the confidence value of music and voice.

That is, adjustment unit 300D may be configured to the confidence value positive for making balanced rank and short-term music Close, or make the confidence value of balanced rank and voice negatively correlated.

Balanced rank is set using voice confidence value and music confidence value with can be combined with.Overall thought is： Only when music confidence value is high and voice confidence level is low, equalization stages special talent should be high, and otherwise balanced rank is low.Such as：

L_eq=Conf_music(1-Conf_speech ^α) (22)

Wherein, in order to handle the non-zero voice confidence level that may often occur in music signal, in voice confidence level Add index α.Using above formula, the absolute music signal integrity of no any phonetic element is applied balanced (rank is equal to 1). As stated in part 1, α can be considered as the weight coefficient of the importance based on content type, and generally can be with It is set to 1 to 2.

If setting bigger weight to the confidence value of voice, adjustment unit 300D, which may be configured to work as, is directed to language The confidence value of sound content type disables balanced device 408 when being more than threshold value.

In the above description, using music content type and voice content type as example.Alternatively, or additionally, It is also contemplated that the confidence value of background sound and/or noise.Specifically, adjustment unit 300D may be configured to make equalization stages Not with the confidence value positive correlation of background sound, and/or make the confidence value of balanced rank and noise negatively correlated.

As another embodiment, confidence value can be used for drawing the normalized weight as described in trifle 1.4. Assuming that for each content type predefine desired balanced rank (for example, be 1 to music, to voice be 0, to noise and Background sound is 0.5), then can to apply the formula similar with formula (8) completely.

Balanced rank can also be carried out smoothly, to avoid that the quick change of audible distortion may be being introduced at transfer point Change.The parameter smoothing unit 814 as described in trifle 1.5 can be used.

The possibility in the leading source in 5.2 music

In order to avoid the music with leading source is applied in high balanced rank, balanced rank and instruction music can also be made Whether fragment includes the confidence value Conf in leading source_domCorrelation, such as：

L_eq=1-Conf_dom (23)

In this way, balanced rank is relatively low to the snatch of music with leading source, and to the snatch of music in not leading source It is higher.

Here, although describing the confidence value of the music with leading source, can also use without leading source Music confidence value.That is, adjustment unit 300D may be configured to make balanced rank dominate the short of source with no The confidence value positive correlation of phase music, and/or bear the confidence value of balanced rank and the short-term music with leading source It is related.

As described in trifle 1.1, although as voice on the one hand and music and as the leading source of having on the other hand Music or music without leading source are the content types in different levels rank, but can abreast consider them.It is logical Confidence value and voice confidence value and music confidence value that joint as described above considers leading source are crossed, can be by by public affairs At least one in formula (20) to formula (21) is combined to set balanced rank with formula (23).One example is to merge institute There are three formula：

L_eq=Conf_music(1-Conf_speech)(1-Conf_dom) (24)

In order to more general, the different of the importance based on content type can also be applied to different confidence values Weight, such as in a manner of in formula (22).

As another example, it is assumed that just calculate Conf only when audio signal is music_dom, segmentation letter can be set Number, such as：

If categorizing system quite determines that the audio is music (music confidence value is more than threshold value), the function is based on master The confidence value of stem sets balanced rank；Otherwise, equalization stages are set based on music confidence value and voice confidence value Not.That is, adjustment unit 300D may be configured to when the confidence value of short-term music is more than threshold value consider have/no Short-term music with leading source.Of course, it is possible to change the first half of formula (25) in a manner of formula (20) to (24) Or latter half.

The same Smooth scheme as described in trifle 1.5 can also be applied, and is also based on translation type to set Constant α, translation type such as position from the music with leading source to the conversion of the music without leading source, or from without Conversion of the music in leading source to the music with leading source.For this purpose, the formula similar with formula (4 ') can also be applied.

5.3 balanced devices it is preset

In addition to balanced rank is adaptively adjusted based on the confidence value of audio content type, it might also depend on Its style, musical instrument or other characteristics, automatically selected for different audio contents appropriate balanced mode it is preset or It is expected that spectrum balance is preset.The music of music including same musical instrument with same style or the sound with same musical specific property Pleasure can share same balanced mode it is preset or it is expected compose balance it is preset.

In order to generality, represent to have same style, same musical instrument or similar using term " music type " The music group of music attribute, and they can be considered as another level of the audio content type as described in trifle 1.1 Rank.Appropriate balanced mode, balanced rank, and/or it is expected composes that balance is preset to be associated with each music type.Equilibrium model Formula is consequently exerted at the gain curve on music signal, and it can be for different music types (such as classics, rock and roll, Jue Shihe Folk rhyme) and any one equalizer preset in the equalizer preset that uses.It is expected that the preset expression of spectrum balance is directed to each type Desired tone color.Figure 23 shows that the expectation spectrum realized in Doby family theater (Dolby Home Theater) technology is flat Weigh preset several examples.Each example describes the expectation spectral shape in entire audible frequency range.Continuously should Shape and relatively calculates EQ Gain compared with the spectral shape of the audio inputted according to this, so as to by the sound of input The spectral shape of frequency transforms to preset spectral shape.

For a new snatch of music, it may be determined that immediate type (hard decision), or can calculate with each The related confidence value (soft-decision) of music type.Based on this information, it is suitable to be determined for given musical works Balanced mode or it is expected compose balance it is preset.Simplest mode is the corresponding modes of the type most matched to its distribution, such as：

Wherein, P_eqIt is the balanced mode of estimation or it is expected that spectrum balance is preset, and c^*It is the rope of the music type most matched Draw (leading audio types), can be by selecting the type with highest confidence value to obtain the index.

Furthermore, it is possible to there is the more than one music type with the confidence value more than 0, it is meant that the musical works has Have and the more or less similar attribute of those types.For example, musical works can have multiple musical instruments, or its can have it is more The attribute of kind style.This has been inspired by considering all types rather than only being estimated suitably using an immediate type The another way of balanced mode.It is, for example, possible to use weighted sum：

Wherein, N is the quantity of predefined type, w_cIt is related to each predefined music type (index is c) Design pattern P_cWeight, w_c1 should be normalized to based on its respective confidence value.In this way, the pattern of estimation will be each The mixing of the pattern of kind music type.For example, for the not only musical works with jazz's attribute but also with rock and roll attribute, estimate The pattern of meter will be the pattern that falls between.

In some applications, it may be undesirable to be related to all types as shown in formula (27).Only type A subset --- with current maximally related type of musical works --- is needed to be considered, and formula (27) can be slightly improved into：

Wherein, N ' is the quantity for the type to be considered, c ' is the type rope of the confidence value based on type and descending arrangement Draw.By using subset, immediate type can be focused more on and exclude those less related types.In other words, adjust Unit 300D is configured to confidence value to consider at least one leading audio types.

In the above description, using music type as example.In fact, the program is for appointing as shown in trifle 1.1 Audio types on what stratum level are all applicable.Therefore, usually, adjustment unit 300D may be configured to each It is preset that audio types distribute balanced rank and/or balanced mode and/or spectrum balance.

5.4 controls based on context type

In trifle before, discussion focuses on various content types.In more embodiment party that this trifle is discussed In formula, context type will be alternatively, or additionally considered.

Generally, balanced device is enabled but for film class media content disabling balanced device for music, because due to obvious Tone color change, balanced device may make the dialogue in film class media sound bad.This represent balanced rank can with it is long-term The confidence level of music and/or the confidence level association of film class media：

L_eq∝Conf_MUSIC (29)

Or

L_eq∝1-Conf_MOVIE (30)

Wherein, L_eqIt is balanced rank, Conf_MUSICAnd Conf_MOVIERepresent long-term music and the confidence level of film class media Value.

That is, adjustment unit 300D may be configured to the confidence value positive for making balanced rank and long-term music Close, or make balanced rank and the confidence value of film class media negatively correlated.

That is, film class media signal is directed to, film class media confidence value height (or music confidence level is low), because This balanced rank is low；On the other hand, for music signal, film class media confidence value is low (or music confidence level is high), because This balanced rank is high.

It can be changed in a manner of with formula (22) to formula (25) identical shown in formula (29) and formula (30) Solution, and/or formula (29) and formula (30) can be with appointing in the scheme shown in formula (22) to formula (25) One scheme of meaning is combined.

Additionally or alternatively, adjustment unit 300D may be configured to make balanced rank bear with the confidence value played It is related.

As another embodiment, confidence value can be used for drawing the normalized weight as described in trifle 1.4.It is false If predefining desired balanced rank/pattern (showing balanced mode in following table 2) for each context type, The formula similar with formula (9) can also be applied.

Table 2：

	Film class media	Long-term music	VoIP	Game
					Balanced mode	Pattern 1	Pattern 2	Pattern 3	Pattern 4

Here, it is equal as being disabled for some context type (such as film class media and game) in some patterns A kind of mode of weighing apparatus, all gains can be set to zero.

The combination of 5.5 embodiments and application scenarios

For example, can group each other in trifle 5.1 to any two described in trifle 5.4 or more solutions Close.And these combination can also with part 1 to the 4th part and described in the other parts that will be described later And any embodiment for being implied be combined.

5.6 balanced device control methods

It is similar with part 1, during the balanced device controller in describing embodiment above, it is clear that also disclose that Some processing and method.Hereinafter, the summary of these methods is provided in the case of the details for not repeating to have discussed.

First, the embodiment for the audio-frequency processing method discussed in part 1 can be used for balanced device, balanced device Parameter is one of target to be adjusted by audio-frequency processing method.According to this point, audio-frequency processing method is also balanced device controlling party Method.

In this trifle, only discussion is exclusively used in controlling to the those aspects of balanced device.General side on control method Face, it may be referred to part 1.

According to some embodiments, balanced device control method can include：The audio types of audio signal are identified in real time； And balanced device is adjusted in a continuous manner based on the audio types identified.

It is similar with the other parts of the application, when being related to multiple audio types with respective confidence value, adjustment behaviour Make 1104 to may be configured to enter the confidence value of the plurality of audio types by the importance based on the plurality of audio types Row weighting, or by being weighted the influence of the plurality of audio types based on confidence value, to consider the plurality of audio class At least some audio types in type.Especially, adjustment operation 1104 is configured to confidence value to consider at least One leading audio types.

As described in part 1, adjusted parameter value can be carried out smooth.Trifle 1.5 and trifle 1.8 are may be referred to, And omit be described in detail herein.

Audio types can be content type or be context type, or both.When being related to content type When, adjustment operation 1104 may be configured to the confidence value positive correlation for making balanced rank and short-term music, and/or make The confidence value of the rank that weighs and voice is negatively correlated.Additionally or alternatively, adjustment operation can be configured to make equilibrium The confidence value positive correlation of rank and background sound, and/or make the confidence value of balanced rank and noise negatively correlated.

When being related to context type, adjustment operation 1104 may be configured to the confidence for making balanced rank and long-term music Angle value positive correlation, and/or make balanced rank and film class media and/or the confidence value of game negatively correlated.

For the content type of short-term music, adjustment operation 1104 may be configured to make balanced rank leading with not having The confidence value positive correlation of the short-term music in source, and/or make the confidence of balanced rank and the short-term music with leading source Angle value is negatively correlated.It can just think just to complete this point when the confidence value of short-term music is more than threshold value.

Except adjusting balanced rank, can based on the confidence value of the audio types of audio signal come adjust balanced device its In terms of him.For example, adjustment operation 1104 may be configured to distribute balanced rank and/or balanced mode to each audio types And/or spectrum balance is preset.

On the instantiation of audio types, part 1 may be referred to.

With the embodiment of audio processing equipment similarly, on the one hand, the embodiment of audio-frequency processing method and embodiment party Any combinations of the modification of formula are all feasible；On the other hand, the modification of the embodiment and embodiment of audio-frequency processing method Each aspect can also be single solution.In addition, any two or more solutions described in this trifle Certainly scheme can be combined with each other, and these combinations can also with it is described in the other parts in the disclosure and implied Any embodiment is combined.

6th part：Audio classifiers and sorting technique

As described in trifle 1.1 and trifle 1.2, the content type including various stratum levels discussed in this application and upper Hereafter the audio types of type can include being classified or being identified based on the method for machine learning by existing classification schemes. In this part and ensuing part, as being previously mentioned in part before, present applicant proposes for context type The grader and some novel aspects of method classified.

6.1 Context quantizations based on content type classification

As being previously mentioned in part before, audio classifiers 200 are used for content type and/or the knowledge for identifying audio signal The context type of other audio signal.Therefore, audio classifiers 200 can include in audio content grader 202 and/or audio Hereafter grader 204.When using prior art to realize audio content grader 202 and audio context grader 204, this Two graders can be independent of one another, though they may share some features and therefore may share for extract feature one A little schemes.

In this part and ensuing 7th part, according to the novel aspect proposed in this application, above and below audio Literary grader 204 can use the result of audio content grader 202, that is to say, that audio classifiers 200 can include：Sound Frequency content classifier, for identifying the content type of audio signal；And audio context grader 204, for based on audio The result of content classifier 202 identifies the context type of audio signal.So, the classification knot of audio content grader 202 Fruit both can be used by audio context grader 204, but by as described in part above adjustment unit 300 (or Adjustment unit 300A to adjustment unit 300D) use.But although being not shown in the accompanying drawings, audio classifiers 200 may be used also With including the two audio content graders 202 used respectively by adjustment unit 300 and audio context grader 204.

In addition, as described in trifle 1.2, especially when classify multiple audio types when, audio content grader 202 or sound Frequency context classifier 204 can include one group of interoperable grader, although they can also be realized as single classification Device.

As described in trifle 1.1, content type is for generally having number frame to the length (such as 1s) of tens of frame amount levels A kind of audio types of short-term audio fragment, and context type is for the length generally with the several seconds to tens of second-times A kind of audio types of the long-term audio fragment of (such as 10s).Therefore, with " content type " and " context type " correspondingly, Use term " short-term " and " long-term " respectively if necessary.But as ensuing 7th part will described in, although up and down Literary type is indicated for the property of the audio signal in relatively long time scale, but can also be based on from short-term audio The feature of snippet extraction identifies context type.

Now, reference picture 24 illustrates the structure of audio content grader 202 and audio context grader 204.

As shown in figure 24, audio content grader 202 can include Short-term characteristic extractor 2022, for from each including The short-term audio fragment extraction Short-term characteristic of audio frame sequence；And short-term grader 2024, for using corresponding short-term special Short-term fragment sequence in long-term audio fragment is categorized into short-term audio types by sign.Short-term characteristic extractor 2022 and short Both phase grader 2024 can realize with prior art, but in ensuing trifle 6.3 also for Short-term characteristic Extractor 2022 proposes some improvement.

Short-term grader 2024 may be configured to each short-term segment classification in the short-term fragment sequence to At least one short-term audio types in trifle 1.1 in explained following short-term audio types (content type)：It is voice, short Phase music, background sound and noise.Each content type can also be further classified the secondary other content type of lower level In, such as discussed in trifle 1.1, but not limited to this.

As being known in the art, the confidence level of categorized audio types can also be obtained by short-term grader 2024 Value.In this application, when mentioning the operation of any grader, it should be understood that regardless of whether carried out it is explicitly stated, Then it also obtain confidence value simultaneously if necessary.Can be in L.Lu, H.-J.Zhang and S.Li " Content-based Audio Classification and Segmentation by Using Support Vector Machines",ACM The example of audio types classification is found in Multimedia Systems Journal 8 (6), pp.482-492 (2003.3), should The full content of document is incorporated by reference into herein.

On the other hand, as shown in figure 24, audio context grader 204 can include statistics extractor 2042, use In the statistics for the result for calculating the short-term fragment sequence that short-term grader is directed in the long-term audio fragment, as long-term spy Sign；And long-term grader 2044, for long-term audio fragment to be categorized into long-term audio types using long-term characteristic.It is similar Both ground, statistics extractor 2042 and long-term grader 2044 can realize with prior art, but following Trifle 6.2 in some improvement for proposing also for statistics extractor 2042.

Long-term grader 2044 may be configured to long-term audio fragment being categorized into and following explained in trifle 1.1 At least one long-term audio types in the long-term audio types (context type) crossed：Film class media, long-term music, game And VoIP.Alternatively, or additionally, long-term grader 2044 may be configured to long-term audio fragment being categorized into In trifle 1.1 in explained VoIP or non-VoIP.Alternatively, or additionally, long-term grader 2044 can by with It is set to and long-term audio fragment is categorized into high quality audio explained in trifle 1.1 or low quality audio. In practice, various target audio types can be selected and trained based on the demand of application/system.

On the implication and selection of short-term fragment and long-term fragment (and the frame to be discussed in trifle 6.3), Ke Yican Examine trifle 1.1.

The extraction of 6.2 long-term characteristics

As shown in figure 24, in one embodiment, only statistics extractor 2042 is used for from short-term grader 2024 result extraction long-term characteristic.As long-term characteristic, can be calculated by statistics extractor 2042 in data below It is at least one：The average value of the confidence value of the short-term audio types of short-term fragment in long-term fragment to be sorted and side Difference, the above-mentioned average value weighted by the significance level of short-term fragment and variance, each short-term sound in long-term fragment to be sorted The frequency of occurrences of frequency type and the frequency changed between different short-term audio types.

The voice confidence value in each short-term fragment (length 1s) and short-term music confidence level are shown in Figure 25 The average value of value.In order to contrast, the audio context different from three extracts the fragment：It is film class media (Figure 25 (A)), long-term Music (Figure 25 (B)) and VoIP (Figure 25 (C)).It can be observed how for film class media context, whether for voice Type or music type can obtain high confidence value, and it continually replaces in the two audio types.Compare Under, the fragment of long-term music provides stable and high short-term music confidence value and relatively stable and low voice confidence level Value.And VoIP fragment provides steady and low short-term music confidence value, but give and stop due to VoIP sessions Pause and the voice confidence value of fluctuation.

Variance for the confidence value of each audio types is also the feature for different audio contexts of classifying.Figure 26 Give in film class media ((A) in Figure 26), long-term music ((B) in Figure 26) and VoIP ((C) in Figure 26) audio Voice hereinafter, short-term music, (abscissa is putting in data set to the block diagram of variance of confidence value of background and noise The variance of certainty value, ordinate are the generation quantity in each section (bin) of the variance yields in data set, and it can be normalized To indicate the probability of happening in each section of variance yields).For film class media, all voices, short-term music and background are put The variance of reliability is all of a relatively high and wider distribution, illustrates that the confidence value of these audio types consumingly changes；For length Phase music, the variance of the confidence level of all voices, short-term music and background is all relatively low and narrow distribution, indicates these sounds The confidence value of frequency type keeps stable：Voice confidence value keeps constant low and music confidence value keeps constant height； For VoIP, the variance of the confidence value of short-term music is relatively low and narrow distribution, and the variance of the confidence value of voice is relative Ground wider distribution, this is due to the frequent pause during VoIP converses.

On the weight for calculating weighted average and variance, it is determined based on the importance of each short-term fragment 's.The importance of short-term fragment can be measured by the energy and loudness of short-term fragment, can be estimated with many prior arts Count the energy and loudness of short-term fragment.

The frequency of occurrences of each short-term audio types in long-term fragment to be sorted is：To the short-term piece in long-term fragment The counting for each audio types that section is classified into, is normalized by the length of long-term fragment.

The frequency changed between different short-term audio types in long-term fragment to be sorted is：To be sorted long-term Two adjacent short-term fragments in fragment change the counting of audio types, are normalized by the length of long-term fragment.

When reference picture 25 discusses the average value and variance of confidence value, each short-term audio types are actually also relates to The frequency of occurrences and the conversion frequency between those different short-term audio types.These features are also classified high with audio context Degree is related.For example, long-term music mainly includes short-term music VF type, so there is long-term music high short-term music to occur Frequency, and VoIP mainly includes voice and pause, so VoIP has high voice or the noise frequency of occurrences.Show as another Example, film class media are more frequently changed than long-term music or VoIP between different short-term audio types, so film class Media generally have the higher conversion frequency between short-term music, voice and background；VoIP is generally more frequent than other types Ground is changed between voice and noise, therefore VoIP generally has the higher conversion frequency between voice and noise.

In general it is assumed that long-term fragment has same length in same application/system.It is if it is the case, then each The generation of short-term audio types counts and the transition count in long-term fragment between different short-term audio types can be direct Using without normalizing.If the length of long-term fragment is variable, the frequency of occurrences as mentioned above should be used And conversion frequency.Claims hereof should be interpreted to cover both of these case.

Additionally or alternatively, audio classifiers 200 (or audio context grader 204) can also include length Phase feature extractor 2046 (Figure 27), for the Short-term characteristic based on the short-term fragment sequence in long-term audio fragment further from Long-term characteristic is extracted in long-term audio fragment.In other words, the dividing without using short-term grader 2024 of long-term characteristic extractor 2046 Class result, but directly using the Short-term characteristic that is extracted by Short-term characteristic extractor 2022, come obtain will be by long-term grader 2044 some long-term characteristics used.Long-term characteristic extractor 2046 and statistics extractor 2042 can be used independently or Person uses in combination.In other words, audio classifiers 200 can include long-term characteristic grader 2046 or including statistical number According to extractor 2042, or both can be included.

Any feature can be extracted by long-term characteristic extractor 2046.In this application, it is proposed that calculate from short-term special Levy at least one of following statistics of Short-term characteristic of extractor 2022 and be used as long-term characteristic：Average value, variance, weighting are flat , weighted variance, high average (high average), harmonic(-)mean (low average) and height are average between harmonic(-)mean Ratio (contrast).

From the average value and variance of the Short-term characteristic of the short-term snippet extraction in long-term fragment to be sorted；

From the weighted average and weighted variance of the Short-term characteristic of the short-term snippet extraction in long-term fragment to be sorted.Base Come in the importance for each short-term fragment that energy or the loudness of the short-term fragment of use just mentioned measure to Short-term characteristic It is weighted；

Height is average：From the average value of the selected Short-term characteristic of the short-term snippet extraction in long-term fragment to be sorted. It is selected when Short-term characteristic meets at least one condition of following condition：More than threshold value；Or short not less than every other In the Short-term characteristic of the predetermined ratio of phase feature, for example, the Short-term characteristic of highest 10%；

Harmonic(-)mean：From the average value of the selected Short-term characteristic of the short-term snippet extraction in long-term fragment to be sorted. It is selected when Short-term characteristic meets at least one condition of following condition：Less than threshold value；Or short not higher than every other In the Short-term characteristic of the predetermined ratio of phase feature, for example, 10% minimum Short-term characteristic；And

Contrast：The average ratio between harmonic(-)mean of height, represent the dynamic of the Short-term characteristic in long-term fragment.

Short-term characteristic extractor 2022 can be realized with prior art, and can thus extract any feature.Although In this way, propose some improvement for Short-term characteristic extractor 2022 in ensuing trifle 6.3.

The extraction of 6.3 Short-term characteristics

As shown in Figure 24 and Figure 27, Short-term characteristic extractor 2022 may be configured to directly from each short-term audio fragment Extract at least one feature in following characteristics：Rhythmic appearance, interruption/quietness and short-term audio quality feature.

Rhythmic appearance can include rhythm intensity, rhythm regularity, rhythm clarity (referring to L.Lu, D.Liu, and H.- J.Zhang.“Automatic mood detection and tracking of music audio signals”.IEEE Transactions on Audio,Speech,and Language Processing,14(1):5-18,2006, in its whole Appearance is incorporated by reference into herein) and 2D subbands modulation (M.F McKinney and J.Breebaart. " Features for Audio and music classification ", Proc.ISMIR, 2003, entire contents are incorporated by reference into herein In).

Interruption/quietness can include voice interruption, decline, Jing Yin length, unnatural quiet, unnatural peace and quiet suddenly Average value, unnatural quietly gross energy etc..

Short-term audio quality is characterized in the audio quality feature on short-term fragment, itself and the audio matter extracted from audio frame Measure feature is similar, will be in discussion below.

Alternatively, or additionally, as shown in figure 28, audio classifiers 200 can include frame level feature extractor 2012, for extracting frame level feature in each frame of included audio frame sequence from short-term fragment, and Short-term characteristic carries The frame level feature that device 2022 is configured to extract from audio frame sequence is taken to calculate Short-term characteristic.

As pretreatment, input audio signal can be monophonic audio signal by lower mixing (down-mix).If sound Frequency signal has been that monophonic signal does not need the pretreatment then.Then with predefined length (usual 10 milliseconds to 25 milliseconds) It is divided into frame.Correspondingly, frame level feature is extracted from each frame.

Frame level feature extractor 2012 may be configured to extract at least one feature in following characteristics：With various short-term The feature of the attribute characterization of audio types, cut-off frequency, static signal to noise ratio (static SNR) characteristic, segmental signal-to-noise ratio (segmental SNR) characteristic, basic voice Expressive Features (basic speech descrptor) and sound channel (vocal Tract) characteristic.

With the spy of the attribute characterization of various short-term audio types (especially voice, short-term music, background sound and noise) Sign can include at least one feature in following characteristics：Frame energy, subband Spectral structure, spectrum flux (spectral flux), plum That cepstrum coefficient (Mel-frequency Cepstral Coefficient, MFCC), bass (bass), residual (residual) Information, purity (Chroma) feature and zero-crossing rate (zero-crossing rate).

On MFCC details, L.Lu, H.-J.Zhang and S.Li are may be referred to, " Content-based Audio Classification and Segmentation by Using Support Vector Machines",ACM Multimedia Systems Journal 8 (6), pp.482-492 (2003.3), entire contents are incorporated by reference into this Wen Zhong.On the details of tone chrominance information, G.H.Wakefield is may be referred to, " Mathematical Representation of joint time Chroma distributions " in SPIE, 1999, entire contents pass through Reference is merged into herein.

Cut-off frequency represents the highest frequency of audio signal, higher than the highest frequency content energy close to zero.Cut Only frequency is designed to detect the limited content (band limited content) of frequency band, and in this application, frequency band is limited Content for audio context classification be useful.Cut-off frequency is typically as caused by coding, because most of encoder Radio-frequency component is abandoned in low bit rate or medium bit rate.For example, MP3 codecs have 16kHz's in 128kbps Cut-off frequency；For another example many popular VoIP codecs have 8kHz or 16kHz cut-off frequency.

Except cut-off frequency, as another characteristic, it is used to distinguish the signal degradation during also audio coding is handled Various audio contexts, such as VoIP contexts and non-VoIP contexts, high quality audio context with above and below low quality audio Text.The feature for representing audio quality further can also be extracted in multiple ranks, such as objective speech quality assessment Feature is (referring to Ludovic Malfait, Jens Berger, and Martin Kastner, " P.563-The ITU-T Standard for Single-Ended Speech Quality Assessment”,IEEE Transaction on Audio, Speech, and Language Processing, VOL.14, NO.6 (2006.11), entire contents pass through reference It is merged into herein), to obtain more rich feature.The example of audio quality feature includes：

A) static SNR characteristics, including the ambient noise rank of estimation, spectrum clarity etc..

B) SNR characteristics, including spectrum level deviation, spectrum level range, relative lowest noise (relative noise are segmented Floor) etc..

C) basic voice Expressive Features, including pitch average value (pitch average), voice segment change in sound level (speech section level variation), vocal level etc..

D) tract characteristics, including class machine voice (robotization), pitch cross-power (pitch cross Power) etc..

In order to draw short-term characteristic from frame level feature, Short-term characteristic extractor 2022 may be configured to calculate frame level feature Statistics, as Short-term characteristic.

The example of the statistics of frame level feature includes average value and standard deviation, and it is each to distinguish that it captures rhythm characteristic Kind audio types, such as short-term music, voice, background sound and noise.For example, voice generally with syllabic rate in voiced sound and clear Between sound alternately but music then not so, represent voice frame level feature change generally than music frame level feature change It is bigger.

Another example of statistics is the weighted average of frame level feature.For example, cut-off frequency is directed to, using each The energy or loudness of frame are as weight, the weighted average between the cut-off frequency of each audio frame extraction in short-term fragment It will be the cut-off frequency for the short-term fragment.

Alternatively, or additionally, as shown in figure 29, audio classifiers 200 can include：Frame level feature extractor 2012, for extracting frame level feature from audio frame；And frame level grader 2014, for using corresponding frame level feature come by this Each frame classification in audio frame sequence into frame level audio types, wherein, Short-term characteristic extractor 2022 may be configured to Short-term characteristic is calculated on the result of the audio frame sequence based on frame level grader 2014.

In other words, in addition to audio content grader 202 and audio context grader 204, audio classifiers 200 are also Frame classifier 201 can be included.In such framework, frame level classification of the audio content grader 202 based on frame classifier 201 As a result short-term fragment is classified, and short-term point based on audio content grader 202 of audio context grader 204 Class result is classified to long-term fragment.

Frame level grader 2014 may be configured to by each audio frame in the audio frame sequence be categorized into it is any can To be referred to as in the class of " frame level audio types ".In one embodiment, frame level audio types can have with hereinbefore The similar framework of the framework of the content type discussed, and also there is the implication similar with content type, only difference is that Frame level audio types and content type are the different stages in audio signal --- i.e., frame level not and short-term fragment rank --- point Class.For example, frame level grader 2014 may be configured to each frame classification of the audio frame sequence to following frame level audio In at least one type in type：Voice, music, background sound and noise.On the other hand, frame level audio types can also have Have from the framework of content type partly different or entirely different framework, be more suitable for frame level classification, and be more suitable for being used as For the Short-term characteristic classified in short term.For example, frame level grader 2014 may be configured to each frame of the audio frame sequence It is categorized at least one frame level audio types in following frame level audio types：Voiced sound, voiceless sound and pause.

On how from frame level classification result extract Short-term characteristic, can be adopted by reference to the description in trifle 6.2 With similar scheme.

Alternatively, short-term grader 2024 can both use the short-term spy of the result based on frame level grader 2014 Sign, the Short-term characteristic for being directly based upon the frame level feature obtained from frame level feature extractor 2012 can also be used.Therefore, it is short-term special Sign extractor 2022 is configured to frame level feature and frame level grader from audio frame sequence extraction on this Both results of audio frame sequence calculate Short-term characteristic.

In other words, frame level feature extractor 2012 may be configured to calculate the statistics with discussing in trifle 6.2 Both similar statistics and Short-term characteristic of combination Figure 28 descriptions, it includes at least one feature in following characteristics： Characterize the features of the attribute of various short-term audio types, cut-off frequency, static signal-to-noise characteristic, segmental signal-to-noise ratio characteristic, basic Voice Expressive Features and tract characteristics.

In order to work in real time, in all embodiments, Short-term characteristic extractor 2022 may be configured to make Worked on the short-term audio fragment that the sliding window slided to pre- fixed step size on the time dimension of long-term audio fragment is formed.Close It may be referred to small in the sliding window for short-term audio fragment, and audio frame and the sliding window for long-term audio fragment, its details Section 1.1.

The combination of 6.4 embodiments and application scenarios

Similar with part 1, the modification of all the above embodiment and embodiment can be arbitrary group with its Close to realize, and the still any component with same or similar function mentioned in different part/embodiments Same component or single component can be used as to realize.

For example, can group each other in trifle 6.1 to any two described in trifle 6.3 or more solutions Close.And these combination can also with part 1 to the 5th part and described in the other parts that will be described later And any embodiment for being implied be combined.Especially, the type smooth unit 712 described in part 1 can For this part using the component as audio classifiers 200, for making frame classifier 2014 or audio content grader 202 Or the result of audio context grader 204 is smooth.In addition, timer 916 is also used as the group of audio classifiers 200 Part, to avoid the mutation of the output of audio classifiers 200.

6.5 audio frequency classification method

It is similar with part 1, during the audio classifiers in describing embodiment above, it is clear that also disclose that one A little processes and method.Hereinafter, the summary of these methods is provided in the case of the details for not repeating to have discussed.

As shown in figure 30, in one embodiment, there is provided audio frequency classification method.In order to identify including short-term audio piece The long-term audio types (that is, context type) of Duan Xulie (overlapping each other or not overlapping) long-term audio fragment, first will Short-term audio fragment is categorized into short-term audio types (operation 3004), i.e. content type, and be directed to the long-term sound by calculating The statistics (operation 3006) of the result of the sort operation of short-term fragment sequence in frequency fragment obtains long-term characteristic.So Afterwards, can be using long-term characteristic come classified for a long time (operation 3008).Short-term audio fragment can include audio frame sequence.When So, in order to identify the short-term audio types of short-term fragment, it is necessary to from short-term snippet extraction Short-term characteristic (operation 3002).

Short-term audio types (content type) can include but is not limited to voice, short-term music, background sound and noise.

Long-term characteristic can include but is not limited to：The average value and variance of the confidence value of short-term audio types, by short-term The above-mentioned average value that the importance of fragment is weighted and variance, the frequency of occurrences of each short-term audio types with Bu Tong in short term The frequency changed between audio types.

As shown in figure 31, in modification, the short-term spy for the short-term fragment sequence that can be directly based upon in long-term audio fragment Sign obtains other long-term characteristic.This other long-term characteristic can include but is not limited to following Short-term characteristic statistics： Average value, variance, weighted average, weighted variance, high average, harmonic(-)mean and high average and harmonic(-)mean ratio (contrast Degree).

There is different modes to extract Short-term characteristic.One mode be directly extracted from short-term audio fragment to be sorted it is short Phase feature.This feature includes but is not limited to：Rhythmic appearance, interruption/quietness and short-term audio quality feature.

The second way is to extract the frame level feature (operation in Figure 32 from the audio frame included by each short-term fragment 3201) frame level feature, is then based on to calculate Short-term characteristic, such as calculates the statistics of frame level feature as Short-term characteristic.Frame level Feature can include but is not limited to：Characterize the various feature of the attribute of audio types, cut-off frequency, static noise bits in short term Property, segmental signal-to-noise ratio characteristic, basic voice Expressive Features and tract characteristics.Characterize the feature of the attribute of various short-term audio types It can also include：Frame energy, subband Spectral structure, spectrum flux, mel cepstrum coefficients, bass, residual, information, tone chromaticity and Zero-crossing rate.

The third mode is to extract Short-term characteristic with extraction long-term characteristic similar mode：From to be sorted short-term After audio frame extraction frame level feature (operation 3201) in fragment, each audio frame is categorized into using corresponding frame level feature In frame level audio types (operation 32011 in Figure 33)；And can by based on frame level audio types (alternatively including confidence Angle value) Short-term characteristic is calculated to extract Short-term characteristic (operation 3002).Frame level audio types can have and short-term audio types (content type) similar attribute and framework, and voice, music, background sound and noise can also be included.

The second way and the third mode can be combined, as shown in the dotted arrow in Figure 33.

As discussed in part 1, short-term audio fragment, which starves long-term audio fragment, to be sampled with sliding window.Namely Say, extracting the operation (operation 3002) of Short-term characteristic can use with pre- fixed step size on the time dimension of long-term audio fragment Carried out on the short-term audio fragment that the sliding window slided is formed, and extract the operation (operation 3107) of long-term characteristic and calculate short The operation (operation 3006) of the statistics of phase audio types can also use the time dimension with pre- fixed step size in audio signal Carried out on the long-term audio fragment that the sliding window slided on degree is formed.

With the embodiment of audio processing equipment similarly, on the one hand, the embodiment of audio-frequency processing method and embodiment party Any combinations of the modification of formula are all feasible；On the other hand, the modification of the embodiment and embodiment of audio-frequency processing method Each aspect can also be single solution.In addition, any two or more solutions described in this trifle Certainly scheme can be combined with each other, and these combinations can also be dark with described in the other parts in present disclosure and institute Any embodiment shown is combined.Especially, as discussed in trifle 6.4, the Smooth scheme of audio types Can be a part for audio frequency classification method discussed herein with conversion plan.

7th part：VoIP graders and sorting technique

A kind of novel audio classifiers are proposed in the 6th part, for being based at least partially on content type classification The result of device is by audio signal classification into audio context type.In the embodiment that the 6th part is discussed, it is from length Several seconds extracted long-term characteristic into the long-term fragment of tens of seconds, and therefore, audio context classification may cause long time delay. It is expected can also in real time or near-real-time, such as in short-term fragment rank, come audio context of classifying.

7.1 Context quantizations based on short-term fragment

Therefore, as shown in figure 34, there is provided audio classifiers 200A, including：Audio content classification 202A, for identifying sound The content type of the short-term fragment of frequency signal；And audio context grader 204A, for being based at least partially on by audio The content type of content classifier identification identifies the context type of short-term fragment.

Here, audio content grader 202A can take already mentioned technology in the 6th part, but can also use not Same technology, as will be detailed below discussing in trifle 7.2.Moreover, audio context grader 204A can use the 6th part Already mentioned technology, difference are that context classifier 204A can directly use audio content grader 202A knot Fruit, rather than the statistics of the result using audio content grader 202A, because audio context grader 204A and sound The grader 202A of frequency content classifies to same short-term fragment.In addition, it is similar with the 6th part, except in audio Outside the result for holding grader 202A, audio context grader 204A can use other spies directly extracted from short-term fragment Sign.That is, audio context grader 204A is configured to by using short-term segment contents type confidence The machine learning models of other features of the angle value as feature and from short-term snippet extraction is classified to short-term fragment.Close In the feature from short-term snippet extraction, the 6th part may be referred to.

Short-term fragment label can be simultaneously except VoIP speech/noises and/or non-by audio content grader 200A (VoIP speech/noises and non-VoIP speech/noises will below the 7.2 for more audio types outside VoIP speech/noises Section discusses), and each audio types in multiple audio types can be provided with the confidence value of its own, such as the 1.2nd Discussed in section.This can realize the preferably classification degree of accuracy, because more rich information can be caught.For example, voice and The united information of the confidence value of short-term music shows audio content is probably voice and background music in which kind of degree Mixing, enables it to make a distinction with pure VoIP contents.

7.2 use the classification of VoIP voices and VoIP noises

This aspect of the application is particularly useful in the non-VoIP categorizing systems of VoIP/, with short decision delay to work as Preceding short-term fragment will need the categorizing system when being classified.

For this purpose, as shown in figure 34, audio classifiers 200A is designed exclusively for the non-VoIP classification of VoIP/ 's.In order to classify to the non-VoIP of VoIP/, VoIP speech classifiers 2026 and/or VoIP noise classification devices are developed, with life Into the intermediate result of the non-VoIP classification of VoIP/ of the final robust for audio context grader 204A.

The short-term fragments of VoIP are either including VoIP voices or including VoIP noises.It was observed that the short-term fragment by voice High accuracy can be reached by being categorized as VoIP voices or non-VoIP voices, but be VoIP by the short-term segment classification of noise Noise or non-VoIP noises are really not so.It was therefore concluded that：By being VoIP (bags directly by short-term segment classification VoIP voices and VoIP noises are included, but does not identify VoIP voices and VoIP noises specifically) and non-VoIP, without considering voice Difference between noise, so as to which the feature of both content types (voice and noise) be mixed, it will reduce mirror Other power.

For grader, realize the degree of accuracy of VoIP voices/non-VoIP Classification of Speech than VoIP noise/non-VoIP noises The higher degree of accuracy of classification is rational, because voice includes more information, and the feature such as cut-off frequency than noise Classification to voice is more effective.According to the weighting levels obtained from AdaBoost training process, for the non-VoIP voices of VoIP/ Former positions of the weighting Short-term characteristic of classification are：The standard deviation of logarithmic energy, cut-off frequency, rhythm intensity standard deviation and Compose the standard deviation of flux.The standard deviation of the logarithmic energy of VoIP voices, the standard deviation of rhythm intensity and the mark for composing flux Quasi- deviation is usually above non-VoIP voices.One is the reason for possible, in non-VoIP contexts such as film class media or game Many short-term speech fragments other sound such as background music or audio relatively low generally with the value of features described above mix.Together When, the cut-off frequency of the cut-off frequencies of VoIP voices generally than non-VoIP voices is low, and this explanation compiles solution by many popular VoIP Code device introduces relatively low cut-off frequency.

Therefore, in one embodiment, audio content grader 202A can include：VoIP speech classifiers 2026, For by short-term segment classification into VoIP voice contents type or non-VoIP voice contents type；And audio context grader 204A, the confidence value of VoIP voices and non-VoIP voices is configured to by short-term segment classification to VoIP contexts In type or non-VoIP context types.

In another embodiment, audio content grader 202A can also include：VoIP noise classifications device 2028, use In by short-term segment classification to VoIP noise contents type or non-VoIP noise contents type；And audio context grader 204A, the confidence value for being configured to VoIP voices, non-VoIP voices, VoIP noises and non-VoIP noises will be short-term Segment classification is into VoIP context types or non-VoIP context types.

As discussed in the 6th part, trifle 1.2 and trifle 7.1, VoIP voices, non-VoIP voices, VoIP noises and non- The content type of VoIP noises can be identified with prior art.

Or audio content grader 202A can have hierarchical structure as shown in figure 35.That is, utilize language The result of sound/noise classification device 2025 is with first by short-term segment classification to voice or noise/background.

On the basis of the embodiment using only VoIP speech classifiers 2026, if a short-term fragment by voice/ Noise classification device 2025 (it is speech classifier in this case) is defined as voice, then VoIP speech classifiers 2026 after It is that VoIP voices are also non-VoIP voices for continuous differentiation, and calculates binary classification result；Otherwise, it is believed that VoIP voices Confidence value it is low or think that the judgement to VoIP voices is not known.

On the basis of the embodiment using only VoIP noise classifications device 2028, if a short-term fragment by voice/ Noise classification device 2025 (it is noise (background) grader in this case) is defined as noise, then VoIP noise classifications device 2028 to continue to distinguish it be that VoIP noises are also non-VoIP noises, and calculates binary classification result.Otherwise, it is believed that The confidence value of VoIP noises is low or thinks that the judgement to VoIP noises is not known.

Here, because usual voice is the content type of informedness and noise/background is interference content type, even if short Phase fragment is not noise, can not definitely determine that the short-term fragment is not VoIP contexts in the embodiment in earlier paragraphs yet Type.And if a short-term fragment is not voice, in the embodiment using only VolP speech classifiers 2026, it may It is not VolP context types.Therefore, generally can independently be realized using only the embodiment of VoIP speech classifiers 2026, And may be used as supplementing embodiment using only the other embodiment of VoIP noise classifications device 2028, and for example using VoIP The embodiment cooperation of speech classifier 2026.

That is, both VoIP speech classifiers 2026 and VoIP noise classifications device 2028 can be used.It is if short-term Fragment is defined as voice by speech/noise grader 2025, then VoIP speech classifiers 2026 continue differentiation be VoIP voices also It is non-VoIP voices, and calculates binary classification result.If short-term fragment is defined as making an uproar by speech/noise grader 2025 Sound, then it is that VoIP noises are also non-VoIP noises that VoIP noise classifications device 2028, which continues differentiation, and calculates binary classification knot Fruit.Otherwise, it is believed that short-term fragment can be classified as non-VoIP.

The realization of speech/noise grader 2025, VoIP speech classifiers 2026 and VoIP noise classifications device 2028 can be with The audio content grader 202 discussed using any prior art or part 1 into the 6th part.

If the short-term fragments of the audio content grader 202A realized as described above most at last one are not categorized into In voice, noise and background, or it is not categorized into VoIP voices, non-VoIP voices, VoIP noises and non-VoIP noises, anticipates Taste that all associated confidence values are all low, then audio content grader 202A (and audio context grader 204A) is short by this Phase segment classification is non-VoIP.

In order to which the result based on VoIP speech classifiers 2026 and VoIP noise classifications device 2028 arrives short-term segment classification In VoIP or non-VoIP context type, audio context grader 204A can use as discuss in trifle 7.1 based on The technology of machine learning, and as an improvement, more features can be used, including directly extract from short-term fragment short-term The knot of feature and/or other audio content graders for the other guide type in addition to VoIP related content types Fruit, as discussed in trifle 7.1.

In addition to the above-mentioned technology based on machine learning, the alternative approach of the non-VoIP classification of VoIP/ can be utilized Domain knowledge and the heuristic rule for utilizing the classification results related to VoIP voices and VoIP noises.This heuristic rule An example it is as follows.

If time t current short-term fragment is confirmed as VoIP voices or non-VoIP voices, the classification results directly quilt As the non-VoIP classification results of VoIP/, because the non-VoIP Classification of Speech of VoIP/ is robust, as previously discussed.Namely Say, if short-term fragment is confirmed as VoIP voices, then it is VoIP context types；If short-term fragment is confirmed as non- VoIP voices, then it is non-VoIP context types.

When VoIP speech classifiers 2026 are made for the voice determined as mentioned above by speech/noise grader 2025 When going out the binary decision on VoIP voices/non-VoIP voices, VoIP voices and non-VoIP voice confidence value are probably mutual Mend, i.e., its summation is 1 (if 0 represents it 100% is not that 1 represents that 100% is), and is used to distinguishing VoIP voices and non- The threshold value of the confidence value of VoIP voices can actually represent same point.If VoIP speech classifiers 2026 are not binary point Class device, then the confidence value of VoIP voices and non-VoIP voices may not be complementary, and for distinguishing VoIP voices and non- The threshold value of the confidence value of VoIP voices can not necessarily represent same point.

But situation about being fluctuated in VoIP voices or non-VoIP voices confidence level close to threshold value and above and below Near Threshold Under, the non-VoIP classification results of VoIP/ excessively may continually switch.In order to avoid this fluctuation, buffering scheme can be provided： Both the threshold value of VoIP voices and the threshold value of non-VoIP voices can be set to bigger so that less easily from Current Content Type is switched to another content type.For the ease of description, the confidence value of non-VoIP voices can be converted into VoIP voices Confidence value.It is, if confidence value is high, then it is assumed that short-term fragment is close to VoIP voices, and if confidence value It is low, then it is assumed that short-term fragment is close to non-VoIP voices.Although for above-mentioned non-binary classifier, the height of non-VoIP voices is put Certainty value is not necessarily mean that the low confidence value of VoIP voices, but this simplification can reflect the solution well Essence, and the accompanying claims described with the language of binary classifier should be interpreted to cover being equal for non-binary classifier Solution.

Buffering scheme is as shown in figure 36.In two threshold value Th1 and Th2 (Th1>=Th2) between have buffer area.Work as VoIP When the value of the confidence v (t) of voice is fallen in this region, Context quantization will not change, as shown in the left and right sides arrow in Figure 36. Only when the value of the confidence v (t) is more than larger threshold value Th1, short-term fragment will be classified as VoIP (such as the lower arrow institute in Figure 36 Show)；And and if only if when confidence value is not more than less threshold value Th2, short-term fragment will be classified as non-VoIP (in such as Figure 36 Shown in portion's arrow).

If alternatively using VoIP noise classifications device 2028, situation is similar.In order that solution has more robustness, VoIP speech classifiers 2026 and VoIP noise classifications device 2028 can be used in combination.Then, audio context grader 204A It may be configured to：If the confidence value of VoIP voices is more than first threshold or if the confidence value of VoIP noises is more than 3rd threshold value, then it is VoIP context types by short-term segment classification；If the confidence value of VoIP voices is not more than the second threshold Value, wherein Second Threshold is not more than first threshold, or if the confidence value of VoIP noises is not more than the 4th threshold value, wherein the Four threshold values are not more than the 3rd threshold value, then are non-VoIP context types by short-term segment classification；Otherwise, it is by short-term segment classification The context type of upper one short-term fragment.

Here, first threshold can be equal to Second Threshold, and the 3rd threshold value can be equal to the 4th threshold value, in particular for But it is not limited to binary VoIP speech classifiers and binary VoIP noise classification devices.But because VoIP noise classification results are usual Robustness is bad, if so the third and fourth threshold value is unequal each other will be more preferable, and the two should away from 0.5 (0 represent be The high confidence level of non-VoIP noises, 1 expression are the high confidence levels of VoIP noises).

7.3 make smoothing fluctuations

In order to avoid rapid fluctuations, another solution is that confidence value determined by audio content grader is carried out Smoothly.Therefore, as shown in figure 37, type smooth unit 203A can be comprised in audio classifiers 200A.For above being begged for The confidence value of any one content type in content type related 4 VoIP of opinion, 1.3 sections can be used to discuss smooth Scheme.

Or it is similar with trifle 7.2, VoIP voices and non-VoIP voices can be considered as pair with complementary confidence value； VoIP noises and non-VoIP noises can also be considered as pair with complementary confidence value.In this case, there was only one per centering It is individual to need to carry out smoothly, the Smooth scheme discussed in trifle 1.3 be used.

By taking the confidence value of VoIP voices as an example, formula (3) is rewritable to be：

V (t)=β v (t-1)+(1- β) voipSpeech Conf (t) (3 ")

Wherein, v (t) is moment t (when secondary) smoothed VoIP voice confidence values, and v (t-1) was a upper moment The smoothed VoIP voice confidence values of (last time), and VoipSpeechConf is current time t before smooth VoIP voice confidence levels, α are weight coefficients.

In a modification, if above-mentioned speech/noise grader 2025, if the voice confidence value of short-movie section Low, then the short-term fragment can not be robustly categorized into VoIP voices, and can directly set VoipSpeechConf (t)= V (t-1) is without making the real work of VoIP speech classifiers 2026.

Or in said circumstances, can set VoipSpeechConf (t)=0.5 (or other be not more than 0.5 values, Such as 0.4-0.5), represent that (confidence level=1 represents its VoIP high confidence level to uncertain situation here, and confidence level=0 represents Non- VoIP high confidence level).

Therefore, can also be included according to modification as shown in figure 37, audio content grader 200A：Speech/noise is classified Device 2025, for identifying the voice content type of short-term fragment；And type smooth unit 203A, may be configured to, by , will current short-term piece in the case that the confidence value of the voice content type of speech/noise grader classification is less than the 5th threshold value Section it is smooth before VoIP voice confidence values be set as predetermined confidence value (such as 0.5 or other values, such as 0.4-0.5) or The smoothed confidence value of upper one short-term fragment.In this case, VoIP speech classifiers 2026 can work and also may be used Not work.Or confidence value can be set by VoIP speech classifiers 2026, this equates smoothly single by type First 203A sets the solution of confidence value, and claim should be interpreted to cover both of these case.In addition, Used here as sentence " confidence value for the voice content type classified by speech/noise grader is less than the 5th threshold value ", but protect The scope not limited to this of shield, and it is equal to the feelings that short-term fragment is classified into other guide type than speech Condition.

For the confidence value of VoIP noises, situation is similar and omits detailed description here.

In order to avoid rapid fluctuations, another solution is to confidence level determined by audio context grader 204A Value is carried out smoothly, can use the Smooth scheme discussed in trifle 1.3.

In order to avoid rapid fluctuations, another solution is to postpone context type between VoIP and non-VoIP to turn Change, can use and the scheme identical scheme described in trifle 1.6.As described in trifle 1.6, timer 916 can be in audio point The outside of class device or in the inside of audio classifiers as one part.Therefore, as shown in figure 38, audio classifiers 200A may be used also With including timer 916.And audio classifiers are configured to continuously export current context type the context until new The length of the duration of type reaches the 6th threshold value (context type is an example of audio types).By reference to trifle 1.6, omit be described in detail herein.

Alternatively or additionally, another scheme as the conversion between delay VoIP and non-VoIP, as previously described First and/or Second Threshold for VoIP/ non-VoIP classification can depend on the context type of upper one short-term fragment It is and different.That is, when the context type of new short-term fragment is different from the context type of upper one short-term fragment, the One threshold value and/or Second Threshold become much larger；When the context of the context type and upper one short-term fragment of new short-term fragment When type is identical, first threshold and/or Second Threshold become smaller.In this way, context type tends to be maintained at and worked as Front upper and lower literary type, so as to suppress the unexpected fluctuation of context type to a certain extent.

The combination of 7.4 embodiments and application scenarios

For example, can group each other in trifle 7.1 to any two described in trifle 7.3 or more solutions Close.And these combinations can also be with carrying out group in any embodiment described by part 1 to the 6th part and being implied Close.Especially, the embodiment discussed in this part and its any combination can with audio processing equipment/method or with Volume leveller controller/control method that 4th part is discussed is combined.

7.5 VoIP sorting techniques

In an embodiment shown in Figure 39, audio frequency classification method includes：Identify the short-term fragment of audio signal Content type (operation 4004), is then based at least partially on identified content type to identify the context class of short-term fragment Type (operation 4008).

In order to identify dynamically and rapidly the context type of audio signal, the audio frequency classification method pair in this part It is particularly useful in identification VoIP context types and non-VoIP context types.It that case, first can be by short term Segment classification identifies the operation quilt of context type into VoIP voice contents type or non-VoIP voice contents type The confidence value based on VoIP voices and non-VoIP voices is configured to, by short-term segment classification to VoIP context types or non- In VoIP context types.

Or can be first by short-term segment classification to VoIP noise contents type or non-VoIP noise contents type In, and identify that the operation of context type is configured to the confidence value based on VoIP noises and non-VoIP noises, will be short-term Segment classification is into VoIP context types or non-VoIP context types.

It can combine and consider voice and noise.In this case, identify that the operation of context type may be configured to Based on the confidence value of VoIP voices, non-VoIP voices, VoIP noises and non-VoIP noises, by short-term segment classification to VoIP Hereafter in type or non-VoIP context types.

In order to identify the context type of short-term fragment, machine learning model can be used, by the content class of short-term fragment The confidence value of type and other features come out from short-term snippet extraction are all used as feature.

The operation of identification context type can also be realized based on heuristic rule.When pertaining only to VoIP voices and non- During VoIP voices, heuristic rule is such：If the confidence value of VoIP voices is more than first threshold, by short-term fragment It is categorized into VoIP context types；If the confidence value of VoIP voices is not more than Second Threshold, by short-term segment classification into Non- VoIP context types, wherein Second Threshold are not more than first threshold；Otherwise, by short-term segment classification Cheng Shangyi short-term pieces The context type of section.

It is similar for pertaining only to VoIP noises with the heuristic rule of the situation of non-VoIP noises.

When being both related to both voice and noise, heuristic rule is such：If the confidence value of VoIP voices is big In first threshold or if the confidence value of VoIP noises is more than the 3rd threshold value, then by short-term segment classification into VoIP contexts Type；If the confidence value of VoIP voices is not more than Second Threshold, wherein Second Threshold is not more than first threshold, or if The confidence value of VoIP noises is not more than the 4th threshold value, wherein the 4th threshold value is not more than the 3rd threshold value, then by short-term segment classification Into non-VoIP context types；Otherwise, by the context type of short-term segment classification Cheng Shangyi short-term fragments.

Here the Smooth scheme discussed in trifle 1.3 and trifle 1.8 can be used and omit detailed description.As The modification of Smooth scheme described in trifle 1.3, before smooth operation 4106, this method can also include according to short-term piece Section identification voice content type (operation 40040 in Figure 40), wherein, it is less than the 5th threshold in the confidence value of voice content type In the case of value (operation 40041 in " N "), current short-term fragment it is smooth before VoIP voice confidence values be configured to Predetermined confidence value or the smoothed confidence value (operation 40044 in Figure 40) of a upper short-term fragment.

Otherwise, if the operation of identification voice content type robustly judges that the short-term fragment is voice (operation 40041 In " Y "), then before smooth operation 4106, the short-term fragment is further split into VoIP voices or non-VoIP voices (behaviour Make 40042).

In fact, even if without using Smooth scheme, this method can also identify voice content type and/or noise first Content type, when short-term fragment is classified into voice or noise, further classification is realized so that short-term segment classification to be arrived One of VoIP voices and non-VoIP voices, or it is categorized into one of VoIP noises and non-VoIP noises.Then it is identified up and down The operation of literary type.

As being previously mentioned in trifle 1.6 and trifle 1.8, wherein the conversion plan discussed can be used as it is described here Audio frequency classification method a part, and omit its details here.In short, this method can also be included to identifying context The duration that the operation of type continuously exports same context type is measured, and wherein audio frequency classification method is configured to Continue to output current context type and reach the 6th threshold value until the duration of new context type.

Similarly, different conversions from a context type to another context type can be directed to setting The 6th different threshold values.Furthermore, it is possible to make the confidence value of the 6th threshold value and new context type negatively correlated.

, can be by pin as the improvement of the conversion plan in the audio frequency classification method particular for the non-VoIP classification of VoIP/ The first threshold of current short-term fragment to the one or more of the 4th threshold value is arranged to depend on a upper short-term fragment Context type and it is different.

It is similar with the embodiment of audio processing equipment, on the one hand, the embodiment and embodiment of audio-frequency processing method Any combinations of modification be all feasible；On the other hand, the modification of the embodiment and embodiment of audio-frequency processing method Each aspect can also be single solution.In addition, any two or more solutions described in this trifle Scheme can be combined with each other, and these combinations can also with it is described in the other parts in present disclosure and implied Any embodiment be combined.Especially, described sound before audio frequency classification method as described herein can be used for Frequency processing method, especially volume leveller control method.

As described by the beginning in " embodiment " of the application, presently filed embodiment may be embodied as firmly Part or software, or both.Figure 41 is the block diagram for the example system 4200 for showing each side for realizing the application.

In Figure 41, CPU (CPU) 4201 according to the program being stored in read-only storage (ROM) 4202 or Person is loaded into the program of random access memory (RAM) 4203 to carry out various processing from storage part 4208.In RAM 4203 In, the data required when CPU 4201 carries out various processing etc. are stored also according to needs.

CPU 4201, ROM 4202 and RAM 4203 are connected to each other by bus 4204.Input/output interface 4205 also connects It is connected to bus 4204.

Input/output interface 4205 is connected to lower component：Importation 4206 including keyboard, mouse etc.；Including example Such as output par, c 4207 of cathode-ray tube (CRT), the display of liquid crystal display (LCD) and loudspeaker etc.；Including hard disk Deng storage part 4208；And the communications portion 4209 including the NIC such as LAN card, modem.It is logical Letter part 4209 carries out communication process by network such as internet.

As needed, driver 4210 is also connected to input/output interface 4205.Such as disk of removable media 4211, CD, magneto-optic disk, semiconductor memory etc. are installed to driver 4210 as needed, to cause the computer read from it Program is installed in storage part 4208 as needed.

It is from network such as internet or for example removable from storage medium in the case where realizing said modules by software The program of the composition software is installed except medium 4211.

Purpose of the terminology used here just for the sake of description embodiment is note that, and is not intended to limit The application.As used herein, unless context is clearly specified in addition, " one " of singulative and "the" mean Including plural form.It will further be understood that when using term " comprising " in this application, illustrate in the presence of signified feature, whole Body, operation, step, element and/or component, but do not exclude the presence of or increase other one or more features, entirety, Operation, step, element, component and/or their combination.

Corresponding construction, material, operation in following claims and all " device or operation plus function " elements etc. With replacing, it is intended to including it is any be used for other elements for specifically noting in the claims it is combined perform the knot of the function Structure, material or operation.Merely for diagram and description purpose and provide to the description of the present application, and not carry out thoroughly or limit In disclosed application.For person of an ordinary skill in the technical field, without departing from scope and spirit of the present invention In the case of, it is clear that can be so that many modifications may be made and modification.Selection and explanation to embodiment, it is to best explain the present invention Principle and practical application, and make person of an ordinary skill in the technical field it will be appreciated that the application, can have be adapted to it is set The various embodiments with various changes for the special-purpose thought.

According to the above, it can be seen that describe the embodiment (each being represented with " EE ") of following exemplary.

Equipment embodiment：

EE.1. a kind of balanced device controller, including：

Audio classifiers, for continuously identifying the audio types of audio signal；And

Adjustment unit, balanced device is adjusted in a continuous manner for the confidence value based on the audio types identified.

EE.2. the balanced device controller according to EE 1, wherein, the audio classifiers are configured to the audio Modulation recognition is into multiple audio types with respective confidence value, and the adjustment unit is configured to by based on institute The importance for stating multiple audio types is weighted to the confidence value of the multiple audio types to consider the multiple audio At least some audio types in type.

EE.3. the balanced device controller according to EE 1, wherein, the audio classifiers are configured to the audio Modulation recognition is into multiple audio types with respective confidence value, and the adjustment unit is configured to by based on institute State influence of the confidence value to the multiple audio types be weighted it is at least some in the multiple audio types to consider Audio types.

EE.4. the balanced device controller according to EE 3, wherein, the adjustment unit is configured to be based on the confidence Angle value considers at least one leading audio types.

EE.5. the balanced device controller according to EE 1, in addition to parameter smoothing unit, for for by the adjustment The parameter of the balanced device of unit adjustment, based on past parameter value come to working as time parameter value of determination by the adjustment unit Carry out smooth.

EE.6. the balanced device controller according to EE 5, wherein, the parameter smoothing unit is configured to pass through calculating Determined by the adjustment unit when the weighted sum of the secondary parameter value determined and last smoothed parameter value when secondary Smoothed parameter value.

EE.7. the balanced device controller according to EE 6, wherein, the weight for calculating the weighted sum is based on described The audio types of audio signal and adaptively change.

EE.8. the balanced device controller according to EE 6, wherein, the weight for calculating the weighted sum is based on difference Adaptively change from an audio types to the conversion pair of another audio types.

EE.9. the balanced device controller according to EE 6, wherein, the weight for calculating the weighted sum is based on by institute State the increase tendency of the value of the parameter of adjustment unit determination or reduce trend and adaptively change.

EE.10. the balanced device controller according to any one of EE 1 to EE 9, wherein,

The audio classifiers include audio content grader, for identifying the content type of the audio signal；And

The adjustment unit is configured to the confidence value positive correlation for making balanced rank and short-term music, and/or makes The confidence value of the balanced rank and voice is negatively correlated.

EE.11. the balanced device controller according to any one of EE 1 to EE 9, wherein,

The audio classifiers include audio context grader, for identifying the context type of the audio signal； And

The adjustment unit is configured to the confidence value positive correlation for making balanced rank and long-term music, and/or makes The balanced rank and film class media and/or the confidence value of game are negatively correlated.

EE.12. the balanced device controller according to any one of EE 1 to EE 9, wherein,

The adjustment unit is configured to the confidence value positive for making balanced rank and the short-term music without leading source Close, and/or make the confidence level of the balanced rank and the short-term music with leading source negatively correlated.

EE.13. the balanced device controller according to EE 10 or EE 11, wherein, the adjustment unit is configured to make The confidence value positive correlation of the balanced rank and the short-term music without leading source, and/or make the balanced rank It is negatively correlated with the confidence level of the short-term music with leading source.

EE.14. the balanced device controller according to EE 13, wherein, the adjustment unit is configured to when described short-term The confidence value of music is considered when being more than threshold value without the/short-term music with leading source.

EE.15. the balanced device controller according to any one of EE 1 to EE 9, wherein,

The adjustment unit is configured to the confidence value positive correlation for making balanced rank and background sound, and/or makes The confidence value of the balanced rank and noise is negatively correlated.

EE.16. the balanced device controller according to any one of EE 1 to EE 9, wherein, the adjustment unit by with It is set to preset to the balanced rank of each audio types distribution and/or balanced mode and/or spectrum balance.

EE.17. the balanced device controller according to EE 16, wherein, the audio classifiers are classified including audio content Device, for by the audio signal classification into short-term content type, the short-term content type include short-term music, voice, It is at least one in background sound and noise.

EE.18. the balanced device controller according to EE 17, wherein, the short-term music includes at least one music class Type.

EE.19. the balanced device controller according to EE 18, wherein, at least one music type includes being based on wind The type of lattice, and/or the type based on musical instrument, and/or the rhythm based on music, speed, tone color and/or any other music category The music type of property.

EE.20. the balanced device controller according to EE 16, wherein, the audio classifiers include audio context point Class device, for, into long-term context type, the long-term context type to include film class matchmaker by the audio signal classification It is at least one in body, long-term music, VoIP and game.

EE.21. a kind of audio reproducing system, including the balanced device controller according to any one of EE 1 to EE 20.

Method embodiment：

EE.1. a kind of balanced device control method, including：

The audio types of audio signal are identified in real time；And

Confidence value based on the audio types identified adjusts balanced device in a continuous manner.

EE.2. the balanced device control method according to EE 1, wherein, by the audio signal classification to accordingly putting In multiple audio types of certainty value, and the operation of the adjustment is configured to by the weight based on the multiple audio types The property wanted is weighted to the confidence value of the multiple audio types to consider at least some sounds in the multiple audio types Frequency type.

EE.3. the balanced device control method according to EE 1, wherein, by the audio signal classification to accordingly putting In multiple audio types of certainty value, and the operation of the adjustment be configured to by based on the confidence value to described more The influence of individual audio types is weighted to consider at least some audio types in the multiple audio types.

EE.4. the balanced device control method according to EE 3, wherein, the operation of the adjustment is configured to based on described Confidence value considers at least one leading audio types.

EE.5. the balanced device control method according to EE 1, in addition to, for the institute of the operation adjustment by the adjustment The parameter of balanced device is stated, based on past parameter value come smooth to the parameter value progress for working as time determination by the operation of the adjustment.

EE.6. the balanced device control method according to EE 5, wherein, the smooth operation is configured to pass through calculating By the operation of the adjustment when time weighted sum of the parameter value of determination and last smoothed parameter value, to determine to work as Secondary smoothed parameter value.

EE.7. the balanced device control method according to EE 6, wherein, the weight for calculating the weighted sum is based on institute State the audio types of audio signal and adaptively change.

EE.8. the balanced device control method according to EE 6, wherein, the weight for calculating the weighted sum is based on not Same adaptively changes from an audio types to the conversion pair of another audio types.

EE.9. the balanced device control method according to EE 6, wherein, for calculate the weighted sum weight be based on by The increase tendency for the parameter value that the operation of the adjustment determines reduces trend and adaptively changed.

EE.10. the balanced device control method according to any one of EE 1 to EE 9, wherein,

Identify that the operation of the audio types includes identifying the content type of the audio signal；And

The operation of the adjustment is configured to the confidence value positive correlation for making balanced rank and short-term music, and/or Make the confidence value of the balanced rank and voice negatively correlated.

EE.11. the balanced device control method according to any one of EE 1 to EE 9, wherein,

Identify that the operation of the audio types includes identifying the context type of the audio signal；And

The operation of the adjustment is configured to the confidence value positive correlation for making balanced rank and long-term music, and/or Make the balanced rank and film class media and/or the confidence value of game negatively correlated.

EE.12. the balanced device control method according to any one of EE 1 to EE 9, wherein,

The operation of the adjustment is configured to make balanced rank with not having the confidence value of the short-term music in leading source just Correlation, and/or make the confidence level of the balanced rank and the short-term music with leading source negatively correlated.

EE.13. the balanced device control method according to EE 10 or EE 11, wherein, the operation of the adjustment is configured Into making the confidence value positive correlation of the balanced rank and the short-term music without leading source, and/or make the equilibrium The confidence level of rank and the short-term music with leading source is negatively correlated.

EE.14. the balanced device control method according to EE 13, wherein, the operation of the adjustment is configured to when described The confidence value of short-term music is considered when being more than threshold value without the/short-term music with leading source.

EE.15. the balanced device control method according to any one of EE 1 to EE 9, wherein,

The operation of the adjustment is configured to the confidence value positive correlation for making balanced rank and background sound, and/or Make the confidence value of the balanced rank and noise negatively correlated.

EE.16. the balanced device control method according to any one of EE 1 to EE 9, wherein, the operation of the adjustment It is configured to preset to the balanced rank of each audio types distribution and/or balanced mode and/or spectrum balance.

EE.17. the balanced device control method according to EE 16, wherein, identify that the operation of the audio types includes inciting somebody to action The audio signal classification into short-term content type, the short-term content type include short-term music, voice, background sound and It is at least one in noise.

EE.18. the balanced device control method according to EE 17, wherein, the short-term music includes at least one music Type.

EE.19. the balanced device control method according to EE 18, wherein, at least one music type includes being based on The type of style, and/or the type based on musical instrument, and/or the rhythm based on music, speed, tone color and/or any other music The music type of attribute.

EE.20. the balanced device control method according to EE 16, wherein, identify the operation of the audio types by described in Audio content is categorized into long-term context type, and the long-term context type includes film class media, long-term music, VoIP With it is at least one in game.

EE.21. a kind of record thereon has the computer-readable medium of computer program instructions, when being executed by a processor, The instruction makes the processor be able to carry out balanced device control method, and the balanced device control method includes：

The audio types of audio signal are identified in real time；And

Claims

1. a kind of balanced device controller, including：

Adjustment unit, balanced device is adjusted in a continuous manner for the confidence value based on the audio types identified, wherein, The audio classifiers are configured to by the audio signal classification into multiple audio types with respective confidence value, and And the adjustment unit is configured to put the multiple audio types by the importance based on the multiple audio types Certainty value is weighted to consider at least some audio types in the multiple audio types.

2. balanced device controller according to claim 1, in addition to parameter smoothing unit, for for single by the adjustment The parameter of the balanced device of member adjustment, entered based on past parameter value to working as time parameter value of determination by the adjustment unit Row is smooth.

3. balanced device controller according to claim 2, wherein, the parameter smoothing unit be configured to by calculate by The weighted sum that the adjustment unit works as time parameter value of determination and the smoothed parameter value of last time is secondary smoothed to determine to work as Parameter value.

4. balanced device controller according to claim 3, wherein, the weight for calculating the weighted sum is based on the sound The audio types of frequency signal and adaptively change.

5. balanced device controller according to claim 3, wherein, for calculating the weight of the weighted sum based on different Adaptively change from an audio types to the conversion pair of another audio types.

6. balanced device controller according to claim 3, wherein, the weight for calculating the weighted sum is based on by described The increase tendency for the parameter value that adjustment unit determines reduces trend and adaptively changed.

7. the balanced device controller according to any one of claim 1 to claim 6, wherein,

The adjustment unit is configured to the confidence value positive correlation for making balanced rank and short-term music, and/or makes described The confidence value of balanced rank and voice is negatively correlated.

8. the balanced device controller according to any one of claim 1 to claim 6, wherein,

The audio classifiers include audio context grader, for identifying the context type of the audio signal；And

The adjustment unit is configured to the confidence value positive correlation for making balanced rank and long-term music, and/or makes described Balanced rank and film class media and/or the confidence value of game are negatively correlated.

9. the balanced device controller according to any one of claim 1 to claim 6, wherein,

The adjustment unit is configured to the confidence value positive correlation for making balanced rank and the short-term music without leading source, and And/or person makes the confidence level of the balanced rank and the short-term music with leading source negatively correlated.

10. balanced device controller according to claim 7, wherein, the adjustment unit is configured to make the equalization stages Not with the confidence value positive correlation of the short-term music without leading source, and/or make the balanced rank with leading The confidence level of the short-term music in source is negatively correlated.

11. balanced device controller according to claim 8, wherein, the adjustment unit is configured to make the equalization stages Not with the confidence value positive correlation of the short-term music without leading source, and/or make the balanced rank with leading The confidence level of the short-term music in source is negatively correlated.

12. balanced device controller according to claim 10, wherein, the adjustment unit is configured to when short-term music Confidence value is considered when being more than threshold value without the/short-term music with leading source.

13. balanced device controller according to claim 11, wherein, the adjustment unit is configured to when short-term music Confidence value is considered when being more than threshold value without the/short-term music with leading source.

14. the balanced device controller according to any one of claim 1 to claim 6, wherein,

The adjustment unit is configured to the confidence value positive correlation for making balanced rank and background sound, and/or makes described The confidence value of balanced rank and noise is negatively correlated.

15. the balanced device controller according to any one of claim 1 to claim 6, wherein, the adjustment unit quilt It is configured to the balanced rank of each audio types distribution and/or balanced mode and/or spectrum balance is preset.

16. balanced device controller according to claim 15, wherein, the audio classifiers are classified including audio content Device, for by the audio signal classification into short-term content type, the short-term content type include short-term music, voice, It is at least one in background sound and noise.

17. balanced device controller according to claim 16, wherein, the short-term music includes at least one music class Type.

18. balanced device controller according to claim 17, wherein, at least one music type includes being based on style Type, and/or type based on musical instrument, and/or

The music type of rhythm, speed, tone color and/or any other music attribute based on music.

19. balanced device controller according to claim 15, wherein, the audio classifiers are classified including audio context Device, for by the audio signal classification into long-term context type, the long-term context type include film class media, It is at least one in long-term music, VoIP and game.

20. a kind of balanced device controller, including：

Adjustment unit, balanced device is adjusted in a continuous manner for the confidence value based on the audio types identified, wherein, The audio classifiers are configured to by the audio signal classification into multiple audio types with respective confidence value, and And the adjustment unit be configured to by based on influence of the confidence value to the multiple audio types be weighted come Consider at least some audio types in the multiple audio types.

21. balanced device controller according to claim 20, wherein, the adjustment unit is configured to be based on the confidence Angle value considers at least one leading audio types.

22. the balanced device controller according to any one of claim 20 to claim 21, wherein,

23. the balanced device controller according to any one of claim 20 to claim 21, wherein,

24. the balanced device controller according to any one of claim 20 to claim 21, wherein,

25. balanced device controller according to claim 22, wherein, the adjustment unit is configured to make the equalization stages Not with the confidence value positive correlation of the short-term music without leading source, and/or make the balanced rank with leading The confidence level of the short-term music in source is negatively correlated.

26. balanced device controller according to claim 23, wherein, the adjustment unit is configured to make the equalization stages Not with the confidence value positive correlation of the short-term music without leading source, and/or make the balanced rank with leading The confidence level of the short-term music in source is negatively correlated.

27. balanced device controller according to claim 25, wherein, the adjustment unit is configured to when short-term music Confidence value is considered when being more than threshold value without the/short-term music with leading source.

28. balanced device controller according to claim 26, wherein, the adjustment unit is configured to when short-term music Confidence value is considered when being more than threshold value without the/short-term music with leading source.

29. the balanced device controller according to any one of claim 20 to claim 21, wherein,

30. the balanced device controller according to any one of claim 20 to claim 21, wherein, the adjustment unit It is configured to preset to the balanced rank of each audio types distribution and/or balanced mode and/or spectrum balance.

31. balanced device controller according to claim 30, wherein, the audio classifiers are classified including audio content Device, for by the audio signal classification into short-term content type, the short-term content type include short-term music, voice, It is at least one in background sound and noise.

32. balanced device controller according to claim 31, wherein, the short-term music includes at least one music class Type.

33. balanced device controller according to claim 32, wherein, at least one music type includes being based on style Type, and/or type based on musical instrument, and/or

34. balanced device controller according to claim 30, wherein, the audio classifiers are classified including audio context Device, for by the audio signal classification into long-term context type, the long-term context type include film class media, It is at least one in long-term music, VoIP and game.

35. a kind of audio reproducing system, including the balanced device control according to any one of claim 1 to claim 34 Device.

36. a kind of balanced device control method, including：

The audio types of audio signal are identified in real time；And

Confidence value based on the audio types identified adjusts balanced device in a continuous manner, wherein, the audio believed Number it is categorized into multiple audio types with respective confidence value, and the operation of the adjustment is configured to by based on institute The importance for stating multiple audio types is weighted to the confidence value of the multiple audio types to consider the multiple audio At least some audio types in type.

37. balanced device control method according to claim 36, in addition to, for the institute of the operation adjustment by the adjustment The parameter of balanced device is stated, based on past parameter value come smooth to the parameter value progress for working as time determination by the operation of the adjustment.

38. the balanced device control method according to claim 37, wherein, the smooth operation is configured to pass through calculating By the operation of the adjustment when the weighted sum of the secondary parameter value determined and last smoothed parameter value, to determine when secondary warp Smooth parameter value.

39. the balanced device control method according to claim 38, wherein, the weight for calculating the weighted sum is based on institute State the audio types of audio signal and adaptively change.

40. the balanced device control method according to claim 38, wherein, the weight for calculating the weighted sum is based on not Same adaptively changes from an audio types to the conversion pair of another audio types.

41. the balanced device control method according to claim 38, wherein, the weight for calculating the weighted sum be based on by The increase tendency for the parameter value that the operation of the adjustment determines reduces trend and adaptively changed.

42. the balanced device control method according to any one of claim 36 to claim 41, wherein,

The operation of identification audio types includes identifying the content type of the audio signal；And

The operation of the adjustment is configured to the confidence value positive correlation for making balanced rank and short-term music, and/or makes institute The confidence value for stating balanced rank and voice is negatively correlated.

43. the balanced device control method according to any one of claim 36 to claim 41, wherein,

The operation of identification audio types includes identifying the context type of the audio signal；And

The operation of the adjustment is configured to the confidence value positive correlation for making balanced rank and long-term music, and/or makes institute State balanced rank and film class media and/or the confidence value of game are negatively correlated.

44. the balanced device control method according to any one of claim 36 to claim 41, wherein,

The operation of the adjustment is configured to the confidence value positive correlation for making balanced rank and the short-term music without leading source, And/or make the confidence level of the balanced rank and the short-term music with leading source negatively correlated.

45. balanced device control method according to claim 42, wherein, the operation of the adjustment be configured to make it is described The confidence value positive correlation of the rank that weighs and the short-term music without leading source, and/or make the balanced rank with having The confidence level of the short-term music in leading source is negatively correlated.

46. balanced device control method according to claim 43, wherein, the operation of the adjustment be configured to make it is described The confidence value positive correlation of the rank that weighs and the short-term music without leading source, and/or make the balanced rank with having The confidence level of the short-term music in leading source is negatively correlated.

47. balanced device control method according to claim 45, wherein, the operation of the adjustment is configured to work as short-term sound Happy confidence value is considered when being more than threshold value without the/short-term music with leading source.

48. balanced device control method according to claim 46, wherein, the operation of the adjustment is configured to work as short-term sound Happy confidence value is considered when being more than threshold value without the/short-term music with leading source.

49. the balanced device control method according to any one of claim 36 to claim 41, wherein,

The operation of the adjustment is configured to the confidence value positive correlation for making balanced rank and background sound, and/or makes institute The confidence value for stating balanced rank and noise is negatively correlated.

50. the balanced device control method according to any one of claim 36 to claim 41, wherein, the adjustment Operation is configured to preset to the balanced rank of each audio types distribution and/or balanced mode and/or spectrum balance.

51. balanced device control method according to claim 50, wherein, identifying the operation of audio types is included the sound Into short-term content type, the short-term content type is included in short-term music, voice, background sound and noise frequency Modulation recognition It is at least one.

52. balanced device control method according to claim 51, wherein, the short-term music includes at least one music class Type.

53. balanced device control method according to claim 52, wherein, at least one music type includes being based on wind The type of lattice, and/or type based on musical instrument, and/or

54. balanced device control method according to claim 50, wherein, identify that the audio is believed in the operation of audio types Number it is categorized into long-term context type, the long-term context type includes film class media, long-term music, VoIP and game In it is at least one.

55. a kind of balanced device control method, including：

The audio types of audio signal are identified in real time；And

Confidence value based on the audio types identified adjusts balanced device in a continuous manner, wherein, the audio believed Number it is categorized into multiple audio types with respective confidence value, and the operation of the adjustment is configured to by based on institute State influence of the confidence value to the multiple audio types be weighted it is at least some in the multiple audio types to consider Audio types.

56. balanced device control method according to claim 55, wherein, the operation of the adjustment is configured to based on described Confidence value considers at least one leading audio types.

57. the balanced device control method according to any one of claim 55 to claim 56, wherein,

58. the balanced device control method according to any one of claim 55 to claim 56, wherein,

59. the balanced device control method according to any one of claim 55 to claim 56, wherein,

60. balanced device control method according to claim 57, wherein, the operation of the adjustment be configured to make it is described The confidence value positive correlation of the rank that weighs and the short-term music without leading source, and/or make the balanced rank with having The confidence level of the short-term music in leading source is negatively correlated.

61. balanced device control method according to claim 58, wherein, the operation of the adjustment be configured to make it is described The confidence value positive correlation of the rank that weighs and the short-term music without leading source, and/or make the balanced rank with having The confidence level of the short-term music in leading source is negatively correlated.

62. balanced device control method according to claim 60, wherein, the operation of the adjustment is configured to work as short-term sound Happy confidence value is considered when being more than threshold value without the/short-term music with leading source.

63. balanced device control method according to claim 61, wherein, the operation of the adjustment is configured to work as short-term sound Happy confidence value is considered when being more than threshold value without the/short-term music with leading source.

64. the balanced device control method according to any one of claim 55 to claim 56, wherein,

65. the balanced device control method according to any one of claim 55 to claim 56, wherein, the adjustment Operation is configured to preset to the balanced rank of each audio types distribution and/or balanced mode and/or spectrum balance.

66. balanced device control method according to claim 65, wherein, identifying the operation of audio types is included the sound Into short-term content type, the short-term content type is included in short-term music, voice, background sound and noise frequency Modulation recognition It is at least one.

67. balanced device control method according to claim 66, wherein, the short-term music includes at least one music class Type.

68. balanced device control method according to claim 67, wherein, at least one music type includes being based on wind The type of lattice, and/or type based on musical instrument, and/or

69. balanced device control method according to claim 65, wherein, identify that the audio is believed in the operation of audio types Number it is categorized into long-term context type, the long-term context type includes film class media, long-term music, VoIP and game In it is at least one.