CN106782544A

CN106782544A - Interactive voice equipment and its output intent

Info

Publication number: CN106782544A
Application number: CN201710199965.1A
Authority: CN
Inventors: 魏云龙; 高田
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2017-03-29
Filing date: 2017-03-29
Publication date: 2017-05-31

Abstract

Present disclose provides a kind of interactive voice equipment and its output intent.Methods described includes：Audio input data is gathered by audio collecting device；Obtain the audio output data of the response audio input data；Obtain the target audio output parameter that analysis determines；Audio output device is controlled to export the audio output data according to the target audio output parameter.

Description

Interactive voice equipment and its output intent

Technical field

This invention relates generally to electronic device field, relate more specifically to interactive voice equipment and its output intent.

Background technology

With the development of voice technology, the setting with voice interactive function of such as mobile phone, computer, intelligent sound box etc Standby use is more and more universal.The output parameter of these equipment, such as output volume, typically settable, but can still go out sometimes Existing some problems.For example, the late into the night all fall asleep after it is just very quiet in room, it is assumed that user comes back a problem to ask Intelligent sound box (such as he inquire day weather how), if audio amplifier is also answered by the volume for setting daytime, will seem Sound is very big, it is possible to wake other people up.If by turning down sound volume before depending on user to sleep, if certain day user is if forgetting regulation This problem can equally be gone out.

Accordingly, it would be desirable to a kind of mechanism of the output for being capable of Based Intelligent Control interactive voice equipment.

The content of the invention

According to the first aspect of the application, there is provided a kind of output intent.Methods described includes：By audio collecting device Collection audio input data；Obtain the audio output data of the response audio input data；Obtain the target sound that analysis determines Frequency output parameter；And, export the audio output data according to target audio output parameter control audio output device.

According to the second aspect of the application, there is provided a kind of processing equipment.The equipment includes：Audio collecting device；Sound Frequency output device；And processing unit.The processing unit is configured to：Audio input data is gathered by audio collecting device； Obtain the audio output data of the response audio input data；Obtain the target audio output parameter that analysis determines；And, according to The audio output data are exported according to target audio output parameter control audio output device.

Brief description of the drawings

By the following specifically describes with reference to accompanying drawing, above and other aspect of exemplary embodiment of the invention and its excellent Point will be apparent for those of ordinary skill in the art, wherein：

Fig. 1 shows the flow chart of output intent according to embodiments of the present invention.

Fig. 2 shows the flow chart of output intent according to an embodiment of the invention.

Fig. 3 shows the flow chart of output intent according to another embodiment of the present invention.

Fig. 4 shows the flow chart of the output intent according to further embodiment of this invention.

Fig. 5 shows the block diagram of interactive voice equipment according to embodiments of the present invention.

Fig. 6 shows the block diagram of the processing unit of interactive voice equipment according to embodiments of the present invention.

In the accompanying drawings, similar reference indicates same or similar key element.

Specific embodiment

According to reference to accompanying drawing to the described in detail below of disclosure exemplary embodiment, the other side of the disclosure, advantage Be will become apparent to those skilled in the art with prominent features.

In the disclosure, term " including " and " containing " and its derivative mean including and it is unrestricted；Term "or" is bag Containing property, mean and/or.

In this manual, following is explanation for describing the various embodiments of disclosure principle, should not be with any Mode is construed to limit scope of disclosure.Referring to the drawings described below is used to help comprehensive understanding by claim and its equivalent The exemplary embodiment of the disclosure that thing is limited.It is described below to help understand including various details, but these details should Think what is be merely exemplary.Therefore, it will be appreciated by those of ordinary skill in the art that without departing substantially from the scope of the present disclosure and spirit In the case of, embodiment described herein can be made various changes and modifications.Additionally, for clarity and brevity, Eliminate the description of known function and structure.Additionally, running through accompanying drawing, same reference numbers are directed to similar function and operation.

Fig. 1 shows the flow chart of the output intent 100 in equipment 10 according to embodiments of the present invention.

Equipment 10 in an embodiment of the present invention can be the various equipment that voice interactive function can be provided (hereinafter Also referred to as speech ciphering equipment), such as intelligent sound box, response formula Intelligent voice toy, the smart phone with voice assistant, computer, Robot with voice interactive function etc..Audio collecting device, audio output dress are generally included in these speech ciphering equipments Put, processing unit etc..

When user provides audio input, method 100 starts.In some embodiments, equipment 10 can after powering always In sound monitoring state, when audio input is detected, method 100 starts.Preferably, in further embodiments, user Can recognize that follow-up sound is input into by specific word (such as " you associate the well ") annunciator that wakes up.In these embodiments, Only when equipment 10 detects the wake-up word of user input, method 100 starts.

As shown in figure 1, in step s 110, audio input data is gathered by audio collecting device.

According to embodiments of the present invention, the audio collecting device can be various known or following exploitation with audio The device of acquisition function, such as microphone.Audio collecting device is also referred to as sound pick-up, to monitor head, pickup first-class, or can be Various audio collection cards.

Preferably, in the case where user recognizes that follow-up sound is input into by specific wake-up word annunciator, step Audio input data in S110 is often referred to the audio input data after word is waken up gathered by audio collecting device.

In the step s 120, the audio output data of the gathered audio input data of response are obtained.

Specifically, in the step s 120, the audio input data of audio collecting device collection can be processed, according to Result determines to respond the audio output data of the audio input data, then obtains identified audio output data.

The treatment of audio input data to audio collecting device collection can be locally executed by equipment 10, or can be by External processing apparatus are performed.

Specifically, equipment 10 can be from sound local or from the gathered audio input data of external equipment acquisition response Frequency output data.The audio input data that equipment 10 can be gathered with processing locality by audio collecting device.For example, equipment 10 can With with native processor processing audio input data, or even audio collecting device in itself can with integrated special processor with Treatment audio input data.

Alternatively, the audio input data gathered by audio collecting device can be sent to external treatment and set by equipment 10 It is standby to be processed.

Either local in equipment 10, still in the outside in reason equipment, the treatment to audio input data can include To the content analysis of voice that is included in audio input data and to the audio attribute of audio input data (such as in time domain when Long and/or spectral characteristic) analysis.Specifically, the treatment to audio input data can for example include that noise filtering, voice were known Not, spectrum analysis etc..

According to the result to audio input data, it may be determined that respond the audio output number of the audio input data According to.If as an example, when being query time to the result instruction user of audio input data, could be by the current time As audio output data.And for example, when user sends the inquiry of " how is weather tomorrow ", carried out by audio input data The result instruction user that speech recognition and semantic analysis are obtained wishes to know current weather, in can be by using interconnection Net search technique searches for weather forecast information, and the weather forecast information that will be searched to be converted to audio signal defeated as audio Go out data.And for example, when user sends the request of " telling a story ", the result instruction user to audio input data is wished Story is listened in prestige, in can be to select story according to predetermined policy, the selected corresponding audio signal of story is defeated as audio Go out data.It can be diversified to select the strategy of story.For example, can be random from the candidate's story set being locally stored One story of selection is played out.It is alternatively possible to according to the spectrum analysis to audio input data determine the sex of importer/ Age selects different candidate's story subsets, and a story is then randomly choosed from selected candidate's story subset is broadcast Put.

The operation of the audio output data of above-mentioned determination response audio input data can be locally executed by equipment 10, or The audio input data of collection can be sent to external processing apparatus by equipment 10, then performing this by external equipment determines behaviour Make.

Additionally, equipment 10 can obtain the audio output data of response audio input data from local storage.Alternatively, Equipment 10 can receive the audio output data of response audio input data from external equipment.

The audio input data of collection is sent in the case that external processing apparatus are processed in equipment 10, Ke Yiyou The external processing apparatus will beam back equipment 10 to the result of audio input data, and sound is then determined and obtained by equipment 10 Should audio input data audio output data.Alternatively it is also possible to by external processing apparatus according to audio input data Result determine and obtain respond the audio input data audio output data, then by the audio output data is activation To equipment 10.

In step s 130, the target audio output parameter that analysis determines is obtained.

It is alternatively possible to determine that target audio output parameter is defeated to obtain target audio by local analytics by equipment 10 Go out parameter.It should be understood that analysis determines the native processor and above-mentioned treatment audio input data of target audio output parameter Native processor can be the different processor in same processor, or equipment 10 in equipment 10.

It is alternatively possible to analyze determination target audio output parameter by external processing apparatus.Then, equipment 10 can be from The external equipment receives target audio output parameter.It should be understood that analysis determines that the external treatment of target audio output parameter sets The external processing apparatus of standby and above-mentioned treatment audio input data can be same equipment, or distinct device.

Either determine target audio output parameter, the analysis in local still the analysis in reason equipment in the outside of equipment 10 Various ways can be used.

In certain embodiments, can be by analyzing the target sound that audio input data determination is matched with audio input data Frequency output parameter.

As an example, target audio output parameter can be determined according to audio input data corresponding audio frequency parameter.Audio The corresponding audio frequency parameter of input data can for example include at least one of parameters described below：Volume parameters, prosodic parameter, tone color Parameter.Especially, volume parameters can represent the height of volume.Prosodic parameter can represent the speed of word speed.Tamber parameter example Male/female/child's voice can such as be represented.These parameters generally can carry out power spectrumanalysis and frequency spectrum by audio input data Analyze to obtain.

In the implementation, target audio output parameter can be presented uniformity or be in the audio frequency parameter of audio input data Existing phase reflexive.

As an example, if the volume for analyzing audio input data is small, it is determined that target audio output volume For small；If instead the volume for analyzing audio input data is big, it is determined that target audio output volume is big.This is suitable for User whispers under quiet environment, the scene spoken up under noisy environment.In such an implementation, in such as late into the night In such quiet environment, user whispers, equipment output small volume, without interference with other people；And on daytime, user is loud Speak, equipment also exports big volume, be easy to user not hear.

As another example, if the volume for analyzing audio input data is small, it is determined that target audio output volume For big；If instead the volume for analyzing audio input data is big, it is determined that target audio output volume is small.This is suitable for Small volume is input into because occasion caused by remote, and small volume shows user in equipment at a distance, it should have the big volume output to make Obtaining user can hear, and big volume input shows user near equipment, and small volume output allows for user and hears.

As an example, if it is fast that the rhythm for analyzing audio input data indicates word speed, it is determined that target audio The prosodic parameter of output also indicates that fast word speed, if instead the rhythm instruction word speed for analyzing audio input data is slow, then really The prosodic parameter of the audio output that sets the goal also indicates that slow word speed.Alternatively, if the rhythm for analyzing audio input data is indicated Word speed is fast, it is determined that the output duration of target audio output uses the first duration, otherwise determines the output of target audio output Duration uses the second duration, wherein the first duration was shorter than for the second time.So, rapid input will correspond to rapid output, full The demand that sufficient user worries.

As another example, if it is fast that the rhythm for analyzing audio input data indicates word speed, it is determined that target audio The prosodic parameter of output indicates slow word speed.So, rapid input slowly exports correspondence, can relax the mood of user.

As another example, if the tone color for analyzing audio input data indicates male voice/female voice, it is determined that target audio The tamber parameter of output indicates female voice/male voice.Alternatively, if analyze audio input data tone color indicate male voice/female voice/ Child's voice, it is determined that the tamber parameter of target audio output correspondingly indicates male voice/female voice/child's voice.

In a preferred embodiment, can be by combining source of sound distance analysis audio input data, to determine the target of matching Audio output parameters.It is corresponding that this audio frequency parameter for allowing for the audio input data of the collection of equipment 10 depends not only on source of sound Audio frequency parameter, is also influenceed by source of sound distance.

Source of sound distance can determine in several ways.

In some instances, waveform analysis can be performed come really by the audio input data gathered to audio collecting device Accordatura source is with a distance from audio collecting device.For example, because radio-frequency component and low-frequency component are with apart from the decay or time delay for producing Characteristic is differed, by the waveform for analyzing audio input data, it may be determined that distance of the source of sound from audio collecting device.

In other examples, source of sound distance can be estimated by microphone array.In this example, the sound of equipment 10 Frequency harvester includes microphone array.Source of sound distance can be estimated by microphone array.Preferably, in order that all directions Estimated accuracy it is basically identical, can using equilateral triangle microphone array arrange, wherein three microphones are arranged in Three summits of equilateral triangle.Or, can be arranged using foursquare microphone array, wherein four microphones are respectively arranged On foursquare four summits.During the different delays produced by the difference of distance using each microphone of sound source to microphone array Between, the specific arrangement (such as equilateral triangle or square) according to microphone array is it is estimated that the distance of sound source.

In other example, the distance of source of sound distance can be estimated by image collecting device.In this example, if Standby 10 include image collecting device (such as camera).Preferably, equipment 10 can include more than one camera, such as with court To multiple cameras of different directions.In some implementations, can be within sweep of the eye by detecting in image collecting device It is no to have user to determine the distance of source of sound distance.If for example, by human face detection tech in the visual field of image collecting device In the range of detect user, it is determined that source of sound distance is near, otherwise for remote.In other realizations, can be by IMAQ Device gather one or more image, by native processor or external processing apparatus analysis acquired image come determine source of sound away from From distance.For example, if the analysis result indicate that all there is same people in the plurality of pictures of same camera continuous acquisition, and its Mouth opening and closing form is different, then the people is considered as supplier's (i.e. source of sound) of audio input signal, determine source of sound distance be it is near, it is no Then determine that source of sound distance is remote.

In other example, source of sound distance can be estimated by infrared range-measurement system.In this example, equipment 10 includes Infrared range-measurement system.Equipment 10 can first determine the direction of sound source by audio collecting device (such as microphone), then using red Outer rangefinder is measured with a distance from sound source on Sounnd source direction.

After source of sound distance is determined, can be by combining source of sound distance analysis audio input data, to determine matching Target audio output parameter.

Target output volume simply can be defined as the volume of direct ratio and audio input data, and inverse ratio and source of sound away from From.Can generally use sound power W as the Measure Indexes for weighing volume, unit can be watt (w).For example, can be with Target output volume is determined by following formula：

Wherein W_outRepresent target output volume, W_inputThe volume of audio input data is represented, r represents audio collecting device With the distance of source of sound, k₀It is constant.

It is alternatively possible to pass through to combine source of sound distance analysis audio input data, the audio frequency parameter of source of sound is determined, then really The fixed target audio output parameter matched with the audio frequency parameter of source of sound.

For example, by taking volume as an example, (that is, the audio input data of the collection of equipment 10 is corresponding for the volume that equipment 10 is collected Volume) volume of source of sound can not be directly characterized, but also influenceed relative to the distance of equipment 10 by source of sound.Generally can be with Using the sound intensity as the Measure Indexes for weighing sound intensity, unit can be watt/meter²(w/m²), it is possible to use sound power conduct The Measure Indexes of volume are weighed, unit can be watt (w).Voice command in view of people is point sound source, can be by voice Propagation is processed as spherical wave.For spherical wave, sound intensity I is directly proportional to the sound power W of point sound source, with putting down apart from r Side is inversely proportional, and is shown below

The inverse square law of the acoustics based on formula (2), by the actual measurement sound intensity I and source of sound at measurement point (equipment 10) place The sound power W of sound source (that is, source of sound) counter can be pushed away apart from r：

W=I × 4 π r² (3)。

Then, by combining source of sound distance analysis audio input data, it may be determined that source of sound volume W, then can be by mesh Mark audio output volume is set to be proportional to source of sound volume W.When source of sound volume is high, target output volume is also high；Otherwise work as sound When source volume is low, target output volume is also low.For example, target output volume can be determined by following formula：

W_out=k₁×W

(4),

Wherein W_outTarget output volume is represented, W represents source of sound volume, and r represents the distance of audio collecting device and source of sound, k₁It is constant.

The target audio output parameter that matching is determined by combining source of sound distance analysis audio input data is described above Some embodiments.These embodiments there may be change in implementing.

For example, in certain embodiments, the determination of target audio output parameter had always both considered the sound of audio input data Frequency parameter, it is also considered that source of sound distance.

And in further embodiments, the audio frequency parameter only in audio input data meets certain condition (for example, audio is defeated The volume for entering data is small) when, the determination of target audio output parameter just considers source of sound distance.As an example, first analyze Audio input data, if the analysis result indicate that the volume of audio input data is big, it is determined that target audio output volume is Greatly, it is not necessary to detect source of sound distance；If the analysis result indicate that the volume of audio input data is small, then source of sound distance is detected, Volume and source of sound distance then in conjunction with audio input data, it is determined that the target audio output volume of matching is (for example, with reference to above Formula 1 determine).In this implementation example, it is to whisper or speaking up that equipment can distinguish user, especially It is to whisper or speak up a long way off on hand that ground can distinguish user, so as to be adaptively adjusted output volume, is changed Kind Consumer's Experience.

In further embodiments, the target audio that can be matched with the ambient parameter by analysis environments parameter determination Output parameter.Ambient parameter can for example include ambient brightness.For example, ambient brightness can be detected by photodetector.Work as inspection The ambient brightness for measuring is brighter, can be set to target audio output parameter (such as volume) higher；Otherwise ambient brightness is darker, Target audio output parameter can be set to lower.This realization can adaptively export big volume, the late into the night output daytime Small volume, reduces the interference to other people.

In some other embodiments, can be determined and institute by the behavior or attribute of analyzing the importer of audio input data State the behavior of importer or the target audio output parameter of attributes match.

In this example, equipment 10 includes image collecting device (such as camera).Can be by image recognition technology (such as people Face identification technology) etc. behavior or attribute to determine user are analyzed to the image that gathers.

The behavior of importer can for example include indicating gesture or the action of small sound, and/or indicate loud gesture or dynamic Make.The gesture or action for indicating small sound for example forefinger can be placed on the action before mouth including user, be pressed downward along gravity direction Action that gesture, the user of palm walk close to when saying to equipment etc..The loud gesture of instruction or action can for example include upward Raise one's hand the palm gesture, by both hands before mouth in horn-like gesture, the mouth of user and equipment the same side of palm gesture etc.. By one or more image of image acquisition device importer, and user can be determined by image recognition technology Gesture or action.

If for example, recognition result shows that finger (preferably forefinger) is placed on the action before mouth by user, it may be determined that The volume of target audio output parameter uses small volume.

If recognition result shows that user is pressed downward the gesture of palm/palm of raising one's hand upwards along gravity direction, it may be determined that mesh The volume for marking audio output parameters uses small/big volume.

If recognition result shows the action that user walks close to when saying to equipment, it may be determined that target audio output parameter Volume uses small volume.

If recognition result show user by both hands before mouth be in horn-like gesture, it may be determined that target audio output ginseng Several volumes uses big volume.

If recognition result shows the gesture of mouth and equipment in the same side of palm of user, it may be determined that target audio is defeated The volume for going out parameter uses big volume.

These realize limb action regulation audio output parameters that can adaptively according to user, improve the body of user Test.

The attribute of importer for example can be including the parameter for indicating the sex of importer, age etc..Can be by camera Deng the image of image acquisition device importer, and input is determined by image analysis technology (such as face recognition technology) The sex of person and/or age.Sex/age according to importer determines target audio output parameter.

If for example, recognition result shows that the age of importer is the elderly, it may be determined that target audio output volume is It is high.

If recognition result shows that importer is male/female/child, it may be determined that target audio output tone color is male voice/female Sound/child's voice.

These realize the regulation such as sex, age that can adaptively according to user audio output parameters, improve user's Experience.

It should be understood that above example is only that schematically illustrate a selected some modes that analysis determines target audio output parameter, But the invention is not restricted to this.For example, audio input data, source of sound distance, ambient parameter and/or importer can be taken into consideration Behavior or attribute in two or more determine target audio output parameter.

It should be understood that maximum and/or minimum threshold value can also be set so that target audio output parameter is confined to threshold value model In enclosing.

In step S140, audio output device output is controlled to obtain in the step s 120 according to target audio output parameter Audio output data.

Equipment 10 can adjust the audio output parameters of its audio output device according to target audio output parameter.If mesh Mark audio output parameters indicate regulated quantity, then correspondingly adjust present video output parameter.If target audio output parameter Absolute magnitude is indicated, then in the situation that present video output parameter is different from the target audio output parameter obtained in step S130 Under, target audio parameter is set to present video output parameter.Preferably, therefore, to assure that the audio output parameters after regulation are still In threshold range.

Then the audio for being obtained in the step s 120 according to the audio output parameters control audio output device output for setting Output data.

Then, method according to embodiments of the present invention, can be adaptively adjusted the audio output of speech interactive equipment Parameter, obtains suitable audio output, improves the use feeling of user.

The method 100 of the embodiment of the present invention is carried out with reference to Fig. 2~Fig. 4 as the example of audio frequency parameter using volume below It is described in further detail.

Fig. 2 shows the stream of the output intent 200 in speech interactive equipment 10 according to an embodiment of the invention Cheng Tu.Method 200 is a specific embodiment of method 100.In the present embodiment, it is defeated by combining source of sound distance analysis audio Enter data, it is determined that the target audio output volume matched with audio volume.

As illustrated, method 200 starts from step S210.

In step S210, audio input data is gathered by audio collecting device.Step S210 is the one of step S110 It is individual to implement.Especially, in the present embodiment, audio collecting device is popped one's head in including PU, and step S210 also includes being visited using PU Head, the acoustic pressure p and direct voice particle rapidity u of measurement input audio.

In step S212, detect source of sound (providing the importer of audio input) and equipment 10 between using depth camera Distance (referred to as source of sound distance) r.The depth camera can be arranged in apparatus 10 as one part, or can be with cloth Put near equipment 10.

In step S220, the audio output data of the gathered audio input data of response are obtained.

In step S230, the target audio output parameter that analysis determines is obtained.

In the present embodiment, especially, by combining source of sound distance analysis audio input data, it is determined that the target sound of matching Frequency output parameter.Specifically, can be according to the acoustic pressure p of the input audio measured in step S210 and direct voice particle rapidity U calculates the sound intensity I of the input audio at equipment 10.Then, according to formula (2), according to defeated at the equipment 10 for calculating Enter the source of sound that determines in the sound intensity I and step S212 of audio apart from r, it is estimated that the sound power W of source of sound, as source of sound sound The measurement of amount.It is then possible to target audio power output to be set to be proportional to the volume of source of sound.When source of sound volume is high, mesh Mark output volume is also high；Otherwise when source of sound volume is low, target output volume is also low.

Equally, analysis determines that the aforesaid operations of target audio output parameter can be locally executed in equipment 10, or can be with Performed by external processing apparatus.Additionally, depth camera can with analysis determine target audio output parameter equipment 10 and/ Or external processing apparatus are communicatively coupled.

In step S240, audio output device output is controlled to obtain in the step s 120 according to target audio output parameter Audio output data.

It is to whisper or speaking up that method 200 can distinguish user, and it is near that can especially distinguish user Place whispers and still speak up a long way off, correspondingly provides suitable audio output volume, improves Consumer's Experience.

Step S220, S240 in method 200 is identical with step S120, S140 in method 100.Except above-mentioned Special feature, the S210 and S230 in method 200 is also similar with the step S110 and S130 in method 100.Accordingly, with respect to method 200 part similar with method 100 will not be repeated here.

Fig. 3 shows the output intent 300 in speech interactive equipment 10 in accordance with another embodiment of the present invention Flow chart.Method 300 is a specific embodiment of method 100.In the present embodiment, by analysis environments parameter determination and ring The target audio output parameter of border parameter matching.

As illustrated, method 300 starts from step S310.

In step S310, audio input data is gathered by audio collecting device.

In step s 320, the audio output data of the gathered audio input data of response are obtained.

In step S330a, ambient parameter is detected by detectors such as sensors.For example use light sensors environment Brightness.The sensor can be arranged in apparatus 10 as one part, or can be arranged near equipment 10.

In step S330, the target audio output parameter that analysis determines is obtained.

In the present embodiment, especially, the target audio for being matched with ambient parameter by analysis environments parameter determination is exported Parameter.If for example, the ambient brightness detected in step S330a is brighter, target audio output volume can be set to It is higher；If otherwise ambient brightness is darker, target audio output volume can be set to lower.

Equally, analysis determines that the aforesaid operations of target audio output parameter can be locally executed in equipment 10, or can be with Performed by external processing apparatus.Additionally, for detecting that the detector of environment can determine target audio output parameter with analysis Equipment 10 and/or external processing apparatus be communicatively coupled.

In step S340, audio output device output is controlled to obtain in step s 320 according to target audio output parameter Audio output data.

Method 300 can adaptively export big volume daytime, and late into the night output small volume reduces night and other people are done Disturb.

Step S310, S320, S340 are identical with step S110, S120 in method 100, S140 in method 300.Except upper The special feature that face is mentioned, the S330 in method 300 is one of the step S130 in method 100 and implements.Accordingly, with respect to The part similar with method 100 of method 300 will not be repeated here.

Fig. 4 shows the output intent 400 in speech interactive equipment 10 according to another embodiment of the invention Flow chart.Method 400 is a specific embodiment of method 100.In the present embodiment, by analyzing the defeated of audio input data The behavior of the person of entering or attribute determine the behavior with importer or the target audio output parameter of attributes match.

As illustrated, method 400 starts from step S410.

In step S410, audio input data is gathered by audio collecting device.

In the step s 420, the audio output data of the gathered audio input data of response are obtained.

In step S430a, the gesture and/or attribute of importer are detected.Gesture and/or attribute for detecting importer Detector can include optical module (such as camera) and image processing modules.For example, can be adopted by imaging first-class image The image of acquisition means acquisition importer, and the hand of importer is determined by image analysis technology (such as face recognition technology) Gesture/action or attribute.The behavior of importer can for example heighten the gesture/action of volume (speaking up) including instruction, such as Raise one's hand upwards the gesture of the palm, action for speaking of magnifying etc.；Or volume (whispering) can be turned down including instruction Gesture/action, such as lip slightly the speech act of change, be pressed downward the gesture of palm, walk close to when saying the action of equipment.It is defeated The attribute of the person of entering for example can be including the parameter for indicating the sex of importer, age etc..

In step S430, the target audio output parameter that analysis determines is obtained.

In the present embodiment, especially, by analyze the importer of audio input data behavior or attribute determine with it is defeated The behavior of the person of entering or the target audio output parameter of attributes match.If for example, the instruction that user is detected in step S430a is big Sound is spoken (heighten volume) or the gesture whispered (turn down volume) or action, can correspondingly by target audio output volume It is set to high or low, or is arranged to heighten or turns down.Alternately or supplement, can also foundation S430a in detect it is defeated The sex of the person of entering is male/female/child, determines that target audio output tone color is male voice/female voice/child's voice.Alternatively, can also foundation The age of importer is the elderly, determines that target audio output volume is (or further heightening) high.

Equally, analysis determines that the aforesaid operations of target audio output parameter can be locally executed in equipment 10, or can be with Performed by external processing apparatus.Equipment 10 can be determined at equipment 10 and/or the outside of target audio output parameter with analysis Reason equipment is communicatively coupled.

In step S440, audio output device output is controlled to obtain in the step s 120 according to target audio output parameter Audio output data.

Method 400 can adaptively according to user the regulation such as sex, age audio output parameters, improve the body of user Test.

Step S410, S420, S440 are identical with step S110, S120 in method 100, S140 in method 400.Except upper The special feature that face is mentioned, the S430 in method 400 is one of the step S130 in method 100 and implements.Accordingly, with respect to The part similar with method 100 of method 400 will not be repeated here.

Fig. 5 shows the block diagram of the example implementation of equipment 10 according to embodiments of the present invention.

Equipment 10 according to embodiments of the present invention can be the various equipment that can provide voice interactive function, such as intelligent sound Case, response formula Intelligent voice toy, the smart phone with voice assistant, the robot with voice interactive function, computer Etc..

If shown, equipment 10 includes audio collecting device 11, audio output device 12, processing unit 13 and memory 14。

Audio collecting device 11 is configured to gather audio input data.Audio collecting device 11 can be various known Or the device with audio collection function of following exploitation, such as microphone.Alternatively, audio collecting device 11 can be with built-in There is voice processing card, the audio input data for gathering can be processed.Such as voice life in identification audio input data Order, or identification natural-sounding carries out semantic analysis etc..As an alternative or supplement, audio collecting device 11 can be built-in with PU Probe, acoustic pressure and direct voice particle rapidity of audio etc. are input into for measuring.

Audio output device 12 is configured to export audio output data.Audio output device 12 can be various known Or the device with audio output function of following exploitation, for example, loudspeaker.Although audio collection is filled in this manual Put 11 and audio output device 12 be shown as separate device, it should be appreciated that, both can be integrated, and be embodied as Audio devices with audio transmission-receiving function.

Processing unit 13 is arranged for controlling for the integrated operation of equipment 10.For example, processing unit 13 may be configured to control Audio collecting device processed gathers audio input data；Obtain the audio output data of response audio input data；Obtain analysis true Fixed target audio output parameter；Audio output device output audio output data are controlled according to target audio output parameter.

Alternatively, processing unit 13 can also be configured to process the audio input data that audio collecting device is gathered.

Alternatively, processing unit 13 can also be configured in response to audio input data, generation or from local or outside (such as internet, outside cloud, another equipment) retrieves corresponding audio output data.

Alternatively, processing unit 13 can also be configured to analysis determination target audio output parameter.

Processing unit 13 can come individually and/or common real with hardware, software, firmware or substantial their any combination It is existing.Alternatively, processing unit 13 can include one or more application specific integrated circuit (ASIC), field programmable gate arrays (FPGA), digital signal processor (DSP) or other integrated devices.Alternately or supplement, processing unit 13 can include place Reason device (or microprocessor) on a memory can be by one or more computer programs of computing device with storage.

In operation, processing unit 13 can perform the miscellaneous part with equipment 10 (for example, audio collecting device 11, sound Frequency output device 12, and/or optional other part) control and/or communication related operation or data processing.

Memory 14 can be with data storage and program.Memory 14 can include various permanent or provisional storage Medium.

Alternatively, equipment 10 can also include communicator 15.Communicator 15 be configured to it is outside (external equipment, Including outside cloud) communication.For example, communicator 15 is configurable to send the sound that audio collecting device 11 is gathered to external equipment Frequency input data, and/or the audio output data of the response audio input data are received from external equipment.Alternately or mend Fill, communicator 15 can send the data that equipment 10 is detected to external equipment, and/or receive target audio from external equipment Output parameter.Communicator 15 can include wireless or wired communication interface, can support various suitable communication standards, This respect, the embodiment of the present invention is unrestricted.

In certain embodiments, the audio collecting device 11 of equipment 10 includes microphone array.Processing unit 13 also configures use In：Source of sound distance is estimated based on microphone array, and obtains true by combining audio input data described in source of sound distance analysis Fixed target audio output parameter.

In certain embodiments, equipment 10 can also include：At least one sensor, is configured to detect ambient parameter. Processing unit is also configured to：Obtain and exported by the target audio matched with the ambient parameter of analysis environments parameter determination Parameter.

In certain embodiments, equipment 10 can also include：Image collecting device, for the figure of the surrounding environment of equipment 10 Picture.For example, image collecting device can gather the image of the importer of audio input data.Processing unit 13 is also configured to： Obtain behavior with the importer or category that the behavior by analyzing the importer of the audio input data or attribute determine Property matching target audio output parameter.

Equipment 10 can be used for performing method according to embodiments of the present invention, such as method 100~400.Equipment 10 it is specific Operation may be referred to the above-mentioned description on method 100~400, will not be repeated here.

It will be understood by those skilled in the art that equipment 10 in Figure 5 illustrate only part related to the present invention, to avoid Obscure the present invention.Although according to embodiments of the present invention however, it will be understood by those skilled in the art that not shown in Figure 5 Equipment 10 can also include constituting other elementary cells of concrete sound interactive device.

Interactive voice equipment according to embodiments of the present invention can be adaptively adjusted audio output parameters, and it is suitable to obtain Audio output, improves the use feeling of user, improves the competitiveness of product.

Fig. 6 diagrammatically illustrates the dress of the treatment for performing the method described with reference to Fig. 1~4 according to the embodiment of the present application Put the block diagram of 13 example.As shown in fig. 6, the processing unit 13 includes processing unit or processor 136.The processor 136 can Being the combination of individual unit or multiple unit, the different step for performing method.Processing unit 13 may also include：Input Unit 132, for receiving signal from other equipment or component (for example, the audio collecting device being attached thereto, sensor etc.)；With And output unit 134, for providing signal (for example, the audio output device being attached thereto, communication dress to other equipment or component Put).Input block and output unit can be arranged to an entirety.

Additionally, processing unit 13 also includes memory 138, be stored with computer program 139 in memory 138.

Computer program 139 can include code/computer executable instructions, and it is caused when being performed by processor 136 Processor 136 is performed for example above in conjunction with the method flow described by Fig. 1~Fig. 4 and its any deformation.

Computer program 139 can be configured with such as computer program code including computer program module.Example Such as, in the exemplary embodiment, the code in computer program 139 can include one or more program modules, for example, include 139A, module 139B ....It should be noted that the dividing mode and number of module are not fixed, those skilled in the art Can be combined using suitable program module or program module according to actual conditions, when the combination of these program modules is by processor During 136 execution so that processor 136 can be performed for example above in conjunction with the method flow described by Fig. 1~Fig. 4 and its any change Shape.

Above combined preferred embodiment invention has been described.It will be understood by those skilled in the art that above The apparatus and method for showing only are exemplary.Equipment of the invention can include portion more more or less than the part for showing Part.The method of the present invention is not limited to step and order illustrated above.Those skilled in the art are according to illustrated embodiment Teaching can carry out many and change and modifications.

The above method, device, unit and/or module according to each embodiment of the application can be by the electricity that have computing capability Sub- equipment performs the software comprising computer instruction to realize.The system can include storage device, described above to realize Various storages.The electronic equipment for having computing capability can be comprising general processor, digital signal processor, dedicated processes Device, re-configurable processor etc. are able to carry out the device of computer instruction, but not limited to this.Perform such instruction and cause electricity Sub- equipment is configured as performing the above-mentioned operations according to the application.Above-mentioned each equipment and/or module can be in an electronics Realized in equipment, it is also possible to realized in distinct electronic apparatuses.These softwares can be stored in a computer-readable storage medium. Computer-readable recording medium storage one or more programs (software module), one or more of programs include instruction, when When one or more processors in electronic equipment perform the instruction, the instruction causes that electronic equipment performs the side of the application Method.

These softwares can be stored as volatile memory or the form of Nonvolatile memory devices (is such as similar to ROM etc. Storage device), it is whether erasable or rewritable, or form (such as RAM, storage core for being stored as memory Piece, equipment or integrated circuit), or (such as, CD, DVD, disk or magnetic are stored on light computer-readable recording medium or magnetic computer-readable recording medium Band etc.).It should be appreciated that storage device and storage medium are adapted for storing the machine readable storage dress of one or more programs The embodiment put, one program or multiple programs include instruction, when executed, realize the implementation of the application Example.Embodiment provides the machine-readable storage device of program and this program of storage, and described program is included for realizing the application Any one claim described in device or method code.Furthermore, it is possible to via any medium (such as, via wired The signal of communication that connection or wireless connection are carried) to send a telegram here and transmit these programs, multiple embodiments suitably include these programs.

Method, device, unit and/or module according to each embodiment of the application can also use such as field programmable gate System on array (FPGA), programmable logic array (PLA), on-chip system, substrate, the system in encapsulation, special integrated electricity Road (ASIC) can come real for carrying out the hardware such as integrated or encapsulation any other rational method or firmware to circuit It is existing, or realized with software, the appropriately combined of three kinds of implementations of hardware and firmware.The system can include storage device, To realize storage as described above.When realizing in such ways, software, hardware and/or the firmware for being used be programmed or It is designed as performing according to the corresponding above method of the application, step and/or function.Those skilled in the art can be according to actual need Come one or more in these systems and module suitably, or a part therein or some use it is different upper Implementation is stated to realize.These implementations each fall within the protection domain of the application.

Although the certain exemplary embodiments with reference to the application have shown and described the application, art technology Personnel it should be understood that in the case of the spirit and scope limited without departing substantially from appended claims and its equivalent, The various changes in form and details can be carried out to the application.Therefore, scope of the present application should not necessarily be limited by above-described embodiment, But not only should be determined by appended claims, also it is defined by the equivalent of appended claims.

Claims

1. a kind of output intent, including：

Audio input data is gathered by audio collecting device；

Obtain the audio output data of the response audio input data；

Obtain the target audio output parameter that analysis determines；And

The audio output data are exported according to target audio output parameter control audio output device.

2. method according to claim 1, wherein, the analysis includes：

Analyze the target audio output parameter that the audio input data determines to be matched with the audio input data.

3. method according to claim 2, wherein, analyze the audio input data and determine and the audio input data The target audio output parameter of matching includes：

Determine target audio output parameter according to the corresponding audio frequency parameter of the audio input data.

4. method according to claim 2, wherein, analyze the audio input data and determine and the audio input data The target audio output parameter of matching includes：

The audio input data with reference to described in source of sound distance analysis, to determine the target audio output parameter.

5. method according to claim 1, wherein, the analysis includes：

The target audio output parameter that analysis environments parameter determination is matched with the ambient parameter.

6. method according to claim 1, wherein, the analysis includes：

The behavior or attribute for analyzing the importer of the audio input data determine the behavior with the importer or attributes match Target audio output parameter.

7. a kind of processing equipment, including：

Audio collecting device；

Audio output device；

Processing unit, is configured to：

Audio input data is gathered by audio collecting device；

Obtain the audio output data of the response audio input data；

Obtain the target audio output parameter that analysis determines；And

8. processing equipment according to claim 7, wherein,

The audio collecting device includes microphone array, and

The processing unit is also configured to：Based on microphone array estimate source of sound distance, and obtain by combine source of sound away from From the target audio output parameter for analyzing the audio input data determination.

9. processing equipment according to claim 7, also includes：At least one sensor, is configured to detect ambient parameter；

Wherein, the processing unit is also configured to：Obtain being matched with the ambient parameter by analysis environments parameter determination Target audio output parameter.

10. output equipment according to claim 7, also includes：Image collecting device, for gathering the audio input number According to importer image；

Wherein, the processing unit is configured to：Obtain behavior or the category by analyzing the importer of the audio input data Property the behavior with the importer that the determines or target audio output parameter of attributes match.