CN105719670B

CN105719670B - A kind of audio-frequency processing method and device towards intelligent robot

Info

Publication number: CN105719670B
Application number: CN201610028052.9A
Authority: CN
Inventors: 郭家
Original assignee: Beijing Guangnian Wuxian Technology Co Ltd
Current assignee: Beijing Guangnian Wuxian Technology Co Ltd
Priority date: 2016-01-15
Filing date: 2016-01-15
Publication date: 2018-02-06
Anticipated expiration: 2036-01-15
Also published as: CN105719670A

Abstract

The invention discloses a kind of audio-frequency processing method and device towards intelligent robot, the audio-frequency processing method includes audio-frequency information acquisition step, the audio-frequency information of collection user's input；Audio-frequency information processing step, is pre-processed to audio-frequency information, obtains record length data, and record length data include average individual character time t3 and maximum individual character time t4；Natural language understanding step, the word in audio-frequency information is parsed, obtains natural language understanding result；Record length judgment step, average individual character time t3, maximum individual character time t4, zero volume duration t5 and natural language understanding result are judged.The present invention can optimize robot answer output opportunity, improve response accuracy.

Description

A kind of audio-frequency processing method and device towards intelligent robot

Technical field

The present invention relates to speech recognition and processing technology field, specifically, is related to a kind of sound towards intelligent robot Frequency treating method and apparatus.

Background technology

Intelligent robot is the aggregate of an a variety of new and high technologies, and it has merged machinery, electronics, sensor, computer Hardware, software, artificial intelligence etc. are permitted multi-disciplinary knowledge, are related to the technology of current many Disciplinary Frontiers.

In intelligent robot with user interaction process, generally first presetting a set time, in recording, detection is used Whether the family silent time has reached this default set time.If having reached the default set time, stop Recording.

But stop the mode of recording above by the default set time, it may appear that End of Tape opportunity inaccuracy is asked Topic, and then intelligent robot answer output opportunity is influenceed, reduce response time accuracy and Consumer's Experience.

The content of the invention

To solve problem above, the invention provides a kind of audio-frequency processing method and device towards intelligent robot, uses To optimize robot answer output opportunity, response accuracy is improved.

According to an aspect of the invention, there is provided a kind of audio-frequency processing method towards intelligent robot, including：

Audio-frequency information acquisition step, the audio-frequency information of collection user's input；

Audio-frequency information processing step, the audio-frequency information is pre-processed, obtain record length data, during the recording Between data include average individual character time t3 and maximum individual character time t4；

Natural language understanding step, the word in the audio-frequency information is parsed, obtain natural language understanding result；

Record length judgment step, when continuing to average individual character time t3, maximum individual character time t4, zero volume Between t5 and natural language understanding result judged that, when judged result, which meets, terminates recording conditions, generation terminates recording instruction.

According to one embodiment of present invention, record length judgment step includes：

Zero volume duration t5 and preset audio end time t0 are compared, works as t5>During t0, terminate recording；

Zero volume duration t5 and average individual character time t3 are compared, works as t5>T3 and the natural language understanding knot When fruit indicates End of Tape, terminate recording；

Zero volume duration t5 and the maximum individual character time t4 are compared, works as t5>Terminate to record during t4, and, adjustment T0 values level off to maximum individual character time t4.

According to one embodiment of present invention, the maximum individual character time t4 is obtained, including：

In single recording, according to the word number for having a volume duration t1 and speech recognition obtains, single record is calculated Individual character time t2 in sound；

According to the individual character time t2 that all singles are recorded in continuous n times recording, maximum individual character time t4 is obtained.

According to one embodiment of present invention, average individual character time t3 is obtained, including：

According to the individual character time t2 that all singles are recorded in continuous n times recording, average individual character time t3 is obtained.

According to one embodiment of present invention, individual character time t2 is calculated by following formula：

T2=t1/a or t2=(t1/a+t1/ (a-1))/2

Wherein, a is to have the word number identified in volume duration t1.

According to another aspect of the present invention, a kind of apparatus for processing audio towards intelligent robot is additionally provided, including：

Audio-frequency information acquisition module, the audio-frequency information of collection user's input；

Audio-frequency information processing module, the audio-frequency information is pre-processed, obtain record length data, during the recording Between data include average individual character time t3 and maximum individual character time t4；

Natural language understanding module, the word in the audio-frequency information is parsed, obtain natural language understanding result.

Record length judge module, when continuing to average individual character time t3, maximum individual character time t4, zero volume Between t5 and natural language understanding result judged that, when judged result, which meets, terminates recording conditions, generation terminates recording instruction.

According to one embodiment of present invention, the record length judge module is used for：

According to one embodiment of present invention, the audio-frequency information processing module includes：

First individual character time calculating unit, in single recording, obtained according to having volume duration t1 and speech recognition Word number, calculate single recording in individual character time t2；

Maximum individual character time calculating unit, according to the individual character time t2 that all singles are recorded in continuous n times recording, obtain institute State maximum individual character time t4.

Second individual character time calculating unit, in single recording, obtained according to having volume duration t1 and speech recognition Word number, calculate single recording in individual character time t2；

Average individual character time calculating unit, according to the individual character time t2 that all singles are recorded in continuous n times recording, obtain institute State average individual character time t3.

Additionally provides a kind of apparatus for processing audio towards intelligent robot according to a further aspect of the invention, including：

Audio-frequency information Acquisition Circuit, the audio-frequency information of collection user's input；

Processor, the audio-frequency information is pre-processed, obtain record length data, the record length data include Average individual character time t3 and maximum individual character time t4,

The word in the audio-frequency information is parsed, obtains natural language understanding result,

To average individual character time t3, maximum individual character time t4, zero volume duration t5 and natural language reason Solution result is judged, when judged result, which meets, terminates recording conditions, generation terminates recording instruction,

Wherein, the processor is to average individual character time t3, maximum individual character time t4, zero volume duration T5 and natural language understanding result judged, including：

Beneficial effects of the present invention：

A kind of audio-frequency processing method and device towards intelligent robot provided by the invention, by judging multiple sign languages The parameter of speed, by the judgement to parameter, recording stopping opportunity accurately being controlled, and according to different user speak word speed and Words and phrases interval carries out word speed study for individual consumer, so as to optimize robot answer output opportunity, improves response accuracy.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages can be by specification, rights Specifically noted structure is realized and obtained in claim and accompanying drawing.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing required in technology description to do simple introduction：

Fig. 1 is method flow diagram according to an embodiment of the invention；

Fig. 2 is that average individual character time t3 according to an embodiment of the invention determines flow chart of steps；

Fig. 3 is maximum individual character time t4 according to an embodiment of the invention to determine flow chart of steps；

Fig. 4 is a kind of apparatus for processing audio structural representation towards intelligent robot according to an embodiment of the invention Figure；

Fig. 5 is the structure that the maximum individual character time is determined in audio-frequency information processing module according to an embodiment of the invention Schematic diagram；

Fig. 6 is the structure that the average individual character time is determined in audio-frequency information processing module according to an embodiment of the invention Schematic diagram；And

Fig. 7 is that a kind of apparatus for processing audio sound intermediate frequency towards intelligent robot according to an embodiment of the invention is believed Cease processing module structural representation.

Embodiment

Embodiments of the present invention are described in detail below with reference to drawings and Examples, and how the present invention is applied whereby Technological means solves technical problem, and the implementation process for reaching technique effect can fully understand and implement according to this.Need to illustrate As long as not forming conflict, each embodiment in the present invention and each feature in each embodiment can be combined with each other, The technical scheme formed is within protection scope of the present invention.

It is a kind of audio-frequency processing method stream towards intelligent robot according to one embodiment of the present of invention as shown in Figure 1 Cheng Tu, below with reference to Fig. 1, the present invention is described in detail.

First, step S110, audio-frequency information acquisition step, that is, the audio-frequency information that user inputs is gathered.Specifically, in the step In rapid, when user speaks, intelligent robot starts the voice messaging that collection receives user.

Followed by step S120, audio-frequency information processing step, i.e., the audio-frequency information of reception is pre-processed, recorded Sound time data.The record length data include average individual character time t3 and maximum individual character time t4.

What average individual character time t3 represented is the time rested on when user speaks on individual character.It is determined that user speak it is flat During equal individual character time t3, obtained by the mean value calculation of the continuous n recorded message of user.

Specifically, including following steps as shown in Figure 2：

It is step S210 first, in single recording, according to the word for having a volume duration t1 and speech recognition obtains Number, calculate the individual character time t2 in single recording.It is to pass through herein it should be noted that in n Recording Process above User set in advance speaks the end time what (end of speech, EOS) distinguished each Recording Process.

When the EOS times refer to user recording, the time that the last character records to program stopped is finished from user.Should The predeterminable EOS times are a certain fixed value t0, such as 3 seconds.By setting the EOS times can be to above continuous n times Recording Process Make a distinction.

After EOS time t0s are set, then the single Recording Process after differentiation can be handled.Recorded in single In, the concrete numerical value that is changed over time based on volume is calculated holds from the time volume for beginning with volume and once occurring volume to the end Continuous time t1 (having the volume duration), and count the word come out in this section of time t1 by continuous speech recognition Number.

Volume herein refers to the volume after carrying out noise reduction process of being spoken to user, excludes noise reduction algorithm to being lost after noise reduction The volume of noise is stayed, that is, assumes that noise reduction is perfect noise reduction, it is believed that if user does not speak after noise reduction, volume is equal to zero. Specifically, the concrete numerical value that volume changes over time in record single Recording Process, such as the 1st millisecond 30 decibels, the 2nd millisecond 70 points Shellfish etc..The concrete numerical value changed over time by detecting volume can be calculated in this time recording from beginning with volume to the end Once there is the time t1 of volume.Meanwhile continuous speech recognition is carried out in single Recording Process, and count and recorded in the single During continuous speech recognition come out word number.

After the word number for having volume duration t1 and identifying determines, the individual character time in single recording is calculated t2。

Specifically, the individual character time t2 in single recording can be calculated by following formula：

T2=t1/a (1)

Wherein, t2 is the individual character time in single recording, and t1 is to have the volume duration in single recording, and a is single There is the word number that continuous speech recognition comes out in volume duration t1 in Recording Process.

Or can also be calculated by following formula single recording in individual character time t2：

T2=(t1/a+t1/ (a-1))/2 (2)

The average value of individual character time in single Recording Process is calculated by formula (2), can obtain what is spoken closer to user The individual character time.

Followed by step S220, according to the individual character time t2 that all singles are recorded in continuous n times recording, average individual character is obtained Time t3.

The individual character time that all singles in continuous n times recording are recorded, calculate user and speak average individual character time t3. Specifically, after continuous n recording, the average value t3 of all individual character times t2 in calculating n times.N herein can preferably one it is pre- If value, such as 10.

It is noted that in single Recording Process, when detecting that the duration t5 that volume is zero reaches predetermined value t0 When, then it is considered as user and no longer speaks, this End of Tape.Also, when volume occur in Recording Process be zero, then with the volume Clocked to start from scratch on the basis of the null value moment as t5.When it is zero volume occur again in Recording Process, then with this again Volume is to start from scratch again and clock on the basis of the null value moment, i.e. t5 starts from scratch timing again.It is this way it is secured that each defeated Go out between word the duration for being zero and speak terminate after volume be zero duration accuracy.

Maximum individual character time t4 refers to user in continuous n Recording Process, the maximum of individual character time when speaking. When determining the maximum individual character time t4 that user speaks, obtained by all individual character times of the continuous n recorded message of counting user Arrive.

Specifically, including following steps as shown in Figure 3：

It is step S310 first, in single recording, according to the word for having a volume duration t1 and speech recognition obtains Number, calculate the individual character time t2 in single recording.

It is at the end of being spoken by user set in advance herein it should be noted that in n Recording Process above Between (end of speech, EOS) each Recording Process is distinguished, the predeterminable EOS times are a certain fixed value t0, By setting the EOS times to be made a distinction to above continuous n times Recording Process.

Then the individual character time t2 in single recording is calculated.Specifically, the individual character time in single recording can pass through formula (1) or formula (2) is calculated.

Finally, in step s 320, according to the individual character time t2 that all singles are recorded in continuous n times recording, user is obtained Speak maximum individual character time t4.

Specifically, the individual character time t2 that all singles in continuous n times recording are recorded, when selecting the individual character of maximum Between, as maximum individual character time t4.N herein can a preferably preset value.

Followed by natural language understanding step 130, the word in audio-frequency information is parsed, obtains natural language understanding knot Fruit.

It is finally step S140, record length judgment step, to average individual character time t3, maximum individual character time t4, zero sound Amount duration t5 and natural language understanding result are judged.

Specifically, in (n+1)th recording, zero volume duration t5 and preset audio end time t0, such as zero are compared Volume duration t5 is more than preset audio end time t0, then terminates to record；As zero volume duration t5 is more than average list Word time t3, and right language understanding NLU thinks End of Tape during natural language understanding result instruction End of Tape, then terminates Recording；Such as zero volume duration t5 is more than average individual character time t3, and natural language understanding NLU is not considered as End of Tape, than To zero volume duration t5 and maximum individual character time t4, work as t5>Terminate to record during t4, and, adjustment t0 values level off to t4.

Specifically, in this step, in (n+1)th recording, the situation for terminating recording is divided into three kinds.One kind therein is Such as detect that zero volume duration t5 is more than preset audio end time t0, then terminate to record.The preset audio end time T0 is set before Recording Process starts.

What preset audio end time t0 can be set is more than user speed and words and phrases interval.Now, exported to improve response On opportunity, at (n+1)th time and in later lasting detection process, its value of preset audio end time t0 gradually reduces maximum with convergence Individual character time t4.

Second of situation is such as to detect that zero volume duration t5 is more than average individual character time t3, and natural language is managed When right language understanding NLU thinks End of Tape during solution result instruction End of Tape, then terminate to record.Specifically, when user is n-th During+1 use speech recognition, if zero volume duration t5 in (n+1)th recorded message being calculated is more than averagely Individual character time t3, then the natural language understanding result based on recorded message judge whether terminate recording.

If natural language understanding result thinks that user has finished words, that is, find modal particle and punctuate that user speaks “" etc., then terminate to record.

Otherwise, if natural language understanding result thinks that user not yet finishes words, as cutting word finds last vocabulary not Completely, then zero volume duration t5 and maximum individual character time t4 is compared, works as t5 into the third situation>Terminate to record during t4, And adjustment t0 values level off to the t4.

It is noted that for different users, preset audio end time t0 and user speak the big of individual character time value Small relation is indefinite, and its value is likely larger than user and spoken the individual character time, it is also possible to is spoken the individual character time less than user.In preceding continuous n In secondary Recording Process, because not maximum individual character time t4 and average individual character time t3 refer to, detecting that zero volume continues Time t5 is more than preset audio end time t0, then terminates to record.

A kind of audio-frequency processing method and device towards intelligent robot that the present embodiment provides, by judging multiple signs The parameter of word speed, by the judgement to parameter, recording stopping opportunity accurately being controlled, and spoken word speed according to different user And words and phrases interval carries out word speed study for individual consumer, so as to optimize robot answer output opportunity, response accuracy is improved.

According to another aspect of the present invention, a kind of apparatus for processing audio towards intelligent robot is additionally provided.Such as Fig. 4 Shown, the apparatus for processing audio includes audio-frequency information acquisition module, audio-frequency information processing module and record length judge module.

Wherein, audio-frequency information acquisition module, the audio-frequency information of user's input is received for gathering；

Audio-frequency information processing module, is pre-processed to audio-frequency information, obtains record length data, record length packet Include average individual character time t3 and maximum individual character time t4；

Natural language understanding module, the word in the audio-frequency information is parsed, obtain natural language understanding result；

Record length judge module, when continuing to average individual character time t3, maximum individual character time t4, zero volume Between t5 and natural language understanding result judged.

Record length judge module is carried out to average individual character time t3, maximum individual character time t4 and zero volume duration t5 Judge, generation terminates to include following several situations during recording instruction.At the end of comparing zero volume duration t5 and preset audio Between t0, work as t5>During t0, terminate recording；Zero volume duration t5 and average individual character time t3 are compared, works as t5>T3 and institute When stating natural language understanding result instruction End of Tape, terminate recording；Compare zero volume duration t5 and the maximum individual character Time t4, works as t5>Terminate to record during t4, and, adjustment t0 values level off to the t4.

In one embodiment of the invention, audio-frequency information processing module includes the first individual character time calculating unit and maximum Individual character time calculating unit, as shown in Figure 5.

Wherein, the first individual character time calculating unit is in single recording, according to having volume duration t1 and speech recognition Obtained word number, calculate the individual character time t2 in single recording；Maximum individual character time calculating unit, according to continuous n times record The individual character time t2 of all single recording, obtains maximum individual character time t4 in sound.

In one embodiment of the invention, audio-frequency information processing module includes the second individual character time calculating unit and is averaged Individual character time calculating unit.

Wherein, the second individual character time calculating unit is in single recording, according to having volume duration t1 and speech recognition Obtained word number, calculate the individual character time t2 in single recording；Average individual character time calculating unit is according to continuous n times recording In the recording of all singles individual character time t2, obtain average individual character time t3, as shown in Figure 6.

Analyzed more than and Fig. 5 and Fig. 6, the first individual character time calculating unit and in audio-frequency information processing module The function that two individual character time calculating units are completed is identical, therefore, when being designed, can calculate for the first individual character time Unit and the second individual character time calculating unit merge into an individual character time calculating unit, then after individual character time calculating unit Maximum individual character time calculating unit and average individual character time computing module are set respectively, as shown in Figure 7.

According to a further aspect of the invention, a kind of apparatus for processing audio towards intelligent robot, the sound are additionally provided Frequency processing device includes audio-frequency information Acquisition Circuit and processor.

Wherein, audio-frequency information Acquisition Circuit is used for the recorded message for gathering user's input；

Processor, audio-frequency information is pre-processed, obtain record length data, record length data include average individual character Time t3 and maximum individual character time t4, to average individual character time t3, maximum individual character time t4, zero volume duration T5 and natural language understanding result are judged.

Specifically, processor is sentenced to average individual character time t3, maximum individual character time t4 and zero volume duration t5 Disconnected, the situation of generation End of Tape instruction is including following several：

(1) zero volume duration t5 and preset audio end time t0 is compared, works as t5>During t0, terminate recording；

(2) zero volume duration t5 and average individual character time t3 are compared, works as t5>T3 and the natural language understanding When as a result indicating End of Tape, terminate recording；

(3) the zero volume duration t5 and maximum individual character time t4 is compared, works as t5>Terminate to record during t4, and, adjust Whole t0 values level off to the t4.

In summary, a kind of audio-frequency processing method and device towards intelligent robot provided by the invention, pass through judgement Multiple parameters for characterizing word speed, by the judgement to parameter, recording stopping opportunity accurately being controlled, and according to different user Word speed of speaking and words and phrases interval carry out word speed study for individual consumer, and so as to optimize robot answer output opportunity, improving should Answer accuracy.

While it is disclosed that embodiment as above, but described content only to facilitate understand the present invention and adopt Embodiment, it is not limited to the present invention.Any those skilled in the art to which this invention pertains, this is not being departed from On the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details, But the scope of patent protection of the present invention, still should be subject to the scope of the claims as defined in the appended claims.

Claims

1. a kind of audio-frequency processing method towards intelligent robot, including：

Audio-frequency information processing step, the audio-frequency information is pre-processed, obtain record length data, the record length number According to including average individual character time t3 and maximum individual character time t4；

Record length judgment step, to average individual character time t3, maximum individual character time t4, zero volume duration t5 Judged with natural language understanding result, when judged result, which meets, terminates recording conditions, generation terminates recording instruction.

2. audio-frequency processing method according to claim 1, it is characterised in that the record length judgment step, including：

Zero volume duration t5 and average individual character time t3 are compared, works as t5>The t3 and natural language understanding result refers to When showing End of Tape, terminate recording；

Zero volume duration t5 and the maximum individual character time t4 are compared, works as t5>Terminate to record during t4, and, adjust t0 values Its value is set gradually to reduce to level off to maximum individual character time t4.

3. audio-frequency processing method according to claim 2, it is characterised in that the maximum individual character time t4 is obtained, including：

In single recording, according to the word number for having a volume duration t1 and speech recognition obtains, calculate in single recording Individual character time t2；

4. audio-frequency processing method according to claim 2, it is characterised in that average individual character time t3 is obtained, including：

5. the audio-frequency processing method according to claim 3 or 4, it is characterised in that individual character time t2 passes through following formula meter Obtain：

T2=t1/a or t2=(t1/a+t1/ (a-1))/2

Wherein, a is to have the word number identified in volume duration t1.

6. a kind of apparatus for processing audio towards intelligent robot, including：

Audio-frequency information processing module, the audio-frequency information is pre-processed, obtain record length data, the record length number According to including average individual character time t3 and maximum individual character time t4；

Record length judge module, to average individual character time t3, maximum individual character time t4, zero volume duration t5 Judged with natural language understanding result, when judged result, which meets, terminates recording conditions, generation terminates recording instruction.

7. apparatus for processing audio according to claim 6, it is characterised in that the record length judge module, be used for：

8. apparatus for processing audio according to claim 7, it is characterised in that the audio-frequency information processing module includes：

First individual character time calculating unit, in single recording, according to the text for having a volume duration t1 and speech recognition obtains Word number, calculate the individual character time t2 in single recording；

Maximum individual character time calculating unit, according to the individual character time t2 that all singles are recorded in continuous n times recording, obtain described in most Big individual character time t4.

9. apparatus for processing audio according to claim 7, it is characterised in that the audio-frequency information processing module includes：

Second individual character time calculating unit, in single recording, according to the text for having a volume duration t1 and speech recognition obtains Word number, calculate the individual character time t2 in single recording；

Average individual character time calculating unit, according to the individual character time t2 that all singles are recorded in continuous n times recording, obtain described flat Equal individual character time t3.

10. a kind of apparatus for processing audio towards intelligent robot, including：

To average individual character time t3, maximum individual character time t4, zero volume duration t5 and the natural language understanding knot Fruit is judged, when judged result, which meets, terminates recording conditions, generation terminates recording instruction,

Wherein, the processor to average individual character time t3, maximum individual character time t4, zero volume duration t5 and Natural language understanding result judged, including：