CN110164431A

CN110164431A - A kind of audio data processing method and device, storage medium

Info

Publication number: CN110164431A
Application number: CN201811361659.4A
Authority: CN
Inventors: 郑脊萌; 高毅; 黎韦伟; 于蒙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-11-15
Filing date: 2018-11-15
Publication date: 2019-08-23
Anticipated expiration: 2038-11-15
Also published as: CN110415698B; CN110364162B; CN110517679A; CN110164431B; CN110364162A; CN110517680B; CN110517679B; CN110517680A; CN110415698A

Abstract

The embodiment of the invention provides a kind of audio data processing method and devices, storage medium, this method comprises: obtaining speech detection model, the speech detection model is the audio data of at least one detection path with historical accumulation characteristic and the corresponding relationship of speech recognition result；Based on the quantity of at least one detection path described in detecting, references object is determined；The references object is to carry out the factor of reset operation judgement；Based on the references object, determine that reset time point, the reset time point are in the case where guaranteeing speech recognition performance, at the time of initializing the historical accumulation in the speech detection model；When the reset time point reaches, the speech detection model is reset.

Description

A kind of audio data processing method and device, storage medium

Technical field

The present invention relates in electronic application field speech recognition technology more particularly to a kind of audio data processing method and Device, storage medium.

Background technique

With the development of intelligent sound box and its spin-off, it is man-machine between interactive voice, especially far field interactive voice, by Gradually become the application function of an important human-computer interaction interface.

Currently, the interactive voice smart machine of electronic field is mainly intelligent sound box, for example, the intelligence with voice control function It can the products such as TV or TV box.One or more generally can be all set in the similar products such as these interactive voice smart machines Wake up word.It is illustrated by taking intelligent sound box as an example, when user says wake-up word to intelligent sound box and detects it by intelligent sound box Afterwards, the voice data (audio data) that next user says is just given intelligent sound box when voice command, carries out speech recognition, And then open the voice interactive function between man-machine.Generally use shot and long term memory unit model (LSTM, Long Short Term Memory) the wake-up detection model that is used as wake up the detection of word.

However, an important feature due to LSTM is historical information characteristic of accumulation, i.e., speech recognition is carried out using LSTM When, to the testing result of the one section of voice data voice data of word (for example, wake up) not only phase with this section of voice data itself It closes, the also tremendous influence by the audio data before this section of voice data.Therefore, it in the detection for waking up word, unavoidably deposits The false wake-up the problem of, and after the noise accumulation of a period of time, inspection of the accumulation of noise data to wake-up word later It surveys performance to have an impact, so as to cause the accuracy rate decline for the speech recognition for waking up word.

Summary of the invention

The embodiment of the present invention provides a kind of audio data processing method and device storage medium, can be improved speech recognition Accuracy rate.

The technical solution of the embodiment of the present invention is achieved in that

The embodiment of the present invention provides a kind of audio data processing method, comprising:

Speech detection model is obtained, the speech detection model is at least one detection path with historical accumulation characteristic Audio data and speech recognition result corresponding relationship；

Based on the quantity of at least one detection path described in detecting, references object is determined；The references object be into The factor of row reset operation judgement；

Based on the references object, determine that reset time point, the reset time point are to guarantee speech recognition performance In the case of, at the time of initializing the historical accumulation in the speech detection model；

When the reset time point reaches, the speech detection model is reset.

The embodiment of the present invention provides a kind of audio-frequency data processing device, comprising:

Acquiring unit, for obtaining speech detection model, the speech detection model be with historical accumulation characteristic extremely The audio data of a few detection path and the corresponding relationship of speech recognition result；

Determination unit determines references object for the quantity based at least one detection path described in detecting；It is described References object is to carry out the factor of reset operation judgement；And it is based on the references object, determine reset time point, the resetting Time point is in the case where guaranteeing speech recognition performance, at the time of initializing the historical accumulation in the speech detection model；

Reset cell, for resetting the speech detection model when the reset time point reaches.

In above-mentioned apparatus, the determination unit is also used to determine when the quantity of the detection path detected is one The references object is current detection result；

Correspondingly, the acquiring unit, is also used to obtain audio data to be detected；And utilize the speech detection model pair The audio data to be detected is identified, current detection result is obtained；

The determination unit determines current also particularly useful for when the current detection result meets default resetting thresholding Time point is the reset time point；Wherein, it presets resetting thresholding and is more than or equal to default wake-up thresholding.

In above-mentioned apparatus, the determination unit is also used to when the quantity of the detection path detected is greater than one, Determine that the references object is current point in time.

In above-mentioned apparatus, the acquiring unit is also used to the utilization speech detection model to described to be detected Audio data is identified, after obtaining current detection result, obtains the history testing result before current point in time；

The determination unit is also used to when the variation range between the current detection result and the history testing result When meeting default false wake-up range, determine that the current point in time is the reset time point.

In above-mentioned apparatus, at least one described detection path includes: backup detection path；

The acquiring unit is also used to obtain current point in time；

The determination unit is also used to when the current point in time reaches default preheating time point, when will be described current Between point be determined as the reset time point of the backup detection path, wherein the default preheating time point is from when default resetting Between point start before default preheating time section time point.

In above-mentioned apparatus, the reset cell, specifically for reaching default preheating time point when the current point in time When, it resets and starts the backup detection path.

In above-mentioned apparatus, at least one described detection path further include: main detection path；The audio data processing dress Set further includes recognition unit and closing unit；

The recognition unit is logical using the main detection after the resetting and starting the backup detection path Road and the backup detection path carry out speech recognition；

The reset cell reaches the default resetting also particularly useful for after by the default preheating time section When time point, the main detection path is reset；

The closing unit, for working as since the default reset time point using the default preheating time section When, the backup detection path is closed,

The recognition unit is also used to carry out speech recognition using the main detection path.

In above-mentioned apparatus, the default reset time point is the time series for being spaced predetermined time period；

The predetermined time period is in the range of 2 times of section of default preheating time and default tolerance threshold wake-up value；

The default tolerance threshold wake-up value is in default optimal wake-up upper limit value and presets between best false wake-up lower limit value；

The default preheating time section is more than or equal to the default wake-up word duration.

In above-mentioned apparatus, the audio-frequency data processing device further includes receiving unit and integrated treatment unit；

The receiving unit, for receiving audio data to be detected；

The recognition unit is specifically used for carrying out voice knowledge to the audio data to be detected using the main detection path Not, main testing result is obtained；And when the main testing result is greater than default wake-up thresholding, identify the audio to be detected Data are to wake up word, start arousal function.

In above-mentioned apparatus, the audio-frequency data processing device further includes recognition unit；

The recognition unit, for described when the reset time point reaches, after resetting the speech detection model, Speech recognition is carried out using the speech detection model after resetting.

In above-mentioned apparatus, the audio-frequency data processing device further includes integrated treatment unit；

The recognition unit, specifically in the speech detection based at least one direction branch, according to the resetting Speech detection model afterwards carries out speech recognition at least one direction branch respectively, obtains at least one current detection result；

The integrated treatment unit is integrated for carrying out integrated treatment at least one described current detection result Testing result；

The recognition unit is identified and is called out also particularly useful for when the comprehensive detection result is greater than default wake-up thresholding Awake word, starts arousal function.

In above-mentioned apparatus, the reset cell is specifically used for initializing institute's predicate when the reset time point reaches The data with historical accumulation characteristic in sound detection model, the speech detection model after being reset.

Memory, for storing executable audio data process instruction；

Processor when for executing the executable audio data process instruction stored in the memory, realizes the present invention The audio data processing method that embodiment provides.

The embodiment of the present invention provides a kind of computer readable storage medium, is stored with executable audio data process instruction, When for causing processor to execute, audio data processing method provided in an embodiment of the present invention is realized.

The embodiment of the present invention has the advantages that

The embodiment of the invention provides a kind of audio data processing method and devices, storage medium, by obtaining voice inspection Model is surveyed, the speech detection model is that the audio data of at least one detection path with historical accumulation characteristic and voice are known The corresponding relationship of other result；Based on the quantity of at least one detection path described in detecting, references object is determined；The reference Object is the factor for carrying out reset operation judgement；Based on the references object, determine that reset time point, the reset time point are In the case where guaranteeing speech recognition performance, at the time of initializing the historical accumulation in the speech detection model；Described heavy When setting time point arrival, the speech detection model is reset.Using above-mentioned technic relization scheme, due to audio-frequency data processing device For the quantity of at least one detection path of different phonetic detection model, the weight carried out in the speech detection model can be determined The judgement of operation is set, so that further determination is reset time point based on references object, that is to say, that for speech detection For the different detection paths of model, the judgement of respective reset time point can be realized by the determination of different references object, and And the reset time point is to initialize the historical accumulation in the speech detection model in the case where guaranteeing speech recognition performance At the time of, then in reset time point, if having reset speech detection model, speech detection model after resetting just there is no Historical trace, in this way, speech detection model is under the premise of reset time promise wakes up performance, and not by prolonged history In the case where the influence of characteristic of accumulation, wake up the accuracy rate that speech recognition can be improved when the speech recognition of word.

Detailed description of the invention

Fig. 1 is an optional structural schematic diagram of audio-frequency data processing system framework provided in an embodiment of the present invention；

Fig. 2 is an optional structural schematic diagram of terminal provided in an embodiment of the present invention；

Fig. 3 is an optional structural schematic diagram of audio-frequency data processing device provided in an embodiment of the present invention；

Fig. 4 is an optional flow diagram one of audio data processing method provided in an embodiment of the present invention；

Fig. 5 A is the illustrative schematic diagram of a scenario one for waking up word detection provided in an embodiment of the present invention；

Fig. 5 B is the illustrative schematic diagram of a scenario two for waking up word detection provided in an embodiment of the present invention；

Fig. 6 is the structure chart of illustrative LSTM memory unit provided in an embodiment of the present invention；

Fig. 7 is an optional flow diagram two of audio data processing method provided in an embodiment of the present invention；

Fig. 8 is an optional flow diagram three of audio data processing method provided in an embodiment of the present invention；

Fig. 9 is the speech recognition schematic diagram of a scenario of illustrative at least two detection path provided in an embodiment of the present invention；

Figure 10 is the relation curve for illustrative first time provided in an embodiment of the present invention waking up success rate and stand-by time；

Figure 11 is the time diagram of illustrative active and standby detection path provided in an embodiment of the present invention；

Figure 12 is an optional flow diagram four of audio data processing method provided in an embodiment of the present invention；

Figure 13 is the speech detection schematic diagram of a scenario one of illustrative multi-direction branch provided in an embodiment of the present invention；

Figure 14 is the speech detection schematic diagram of a scenario two of illustrative multi-direction branch provided in an embodiment of the present invention；

Figure 15 is the speech detection schematic diagram of a scenario three of illustrative multi-direction branch provided in an embodiment of the present invention；

Figure 16 is the speech detection schematic diagram of a scenario four of illustrative multi-direction branch provided in an embodiment of the present invention；

Figure 17 is illustrative speech recognition schematic diagram of a scenario one provided in an embodiment of the present invention；

Figure 18 is illustrative speech recognition schematic diagram of a scenario two provided in an embodiment of the present invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention make into It is described in detail to one step, described embodiment is not construed as limitation of the present invention, and those of ordinary skill in the art are not having All other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.

Unless otherwise defined, all technical and scientific terms used herein and belong to technical field of the invention The normally understood meaning of technical staff is identical.Term used herein is intended merely to the purpose of the description embodiment of the present invention, It is not intended to limit the present invention.

Before the embodiment of the present invention is further elaborated, to noun involved in the embodiment of the present invention and term It is illustrated, noun involved in the embodiment of the present invention and term are suitable for following explanation.

1) word is waken up, the keyword of the starting for interactive voice smart machine refers to starting in embodiments of the present invention The corresponding voice signal of the keyword of audio-frequency data processing device.

2) feature extraction: by primitive character be converted to one group have obvious physical significance (Gabor, geometrical characteristic [angle point, Invariant], texture [LBP HOG] etc.) or statistical significance or core feature.Feature extraction in embodiments of the present invention refers to To the characteristic quantity for extracting important audio-frequency information in audio data.

Illustrate that the exemplary application for realizing the audio-frequency data processing device of the embodiment of the present invention, the embodiment of the present invention mention below The audio-frequency data processing device of confession may be embodied as smart phone, tablet computer, laptop, interactive voice smart machine Various types of user terminals with speech recognition or audio data processing function such as (for example, intelligent sound box), can also be with It is embodied as a server, server here is the background service for running audio data processing function or speech identifying function application Device.In the following, the exemplary application that terminal will be covered when illustrating that audio-frequency data processing device is embodied as terminal.

Show referring to the optional framework that Fig. 1, Fig. 1 are audio-frequency data processing systems 100 provided in an embodiment of the present invention It is intended to, supports an exemplary application to realize, terminal 400 (illustrating terminal 400-1 and terminal 400-2) passes through net Network 200 connects server 300, and network 300 can be wide area network or local area network, or be combination, using wireless Link realizes data transmission.

Wherein, terminal 400, for obtaining speech detection model, speech detection model be with historical accumulation characteristic extremely The audio data of a few detection path and the corresponding relationship of speech recognition result；Based at least one detection path detected Quantity, determine references object；References object is to carry out the factor of reset operation judgement；Based on references object, when determining resetting Between point, reset time point is the historical accumulation in the case where guaranteeing speech recognition performance, in initialization speech detection model Moment；When reset time point reaches, speech detection model, the speech detection model after being reset are reset；After resetting Speech detection model speech recognition is carried out to the audio data to be detected that gets, it is determined whether arousal function is carried out, when true When being set to arousal function, brake audio data to be checked is received, detection function audio data is treated and carries out speech recognition, obtain function Energy phonetic order, is sent to server 300 for function phonetic order.

Server 300, for according to function phonetic order, systematic function triggering command to be controlled according to function triggering command Terminal 300 or other terminals realize the function that function phonetic order is triggered.

Audio-frequency data processing device provided in an embodiment of the present invention may be embodied as the mode of hardware or software and hardware combining, Illustrate the various exemplary implementations of device provided in an embodiment of the present invention below.

Referring to fig. 2, Fig. 2 is 400 1 optional structural schematic diagrams of terminal provided in an embodiment of the present invention, and terminal 400 can To be that mobile phone, computer, digital broadcast terminal, audio data transceiver, game console, tablet device, medical treatment are set Standby, body-building equipment, personal digital assistant etc. are according to the structure of terminal 400, it is anticipated that audio-frequency data processing device is embodied as end Exemplary structure when end, therefore structure as described herein is not construed as limiting, such as can be omitted portion described below Subassembly, alternatively, adding the component do not recorded hereafter to adapt to the specific demand of certain applications.

Terminal 400 shown in Fig. 2 includes: at least one processor 410, memory 440, at least one network interface 420 With user interface 430.Various components in terminal 400 are coupled by bus system 450.It is understood that bus system 450 For realizing the connection communication between these components.Bus system 450 except include data/address bus in addition to, further include power bus, Control bus and status signal bus in addition.But for the sake of clear explanation, various buses are all designated as bus system in Fig. 2 450。

User interface 430 may include display, keyboard, mouse, trace ball, click wheel, key, button, touch-sensitive plate or Person's touch screen etc..

Memory 440 can be volatile memory or nonvolatile memory, may also comprise volatile and non-volatile Both memories.Wherein, nonvolatile memory can be read-only memory (ROM, Read Only Memory), programmable Read memory (PROM, Programmable Read-Only Memory), Erasable Programmable Read Only Memory EPROM (EPROM, Erasable Programmable Read-Only Memory), flash memory (Flash Memory) etc..Volatile memory can be with It is random access memory (RAM, Random Access Memory), is used as External Cache.By exemplary but not It is restricted explanation, the RAM of many forms is available, such as static random access memory (SRAM, Static Random Access Memory), synchronous static random access memory (SSRAM, Synchronous Static Random Access Memory).The memory 440 of description of the embodiment of the present invention is intended to include the memory of these and any other suitable type.

Memory 440 in the embodiment of the present invention can storing data to support the operation of terminal 400.These data are shown Example includes: any computer program for operating in terminal 400, such as operating system 442 and executable program 441.Wherein, Operating system includes various system programs, such as ccf layer, core library layer, driving layer etc., for realizing various basic businesses with And the hardware based task of processing.Executable program may include various application programs, such as executable audio data processing refers to It enables.

As the example that audio data processing method provided in an embodiment of the present invention uses software and hardware combining to implement, the present invention Audio data processing method provided by embodiment can be embodied directly in be combined by the software module that processor 440 executes, soft Part module can be located in storage medium, and storage medium is located at memory 440, and processor 410 reads software mould in memory 440 The executable audio data process instruction that block includes, in conjunction with necessary hardware (e.g., including processor 440 and be connected to total The other assemblies of line 450) complete audio data processing method provided in an embodiment of the present invention.

As an example, processor 410 can be a kind of IC chip, and the processing capacity with signal, for example, it is general Processor, digital signal processor (DSP, Digital Signal Processor) or other programmable logic device are divided Vertical door or transistor logic, discrete hardware components etc., wherein general processor can be microprocessor or any normal The processor etc. of rule.

Illustratively, it the embodiment of the invention provides a kind of audio-frequency data processing device, includes at least:

Memory 440, for storing executable audio data process instruction；

Processor 410 when for executing the executable audio data process instruction stored in the memory 440, is realized Audio data processing method provided in an embodiment of the present invention.

Illustrate the exemplary structure of software module below, in some embodiments, as shown in figure 3, audio data processing dress Setting the software module in 1 may include: acquiring unit 10, determination unit 11 and reset cell 12；Wherein,

Acquiring unit 10, for obtaining speech detection model, the speech detection model is with historical accumulation characteristic The audio data of at least one detection path and the corresponding relationship of speech recognition result；

Determination unit 11 determines references object for the quantity based at least one detection path described in detecting；Institute Stating references object is to carry out the factor of reset operation judgement；And it is based on the references object, and determine reset time point, it is described heavy Set time point be in the case where guaranteeing speech recognition performance, initialize the historical accumulation in the speech detection model when It carves；

Reset cell 12, for resetting the speech detection model when the reset time point reaches.

In some embodiments of the invention, the determination unit 11 is also used to the quantity when the detection path detected When being one, determine that the references object is current detection result.

In some embodiments of the invention, the determination unit 11 is also used to the quantity when the detection path detected When for greater than one, determine that the references object is current point in time.

In some embodiments of the invention, the acquiring unit 10, is also used to obtain audio data to be detected；And it utilizes The speech detection model identifies the audio data to be detected, obtains current detection result；

The determination unit 11, also particularly useful for when the current detection result meets default resetting thresholding, determination is worked as Preceding time point is the reset time point；Wherein, it presets resetting thresholding and is more than or equal to default wake-up thresholding.

In some embodiments of the invention, the acquiring unit 10 is also used to the utilization speech detection model The audio data to be detected is identified, after obtaining current detection result, obtains the history inspection before current point in time Survey result；

The determination unit 11 is also used to when the variation model between the current detection result and the history testing result When enclosing the default false wake-up range of satisfaction, determine that the current point in time is the reset time point.

In some embodiments of the invention, at least one described detection path includes: backup detection path；

The acquiring unit 10, is also used to obtain current point in time；

The determination unit 11 is also used to when the current point in time reaches default preheating time point, will be described current Time point is determined as the reset time point of the backup detection path, wherein the default preheating time point is from default resetting Time point start before default preheating time section time point.

In some embodiments of the invention, the reset cell 12, specifically for reaching pre- when the current point in time If when preheating time point, resetting and starting the backup detection path.

In some embodiments of the invention, at least one described detection path further include: main detection path；The audio Data processing equipment 1 further includes recognition unit 13 and closing unit 14；

The recognition unit 13, after the resetting and starting the backup detection path, using the main detection Access and the backup detection path carry out speech recognition；

The reset cell 12 reaches the default weight also particularly useful for after by the default preheating time section When setting time point, the main detection path is reset；

The closing unit 14, for working as since the default reset time point using the default preheating time section When, the backup detection path is closed,

The recognition unit 13 is also used to carry out speech recognition using the main detection path.

In some embodiments of the invention, the default reset time point is the time sequence for being spaced predetermined time period Column；

In some embodiments of the invention, the audio-frequency data processing device 1 further includes receiving unit 15 and General Office Manage unit 16；

The receiving unit 15, for receiving audio data to be detected；

The recognition unit 13 is specifically used for carrying out voice to the audio data to be detected using the main detection path Identification, obtains main testing result；And when the main testing result is greater than default wake-up thresholding, identify the acoustic to be checked Frequency starts arousal function according to wake up word.

In some embodiments of the invention, the audio-frequency data processing device 1 further includes recognition unit 13；

The recognition unit 13, for described when the reset time point reaches, reset the speech detection model it Afterwards, speech recognition is carried out using the speech detection model after resetting.

In some embodiments of the invention, the audio-frequency data processing device 1 further includes integrated treatment unit 16；

The recognition unit 13, specifically in the speech detection based at least one direction branch, according to described heavy The speech detection model postponed carries out speech recognition at least one direction branch respectively, obtains at least one current detection knot Fruit；

The integrated treatment unit 16 obtains comprehensive for carrying out integrated treatment at least one described current detection result Close testing result；

The recognition unit 13 is identified also particularly useful for when the comprehensive detection result is greater than default wake-up thresholding Word is waken up, arousal function is started.

In some embodiments of the invention, the reset cell 12 is specifically used for when the reset time point reaches, Initialize the data with historical accumulation characteristic in the speech detection model, the speech detection model after being reset.

In practical applications, the acquiring unit 10, the determination unit 11, the reset cell 12, the identification are single Member 13, closing unit 14 and the integrated treatment unit 16 can be realized by processor, and receiving unit 15 can then be connect by user It mouthful realizes, the embodiment of the present invention is with no restriction.

As the example that audio data processing method provided in an embodiment of the present invention uses hardware to implement, the embodiment of the present invention The processor 410 of hardware decoding processor form can be directly used to execute completion in provided audio data processing method, For example, by one or more application specific integrated circuit (ASIC, Applic ation Specific Integrated Circuit), DSP, programmable logic device (PLD, Programmabl e Logic Device), complicated programmable logic device Part (CPLD, Complex Programmable Logi c Device), field programmable gate array (FPGA, Field- Programmable Gate Array) or other electronic components execution realization audio data processing provided in an embodiment of the present invention Method.

Below in conjunction with it is above-mentioned realize the embodiment of the present invention audio-frequency data processing device exemplary application and implementation, Illustrate the audio data processing method for realizing the embodiment of the present invention.

Referring to fig. 4, Fig. 4 is an optional process signal of audio data processing method provided in an embodiment of the present invention The step of scheming, showing in conjunction with Fig. 4 is illustrated.

S101, speech detection model is obtained, speech detection model is that at least one detection with historical accumulation characteristic is logical The audio data on road and the corresponding relationship of speech recognition result.

S102, the quantity based at least one detection path detected, determine references object；References object is to carry out weight Set the factor of operation judges.

S103, it is based on references object, determines that reset time point, reset time point are the case where guaranteeing speech recognition performance Under, at the time of initializing the historical accumulation in speech detection model.

S104, reset time point reach when, reset speech detection model.

A kind of audio data processing method provided in an embodiment of the present invention is applied in speech detection or speech recognition It in scene, such as wakes up word and detects scene, the embodiment of the present invention is with no restriction.

The example of audio data processing method provided in an embodiment of the present invention is carried out for waking up word detection scene below Property explanation.

In embodiments of the present invention, it is waking up in word detection model as shown in Figure 5A, audio-frequency data processing device connects in real time Audio data to be detected is received, and received audio data to be detected is input to and wakes up word detection model (i.e. speech detection model) In identified, finally output wake up word testing result, determine whether to wake up audio-frequency data processing device according to testing result.

Illustratively, audio data to be detected can be continuous signal (continued time domain signal or the continuous frequency of monophonic Domain signal, the embodiment of the present invention is with no restriction), the continuous signal of the monophonic, which is often admitted to as unit of frame, wakes up word detection Model.Word detection model is waken up after obtaining each frame input continuous signal, detection/judgement in newest T time window whether There is predefined wake-up word, that is, identifies whether as preset wake-up word.Finally, from word detection model is waken up by frame output detection As a result.

It should be noted that in embodiments of the present invention, not limiting the output form of testing result, can be obtained to be specific Point, two kinds of form of identification of word, such as binary representation or text results expression etc., the present invention can be waken up for yes or no Embodiment is with no restriction.

Illustratively, testing result uses binary representation, and output 1, which is characterized in T time window, detects wake-up word；It is defeated Out 0: wake-up word is not detected in T time window.

In embodiments of the present invention, it based in the scene of similar speech recognition shown in Fig. 5 A, proposes to speech detection The method on the opportunity that model is reset, in order to the speech detection model after preferably being reset using detection or recognition effect During carrying out subsequent audio data detection, it may remain in the higher level of recognition accuracy.

Here, audio-frequency data processing device carries out speech recognition, speech detection model here using speech detection model For the audio data of at least one detection path with historical accumulation characteristic and the corresponding relationship of speech recognition result.Audio number It needs first to obtain speech detection model according to processing unit, due to the access for carrying out speech recognition detection in speech detection model It can be one or more, therefore, which needs first to carry out speech detection model the inspection of detection path It surveys, after detecting at least one detection path, the quantity based at least one detection path determines different detection paths respectively In the case where corresponding references object；Wherein, references object is to carry out the factor of reset operation judgement, is guaranteed according to resetting When the determining reset time point of judgement carries out model resetting, the data or the spy that wake up the accuracy of word detection can be kept Property.So, after having obtained references object, audio-frequency data processing device can be corresponding based on 2 detection path not of the same race References object determines the reset time point in the case of detection path not of the same race, wherein reset time point is to guarantee voice knowledge In the case where other performance, at the time of initializing the historical accumulation in speech detection model.When reset time point reaches, language is reset Sound detection model, the speech detection model after being reset.

In some embodiments of the invention, specific reset process in embodiments of the present invention are as follows: in reset time point When arrival, the data with historical accumulation characteristic in speech detection model, the speech detection model after being reset are initialized.

In some embodiments of the invention, when the quantity of the detection path detected is one, references object is determined For current detection result.

In some embodiments of the invention, when the quantity of the detection path detected is greater than one, reference is determined Object is current point in time.

That is, in embodiments of the present invention, the case where detection path of the same race, can not be divided into a detection path The case where situation and at least two (being greater than one) detection paths.In the case where a detection path, audio data processing Device is the judgement that the reset time point of resetting speech detection model is carried out based on current detection result；And at least two inspections In the case where surveying access, audio-frequency data processing device is the reset time that resetting speech detection model is carried out based on current point in time The judgement of point, detailed is to carry out reset time point according to current point in time and the reset time condition pre-set Judgement, it will be described in detail in the embodiment below.

It is understood that since audio-frequency data processing device is logical at least one detection of different phonetic detection model The quantity on road can determine the judgement for carrying out the reset operation in the speech detection model, thus based on references object again into one The determination of step is reset time point, that is to say, that for the different detection paths of speech detection model, can pass through difference The judgement of respective reset time point is realized in the determination of references object, and the reset time point is to guarantee speech recognition performance In the case where, at the time of initializing the historical accumulation in the speech detection model, then having reset language in reset time point If sound detection model, just there is no historical traces for the speech detection model after resetting, in this way, speech detection model is in resetting Between under the premise of promise wakes up performance, and in the case where not influenced by prolonged historical accumulation characteristic, carry out wake-up word Speech recognition when can improve the accuracy rate of speech recognition.

In some embodiments of the invention, after S104, audio-frequency data processing device is resetting speech detection model Later, so that it may speech recognition be carried out using the speech detection model after resetting, the identification of the testing result obtained in this way is quasi- True rate will be fine.

It should be noted that in embodiments of the present invention, speech detection model is that the voice with historical accumulation characteristic is known Other model, for example, LSTM.

In embodiments of the present invention, LSTM is a kind of time recurrent neural network, can selectively remember historical information and (go through History characteristic of accumulation).It is further improved on the basis of RNN model, using the hidden layer in LSTM unit replacement RNN network Node just then forms LSTM.

((Memory Cell, Cell) (i.e. core door) state is controlled the memory unit of LSTM unit by 3 doors, i.e., defeated It gets started (inputgate), forget door (forgetgate) and out gate (outputgate).

Wherein, current data is selectively input to memory unit by input gate；Door regulation historical information is forgotten to current The influence of memory unit state value；Out gate exports memory unit state value for selectivity.3 doors and independent memory unit Design has LSTM unit and saves, reads, resets and update the effect of long range historical information.Illustratively, as shown in Figure 6 For the structure of a LSTM memory unit Cell.

Firstly, t moment input feature vector x_tWith t-1 moment hidden layer variable h_t-1, in transferring weights matrix W and U, and partially Under the collective effect for setting vector b, the quantity of state i of t moment is generated_t、f_tAnd o_t, see formula (1) to formula (3).Further in t-1 Moment core door state amount c_t-1Auxiliary under, generate t moment core door state amount c_t, see formula (4).Finally, in t moment core Ostium quantity of state c_tWith out gate quantity of state o_tUnder the action of, generate t moment hidden layer variable h_t, and then influence t+1 moment LSTM The interior change of neuron is shown in formula (5).

i_t=σ (W_ix_t+U_ih_t-1+b_i) (1)

f_t=σ (W_fx_t+U_fh_t-1+b_f) (2)

o_t=σ (W_ox_t+U_oh_t-1+b_o) (3)

c_t=f_t*c_t-1+i_t*φ(W_cx_t+U_ch_t-1+b_c) (4)

h_t=o_t*φ(c_t) (5)

Wherein, two kinds of nonlinear activation functions are respectivelyWith φ (x_t)=tanh (x_t)。

i_t、f_t、o_tAnd c_tIt respectively indicates the input gate state value of t moment, forget gate-shaped state value, out gate state value and core Ostium state value.In embodiments of the present invention, for each logic gate, W_i、W_f、W_oAnd W_cIt respectively indicates input gate, forget Transferring weights matrix corresponding to door, out gate and core door；U_i、U_f、U_oAnd U_cIt respectively represents input gate, forget door, out gate With t-1 moment hidden layer variable h corresponding to core door_t-1Corresponding transferring weights matrix, b_i、b_f、b_oAnd b_cThen represent input Bias vector corresponding to door, forgetting door, out gate and core door.

Illustratively, since LSTM has historical trace (can be understood as historical accumulation characteristic), to audio to be detected Data carry out speech detection either speech recognition when, will receive the influence of history detection data and output test result, And historical trace is limited, and exists it is thus impossible to unconfined, also, in the time span existing for historical trace, With the growth of the stand-by time of time audio-frequency data processing device, false wake-up performance will be higher and higher, i.e. false wake-up probability Increasing, the reset time point in the embodiment of the present invention is exactly the time point being arranged in the finite time of this historical trace, Speech detection model is reset in reset time point, the wake-up performance of the speech detection model after resetting is with regard to fine.Tool The reset process of body is exactly audio-frequency data processing device in reset time point, has history note to what is stored in speech detection model The data recalled have carried out initialization cleaning, so that the speech detection model after resetting is no longer influenced by standby interior historical trace for a long time Influence.

In some embodiments of the invention, when the quantity of the detection path detected is one, references object is to work as Preceding testing result, the reset process that audio-frequency data processing device carries out model during speech recognition are referring to Fig. 7, Fig. 7 One optional flow diagram of audio data processing method provided in an embodiment of the present invention can also be held after S102 Row S201-S205.It is as follows:

S201, audio data to be detected is obtained.

S202, audio data to be detected is identified using speech detection model, obtains current detection result.

S203, when current detection result meets default resetting thresholding, determine current point in time for resetting time point.

Wherein, it presets resetting thresholding and is more than or equal to default wake-up thresholding.

In embodiments of the present invention, when the quantity of the detection path detected is one, references object is current detection As a result, audio-frequency data processing device is exactly the resetting for carrying out model during speech recognition.

In S201, audio-frequency data processing device is to obtain or receive in real time audio data to be detected.

In embodiments of the present invention, due to obtaining in real time, audio data to be detected may be receive it is outer Noise or noise in boundary, it is also possible to receive the continuous signal of user or the input of other sounding device, the embodiment of the present invention With no restriction.

In S202, audio-frequency data processing device is after receiving audio data to be detected, due to audio data processing Speech detection model is provided with when in device, therefore, which can use speech detection model pair Audio data to be detected carries out speech recognition, then, exports current detection result.

In embodiments of the present invention, audio-frequency data processing device carries out the process of speech detection to audio data to be detected In, which needs that audio data to be detected is first carried out audio feature extraction, and the audio frequency characteristics are defeated Enter into speech detection device, to output current detection result.

In some embodiments of the invention, the mode of feature extraction includes: SPP feature extraction, mel-frequency cepstrum system Number feature etc., the embodiment of the present invention is with no restrictions.

It should be noted that the testing result in the embodiment of the present invention can be score, or identification information (example Such as, 0,1) etc., the embodiment of the present invention is with no restriction.

In S203, presetting resetting thresholding is and the consistent numerical value of current detection result type, that is to say, that default resetting Thresholding is can be with the data compared with current detection result.In embodiments of the present invention, audio-frequency data processing device is by current detection As a result it is compared with default resetting thresholding, when current detection result meets default resetting thresholding, characterization can at this time can be with The resetting of speech detection model is carried out, then, audio-frequency data processing device obtains current point in time, determines that current point in time is Reset time point.Wherein, it presets resetting thresholding and is more than or equal to default wake-up thresholding.

In embodiments of the present invention, the numerical lower limits of speech detection model resetting can be carried out by default resetting thresholding characterization Value, or characterization can carry out the numberical range of speech detection model resetting；When current detection result meets speech detection model The numerical lower limits value of resetting, or belong to speech detection model resetting numberical range when, characterization can carry out voice inspection Survey the resetting of model.

It should be noted that in embodiments of the present invention, when applied in wake-up word detection scene as shown in Figure 5 B, sound Frequency data processing equipment obtains audio data to be detected, using wake-up word detection model (speech detection model) to audio to be detected Data are identified, obtain current detection as a result, carrying out resetting judgement according to current detection result, when current detection result meets When default resetting thresholding, determine that current point in time for resetting time point, wake up in reset time point the weight of word detection algorithm It sets.

Default resetting thresholding is centainly greater than default wake-up word limit.The default thresholding that wakes up is determined based on testing result It can carry out the threshold value of audio-frequency data processing device arousal function.

It should be noted that in embodiments of the present invention, current detection result had not only been needed for carrying out resetting judgement, but also need To be used to carry out wake-up judgement.

In embodiments of the present invention, when current detection result is more than default resetting thresholding, resetting speech detection model (is waken up Word detection algorithm).It is understood that reset operation is always when default resetting thresholding selection is more than or equal to wake-up thresholding It follows after waking up judgement, can thus evade and carry out reset problem among wake-up word detection, so as to cause speech recognition There is the problem of mistake, accuracy rate.

Illustratively, audio-frequency data processing device carries out speech detection to audio data 1, show that testing result is 85 points, And default resetting thresholding is 90 points, and presetting and waking up thresholding is 80 points, that is to say, that in current detection, audio data processing Device, which meets, wakes up judgement, is waken up, and is unsatisfactory for resetting thresholding, does not need the resetting for carrying out speech detection model, still, if inspection When survey result is 95 points, the numerical value for the testing result that speech detection model inspection goes out is slowly to finally obtain toward going up 95 points, then just having carried out waking up judgement when testing result rises to 80 points, audio-frequency data processing device is waken up, so After continue to increase the resetting for judging until more than 90 points to need to carry out speech detection model, and at this moment, have been completed wake-up Judgement；If default resetting thresholding is less than default wake-up thresholding, when wake-up condition does not reach also, with regard to carrying out voice always There is the case where accidentally resetting in the resetting of detection model, and avoids the problem of may being reset in waking up word detection.

It should be noted that default resetting thresholding is consistent with the default type of thresholding that wakes up, the restriction sheet of specific numerical value Inventive embodiments are with no restriction.

It should be noted that the setting of above-mentioned reset time point be applied best to user need to carry out in a short time it is more In the usage scenario of secondary wake operation.

In embodiments of the present invention, if user needs to carry out multiple wake operation in a short time, at audio data Manage score (the current detection knot that device carries out the output of speech detection model to the wake-up word (audio data to be detected) of the user Fruit) it is successfully more than after default resetting thresholding is primary, next wake operation or wake-up judgement will all obtain optimal wake-up Performance response (because the wake-up performance every time after resetting is all optimal)；It is next meanwhile because having reset speech detection model Waking up word will be easier to obtain higher score, and high score promotes the purpose for easily reaching default resetting thresholding again, i.e., more It is easy again resetting of the triggering to speech detection model.

Meanwhile in terms of false wake-up, if default resetting thresholding is sufficiently high (to be more than or equal to default used by waking up wake up Thresholding), then speech detection model is led to the probability very little of resetting in audio-frequency data processing device in standby by noise；And And because speech detection model is far longer than i.e. often from the expectation mean value for the time span being initialised between first time false wake-up After primary resetting, therefore the time before false wake-up performance reaches optimum state by noise false wake-up or misses the probability reset It is very low, so, it, will not be to sound even if audio-frequency data processing device is by noise false wake-up and accidentally resets in standby The wake-up performance of frequency data processing equipment has obvious damage, can also improve the accuracy rate of speech recognition such as wake operation.

History testing result before S204, acquisition current point in time.

S205, false wake-up range is preset when the variation range between current detection result and the history testing result meets When, determine current point in time for resetting time point.

In S204, audio-frequency data processing device is to obtain audio data to be detected in real time, therefore, the audio detection Device can be with real-time perfoming speech detection or speech recognition, then audio-frequency data processing device is available to many inspections Survey result.So before current point in time, speech detection many times is had been carried out in audio-frequency data processing device, because This, which is the available history testing result to before current point in time.

Illustratively, audio-frequency data processing device is before time t, and 50 of 50 speech detections before acquisition time t A history testing result.

In some embodiments of the invention, default before audio-frequency data processing device can also obtain current point in time All testing results in period, as history testing result, the concrete implementation mode embodiment of the present invention is with no restriction.

In S205, audio-frequency data processing device can be based on current detection result and history testing result, count So in repeated detection result, whether very violent or variation is big for the variation of testing result, when changing greatly for testing result, and When quickly acutely to decline, it is necessary to carry out the resetting of speech detection model, that is to say, that when current detection result and history When variation range between testing result meets default false wake-up range, determine that current point in time for resetting time point, is being reset It is further continued for carrying out speech recognition or speech detection after the resetting of time point progress speech detection model.

Wherein, it presets false wake-up range and just characterizes the numberical range that testing result acutely declines, false wake-up within the scope of this Probability it is just very high.

It should be noted that resetting speech detection model when quick and violent decline occurs in testing result.It can be with Understand, the testing result of the speech detection model with historical trace in normal noise, (concentrate in training data by speech detection model Contained the noise of respective type) under generally only will be slow small size decline, often only very noisy or speech detection model Can just cause when the noise type that do not met in the training process appears in speech detection the quick of testing result and Significantly decline, and then wakes up performance in the period after causing and obviously deteriorate；Therefore, when audio-frequency data processing device detects To testing result this variation when reset speech detection model, so that it may avoid the above problem (in this section), while can't To the wake-up performance under commonly used scene, false wake-up performance and memory and operand have a significant effect, and also improve wake-up Accuracy rate.

It should be noted that in embodiments of the present invention, S203 and S204-S205 are two kinds of realizations optional after S202 The step of mode, audio-frequency data processing device can be executed according to the actual situation after S202, the embodiment of the present invention does not limit System.

In some embodiments of the invention, when the quantity of the detection path detected is greater than one, references object For current point in time, detection path at this moment includes: backup detection path and main detection path；Audio-frequency data processing device is in language The reset process of progress model is audio data processing side provided in an embodiment of the present invention referring to Fig. 8, Fig. 8 during sound identifies S301-S306 can also be performed after S102 in one optional flow diagram of method.It is as follows:

S301, current point in time is obtained.

S302, when current point in time reaches default preheating time point, current point in time is determined as backup detection path Reset time point, wherein default preheating time point be since default reset time point before section of default preheating time Time point.

S303, when current point in time reaches default preheating time point, reset and start backup detection path.

S304, speech recognition is carried out using main detection path and backup detection path.

S305, after by default preheating time section, when reaching the default reset time point, it is logical to reset main detection Road.

S306, when since default reset time point using default preheating time section, close backup detection path, adopt Speech recognition is carried out with main detection path.

In embodiments of the present invention, when the quantity of the detection path detected is greater than one, references object is current Time point, and detection path includes: backup detection path and main detection path；Wherein, backup detection path and main detection path The number embodiment of the present invention all with no restriction.

Illustratively, speech detection process schematic diagram shown in Figure 9, in audio-frequency data processing device, with one It is illustrated for main detection path and a backup detection path, is provided among main detection path and backup detection path Resetting and starting controller, the resetting and starting controller are used to control the resetting of main detection path, and control backup detection The resetting and starting of access.Audio data to be detected can be detected after main detection path and backup detection path As a result (main testing result and backup testing result), finally, final by being exported again after the progress integrated treatment of all testing results Testing result, i.e., total testing result.

In embodiments of the present invention, references object is current point in time, specifically, audio-frequency data processing device is to be based on working as Preceding time point and preset time condition carry out the determination of reset time point.

Wherein, the time parameter in preset time condition includes default reset time point, presets optimal wake-up upper limit value, is pre- If best false wake-up lower limit value, default preheating time section and default wake-up word duration.Wherein, preset preheating time point be from Preset the time point of the section of default preheating time before reset time point starts.

In this way, audio-frequency data processing device is after obtaining current point in time, when current point in time reaches default preheating time When point, current point in time is determined as to the reset time point of backup detection path.When current point in time reaches default preheating time When point, resets and start backup detection path.Speech recognition is carried out using main detection path and backup detection path.When by pre- If when reaching the default reset time point, resetting main detection path after preheating time section.It is opened when from default reset time point When beginning using default preheating time section, backup detection path is closed, speech recognition is carried out using main detection path.

Wherein, the time parameter of preset time condition meets:

Default reset time point is the time series for being spaced predetermined time period；

Predetermined time period is in the range of 2 times of section of default preheating time and default tolerance threshold wake-up value；

Default tolerance threshold wake-up value in default optimal wake-up upper limit value and is preset between best false wake-up lower limit value；

Default preheating time section is more than or equal to the default wake-up word duration.

It should be noted that for the speech detection model with historical accumulation characteristic, in waking up detection scene, Success rate is waken up as the time changes.

Illustratively, first time as shown in Figure 10 wakes up the relation curve of success rate and stand-by time, at audio data It manages device and meets t >=T in standby (the wake-up word for being not received by user) time t₀After, next first time or several leading secondary The wake-up success rate of wake operation will be substantially reduced.Wake up size and stand-by time section t of the amplitude depending on t of reduced performance The intensity and feature of interior ambient noise.Wherein, T₀The history for representing wake-up word detection algorithm (i.e. speech detection model) is insensitive The lower limit value of time, i.e., default optimal wake-up upper limit value.As t≤T₀When, waking up success rate will not decreased significantly (if should be to The noise data feature used when the feature and model training of the ambient noise in the machine period does not have too many differences).T₀Value Data configuration when depending on model training.And the historical trace duration for waking up word detection algorithm is often limited, and is denoted as T₁ (presetting best false wake-up lower limit value), value by the algorithm speech detection model model structure and tuning parameter determine, Data more than the historical accumulation of the duration will not have an impact that (or the influence is small to the current results for waking up word detection algorithm To can ignore).

Therefore, in embodiments of the present invention, t≤T₁When for false wake-up performance reach it is optimal before.

In embodiments of the present invention, when the wake operation of user is random distribution in time, and front and back wakes up behaviour twice Interval time (predetermined time period) is longer between work, then needs to carry out reset operation in the standby state to guarantee to use next time Standby time t before the wake operation at family meets t≤T₁。

Illustratively, as shown in figure 11, under audio-frequency data processing device standby mode, in { t₁-K,t₂-K,t₃-K,…} Moment, resetting and starting controller can wake-up word detection algorithm to backup detection path initiate to reset and start-up operation.Backup After the detection module of detection path receives resetting and start command, the data of its internal historical accumulation are removed, and start to receive The audio data to be detected of input.K is referred to as default preheating time section herein, and K needs to wake up the duration more than or equal to default τ: K >=τ improves the accuracy rate for waking up word detection to guarantee that backup detection path can be correctly detecting wake-up word.

Also, the wake-up of resetting and starting controller module every the D time (default reset time point) to main detection path Word detection module initiates reset operation.

Wherein, D can be one less than T₁Constant be also possible to the random number regenerated every time.

In embodiments of the present invention, default reset time point is denoted as: { t₁,t₂,t₃,…}.The selection of reset time point needs Meet formula (1):

2K<t_i+1-t_i≤T₂ (6)

K is default preheating time section, T₂The tolerable performance fall time selected when designing for system, meet T₀≤T₂ ≤T₁。

In { t₁+K,t₂+K,t₃+ K ... } moment, resetting and starting controller can wake-up word to backup detection path detect Algorithm sending is ceased and desisted order, and backup detection path is out of service or closes.

Wherein, the runing time of detection path is backed up from t_i- K arrives t_i+K。

It is understood that if t_iIt is exactly within the scope of the audio data that some wakes up word, then at least backup detection Access can receive the complete audio data for waking up word, realize the detection for waking up word, improves and wakes up the accurate of word detection ?.Meanwhile as long as meeting formula (7)

T₀/2≥K≥τ (7)

So, in t_i- K arrives t_iThe wake-up word occurred in+K the period, will all obtain the optimal wake-up of backup detection path The response of performance reaches the accuracy of optimal wake-up word detection.

It should be noted that in embodiments of the present invention, backing up the initial state of detection path to close, only default pre- Just start when hot time point reaches.

In some embodiments of the invention, the audio-frequency data processing device in S304 carries out the detailed process of speech recognition To receive audio data to be detected；Voice is carried out to audio data to be detected respectively using main detection path and backup detection path Identification, obtains main testing result and backup testing result；Integrated treatment is carried out to main testing result and backup testing result, is obtained Total testing result；When total testing result is greater than default wake-up thresholding, audio data to be detected is identified to wake up word, starting is called out Awake function.

In some embodiments of the invention, the audio-frequency data processing device in S306 carries out the detailed process of speech recognition Are as follows: receive audio data to be detected；Speech recognition is carried out to audio data to be detected using main detection path, obtains main detection knot Fruit；When main testing result is greater than default wake-up thresholding, audio data to be detected is identified to wake up word, starts arousal function.

In embodiments of the present invention, audio-frequency data processing device is when starting backup detection path, main detection path Speech detection is all carried out with backup detection path, therefore, available main testing result and backup testing result, then audio number According to processing unit can the comprehensive detection based on main testing result and backup testing result as a result, i.e. total testing result is called out It wakes up and judges.Also, for audio-frequency data processing device when out of service or closing backup detection path, main detection path is all Carry out speech detection, therefore, available main testing result, then audio-frequency data processing device can be based on main testing result Wake up and judges.In this way, the raising of the accuracy rate based on speech recognition, wakes up accuracy rate and also improves

In embodiments of the present invention, the wake-up word testing result of comprehensive main detection path and backup detection path, General Office After reason, total testing result is exported.

Illustratively, a kind of realization of simple testing result integrated treatment is: when backup detection path is not run (t_i-1+ K~t_i- K), using only the main testing result of main detection path；(the t when primary path and backup access are run simultaneously_i- K~ t_i+ K) use the higher person in testing result in main detection path and backup detection path.Assuming that main testing result is z (t), it is standby Part testing result is b (t), and total testing result is s (t), i.e. formula (8) after integrated treatment:

S (t)=z (t), t ∈ (t_i-1+ K~t_i-K)

S (t)=max_z,b(z(t),b(t)),t∈(t_i- K~t_i+K) (8)

It should be noted that integrated treatment can also be mean operation, geometric average or weighting algorithm etc., present invention implementation Example is with no restriction.

In embodiments of the present invention, audio-frequency data processing device is after getting total testing result, so that it may call out with default Awake thresholding is made comparisons, and wake up and is adjudicated.

In some embodiments of the invention, the optimized integration of the speech detection model resetting based on previous embodiment description On, it is an optional flow diagram of audio data processing method provided in an embodiment of the present invention referring to Figure 12, Figure 12, figure 12 show after S104, and audio-frequency data processing device can carry out speech recognition using the speech detection model after resetting.Tool S105-107 can also be performed in body realization.It is as follows:

S105, in the speech detection based at least one direction branch, according to the speech detection model after resetting to extremely A few direction branch carries out speech recognition respectively, obtains at least one current detection result.

S106, integrated treatment is carried out at least one current detection result, obtains comprehensive detection result.

S107, when comprehensive detection result be greater than it is default wake up thresholding when, identify wake-up word, start arousal function.

In embodiments of the present invention, there can be the speech detection framework of multiple directions branch, previous embodiment describes all It is the speech detection model framework on a direction.

In some embodiments of the invention, the speech detection framework of multiple directions branch (at least one direction) can lead to It crosses microphone array microphone array signals are distributed in different directions branch, can be transmitted after audio data input to be detected In the speech detection of multi-direction branch, each direction branch can speech detection obtain a testing result, then, by multi-party At least one testing result (as shown in figure 13) will be obtained to the speech detection of branch.

In embodiments of the present invention, each direction branch road is both provided with single-channel voice detection model, the single channel language Sound detection model is exactly speech detection model described in above example.

Therefore, audio-frequency data processing device is in the speech detection based at least one direction branch, after resetting Speech detection model (the single-channel voice detection model after resetting) carries out speech recognition at least one direction branch respectively, can To obtain at least one current detection as a result, carrying out integrated treatment at least one current detection result, comprehensive detection knot is obtained Fruit carries out wake-up judgement based on comprehensive detection result and the default thresholding that wakes up, i.e., wakes up door when comprehensive detection result is greater than to preset In limited time, it identifies wake-up word, starts arousal function.

And obtained for the single-channel voice detection model after the resetting in each direction branch, in preceding embodiment All reset process in the speech detection model of description are consistent.

That is, the audio-frequency data processing device reset time point in embodiment in front, carries out resetting speech detection The realization of model can directly simply in the direction each of Figure 13, branch be independently operated, i.e. all directions branch according to itself Testing result to this direction branch carry out reset operation, can also be according to the maximum value of testing result in all directions branch come The unified single-channel voice detection model reset in all direction branches.

Illustratively, as shown in figure 14, the multi-direction branch in Figure 13 is called out by the way of a detection path Word of waking up detects and the process of resetting detection.As shown in figure 15, for for a direction branch, using a main detection path The mode of (single channel wakes up word detection) and a backup detection path (backup single channel wakes up word detection) is to more in Figure 13 Direction branch wake up the process of word detection and resetting detection.As shown in figure 16, it is detected for the wake-up word of multi-direction branch During, different directions branch can be used by the way of a detection path and the cooperation of at least two detection paths.

It should be noted that each direction branch can use resetting judgment mode of the Figure 14 into Figure 16, the present invention Which the specific direction that embodiment does not limit the branch that can be reset is.In detailed description embodiment in front into Description is gone, details are not described herein again.

In some embodiments of the invention, in the scene of main detection path and backup detection path, to all sides Resetting and backup operation are carried out in turn to branch, in optional reset time point t_i, the i-th %N branch is reset, wherein N is The number of branch, " % " representative take the remainder operation；Alternatively, in any reset time point t_i, select current detection result minimum Branch is in t_i+1Moment carries out resetting and backup operation, and the embodiment of the present invention is with no restriction.

In the following, the embodiment of the present invention will be illustrated in an actual applied field for wake up using intelligent sound box word detection Exemplary application in scape is illustrated by taking the reset mode of at least two detection paths as an example.

As shown in figure 17, user the moment 1 issue ", small four " audio data 1 (audio data to be detected), the audio Data 1 are received by intelligent sound box, and intelligent sound box carries out waking up detection and resetting detection for audio data 1, and intelligent sound box is sentenced The disconnected moment 1 compares with default preheating time point and default reset time point, obtains the moment 1 and reaches default preheating time point, Then, it resets and starts backup detection path, in this case, intelligent sound box is using main detection path and backup detection path Wake-up identification is carried out, main testing result and backup testing result are obtained；Main testing result and backup testing result are integrated Processing, obtains total testing result；When total testing result is greater than default wake-up thresholding, audio data to be detected is identified to wake up Word starts arousal function, just the voice prompting of output " I " to user.User is known that and can carry out next time in this way Phonetic order, control intelligent sound box realize certain application function.In embodiments of the present invention, which can be The application function of intelligent sound box itself can also be controlled by server and is in it with other terminals in a local area network Application function.

Illustratively, as shown in figure 18, the audio data 2 of " opening TV ", intelligence after intelligent sound box is waken up, are received Speaker starts the arousal function of TV opening, then generates electricity after it have passed through resetting detection above-mentioned and wake up judgement Depending on enabled instruction to server, which is just opened according to TV enabled instruction by network-control TV, on the boundary of TV The signal language of " booting " is shown on face.

The embodiment of the present invention provides one kind and is stored with computer readable storage medium, wherein it is stored with executable instruction, when When executable audio data process instruction is executed by processor, processor will be caused to execute audio number provided in an embodiment of the present invention According to processing method.

In some embodiments, storage medium can be FRAM, ROM, PROM, EPROM, EE PROM, flash memory, magnetic surface The memories such as memory, CD or CD-ROM；Be also possible to include one of above-mentioned memory or any combination various equipment.

In some embodiments, executable instruction can use program, software, software module, the form of script or code, By any form of programming language (including compiling or interpretative code, or declaratively or process programming language) write, and its It can be disposed by arbitrary form, including be deployed as independent program or be deployed as module, component, subroutine or be suitble to Calculate other units used in environment.

As an example, executable instruction can with but not necessarily correspond to the file in file system, can be stored in A part of the file of other programs or data is saved, for example, being stored in hypertext markup language (H TML, Hyper Text Markup Language) in one or more scripts in document, it is stored in the single file for being exclusively used in discussed program In, alternatively, being stored in multiple coordinated files (for example, the file for storing one or more modules, subprogram or code section).

As an example, executable instruction can be deployed as executing in a calculating equipment, or it is being located at one place Multiple calculating equipment on execute, or, be distributed in multiple places and by multiple calculating equipment of interconnection of telecommunication network Upper execution.

The above, only the embodiment of the present invention, are not intended to limit the scope of the present invention.It is all in this hair Made any modifications, equivalent replacements, and improvements etc. within bright spirit and scope, be all contained in protection scope of the present invention it It is interior.

Claims

1. a kind of audio data processing method characterized by comprising

Speech detection model is obtained, the speech detection model is the sound of at least one detection path with historical accumulation characteristic The corresponding relationship of frequency evidence and speech recognition result；

Based on the quantity of at least one detection path described in detecting, references object is determined；The references object is to carry out weight Set the factor of operation judges；

Based on the references object, determine that reset time point, the reset time point are the case where guaranteeing speech recognition performance Under, at the time of initializing the historical accumulation in the speech detection model；

When the reset time point reaches, the speech detection model is reset.

2. the method according to claim 1, wherein described based at least one detection path described in detecting Quantity, determine references object, comprising:

When the quantity of the detection path detected is one, determine that the references object is current detection result；

Correspondingly, described be based on the references object, reset time point is determined, comprising:

Obtain audio data to be detected；

The audio data to be detected is identified using the speech detection model, obtains current detection result；

When the current detection result meets default resetting thresholding, determine that current point in time is the reset time point；

3. the method according to claim 1, wherein described based at least one detection path described in detecting Quantity, determine references object, comprising:

When the quantity of the detection path detected is greater than one, determine that the references object is current point in time；

Correspondingly, at least one described detection path includes: backup detection path；It is described to be based on the references object, determine weight Set time point, comprising:

Obtain current point in time；

When the current point in time reaches default preheating time point, it is logical that the current point in time is determined as the backup detection The reset time point on road, wherein the default preheating time point be since default reset time point before default preheating when Between section time point.

4. according to the method described in claim 2, it is characterized in that, described utilize the speech detection model to described to be detected Audio data is identified, after obtaining current detection result, the method also includes:

Obtain the history testing result before current point in time；

When the variation range between the current detection result and the history testing result meets default false wake-up range, really The fixed current point in time is the reset time point.

5. according to the method described in claim 3, it is characterized in that, described when the reset time point reaches, described in resetting Speech detection model, comprising:

When the current point in time reaches default preheating time point, resets and start the backup detection path.

6. according to the method described in claim 5, it is characterized in that, at least one described detection path further include: main detection is logical Road；The resetting and after starting the backup detection path, the method also includes:

Speech recognition is carried out using the main detection path and the backup detection path；

After by the default preheating time section, when reaching the default reset time point, the main detection path is reset；

When since the default reset time point using the default preheating time section, it is logical to close the backup detection Road carries out speech recognition using the main detection path.

7. according to claim 3,5 or 6 described in any item methods, which is characterized in that

The default reset time point is the time series for being spaced predetermined time period；

8. according to the method described in claim 6, it is characterized in that, described detected using the main detection path and the backup Access carries out speech recognition, comprising:

Receive audio data to be detected；

Speech recognition is carried out to the audio data to be detected respectively using the main detection path and the backup detection path, Obtain main testing result and backup testing result；

Integrated treatment is carried out to the main testing result and the backup testing result, obtains total testing result；

When total testing result is greater than default wake-up thresholding, the audio data to be detected is identified to wake up word, starting Arousal function.

9. according to the method described in claim 6, it is characterized in that, described carry out speech recognition using the main detection path, Include:

Receive audio data to be detected；

Speech recognition is carried out to the audio data to be detected using the main detection path, obtains main testing result；

When the main testing result is greater than default wake-up thresholding, the audio data to be detected is identified to wake up word, starting Arousal function.

10. the method according to claim 1, wherein described when the reset time point reaches, described in resetting After speech detection model, the method also includes:

Speech recognition is carried out using the speech detection model after resetting.

11. according to the method described in claim 10, it is characterized in that, the speech detection model using after resetting carries out language Sound identification, comprising:

In the speech detection based at least one direction branch, according to the speech detection model after the resetting at least one Direction branch carries out speech recognition respectively, obtains at least one current detection result；

Integrated treatment is carried out at least one described current detection result, obtains comprehensive detection result；

When the comprehensive detection result is greater than default wake-up thresholding, identifies wake-up word, start arousal function.

12. the method according to claim 1, wherein described when the reset time point reaches, described in resetting Speech detection model, comprising:

When the reset time point reaches, the data with historical accumulation characteristic in the speech detection model are initialized, Speech detection model after being reset.

13. a kind of audio-frequency data processing device characterized by comprising

Acquiring unit, for obtaining speech detection model, the speech detection model is at least one with historical accumulation characteristic The audio data of a detection path and the corresponding relationship of speech recognition result；

Determination unit determines references object for the quantity based at least one detection path described in detecting；The reference Object is the factor for carrying out reset operation judgement；And it is based on the references object, determine reset time point, the reset time Point is in the case where guaranteeing speech recognition performance, at the time of initializing the historical accumulation in the speech detection model；

14. a kind of audio-frequency data processing device characterized by comprising

Memory, for storing executable audio data process instruction；

Processor, when for executing the executable audio data process instruction stored in the memory, realize claim 1 to 12 described in any item methods.

15. a kind of computer readable storage medium, which is characterized in that executable audio data process instruction is stored with, for drawing When playing processor execution, the described in any item methods of claim 1 to 12 are realized.