CN109994106B

CN109994106B - Voice processing method and equipment

Info

Publication number: CN109994106B
Application number: CN201711481463.4A
Authority: CN
Inventors: 万玉龙; 高杰
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2023-06-23
Anticipated expiration: 2037-12-29
Also published as: CN109994106A

Abstract

The embodiment of the application discloses a voice processing method and equipment. The method comprises the following steps: voice activity detection is carried out on the audio data to obtain voice data; identifying wake-up words from the voice data; determining the ending time of the tail of the wake-up word in the voice data; and carrying out voice activity detection for the second time on the data which takes the ending time of the word tail as the starting point in the audio data, and identifying a voice instruction. By utilizing the voice processing mode provided by the application for voice recognition, the starting time of the voice instruction in the voice data can be accurately divided, redundant information in the wake-up word is prevented from being substituted into the subsequent voice instruction, and the accuracy of intelligent voice recognition is improved.

Description

Voice processing method and equipment

Technical Field

The present disclosure relates to the field of intelligent speech processing technologies, and in particular, to a speech processing method and apparatus.

Background

In recent years, intelligent voice interaction (Intelligent Speech Interaction) technology is rapidly developed, and the intelligent voice interaction technology is based on technologies such as voice recognition, voice synthesis, natural language understanding and the like, so that a product can be endowed with intelligent human-computer interaction experience of 'listening, speaking and understanding you' under various actual application scenes for enterprises. The intelligent voice interaction technology can be suitable for various scenes such as intelligent question answering, intelligent quality inspection, court trial real-time recording, real-time speech captions, interview recording and transcription, and the like, and has application cases in a plurality of fields such as finance, insurance, judicial, electronic commerce, and the like.

In the process of intelligent voice interaction, voice instructions can be sent to the intelligent equipment only after the intelligent equipment wakes up through voice. In the voice awakening process, a user can send out a preset awakening word through voice to awaken the equipment. For example, in one example, the wake-up word of the smart car is "hello, dong", and after the user utters the wake-up word of "hello, dong", the voice recognition system of the smart car can only receive the voice command issued by the user, such as playing music, navigating, making a call, etc., after recognizing that the user utters the correct wake-up word. Depending on the actual application requirements, the user will often speak the voice command immediately after speaking the wake-up word, for example, in the above example, the user may initially issue a voice command of "how good you, dong-dong, today's weather". In order to meet the above-mentioned user needs, an integrated recognition method of wake-up words and speech semantic understanding is proposed in the prior art, that is, after recognition by the wake-up words, a speech instruction immediately after recognition of the wake-up words. Therefore, zero interval and zero delay between awakening and voice control can be realized, steps of voice control of a user are reduced, and efficiency is improved. However, in the prior art, the recognition manner of the interval time between the wake-up word and the voice command is often not accurate enough, so that the recognized voice command has deviation from the actual voice command.

Therefore, there is a need in the art for a way to accurately identify the time interval between wake-up words and voice commands.

Disclosure of Invention

An object of the embodiments of the present application is to provide a voice processing method and apparatus, which can avoid substituting redundant information in a wake-up word into a subsequent voice instruction, and improve accuracy of intelligent voice recognition.

The voice processing method and the voice processing equipment provided by the embodiment of the application are specifically realized in the following way:

a method of speech processing, the method comprising:

voice activity detection is carried out on the audio data to obtain voice data;

identifying wake-up words from the voice data;

determining the ending time of the tail of the wake-up word in the voice data;

and carrying out voice activity detection for the second time on the data which takes the ending time of the word tail as the starting point in the audio data, and identifying a voice instruction.

A voice processing device comprises a microphone, a voice activity detection module, a wake-up word detection module and a voice recognition module, wherein,

the microphone is used for acquiring audio data;

the voice activity detection module is used for detecting voice activity of the audio data to obtain voice data;

the wake-up word detection module is used for identifying wake-up words from the voice data and determining the end time of the tail of the wake-up words in the voice data;

the voice activity detection module is further used for carrying out voice activity detection for the second time on the data which takes the ending time of the word tail as the starting point in the audio data;

the voice recognition module is used for recognizing voice instructions in the audio data detected by the second voice activity.

A speech processing apparatus comprising a processor and a memory for storing processor-executable instructions, the processor implementing the steps of the speech processing method when executing the instructions.

A computer readable storage medium having stored thereon computer instructions which when executed implement the steps of the speech processing method.

An in-vehicle system comprising a processor and a memory for storing processor-executable instructions that when executed by the processor implement the steps of the speech processing method.

A conference system comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement the steps of the speech processing method.

The voice processing method and the voice processing device can identify the end time of the tail of the wake-up word in the voice data, and identify the voice instruction in the voice data by taking the end time of the tail as the starting point. By utilizing the voice processing mode provided by the application for voice recognition, the starting time of the voice instruction in the voice data can be accurately divided, redundant information in the wake-up word is prevented from being substituted into the subsequent voice instruction, and the accuracy of intelligent voice recognition is improved. In addition, after determining the end time of the tail of the wake-up word, performing a second voice activity detection on the data, which takes the end time of the tail as the starting point, in the audio data, and if non-voice noise exists between the wake-up word and the voice command, then the interference of the non-voice noise on voice command recognition can be eliminated, and the accuracy of voice command recognition is further enhanced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

Fig. 1 is a schematic view of an application scenario of a speech processing method provided in the present application;

fig. 2 is a schematic application scenario diagram of a speech processing method provided in the present application;

fig. 3 is an application scenario schematic diagram of a speech processing method provided in the present application;

FIG. 4 is a method flow diagram of one embodiment of a speech processing method provided herein;

fig. 5 is a schematic block diagram of an embodiment of a speech processing device provided in the present application.

Detailed Description

In order to better understand the technical solutions in the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

In order to facilitate understanding of the technical solutions provided by the embodiments of the present application by those skilled in the art, a technical environment in which the technical solutions are implemented is described below.

From the above, in the prior art, an integrated recognition method of wake-up words and speech semantic understanding is proposed, that is, after recognition by the wake-up words, a speech instruction immediately after recognition of the wake-up words is provided. Therefore, zero interval and zero delay between awakening and voice control can be realized, steps of voice control of a user are reduced, and efficiency is improved. In one example, the wake-up word of the intelligent voice interaction device in a certain intelligent automobile is "hello, zebra", and when the user wakes up the intelligent voice interaction device by using the wake-up word and sends a voice command, the user can send out the voice of the wake-up word such as "hello, zebra, how to weather today", "hello, zebra, please navigate to the gym", "hello, zebra, and make a phone to mom" and the like. In the prior art, in the process of recognizing the wake-up word and the voice command for the voice, the end time of the wake-up word cannot be accurately determined, and redundant voice information is often added to the voice command, so that the voice command is not recognized accurately, and even the voice command cannot be recognized. The reasons for the situation are various, for example, pronunciation habits of different users are different, pronunciation speeds of some users are fast, and at this time, the intelligent voice interaction device can accurately identify the ending time of the wake-up word. While some users like to make a long voice, for example, when making a "zebra", it is customary to speak a "horse" word into a long voice "horse" - "at which point the intelligent voice interaction device may be earlier than it actually is at the end time of recognizing the wake-up word, and recognize the long voice behind the" horse "word into a subsequent voice command. For example, after speaking "hello, zebra, how weather today," the voice command recognized by the intelligent voice interaction device may be "how weather today" where the drag sound after zebra is substituted into the voice command. In other cases, such as where some users are accustomed to speaking the voice command at intervals after speaking the wake-up word, the intelligent voice interaction device may substitute the noise of the surrounding environment into the voice command.

Based on the technical requirements similar to those described above, the voice processing method provided by the application can accurately identify the end time of the wake-up word, avoid substituting redundant information into a subsequent voice instruction, and improve the accuracy of intelligent voice identification.

The following describes a specific implementation of the method of the present embodiment through several specific application scenarios.

Scene one

As shown in the schematic view of the scenario in fig. 1, the car with a small user has the function of intelligent voice recognition, and the set wake-up word is "hello, zebra". The intelligent voice recognition client on the vehicle can control related equipment according to the voice command so as to realize the requirements of the user as long as the user is aware of the wake-up word and the voice command. For example, according to voice data "hello, zebra, how the weather is today," the intelligent voice recognition client on the vehicle can search weather information from the search engine and broadcast the weather information to the user's minds. Based on the voice data, when the intelligent voice recognition client detects the wake-up word 'hello, zebra' in the voice data, the intelligent voice recognition client is awakened, and the tail end time of the wake-up word is triggered to be recognized. For example, for the wake word "hello, zebra", the last syllable should be the final a of ma, and by detecting the final a, the end of the suffix of the wake word "hello, zebra" can be determined. Then, the end time of the tail of the wake-up word can be used as a starting point to identify the voice command after the wake-up word.

Scene two

The technical scheme provided by the application can also be applied to equipment such as intelligent ticket purchasing machines, intelligent vending machines and the like in public places. For example, as shown in the schematic view of the scenario in fig. 2, when the user uses the intelligent ticket purchasing machine to purchase a ticket, the user needs to speak a wake-up word to wake up the ticket purchasing machine, and in one example, the user needs to make a voice similar to "hello, love, to people square" to wake up the intelligent ticket purchasing machine. After the intelligent ticket purchasing machine acquires the voice data, the wake-up word 'hello, love of the intelligent ticket purchasing machine' can be identified. In an actual voice scene, pronunciation habits of all users are affected by culture, regions and the like, so that the tail end time of a wake-up word needs to be determined, and redundant information in the wake-up word is prevented from being substituted into a voice instruction, so that the accuracy of voice recognition is affected. Likewise, by detecting syllable ai of wake word "love", the end of the wake word "hello, love" can be determined. Then, the end time of the tail of the wake-up word can be used as a starting point to identify the voice command after the wake-up word.

Of course, the technical solution provided in the present application may also be applied to other scenes, such as the conference scene shown in fig. 3, where the user needs to send out a voice similar to "hello, little honey, start conference recording" and wake up the intelligent conference assistant to start conference recording. In addition, the technical scheme provided by the application can be applied to other intelligent clients, and the clients can be electronic equipment with a recording function. Specifically, for example, the client may be a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, a smart wearable device, a shopping guide terminal, a television, a smart speaker, a microphone, and the like. Wherein, intelligent wearable equipment includes but is not limited to intelligent bracelet, intelligent wrist-watch, intelligent glasses, intelligent helmet, intelligent necklace etc.. Alternatively, the client may be software that can be run in the electronic device. For example, a recording function is provided in the electronic device, and the software can record an audio file by calling the recording function.

The following describes the speech processing method described in the present application in detail with reference to the accompanying drawings. Fig. 4 is a method flow diagram of one embodiment of a speech processing method provided herein. Although the present application provides method operational steps as illustrated in the following examples or figures, more or fewer operational steps may be included in the method, either on a routine or non-inventive basis. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided in the embodiments of the present application. The methods may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment) in accordance with the methods shown in the embodiments or figures when the methods are performed in actual speech processing or devices.

S401: and detecting voice activity of the audio data to obtain voice data.

S403: wake-up words are identified from the speech data.

S405: and determining the ending time of the tail of the wake-up word in the voice data.

S407: and carrying out voice activity detection for the second time on the data which takes the ending time of the word tail as the starting point in the audio data, and identifying a voice instruction.

In this embodiment, first, audio data may be acquired, and the audio data may include not only sounds uttered by a person but also other non-voice sounds such as musical sounds, sounds uttered when a car is traveling, game sounds, animal sounds, and the like. As described above, the acquired audio data may be processed by the smart voice client, at which time the client may perform voice activity detection on the audio data, and acquire the voice data when valid voice data in the audio data is identified.

In the intelligent voice client, the intelligent voice client can be started to perform voice recognition only when the voice data are recognized to have wake-up words. In this embodiment, wake-up word detection may be performed on the voice data. Specifically, the voice data may be matched with a wake word, and when the same data as the wake word is obtained from the voice data, it may be determined that the wake word is included in the voice data. It should be noted that, the wake-up word may be preset, and when it is recognized that the voice data includes the wake-up word, the intelligent voice client may be triggered to be started. In the examples described above, "hello," "zebra," etc. are wake words relative to a particular smart voice client. The voice command further comprises a voice command which is continuously sent out after the wake-up word, and the voice command is used for instructing the intelligent voice client to provide corresponding services, such as inquiring weather, checking stock index, online shopping, playing music and the like. For example, in the above example, the "how today's weather is", "please navigate to a gym", "call mom", etc. are voice instructions immediately following the wake-up word "hello", zebra ". After recognizing the wake-up word and the voice command, the intelligent voice client can execute the voice command to meet the requirements of users.

In this embodiment, after receiving the voice data, the intelligent voice client may identify a tail end time of the wake-up word. Specifically, the intelligent voice client may identify the tail word of the wake-up word, where the tail word includes the last word of the wake-up word, such as "clatter" in "hello, clatter" and "hello, zebra" in "clatter" above. The last syllable of the tail word can be used as the tail end time of the wake-up word. For example, in the last syllable of "dong" and the last syllable of "horse" in the last word of the two wake-up words are "ong" and "a". In chinese, the pronunciation of a chinese character usually consists of an initial and a final, the initial is before and the final is after, so the final syllable of the final is generally the final, such as a, o, e, i, u, ai, ei, ui, ao, an, en, in, un, ang. Of course, in other languages, the last syllable of the tail word has a similar way of determining, such as in English, the last syllable of the tail word "Jack" of the wake word "Hi, jack" is "k". In other embodiments, such as in chinese, the last syllable of the wake word's tail word may not be a final in the pinyin of a chinese character, but rather may be related to the pronunciation habits of the user, at which point the last syllable of the wake word's tail word may be determined after the pronunciation habits of the user are determined. In one embodiment, the pronunciation habits of the user may be learned according to the historical voice data of the user, or the pronunciation habits of the users in different regions and different crowds may be obtained according to statistics, which is not limited in this application.

In this embodiment, the ending time of the last syllable of the tail word may be used as the ending time of the tail of the wake-up word. In this embodiment, before determining the ending time of the last syllable of the tail word, wake-up word detection needs to be performed on the voice data first, and if the wake-up word is detected, the ending time of the tail word of the wake-up word can be triggered and identified. For example, when the wake-up word is "hello, zebra", in the process of detecting the wake-up word, when the wake-up word is "hello, zebra", the pronunciation of the last syllable of the tail word "horse" may not be finished. That is, when the wake-up word passes the detection, the determination of the end time of the last syllable of the tail word is triggered.

In determining the ending time of the last syllable, the starting pronunciation time of the last syllable can be acquired first. For example, the last syllable of the last word "horse" in the wake-up word "hello, zebra" is a, when the wake-up word "hello, zebra" passes detection, the time when syllable a starts to sound is taken as the time starting point, and the phonemes of the voice data are collected according to the preset frequency in the preset time period. And when the phonemes change, taking the time of the phoneme change as the ending time of the last syllable. In this embodiment, the phonemes of the voice data are the minimum units of voice, and in the process of collecting the phonemes of the voice data, the phonemes may be collected according to a preset frequency in a preset time period, for example, 2 seconds, and the preset frequency is, for example, 20 milliseconds. After a plurality of phonemes are collected, the phonemes can be compared, and if the phonemes are found to change, the time when the phonemes change is taken as the pronunciation ending time of the tail word. For example, in one example, after the last syllable a of the wake-up word "hello, zebra" is collected, a phoneme sequence of "a, a, a, a, a, [ blank ], j, in, t, ian" is generated, and it can be found that, at a certain moment in the phoneme sequence, the phoneme a becomes blank, at this moment, the user may complete pronunciation of the wake-up word "horse", and after a short pause, a voice command is issued. At this time, the time of the last syllable in the phoneme sequence, which is the phoneme of the last syllable a, may be taken as the end time of the last syllable of the tail word.

In this embodiment, after determining the end of the tail of the wake-up word in the voice data, the voice activity detection may be performed for the second time on the data after the end of the tail in the audio data, and the voice command may be identified. Specifically, in one embodiment of the present application, during the process of recognizing the voice command, the second voice activity detection may be performed on the voice data with the end of the tail of the wake word as a time start point. In one embodiment, data after the end of the word ending time as a starting point can be intercepted from the audio data, and voice activity detection can be performed on the data. When detecting that valid voice appears in the voice data, voice recognition can be performed on voice instructions in the voice data. In this embodiment, the recognition of the voice command as soon as the pronunciation of the wake-up word is finished can be avoided, which may cause redundancy of voice data. The reason for this is that in a real situation, many users often issue voice commands after issuing wake-up words at intervals, and various non-voice sounds may exist in the intervals, such as a sound generated when the automobile is running, a sound of an external natural environment, a music sound, and the like. If the interval time period is recognized as a part of the voice command, redundant processing information is generated in the voice command recognition process, so that the processing efficiency is reduced, and inaccuracy of voice command recognition may even be caused. Therefore, in this embodiment, the end of the tail of the wake-up word may be used as a time start point to perform voice activity detection on the voice data, and when it is detected that valid voice is present in the voice data, voice recognition on the voice command is started. Thus, unnecessary non-voice sounds can be filtered through the effective voice detection process, and the efficiency and accuracy of voice instruction recognition are improved.

In one embodiment of the present application, the voice data may include sub-audio data from a plurality of audio channels. For example, in a smart car, there may be multiple audio channels, each of which may also be connected to a microphone device. At this time, sub-audio data from a plurality of audio channels may be acquired by microphone apparatuses connected to each of the plurality of audio channels. In this embodiment, wake-up word detection may be performed on the plurality of sub-audio data, and if wake-up words occur in sub-audio data of one of the audio channels, the audio channel is used as a target audio channel and wake-up word detection is stopped on sub-audio data of other audio channels. In one specific example, a smart car has a plurality of audio channels thereon, each of which records voice data via a microphone device. For example, on a smart car, microphone devices are provided in front of the seats of both the primary and secondary driver, each connected to a smart voice client through an audio channel. The intelligent voice client can detect sub-audio data recorded by a plurality of audio channels simultaneously in the wake-up word detection process, wherein the sub-audio data comprise the audio channels connected with the main driving microphone equipment and the co-driving microphone equipment. If the intelligent voice client detects that the sub-audio data acquired by the audio channel of the main driving includes the wake-up word, the audio channel of the main driving can be used as the target audio channel, and in addition, the sub-audio data of other audio channels except the main driving can be stopped being detected. Therefore, in the voice instruction recognition process, only the audio data acquired by the main driving audio channel is required to be acquired, and the processing efficiency of the voice data is improved.

In another aspect, the present application further provides a voice processing device, and fig. 5 is a schematic block diagram of an embodiment of the voice processing device provided in the present application, as shown in fig. 5, where the device may include a microphone, a voice activity detection module, a wake word detection module, and a voice recognition module, where,

the microphone is used for acquiring audio data;

Optionally, in an embodiment of the present application, the determining a tail end time of the wake word in the voice data may include:

acquiring the tail word of the wake-up word and the last syllable of the tail word;

determining the ending time of the last syllable;

and taking the ending time as the ending time of the tail of the wake-up word.

Optionally, in an embodiment of the present application, the determining the ending time of the last syllable may include:

acquiring the initial pronunciation time of the last syllable;

taking the initial pronunciation moment as a starting point, and collecting phonemes of the voice data according to a preset frequency in a preset time period;

and when the phonemes change, taking the time of the phoneme change as the ending time of the last syllable.

Optionally, in an embodiment of the present application, the performing a second voice activity detection on the data after the audio data starts from the end of the word tail, and identifying the voice command may include:

intercepting data after the ending time of the word tail is taken as a starting point from the audio data;

voice activity detection is carried out on the data;

when it is determined that valid speech is present in the data, speech instructions in the data are identified.

Optionally, in an embodiment of the present application, the identifying a wake word from the voice data may include:

matching the voice data with wake-up words;

and when the data which are the same as the wake-up words are obtained from the voice data in a matching way, determining that the wake-up words are included in the voice data.

Optionally, in an embodiment of the present application, if the voice data includes sub-voice data from a plurality of audio channels, the identifying the wake-up word from the voice data may include:

respectively carrying out wake-up word detection on the sub-audio data from the plurality of audio channels;

if the wake-up word appears in the sub-audio data of one of the audio channels, the audio channel is used as a target audio channel, and the wake-up word detection of the sub-audio data of other audio channels is stopped.

The voice processing method and the voice processing equipment can identify the end time of the tail of the wake-up word in the voice data, and identify the voice instruction in the voice data by taking the end time of the tail as the starting point. By utilizing the voice processing mode provided by the application for voice recognition, the starting time of the voice instruction in the voice data can be accurately divided, redundant information in the wake-up word is prevented from being substituted into the subsequent voice instruction, and the accuracy of intelligent voice recognition is improved. In addition, after determining the end time of the tail of the wake-up word, performing a second voice activity detection on the data, which takes the end time of the tail as the starting point, in the audio data, and if non-voice noise exists between the wake-up word and the voice command, then the interference of the non-voice noise on voice command recognition can be eliminated, and the accuracy of voice command recognition is further enhanced.

In this embodiment, the client may be an electronic device having a recording function. Depending on the client data processing capabilities, it may be divided into the following categories.

TABLE 1

In this embodiment, the hardware device of the primary network device is relatively simple, and recording through the microphone can be performed to generate audio information. And transmitting the generated audio information to a server through a network communication module. The primary network device may include a microphone, a network communication unit, a sensor, and a speaker. The primary network device may be substantially absent the data needs to be processed. The primary network device may also be provided with other sensors for acquiring operating parameters of the primary network device. Specifically, for example, the primary network device may be an internet of things device, an edge node device, or the like.

In this embodiment, the simple network device may mainly include: microphones, network communication units, processors, memories, speakers, etc. Simple network devices enhance the ability to process data as compared to primary network devices. The simple network device may have a processor that can process simple logic operations so that the simple network device, after collecting the data, may perform preliminary preprocessing of the data, such as generating a feature matrix from the audio information. The simple network device may have a display module with a simple display function, which may be used to feed back information to the user. Specifically, for example, the simple network device may be a smart wearable device, POS (point of sale) machine, or the like. Such as smart bracelets, more elementary smart watches, smart glasses, or settlement devices in off-line shopping sites (e.g., POS), mobile settlement devices (e.g., handheld POS, settlement modules attached to handheld devices), etc.

In this embodiment, the intermediate network device may mainly include a microphone, a network communication unit, a processor, a memory display, a speaker, and the like. The main frequency of the processor of the medium level network device is typically less than 2.0GHz, the memory capacity is typically less than 2GB, and the memory capacity is typically less than 128GB. The intermediate network device may perform a certain degree of processing on the recorded audio information, for example, generating a feature matrix, and performing endpoint detection processing, noise reduction processing, voice recognition and the like on the feature matrix. Specifically, for example, the intermediate network device may include: intelligent household appliances in intelligent home, intelligent home terminals, intelligent sound boxes, higher-level intelligent watches, primary intelligent mobile phones (for example, the price is about 1000 yuan), and vehicle-mounted intelligent terminals.

In this embodiment, the intelligent network device may mainly include hardware such as a microphone, a network communication unit, a processor, a memory, a display, and a speaker. The intelligent network device may have a relatively high data processing capability. The main frequency of the processor of the intelligent network device is typically greater than 2.0GHz, the capacity of the memory is typically less than 12GB, and the capacity of the memory is typically less than 1TB. After the feature matrix may be generated for the audio information, an end point detection process, a noise reduction process, a voice recognition process, and the like may be performed. Further, the intelligent network device can also generate a voice feature vector according to the audio information. In some cases, the speech feature vector may be matched with the user feature vector to identify the identity of the user. But such matching is limited to a limited number of user feature vectors, such as user feature vectors of individual family members in a family. Specifically, for example, the intelligent network device may include: smart phones, tablet computers, desktop computers, notebook computers and the like with better performance.

In this embodiment, the high-performance device may mainly include hardware such as a microphone, a network communication unit, a processor, a memory, a display, a speaker, and the like. High performance devices can have large-scale data processing capabilities and can also provide powerful data storage capabilities. The main frequency of the processor of the high-performance device is usually above 3.0GHz, the capacity of the memory is usually more than 12GB, and the capacity of the memory can be above 1TB. The high performance device may generate a feature matrix for the audio information, end point detection processing, noise reduction processing, speech recognition, generate speech feature vectors, and match the speech feature vectors with a stored number of user feature vectors. Specifically, for example, the high performance device may be a workstation, a highly configured desktop computer, a Kiosk, a Kiosk, or the like.

Of course, the foregoing is by way of example only, as a few clients are listed. With the progress of science and technology, the performance of the hardware device may be improved, so that the electronic device with weak data processing capability may also have strong processing capability. The following embodiments are therefore incorporated by reference above in table 1, again by way of example only, and not by way of limitation.

It should be noted that the above data update optimization method can be implemented by all of the five types of hardware shown in table 1.

Although the present application provides method operational steps as described in the examples or flowcharts, more or fewer operational steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented by an actual device or client product, the instructions may be executed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment) as shown in the embodiments or figures.

Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller can be regarded as a hardware component, and means for implementing various functions included therein can also be regarded as a structure within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a mobile terminal, a server, or a network device, etc.) to perform the methods described in the various embodiments or some parts of the embodiments of the present application.

Various embodiments in this specification are described in a progressive manner, and identical or similar parts are all provided for each embodiment, each embodiment focusing on differences from other embodiments. The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

Although the present application has been described by way of example, those of ordinary skill in the art will recognize that there are many variations and modifications of the present application without departing from the spirit of the present application, and it is intended that the appended claims encompass such variations and modifications without departing from the spirit of the present application.

Claims

1. A method of speech processing, the method comprising:

voice activity detection is carried out on the audio data to obtain voice data;

identifying wake-up words from the voice data;

determining the ending time of the tail of the wake-up word in the voice data;

performing voice activity detection for the second time on the data which takes the ending time of the word tail as the starting point in the audio data, and recognizing a voice instruction;

wherein determining the end of the wake word at the end of the tail in the speech data comprises:

determining the ending time of the last syllable at the tail of the wake-up word according to the pronunciation habit of the user;

and taking the ending time of the last syllable of the tail as the ending time of the tail of the wake-up word.

2. The method of claim 1, wherein the determining the ending time of the last syllable comprises:

acquiring the initial pronunciation time of the last syllable;

3. The method of claim 1, wherein performing a second voice activity detection on the audio data beginning at the end of the word ending time and identifying voice commands comprises:

voice activity detection is carried out on the data;

4. The method of claim 1, wherein the identifying wake words from the voice data comprises:

matching the voice data with wake-up words;

5. The method of claim 1, wherein if the voice data includes sub-voice data from a plurality of audio channels, the identifying a wake-up word from the voice data comprises:

6. A voice processing device is characterized by comprising a microphone, a voice activity detection module, a wake-up word detection module and a voice recognition module, wherein,

the microphone is used for acquiring audio data;

the voice recognition module is used for recognizing voice instructions in the audio data detected by the second voice activity;

7. The apparatus of claim 6, wherein the determining the ending time of the last syllable comprises:

acquiring the initial pronunciation time of the last syllable;

8. The apparatus of claim 6, wherein said performing a second voice activity detection on said audio data beginning at said end of word ending time and identifying voice instructions comprises:

voice activity detection is carried out on the data;

9. The apparatus of claim 6, wherein the identifying wake words from the voice data comprises:

matching the voice data with wake-up words;

10. The apparatus of claim 6, wherein if the voice data includes sub-voice data from a plurality of audio channels, the identifying a wake-up word from the voice data comprises:

11. A speech processing device comprising a processor and a memory for storing processor executable instructions, which when executed by the processor implement the steps of the method of any of claims 1 to 5.

12. A computer readable storage medium having stored thereon computer instructions which when executed implement the steps of the method of any of claims 1 to 5.

13. An in-vehicle system comprising a processor and a memory for storing processor-executable instructions, the processor, when executing the instructions, implementing the steps of the method of any one of claims 1 to 5.

14. A conference system comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement the steps of the method of any one of claims 1 to 5.