CN115527532A - Equipment awakening method and device, computer equipment and storage medium - Google Patents

Equipment awakening method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN115527532A
CN115527532A CN202211073943.8A CN202211073943A CN115527532A CN 115527532 A CN115527532 A CN 115527532A CN 202211073943 A CN202211073943 A CN 202211073943A CN 115527532 A CN115527532 A CN 115527532A
Authority
CN
China
Prior art keywords
target
voice
frame
end point
classification information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211073943.8A
Other languages
Chinese (zh)
Inventor
李良斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202211073943.8A priority Critical patent/CN115527532A/en
Publication of CN115527532A publication Critical patent/CN115527532A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/4401Bootstrapping
    • G06F9/4418Suspend and resume; Hibernate and awake
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/027Syllables being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a device awakening method and device, computer equipment and a storage medium, and belongs to the technical field of voice recognition. The method comprises the following steps: classifying a plurality of voice frames in the acquired voice signal to obtain a plurality of classification information, wherein the classification information is used for indicating the probability that each syllable, each character or each word in a target phrase is included in the voice frames; determining a tail end point of the target phrase based on the plurality of classification information, wherein the tail end point is used for indicating the moment when the target phrase in the voice signal is played to be finished; waking up a target device based on a tail-end point of the target phrase. According to the technical scheme, the time when the target phrase is played in the voice signal is determined, and finally the target equipment is awakened at the time when the target phrase is played, so that the target equipment can be awakened under the condition that the target phrase is completely detected, mistaken awakening is avoided, and the awakening accuracy is improved.

Description

Equipment awakening method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a device wake-up method and apparatus, a computer device, and a storage medium.
Background
With the development of voice recognition technology, the use of devices with voice wake-up functionality has become a popular trend. The user object may wake up the device by speaking a specific vocabulary, i.e., a wake-up word, to bring the device from a standby state to an operating state. How to improve the success rate of awakening is a problem to be solved.
At present, whether a collected voice signal contains a keyword or a key syllable in a wakeup word is generally directly judged, and if the keyword or the key syllable is detected, the device is waken up to improve the success rate of wakeup.
The technical scheme has the problems that the probability that the equipment is mistakenly awakened is improved and the awakening accuracy is reduced because the equipment detects that the key words or key syllables are awakened.
Disclosure of Invention
The embodiment of the application provides a device awakening method and device, a computer device and a storage medium, which can ensure that the target device is awakened only under the condition that the target phrase is completely detected, so that false awakening is avoided, and the awakening accuracy is improved. The technical scheme is as follows:
in one aspect, a device wake-up method is provided, and the method includes:
classifying a plurality of voice frames in the acquired voice signal to obtain a plurality of classification information, wherein the classification information is used for indicating the probability that each syllable, each character or each word in a target phrase is included in the voice frames;
determining a tail end point of the target phrase based on the plurality of classification information, wherein the tail end point is used for indicating the moment when the target phrase in the voice signal is played to the end;
waking up a target device based on a tail-end point of the target phrase.
In some embodiments, the classifying the multiple speech frames in the acquired speech signal to obtain multiple pieces of classification information includes:
for any voice frame in the voice signals, extracting the characteristics of the voice frame to obtain the voice characteristics of the voice frame;
classifying the voice features based on a neural network to obtain classification information of the voice frames, wherein the neural network is used for classifying syllables, characters or words.
In some embodiments, the method further comprises:
for a target voice frame, acquiring the classification information of an adjacent voice frame adjacent to the target voice frame, wherein the target voice frame is any one of the voice frames;
and based on the classification information and the smoothing coefficient of the adjacent speech frames, smoothing the classification information of the target speech frame, wherein the smoothing is used for smoothing the change trend of the probability of each syllable, each word or each word of the target phrase in the adjacent speech frames.
In some embodiments, said determining the tail-end point of the target phrase based on the plurality of classification information comprises:
determining a first target frame based on the plurality of classification information, wherein the first target frame is a voice frame which comprises target words or target words in the target phrase for the first time;
processing a plurality of first voice frames behind the first target frame based on a rectangular sliding window to obtain a plurality of first information, wherein the plurality of first information is used for indicating the change trend of the probability of the target words or the target words, the length of the rectangular sliding window is a first number of frames, and the sliding step length is one frame;
determining a tail-end point of the target phrase based on the plurality of first information.
In some embodiments, said determining the tail-end point of the target phrase based on the plurality of first information comprises:
responding to the fact that the probability of the target word or the target word in any adjacent first information is changed from being larger than a first threshold value to being smaller than the first threshold value, determining a second target frame, wherein the second target frame is a first voice frame through which the rectangular sliding window slides when the adjacent first information is determined;
and determining the starting time of the second target frame as the tail end point of the target phrase.
In some embodiments, said determining the tail-end point of the target phrase based on the plurality of classification information comprises:
determining a third target frame based on the plurality of classification information, wherein the third target frame is a voice frame comprising a target syllable in the target phrase for the first time;
processing a plurality of second voice frames behind the third target frame based on a triangular sliding window to obtain a plurality of second information, wherein the plurality of second information is used for indicating the change trend of the probability of the target syllable, the length of the triangular sliding window is a second number of frames, and the sliding step length is one frame;
determining a tail-end point of the target phrase based on the plurality of second information.
In some embodiments, said determining the tail-end point of the target phrase based on the plurality of second information comprises:
determining a variation of the probability of the target syllable in two adjacent second information based on the plurality of second information;
responding to the fact that the variation is larger than a second threshold value for a third number of times in succession, and determining a fourth target frame, wherein the fourth target frame is a second voice frame through which the triangular sliding window slides when the variation is larger than the second threshold value for the first time;
and determining the starting time of the fourth target frame as the tail end point of the target phrase.
In some embodiments, the method further comprises:
acquiring the currently input voice signal;
and processing the voice signal based on a voice sliding window to obtain the plurality of voice frames, wherein the length of the voice sliding window is a first time length, the sliding step length is a second time length, and the second time length is less than the first time length.
In some embodiments, the method further comprises:
awakening the target equipment in response to the fact that the probability of awakening syllables, awakening words or awakening words in the target phrase in any classification information is larger than an awakening threshold value;
the waking up the target device based on the tail end point of the target phrase comprises:
under the condition that the target equipment is awakened, acquiring a first voice signal from the voice signal, wherein the first voice signal is a voice signal behind the tail end point;
and inputting the first voice signal and the second voice signal into an automatic voice recognition model, wherein the second voice signal is a newly acquired voice signal, and the automatic voice recognition model is used for recognizing the voice signal as an interactive instruction.
In another aspect, an apparatus for device wake-up is provided, the apparatus comprising:
the classification module is used for classifying a plurality of voice frames in the acquired voice signals to obtain a plurality of classification information, wherein the classification information is used for indicating the probability that each syllable, each character or each word in a target phrase is included in the voice frames;
a determining module, configured to determine, based on the multiple pieces of classification information, a tail end point of the target phrase, where the tail end point is used to indicate a time when playing of the target phrase in the speech signal ends;
and the awakening module is used for awakening the target equipment based on the tail end point of the target phrase.
In some embodiments, the classification module is configured to perform feature extraction on any speech frame in the speech signal to obtain a speech feature of the speech frame; classifying the voice features based on a neural network to obtain classification information of the voice frames, wherein the neural network is used for classifying syllables, characters or words.
In some embodiments, the apparatus further comprises:
a first obtaining module, configured to obtain, for a target speech frame, classification information of an adjacent speech frame adjacent to the target speech frame, where the target speech frame is any one speech frame in the multiple speech frames;
and the smoothing module is used for smoothing the classification information of the target speech frame based on the classification information and the smoothing coefficient of the adjacent speech frame, wherein the smoothing processing is used for smoothing the change trend of the probability of each syllable, each character or each word of the target phrase in the adjacent speech frame.
In some embodiments, the determining module comprises:
a first determining unit, configured to determine a first target frame based on the plurality of classification information, where the first target frame is a speech frame that includes a target word or a target word in the target phrase for the first time;
a first processing unit, configured to process, based on a rectangular sliding window, a plurality of first speech frames located after the first target frame to obtain a plurality of first information, where the plurality of first information are used to indicate a change trend of the probability of the target word or the target word, the rectangular sliding window is a frame with a first number of lengths, and a sliding step is a frame;
a second determining unit, configured to determine a tail end point of the target phrase based on the plurality of first information.
In some embodiments, the second determining unit is configured to determine a second target frame in response to a change from that the probability of the target word or the target word is greater than a first threshold to that the probability of the target word is less than the first threshold in any adjacent first information, where the second target frame is the first speech frame through which the rectangular sliding window slides when the adjacent first information is determined; and determining the starting time of the second target frame as the tail end point of the target phrase.
In some embodiments, the determining module comprises:
a third determining unit, configured to determine a third target frame based on the plurality of classification information, where the third target frame is a speech frame that includes a target syllable in the target phrase for the first time;
a second processing unit, configured to process, based on a triangular sliding window, a plurality of second speech frames after the third target frame to obtain a plurality of second information, where the plurality of second information is used to indicate a change trend of the probability of the target syllable, the length of the triangular sliding window is a second number of frames, and a sliding step length is one frame;
a fourth determining unit, configured to determine a tail end point of the target phrase based on the plurality of second information.
In some embodiments, the fourth determining unit is configured to determine, based on the plurality of second information, a variation of the probability of the target syllable in two adjacent second information; responding to the fact that the variation is larger than a second threshold value for a third number of times in succession, and determining a fourth target frame, wherein the fourth target frame is a second voice frame through which the triangular sliding window slides when the variation is larger than the second threshold value for the first time; and determining the starting time of the fourth target frame as the tail end point of the target phrase.
In some embodiments, the apparatus further comprises:
the second acquisition module is used for acquiring the currently input voice signal;
and the signal processing module is used for processing the voice signals based on a voice sliding window to obtain the plurality of voice frames, wherein the length of the voice sliding window is a first time length, the sliding step length is a second time length, and the second time length is less than the first time length.
In some embodiments, the wake module is further configured to wake the target device in response to a probability of a wake syllable, a wake word, or a wake word in the target phrase in any of the classification information being greater than a wake threshold; under the condition that the target equipment is awakened, acquiring a first voice signal from the voice signal, wherein the first voice signal is a voice signal behind the tail end point; and inputting the first voice signal and the second voice signal into an automatic voice recognition model, wherein the second voice signal is a newly acquired voice signal, and the automatic voice recognition model is used for recognizing the voice signal as an interactive instruction.
In another aspect, a computer device is provided, which includes a processor and a memory, wherein the memory is used for storing at least one piece of computer program, and the at least one piece of computer program is loaded by the processor and executes the device wake-up method.
In another aspect, a computer-readable storage medium is provided, which is used for storing at least one piece of computer program, and the at least one piece of computer program is used for executing the device wake-up method.
In another aspect, a computer program product is provided, which when executed by a processor implements the above-described device wake-up method.
The embodiment of the application provides a device awakening method, wherein the probability that each voice frame comprises each syllable, each character or each word in a target phrase can be determined by classifying a plurality of voice frames in an acquired voice signal, so that the time when the target phrase is played in the voice signal can be determined based on the probability that each voice frame comprises each syllable, each character or each word in the target phrase, and finally the target device is awakened at the time when the target phrase is played, so that the target device can be awakened under the condition that the target phrase is completely detected, false awakening is avoided, and the awakening accuracy is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is an implementation environment of a device wake-up method according to an embodiment of the present application;
fig. 2 is a flowchart of a device wake-up method according to an embodiment of the present application;
fig. 3 is a flowchart of another device wake-up method provided in an embodiment of the present application;
FIG. 4 is a schematic diagram of a rectangular sliding window provided in accordance with an embodiment of the present application;
FIG. 5 is a schematic diagram of a triangular sliding window provided in accordance with an embodiment of the present application;
fig. 6 is a block diagram of a device wake-up apparatus provided in an embodiment of the present application;
fig. 7 is a block diagram of another device wake-up apparatus provided in an embodiment of the present application;
fig. 8 is a block diagram of a terminal according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a server provided according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.
The term "at least one" in this application means one or more, and the meaning of "a plurality" means two or more.
It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are authorized by the user or sufficiently authorized by various parties, and the collection, use, and processing of the relevant data is required to comply with relevant laws and regulations and standards in relevant countries and regions. For example, the speech signals referred to in this application are all acquired with sufficient authorization.
The following explains terms related to the present application.
Mel-Frequency Cepstrum (MFCC) is a linear transformation of the logarithmic energy spectrum based on the nonlinear Mel scale of sound frequencies. The mel-frequency cepstrum coefficients are the coefficients that make up the mel-frequency cepstrum. It is derived from the cepstrum of the audio segment. Cepstrum differs from mel-frequency cepstrum in that the band division of the mel-frequency cepstrum is equally spaced on the mel scale, which more closely approximates the human auditory system than the linearly spaced bands used in the normal log cepstrum. Such a non-linear representation may provide a better representation of the sound signal in a number of domains. MFCC features are widely used for speech recognition functions.
A Filter bank (FBank) is one of the methods for extracting speech feature parameters, and is the most common and effective speech feature extraction algorithm because of its unique cepstrum-based extraction mode, which better conforms to the human auditory principle. Based on the characteristic FBank of the filter bank, the FBank characteristic extraction method is equivalent to that MFCC removes the discrete cosine transform (lossy transform) of the last step, and compared with the MFCC characteristic, the FBank characteristic retains more original voice data. The FBank feature is applied to the function of speech recognition.
The device awakening method provided by the embodiment of the application can be executed by computer equipment. In some embodiments, the computer device is a terminal or a server. First, taking a computer device as a terminal as an example, an implementation environment of the device wake-up method provided in the embodiment of the present application is described below, and fig. 1 is a schematic diagram of an implementation environment of the device wake-up method provided in the embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102.
The terminal 101 and the server 102 can be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
In some embodiments, the terminal 101 is a smartphone, a tablet, a laptop, a desktop computer, a smart speaker, a smart watch, and the like, but is not limited thereto. The terminal 101 is installed and operated with an application program supporting voice recognition.
In some embodiments, the server 102 is an independent physical server, can also be a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing basic cloud computing services such as cloud service, cloud database, cloud computing, cloud function, cloud storage, web service, cloud communication, middleware service, domain name service, security service, CDN (Content Delivery Network), big data and artificial intelligence platform, and the like. The server 102 is used for providing background services for the application programs supporting the virtual scenes. In some embodiments, the server 102 undertakes primary computing work and the terminal 101 undertakes secondary computing work; or, the server 102 undertakes the secondary computing work, and the terminal 101 undertakes the primary computing work; alternatively, the server 102 and the terminal 101 perform cooperative computing by using a distributed computing architecture.
Fig. 2 is a flowchart of a device wake-up method according to an embodiment of the present application, and as shown in fig. 2, the device wake-up method is described as being executed by a terminal in the embodiment of the present application. The equipment awakening method comprises the following steps:
201. the terminal classifies a plurality of voice frames in the acquired voice signal to obtain a plurality of classification information, wherein the classification information is used for indicating the probability that each syllable, each character or each word in the target phrase is included in the voice frames.
In this embodiment of the present application, the voice signal is a voice signal acquired by the terminal in real time, and the voice signal may be acquired by the terminal through an internal voice acquisition device or may be acquired by the terminal through an external voice acquisition device. The voice signal is used for waking up the device, and accordingly, the voice signal may contain information for waking up the device. And the terminal frames the acquired voice signal to obtain a plurality of voice frames. For any voice frame, the terminal can classify the voice frame according to the classification modes of syllables, characters or words and the like, and determine the probability of each syllable, each character or each word in the target phrase included in the voice frame, so as to obtain the classification information of the voice frame.
For example, taking the content of the speech signal as "hello friend" and the classification manner as word-based classification as an example, the classification information may indicate the probability of "you" being included, the probability of "good" being included, the probability of "punty" being included, and the probability of "friend" being included in the speech frames acquired from the speech signal. Similarly, the classification information may also be used to indicate the probability of including each syllable in the target phrase or the probability of including each word in the speech frame.
202. And the terminal determines the tail end point of the target phrase based on the plurality of classification information, wherein the tail end point is used for indicating the moment when the target phrase in the voice signal is played to be finished.
In this embodiment, the terminal may further process the plurality of classification information, determine, in the speech signal, a probability of occurrence of a target syllable, a target word, or a target word in the target phrase, and then determine a time when the playing of the target phrase in the speech signal is finished. The target phrase is an awakening word used for awakening the equipment, the awakening word is generally designed in a four-character phrase form, the situation of mistaken awakening is easy to occur when the awakening word is too short, and the communication experience is poor due to the fact that the awakening word is too long. The target syllable may be the last syllable of the wake word, the target word may be the last word of the wake word, and the target word may be the last vocabulary of the wake word.
203. The terminal wakes up the target device based on the tail end point of the target phrase.
In this embodiment of the application, after determining the tail end point of the target phrase, the terminal may determine that the target phrase has been played completely, and at this time, awaken the target device through an awakening instruction. The target device may be the terminal itself or an intelligent device connected to the terminal. The target device can recognize the newly collected voice signal after awakening, and execute the interactive instruction obtained by recognition.
The embodiment of the application provides an equipment awakening method, wherein syllables, characters or words included in each voice frame can be determined by classifying a plurality of voice frames in an acquired voice signal, so that the time when a target phrase is played in the voice signal can be determined based on the syllables, the characters or the words included in each voice frame, and finally the target equipment is awakened at the time when the target phrase is played, so that the target equipment can be awakened under the condition that the target phrase is completely detected, false awakening is avoided, and the awakening accuracy is improved.
Fig. 2 exemplarily shows a main flow of the device wake-up method provided in the embodiment of the present application, and the device wake-up method is described in detail below based on an application scenario. Fig. 3 is a flowchart of another device wake-up method according to an embodiment of the present application, and as shown in fig. 3, the embodiment of the present application is described as being executed by a terminal as an example. The equipment awakening method comprises the following steps:
301. the terminal acquires a currently input voice signal.
In the embodiment of the application, the terminal can acquire the currently input voice signal in real time through the voice acquisition equipment inside the terminal and also through the voice acquisition equipment externally connected with the terminal. Alternatively, the speech signal may be real-time speech or a historical recording. Optionally, the voice signal may or may not include information for waking up the device, and the present application does not limit the voice signal. In the embodiment of the application, the terminal can collect the voice signal in real time and process the collected voice signal at the same time.
302. The terminal processes the voice signals based on the voice sliding window to obtain a plurality of voice frames, wherein the length of the voice sliding window is a first time length, the sliding step length is a second time length, and the second time length is smaller than the first time length.
In the embodiment of the application, the terminal can perform framing on the voice signal based on the voice sliding window. And the voice sliding window slides for a second time length each time, and a voice signal with the first time length is intercepted as a voice frame each time the voice sliding window slides for one time, so that a plurality of voice frames are obtained. For example, the first time period is 10 ms, 16 ms, or 20 ms, etc., and the second time period is 5 ms, 10 ms, or 15 ms, etc., and the length of the voice sliding window and the step size of the voice sliding window sliding are not limited in the embodiment of the present application.
In the process of processing the voice signals, because the second duration of the voice sliding window is less than the first duration, namely the sliding step length of the voice sliding window is less than the length of the voice sliding window, an overlapping part exists between adjacent voice frames, and the overlapping part can prevent information from being missed when the voice frames are processed, so that the accuracy of voice processing is improved.
In some embodiments, the terminal may not use the speech sliding window to process the speech signal, but directly divide the speech signal by the first duration, that is, intercept a section of the speech signal as a speech frame every first duration to obtain a plurality of speech frames. The embodiment of the application does not limit the length of each intercepted voice frame.
303. The terminal classifies the plurality of voice frames to obtain a plurality of classification information, and the classification information is used for indicating the probability that the voice frames comprise each syllable, each character or each word in the target phrase.
In the embodiment of the application, the terminal can classify the voice frames by adopting different classification modes, wherein the classification modes comprise syllable classification, character classification or word classification. If the terminal classifies the voice frame according to the syllables, the classification information of the voice frame is used for indicating the probability that the voice frame comprises the syllables in the target phrase; if the terminal classifies the voice frame according to the words, the classification information of the voice frame is used for indicating the probability that the voice frame comprises the words in the target phrase; if the terminal classifies the voice frame according to the words, the classification information of the voice frame is used for indicating the probability that the voice frame comprises the words in the target phrase.
In some embodiments, the terminal may classify the voice frame through a neural network trained based on the target phrase for classifying the syllables, words or words, and determining the probability of each syllable, each word or each word in the target phrase included in the voice frame. Correspondingly, for any voice frame in the voice signals, the terminal extracts the characteristics of the voice frame to obtain the voice characteristics of the voice frame. Then, the terminal classifies the voice features based on a neural network to obtain the classification information of the voice frame, wherein the classification information is used for indicating the probability that the voice frame comprises each syllable, each character or each word in the target phrase. The speech feature may be a MFCC feature or an FBank feature, among other things. The Neural Network may be an ANN (Artificial Neural Network), a CNN (Convolutional Neural Network), or another Neural Network that can be used to classify syllables, words, or words, and the application does not limit the type of the Neural Network and the operation form of the Neural Network. The speech frames are classified through the neural network, further processing of the speech frames is achieved, information contained in each speech frame can be determined, for example, the probability of each syllable, the probability of each word or the probability of each word in the target phrase is included in the speech frame, and then the start and stop time of the target phrase can be determined.
In some embodiments, the terminal may perform the following step 304 to smooth the plurality of classification information, and may also skip the step 304 to perform the step 305 and subsequent steps, that is, the step 304 is an optional step.
304. The terminal performs smoothing processing on the plurality of classification information.
In the embodiment of the present application, the terminal smoothes the plurality of pieces of classification information by a smoothing coefficient. Wherein, different smoothing coefficients are set for different classification modes. If the classification mode is to classify according to syllables, because the time occupied by each syllable is very short, a smaller smoothing coefficient can be set so as to reduce the correction amplitude and avoid the problem of awakening delay; if the classification mode is to classify according to words, as the pronunciation of the words is longer, a larger smooth coefficient can be set so as to eliminate the noisy classification of the neural network and improve the classification sensitivity.
In some embodiments, the terminal may smooth the classification information of adjacent speech frames by a first order smoothing formula. Correspondingly, the terminal acquires the classification information of the adjacent voice frame connected with the target voice frame for the target voice frame, and then carries out smoothing processing on the classification information of the target voice frame based on the classification information and the smoothing coefficient of the adjacent voice frame. The smoothing process is used to smooth the variation trend of the probability of each syllable, each word or each word of the target phrase in the adjacent speech frames. Among them, the first-order smoothing formula is shown in the following formula (1).
X i =a*X i-1 +(1-a)*X i (I)
Wherein, X i Classification information representing an ith speech frame; x i-1 Indicates the classification information of the i-1 th speech frame and a indicates the smoothing coefficient.
For example, taking the example that the neural network classifies according to syllables, the neural network classifies the ith speech frame to obtain the probability of each syllable in the target phrase included in the speech frame, wherein the probability of some syllables is 0, which indicates that the speech frame does not include the syllable. In the classification information of two adjacent voice frames, the probability difference of each syllable is smaller, and the probability in the classification information can be smoother through the first-order smoothing formula.
305. And the terminal determines the tail end point of the target phrase based on the plurality of classification information, wherein the tail end point is used for indicating the moment when the target phrase in the voice signal is played to be finished.
In the embodiment of the application, the terminal can determine the tail end point of the target phrase according to the probability of each syllable, each character or each word of the target phrase in the classification information. Optionally, the terminal may determine the tail end point of the target phrase in two ways according to different classification ways.
The first method is as follows: when the classification mode is to classify according to characters or words. Since the classification information is smoothed, the voice frame may have a slight lag, and the terminal may process the classification information through a rectangular sliding window to determine the end point of the target phrase. Correspondingly, the terminal determines the first target frame based on the plurality of classification information. Then, the terminal processes a plurality of first speech frames behind the first target frame based on the rectangular sliding window to obtain a plurality of first information. Finally, the terminal determines the tail end point of the target phrase based on a plurality of pieces of first information. Optionally, in response to that the probability of a target word or a target word in any adjacent first information is changed from being greater than a first threshold to being smaller than the first threshold, a second target frame is determined, and a start time of the second target frame is determined as a tail end point of the target phrase. The first target frame is a voice frame which comprises target words or target words in the target phrase for the first time. The length of the rectangular sliding window is a first number of frames, and the sliding step length is one frame. The plurality of pieces of first information are used for indicating the variation trend of the probability of the target characters or the target words, and the variation trend is a trend of ascending first and then descending, namely the variation trend of the probability of the target characters or the target words in the plurality of pieces of classified information in the process that the voice signals start to play the target characters or the target words and finish playing the target characters or the target words. The target word is the last word in the target phrase, and the target word is the last word in the target phrase. The first threshold is used to determine whether the target word or the target word is a tail end point of the target phrase, and the value of the first threshold is not limited in the embodiment of the present application. A rectangular sliding window may also be referred to as a rectangular filter. Through the change trend of the probability of the target words or the target words in the target phrase, the tail end points of the target phrase can be accurately determined, namely the terminal determines to play the complete target phrase, the condition of mistaken awakening can be effectively avoided, and the equipment awakening accuracy rate is improved.
For example, when the target phrase is "small allegorian", and the target word is "school", when the terminal first determines a first target frame in which the "school" word first appears based on the plurality of classification information, and then acquires a plurality of first speech frames subsequent to the first target frame, and after the first target frame, and a plurality of first speech frames adjacent to the first target frame include the "school" word, the terminal determines a variation trend of the probability of the "school" word in the plurality of first speech frames through a rectangular window. For example, the rectangular sliding window has a length of 10 frames. Sliding 1 frame at a time. The rectangular sliding window sums the probabilities of the "learning" words in 10 frames to serve as the window score of the rectangular sliding window, namely the first information, and the probability variation trend of the "learning" words is represented by a plurality of first information. If the probability of the word learning in the previous first information and the next first information is changed from being larger than the first threshold to being smaller than the first threshold after the rectangular sliding window slides for a certain time, it indicates that the frame sliding through may include the word learning, at this time, the frame is taken as a second target frame, and the starting time of the second target frame is determined as the time when the target phrase is played to the end. Fig. 4 is a schematic diagram of a rectangular sliding window provided according to an embodiment of the present application. As shown in fig. 4, the rectangle box represents a rectangular sliding window, the length of the rectangle is 10 frames, each time one frame is slid, the vertical rectangles represent the probability of the target word or the target word in the classification information of the first speech frames, and the higher the height of the rectangle, the higher the probability of the target word or the target word.
The second method comprises the following steps: when the classification is made according to syllables. Since the classification information is smoothed, sudden jump is not generated, but a gradual change process is performed, the terminal can process the voice frame through a triangular sliding window to determine the tail end point of the target phrase, and the triangular sliding window separates an isosceles triangle similar to a mel filter bank from the middle and adopts a right triangle at the left half. Correspondingly, the terminal determines a third target frame based on the plurality of classification information. Then, the terminal processes a plurality of second voice frames behind the third target frame based on the triangular sliding window to obtain a plurality of second information. The terminal determines a tail end point of the target phrase based on the plurality of second information. Optionally, the terminal determines a variation of the probability of the target syllable in two adjacent second information based on the plurality of second information, determines a fourth target frame in response to that the variation is greater than a second threshold for a third number of consecutive times, and determines a start time of the fourth target frame as a tail end point of the target phrase. Wherein the third target frame is a speech frame including the target syllable in the target phrase for the first time. The second information is used to indicate the variation trend of the probability of the target syllable, which is explained in the first embodiment and will not be described herein again. The length of the triangular sliding window is a second number of frames, and the sliding step length is one frame. The second threshold is used to determine whether the target byte is a tail end point of the target phrase, and the value of the second threshold is not limited in this application embodiment. The third number may be 2 times, 3 times, etc. Through the relationship between the variable quantity of the probability of the third number of times and the second threshold value, the tail end point of the target phrase can be accurately determined, namely the terminal determines to play the complete target phrase, the condition of mistaken awakening can be effectively avoided, and the equipment awakening accuracy rate is improved
In some embodiments, the terminal may determine the second information by the following formula (2).
Figure BDA0003830633470000131
Wherein, score l Representing the second information, l represents the length of the triangular sliding window, F represents the F-th frame within the triangular sliding window, s represents the maximum value in the triangular sliding window, a ratio typically set to 2,f and l represents the weight of the F-th frame, the product of the weight and s represents the weight of the F-th frame in the triangular sliding window, F f Representing the probability of the target syllable in the classification information corresponding to the F-th frame in the triangular sliding window, F f Will vary with the variation of f. Due to the shape of the triangular sliding window, the weight of the second speech frame on the left side is smaller than that of the second speech frame on the right side, so that the accuracy of speech processing is improved by performing reduction, invariance or amplification processing on the classification information of the second speech frame in the formula (2).
For example, fig. 5 is a schematic diagram of a triangular sliding window provided in an embodiment of the present application. As shown in fig. 5, the right triangle represents a triangular sliding window, with a length of a second number of frames, sliding one frame at a time. The vertical rectangles represent the probability of the target syllable in the classification information of the second speech frames, and the higher the height of the rectangles, the higher the probability of the target syllable.
306. The terminal wakes up the target device based on the tail end point of the target phrase.
In the embodiment of the application, when determining the tail end point of the target phrase, the terminal can send a wake-up instruction to the target device, where the wake-up instruction is used to instruct the target device to transition from a standby state to an operating state. The target device can be provided with an automatic voice recognition model, the automatic voice recognition model can recognize an input voice signal when the target device is in a working state, the voice signal is recognized as an interactive instruction, and then the target device can execute the recognized interactive instruction. The device is awakened by judging the complete target phrase, so that the condition of mistaken awakening is avoided, and the equipment awakening accuracy is improved.
In some embodiments, the terminal may wake up the target device first, but not send the voice signal to the automatic voice recognition model in the target device, but process the received voice signal based on the target phrase and then send the processed voice signal to the automatic voice recognition model, so as to avoid that the wake-up word in the voice signal enters the automatic voice recognition model, which may result in a failure in interactive instruction recognition or a recognition error. Correspondingly, in response to the fact that the probability of awakening syllables, awakening words or awakening words in the target phrase in any classification information is larger than the awakening threshold value, the terminal awakens the target device. Then, the terminal determines the end point of the target phrase based on the classification information. And under the condition that the target equipment is awakened, acquiring a first voice signal from the voice signal, and inputting the first voice signal and a second voice signal into the automatic voice recognition model. The first voice signal is a voice signal after the tail end point, and the second voice signal is a newly acquired voice signal. The awakening threshold is used for judging whether to awaken the target equipment, and under the condition that the probability of awakening syllables, the probability of awakening words or the probability of awakening words in the classification information is greater than the awakening threshold, the terminal judges that the voice signal plays the target phrase, and awakens the target equipment at the moment. Whether the target equipment is awakened or not is judged firstly, whether the tail end point of the target phrase is reached or not is judged, the time for waiting for awakening can be shortened, the awakening efficiency of the equipment is improved, the first voice signal behind the tail end point of the target phrase and the newly acquired second voice signal are input into the automatic voice recognition model to perform voice recognition, the automatic voice recognition model of partial syllables, partial characters or words in the target phrase is prevented from being recognized as an interactive instruction, and the awakening accuracy of the equipment is improved.
Fig. 6 is a block diagram of a device wake-up apparatus according to an embodiment of the present application. The apparatus is for performing the steps in the above method, see fig. 6, the apparatus comprising: a classification module 601, a determination module 602, and a wake-up module 603.
A classification module 601, configured to classify a plurality of speech frames in the acquired speech signal to obtain a plurality of classification information, where the classification information is used to indicate a probability that each syllable, each character, or each word in the speech frame includes a target phrase;
a determining module 602, configured to determine, based on the plurality of classification information, a tail end point of the target phrase, where the tail end point is used to indicate a time when playing of the target phrase in the voice signal is finished;
a wake module 603 configured to wake the target device based on the tail end point of the target phrase.
In some embodiments, the classification module 601 is configured to perform feature extraction on a speech frame for any speech frame in a speech signal to obtain speech features of the speech frame; the speech features are classified based on a neural network to obtain the classification information of the speech frames, and the neural network is used for classifying syllables, characters or words.
In some embodiments, fig. 7 is a block diagram of another device wake-up apparatus provided in accordance with an embodiment of the present application. Referring to fig. 7, the apparatus further comprises:
a first obtaining module 604, configured to obtain, for a target speech frame, classification information of an adjacent speech frame adjacent to the target speech frame;
and a smoothing module 605, configured to perform smoothing processing on the classification information of the target speech frame based on the classification information and the smoothing coefficient of the adjacent speech frame.
In some embodiments, referring to fig. 7, the determining module 602 includes:
a first determining unit 6021, configured to determine a first target frame based on the plurality of classification information, where the first target frame is a speech frame including a target word or a target word in a target phrase for the first time;
a first processing unit 6022, configured to process, based on a rectangular sliding window, a plurality of first speech frames located after a first target frame to obtain a plurality of first information, where the plurality of first information are used to indicate a change trend of a probability of a target word or a target word, the length of the rectangular sliding window is a first number of frames, and a sliding step length is one frame;
a second determining unit 6023 configured to determine a tail end point of the target phrase based on the plurality of first information.
In some embodiments, the second determining unit 6023 is configured to determine a second target frame in response to a change from a probability of the target word or the target word being greater than the first threshold to being smaller than the first threshold in any adjacent first information, the second target frame being the first speech frame through which the rectangular sliding window slides when the adjacent first information is determined; and determining the starting time of the second target frame as the tail end point of the target phrase.
In some embodiments, referring to fig. 7, the determining module 602 includes:
a third determining unit 6024, configured to determine a third target frame based on the plurality of classification information, the third target frame being a speech frame including the target syllable in the target phrase for the first time;
a second processing unit 6025, configured to process, based on a triangular sliding window, a plurality of second speech frames located after a third target frame to obtain a plurality of second information, where the plurality of second information is used to indicate a change trend of probability of a target syllable, a length of the triangular sliding window is a second number of frames, and a sliding step is one frame;
a fourth determining unit 6026 configured to determine a tail-end point of the target phrase based on the plurality of second information.
In some embodiments, the fourth determining unit 6026 is configured to determine a variation of the probability of the target syllable in two adjacent second information based on the plurality of second information; responding to the fact that the variation is larger than the second threshold value for the first time and the third number of times, determining a fourth target frame, wherein the fourth target frame is a second voice frame through which the triangular sliding window slides when the variation is larger than the second threshold value for the first time; and determining the starting time of the fourth target frame as the tail end point of the target phrase.
In some embodiments, referring to fig. 7, the apparatus further comprises:
a second obtaining module 606, configured to obtain a currently input voice signal;
the signal processing module 607 is configured to process the speech signal based on the speech sliding window to obtain a plurality of speech frames, where the length of the speech sliding window is a first duration, the sliding step length is a second duration, and the second duration is less than the first duration.
In some embodiments, the wake module 603 is further configured to wake the target device in response to a probability of waking up a syllable, a wake word, or a wake word in the target phrase in any classification information being greater than a wake threshold; under the condition that the target equipment is awakened, acquiring a first voice signal from the voice signal, wherein the first voice signal is the voice signal behind the tail end point; and inputting the first voice signal and the second voice signal into an automatic voice recognition model, wherein the second voice signal is a newly acquired voice signal, and the automatic voice recognition model is used for recognizing the voice signal as an interactive instruction.
The embodiment of the application provides a device awakening method, wherein the probability that each voice frame comprises each syllable, each character or each word in a target phrase can be determined by classifying a plurality of voice frames in an acquired voice signal, so that the time when the target phrase is played in the voice signal can be determined based on the probability that each voice frame comprises each syllable, each character or each word in the target phrase, and finally the target device is awakened at the time when the target phrase is played, so that the target device can be awakened under the condition that the target phrase is completely detected, false awakening is avoided, and the awakening accuracy is improved.
It should be noted that: in the device wake-up apparatus provided in the foregoing embodiment, when the device wake-up apparatus wakes up, only the division of the functional modules is illustrated, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the device wake-up apparatus and the device wake-up method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.
In this embodiment of the present application, the computer device can be configured as a terminal or a server, when the computer device is configured as a terminal, the terminal can be used as an execution subject to implement the technical solution provided in the embodiment of the present application, when the computer device is configured as a server, the server can be used as an execution subject to implement the technical solution provided in the embodiment of the present application, or the technical solution provided in the present application can be implemented through interaction between the terminal and the server, which is not limited in this embodiment of the present application.
When the computer device is configured as a terminal, fig. 8 is a block diagram of a terminal 800 according to an embodiment of the present application. The terminal 800 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 800 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.
In general, the terminal 800 includes: a processor 801 and a memory 802.
The processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 801 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 801 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 801 may be integrated with a GPU (Graphics Processing Unit) which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 801 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.
Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 802 is used to store at least one computer program for execution by the processor 801 to implement the device wake-up method provided by the method embodiments herein.
In some embodiments, the terminal 800 may further optionally include: a peripheral interface 803 and at least one peripheral. The processor 801, memory 802, and peripheral interface 803 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 803 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 804, display 805, camera assembly 806, audio circuitry 807, and power supply 808.
The peripheral interface 803 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 801 and the memory 802. In some embodiments, the processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.
The Radio Frequency circuit 804 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 804 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 804 converts an electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. In some embodiments, the radio frequency circuitry 804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 804 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display 805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to capture touch signals on or above the surface of the display 805. The touch signal may be input to the processor 801 as a control signal for processing. At this point, the display 805 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 805 may be one, disposed on a front panel of the terminal 800; in other embodiments, the display 805 may be at least two, respectively disposed on different surfaces of the terminal 800 or in a folded design; in other embodiments, the display 805 may be a flexible display disposed on a curved surface or a folded surface of the terminal 800. Even further, the display 805 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 805 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.
The camera assembly 806 is used to capture images or video. In some embodiments, camera assembly 806 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera head assembly 806 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
The audio circuit 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 801 for processing or inputting the electric signals to the radio frequency circuit 804 to realize voice communication. The microphones may be provided in a plurality, respectively, at different portions of the terminal 800 for the purpose of stereo sound collection or noise reduction. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 807 may also include a headphone jack.
Power supply 808 is used to power various components in terminal 800. The power source 808 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 808 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, the terminal 800 also includes one or more sensors 809. The one or more sensors 809 include, but are not limited to: acceleration sensor 810, gyro sensor 811, pressure sensor 812, optical sensor 813, and proximity sensor 814.
The acceleration sensor 810 can detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the terminal 800. For example, the acceleration sensor 810 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 801 may control the display 805 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 810. The acceleration sensor 810 may also be used for collection of motion data of a game or a user.
The gyro sensor 811 may detect a body direction and a rotation angle of the terminal 800, and the gyro sensor 811 may cooperate with the acceleration sensor 810 to collect a 3D motion of the user with respect to the terminal 800. The processor 801 may implement the following functions according to the data collected by the gyro sensor 811: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
Pressure sensors 812 may be disposed on the side frames of terminal 800 and/or underlying display 805. When the pressure sensor 812 is disposed on the side frame of the terminal 800, the holding signal of the user to the terminal 800 can be detected, and the processor 801 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 812. When the pressure sensor 812 is arranged at the lower layer of the display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 805. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The optical sensor 813 is used to collect the ambient light intensity. In one embodiment, the processor 801 may control the display brightness of the display screen 805 based on the ambient light intensity collected by the optical sensor 813. Specifically, when the ambient light intensity is high, the display brightness of the display screen 805 is increased; when the ambient light intensity is low, the display brightness of the display 805 is reduced. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera head assembly 806 according to the ambient light intensity collected by the optical sensor 813.
A proximity sensor 814, also known as a distance sensor, is typically disposed on the front panel of the terminal 800. The proximity sensor 814 is used to collect the distance between the user and the front surface of the terminal 800. In one embodiment, when the proximity sensor 814 detects that the distance between the user and the front surface of the terminal 800 is gradually decreased, the display 805 is controlled by the processor 801 to switch from a bright screen state to a dark screen state; when the proximity sensor 814 detects that the distance between the user and the front face of the terminal 800 is gradually increased, the processor 801 controls the display 805 to switch from the breath-screen state to the bright-screen state.
Those skilled in the art will appreciate that the configuration shown in fig. 8 is not intended to be limiting of terminal 800 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
When the computer device is configured as a server, fig. 9 is a schematic structural diagram of a server provided in an embodiment of the present application, where the server 900 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 901 and one or more memories 902, where the memories 902 store at least one computer program, and the at least one computer program is loaded and executed by the processors 901 to implement the device wake-up method provided by each method embodiment. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
The embodiment of the present application further provides a computer-readable storage medium, where at least one segment of computer program is stored in the computer-readable storage medium, and the at least one segment of computer program is loaded and executed by a processor of a computer device to implement the operations performed by the computer device in the device wake-up method in the foregoing embodiments. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
In some embodiments, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.
Embodiments of the present application further provide a computer program product, which includes computer program code stored in a computer readable storage medium. The processor of the computer device reads the computer program code from the computer readable storage medium, and the processor executes the computer program code to cause the computer device to perform the device wake-up method provided in the various alternative implementations described above.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.
The above description is intended only to illustrate the alternative embodiments of the present application, and should not be construed as limiting the present application, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (13)

1. A method of device wakeup, the method comprising:
classifying a plurality of voice frames in the acquired voice signal to obtain a plurality of classification information, wherein the classification information is used for indicating the probability that each syllable, each character or each word in a target phrase is included in the voice frames;
determining a tail end point of the target phrase based on the plurality of classification information, wherein the tail end point is used for indicating the moment when the target phrase in the voice signal is played to be finished;
waking up a target device based on a tail-end point of the target phrase.
2. The method of claim 1, wherein the classifying the plurality of speech frames in the acquired speech signal to obtain a plurality of classification information comprises:
for any voice frame in the voice signals, extracting the characteristics of the voice frame to obtain the voice characteristics of the voice frame;
and classifying the voice features based on a neural network to obtain the classification information of the voice frame, wherein the neural network is used for classifying syllables, characters or words.
3. The method of claim 1, further comprising:
for a target voice frame, acquiring the classification information of an adjacent voice frame adjacent to the target voice frame, wherein the target voice frame is any one of the voice frames;
and based on the classification information and the smoothing coefficient of the adjacent voice frames, smoothing the classification information of the target voice frame, wherein the smoothing is used for smoothing the change trend of the probability of each syllable, each character or each word of the target phrase in the adjacent voice frames.
4. The method of claim 1, wherein said determining a tail-end point of said target phrase based on said plurality of classification information comprises:
determining a first target frame based on the plurality of classification information, wherein the first target frame is a voice frame which comprises target words or target words in the target phrase for the first time;
processing a plurality of first voice frames behind the first target frame based on a rectangular sliding window to obtain a plurality of first information, wherein the plurality of first information are used for indicating the change trend of the probability of the target words or the target words, the length of the rectangular sliding window is a first number of frames, and the sliding step length is one frame;
determining a tail end point of the target phrase based on the plurality of first information.
5. The method of claim 4, wherein said determining a tail-end point of said target phrase based on said plurality of first information comprises:
responding to any adjacent first information, wherein the probability of the target word or the target word is changed from being larger than a first threshold value to being smaller than the first threshold value, and determining a second target frame, wherein the second target frame is a first speech frame through which the rectangular sliding window slides when the adjacent first information is determined;
and determining the starting time of the second target frame as the tail end point of the target phrase.
6. The method of claim 1, wherein said determining a tail-end point of said target phrase based on said plurality of classification information comprises:
determining a third target frame based on the plurality of classification information, wherein the third target frame is a voice frame comprising target syllables in the target phrase for the first time;
processing a plurality of second voice frames behind the third target frame based on a triangular sliding window to obtain a plurality of second information, wherein the plurality of second information is used for indicating the change trend of the probability of the target syllable, the length of the triangular sliding window is a second number of frames, and the sliding step length is one frame;
and determining a tail end point of the target phrase based on the plurality of second information.
7. The method of claim 6, wherein said determining a tail-end point of said target phrase based on said plurality of second information comprises:
determining a variation of the probability of the target syllable in two adjacent second information based on the plurality of second information;
responding to the fact that the variation is larger than a second threshold value for a third number of times in succession, and determining a fourth target frame, wherein the fourth target frame is a second voice frame through which the triangular sliding window slides when the variation is larger than the second threshold value for the first time;
and determining the starting time of the fourth target frame as the tail end point of the target phrase.
8. The method of claim 1, further comprising:
acquiring the currently input voice signal;
and processing the voice signal based on a voice sliding window to obtain the plurality of voice frames, wherein the length of the voice sliding window is a first time length, the sliding step length is a second time length, and the second time length is smaller than the first time length.
9. The method of claim 1, further comprising:
awakening the target equipment in response to the probability that an awakening syllable, an awakening word or an awakening word in the target phrase in any classification information is larger than an awakening threshold value;
the waking up the target device based on the tail end point of the target phrase comprises:
under the condition that the target equipment is awakened, acquiring a first voice signal from the voice signal, wherein the first voice signal is a voice signal behind the tail end point;
and inputting the first voice signal and the second voice signal into an automatic voice recognition model, wherein the second voice signal is a newly acquired voice signal, and the automatic voice recognition model is used for recognizing the voice signal as an interactive instruction.
10. An apparatus wake-up device, the apparatus comprising:
the classification module is used for classifying a plurality of voice frames in the acquired voice signals to obtain a plurality of classification information, wherein the classification information is used for indicating the probability that each syllable, each character or each word in a target phrase is included in the voice frames;
a determining module, configured to determine, based on the plurality of classification information, a tail end point of the target phrase, where the tail end point is used to indicate a time when playing of the target phrase in the voice signal is finished;
and the awakening module is used for awakening the target equipment based on the tail end point of the target phrase.
11. A computer device comprising a processor and a memory, the memory being configured to store at least one piece of computer program, the at least one piece of computer program being loaded by the processor and being configured to perform the device wake-up method of any one of claims 1 to 9.
12. A computer-readable storage medium for storing at least one piece of computer program for executing the device wake-up method of any one of claims 1 to 9.
13. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the device wake-up method according to any of the claims 1 to 9.
CN202211073943.8A 2022-09-02 2022-09-02 Equipment awakening method and device, computer equipment and storage medium Pending CN115527532A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211073943.8A CN115527532A (en) 2022-09-02 2022-09-02 Equipment awakening method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211073943.8A CN115527532A (en) 2022-09-02 2022-09-02 Equipment awakening method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115527532A true CN115527532A (en) 2022-12-27

Family

ID=84697072

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211073943.8A Pending CN115527532A (en) 2022-09-02 2022-09-02 Equipment awakening method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115527532A (en)

Similar Documents

Publication Publication Date Title
CN108615526B (en) Method, device, terminal and storage medium for detecting keywords in voice signal
CN110379430B (en) Animation display method and device based on voice, computer equipment and storage medium
CN111933112B (en) Awakening voice determination method, device, equipment and medium
CN110556127B (en) Method, device, equipment and medium for detecting voice recognition result
CN108922531B (en) Slot position identification method and device, electronic equipment and storage medium
CN110956971A (en) Audio processing method, device, terminal and storage medium
CN110931048A (en) Voice endpoint detection method and device, computer equipment and storage medium
CN114299933A (en) Speech recognition model training method, device, equipment, storage medium and product
CN111681655A (en) Voice control method and device, electronic equipment and storage medium
CN111613213B (en) Audio classification method, device, equipment and storage medium
CN113220590A (en) Automatic testing method, device, equipment and medium for voice interaction application
CN114333774B (en) Speech recognition method, device, computer equipment and storage medium
CN110867194B (en) Audio scoring method, device, equipment and storage medium
CN112233689A (en) Audio noise reduction method, device, equipment and medium
CN111681654A (en) Voice control method and device, electronic equipment and storage medium
CN113744736B (en) Command word recognition method and device, electronic equipment and storage medium
CN111048109A (en) Acoustic feature determination method and apparatus, computer device, and storage medium
CN114299935A (en) Awakening word recognition method, awakening word recognition device, terminal and storage medium
CN112116908B (en) Wake-up audio determining method, device, equipment and storage medium
CN110337030B (en) Video playing method, device, terminal and computer readable storage medium
CN116860913A (en) Voice interaction method, device, equipment and storage medium
CN114333821A (en) Elevator control method, device, electronic equipment, storage medium and product
CN111028846B (en) Method and device for registration of wake-up-free words
CN114299997A (en) Audio data processing method and device, electronic equipment, storage medium and product
CN111125424B (en) Method, device, equipment and storage medium for extracting core lyrics of song

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination