CN118173088A - Control command triggering method, device, equipment and storage medium of intelligent equipment - Google Patents

Control command triggering method, device, equipment and storage medium of intelligent equipment Download PDF

Info

Publication number
CN118173088A
CN118173088A CN202211579605.1A CN202211579605A CN118173088A CN 118173088 A CN118173088 A CN 118173088A CN 202211579605 A CN202211579605 A CN 202211579605A CN 118173088 A CN118173088 A CN 118173088A
Authority
CN
China
Prior art keywords
command word
language
network
voice
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211579605.1A
Other languages
Chinese (zh)
Inventor
龙良曲
郭士嘉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Insta360 Innovation Technology Co Ltd
Original Assignee
Insta360 Innovation Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Insta360 Innovation Technology Co Ltd filed Critical Insta360 Innovation Technology Co Ltd
Priority to CN202211579605.1A priority Critical patent/CN118173088A/en
Publication of CN118173088A publication Critical patent/CN118173088A/en
Pending legal-status Critical Current

Links

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The application relates to a control command triggering method and device of intelligent equipment, computer equipment, storage medium and computer program product. The method comprises the following steps: extracting features of at least two voice fragments intercepted from the voice signal to obtain voice features; the command word recognition is carried out on the voice features through a command word recognition sub-network of a recognition network, so that a command word recognition result is obtained; performing language identification on the voice characteristics through a language identification sub-network of the identification network to obtain a language identification result; and when the command word recognition result and the language recognition result meet the triggering conditions, triggering a control command based on the command word recognition result. By adopting the method, the false triggering of the control command can be avoided.

Description

Control command triggering method, device, equipment and storage medium of intelligent equipment
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for triggering a control command of an intelligent device.
Background
Along with the development of artificial intelligence technology, intelligent devices are increasingly widely applied, and when the intelligent devices are used, control commands of the intelligent devices are triggered through voice signals, and the intelligent devices are controlled based on the control commands. In the conventional art, when a speech signal is not a language of a target language, non-command words are easily recognized as command words, thereby causing false triggering of a speech command.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a control command triggering method, apparatus, computer device, computer readable storage medium, and computer program product for an intelligent device capable of avoiding false triggering.
In a first aspect, the application provides a control command triggering method of an intelligent device. The method comprises the following steps:
extracting features of at least two voice fragments intercepted from the voice signal to obtain voice features;
The command word recognition is carried out on the voice features through a command word recognition sub-network of a recognition network, so that a command word recognition result is obtained;
Performing language identification on the voice characteristics through a language identification sub-network of the identification network to obtain a language identification result;
And when the command word recognition result and the language recognition result meet the triggering conditions, triggering a control command based on the command word recognition result.
In a second aspect, the application further provides a control command triggering device of the intelligent device. The device comprises:
The extraction module is used for extracting characteristics of at least two voice fragments intercepted from the voice signal to obtain voice characteristics;
the command word recognition module is used for recognizing the command word of the voice feature through a command word recognition sub-network of the recognition network to obtain a command word recognition result;
The language identification module is used for carrying out language identification on the voice characteristics through a language identification sub-network of the identification network to obtain a language identification result;
And the triggering module is used for triggering a control command based on the command word recognition result when the command word recognition result and the language recognition result meet the triggering conditions.
In one embodiment, the extraction module is further configured to:
determining a segment interception length and a sliding window moving step length;
moving the sliding window according to the moving step length of the sliding window;
based on the segment intercepting length, intercepting the voice signals in the sliding window after moving to obtain at least two voice segments, and extracting the characteristics of the voice segments.
In one embodiment, the trigger conditions include command word conditions and language conditions; the trigger module is further configured to:
And if the command word recognition result meets the command word condition and the language recognition result meets the language condition, triggering a control command based on the command word recognition result.
In one embodiment, the command word recognition module is further configured to:
The command word recognition is carried out on the voice features through a command word recognition sub-network of a recognition network, so as to obtain command word recognition vectors; the command word recognition vector comprises at least two vector elements; the vector elements are used for representing the probability that the voice segment contains command words or the probability that the voice segment does not contain command words;
sorting vector elements in the command word recognition vector;
and determining a command word recognition result corresponding to the voice fragment based on the sequencing result.
In one embodiment, the language identification module is further configured to:
Performing language recognition on the voice characteristics through a language recognition sub-network of the recognition network to obtain language recognition vectors; the language identification vector comprises at least two vector elements, wherein the vector elements are used for representing the probability that the voice fragment is the target language or is other languages except the target language;
sorting vector elements of the language identification vector;
And determining a language identification result corresponding to the voice fragment based on the sequencing result.
In one embodiment, the recognition network is trained on a pre-trained recognition network, and the pre-trained recognition network comprises a pre-trained command word recognition sub-network and a pre-trained language recognition sub-network; the apparatus further comprises:
The extraction module is further used for extracting characteristics of at least two voice fragment samples intercepted from the voice signal samples to obtain voice sample characteristics;
The command word recognition module is further used for recognizing command words of the voice sample characteristics through the pre-trained command word recognition sub-network to obtain a command word training result;
The language identification module is further used for carrying out language identification on the voice sample characteristics through the pre-trained language identification sub-network to obtain a language training result;
The calculation module is used for calculating based on the command word training result and the language training result to obtain a loss value;
and the training module is used for training the pre-trained identification network according to the loss value to obtain the identification network.
In one embodiment, the computing module is further configured to:
Calculating to obtain a command word sub-network loss value based on the command word training result;
calculating to obtain a language sub-network loss value based on the language training result;
And calculating the command word sub-network loss value and the language sub-network loss value according to the weighting parameters to obtain the loss value.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
extracting features of at least two voice fragments intercepted from the voice signal to obtain voice features;
The command word recognition is carried out on the voice features through a command word recognition sub-network of a recognition network, so that a command word recognition result is obtained;
Performing language identification on the voice characteristics through a language identification sub-network of the identification network to obtain a language identification result;
And when the command word recognition result and the language recognition result meet the triggering conditions, triggering a control command based on the command word recognition result.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
extracting features of at least two voice fragments intercepted from the voice signal to obtain voice features;
The command word recognition is carried out on the voice features through a command word recognition sub-network of a recognition network, so that a command word recognition result is obtained;
Performing language identification on the voice characteristics through a language identification sub-network of the identification network to obtain a language identification result;
And when the command word recognition result and the language recognition result meet the triggering conditions, triggering a control command based on the command word recognition result.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
extracting features of at least two voice fragments intercepted from the voice signal to obtain voice features;
The command word recognition is carried out on the voice features through a command word recognition sub-network of a recognition network, so that a command word recognition result is obtained;
Performing language identification on the voice characteristics through a language identification sub-network of the identification network to obtain a language identification result;
And when the command word recognition result and the language recognition result meet the triggering conditions, triggering a control command based on the command word recognition result.
The control command triggering method, the device, the computer equipment, the storage medium and the computer program product of the intelligent equipment perform feature extraction on at least two voice fragments intercepted from the voice signal to obtain voice features. And carrying out command word recognition on the voice features through a command word recognition sub-network of the recognition network to obtain a command word recognition result, and carrying out language recognition on the voice features through a language recognition sub-network of the recognition network to obtain a language recognition result. When the command word recognition result and the language recognition result meet the triggering conditions, triggering the control command based on the command word recognition result. The recognition network can be used for simultaneously recognizing the command words and the languages of the voice fragments, and the control instruction is triggered when the command word recognition result and the language recognition result meet the triggering conditions, so that false triggering caused by voice signals of non-target languages can be effectively avoided. And the computer equipment can obtain the command word recognition result and the language recognition result by only running one recognition model, so that the recognition speed is improved, and the triggering efficiency of the control command is further improved.
Drawings
FIG. 1 is an application environment diagram of a control command triggering method of a smart device in one embodiment;
FIG. 2 is a flow chart of a control command triggering method of the smart device according to an embodiment;
FIG. 3 is a schematic diagram of a spectrum conversion process in one embodiment;
FIG. 4 is a schematic diagram of an identification network in one embodiment;
FIG. 5 is a schematic diagram of speech segment interception through a sliding window in one embodiment;
FIG. 6 is a flow diagram of a method of identifying network training in one embodiment;
FIG. 7 is a flow chart of a control command triggering method of the smart device according to another embodiment;
FIG. 8 is a block diagram of the control command trigger of the smart device in one embodiment;
FIG. 9 is a block diagram of another embodiment of a control command trigger device for a smart device;
FIG. 10 is an internal block diagram of a computer device in one embodiment;
FIG. 11 is an internal block diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The control command triggering method of the intelligent device provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the smart device 102 communicates with the server 104 via a network, and obtains a trained identification network from the server 104. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The intelligent device 102 performs feature extraction on at least two voice fragments intercepted from the voice signal to obtain voice features; the command word recognition is carried out on the voice characteristics through a command word recognition sub-network of the recognition network, so that a command word recognition result is obtained; performing language identification on the voice characteristics through a language identification sub-network of an identification network to obtain a language identification result; when the command word recognition result and the language recognition result meet the triggering conditions, triggering the control command based on the command word recognition result. The intelligent device 102 may be various smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be image acquisition devices, intelligent sound boxes, intelligent televisions, intelligent air conditioners, intelligent vehicle-mounted devices and the like. The image capturing device may be a normal camera, a motion camera, a panoramic camera, a video camera, or the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, as shown in fig. 2, a control command triggering method of an intelligent device is provided, and the method is applied to the intelligent device in fig. 1 for illustration, and includes the following steps:
s202, extracting features of at least two voice fragments intercepted from the voice signal to obtain voice features.
Wherein, the voice signal is a signal of which the sound intensity changes with time. The speech signal may be a signal of various languages, for example, the speech signal may be a signal of various languages such as german, japanese, french, or chinese. The voice signal may contain command words issued by the user to the smart device. For example, the voice signal may be a signal with a length of 5 seconds sent by the user to the smart device, including command words such as "take a photo", "turn on" or "turn off". A speech segment is a segment that is truncated from a speech signal. For example, a speech signal has a length of 10 seconds, and a speech segment has a length of 1.6 seconds, which is taken from the speech signal. The speech features are features learned by the neural network and can be represented by feature vectors. For example, the speech feature may be a feature vector of length 512.
In one embodiment, as shown in fig. 3, S202 specifically includes: when a voice signal is acquired, at least two voice fragments are intercepted from the voice signal, and pre-emphasis is carried out on the voice fragments so as to compensate high-frequency components of the voice fragments. And then windowing the pre-emphasized voice segment and performing discrete Fourier transform on the windowed voice segment. The result of the discrete fourier transform is filtered and log-calculated by a mel filter bank. Finally, discrete cosine transform is performed on the logarithm calculation result to obtain a frequency spectrum corresponding to the voice segment, where the frequency spectrum may be, for example, an MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient) frequency spectrum. The intelligent device inputs the frequency spectrum into a neural network, and extracts the voice characteristics of the voice fragments from the frequency spectrum through the neural network. The neural network may be, for example, a convolutional neural network, a residual convolutional neural network, or the like. For example, the neural network may be a Conv2D (two-dimensional convolution) neural network.
S204, through a command word recognition sub-network of the recognition network, command word recognition is carried out on the voice features, and a command word recognition result is obtained.
The recognition network is a multi-task joint learning model for carrying out command word recognition and language recognition, and can be a neural network model, such as a convolutional neural network, a residual convolutional neural network and the like, a feedforward neural network and the like. The command word recognition sub-network is a neural network for recognizing command words for voice features, and is a sub-network of the recognition network. The command word is a word for instructing the smart device to execute a command, and may be a word of various languages. For example, the command word may be English, japanese, french, chinese, or the like. For example, the command word may be "turn on flash", "turn up", "turn off sound", etc. in chinese, or the command word may be "Power-up", "Power-off", etc. in english. The command word recognition result is used for indicating the command words contained in the voice fragments, and may contain probabilities that the command words correspond to the command words. For example, the command word recognition result may be "command word: starting up, probability: 0.7". When the command word recognition sub-network detects that the voice fragment does not contain the command word, the command word recognition result may be "no command word, probability: 0.75".
In one embodiment, S204 specifically includes: the command word recognition is carried out on the voice characteristics through a command word recognition sub-network of the recognition network, so as to obtain a command word recognition vector; the command word recognition vector comprises at least two vector elements; the vector element is used for representing the probability that the voice segment contains each command word or the probability that the voice segment does not contain the command word; ordering vector elements in the command word recognition vector; and determining a command word recognition result corresponding to the voice fragment based on the sequencing result.
The command word recognition vector comprises at least two vector elements, wherein the vector elements are used for representing the probability that each command word is contained in the voice segment or the probability that the command word is not contained in the voice segment. For example, the command word recognition vector is an n+1-dimensional vector, containing n+1 vector elements. Wherein, N vector elements represent probabilities that a certain command word of N command words is included in the voice segment, and the last vector element represents probabilities that no command word is included in the voice segment. For example, the command word recognition vector is (a 1,a2,…,aN,aN+1), where a 1 represents the probability of containing command word 1 in the speech segment, a 2 represents the probability of containing command word 2 in the speech segment, and so on until a N,aN+1 represents the probability of the speech segment being a speech background that does not contain command words. The smart device may order the vector elements in the command word recognition vector by size, e.g., order the vector elements in the command word recognition vector in order from large to small. The intelligent device determines command word recognition results corresponding to the voice fragments based on the sequencing results. For example, the largest vector element and the command word corresponding to the vector element are used as the command word recognition result. For another example, vector elements with ordered ranks reaching preset ranks and command words corresponding to the vector elements are used as command word recognition results.
S206, performing language identification on the voice characteristics through a language identification sub-network of the identification network to obtain a language identification result.
The language identification sub-network is a neural network for carrying out language identification on voice characteristics and is a sub-network of an identification network. The language recognition result is used to represent the language category of the speech signal, and may include languages and probabilities corresponding to the languages. For example, the language identification result may be "language: english, probability: 0.48".
In one embodiment, S206 specifically includes: performing language recognition on the voice characteristics through a language recognition sub-network of the recognition network to obtain a language recognition vector; the language identification vector comprises at least two vector elements, wherein the vector elements are used for representing the probability that the voice fragment is the target language or other languages except the target language; sorting vector elements of the language identification vector; and determining the language identification result corresponding to the voice fragment based on the sequencing result.
The language identification vector comprises at least two vector elements, wherein the vector elements are used for representing the probability that the voice fragment is in the target language or other languages except the target language. For example, the language identification vector is an m+1-dimensional vector, which contains m+1 vector elements. Wherein M vector elements represent probabilities that the speech segment is in a language other than the target language, and the last vector element represents probabilities that the speech segment is in the target language. The target language is the language type currently used by the intelligent device. For example, if the language type currently set by the smart device is chinese, the target language is chinese. For example, the language recognition vector is (b 1,b2,…,bM,bM+1), where b 1 to b M represent probabilities of the speech segment being in a language other than the target language, and b M+1 represents probabilities of the speech segment being in the target language. For example, when the target language is chinese, b 1 may represent the probability that the speech segment is english, and b M+1 represents the probability that the speech segment is chinese.
The smart device may order the vector elements in the language identification vector by size, e.g., order the vector elements in the language identification vector in order from large to small. The intelligent device determines language identification results corresponding to the voice fragments based on the sorting results. For example, the largest vector element and the language corresponding to the vector element are used as the language identification result. For another example, the vector elements with the ordered ranks reaching the preset ranks and the languages corresponding to the vector elements are used as language recognition results.
In one embodiment, as shown in FIG. 4, the recognition network includes a feature extraction sub-network, a command word recognition sub-network, and a language recognition sub-network. The feature extraction sub-network is used for extracting features of the frequency spectrum corresponding to the voice fragment to obtain voice features. Then, the voice features are respectively input into a command word recognition sub-network and a language recognition sub-network. And carrying out command word recognition on the voice characteristics through a command word recognition sub-network of the recognition network to obtain a command word recognition result. And carrying out language identification on the voice characteristics through a language identification sub-network of the identification network to obtain a language identification result.
S208, when the command word recognition result and the language recognition result meet the triggering conditions, triggering the control command based on the command word recognition result.
The triggering condition is a condition for judging whether to trigger the control command according to the command word recognition result and the language recognition result. In one embodiment, the trigger conditions include command word conditions and language conditions; s208 specifically includes: and if the command word recognition result meets the command word condition and the language recognition result meets the language condition, triggering a control command based on the command word recognition result.
The command word condition is a condition for judging whether the command word recognition result meets the requirement of triggering the control command. For example, the command word condition is that the number of speech segments containing the command word in the speech signal is greater than a preset value. For another example, the command word condition is that the number of voice fragments containing the command word in the voice signal is greater than a preset value, and the probability value corresponding to the command word in the command word recognition result corresponding to the voice fragment is greater than the preset probability value. The language condition is a condition for judging whether the language identification result meets the requirement of triggering the control command. For example, the language condition is that the language recognition results corresponding to each voice segment in the voice signal are all the target languages currently set by the intelligent device. For another example, the language condition is that the number of speech segments of the target language is greater than the preset number of segments as a result of language recognition in the speech signal.
In one embodiment, the intelligent device stores the command word recognition result and the language recognition result corresponding to each voice segment intercepted from the voice signal in the cache pool, where the command word recognition result and the language recognition result corresponding to each voice segment may be the same or different. When the command word recognition result and the language recognition result in the cache pool meet the triggering conditions, triggering a control command based on the command word recognition result. Specifically, the smart device may intercept a plurality of voice segments from a voice signal of a preset duration, for example, the smart device may intercept 10 voice segments from a voice signal of 10 seconds in length. And storing the command word recognition result and the language recognition result corresponding to each voice fragment into a cache pool. When the command word recognition result and the language recognition result in the cache pool meet the triggering conditions, triggering a control command based on the command word recognition result. For example, when the number of voice fragments of the command words is recognized to be greater than 8, the probability value corresponding to each command word is greater than a preset value, and the language recognition results corresponding to the 10 voice fragments are all target languages, the control command is triggered based on the command word recognition results.
In one embodiment, S208 specifically includes: when the command word recognition result and the language recognition result meet the triggering conditions, determining a target command word from all candidate command words based on the command word recognition result, and triggering a control command according to the target command word. For example, assuming that the number of candidate command words is 10, the intelligent device selects the command word with the largest corresponding probability value from the command word recognition result as the target command word, and triggers the control instruction according to the target command word. Or the intelligent device can also select the command word with the largest corresponding voice fragment number from the command word recognition result as the target command word, and trigger the control instruction according to the target command word. For example, when the target command word is "power on", a power on command is triggered.
And when the command word recognition result meets the command word condition and the language recognition result meets the language condition, the intelligent device triggers a control command based on the command word recognition result. Therefore, the control command can be retriggered in the target language which is used by the intelligent equipment currently when the voice signal is judged, false triggering caused by the voice signal of the non-target language is effectively avoided, and the accuracy of the triggering control command is improved.
In the above embodiment, feature extraction is performed on at least two speech segments intercepted from the speech signal, so as to obtain speech features. And carrying out command word recognition on the voice features through a command word recognition sub-network of the recognition network to obtain a command word recognition result, and carrying out language recognition on the voice features through a language recognition sub-network of the recognition network to obtain a language recognition result. When the command word recognition result and the language recognition result meet the triggering conditions, triggering the control command based on the command word recognition result. The recognition network can be used for simultaneously recognizing the command words and the languages of the voice fragments, and the control instruction is triggered when the command word recognition result and the language recognition result meet the triggering conditions, so that false triggering caused by voice signals of non-target languages can be effectively avoided. And the computer equipment can obtain the command word recognition result and the language recognition result by only running one recognition model, so that the recognition speed is improved, and the triggering efficiency of the control command is further improved.
In one embodiment, S202 specifically includes: determining a segment interception length and a sliding window moving step length; moving the sliding window according to the moving step length of the sliding window; based on the segment interception length, intercepting the voice signals in the sliding window after moving to obtain at least two voice segments, and extracting the characteristics of the voice segments.
Wherein the segment interception length is the intercepted voice segment length. For example, the clip length may be 1.6 seconds, 1 second, 500 milliseconds, etc. The moving step length of the sliding window is the length of each moving of the sliding window. For example, the sliding window movement step size may be 200 milliseconds, 100 milliseconds, 300 milliseconds, or the like. The intelligent device moves the sliding window according to the moving step length of the sliding window, then adjusts the size of the sliding window into a segment intercepting length, and intercepts voice signals in the moving sliding window. For example, as shown in FIG. 5, the sliding window movement step is 200 milliseconds, with the smart device moving 200 milliseconds each time. Then, the voice signal in the sliding window with the length of 1.6 seconds is intercepted, and a plurality of voice fragments are obtained.
In the above embodiment, the intelligent device determines the segment interception length and the sliding window moving step length, and moves the sliding window according to the sliding window moving step length. Based on the segment interception length, intercepting the voice signals in the sliding window after moving, and extracting the characteristics of the intercepted voice segments. Therefore, whether the control command is triggered can be judged through the command word recognition results and the language recognition results corresponding to the plurality of voice fragments, false triggering of the control command caused by inaccurate recognition results of the single voice fragments is avoided, and the accuracy of triggering of the control command is improved.
In one embodiment, the recognition network is a training of a pre-trained recognition network, the pre-trained recognition network including a pre-trained command word recognition sub-network and a pre-trained language recognition sub-network; as shown in fig. 6, training the neural network includes the steps of:
S602, extracting features of at least two voice fragment samples intercepted from the voice signal samples to obtain voice sample features.
The voice signal sample is a signal sample with time-varying sound intensity. The speech signal samples may be of various languages, including command words issued by the user to the smart device. For example, the speech signal samples may be 10 second long signal samples sent by the user to the smart device. For example, the speech signal samples may be signal samples in various languages such as german, japanese, french, or chinese. For example, the speech signal sample may include command words such as "take a photograph", "turn on" or "turn off". The speech segment samples are segments that are truncated from the speech signal samples. For example, a speech signal sample is 10 seconds long and a speech segment sample is a segment of 1.6 seconds length taken from the speech signal. The intelligent device marks the voice fragment samples, including marking command words and languages corresponding to the voice fragment samples. For example, for the speech segment sample 1, the intelligent device labels the command word corresponding to the speech segment sample as "on" and the language as "chinese". The speech sample features are features learned through a neural network and may be represented by feature vectors. For example, the speech sample feature may be a feature vector of length 512 or 256.
In one embodiment, the smart device moves the sliding window according to a sliding window movement step; based on the segment interception length, intercepting the voice signal samples in the sliding window after moving to obtain at least two voice segment samples, and extracting the characteristics of the voice segment samples.
S604, identifying command words for the voice sample features through a pre-trained command word identification sub-network to obtain a command word training result.
The pre-trained recognition network is a multi-task joint learning model for command word recognition and language recognition, and can be a neural network model, such as a convolutional neural network, a residual convolutional neural network and the like, a feedforward neural network and the like. The pre-trained command word recognition sub-network is a neural network for command word recognition of voice features, and is a sub-network of the pre-trained recognition network. The command word training result is used for representing the command words contained in the voice segment sample, and may contain probabilities that the command words correspond to the command words. For example, the command word training result may be "command word: starting up, probability: 0.7". When the pre-trained command word recognition sub-network detects that the voice fragment does not contain the command word, the command word recognition result can be no command word and the corresponding probability. For example, there may be "no command word, probability: 0.6".
In one embodiment, S604 specifically includes: through a pre-trained command word recognition sub-network, carrying out command word recognition on the voice sample characteristics to obtain a command word recognition vector sample; the command word recognition vector sample comprises at least two vector elements; the vector element is used for representing the probability that the voice segment sample contains each command word or the probability that the voice segment sample does not contain the command word; sorting vector elements in the command word recognition vector samples; and determining command word training results corresponding to the voice fragment samples based on the ordered results.
S606, language identification is carried out on the characteristics of the voice sample through a pre-trained language identification sub-network, and a language training result is obtained.
The pre-trained language recognition sub-network is a neural network for recognizing the language of the voice sample characteristics, and is a sub-network of the pre-trained recognition network. The language training result is used to represent the language category of the speech signal sample, and may include languages and probabilities corresponding to the languages. For example, the language identification result may be "language: english, probability: 0.48".
In one embodiment, S206 specifically includes: performing language identification on the voice characteristics through a pre-trained language identification sub-network to obtain language training vectors; the language training vector comprises at least two vector elements, wherein the vector elements are used for representing the probability that the voice fragment sample is the target language or other languages except the target language; sorting vector elements of the language training vector; and determining the language training result corresponding to the voice fragment sample based on the sequencing result.
And S608, calculating based on the command word training result and the language training result to obtain a loss value.
The loss value is a numerical value representing the difference between the training result of the command word and the training result of the language and the true value, and can be any integer, decimal, fraction or percentage. For example, the loss value may be a cross entropy loss value, a mean square error loss value, or the like.
In one embodiment, S608 specifically includes: calculating a command word sub-network loss value based on the command word training result; calculating a language sub-network loss value based on the language training result; and calculating the command word sub-network loss value and the language sub-network loss value according to the weighting parameters to obtain the loss value.
The weighting parameter is a parameter for adjusting the weights of the command word sub-network loss value and the language sub-network loss value, and can be any decimal, fractional, or percentage. For example, the weighting parameter is 0.3, 0.5, or 0.7, etc. In one embodiment, the weight of the command word subnetwork loss value is proportional to the size of the weighting parameter. The higher the value of the weighting parameter, the higher the weight the command word subnetwork loss value occupies in the loss value.
In one embodiment, the intelligent device performs weighted calculation on the command word sub-network loss value and the language sub-network loss value according to the weighted parameter to obtain the loss value. Specifically, the intelligent device performs weighted calculation on the command word sub-network loss value and the language sub-network loss value according to the formula (1), wherein a is a weighted parameter, L kws is the command word sub-network loss value, and L lang is the language sub-network loss value.
L=a×Lkws+ (1-a)×Llang (1)
In one embodiment, the smart device may calculate the command word subnetwork loss value according to equation (2). And (3) calculating according to the formula (3) to obtain the seed network loss value. Wherein, H 1 is a command word sub-network loss value, p 1(xi) is a probability value of a real command word corresponding to a voice segment i, q 1(xi) is a probability value of a command word training result corresponding to the voice segment i, H 2 is a language sub-network loss value, p 2(xi) is a probability value of a real language corresponding to the voice segment i, q 2(xi) is a probability value of a language training result corresponding to the voice segment i, and n is the number of voice segments.
The intelligent device calculates and obtains a command word sub-network loss value based on the command word training result; calculating a language sub-network loss value based on the language training result; and calculating the command word sub-network loss value and the language sub-network loss value according to the weighting parameters to obtain the loss value. Therefore, the pre-trained command word sub-network and the pre-trained language sub-network can be jointly trained according to the loss value, a joint learning model capable of processing multiple tasks is obtained, namely, an identification network capable of carrying out command word identification and language identification is obtained, the identification network can process multiple tasks, and the identification speed of voice signals is improved.
And S610, training the pre-trained identification network according to the loss value to obtain the identification network.
The intelligent device trains the pre-trained identification network according to the loss value, specifically, the intelligent device can adjust parameters of the pre-trained identification network according to the loss value, and the loss value is reduced by adjusting the parameters of the pre-trained identification network. Training the pre-trained identification network is completed when the loss value is less than a preset value or when the loss value is minimal.
In the above embodiment, through the pre-trained command word recognition sub-network, command word recognition is performed on the voice sample characteristics, so as to obtain a command word training result; and carrying out language identification on the characteristics of the voice sample through a pre-trained language identification sub-network to obtain a language training result. And training the pre-trained recognition network according to the loss value calculated based on the command word training result and the language training result to obtain the recognition network. Therefore, the recognition network which can be used for carrying out command word recognition and language recognition is obtained through training, compared with the recognition of speech signals by training a plurality of models, the number of the models to be trained is reduced, and the model training efficiency is improved. And the intelligent equipment can avoid false triggering of non-target language by only running one model, thereby improving the recognition speed of the voice signal.
In one embodiment, as shown in fig. 7, the control command triggering method of the smart device includes the following steps:
S702, extracting features of at least two voice segment samples intercepted from the voice signal sample to obtain voice sample features.
S704, carrying out command word recognition on the voice sample characteristics through a pre-trained command word recognition sub-network to obtain a command word training result, and carrying out language recognition on the voice sample characteristics through a pre-trained language recognition sub-network to obtain a language training result.
S706, calculating to obtain a command word sub-network loss value based on the command word training result, and calculating to obtain a language sub-network loss value based on the language training result.
S708, calculating command word sub-network loss values and language sub-network loss values according to the weighting parameters to obtain loss values, and training the pre-trained recognition network according to the loss values to obtain the recognition network.
S710, determining the segment intercepting length and the moving step length of the sliding window, and moving the sliding window according to the moving step length of the sliding window.
S712, based on the segment interception length, intercepting the voice signals in the sliding window after moving to obtain at least two voice segments, and extracting the characteristics of the voice segments to obtain voice characteristics.
S714, identifying the command word of the voice feature through a command word identification sub-network of the identification network to obtain a command word identification vector; the command word recognition vector comprises at least two vector elements; the vector element is used to represent the probability that a speech segment contains each command word or the probability that a speech segment does not contain a command word.
S716, sorting vector elements in the command word recognition vector, and determining a command word recognition result corresponding to the voice fragment based on the sorting result.
S718, performing language recognition on the voice characteristics through a language recognition sub-network of the recognition network to obtain a language recognition vector; the language identification vector comprises at least two vector elements, and the vector elements are used for representing the probability that the voice fragment is in the target language or other languages except the target language.
S720, sorting vector elements of the language identification vector, and determining a language identification result corresponding to the voice fragment based on the sorting result.
S722, if the command word recognition result meets the command word condition and the language recognition result meets the language condition, triggering the control command based on the command word recognition result.
The specific contents of S702 to S722 described above may refer to the specific implementation procedure described above.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a control command triggering device of the intelligent device for realizing the control command triggering method of the intelligent device. The implementation scheme of the solution to the problem provided by the device is similar to the implementation scheme described in the above method, so the specific limitation in the embodiment of the control command triggering device of one or more intelligent devices provided below may refer to the limitation of the control command triggering method of the intelligent device described above, and will not be repeated herein.
In one embodiment, as shown in fig. 8, there is provided a control command triggering apparatus of an intelligent device, including: an extraction module 802, a command word recognition module 804, a language recognition module 806, and a trigger module 808, wherein:
An extracting module 802, configured to perform feature extraction on at least two speech segments intercepted from the speech signal, so as to obtain speech features;
the command word recognition module 804 is configured to recognize a command word on the voice feature through a command word recognition sub-network of the recognition network, so as to obtain a command word recognition result;
The language recognition module 806 is configured to perform language recognition on the voice feature through a language recognition sub-network of the recognition network to obtain a language recognition result;
And a triggering module 808, configured to trigger a control command based on the command word recognition result when the command word recognition result and the language recognition result satisfy the triggering conditions.
In the above embodiment, feature extraction is performed on at least two speech segments intercepted from the speech signal, so as to obtain speech features. And carrying out command word recognition on the voice features through a command word recognition sub-network of the recognition network to obtain a command word recognition result, and carrying out language recognition on the voice features through a language recognition sub-network of the recognition network to obtain a language recognition result. When the command word recognition result and the language recognition result meet the triggering conditions, triggering the control command based on the command word recognition result. The recognition network can be used for simultaneously recognizing the command words and the languages of the voice fragments, and the control instruction is triggered when the command word recognition result and the language recognition result meet the triggering conditions, so that false triggering caused by voice signals of non-target languages can be effectively avoided. And the computer equipment can obtain the command word recognition result and the language recognition result by only running one recognition model, so that the recognition speed is improved, and the triggering efficiency of the control command is further improved.
In one embodiment, the extraction module 802 is further configured to:
determining a segment interception length and a sliding window moving step length;
moving the sliding window according to the moving step length of the sliding window;
Based on the segment interception length, intercepting the voice signals in the sliding window after moving to obtain at least two voice segments, and extracting the characteristics of the voice segments.
In one embodiment, the trigger conditions include command word conditions and language conditions; the triggering module 808 is further configured to:
And if the command word recognition result meets the command word condition and the language recognition result meets the language condition, triggering a control command based on the command word recognition result.
In one embodiment, command word recognition module 804 is further configured to:
The command word recognition is carried out on the voice characteristics through a command word recognition sub-network of the recognition network, so as to obtain a command word recognition vector; the command word recognition vector comprises at least two vector elements; the vector element is used for representing the probability that the voice segment contains each command word or the probability that the voice segment does not contain the command word;
Ordering vector elements in the command word recognition vector;
and determining a command word recognition result corresponding to the voice fragment based on the sequencing result.
In one embodiment, the language identification module 806 is further configured to:
Performing language recognition on the voice characteristics through a language recognition sub-network of the recognition network to obtain a language recognition vector; the language identification vector comprises at least two vector elements, wherein the vector elements are used for representing the probability that the voice fragment is the target language or other languages except the target language;
Sorting vector elements of the language identification vector;
and determining the language identification result corresponding to the voice fragment based on the sequencing result.
In one embodiment, the recognition network is a training of a pre-trained recognition network, the pre-trained recognition network including a pre-trained command word recognition sub-network and a pre-trained language recognition sub-network; as shown in fig. 9, the apparatus further includes:
The extracting module 802 is further configured to perform feature extraction on at least two voice segment samples intercepted from the voice signal sample, so as to obtain voice sample features;
the command word recognition module 804 is further configured to perform command word recognition on the voice sample feature through a pre-trained command word recognition sub-network, so as to obtain a command word training result;
The language recognition module 806 is further configured to perform language recognition on the voice sample feature through the pre-trained language recognition sub-network to obtain a language training result;
The calculating module 810 is configured to calculate based on the command word training result and the language training result, so as to obtain a loss value;
training module 812 is configured to train the pre-trained identification network according to the loss value, so as to obtain the identification network.
In one embodiment, the computing module 810 is further configured to:
Calculating a command word sub-network loss value based on the command word training result;
calculating a language sub-network loss value based on the language training result;
And calculating the command word sub-network loss value and the language sub-network loss value according to the weighting parameters to obtain the loss value.
All or part of the modules in the control command triggering device of the intelligent device can be realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing control command trigger data of the intelligent device. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a control command triggering method for an intelligent device.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 11. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program, when executed by a processor, implements a control command triggering method for an intelligent device. The display unit of the computer equipment is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device, wherein the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on a shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by persons skilled in the art that the structures shown in FIGS. 10 and 11 are block diagrams of only some of the structures associated with the present inventive arrangements and are not limiting of the computer device to which the present inventive arrangements may be implemented, and that a particular computer device may include more or fewer components than shown, or may be combined with certain components, or may have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (11)

1. A control command triggering method for an intelligent device, the method comprising:
extracting features of at least two voice fragments intercepted from the voice signal to obtain voice features;
The command word recognition is carried out on the voice features through a command word recognition sub-network of a recognition network, so that a command word recognition result is obtained;
Performing language identification on the voice characteristics through a language identification sub-network of the identification network to obtain a language identification result;
And when the command word recognition result and the language recognition result meet the triggering conditions, triggering a control command based on the command word recognition result.
2. The method of claim 1, wherein the feature extraction of at least two speech segments intercepted from the speech signal comprises:
determining a segment interception length and a sliding window moving step length;
moving the sliding window according to the moving step length of the sliding window;
based on the segment intercepting length, intercepting the voice signals in the sliding window after moving to obtain at least two voice segments, and extracting the characteristics of the voice segments.
3. The method of claim 1, wherein the trigger conditions include command word conditions and language conditions; when the command word recognition result and the language recognition result meet the triggering conditions, triggering the control command based on the command word recognition result comprises:
And if the command word recognition result meets the command word condition and the language recognition result meets the language condition, triggering a control command based on the command word recognition result.
4. The method of claim 1, wherein the performing, by the command word recognition sub-network of the recognition network, the command word recognition on the speech feature to obtain a command word recognition result comprises:
The command word recognition is carried out on the voice features through a command word recognition sub-network of a recognition network, so as to obtain command word recognition vectors; the command word recognition vector comprises at least two vector elements; the vector elements are used for representing the probability that the voice segment contains command words or the probability that the voice segment does not contain command words;
sorting vector elements in the command word recognition vector;
and determining a command word recognition result corresponding to the voice fragment based on the sequencing result.
5. The method of claim 1, wherein the performing, by the language recognition sub-network of the recognition network, the language recognition on the speech feature to obtain a language recognition result includes:
Performing language recognition on the voice characteristics through a language recognition sub-network of the recognition network to obtain language recognition vectors; the language identification vector comprises at least two vector elements, wherein the vector elements are used for representing the probability that the voice fragment is the target language or is other languages except the target language;
sorting vector elements of the language identification vector;
And determining a language identification result corresponding to the voice fragment based on the sequencing result.
6. The method of claim 1, wherein the recognition network is trained on a pre-trained recognition network, the pre-trained recognition network comprising a pre-trained command word recognition sub-network and a pre-trained language recognition sub-network; the training of the pre-trained recognition network comprises:
Extracting features of at least two voice fragment samples intercepted from a voice signal sample to obtain voice sample features;
carrying out command word recognition on the voice sample characteristics through the pre-trained command word recognition sub-network to obtain a command word training result;
Performing language identification on the voice sample characteristics through the pre-trained language identification sub-network to obtain a language training result;
Calculating based on the command word training result and the language training result to obtain a loss value;
training the pre-trained identification network according to the loss value to obtain the identification network.
7. The method of claim 1, wherein the calculating based on the command word training results and the language training results comprises:
Calculating to obtain a command word sub-network loss value based on the command word training result;
calculating to obtain a language sub-network loss value based on the language training result;
And calculating the command word sub-network loss value and the language sub-network loss value according to the weighting parameters to obtain the loss value.
8. A control command triggering device for an intelligent device, the device comprising:
The extraction module is used for extracting characteristics of at least two voice fragments intercepted from the voice signal to obtain voice characteristics;
the command word recognition module is used for recognizing the command word of the voice feature through a command word recognition sub-network of the recognition network to obtain a command word recognition result;
The language identification module is used for carrying out language identification on the voice characteristics through a language identification sub-network of the identification network to obtain a language identification result;
And the triggering module is used for triggering a control command based on the command word recognition result when the command word recognition result and the language recognition result meet the triggering conditions.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.
11. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.
CN202211579605.1A 2022-12-08 2022-12-08 Control command triggering method, device, equipment and storage medium of intelligent equipment Pending CN118173088A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211579605.1A CN118173088A (en) 2022-12-08 2022-12-08 Control command triggering method, device, equipment and storage medium of intelligent equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211579605.1A CN118173088A (en) 2022-12-08 2022-12-08 Control command triggering method, device, equipment and storage medium of intelligent equipment

Publications (1)

Publication Number Publication Date
CN118173088A true CN118173088A (en) 2024-06-11

Family

ID=91357213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211579605.1A Pending CN118173088A (en) 2022-12-08 2022-12-08 Control command triggering method, device, equipment and storage medium of intelligent equipment

Country Status (1)

Country Link
CN (1) CN118173088A (en)

Similar Documents

Publication Publication Date Title
EP3940638B1 (en) Image region positioning method, model training method, and related apparatus
CN111476306B (en) Object detection method, device, equipment and storage medium based on artificial intelligence
CN109299315B (en) Multimedia resource classification method and device, computer equipment and storage medium
CN111062871B (en) Image processing method and device, computer equipment and readable storage medium
WO2021022521A1 (en) Method for processing data, and method and device for training neural network model
CN110263213B (en) Video pushing method, device, computer equipment and storage medium
CN111209970A (en) Video classification method and device, storage medium and server
CN113298096B (en) Method, system, electronic device and storage medium for training zero sample classification model
CN110162604B (en) Statement generation method, device, equipment and storage medium
CN111950570B (en) Target image extraction method, neural network training method and device
CN111709398A (en) Image recognition method, and training method and device of image recognition model
WO2022161302A1 (en) Action recognition method and apparatus, device, storage medium, and computer program product
CN116580257A (en) Feature fusion model training and sample retrieval method and device and computer equipment
CN112950640A (en) Video portrait segmentation method and device, electronic equipment and storage medium
CN111340213B (en) Neural network training method, electronic device, and storage medium
CN112818995A (en) Image classification method and device, electronic equipment and storage medium
CN114117206B (en) Recommendation model processing method and device, electronic equipment and storage medium
CN114943937A (en) Pedestrian re-identification method and device, storage medium and electronic equipment
CN113361384A (en) Face recognition model compression method, device, medium, and computer program product
CN113435531A (en) Zero sample image classification method and system, electronic equipment and storage medium
CN115565186B (en) Training method and device for character recognition model, electronic equipment and storage medium
CN116958852A (en) Video and text matching method and device, electronic equipment and storage medium
CN118173088A (en) Control command triggering method, device, equipment and storage medium of intelligent equipment
CN112632222B (en) Terminal equipment and method for determining data belonging field
CN110826726B (en) Target processing method, target processing device, target processing apparatus, and medium

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination