WO2021114224A1 - 语音检测方法、预测模型的训练方法、装置、设备及介质 - Google Patents

语音检测方法、预测模型的训练方法、装置、设备及介质 Download PDF

Info

Publication number
WO2021114224A1
WO2021114224A1 PCT/CN2019/125121 CN2019125121W WO2021114224A1 WO 2021114224 A1 WO2021114224 A1 WO 2021114224A1 CN 2019125121 W CN2019125121 W CN 2019125121W WO 2021114224 A1 WO2021114224 A1 WO 2021114224A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
face image
audio signal
text information
voice
Prior art date
Application number
PCT/CN2019/125121
Other languages
English (en)
French (fr)
Inventor
高益
聂为然
黄佑佳
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2019/125121 priority Critical patent/WO2021114224A1/zh
Priority to EP19956031.9A priority patent/EP4064284A4/en
Priority to CN201980052133.4A priority patent/CN112567457B/zh
Publication of WO2021114224A1 publication Critical patent/WO2021114224A1/zh
Priority to US17/838,500 priority patent/US20220310095A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/165Management of the audio stream, e.g. setting of volume, audio stream path
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • This application relates to the field of voice interaction technology, and in particular to a voice detection method, a training method, device, device, and medium of a prediction model.
  • the voice interaction technology in order to realize the human-computer interaction function based on voice, it usually recognizes the voice start point and voice end point in a segment of voice, and intercepts the part between the voice start point and the voice end point, as a voice command, based on the voice Instructions to instruct the device to perform the corresponding operation.
  • the voice start point is usually triggered by the user's active operation, and it is easy to determine the time point when the wake-up word is collected, the time point when the voice interaction activation option is triggered, and the voice end point needs to be determined by the device through voice analysis Processing can be drawn. It can be seen that how to accurately detect the end point of a voice is very important to voice interaction technology, and it is also a major technical difficulty.
  • the voice detection method is usually: every time a time window passes, the audio signal in the current time window is collected, the tail silence duration of the audio signal is detected, and the tail silence duration is compared with the silence duration threshold. If the silence duration is greater than the silence duration threshold, the audio signal is determined to be the voice end point, and if the tail silence duration is less than or equal to the silence duration threshold, it is determined that the audio signal is not the voice end point.
  • the tail mute duration of the detected audio signal will be longer than the accurate tail mute duration, which will cause the voice end point to be easily missed and lead to too late It is detected that the voice interaction is in the end state; in addition, once the user pauses during speaking, the tail silence duration of the detected audio signal will be smaller than the accurate tail silence duration, which will lead to premature detection of the voice interaction In the end state. It can be seen that the accuracy of the voice end point detected by this method is poor.
  • the embodiments of the present application provide a voice detection method, a prediction model training method, device, equipment, and medium, which can improve the accuracy of detecting the voice end point.
  • a voice detection method is provided.
  • an audio signal and a face image can be acquired, and the time point of the face image being taken is the same as the time point of the audio signal collection;
  • the face image is input to a prediction model, the prediction model is used to predict whether the user has the intention to continue speaking; the face image is processed by the prediction model, and the prediction result is output; if the prediction result indicates that the user does not have The intention of continuing to speak determines that the audio signal is the end point of the speech.
  • the above provides a multi-modal voice end point detection method, through the model to recognize the captured face image, thereby predicting whether the user has the intention to continue speaking, and combining the prediction results to determine whether the collected audio signal is
  • the voice end point is detected based on the acoustic features and the visual modal features of the face image. Even when the background noise is strong or the user pauses during speech, the face image can be used.
  • To accurately determine whether the speech signal is the end point of the speech thus avoiding the interference caused by background noise and speech pauses, thereby avoiding the problem of too late or premature detection of the end of the voice interaction caused by the interference of background noise and speech pauses , Improve the accuracy of detecting the voice end point, thereby improving the efficiency of voice interaction.
  • the key points contained in the face image may be extracted; the key points are processed to obtain the action feature of the face image; and the action feature Perform classification to obtain respective confidence levels corresponding to different categories; and determine the prediction result according to the confidence levels.
  • a segment of speech contains pauses, it is impossible to distinguish whether an audio signal is a pause or an end point when the speech is syntactically analyzed.
  • the feature of the key points of the face and the action feature are integrated, and the micro-expression contained in the face can be accurately recognized based on the current action of the face, and the user's mental state can be inferred based on the expression. Then predict whether the user has the intention to continue speaking.
  • This method uses visual information to assist in judgment, thereby solving problems that cannot be solved by syntactic analysis and reducing premature speech truncation.
  • the prediction model is obtained by training based on the first sample face image and the second sample face image; the first sample face image is annotated with a first label, and the first label indicates that the sample user has The intention to continue speaking, the first tag is determined based on the first sample audio signal, the collection time point of the first sample audio signal, the collection object and the shooting time point of the first sample face image, and The subject is the same; the second sample face image is annotated with a second label, the second label indicates that the sample user does not have the intention to continue speaking, the second label is determined according to the second sample audio signal, so The collection time point and collection object of the second sample audio signal are the same as the shooting time point and shooting object of the second sample face image.
  • a model training method for realizing the user's intention prediction function is provided.
  • the model training is performed by using the sample face image containing the user's intention to continue speaking and the sample face image containing the user's intention not to continue speaking.
  • the model can learn from the sample face image and the corresponding label containing the user's intention to continue speaking, what the characteristics of the face image will look like when the user's intention is to continue speaking, never including no speaking In the sample face image and the corresponding label of the user’s intention, it is learned what the features of the face image will be when the user’s intention is not to continue speaking. Then the prediction model has learned the difference between the user’s intention and the facial image features.
  • the model can be used to predict whether the current user has the intention to continue speaking based on an unknown face image, so that the user’s intention represented by the face image can be used to accurately detect the current Whether the voice signal is the end point of the voice.
  • the first sample audio signal satisfies a first condition
  • the first condition includes: the voice activity detection (Voice Activity Detection, VAD) result corresponding to the first sample audio signal is first updated from the speaking state It is in the silent state, and then updated from the silent state to the speaking state.
  • VAD Voice Activity Detection
  • the VAD result is the speaking state; for the audio during the pause, the VAD result is the speaking state; In other words, the VAD result is silent; for the audio after the pause, the VAD result returns to the speaking state.
  • the collected sample audio signal meets the first condition (1), it indicates that the sample audio signal is consistent with the audio during the pause in this scene. Since the sample user continued to speak after a pause, instead of ending the voice, the intention of the sample user at the pause time is to continue to speak.
  • the sample face image taken at the pause time will contain the user's intention to continue speaking, then by adding This sample face image is marked as the first face image, and then the model can be used to learn the mapping relationship between the face image and the user’s intention to continue speaking through the first sample face image. Then in the model application stage, that is The model can be used to predict whether the current user has the intention to continue speaking based on an unknown face image.
  • the first sample audio signal satisfies a first condition
  • the first condition includes: the tail silence duration of the first sample audio signal is less than a first threshold and greater than a second threshold, and the first The threshold is greater than the second threshold;
  • the first sample audio signal satisfies a first condition
  • the first condition includes: a first confidence level of a text information combination is greater than a second confidence level of the first text information, and the text information combination is all The combination of the first text information and the second text information, the first text information represents the semantics of the previous sample audio signal of the first sample audio signal, and the second text information represents the first sample The semantics of the audio signal of the next sample of the audio signal, the first confidence level represents the probability that the text information is combined into a complete sentence, and the second confidence level represents the probability that the first text information is a complete sentence;
  • the first sample audio signal satisfies a first condition
  • the first condition includes: a first confidence level of the text information combination is greater than a third confidence level of the second text information, and the first Three confidence levels represent the probability that the second text information is a complete sentence;
  • the effect achieved can at least include: for a sentence containing a short pause, the related technology will use the pause point as the dividing point to split the complete sentence into two segments of speech. Since the user has not finished speaking, the electronic device determines in advance that the voice end point has been detected, which leads to premature detection of the voice end point. Then, the electronic device will directly use the voice before the pause as a voice command, and ignore the voice after the pause, resulting in incomplete recognition of the voice command. If the electronic device directly processes the business according to the voice command before the pause, it will undoubtedly affect the business Accuracy of processing.
  • the first sample face image satisfies a third condition
  • the third condition includes: inputting the first sample face image into the first classifier and the prediction model in the prediction model, respectively After the second classifier in the second classifier, the probability output by the first classifier is greater than the probability output by the second classifier, the first classifier is used to predict the probability that the face image contains an action, and the second classifier uses To predict the probability that the face image does not contain any action.
  • the second sample audio signal satisfies a second condition, and the second condition includes at least one of the following: the VAD result corresponding to the second sample audio signal is updated from the speaking state to the silent state; or, the The tail silence duration of the second sample audio signal is greater than the first threshold.
  • the audio signal collected when the face image is taken to determine whether the face image contains the user's intention not to continue speaking, and to use the information of the acoustic modality to label the training image, which can ensure that the sample face The label of the image matches the user’s intention of whether to continue to speak.
  • the model will be trained based on the accurate samples, and the accuracy of predicting the user’s intention will also be high, so it helps in the model In the application stage, the end point of the voice is accurately detected.
  • the features of the text modal can also be combined for voice detection.
  • speech recognition may be performed on the audio signal to obtain the third text information corresponding to the audio signal; the third text information may be syntactically analyzed to obtain a first analysis result, and the first analysis result is used It indicates whether the third text information is a complete sentence; if the first analysis result indicates that the third text information is not a complete sentence, it is determined that the audio signal is not the end point of speech; or, if the first analysis result indicates that the third text information is not a complete sentence; An analysis result indicates that the third text information is a complete sentence, and the step of inputting the face image into the prediction model is executed.
  • the sentence composed of the current vocabulary and the previous vocabulary is syntactically complete, and it cannot be the only basis for the current vocabulary to be the end point of the speech. If the methods provided by related technologies are implemented, relying solely on acoustic information, it is possible that when a temporary pause is detected, the pause point may be misjudged as the end point of the voice, resulting in the segmentation of the voice command, resulting in misinterpretation of the user's intention, and the task of voice interaction. Handling errors.
  • the above method it is possible to trigger the process of applying the prediction model to perform face recognition under the condition that the audio signal has been detected to be syntactically complete, so as to use the prediction result to further determine whether the audio signal has indeed reached the end point of the voice, and then through the fusion
  • the feature of visual modal avoids the misjudgment of syntactic analysis, greatly improves the accuracy of voice end point detection, and reduces the probability of premature truncation of voice commands.
  • the above-mentioned syntactic analysis method does not depend on a specific ASR engine and a specific scene, and the detection of each modal can be performed independently and comprehensively judged, with easier operability and high practicability.
  • the process of syntactic analysis may include: segmenting the third text information to obtain multiple vocabularies; for each of the multiple vocabularies, performing syntactic analysis on the vocabulary to obtain the vocabulary
  • the second analysis result is used to indicate whether the vocabulary and the vocabulary before the vocabulary constitute a complete sentence; if the second analysis result corresponding to any vocabulary in the plurality of vocabulary indicates composition If a complete sentence is determined, the third text information is determined to be a complete sentence; or, if the second analysis result corresponding to each word in the plurality of words indicates that a complete sentence is not formed, it is determined that the third text information is not complete Statement.
  • the effect achieved can at least include: not only comprehensive consideration of the syntactic connection between each vocabulary and the previous vocabulary, but also the use of the N-Best (N best) algorithm, whenever When a word is detected, it is judged whether the word has formed a complete sentence with the previous word. Once the current word indicates that a complete sentence has been formed, it can be determined that the analyzed text information is a complete sentence, and the next detection process is executed. Then, it can be detected in time when the current audio signal has the probability of being the end point of the speech, so as to ensure the real-time detection of the end point of the speech and avoid the detection of the end point of the speech too late.
  • N-Best N best
  • the trigger condition for inputting the face image into the prediction model includes: detecting a tail silence duration of the audio signal; determining that the tail silence duration is greater than a third threshold.
  • the process of fusing the features of the face image to perform voice detection can be performed.
  • the effect achieved by this method can at least include: once the silence duration is greater than the minimum threshold (the third threshold), the text modal and image modal are combined, and the result of syntactic analysis and facial analysis is used to detect the end point of the voice, thereby Through the fusion of multi-modal information, the voice endpoint is detected as quickly and accurately as possible to avoid excessive delay.
  • the above-mentioned voice detection method can be applied to a vehicle-mounted terminal, and the vehicle-mounted terminal may also collect driving status information, where the driving status information represents the driving status of the vehicle equipped with the vehicle-mounted terminal; and collect driving status information, the driving status The information represents the driving status of the vehicle equipped with the in-vehicle terminal; the third threshold is adjusted according to the driving status information.
  • the achieved effect can at least include: the specific application scenarios of voice detection can be integrated to perform endpoint detection, for example, in the vehicle scenario, the driving condition during driving can be used to adjust the threshold of the tail silence duration to make the threshold It can be adjusted adaptively according to the current driving conditions to improve the robustness of voice endpoint detection.
  • the process of adjusting the third threshold may include: if the driving condition information indicates that a sharp turn has occurred, adjusting the third threshold, and the adjusted third threshold is greater than the third threshold before the adjustment. Threshold; or, if the driving condition information indicates that sudden braking has occurred, the third threshold is adjusted, and the adjusted third threshold is greater than the third threshold before the adjustment.
  • the effects achieved can at least include: if the vehicle turns sharply or brakes sharply, the user's voice is likely to be interrupted due to the sharp turn or sudden braking, resulting in a greater probability of the end point of the voice, and the duration of the interruption of the voice. It will also be longer accordingly.
  • the adjusted threshold can be adapted to the situation of sharp turns or sudden braking.
  • the specific application scenarios of voice detection can be integrated for endpoint detection.
  • the driving condition during driving can be used to adjust the threshold of the tail mute duration, so that the threshold can be adjusted adaptively according to the current driving situation. Robustness of voice endpoint detection.
  • the above-mentioned voice detection method may be applied to a vehicle-mounted terminal, and the vehicle-mounted terminal may also collect environmental information.
  • the environmental information indicates the environment in which the vehicle equipped with the vehicle-mounted terminal is located; according to the environmental information, the prediction The parameters of the model are adjusted.
  • the results achieved can at least include: during the driving of the vehicle, the environment outside the vehicle will affect the driver’s emotions, and the changes in emotion will affect the process of face recognition, then pass Adjusting the parameters of the prediction model in combination with the environment outside the vehicle can make the process of face recognition by the prediction model match the current environment outside the vehicle, thereby improving the accuracy of the prediction results of the prediction model.
  • the process of adjusting the parameters of the prediction model may include: if the environmental information indicates that traffic congestion has occurred, adjusting the decision threshold of the third classifier in the prediction model, and the third classifier The device is used to judge that the user has the intention to continue speaking when the input data is higher than the judgment threshold, and judge that the user does not have the intention to continue speaking when the input data is lower than or equal to the judgment threshold.
  • the results achieved can at least include: the probability of a driver being anxious in a traffic congested scene is higher than the probability of a driver being anxious in a smooth traffic scene, and changes in emotions will affect the face recognition
  • the process of face recognition by the prediction model can be matched with the current traffic conditions, thereby improving the accuracy of the prediction results of the prediction model.
  • a method for training a prediction model for speech detection is provided.
  • a sample audio signal set and a sample face image set to be labeled can be obtained; according to the first set of sample audio signals A sample audio signal, processing the third sample face image in the sample face image set to obtain a first sample face image, the first sample face image is annotated with a first label, and the first label It means that the sample user has the intention of continuing to speak, and the shooting time point and the shooting object of the first sample face image are the same as the collection time point and the collection object of the first sample audio signal; focus according to the sample audio signal
  • the second sample audio signal is processed on the fourth sample face image in the sample face image set to obtain a second sample face image, the second sample face image is annotated with a second label, and the first sample face image is labeled with a second label.
  • the two labels indicate that the sample user does not have the intention to continue speaking, and the shooting time point and shooting object of the second sample face image are the same as the collection time point and collection object of the second sample audio signal; using the first sample Model training is performed on the sample face image and the second sample face image to obtain a prediction model, and the prediction model is used to predict whether the user has the intention to continue speaking.
  • the first sample audio signal satisfies a first condition
  • the first condition includes at least one of the following: the VAD result corresponding to the first sample audio signal is first updated from the speaking state to the silent state, and then Update from the silent state to the speaking state; or, the tail silence duration of the first sample audio signal is less than a first threshold and greater than a second threshold, and the first threshold is greater than the second threshold; or,
  • the first confidence of the text information combination is greater than the second confidence of the first text information, the text information combination is a combination of the first text information and the second text information, and the first text information represents the first
  • the second sample audio signal satisfies a second condition, and the second condition includes at least one of the following: the VAD result corresponding to the second sample audio signal is updated from the speaking state to the silent state; or, the The tail silence duration of the second sample audio signal is greater than the first threshold.
  • the first sample face image satisfies a third condition
  • the third condition includes: inputting the first sample face image into the first classifier in the prediction model and the prediction model After the second classifier of the second classifier, the probability output by the first classifier is greater than the probability output by the second classifier, the first classifier is used to predict the probability that the face image contains an action, and the second classifier is used for Predict the probability that the face image does not contain any action.
  • a voice detection device in a third aspect, is provided, and the voice detection device has the function of realizing the voice detection in the first aspect or any one of the optional methods of the first aspect.
  • the voice detection device includes at least one module, and at least one module is used to implement the voice detection method provided in the first aspect or any one of the optional manners of the first aspect.
  • the trigger condition for inputting the face image into the prediction model includes: detecting a tail silence duration of the audio signal; determining that the tail silence duration is greater than a third threshold.
  • the device is applied to a vehicle-mounted terminal, and the device further includes: a first collection module, configured to collect driving status information, where the driving status information represents the driving status of the vehicle equipped with the vehicle-mounted terminal; first adjustment The module is used to adjust the third threshold value according to the driving condition information.
  • a first collection module configured to collect driving status information, where the driving status information represents the driving status of the vehicle equipped with the vehicle-mounted terminal
  • first adjustment The module is used to adjust the third threshold value according to the driving condition information.
  • the first adjustment module is configured to adjust the third threshold if the driving condition information indicates that a sharp turn has occurred, and the adjusted third threshold is greater than the third threshold before the adjustment.
  • the first adjustment module is configured to adjust the third threshold if the driving condition information indicates that a sudden brake has occurred, and the adjusted third threshold is greater than the third threshold before the adjustment.
  • the device is applied to a vehicle-mounted terminal, and the device further includes: a second collection module for collecting environmental information, the environment information indicating the environment in which the vehicle equipped with the vehicle-mounted terminal is located; a second adjustment module , Used to adjust the parameters of the prediction model according to the environmental information.
  • the second adjustment module is configured to adjust the decision threshold of a third classifier in the prediction model if the environmental information indicates that traffic congestion has occurred, and the third classifier is used to input data When it is higher than the judgment threshold, it is judged that the user has the intention to continue speaking, and when the input data is lower than or equal to the judgment threshold, it is judged that the user does not have the intention to continue speaking.
  • a training device for a prediction model for speech detection including:
  • the acquisition module is used to acquire a sample audio signal set and a sample face image set to be labeled
  • the processing module is configured to process the third sample face image in the sample face image set according to the first sample audio signal in the sample audio signal set to obtain the first sample face image, the first sample face image
  • the sample face image is annotated with a first label, the first label indicates that the sample user has the intention to continue speaking, the shooting time point of the first sample face image, the shooting object and the collection of the first sample audio signal
  • the time point and collection object are the same;
  • the processing module is further configured to process the fourth sample face image in the sample face image set according to the second sample audio signal in the sample audio signal set to obtain a second sample face image, the The second sample face image is annotated with a second label, the second label indicates that the sample user does not have the intention to continue speaking, the shooting time point of the second sample face image, the shooting object, and the second sample audio signal
  • the collection time point and collection object are the same;
  • the training module is used to perform model training using the first sample face image and the second sample face image to obtain a prediction model, and the prediction model is used to predict whether the user has the intention to continue speaking.
  • the first sample audio signal satisfies a first condition
  • the first condition includes at least one of the following:
  • the VAD result corresponding to the first sample audio signal is first updated from the speaking state to the silent state, and then from the silent state to the speaking state; or,
  • the tail silence duration of the first sample audio signal is less than a first threshold and greater than a second threshold, and the first threshold is greater than the second threshold; or,
  • the first confidence of the text information combination is greater than the second confidence of the first text information
  • the text information combination is a combination of the first text information and the second text information
  • the first text information represents the first The semantics of the previous sample audio signal of the sample audio signal
  • the second text information represents the semantics of the next sample audio signal of the first sample audio signal
  • the first confidence level represents that the text information is combined into The probability of a complete sentence, where the second confidence level represents the probability that the first text information is a complete sentence; or,
  • the first confidence of the text information combination is greater than the third confidence of the second text information, and the third confidence represents the probability that the second text information is a complete sentence.
  • the second sample audio signal satisfies a second condition
  • the second condition includes at least one of the following:
  • the VAD result corresponding to the second sample audio signal is updated from the speaking state to the silent state; or,
  • the tail silence duration of the second sample audio signal is greater than the first threshold.
  • the first sample face image satisfies a third condition
  • the third condition includes:
  • the probability of the output of the first classifier is greater than the output of the second classifier
  • the first classifier is used to predict the probability that the face image contains an action
  • the second classifier is used to predict the probability that the face image does not contain an action.
  • an electronic device in a fifth aspect, includes a processor configured to execute instructions so that the electronic device executes the voice detection method provided in the first aspect or any one of the optional manners in the first aspect. .
  • a processor configured to execute instructions so that the electronic device executes the voice detection method provided in the first aspect or any one of the optional manners in the first aspect.
  • an electronic device in a sixth aspect, includes a processor configured to execute instructions so that the electronic device executes the above-mentioned second aspect or any one of the optional methods of the second aspect.
  • the training method of the detected predictive model For specific details of the electronic device provided in the sixth aspect, reference may be made to the second aspect or any of the optional manners of the second aspect, which will not be repeated here.
  • a computer-readable storage medium stores at least one instruction, and the instruction is read by a processor to make an electronic device execute the first aspect or any one of the optional manners of the first aspect
  • the voice detection method provided.
  • a computer-readable storage medium stores at least one instruction, and the instruction is read by a processor to make an electronic device execute the second aspect or any one of the optional manners of the second aspect Provides a training method for the prediction model for speech detection.
  • a computer program product is provided.
  • the computer program product runs on an electronic device, the electronic device executes the voice detection method provided in the first aspect or any one of the optional methods in the first aspect.
  • a computer program product is provided.
  • the computer program product runs on an electronic device
  • the electronic device executes the voice detection method provided in the second aspect or any of the optional methods of the second aspect.
  • the training method of the predictive model is provided.
  • a chip is provided, when the chip runs on an electronic device, the electronic device executes the voice detection method provided in the first aspect or any one of the optional methods of the first aspect.
  • a chip is provided.
  • the electronic device executes the prediction model for speech detection provided in the second aspect or any one of the optional methods of the second aspect. Training methods.
  • FIG. 1 is a schematic diagram of an implementation environment of a voice detection method provided by an embodiment of the present application
  • FIG. 2 is a flowchart of a method for training a prediction model for speech detection provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a condition that needs to be met for marking a first label provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a condition that needs to be met for marking a second label provided by an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of a prediction model provided by an embodiment of the present application.
  • FIG. 6 is a flowchart of a voice detection method provided by an embodiment of the present application.
  • Fig. 7 is a schematic diagram of a syntactic analysis provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a syntax analysis provided by an embodiment of the present application.
  • FIG. 9 is a flowchart of a voice detection method provided by an embodiment of the present application.
  • FIG. 10 is a flowchart of a voice detection method in a vehicle-mounted scenario provided by an embodiment of the present application.
  • FIG. 11 is a software architecture diagram of a voice detection method provided by an embodiment of the present application.
  • FIG. 12 is a flowchart of a voice detection method provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a voice detection device provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a training device for a prediction model for speech detection provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a terminal 100 provided by an embodiment of the present application.
  • FIG. 16 is a functional architecture diagram of a terminal 100 provided by an embodiment of the present application.
  • FIG. 17 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • the size of the sequence number of each process does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not be implemented in the embodiments of the present application.
  • the process constitutes any limitation.
  • determining B according to A does not mean that B is determined only according to A, and B can also be determined according to A and/or other information.
  • one embodiment or “an embodiment” mentioned throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of the present application. Therefore, the appearances of "in one embodiment” or “in an embodiment” in various places throughout the specification do not necessarily refer to the same embodiment. In addition, these specific features, structures or characteristics can be combined in one or more embodiments in any suitable manner.
  • Voice endpoint (Endpoint) detection refers to the technology of detecting the voice end point in the audio. Specifically, audio usually includes multiple audio signals. In the process of endpoint detection, each audio signal can be detected in turn to determine whether the current audio signal is the end point of the voice.
  • Voice endpoint detection technology is usually applied in voice interaction scenarios. After the user speaks, the voice endpoint detection is performed on the audio to determine the voice start point and voice end point, and intercept the audio between the voice start point and the end point as a piece of Voice commands.
  • the voice interaction is usually initiated by the user.
  • the trigger mode of voice interaction may be a push-to-talk (PTT) mode.
  • PTT push-to-talk
  • the user can initiate voice interaction by pressing a physical button or a virtual button; for another example, the trigger mode of the voice interaction may be a voice trigger (VT) mode.
  • VT voice trigger
  • the user can initiate a voice interaction by saying a wake-up word. This makes it easier to accurately detect the starting point of the voice.
  • the end point of the voice it usually needs automatic detection by the machine.
  • the voice end point is usually only achieved by relying on automatic speech recognition (Auto Speech Recognition, ASR) and voice activity detection (Voice Activity Detection, VAD) technologies.
  • ASR Automatic Speech Recognition
  • VAD Voice Activity Detection
  • VAD Used to detect whether the audio signal in a certain time window is a voice signal.
  • the voice end point detection scheme that relies on the VAD technology is: when the VAD detects a certain period of non-voice, it is determined that the voice ends. This duration is generally a fixed duration, such as 800 milliseconds. If the VAD detects non-voice for more than 800 milliseconds, it will determine the end of the voice and use the currently detected audio signal as the voice endpoint. Among them, Trailing Silence (TS) is an important parameter of this endpoint detection method. However, it is difficult to set a fixed duration parameter to adapt to all scenarios and environments. For example, if the duration parameter is set too large, the user will experience a longer delay. If the set duration parameter is too small, the user's voice is easily cut off.
  • TS Trailing Silence
  • ASR technology and VAD technology have two main problems in detecting the voice end point: first, background noise easily causes the detection of the voice end point late; second, if the voice contains pauses in the middle, it is easy to cause the detected voice end point to be early.
  • These two problems will greatly affect the user experience: due to the first problem, the machine will take a long time to detect that the voice command has ended, because the actual end time of the voice command is longer than the detected end time. It is longer, resulting in the voice command being executed after a period of time after the voice command is over, which causes too much time delay in the execution of the voice command. From the user’s point of view, the system has to wait a long time after the voice is spoken.
  • VAD which relies solely on acoustic information, may not be sufficient to accurately determine the state of the voice endpoint in some scenarios.
  • the method does not depend on a specific ASR engine and a specific scene, and the detection of each mode can be performed independently and comprehensively judged, and the operability is easier.
  • Fig. 1 is a schematic diagram of an implementation environment of a voice detection method provided by an embodiment of the present application.
  • the implementation environment includes: terminal and voice detection platform.
  • the terminal may be a vehicle-mounted terminal 101, a smart phone 102, a smart speaker 103, or a robot 104.
  • the terminal can also be other electronic devices that support voice detection, such as smart home devices, smart TVs, game consoles, desktop computers, tablet computers, e-book readers, and smart devices.
  • the terminal can run applications that support voice detection.
  • the application can be a navigation application, a voice assistant, an intelligent question answering application, etc.
  • the terminal is a terminal used by a user, and a user account is logged in an application program running on the terminal, and the user account may be registered in the voice detection platform in advance.
  • the terminal can be connected to the voice detection platform through a wireless network or a wired network.
  • the voice detection platform is used to provide background services for applications that support voice detection.
  • the voice detection platform may execute the following method embodiments, train to obtain a prediction model, and send the prediction model to the terminal so that the terminal can use the prediction model to perform voice detection.
  • the voice detection platform includes a server 201 and a database 202.
  • the server 201 may be one server or a cluster composed of multiple servers.
  • the database 202 may be used to store sample sets, such as a sample face image set containing a large number of sample face images, a sample audio signal set containing a large number of sample audio signals, and so on.
  • the server 201 can access the database 202 to obtain a sample set stored in the database 202, and obtain a prediction model through sample set training.
  • the number of the aforementioned terminals, servers or databases may be more or less.
  • the above-mentioned terminal, server or database may be only one, or the above-mentioned number is dozens or hundreds, or more.
  • the voice detection system also includes other terminals, other servers or other databases. .
  • the system architecture is exemplarily introduced above, and the following exemplarily introduces the method flow of voice detection based on the system architecture provided above.
  • the method flow of voice detection may include a model training stage and a model prediction stage.
  • the method flow of the model training phase is introduced through the embodiment of FIG. 2
  • the method flow of the model prediction phase is introduced through the embodiment of FIG. 6.
  • FIG. 2 is a flowchart of a method for training a prediction model for speech detection according to an embodiment of the present application.
  • the method can be applied to an electronic device, and the electronic device can be as shown in FIG.
  • the terminal in the system architecture shown in FIG. 1 may also be the voice detection platform in the system architecture shown in FIG. 1, such as the server 201.
  • the method includes the following steps:
  • Step 201 The electronic device obtains a sample audio signal set and a sample face image set to be labeled.
  • the sample audio signal set includes multiple sample audio signals
  • the sample face image set includes multiple sample face images.
  • the shooting time point of each sample face image and the shooting object and the corresponding sample audio signal collection time point and collection object Both are the same, and the sample face image can be annotated according to the corresponding relationship between the sample face image and the sample audio signal.
  • Step 202 The electronic device processes the third sample face image in the sample face image set according to the first sample audio signal in the sample audio signal set to obtain the first sample face image.
  • the electronic device may obtain a sample audio signal set, the sample audio signal set includes a plurality of sample audio signals, and there may be a correspondence between each sample audio signal and each sample face image.
  • the correspondence between the sample audio signal and the sample face image means that the collection time point of the sample audio signal is the same as the shooting time point of the sample face image.
  • the sample face image taken at time X time Y corresponds to the sample audio signal collected at time X time Y.
  • the sample audio signal set can be acquired in multiple ways.
  • the electronic device may include a microphone, and the electronic device may receive a recording instruction, and in response to the recording instruction, collect audio sent by a sample user through the microphone to obtain a sample audio signal.
  • the recording instruction can be triggered by the user's operation.
  • the electronic device may request a sample audio signal set from the server through the network, and this embodiment does not limit how to obtain the sample audio signal set.
  • the third sample face image is an unlabeled sample face image, and the third sample face image may be any sample face image in the sample face image set.
  • the first sample audio signal and the third sample face image have a corresponding relationship, and the shooting time point and shooting object of the third sample face image are the same as the collection time point and collection object of the corresponding first sample audio signal.
  • the electronic device may obtain a sample face image set, the sample face image set includes a plurality of third sample face images, and the sample face image set may be obtained in multiple ways.
  • the electronic device may include a camera, and the electronic device may receive a shooting instruction, and in response to the shooting instruction, photograph the sample user through the camera to obtain a sample face image set.
  • the shooting instruction is used to instruct the electronic device to shoot, and the shooting instruction can be triggered by a user's operation.
  • the electronic device can read a pre-stored sample face image set.
  • the electronic device may request a sample face image set from the server via the network, and this embodiment does not limit how to obtain the sample face image set.
  • the first sample face image is a labeled sample face image, and the first sample face image can be obtained by adding tags to the third sample face image. Since the shooting time point and the shooting object of the third sample face image are the same as the shooting time point and the shooting object of the first sample face image, the shooting time point and the shooting object of the third sample face image are the same as the first sample The collection time point and collection object of the audio signal are also the same.
  • the content of the first sample face image is the face of the sample user, the first sample face image contains the feature that the sample user has the intention to continue speaking, and the first sample face image can be obtained by photographing the sample user by a camera.
  • the number of the first sample face images may be multiple, and the sample users corresponding to different first sample face images may be the same or different.
  • the first sample face image is annotated with the first label.
  • the first label indicates that the sample user has the intention to continue speaking.
  • the first label can be in any data format, such as numbers, letters, character strings, and so on.
  • the first tag may be "Think before speaking” (thinking state before speaking).
  • the electronic device acquires a sample face image set. For each third sample face image in the sample face image set, the electronic device can determine the first sample audio signal corresponding to the third sample face image, and determine whether the first sample audio signal satisfies the first condition, if The first sample audio signal satisfies the first condition, then the first label is added to the third sample face image to obtain the first sample face image, the first sample face image includes the third sample face image and the first label . It can be seen from this process that in the first acquisition method, the first label is determined according to the first sample audio signal.
  • the first condition is used to determine whether the first sample audio signal contains the intention to continue speaking. If the first sample audio signal satisfies the first condition, the corresponding third sample face image can be marked as the first sample face image.
  • the first condition can be set according to experiment, experience or demand.
  • the first condition may include at least one of the following first condition (1) to first condition (4):
  • the first condition (1) The VAD result corresponding to the first sample audio signal is first updated from the speaking state to the silent state, and then from the silent state to the speaking state.
  • the sample audio signal represents the user intention of the sample user at X time and Y in the visual dimension
  • the sample audio signal represents the user intention of the sample user at X time and Y in the acoustic dimension
  • the sample face image and sample audio are visible.
  • the signal reflects the same user's intention from different modalities.
  • the corresponding sample face image and sample audio signal can be used to dig out the characteristics of the user's intention in the acoustic and visual modalities.
  • the association relationship can be used to fuse multi-modal features to detect the end point of the speech.
  • the detection process for the first condition (1) may include: the electronic device may include a VAD unit, and the VAD unit is used to detect whether the audio signal of the current time window is a voice signal.
  • the VAD unit can be software, hardware, or a combination of software and hardware.
  • the input parameter of the VAD unit may include an audio signal
  • the output parameter of the VAD unit may include a VAD result of the audio signal
  • the VAD result may include a speaking state and a silent state.
  • the speaking state indicates that the audio signal is a voice signal. For example, the speaking state can be recorded as Speech in the program; the silent state indicates that the audio signal is not a voice signal, and the silent state can be recorded as Silence in the program.
  • the first sample audio signal may be input to the VAD unit, and the VAD unit is used to perform VAD processing on the first sample audio signal, and output the VAD result. If the VAD result is the speaking state first, then switched to the silent state, and then switched back to the speaking state, it can be determined that the first sample audio signal satisfies the first condition (1).
  • the VAD result is the speaking state; for the audio during the pause, the VAD result is the speaking state; In other words, the VAD result is silent; for the audio after the pause, the VAD result returns to the speaking state.
  • the collected sample audio signal meets the first condition (1), it indicates that the sample audio signal is consistent with the audio during the pause in this scene. Since the sample user continued to speak after a pause, instead of ending the voice, the intention of the sample user at the pause time is to continue to speak.
  • the sample face image taken at the pause time will contain the user's intention to continue speaking, then by adding This sample face image is marked as the first face image, and then the model can be used to learn the mapping relationship between the face image and the user’s intention to continue speaking through the first sample face image. Then in the model application stage, that is The model can be used to predict whether the current user has the intention to continue speaking based on an unknown face image.
  • the first condition (2) the tail silence duration of the first sample audio signal is less than the first threshold and greater than the second threshold.
  • the trailing silence duration is also called Trailing Silence (TS), which refers to the total duration of the silence segment at the tail of the voice signal.
  • TS Trailing Silence
  • the threshold value can be used to detect whether the tail silence duration of the audio signal has met the first condition for the end of the voice.
  • the threshold corresponding to the tail silence duration may include a first threshold and a second threshold.
  • the first threshold may be the maximum value of the thresholds corresponding to the tail silence duration, and the first threshold is greater than the second threshold. If the tail silence duration is greater than the first threshold, it can be determined that the audio signal is the end point of the voice. For example, the first threshold can be written as D max in the program.
  • the specific value of the first threshold can be configured according to experiment, experience or requirements, and the specific value of the first threshold is not limited in this embodiment.
  • the second threshold may be the minimum of the thresholds corresponding to the tail silence duration. If the tail silence duration is greater than the second threshold, it can be determined that the audio signal has the probability of being the end point of the voice, that is, the audio signal may or may not be the end point of the voice. The voice end point can be further determined by using the characteristics of other modalities to determine whether the audio signal is the voice end point. For example, the second threshold can be written as D min in the program.
  • the electronic device can detect the tail silence duration of the first sample audio signal, and compare the tail silence duration with the first threshold and the second threshold. If the tail silence duration is less than the first threshold and greater than the second threshold, it can be determined The first sample audio signal satisfies the first condition (2).
  • the text information combination is a combination of the first text information and the second text information.
  • the first text information represents the semantics of the previous sample audio signal of the first sample audio signal.
  • the second text information represents the semantics of the next sample audio signal of the first sample audio signal.
  • the text information combination may be an ordered combination, with the first text information in the front and the second text information in the back.
  • the text information combination can represent the semantics of the entire audio segment during the speech
  • the first text information can represent the semantics before the pause
  • the second text information can represent the semantics after the pause.
  • the user says "I'm going to Jinhai Road”, then pauses for a while, and then continues to say "Jinsui Road”.
  • the first sample audio signal may be a silent segment during the pause
  • the first text information is the text information corresponding to the audio signal before the pause, that is, "I want to go to Jinhai Road”.
  • the second text information is the text information corresponding to the audio signal after the pause, that is, "Jinsui Road”.
  • the text information combination can be a combination of "I'm going to Jinhai Road” and “Jinsui Road”, that is, "I'm going to Jinhai Road” Jinsui Road”.
  • the first degree of confidence indicates the probability that the text information is combined into a complete sentence.
  • the greater the first confidence level the higher the probability that the text information is combined into a complete sentence, and the higher the probability that the first sample audio signal is paused instead of ending, the third sample face image corresponding to the first sample audio signal.
  • the first confidence level may indicate the probability that "I am going to Jinhai Road and Jinsui Road" is a complete sentence.
  • the first degree of confidence can be denoted as Conf merge .
  • the second confidence degree indicates the probability that the first text information is a complete sentence.
  • the greater the second confidence level the higher the probability that the first text information is a complete sentence, and the higher the probability that the first sample audio signal will end rather than pause, the third sample face corresponding to the first sample audio signal The higher the probability that the image contains the intention not to continue speaking.
  • the second confidence level may indicate the probability that "I am going to Jinhai Road" is a complete sentence.
  • the second degree of confidence can be denoted as Conf spliti .
  • the process of detecting the electronic device and the first condition (3) may include the following steps:
  • Step 1 Perform voice recognition on the previous sample audio signal of the first sample audio signal to obtain first text information.
  • Step 2 Perform voice recognition on the next sample audio signal of the first sample audio signal to obtain second text information.
  • Step 3 Splicing the first text information and the second text information to obtain a text information combination.
  • Step 4 Perform syntactic analysis on the text information combination to obtain the first confidence.
  • Step 5 Perform a syntactic analysis on the first text information to obtain a second degree of confidence.
  • Step 6 The first confidence degree and the second confidence degree are compared. If the first confidence degree is greater than the second confidence degree, it can be determined that the sample face image satisfies the first condition (3).
  • the first condition (4) the first confidence level corresponding to the text information combination is greater than the third confidence level corresponding to the second text information.
  • the third degree of confidence indicates the probability that the second text information is a complete sentence.
  • the greater the third degree of confidence the higher the probability that the second text information is a complete sentence.
  • the third confidence level may indicate the probability that "Jinsui Road" is a complete sentence.
  • the process of detecting the electronic device and the first condition (3) may include the following steps:
  • Step 1 Perform voice recognition on the previous sample audio signal of the first sample audio signal to obtain first text information.
  • Step 2 Perform voice recognition on the next sample audio signal of the first sample audio signal to obtain second text information.
  • Step 3 Splicing the first text information and the second text information to obtain a text information combination.
  • Step 4 Perform syntactic analysis on the text information combination to obtain the first confidence.
  • Step 5 Perform a syntactic analysis on the second text information to obtain a third degree of confidence.
  • Step 6 The first confidence degree and the third confidence degree are compared. If the first confidence degree is greater than the third confidence degree, it can be determined that the sample face image satisfies the first condition (4).
  • first condition (3) and the first condition (4) can be combined, and the first condition (3) and the first condition (4) in the combination scheme can be in an AND relationship.
  • the combination scheme of the first condition (3) and the first condition (4) may be:
  • Step 1 Perform voice recognition on the previous sample audio signal of the first sample audio signal to obtain first text information.
  • Step 2 Perform voice recognition on the next sample audio signal of the first sample audio signal to obtain second text information.
  • Step 3 Splicing the first text information and the second text information to obtain a text information combination.
  • Step 4 Perform syntactic analysis on the text information combination to obtain the first confidence.
  • Step 5 Perform a syntactic analysis on the first text information to obtain a second degree of confidence.
  • Step 6 Perform a syntactic analysis on the second text information to obtain a third degree of confidence.
  • Step 7 Compare the first confidence level with the second confidence level, compare the first confidence level with the third confidence level, if the first confidence level is greater than the second confidence level, and the first confidence level is greater than the third confidence level , It can be determined that the sample face image satisfies the first condition. If the first confidence level is less than or equal to the second confidence level, or the first confidence level is less than or equal to the third confidence level, it can be determined that the sample face image does not meet the first condition.
  • the effect achieved can at least include: for a sentence containing a short pause, the related technology will use the pause as the dividing point to split the complete sentence Come, cut into two voices. For example, for “I'm going to Jinhai Road and Jinsui Road”, if the user pauses after saying “I'm going to Jinhai Road”, the electronic device will divide this sentence into “I'm going to Jinhai Road” and “Jinsui Road”. When the user says “road” in "I'm going to Jinhai Road”, the electronic device determines in advance that the voice end point has been detected, which leads to premature detection of the voice end point.
  • the electronic device will directly use "I want to go to Jinhai Road” as the voice command, and ignore the "Jinsui Road” that follows, resulting in incomplete recognition of the voice command. If the electronic device directly responds to "I want to go to Jinhai Road” To perform business processing, such as navigating to Jinhai Road, will undoubtedly affect the accuracy of business processing.
  • first condition (1) to first condition (4) can be combined in any manner.
  • only one first condition among the four first conditions may be used, or two or more first conditions among the four first conditions may be executed.
  • the logical relationship between the different first conditions can be an AND relationship or an OR relationship.
  • the situation that the first condition is satisfied may be as shown in FIG. 3.
  • the time sequence for determining the different first conditions in the combined scheme is not limited. It can be executed first in a certain implementation method, and executed after other implementation methods, or it can be executed in parallel in multiple implementation methods.
  • the electronic device may also determine whether the third sample face image satisfies the third condition, and if the third sample face image satisfies the third condition, With three conditions, the first label is added to the third sample face image to obtain the first sample face image.
  • the third condition and any one or more of the first condition (1) to the first condition (4) can be combined, if the third condition and the first condition (1) to the first condition (4)
  • the logical relationship between the third condition and the first condition can be an AND relationship or an OR relationship.
  • the third condition includes: after the first sample face image is respectively input to the first classifier in the prediction model and the second classifier in the prediction model, the probability of the output of the first classifier is greater than that of the second classifier. The probability of the output of the generator.
  • the first classifier is used to predict the probability that the face image contains the action
  • the output parameter of the first classifier may be the probability that the face image contains the action
  • the input parameter of the first classifier may be the feature of the key points of the face image.
  • the first classifier can be called an action unit.
  • the first classifier may be a part of the prediction model.
  • the first classifier may be a part of the action recognition layer in FIG. 5, and may include one or more AUs.
  • the second classifier is used to predict the probability that the face image does not contain an action
  • the output parameter of the second classifier can be the probability that the face image does not contain an action
  • the input parameter of the second classifier can be the feature of the key points of the face image .
  • the second classifier can be called an action unit.
  • the first classifier and the second classifier can be used in combination. For example, if the probability of the output of the first classifier is greater than the probability of the output of the second classifier, it indicates that the output result of the first classifier is valid.
  • the number of first classifiers may include multiple, and each first classifier may be used to predict one action, or a combination of multiple first classifiers may be used to predict one action.
  • the sum of the probabilities output by the multiple first classifiers and the probabilities output by the second classifiers may be equal to one.
  • N there may be N first classifiers, and N is a positive integer.
  • the i-th first classifier among the N first classifiers can be denoted as AU i
  • the output probability of AU i can be denoted as PAU i .
  • the second classifier can be denoted as NEU
  • the probability output by NEU can be denoted as PNEU.
  • PAU 1 , PAU 2 ?? The sum of PAU N and PNEU is 1. If the output of the first classifier AU i is greater than the output of the second classifier NEU, i.e.
  • NEU PAU i is greater than P, the first classifier AU i output current valid; if the output of the first classifier AU i If it is less than or equal to the output result of the second classifier NEU, that is, if PAU i is less than or equal to PNEU, the current output result of the first classifier AU i is invalid.
  • N is a positive integer
  • i is a positive integer
  • i is less than N.
  • the first condition (5) may specifically be: the probability output by any first classifier is greater than the probability output by the second classifier, that is, PAU i >P NEU .
  • Acquisition method 2 The electronic device sends an acquisition request to the database.
  • the acquisition request is used to request the acquisition of the first sample face image.
  • the database reads the first sample face image and returns it to the electronic device.
  • Acquisition method 3 The electronic device accesses the local disk and reads the first sample face image pre-stored in the disk.
  • acquisition method 1 to acquisition method 3 are only exemplary descriptions, and do not represent a mandatory implementation method of the first sample facial image acquisition function.
  • other implementations may also be used to implement the function of acquiring the first sample face image, and these other ways of implementing the first sample face image acquisition function are a specific case of step 202, and should also be used. Covered in the protection scope of the embodiments of the present application.
  • Step 203 The electronic device processes the fourth sample face image in the sample face image set according to the second sample audio signal in the sample audio signal set to obtain the second sample face image.
  • the fourth sample face image is an unlabeled sample face image, and the fourth sample face image may be any sample face image in the sample face image set.
  • the second sample audio signal and the fourth sample face image have a corresponding relationship, and the shooting time point and the shooting object of the fourth sample face image are the same as the collection time point and collection object of the corresponding second sample audio signal.
  • the second sample face image is a labeled sample face image, and the second sample face image can be obtained by adding tags to the third sample face image. Since the shooting time and the shooting object of the fourth sample face image are the same as the shooting time and the shooting object of the second sample face image, the shooting time and the shooting object of the second sample face image and the second sample audio The collection time point and collection object of the signal are also the same.
  • the content of the second sample face image is the face of the sample user, the second sample face image contains the feature that the sample user has the intention to continue speaking, and the second sample face image can be obtained by photographing the sample user by a camera.
  • the number of the second sample face images may be multiple, and the sample users corresponding to different second sample face images may be the same or different.
  • the sample user corresponding to the second sample face image and the sample user corresponding to the first sample face image may be the same or different.
  • the face image of the second sample is annotated with a second label.
  • the second label indicates that the sample user has no intention to continue speaking.
  • the second label can be in any data format, such as numbers, letters, character strings, and so on.
  • the second label may be "Neutral".
  • the electronic device acquires a sample face image set. For each fourth sample face image in the sample face image set, the electronic device can determine the second sample audio signal corresponding to the fourth sample face image, and determine whether the second sample audio signal satisfies the second condition. If the sample audio signal meets the second condition, a second label is added to the fourth sample face image to obtain a second sample face image.
  • the second sample face image includes the fourth sample face image and the second label. It can be seen from this process that in the first acquisition method, the second label is determined based on the second sample audio signal.
  • the second condition is used to determine whether the corresponding second sample audio signal contains the intention not to continue speaking. If the second sample audio signal meets the second condition, the corresponding fourth sample face image can be marked as the second sample face image .
  • the second condition can be set according to experiment, experience or demand.
  • the second condition includes at least one of the following second condition (1) to second condition (2):
  • the second condition (1) The VAD result corresponding to the second sample audio signal is updated from the speaking state to the silent state.
  • the second condition (2) the tail silence duration of the second sample audio signal is greater than the first threshold.
  • the electronic device can detect the tail silence duration of the second sample audio signal, and compare the tail silence duration with the first threshold. If the tail silence duration is greater than the first threshold, since the tail silence duration is greater than the maximum value of the threshold, it indicates The second sample audio signal is not paused but ends, and it can be determined that the second sample audio signal satisfies the second condition (2).
  • the label “Neutral” can be added to the sample face image to indicate that the sample face image corresponds to a user who has not continued to speak. intention.
  • the effect achieved can at least include: the ability to fuse the captured face images, the collected audio signals, the semantics of the text information, etc. Therefore, the training data is automatically labeled with global information. Since the information of each modal is comprehensively considered, it can ensure that the label of the sample face image matches the user’s intention of whether to continue to speak. Then the labeled sample The accuracy of the model is high. After the model is trained based on accurate samples, the accuracy of predicting the user's intention will also be high, so it is helpful to accurately detect the voice end point in the model application stage.
  • Acquisition method 2 The electronic device sends an acquisition request to the database.
  • the acquisition request is used to request the acquisition of the second sample face image.
  • the database reads the second sample face image and returns it to the electronic device.
  • Acquisition method 3 The electronic device accesses the local disk and reads the second sample face image pre-stored in the disk.
  • step 202 and step 203 can be performed sequentially. For example, step 202 may be performed first, and then step 203; or step 203 may be performed first, and then step 202 may be performed. In other embodiments, step 202 and step 203 can also be performed in parallel, that is, step 202 and step 203 can be performed at the same time.
  • Step 204 The electronic device uses the first sample face image and the second sample face image to perform model training to obtain a prediction model.
  • the prediction model is used to predict whether the user has the intention to continue speaking.
  • the prediction model may be a two-classifier, and the prediction result of the prediction model may include the first value and the second value.
  • the first value of the prediction result indicates that the user has the intention to continue speaking.
  • the second value of the prediction result indicates that the user does not have the intention to continue speaking.
  • the first value and the second value can be any two different data.
  • the first value of the prediction result may be 1, and the second value of the prediction result may be 0; or, the first value of the prediction result may be 0, and the second value of the prediction result may be 1.
  • the prediction model can output 1; if the prediction model predicts the face image representation based on the input face image When the user does not have the intention to continue speaking, the prediction model can output 0.
  • the prediction model may be an artificial intelligence (AI) model.
  • AI artificial intelligence
  • the specific types of prediction models can include multiple types.
  • the prediction model may include at least one of a neural network, a support vector machine, a linear regression model, a logistic regression model, a decision tree, or a random forest.
  • the predictive model can be a neural network.
  • the prediction model may be a convolutional neural network or a recurrent neural network.
  • each module in the predictive model can be a layer, or each module can be a network composed of multiple layers.
  • Each layer can include one or more nodes.
  • the prediction model includes an input layer, a first hidden layer, an action recognition layer, a second hidden layer, and an output layer.
  • the prediction model may include a key point extraction module (not shown in FIG. 5).
  • the connection here refers to data interaction.
  • the input layer can be connected to the first hidden layer
  • the first hidden layer is connected to the action recognition layer
  • the action recognition layer is connected to the second hidden layer
  • the second hidden layer is connected to the output layer.
  • the key point extraction module can be connected to the input layer. It should be understood that although not shown in FIG. 5, different modules in the prediction model may also have other connection relationships. For example, different layers can be connected across layers.
  • the key point extraction module is used to extract the features of the key points from the face image, and input the features of the key points to the input layer.
  • the input layer is used to output the features of the key points to the first hidden layer.
  • the first hidden layer is used to perform linear mapping and nonlinear mapping on the features of the key points to obtain the features of the mapped key points, and output the features of the mapped key points to the action recognition layer.
  • the action recognition layer is used to identify the features of the mapped key points to obtain the action features, and output the action features to the second hidden layer.
  • the second hidden layer is used to perform linear mapping and non-linear mapping on the action feature to obtain the mapped action feature, and output the mapped action feature to the output layer.
  • the output layer is used to classify the mapped action features to obtain the confidence levels corresponding to different categories; the prediction results are determined according to the confidence levels.
  • the input layer may include multiple nodes, and each node of the input layer is used to receive a feature of a key point.
  • the input layer may include FP1, FP2, FP3...FPn, FP1 is used to receive the features of key point 1 and sent to the hidden layer; FP2 is used to receive the features of key point 2 and sent to the hidden layer; FP3 It is used to receive the feature of key point 3 and send it to the hidden layer, and so on.
  • FPn is used to receive the feature of key point n and send it to the hidden layer.
  • the action recognition layer may include multiple first classifiers and second classifiers.
  • Each first classifier can receive the features of the mapped key points from the first hidden layer, and after performing action recognition, obtain the probability that the face image contains the action.
  • the second classifier may receive the features of the mapped key points from the first hidden layer, and after performing action recognition, obtain the probability that the face image does not contain the action. If the output result of the first classifier is greater than the output result of the second classifier, the output result of the first classifier can be sent to the second hidden layer.
  • the action recognition layer may include N first classifiers, each of the N first classifiers may be called an action unit (Action Unit, AU), and the N first classifiers
  • the classifiers are respectively denoted as AU1, AU2, AU3...AUn.
  • model training can include multiple implementation methods.
  • model training may include a process of multiple iterations.
  • the process of each iteration can include the following steps (1.1) to (1.3):
  • Step (1.1) The electronic device inputs the first sample image into the prediction model, processes the first sample image through the prediction model, and outputs the prediction result.
  • Step (1.2) The electronic device calculates the first loss value through the loss function according to the prediction result and the first label.
  • the first loss value represents the deviation between the prediction result and the first label, and the deviation between the prediction result and the first label. The larger the value, the larger the first loss value.
  • Step (1.3) The electronic device adjusts the parameters of the prediction model according to the first loss value.
  • the process of each iteration includes the following steps (2.1) to (2.3).
  • Step (2.1) The electronic device inputs the second sample image into the prediction model, processes the second sample image through the prediction model, and outputs the prediction result.
  • Step (2.2) The electronic device calculates a second loss value through a loss function based on the prediction result and the second label.
  • the second loss value represents the deviation between the prediction result and the second label, and the difference between the prediction result and the second label. The greater the deviation, the greater the second loss value.
  • Step (2.3) The electronic device adjusts the parameters of the prediction model according to the second loss value.
  • the above shows an iterative process of training. After each iteration, the electronic device can detect whether the training termination condition is currently met. When the training termination condition is not met, the electronic device executes the next iteration process; when the training termination condition is met , The electronic device outputs the prediction model used in this iteration process as the trained prediction model.
  • the training termination condition can be that the number of iterations reaches the target number or the loss function meets a preset condition, or it can be that its ability is not improved in a period of time when it is verified based on a verification data set.
  • the target number of times may be a preset number of iterations to determine the timing of the end of training and avoid wasting training resources.
  • the preset condition can be that the loss function value remains unchanged or does not decrease for a period of time during the training process. At this time, the training process has achieved the training effect, that is, the prediction model has the ability to recognize whether the user has the intention to continue speaking according to the face image Function.
  • the training process of the prediction model may include a first training phase and a second training phase.
  • the first training phase is used to train the first classifier and the second classifier
  • the second training phase is used to train the first classifier and the second classifier.
  • Three classifiers are trained. Among them, the first classifier, the second classifier, or the third classifier may all be part of the prediction model.
  • the first classifier may be a part of the action recognition layer in FIG. 5, and may include one or more AUs.
  • the second classifier may also be a part of the action recognition layer in FIG. 5, and may include one or more AUs.
  • the third classifier can be the judger of the output layer in FIG. 5.
  • the fifth sample face image and the sixth sample face image can be used for model training in advance to obtain the first classifier and the second classifier.
  • the first classifier and the second classifier are used, and the sample person is Annotate the face image.
  • the first classifier, the second classifier, and the third classifier to be trained are combined to obtain a prediction model to be trained.
  • the prediction model to be trained includes the first classifier and the untrained third classifier.
  • This embodiment provides a model training method for realizing the user's intention prediction function.
  • the model training is performed by using sample face images containing the user's intention to continue speaking and the sample face image containing the user's intention not to continue to speak.
  • the prediction model can be trained In the process, from the sample face image and the corresponding label that contain the user's intention to continue to speak, learn what the characteristics of the face image will be when the user's intention is to continue speaking, and never include the user's intention not to continue speaking From the sample face image and the corresponding label, learn what the features of the face image will look like when the user's intention is not to continue speaking, then the prediction model has learned the correspondence between the user's intention and the face image features In the model application stage, the model can be used to predict whether the current user has the intention to continue speaking based on an unknown face image, so that the user’s intention expressed by the face image can be used to accurately detect whether the current voice signal is It is the end point of the speech.
  • the foregoing method embodiment introduces the training process of the prediction model.
  • the following describes the process of using the prediction model provided in the embodiment of FIG. 2 to perform voice endpoint detection through the embodiment of FIG. 6.
  • FIG. 6 is a flowchart of a voice detection method provided by an embodiment of the present application. This method is applied to electronic equipment.
  • the electronic device may be a terminal in the system architecture shown in FIG. 1, or may be a voice detection platform in the system architecture shown in FIG. 1, such as the server 201.
  • the electronic device implementing the embodiment in FIG. 6 and the electronic device implementing the embodiment in FIG. 2 may be the same electronic device or different electronic devices. If the electronic device that executes the embodiment in FIG. 6 is different from the electronic device that executes the embodiment in FIG. 2, the electronic devices in the two method embodiments can interact and cooperate to complete the task of voice detection.
  • the training step of the prediction model can be performed by the server, and the step of using the prediction model for detection can be performed by the terminal.
  • the training steps and detection steps of the prediction model can also be performed on the terminal side, or both can be performed on the server side.
  • the method includes the following steps:
  • Step 601 The electronic device obtains an audio signal and a face image.
  • the shooting time point of the face image is the same as the audio signal collection time point.
  • the electronic device can collect audio signals through a camera, and take a face image through the camera.
  • the audio signal can indicate whether the user has the intention to continue speaking at X time Y
  • the face image can also indicate whether the user has the intention to continue speaking at X time Y.
  • the electronic device may also receive a voice detection instruction from the terminal.
  • the voice detection instruction carries the audio signal and the human face.
  • the electronic device can respond to the voice detection instruction, execute the following method flow according to the audio signal and the face image, and return the result of the voice detection to the terminal.
  • the trigger condition of step 601 may include multiple situations. For example, this embodiment can be applied in a voice interaction scenario. If the terminal detects an audio signal containing a wake-up word, it can switch from the standby state to the working state, that is, the terminal is awakened, and the wake-up event of the terminal can be triggered Execution of step 601.
  • Step 602 The electronic device performs voice recognition on the audio signal, obtains third text information corresponding to the audio signal, and detects the tail mute duration of the audio signal.
  • the text information corresponding to the audio signal obtained in step 601 is recorded as the third text information.
  • ASR may be performed on the audio signal obtained in step 601 to obtain the third text information.
  • the third text information may be "Call Teacher Zhang”, “Navigate to Century Avenue” and so on.
  • VAD can also be performed on the audio signal obtained in step 601 to obtain the tail mute duration.
  • this embodiment does not limit the timing of the voice recognition step and the detection step of the tail mute duration.
  • the step of speech recognition and the step of detecting the duration of tail silence may be performed sequentially.
  • the voice recognition step may be performed first, and then the tail silent duration detection step; or the tail silent duration detection step may be executed first, and then the voice recognition step may be executed.
  • the voice recognition step and the tail silence duration detection step can also be executed in parallel, that is, the speech recognition step and the tail silence duration detection step can be executed at the same time.
  • Step 603 The electronic device compares the tail silence duration with a corresponding threshold.
  • a third threshold may be used, and the third threshold may be the first threshold mentioned in the embodiment of FIG. 2 or the second threshold mentioned in the embodiment of FIG. 2. Or it may be a combination of the first threshold and the second threshold, or may be other thresholds than the first threshold and the second threshold.
  • the process of using the threshold value for comparison may specifically include the following steps:
  • Step (1) The electronic device can compare the tail silence duration with a first threshold, and if the tail silence duration is less than the first threshold, perform step (2). In addition, if the tail silence duration is greater than or equal to the first threshold, the electronic device determines that the voice signal is the end point of the voice.
  • Step (2) The electronic device may compare the tail silence duration with a third threshold, and if the tail silence duration is greater than the third threshold, step 604 is executed. If the tail mute duration is less than or equal to the third threshold, the electronic device continues to acquire the next audio signal and the face image corresponding to the next audio signal, and continues to perform steps 601 to 603 on the next audio signal.
  • the third threshold used in step (2) may be smaller than the first threshold used in step (1).
  • the third threshold used in step (2) is equal to the second threshold above, that is, , The silence detection threshold used by the inference side and the silence detection threshold used by the training side can be the same.
  • the following voice detection process can be executed.
  • the effect achieved by this method can at least include: once the silence duration is greater than the minimum threshold (the third threshold), the text modal and image modal are combined, and the result of syntactic analysis and facial analysis is used to detect the end point of the voice, thereby Through the fusion of multi-modal information, the voice endpoint is detected as quickly and accurately as possible to avoid excessive delay. And when the silence duration is greater than the maximum threshold (the first threshold), because the silence time is too long, the process of syntactic analysis and the process of facial analysis can be eliminated, and it is directly determined that the voice end point has been detected.
  • Step 604 If the tail silence duration is greater than the third threshold, the electronic device performs syntactic analysis on the third text information to obtain a first analysis result.
  • the first analysis result is used to indicate whether the third text information is a complete sentence.
  • the first analysis result may include the first value and the second value.
  • the first value of the first analysis result indicates that the third text information is a complete sentence.
  • the second value of the first analysis result indicates that the third text information is not a complete sentence, but a sentence to be supplemented.
  • the first value and the second value of the first analysis result can be any two different data. For example, the first value of the first analysis result is 1, and the second value of the first analysis result is 0; or, the first value of the first analysis result is 0, and the second value of the first analysis result is 1.
  • the third text information may be regarded as a vocabulary sequence, and the first analysis result may be a sequence prediction result of the vocabulary sequence.
  • the syntactic analysis includes the following steps 1 to 5:
  • Step 1 The electronic device performs word segmentation on the third text information to obtain multiple words.
  • the third text information is “Call Teacher Zhang”. After the word segmentation of "Call Teacher Zhang”, the multiple words obtained are “ ⁇ ”, “ ⁇ ", “ ⁇ ”, “ “Give”, “Zhang”, “Old” and “Teacher”.
  • the third text information is "Navigate to Jinsui Road on Jinhai Road”. After word segmentation is performed on “Navigate to Jinsui Road on Jinhai Road”, multiple vocabulary words are obtained, namely "dao", “ ⁇ ", “To”, “gold”, “sea”, “lu”, “gold”, “sui” and “lu”.
  • Step 2 For each vocabulary of the multiple vocabularies, the electronic device performs syntactic analysis on the vocabulary to obtain a second analysis result corresponding to the vocabulary.
  • the second analysis result is used to indicate whether the vocabulary and the vocabulary before the vocabulary constitute a complete sentence.
  • the second analysis result may include the first value and the second value.
  • the first value of the second analysis result indicates that the vocabulary and the vocabulary before the vocabulary constitute a complete sentence.
  • the second value of the second analysis result indicates that the vocabulary and the vocabulary before the vocabulary do not form a complete sentence.
  • the first value and the second value of the second analysis result can be any two different data. For example, the first value of the second analysis result is 1, and the second value of the second analysis result is 0; or, the first value of the second analysis result is 0, and the second value of the second analysis result is 1.
  • the second analysis result corresponding to " ⁇ " is 0, the second analysis result corresponding to "Hai” is 0, and the second analysis result corresponding to "Lu” (here refers to the road in Jinhai Road) is 1, "Gold”
  • the corresponding second analysis result is 0, the second analysis result corresponding to "Sui” is 0, and the second analysis result corresponding to "Lu” (here refers to the road in Jinsui Road) is 1.
  • a streaming detection method may be used for syntactic analysis.
  • the specific process of streaming detection can include: the electronic device can start from the first word in the third text information, traverse each word, perform text analysis on the current traversed word and each previous word, and output the current traversed word Corresponding to the second analysis result.
  • the electronic device can determine that the third text information is a complete sentence and stop continuing to traverse.
  • the third text information is "Dai”, “Dian”, “Talk”, “Give”, “Zhang”, “Old”, and “Teacher”.
  • identifying "beat predict that the syntax of "beat” is incomplete, and output 0; when identifying "dian”, predict that the syntax of "playing” is incomplete, and output 0; when identifying "word”
  • predicting that the syntax of "calling” is incomplete output 0; when recognizing "give”, predicting that the syntax of "calling” is incomplete, outputting 0; when recognizing "Zhang”, predicting that the syntax of "calling to Zhang” is not Complete, output 0; when identifying "old”, predict that the syntax of "Call Zhang Lao” is incomplete, and output 0; when identifying "teacher”, predict that the syntax of "Call Zhang Lao” is complete, and output 1.
  • the third text information is "guide”, “navigation”, “to”, “gold”, “sea”, “road”, “gold”, “sui” and “road”.
  • the process of streaming detection when identifying "guide”, predict that the syntax of "guide” is incomplete, and output 0; when identifying "hang”, predict that the syntax of "navigation” is incomplete, output 0; when identifying "to” , Predict that the syntax of "Navigate to” is incomplete, and output 0; when recognizing "Gold”, predict that the syntax of "Navigate to gold” is incomplete, and output 0; when recognizing “sea”, predict that the syntax of "Navigate to Jinhai” is incomplete, Output 0; when identifying “road”, predict that the syntax of "navigate to Jinhai Road” is complete, and output 1; when identifying "gold”, predict that the syntax of "navigate to Jinhailu Jin” is incomplete, and output 0; when identifying "sui” When
  • Step 3 For each vocabulary of the multiple vocabulary, the electronic device determines whether the second segmentation result corresponding to the vocabulary indicates that a complete sentence is formed, and if the second analysis result corresponding to any vocabulary of the multiple vocabulary indicates that a complete sentence is formed , Perform the following step 4, if the second analysis result corresponding to each of the multiple words indicates that no complete sentence is formed, perform the following step 5.
  • Step 4 The electronic device determines that the third text information is a complete sentence.
  • Step 5 The electronic device determines that the third text information is not a complete sentence.
  • the results can at least include: not only comprehensively considering the syntactic connection between each vocabulary and the previous vocabulary, but also using the N-Best (N best) algorithm , Whenever a vocabulary is detected, it is judged whether the vocabulary has formed a complete sentence with the previous vocabulary. Once the current vocabulary has formed a complete sentence, it can be determined that the analyzed text information is a complete sentence, and the next step is executed Detection process. Then, it can be detected in time when the current audio signal has the probability of being the end point of the speech, so as to ensure the real-time detection of the end point of the speech and avoid the detection of the end point of the speech too late.
  • N-Best N best
  • Step 605 The electronic device determines whether the first analysis result indicates that the third text information is a complete sentence.
  • the electronic device can determine that the audio signal is not a voice end point. If the first analysis result indicates that the third text information is a complete sentence, the electronic device can determine that the audio signal has a probability of being the end point of the voice, and execute step 606 to perform face recognition.
  • the prediction result output by the prediction model is 0, indicating that the user has the intention to continue speaking, and then continues to traverse the next vocabulary; when recognizing "gold”, the output is 0, and it is determined that the completeness is not detected Sentence, continue to traverse the next vocabulary; when recognizing "Sui”, output 0, at this time it is determined that the complete sentence is not detected, continue to traverse the next vocabulary; when recognizing "Road”, output 1 and then go to step 606 and step 607 ; Through the steps of face recognition, the prediction result output by the prediction model is 1, indicating that the user does not have the intention to continue speaking, and it is determined that the voice end point is detected.
  • the method provided in this embodiment it is possible to trigger the process of applying the prediction model to perform face recognition under the condition that the audio signal has been detected to be syntactically complete, so as to use the prediction result to further determine whether the audio signal has indeed reached the end of the voice. Therefore, by fusing the characteristics of the visual modal, the misjudgment of the syntactic analysis is avoided, the accuracy of the detection of the voice end point is greatly improved, and the probability of the voice command being prematurely truncated is reduced.
  • the above-mentioned syntactic analysis method is simple and easy to implement and has high practicability.
  • Step 606 If the first analysis result indicates that the third text information is a complete sentence, the electronic device inputs the face image into the prediction model.
  • Step 607 The electronic device processes the face image through the prediction model, and outputs the prediction result.
  • the prediction model uses samples and labels to learn the mapping relationship between the face image and the user's intention, then in step 607, the prediction model can recognize the face image based on the learned mapping relationship , Determine the user's intention corresponding to the face image, and predict whether the user has the intention to continue speaking.
  • the specific process of processing by the prediction model may include the following steps one to four:
  • Step 1 Extract the key points contained in the face image.
  • Step 2 Process the key points to obtain the action characteristics of the face image.
  • the specific process of digging out the action features from the face image can include a variety of implementation methods.
  • the action characteristics can be acquired by executing the following (1) to (4).
  • the face image is input to the key point extraction module in the prediction model, and the key point feature is extracted from the face image through the key point extraction module.
  • the feature of the key point can be in any data form, including but not limited to a one-dimensional vector, a two-dimensional feature map, or a three-dimensional feature cube.
  • the number of key points in the face image can be multiple, and when step (1) is performed, the feature of each key point in the multiple key points can be extracted.
  • the feature of the key point can be input to the input layer, and the feature of the key point can be sent to the first hidden layer through the input layer.
  • the characteristics of key point 1, the characteristics of key point 2, the characteristics of key point 2, the characteristics of key point 3, and the characteristics of key point n can be input to the input layer, and the node FP1 of the input layer receives the key point 1
  • the feature is sent to the hidden layer; the node FP2 receives the feature of key point 2 and sends it to the hidden layer; the node FP3 receives the feature of key point 3 and sends it to the hidden layer, and so on, the node FPn receives the feature of key point n and sends it to the hidden layer Floor.
  • the action recognition layer may include N action units, which are respectively denoted as AU 1 , AU 2 , AU 3 ... AU n .
  • the action unit AU 1 recognizes the features of the mapped key points, it outputs PAU 1. If the output result of the action unit AU 1 is greater than the output result of the NEU, that is, PAU 1 is greater than PNEU, then the output result of PAU 1 is valid; the action unit After AU 2 recognizes the features of the mapped key points, it outputs PAU 2. If the output result of the action unit AU 2 is greater than the output result of NEU, that is, PAU 2 is greater than PNEU, the output result of PAU 2 is valid; and so on. After the action unit NEU recognizes the features of the mapped key points, it outputs PNEU.
  • PNEU can be used to compare the output results of other action units, and the output results of effective action units can be summed, and the sum obtained is the action feature.
  • Each action unit in the action recognition layer can correspond to a key muscle point in the face, and each action unit can be identified when the corresponding key muscle point changes.
  • AU1 can recognize the muscles that lift the upper lip and the middle area.
  • AU2 can identify jaw drop
  • AU3 can identify mouth corner stretch
  • AU4 can identify eyebrows are depressed and gathered
  • AU5 can identify mouth corners are pulled down and tilted
  • AU6 can identify raised outer corners of eyebrows.
  • the AU recognition result is indicated by the output probability. For example, the larger the PAU 1 , the higher the probability that the face lifts the upper lip and the muscles in the middle region.
  • the probability of each AU output in the action recognition layer is also different. For example, if the user's current facial expression is joy, since the face usually raises the corner of the mouth during joy, the PAU 1 will be larger, so PAU 1 can be used to identify it.
  • Step three classify the action features, and obtain the confidence levels corresponding to different categories.
  • Step 4. Determine the prediction result according to the confidence level.
  • the action features can be classified to obtain the confidence of the first category and the confidence of the second category.
  • the first category is that the user has the intention to continue speaking
  • the second category is that the user does not have the intention to continue speaking.
  • the confidence level of the first category can be compared with the confidence level of the second category. If the confidence level of the first category is greater than the confidence level of the second category, output the user's intention to continue speaking as the prediction result; or, if the first category’s confidence level is greater than the second category’s confidence level; The confidence of the category is not greater than the confidence of the second category, and the user does not have the intention to continue speaking as the prediction result.
  • the action feature can be input to the second hidden layer, and the action feature can be non-linearly mapped and linearly mapped through the second hidden layer to obtain the mapped action feature.
  • the mapped action features are classified through the output layer, and the resulting category can be the prediction result of the prediction model. If the category is intended to continue speaking, it indicates that the current audio signal has not reached the end of the voice. If the category is no intention to continue speaking, the currently recognized audio signal is used as the end point of the speech.
  • the prediction model uses the above steps 1 to 4 to make predictions, and the achieved effects can at least include:
  • Step 608 If the prediction result indicates that the user does not intend to continue speaking, the electronic device determines that the audio signal is the end point of the voice.
  • the electronic device When the electronic device determines that the audio signal is the voice end point, it can perform any service processing function corresponding to the voice end. For example, the voice detection result can be returned to the user, or the voice detection result can be output to a subsequent module. For example, the electronic device can intercept the part between the voice start point and the voice end point from the audio, parse the voice command, and perform service processing in response to the voice command.
  • the process of the voice detection method may be as shown in FIG. 9, including the following 5 steps:
  • Step 1 Perform voice recognition (ASR) on the audio signal to obtain the streaming N-best result and the tail mute duration.
  • ASR voice recognition
  • Step 2 Compare the tail silence duration with the maximum silence duration threshold Dmax, if the tail silence duration is greater than Dmax, go to step 5, otherwise go to step 3;
  • Step 3 Compare the tail silence duration with the minimum silence duration threshold Dmin, if the tail silence duration is less than Dmin, go to step 1, otherwise go to step 4;
  • Step 4 Analyze the N-best results of voice recognition, facial action units and key points, and classify the audio signal. If the conditions corresponding to the voice end point are met, go to step 5, otherwise go to step 1;
  • Step 5 The end point of the voice is detected.
  • the driving situation may also be considered.
  • the driving situation is used for comprehensive judgment. For details, refer to the embodiment in FIG. 11 below.
  • This embodiment provides a multi-modal voice end point detection method.
  • the captured face image is recognized through the model, thereby predicting whether the user has the intention to continue speaking, and combining the prediction result to determine the collected audio signal Whether it is the end point of the speech, because on the basis of the acoustic features, the visual modal features of the face image are also combined for detection. Even when the background noise is strong or the user pauses during speech, the human can be used. Face images are used to accurately determine whether the voice signal is the end point of the voice, thus avoiding the interference caused by background noise and speech pauses, thereby avoiding the background noise and the interference of speech pauses that will cause too late or premature detection of the end of voice interaction The problem improves the accuracy of detecting the end point of the voice.
  • the prediction model provided by the foregoing method embodiment can be applied to any scenario where voice detection needs to be detected.
  • the following uses an exemplary application scenario for illustration.
  • FIG. 10 is a flowchart of a voice detection method in a vehicle-mounted scene provided by an embodiment of the present application.
  • the interactive body of the method includes a vehicle-mounted terminal and a server, and includes the following steps:
  • Step 1001 The server obtains a sample audio signal set and a sample face image set to be labeled.
  • Step 1002 The server processes the third sample face image in the sample face image set according to the first sample audio signal in the sample audio signal set to obtain the first sample face image.
  • Step 1003 The server processes the fourth sample face image in the sample face image set according to the second sample audio signal in the sample audio signal set to obtain the second sample face image.
  • Step 1004 The server uses the first sample face image and the second sample face image to perform model training to obtain a prediction model.
  • Step 1005 The server sends the prediction model to the vehicle-mounted terminal.
  • Step 1006 The vehicle-mounted terminal receives the prediction model, and stores the prediction model.
  • Step 1007 The vehicle-mounted terminal obtains audio signals and facial images.
  • Step 1008 The vehicle-mounted terminal performs voice recognition on the audio signal, obtains the third text information corresponding to the audio signal, and detects the tail mute duration of the audio signal.
  • Step 1009 The vehicle-mounted terminal compares the tail mute duration with the corresponding threshold.
  • Step 1010 If the tail silence duration is greater than the third threshold, the vehicle-mounted terminal performs syntactic analysis on the third text information to obtain the first analysis result.
  • the driving condition of the vehicle can be considered to comprehensively detect the voice end point.
  • the vehicle-mounted terminal may collect driving condition information, and adjust the threshold corresponding to the tail silence duration according to the driving condition information, for example, adjust the third threshold.
  • the driving status information indicates the driving status of the vehicle equipped with the in-vehicle terminal.
  • the vehicle-mounted terminal can be equipped with sensors, and the driving condition information can be collected through the sensors.
  • the achieved effect can at least include: the specific application scenarios of voice detection can be integrated to perform endpoint detection, for example, in the vehicle scenario, the driving condition during driving can be used to adjust the threshold of the tail silence duration to make the threshold It can be adjusted adaptively according to the current driving conditions to improve the robustness of voice endpoint detection.
  • the specific meaning of the driving status information may include at least one type, which will be illustrated by way of way 1 to way 2 below.
  • Method 1 If the driving condition information indicates that a sharp turn has occurred, the third threshold is adjusted, and the adjusted third threshold is greater than the third threshold before adjustment.
  • the vehicle-mounted terminal can be equipped with an accelerometer sensor, and the situation of a sharp turn can be collected through the accelerometer sensor.
  • the adjusted threshold can be adapted to the situation of sudden braking.
  • the vehicle-mounted terminal can be equipped with an accelerometer sensor, and the situation of sudden braking can be collected through the accelerometer sensor.
  • the first mode or the second mode can be implemented, or a combination of the first mode and the second mode can be implemented.
  • Step 1011 The vehicle-mounted terminal judges whether the first analysis result indicates that the third text information is a complete sentence.
  • Step 1012 If the first analysis result indicates that the third text information is a complete sentence, the vehicle-mounted terminal inputs the face image into the prediction model.
  • Step 1013 The vehicle-mounted terminal processes the face image through the prediction model, and outputs the prediction result.
  • Step 1014 If the prediction result indicates that the user does not intend to continue speaking, the vehicle-mounted terminal determines that the audio signal is the end point of the voice.
  • the environment outside the vehicle may be considered to comprehensively detect the voice end point.
  • the vehicle-mounted terminal may collect environmental information, and the environmental information indicates the environment in which the vehicle equipped with the vehicle-mounted terminal is located.
  • the vehicle-mounted terminal can adjust the parameters of the prediction model according to environmental information.
  • the vehicle-mounted terminal can be equipped with a driving recorder, and the environment outside the vehicle can be collected through the driving recorder.
  • the adjustment method based on environmental information may be model fine-tuning.
  • the vehicle-mounted terminal may adjust the decision threshold of the third classifier in the prediction model.
  • the third classifier is used to judge that the user has the intention to continue speaking when the input data is higher than the judgment threshold, and judge that the user does not have the intention to continue speaking when the input data is lower than or equal to the judgment threshold.
  • the third classifier can be a node of the output layer.
  • the effect achieved can at least include: during the driving of the vehicle, the environment outside the vehicle will have an impact on the driver's emotions. For example, the probability of a driver being anxious in a traffic congested scene is higher than the probability of a driver being anxious in a smooth traffic scene. The changes in emotions will affect the process of face recognition.
  • the parameters of the prediction model in combination with the environment outside the car the process of face recognition by the prediction model can be matched with the current environment outside the car, thereby improving the prediction results of the prediction model. Accuracy.
  • the scene information can also be further used to detect the voice end point.
  • the sound source information or sound field information can be combined to detect the end point of the voice.
  • This embodiment provides a multi-modal voice end point detection method in a vehicle-mounted scene.
  • the captured face image is recognized through the model, thereby predicting whether the user has the intention to continue speaking, and combining the prediction result to determine the collected audio Whether the signal is the end point of the speech, because on the basis of the acoustic features, the visual modal features of the face image are also combined for detection, even when the background noise is strong or the user pauses during speech, it can be used
  • Face images are used to accurately determine whether the speech signal is the end point of the speech, thus avoiding the interference caused by background noise and speech pauses, thereby avoiding background noise and the interference of speech pauses that will cause too late or premature detection of the end of the voice interaction
  • the problem of state improves the accuracy of detecting the end point of the voice.
  • the speech detection method provided in this embodiment is described above, and the software architecture of the speech detection method is exemplarily introduced below.
  • the software architecture may include multiple functional modules, for example, may include a data acquisition module, a data processing module, and a decision-making module. Among them, each functional module can be realized by software.
  • the data acquisition module is used to collect audio streams in real time through a microphone, and shoot video streams in real time through a camera.
  • the data acquisition module can pass audio and video streams to the data processing module.
  • the data processing module can extract multiple modal information, such as acoustic information, semantic information, and visual information, from the audio stream and video stream through the ability to process data and control equipment provided by the central processor, and transmit the information of multiple modalities. Enter the decision-making module.
  • the decision module can fuse various modal information to decide whether the current audio signal is a voice endpoint.
  • Fig. 12 is a flowchart of the machine performing voice end point detection based on the above-mentioned software architecture.
  • automatic speech recognition can be performed on the audio signal to obtain the duration of silence at the end of the voice and the N-best result of the text information.
  • the syntax analysis is performed, and the analysis result and the duration between the duration and the threshold are analyzed.
  • the current audio signal can be classified into the voice end point or non-voice end point.
  • the voice detection method of the embodiment of the present application is introduced above, and the voice detection device of the embodiment of the present application is introduced below. It should be understood that the application of the voice detection device has any function of the voice detection device in the above method.
  • FIG. 13 is a schematic structural diagram of a voice detection device provided by an embodiment of the present application.
  • the voice detection device includes: an acquisition module 1301 for performing step 601 or step 1007 in the foregoing method embodiment; input The module 1302 is used to perform step 606 or step 1012; the processing module 1303 is used to perform step 607 or step 1013; the determination module 1304 is used to perform step 608 or step 1014.
  • the processing module includes:
  • the extraction sub-module is used to perform step one in step 607;
  • the processing sub-module is used to perform step two in step 607;
  • the classification sub-module is used to perform step three in step 607.
  • the acquisition module is further used to perform step 201; the device further includes: a training module is used to perform step 202.
  • the first sample face image satisfies the first condition.
  • the second sample face image satisfies the second condition.
  • the device further includes: a speech recognition module, used to perform the steps of speech recognition; a syntax analysis module, used to perform the steps of syntactic analysis; and a determination module, also used to, if the result of the syntactic analysis indicates that the sentence is not a complete sentence, It is determined that the audio signal is not the end point of the speech; or, if the result of the syntactic analysis indicates a complete sentence, the input module 1302 is triggered to perform step 606 or step 1012.
  • a speech recognition module used to perform the steps of speech recognition
  • a syntax analysis module used to perform the steps of syntactic analysis
  • a determination module also used to, if the result of the syntactic analysis indicates that the sentence is not a complete sentence, It is determined that the audio signal is not the end point of the speech; or, if the result of the syntactic analysis indicates a complete sentence, the input module 1302 is triggered to perform step 606 or step 1012.
  • the syntactic analysis module is used to perform steps one to five in the syntactic analysis.
  • the trigger condition for inputting the face image into the prediction model includes: detecting the tail silence duration of the audio signal; determining that the tail silence duration is greater than the third threshold.
  • the device is applied to a vehicle-mounted terminal, and the device further includes: a first collection module, configured to collect driving status information; and a first adjustment module, configured to adjust the third threshold according to the driving status information.
  • the first adjustment module is configured to adjust the third threshold if the driving condition information indicates that a sharp turn has occurred; if the driving condition information indicates that a sudden brake has occurred, adjust the third threshold.
  • the device is applied to a vehicle-mounted terminal, and the device further includes: a second collection module for collecting environmental information; and a second adjustment module for adjusting the parameters of the prediction model according to the environmental information.
  • the second adjustment module is configured to adjust the decision threshold of the third classifier in the prediction model if the environmental information indicates that traffic congestion has occurred.
  • the voice detection device provided in the embodiment of FIG. 13 corresponds to the voice detection device in the foregoing method embodiment, and each module in the voice detection device and the foregoing other operations and/or functions are used to implement the voice detection device in the method embodiment.
  • each module in the voice detection device and the foregoing other operations and/or functions are used to implement the voice detection device in the method embodiment.
  • the voice detection device provided in the embodiment of FIG. 13 detects voice
  • the division of the above-mentioned functional modules is only used as an example.
  • the above-mentioned functions can be allocated by different functional modules as required, that is, voice
  • the internal structure of the detection device is divided into different functional modules to complete all or part of the functions described above.
  • the voice detection device provided in the foregoing embodiment belongs to the same concept as the foregoing voice detection method embodiment, and its specific implementation process is detailed in the method embodiment, and will not be repeated here.
  • FIG. 14 is a schematic structural diagram of a training device for a prediction model for voice detection according to an embodiment of the present application.
  • the voice detection device includes: an acquisition module 1401, which is used to execute the above-mentioned method embodiment in FIG. 2 Step 201 in the embodiment or step 1001 in the embodiment of FIG. 10; processing module 1402 for executing steps 202 and 203 in the method embodiment of FIG. 2 above, or step 1002 and step 1003 in the embodiment of FIG. 10; training module 1403, configured to perform step 204 in the method embodiment in FIG. 2 or step 1004 in the embodiment in FIG. 10.
  • the first sample audio signal satisfies the first condition.
  • the second sample audio signal satisfies the second condition.
  • the first sample face image satisfies the third condition.
  • the training device for the prediction model for voice detection provided in the embodiment of FIG. 14 corresponds to the electronic device in the method embodiment of FIG. 2, and the modules in the training device for the prediction model for voice detection and the other operations described above are /Or the functions are respectively to implement various steps and methods implemented by the electronic device in the method embodiment of FIG. 2 in the method embodiment.
  • the modules in the training device for the prediction model for voice detection and the other operations described above are /Or the functions are respectively to implement various steps and methods implemented by the electronic device in the method embodiment of FIG. 2 in the method embodiment.
  • the method embodiment of FIG. 2 described above For specific details, please refer to the method embodiment of FIG. 2 described above. For brevity, details are not described herein again.
  • the training device for the prediction model for voice detection provided in the embodiment of FIG. 14 only uses the division of the above functional modules for illustration when training the prediction model for voice detection. In actual applications, it can be implemented according to needs.
  • the above function allocation is completed by different function modules, that is, the internal structure of the training device for the prediction model of speech detection is divided into different function modules to complete all or part of the functions described above.
  • the training device for the prediction model for speech detection provided by the above-mentioned embodiment belongs to the same concept as the above-mentioned embodiment of the method for training the prediction model for speech detection. For the specific implementation process, please refer to the method embodiment, which will not be repeated here. .
  • the embodiments of the present application provide an electronic device, which includes a processor, and the processor is configured to execute instructions so that the electronic device executes the voice detection method provided in each of the foregoing method embodiments.
  • the processor may be a general-purpose central processing unit (central processing unit, CPU), network processor (Network Processor, NP for short), microprocessor, or may be one or more integrations used to implement the solutions of the present application
  • the circuit for example, an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
  • the processor can be a single-CPU processor or a multi-CPU processor. The number of processors can be one or more.
  • the electronic device may further include a memory.
  • the memory can be read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), or other types that can store information and instructions
  • Dynamic storage devices can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only Memory (CD-ROM), or other optical disk storage, optical disk storage ( Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be stored by a computer Any other media taken, but not limited to this.
  • EEPROM electrically erasable programmable read-only memory
  • CD-ROM compact disc read-only Memory
  • optical disk storage Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.
  • magnetic disk storage media or other magnetic storage devices or can be used to carry or store desired program codes in the form of instructions or data
  • the memory and the processor can be set separately, and the memory and the processor can also be integrated.
  • the electronic device may further include a transceiver.
  • the transceiver is used to communicate with other devices or communication networks, and the way of network communication can be but not limited to Ethernet, wireless access network (RAN), wireless local area networks (WLAN), etc.
  • the electronic device that implements the embodiment in FIG. 2, the embodiment in FIG. 6 or the embodiment in FIG. 10 may be implemented as a terminal.
  • the hardware structure of the terminal is exemplarily described below.
  • FIG. 15 is a schematic structural diagram of a terminal 100 provided by an embodiment of the present application.
  • the terminal 100 may be a vehicle-mounted terminal 101, a smart phone 102, a smart speaker 103, or a robot 104 in the hardware environment shown in FIG. 1, of course, it may also be other types of terminals.
  • the terminal 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone interface 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and user An identification module (subscriber identification module, SIM) card interface 195, etc.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.
  • the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the terminal 100.
  • the terminal 100 may include more or fewer components than those shown in the figure, or combine certain components, or split certain components, or arrange different components.
  • the illustrated components can be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU), etc.
  • AP application processor
  • modem processor modem processor
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the different processing units may be independent devices or integrated in one or more processors.
  • the controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching instructions and executing instructions.
  • a memory may also be provided in the processor 110 to store instructions and data.
  • the memory in the processor 110 is a cache memory.
  • the memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / Or Universal Serial Bus (USB) interface, etc.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transceiver receiver/transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB Universal Serial Bus
  • the I2C interface is a bidirectional synchronous serial bus, which includes a serial data line (SDA) and a serial clock line (SCL).
  • the processor 110 may include multiple sets of I2C buses.
  • the processor 110 may be coupled to the touch sensor 180K, charger, flash, camera 193, etc., respectively through different I2C bus interfaces.
  • the processor 110 may couple the touch sensor 180K through an I2C interface, so that the processor 110 and the touch sensor 180K communicate through an I2C bus interface to implement the touch function of the terminal 100.
  • the I2S interface can be used for audio communication.
  • the processor 110 may include multiple sets of I2S buses.
  • the processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170.
  • the audio module 170 may transmit audio signals to the wireless communication module 160 through an I2S interface, so as to realize the function of answering calls through a Bluetooth headset.
  • the PCM interface can also be used for audio communication to sample, quantize and encode analog signals.
  • the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface.
  • the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.
  • the UART interface is a universal serial data bus used for asynchronous communication.
  • the bus can be a two-way communication bus. It converts the data to be transmitted between serial communication and parallel communication.
  • the UART interface is generally used to connect the processor 110 and the wireless communication module 160.
  • the processor 110 communicates with the Bluetooth module in the wireless communication module 160 through the UART interface to realize the Bluetooth function.
  • the audio module 170 may transmit audio signals to the wireless communication module 160 through a UART interface, so as to realize the function of playing music through a Bluetooth headset.
  • the MIPI interface can be used to connect the processor 110 with the display screen 194, the camera 193 and other peripheral devices.
  • the MIPI interface includes a camera serial interface (camera serial interface, CSI), a display serial interface (display serial interface, DSI), and so on.
  • the processor 110 and the camera 193 communicate through a CSI interface to implement the shooting function of the terminal 100.
  • the processor 110 and the display screen 194 communicate through a DSI interface to realize the display function of the terminal 100.
  • the GPIO interface can be configured through software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface can be used to connect the processor 110 with the camera 193, the display screen 194, the wireless communication module 160, the audio module 170, the sensor module 180, and so on.
  • the GPIO interface can also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, etc.
  • the USB interface 130 is an interface that complies with the USB standard specification, and specifically may be a Mini USB interface, a Micro USB interface, a USB Type C interface, and so on.
  • the USB interface 130 can be used to connect a charger to charge the terminal 100, and can also be used to transfer data between the terminal 100 and peripheral devices. It can also be used to connect earphones and play audio through earphones. This interface can also be used to connect to other terminals, such as AR devices.
  • the interface connection relationship between the modules illustrated in the embodiment of the present application is merely a schematic description, and does not constitute a structural limitation of the terminal 100.
  • the terminal 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the charging management module 140 may receive the charging input of the wired charger through the USB interface 130.
  • the charging management module 140 may receive the wireless charging input through the wireless charging coil of the terminal 100. While the charging management module 140 charges the battery 142, it can also supply power to the terminal through the power management module 141.
  • the power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.
  • the power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the internal memory 121, the display screen 194, the camera 193, and the wireless communication module 160.
  • the power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, and battery health status (leakage, impedance).
  • the power management module 141 may also be provided in the processor 110.
  • the power management module 141 and the charging management module 140 may also be provided in the same device.
  • the wireless communication function of the terminal 100 can be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, and the baseband processor.
  • the antenna 1 and the antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in the terminal 100 can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • Antenna 1 can be multiplexed as a diversity antenna of a wireless local area network.
  • the antenna can be used in combination with a tuning switch.
  • the mobile communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G and the like applied to the terminal 100.
  • the mobile communication module 150 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), and the like.
  • the mobile communication module 150 can receive electromagnetic waves by the antenna 1, and perform processing such as filtering, amplifying and transmitting the received electromagnetic waves to the modem processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modem processor, and convert it into electromagnetic waves for radiation via the antenna 1.
  • at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110.
  • at least part of the functional modules of the mobile communication module 150 and at least part of the modules of the processor 110 may be provided in the same device.
  • the modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal.
  • the demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the application processor outputs a sound signal through an audio device (not limited to the speaker 170A, the receiver 170B, etc.), or displays an image or video through the display screen 194.
  • the modem processor may be an independent device.
  • the modem processor may be independent of the processor 110 and be provided in the same device as the mobile communication module 150 or other functional modules.
  • the wireless communication module 160 can provide applications on the terminal 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), and global navigation satellite systems. (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • BT Bluetooth
  • GNSS global navigation satellite system
  • frequency modulation frequency modulation, FM
  • NFC near field communication technology
  • infrared technology infrared, IR
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2, frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110.
  • the wireless communication module 160 may also receive a signal to be sent from the processor 110, perform frequency modulation, amplify, and convert it into electromagnetic waves to radiate through the antenna 2.
  • the antenna 1 of the terminal 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the terminal 100 can communicate with the network and other devices through wireless communication technology.
  • the wireless communication technology may include the global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband code Wideband code division multiple access (WCDMA), time-division code division multiple access (TD-SCDMA), long term evolution (LTE), BT, GNSS, WLAN, NFC, FM, and/or IR technology, etc.
  • the GNSS can include global positioning system (GPS), global navigation satellite system (GLONASS), Beidou navigation satellite system (BDS), quasi-zenith satellite system (quasi- Zenith satellite system, QZSS) and/or satellite-based augmentation systems (SBAS).
  • GPS global positioning system
  • GLONASS global navigation satellite system
  • BDS Beidou navigation satellite system
  • QZSS quasi-zenith satellite system
  • SBAS satellite-based augmentation systems
  • the terminal 100 implements a display function through a GPU, a display screen 194, and an application processor.
  • the GPU is an image processing microprocessor, which is connected to the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations and is used for graphics rendering.
  • the processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos, and the like.
  • the display screen 194 includes a display panel.
  • the display panel can use liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • active-matrix organic light-emitting diode active-matrix organic light-emitting diode
  • AMOLED flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc.
  • the terminal 100 may include one or N display screens 194, and N is a positive integer greater than one.
  • the terminal 100 can implement a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, and an application processor.
  • the ISP is used to process the data fed back from the camera 193. For example, when taking a picture, the shutter is opened, the light is transmitted to the photosensitive element of the camera through the lens, the light signal is converted into an electrical signal, and the photosensitive element of the camera transmits the electrical signal to the ISP for processing and transforms it into an image visible to the naked eye.
  • ISP can also optimize the image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be provided in the camera 193.
  • the camera 193 is used to capture still images or videos.
  • the object generates an optical image through the lens and is projected to the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal.
  • ISP outputs digital image signals to DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the terminal 100 may include one or N cameras 193, and N is a positive integer greater than one.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the terminal 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
  • Video codecs are used to compress or decompress digital video.
  • the terminal 100 may support one or more video codecs. In this way, the terminal 100 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.
  • MPEG moving picture experts group
  • MPEG2 MPEG2, MPEG3, MPEG4, and so on.
  • NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • applications such as intelligent cognition of the terminal 100 can be realized, such as image recognition, face recognition, voice recognition, text understanding, and so on.
  • the external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the terminal 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example, save music, video and other files in an external memory card.
  • the internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions.
  • the internal memory 121 may include a storage program area and a storage data area.
  • the internal memory 121 may store the prediction model described in the foregoing method embodiment.
  • the storage program area can store an operating system, an application program (such as a sound playback function, an image playback function, etc.) required by at least one function, and the like.
  • the data storage area can store data (such as audio data, phone book, etc.) created during the use of the terminal 100.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
  • the processor 110 executes various functional applications and data processing of the terminal 100 by running instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
  • the terminal 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. For example, music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.
  • the audio module 170 can also be used to encode and decode audio signals.
  • the audio module 170 may be provided in the processor 110, or part of the functional modules of the audio module 170 may be provided in the processor 110.
  • the speaker 170A also called “speaker” is used to convert audio electrical signals into sound signals.
  • the terminal 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • the receiver 170B also called “earpiece” is used to convert audio electrical signals into sound signals.
  • the terminal 100 answers a call or voice message, it can receive the voice by bringing the receiver 170B close to the human ear.
  • the microphone 170C also called “microphone”, “microphone”, is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 170C through the human mouth, and input the sound signal into the microphone 170C.
  • the terminal 100 may be provided with at least one microphone 170C. In other embodiments, the terminal 100 may be provided with two microphones 170C, which can implement noise reduction functions in addition to collecting sound signals. In other embodiments, the terminal 100 may also be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions.
  • the earphone interface 170D is used to connect wired earphones.
  • the earphone interface 170D may be a USB interface 130, or a 3.5mm open mobile terminal platform (open mobile terminal platform, OMTP) standard interface, or a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • the pressure sensor 180A is used to sense the pressure signal and can convert the pressure signal into an electrical signal.
  • the pressure sensor 180A may be provided on the display screen 194.
  • the capacitive pressure sensor may include at least two parallel plates with conductive materials. When a force is applied to the pressure sensor 180A, the capacitance between the electrodes changes.
  • the terminal 100 determines the strength of the pressure according to the change in capacitance.
  • the terminal 100 detects the intensity of the touch operation according to the pressure sensor 180A.
  • the terminal 100 may also calculate the touched position according to the detection signal of the pressure sensor 180A.
  • touch operations that act on the same touch position but have different touch operation strengths may correspond to different operation instructions. For example: when a touch operation whose intensity of the touch operation is less than the first pressure threshold is applied to the short message application icon, an instruction to view the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, an instruction to create a new short message is executed.
  • the gyro sensor 180B may be used to determine the movement posture of the terminal 100.
  • the angular velocity of the terminal 100 around three axes ie, x, y, and z axes
  • the gyro sensor 180B can be used for image stabilization.
  • the gyroscope sensor 180B detects the shake angle of the terminal 100, calculates the distance that the lens module needs to compensate according to the angle, and allows the lens to counteract the shake of the terminal 100 through a reverse movement to achieve anti-shake.
  • the gyro sensor 180B can also be used for navigation and somatosensory game scenes.
  • the air pressure sensor 180C is used to measure air pressure.
  • the terminal 100 calculates the altitude based on the air pressure value measured by the air pressure sensor 180C to assist positioning and navigation.
  • the magnetic sensor 180D includes a Hall sensor.
  • the terminal 100 may use the magnetic sensor 180D to detect the opening and closing of the flip holster.
  • the terminal 100 can detect the opening and closing of the flip according to the magnetic sensor 180D.
  • features such as automatic unlocking of the flip cover are set.
  • the acceleration sensor 180E can detect the magnitude of the acceleration of the terminal 100 in various directions (generally three axes). When the terminal 100 is stationary, the magnitude and direction of gravity can be detected. It can also be used to recognize the terminal's posture, apply to horizontal and vertical screen switching, pedometer and other applications.
  • the terminal 100 can measure the distance by infrared or laser. In some embodiments, when shooting a scene, the terminal 100 may use the distance sensor 180F to measure the distance to achieve fast focusing.
  • the proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector such as a photodiode.
  • the light emitting diode may be an infrared light emitting diode.
  • the terminal 100 emits infrared light to the outside through the light emitting diode.
  • the terminal 100 uses a photodiode to detect infrared reflected light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the terminal 100. When insufficient reflected light is detected, the terminal 100 may determine that there is no object near the terminal 100.
  • the terminal 100 can use the proximity light sensor 180G to detect that the user holds the terminal 100 close to the ear to talk, so as to automatically turn off the screen to save power.
  • the proximity light sensor 180G can also be used in leather case mode, and the pocket mode will automatically unlock and lock the screen.
  • the ambient light sensor 180L is used to sense the brightness of the ambient light.
  • the terminal 100 can adaptively adjust the brightness of the display screen 194 according to the perceived brightness of the ambient light.
  • the ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.
  • the ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the terminal 100 is in a pocket to prevent accidental touch.
  • the fingerprint sensor 180H is used to collect fingerprints.
  • the terminal 100 can use the collected fingerprint characteristics to implement fingerprint unlocking, access application locks, fingerprint photographs, fingerprint answering calls, and so on.
  • the temperature sensor 180J is used to detect temperature.
  • the terminal 100 uses the temperature detected by the temperature sensor 180J to execute a temperature processing strategy. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold value, the terminal 100 executes to reduce the performance of the processor located near the temperature sensor 180J, so as to reduce power consumption and implement thermal protection.
  • the terminal 100 when the temperature is lower than another threshold, the terminal 100 heats the battery 142 to avoid abnormal shutdown of the terminal 100 due to low temperature.
  • the terminal 100 boosts the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperature.
  • Touch sensor 180K also called “touch device”.
  • the touch sensor 180K may be disposed on the display screen 194, and the touch screen is composed of the touch sensor 180K and the display screen 194, which is also called a “touch screen”.
  • the touch sensor 180K is used to detect touch operations acting on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • the visual output related to the touch operation can be provided through the display screen 194.
  • the touch sensor 180K may also be disposed on the surface of the terminal 100, which is different from the position of the display screen 194.
  • the bone conduction sensor 180M can acquire vibration signals.
  • the bone conduction sensor 180M can obtain the vibration signal of the vibrating bone mass of the human voice.
  • the bone conduction sensor 180M can also contact the human pulse and receive the blood pressure pulse signal.
  • the bone conduction sensor 180M may also be provided in the earphone, combined with the bone conduction earphone.
  • the audio module 170 can parse out the voice command based on the vibration signal of the vibrating bone block of the voice obtained by the bone conduction sensor 180M to realize the voice function.
  • the application processor can analyze the heart rate information based on the blood pressure beating signal obtained by the bone conduction sensor 180M, and realize the heart rate detection function.
  • the button 190 includes a power button, a volume button, and so on.
  • the button 190 may be a mechanical button. It can also be a touch button.
  • the terminal 100 may receive key input, and generate key signal input related to user settings and function control of the terminal 100.
  • the motor 191 can generate vibration prompts.
  • the motor 191 can be used for incoming call vibration notification, and can also be used for touch vibration feedback.
  • touch operations applied to different applications can correspond to different vibration feedback effects.
  • Acting on touch operations in different areas of the display screen 194, the motor 191 can also correspond to different vibration feedback effects.
  • Different application scenarios for example: time reminding, receiving information, alarm clock, games, etc.
  • the touch vibration feedback effect can also support customization.
  • the indicator 192 may be an indicator light, which may be used to indicate the charging status, power change, or to indicate messages, missed calls, notifications, and so on.
  • the SIM card interface 195 is used to connect to the SIM card.
  • the SIM card can be inserted into the SIM card interface 195 or pulled out from the SIM card interface 195 to achieve contact and separation with the terminal 100.
  • the terminal 100 may support 1 or N SIM card interfaces, and N is a positive integer greater than 1.
  • the SIM card interface 195 can support Nano SIM cards, Micro SIM cards, SIM cards, etc.
  • the same SIM card interface 195 can insert multiple cards at the same time. The types of the multiple cards can be the same or different.
  • the SIM card interface 195 can also be compatible with different types of SIM cards.
  • the SIM card interface 195 may also be compatible with external memory cards.
  • the terminal 100 interacts with the network through the SIM card to implement functions such as call and data communication.
  • the terminal 100 adopts an eSIM, that is, an embedded SIM card.
  • the eSIM card can be embedded in the terminal 100 and cannot be separated from the terminal 100.
  • the software system of the terminal 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • FIG. 16 is a functional architecture diagram of a terminal 100 provided by an embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface.
  • the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the Android runtime and system library, and the kernel layer.
  • the application layer can include a series of application packages.
  • the application package may include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and so on.
  • the window manager is used to manage window programs.
  • the window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, etc.
  • the content provider is used to store and retrieve data and make these data accessible to applications.
  • the data can include videos, images, audios, phone calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls that display text, controls that display pictures, and so on.
  • the view system can be used to build applications.
  • the display interface can be composed of one or more views.
  • a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.
  • the phone manager is used to provide the communication function of the terminal 100. For example, the management of the call status (including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
  • the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, text messages are prompted in the status bar, a prompt sound is emitted, the terminal vibrates, and the indicator light flashes.
  • Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.
  • the core library consists of two parts: one part is the function functions that the java language needs to call, and the other part is the core library of Android.
  • the application layer and application framework layer run in a virtual machine.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • the system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.
  • the surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support multiple audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing.
  • the 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
  • the terminal 100 activates the microphone 170C through audio drive, collects audio signals through the microphone 170C, activates the camera drive, and captures a face image through the camera 193.
  • the terminal loads the prediction model into the internal memory 121, the processor 110 inputs the face image into the prediction model, and the processor 110 processes the face image through the prediction model and outputs the prediction result; if the prediction result indicates that the user does not intend to continue speaking, The processor 110 determines that the audio signal is the end point of the voice.
  • the electronic device that executes the embodiment in FIG. 2, the embodiment in FIG. 6 or the embodiment in FIG. 10 may be implemented as a computing device, and the computing device may be a server, a host, or a personal computer.
  • the computing device can be realized by a general bus architecture.
  • FIG. 17 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • the computing device may be configured as an electronic device in the foregoing method embodiment.
  • the computing device may be any device involved in all or part of the content described in the method embodiments.
  • the computing device includes at least one processor 1701, communication bus 1702, memory 1703, and at least one communication interface 1704.
  • the processor 1701 may be a general-purpose central processing unit (CPU), a network processor (NP), a microprocessor, or may be one or more integrated circuits for implementing the solution of the application, for example, a dedicated integrated circuit Circuit (application-specific integrated circuit, ASIC), programmable logic device (programmable logic device, PLD) or a combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
  • the communication bus 1702 is used to transfer information between the aforementioned components.
  • the communication bus 1702 can be divided into an address bus, a data bus, a control bus, and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the memory 1703 can be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, or it can be a random access memory (RAM) or can store information and instructions
  • ROM read-only memory
  • RAM random access memory
  • Other types of dynamic storage devices can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage , CD storage (including compressed CDs, laser disks, CDs, digital universal CDs, Blu-ray CDs, etc.), disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures And any other media that can be accessed by the computer, but not limited to this.
  • the memory 1703 may exist independently and is connected to the processor 1701 through the communication bus 1702.
  • the memory 1703 may also be integrated with the processor 1701.
  • the communication interface 1704 uses any device such as a transceiver for communicating with other devices or communication networks.
  • the communication interface 1704 includes a wired communication interface, and may also include a wireless communication interface.
  • the wired communication interface may be, for example, an Ethernet interface.
  • the Ethernet interface can be an optical interface, an electrical interface, or a combination thereof.
  • the wireless communication interface may be a wireless local area network (WLAN) interface, a cellular network communication interface, or a combination thereof.
  • WLAN wireless local area network
  • the processor 1701 may include one or more CPUs, such as CPU0 and CPU1 as shown in FIG. 3.
  • the computer device may include multiple processors, such as the processor 1701 and the processor 1705 as shown in FIG. 3.
  • processors can be a single-core processor (single-CPU) or a multi-core processor (multi-CPU).
  • the processor here may refer to one or more devices, circuits, and/or processing cores for processing data (such as computer program instructions).
  • the computer device may further include an output device 1706 and an input device 1707.
  • the output device 1706 communicates with the processor 1701 and can display information in a variety of ways.
  • the output device 1706 may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector, etc.
  • the input device 1707 communicates with the processor 1701, and can receive user input in a variety of ways.
  • the input device 1707 may be a mouse, a keyboard, a touch screen device, a sensor device, or the like.
  • the memory 1703 is used to store the program code 1710 for executing the solution of the present application, and the processor 1701 may execute the program code 1710 stored in the memory 1703. That is, the computing device can implement the method provided by the method embodiment through the processor 1701 and the program code 1710 in the memory 1703.
  • the computing device in the embodiment of the present application may correspond to the electronic device in the above-mentioned method embodiments, and the processor 1710, transceiver 1720, etc. in the computing device may implement the functions of the electronic device in the above-mentioned method embodiments And/or the various steps and methods implemented. For the sake of brevity, I will not repeat them here.
  • the electronic device that executes the embodiment in FIG. 2, the embodiment in FIG. 6 or the embodiment in FIG. 10 may also be implemented by a general-purpose processor.
  • the form of the general-purpose processor may be a chip.
  • a general-purpose processor that implements an electronic device includes a processing circuit and an input interface and an output interface that are internally connected and communicated with the processing circuit.
  • the input interface can input audio signals and face images into the processing circuit, and the processing circuit is used to perform step 602 Go to step 608, the processing circuit may output the result of the voice detection through the output interface.
  • the general-purpose processor may further include a storage medium, and the storage medium may store instructions executed by the processing circuit, and the processing circuit is configured to execute the instructions stored by the storage medium to execute the foregoing method embodiments.
  • the storage medium can also be used to cache the prediction model or to persistently store the prediction model.
  • the electronic equipment in the embodiment of FIG. 2, the embodiment of FIG. 6 or the embodiment of FIG. 10 can also be implemented as follows: one or more field programmable gate arrays (full English name: field-programmable gate array, English abbreviation: FPGA), programmable logic device (English full name: programmable logic device, English abbreviation: PLD), controller, state machine, gate logic, discrete hardware components, any other suitable circuits, or Any combination of circuits capable of performing the various functions described throughout this application.
  • field programmable gate arrays full English name: field-programmable gate array, English abbreviation: FPGA
  • programmable logic device English full name: programmable logic device, English abbreviation: PLD
  • controller state machine
  • gate logic discrete hardware components
  • the electronic device that executes the embodiment in FIG. 2, the embodiment in FIG. 6 or the embodiment in FIG. 10 may also be implemented by using a computer program product.
  • an embodiment of the present application provides a computer program product, which when the computer program product runs on an electronic device, causes the electronic device to execute the voice detection method in the foregoing method embodiment.
  • the above-mentioned electronic devices of various product forms such as the terminal 100 and the computing device 1600, respectively have any function of the electronic device in the above-mentioned method embodiment in FIG. 2, the embodiment in FIG. 6, or the embodiment in FIG. 10, and will not be repeated here.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiment described above is only illustrative.
  • the division of the unit is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may also be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .
  • the computer program product includes one or more computer program instructions.
  • the computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer program instructions can be passed from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a digital video disc (DVD), or a semiconductor medium (for example, a solid state hard disk).
  • the program can be stored in a computer-readable storage medium, as mentioned above.
  • the storage medium can be read-only memory, magnetic disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Geometry (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种语音检测方法、预测模型的训练方法、装置、设备及介质,属于语音交互技术领域。一种多模态的语音结束点检测方法,通过模型对拍摄的人脸图像进行识别,从而预测出用户是否具有继续说话的意图,结合预测结果,来判决采集到的音频信号是否为语音结束点,由于在声学特征的基础上,还融合了人脸图像这种视觉模态的特征来进行检测,即使在背景噪声很强或者用户说话期间停顿的情况下,也能够利用人脸图像来准确判决语音信号是否为语音结束点,因此避免了背景噪声以及说话停顿造成的干扰,从而避免了背景噪声以及说话停顿的干扰会引发的过晚或者过早检测出语音交互处于结束状态的问题,提高了检测语音结束点的准确性。

Description

语音检测方法、预测模型的训练方法、装置、设备及介质 技术领域
本申请涉及语音交互技术领域,特别涉及一种语音检测方法、预测模型的训练方法、装置、设备及介质。
背景技术
在语音交互技术中,为了实现基于语音的人机交互功能,通常会识别一段语音中的语音起始点和语音结束点,截取语音起始点和语音结束点之间的部分,作为语音指令,基于语音指令来指示设备执行对应的操作。其中,语音起始点通常由用户的主动操作触发,很容易通过唤醒词的采集时间点、语音交互启动选项被触发操作的时间点等数据确定出来,而语音结束点则需要由设备通过对语音分析处理才能得出。由此可见,如何准确地检测出语音结束点,对于语音交互技术而言是至关重要的,同时也是一大技术难点。
相关技术中,语音检测方法通常是:每经过一个时间窗,采集当前时间窗内的音频信号,检测所述音频信号的尾部静音时长,对尾部静音时长与静音时长阈值进行比较,若所述尾部静音时长大于静音时长阈值,则确定音频信号为语音结束点,若所述尾部静音时长小于或等于静音时长阈值,则确定音频信号不为语音结束点。
采用上述方法检测语音结束点时,一旦背景噪音较强,就会造成检测到的音频信号的尾部静音时长比准确的尾部静音时长偏大,导致语音结束点容易被漏检测,进而导致过晚地检测出语音交互已处于结束状态;此外,一旦用户在说话期间进行停顿,就会造成检测到的音频信号的尾部静音时长比准确的尾部静音时长偏小,就会导致过早地检测出语音交互处于结束状态。由此可见,这种方法检测出的语音结束点准确性较差。
发明内容
本申请实施例提供了一种语音检测方法、预测模型的训练方法、装置、设备及介质,能够提高检测语音结束点的准确性。
第一方面,提供了一种语音检测方法,在该方法中,可以获取音频信号以及人脸图像,所述人脸图像的拍摄时间点和所述音频信号的采集时间点相同;将所述人脸图像输入预测模型,所述预测模型用于预测用户是否具有继续说话的意图;通过所述预测模型对所述人脸图像进行处理,输出预测结果;若所述预测结果表示所述用户不具有继续说话的意图,确定所述音频信号为语音结束点。
以上提供了一种多模态的语音结束点检测方法,通过模型对拍摄的人脸图像进行识别,从而预测出用户是否具有继续说话的意图,结合预测结果,来判决采集到的音频信号是否为语音结束点,由于在声学特征的基础上,还融合了人脸图像这种视觉模态的特征来进行检测,即使在背景噪声很强或者用户说话期间停顿的情况下,也能够利用人脸图像来准确判决语音信号是否为语音结束点,因此避免了背景噪声以及说话停顿造成的干扰,从而避免了背景噪声以及说话停顿的干扰会引发的过晚或者过早检测出语音交互处于结束状态的问题,提高了 检测语音结束点的准确性,进而提高语音交互的效率。此外,由于解决了语音交互时语音结束点检测不准确的问题,避免了语音结束点过晚检测会引发的响应时延过长的问题,从而缩短了语音交互的时延,提高了语音交互的流畅性,避免了语音结束点过早检测会引发的语音指令被过早截断的问题,从而避免用户意图理解有误的情况,提高了语音交互的准确性。
可选地,在预测模型处理人脸图像的过程中,可以提取所述人脸图像包含的关键点;对所述关键点进行处理,得到所述人脸图像的动作特征;对所述动作特征进行分类,得到不同类别分别对应的置信度;根据所述置信度确定所述预测结果。
若一段语音中包含停顿,对语音进行句法分析时,无法区分一个音频信号是停顿还是语音结束点。而通过这种可选方式,融合了人脸的关键点的特征以及动作特征,能够基于人脸当前进行的动作,精确地识别出面部包含的微表情,从而根据表情推理出用户的精神状态,进而预测出用户是否具有继续说话的意图。这种方法借助于视觉信息来进行辅助判断,从而解决了句法分析无法解决的问题,能够减少语音的过早截断。
可选地,所述预测模型是根据第一样本人脸图像以及第二样本人脸图像训练得到的;所述第一样本人脸图像标注了第一标签,所述第一标签表示样本用户具有继续说话的意图,所述第一标签是根据第一样本音频信号确定的,所述第一样本音频信号的采集时间点及采集对象和所述第一样本人脸图像的拍摄时间点及拍摄对象均相同;所述第二样本人脸图像标注了第二标签,所述第二标签表示样本用户不具有继续说话的意图,所述第二标签是根据第二样本音频信号确定的,所述第二样本音频信号的采集时间点及采集对象和所述第二样本人脸图像的拍摄时间点及拍摄对象均相同。
通过这种可选方式,提供了实现用户意图预测功能的模型训练方法,利用包含继续说话的用户意图的样本人脸图像以及包含不继续说话的用户意图的样本人脸图像,进行模型训练,预测模型可以通过训练的过程,从包含继续说话的用户意图的样本人脸图像和对应的标签中,学习出用户意图为继续说话时,人脸图像的特征会是怎么样的,从包含不继续说话的用户意图的样本人脸图像和对应的标签中,学习出用户意图为不继续说话时,人脸图像的特征又会是怎么样的,那么预测模型由于学习到用户意图与人脸图像特征之间的对应关系,在模型应用阶段,即可通过模型来根据一幅未知的人脸图像,预测出当前用户是否具有继续说话的意图,从而利用人脸图像表示的用户意图,准确地检测出当前的语音信号是否为语音结束点。
可选地,所述第一样本音频信号满足第一条件,所述第一条件包括:所述第一样本音频信号对应的语音活性检测(Voice Activity Detection,VAD)结果先从说话状态更新为沉默状态,再从所述沉默状态更新为所述说话状态。
如果样本用户在说话期间进行了短暂的停顿,那么这种场景下,对说话期间采集的音频进行VAD的过程中,对于停顿之前的音频而言,VAD结果是说话状态;对于停顿期间的音频而言,VAD结果是沉默状态;对于停顿之后的音频而言,VAD结果恢复为说话状态。那么如果采集的样本音频信号满足第一条件(1),表明样本音频信号与这种场景中停顿期间的音频吻合。由于样本用户停顿之后又继续进行了说话,而不是结束语音,因此停顿时间点样本用户的意图是继续进行说话,那么停顿时间点拍摄的样本人脸图像会包含继续说话的用户意图,那么通过将该样本人脸图像标注为第一人脸图像,后续即可让模型通过第一样本人脸图像,学习出人脸图像与继续说话的用户意图之间的映射关系,那么在模型应用阶段,即可使用模型来根据一幅未知的人脸图像,预测出当前用户是否具有继续说话的意图。
可选地,所述第一样本音频信号满足第一条件,所述第一条件包括:所述第一样本音频信号的尾部静音时长小于第一阈值且大于第二阈值,所述第一阈值大于所述第二阈值;
可选地,所述第一样本音频信号满足第一条件,所述第一条件包括:文本信息组合的第一置信度大于第一文本信息的第二置信度,所述文本信息组合为所述第一文本信息与第二文本信息的组合,所述第一文本信息表示所述第一样本音频信号的上一个样本音频信号的语义,所述第二文本信息表示所述第一样本音频信号的下一个样本音频信号的语义,所述第一置信度表示所述文本信息组合为完整语句的概率,所述第二置信度表示所述第一文本信息为完整语句的概率;
可选地,所述第一样本音频信号满足第一条件,所述第一条件包括:所述文本信息组合的第一置信度大于所述第二文本信息的第三置信度,所述第三置信度表示所述第二文本信息为完整语句的概率;
通过上述第一条件,达到的效果至少可以包括:对于包含短暂停的一句话而言,相关技术会以停顿点为分割点,将这一句完整的话割裂开来,切分为两段语音。由于用户还没说完时,电子设备提前判定已经检测到了语音结束点,导致语音结束点检测过早。那么,电子设备会直接将停顿之前的语音作为语音指令,而忽略掉停顿之后的语音,导致识别的语音指令不完整,如果电子设备直接根据停顿之前的语音指令来进行业务处理,无疑会影响业务处理的准确性。而通过上述方法,能够综合考虑前后两段音频信号:不仅对前后两段音频信号分别进行识别,得出两段音频信号对应的分句各自是完整语句的置信度,还对多段音频信号组成的整体进行识别,得到两个分句的整体是完整语句的置信度;若整体是完整语句的置信度大于两个分句各自是完整语句的置信度,则将两个分句之间的静音片段对应的样本人脸图像取出,标记为第一样本人脸图像,从而可以让模型通过标注好的第一样本人脸图像,学习出停顿时人脸图像会包含的特征。
可选地,所述第一样本人脸图像满足第三条件,所述第三条件包括:将所述第一样本人脸图像分别输入所述预测模型中的第一分类器以及所述预测模型中的第二分类器后,所述第一分类器输出的概率大于所述第二分类器输出的概率,所述第一分类器用于预测人脸图像包含动作的概率,所述第二分类器用于预测人脸图像不包含动作的概率。
通过上述第三条件,能够融合拍摄的人脸图像、采集的音频信号、文本信息的语义等多个模态的信息,从而结合全局信息来对训练数据进行自动标注,由于综合考虑了各个模态的信息,可以保证样本人脸图像的标签与是否继续说话的用户意图相匹配,那么由于标注得到的样本的准确性高,模型根据准确的样本进行训练后,预测用户意图的准确性也会较高,因此有助于在模型应用阶段,准确地检测出语音结束点。
可选地,所述第二样本音频信号满足第二条件,所述第二条件包括以下至少一项:所述第二样本音频信号对应的VAD结果从说话状态更新为沉默状态;或,所述第二样本音频信号的尾部静音时长大于第一阈值。
通过上述第二条件,能够利用拍摄人脸图像时采集的音频信号,来判断人脸图像是否包含不继续说话的用户意图,利用声学模态的信息来对训练图像进行标注,可以保证样本人脸图像的标签与是否继续说话的用户意图相匹配,那么由于标注得到的样本的准确性高,模型根据准确的样本进行训练后,预测用户意图的准确性也会较高,因此有助于在模型应用阶段,准确地检测出语音结束点。
可选地,还可以融合文本模态的特征进行语音检测。具体而言,可以对所述音频信号进行语音识别,得到所述音频信号对应的第三文本信息;对所述第三文本信息进行句法分析,得到第一分析结果,所述第一分析结果用于表示所述第三文本信息是否为完整语句;若所述第一分析结果表示为所述第三文本信息不为完整语句,确定所述音频信号不为语音结束点;或者,若所述第一分析结果表示为所述第三文本信息为完整语句,执行所述将所述人脸图像输入预测模型的步骤。
通过融合文本模态的特征进行语音检测,至少可以达到以下效果:当前词汇与之前的词汇组成的语句的句法完整,并不能成为当前词汇是语音结束点的唯一依据。如果实施相关技术提供的方法,单纯依赖声学信息,就可能在检测到暂时停顿时,就将停顿点误判为语音结束点,导致语音指令被分割,造成曲解了用户意图,致使语音交互的任务处理错误。而通过上述方法,可以在检测到音频信号已经句法完整的条件下,触发应用预测模型来进行人脸识别的流程,从而利用预测结果,进一步判断音频信号是否确实到达了语音结束点,从而通过融合视觉模态的特征,避免句法分析误判的情况,极大地提高语音结束点检测的准确性,降低语音指令被过早截断的概率。此外,上述句法分析的方法不依赖于特定的ASR引擎和特定场景,各个模态的检测可以独立执行、综合判断,可操作性更易,实用性高。
可选地,句法分析的过程可以包括:对所述第三文本信息进行分词,得到多个词汇;对于所述多个词汇中的每个词汇,对所述词汇进行句法分析,得到所述词汇对应的第二分析结果,所述第二分析结果用于表示所述词汇与所述词汇之前的词汇是否组成了完整语句;若所述多个词汇中任一词汇对应的第二分析结果表示组成了完整语句,确定所述第三文本信息为完整语句;或者,若所述多个词汇中每个词汇对应的第二分析结果均表示没有组成完整语句,确定所述第三文本信息不为完整语句。
通过执行上述步骤来进行句法分析,达到的效果至少可以包括:不仅综合考虑了每个词汇与之前词汇之间在句法上的联系,而且利用了N—Best(N条最优)算法,每当检测到一个词汇,则判断该词汇是否已经和之前的词汇组成了完整语句,一旦当前的词汇表示已经组成完整语句时,即可确定已分析的文本信息为完整语句,执行下一步的检测流程。那么,可以在当前音频信号具有是语音结束点的概率时,及时检测出来,从而保证语音结束点检测的实时性,避免语音结束点检测过晚。
可选地,所述将所述人脸图像输入预测模型的触发条件包括:检测所述音频信号的尾部静音时长;确定所述尾部静音时长大于第三阈值。
通过上述触发条件,可以在尾部静音时长处于第三阈值和第一阈值之间时,执行融合人脸图像的特征来进行语音检测的流程。这种方式达到的效果至少可以包括:一旦静音时长大于最小的阈值(第三阈值),就结合文本模态以及图像模态,利用句法分析的结果以及面部分析的结果来检测语音结束点,从而通过多模态信息的融合,尽可能快又准地检测到语音端点,避免延时过长的情况。
可选地,上述语音检测的方法可以应用于车载终端,车载终端还可以采集行车状况信息,所述行车状况信息表示搭载所述车载终端的车辆的行车状况;采集行车状况信息,所述行车状况信息表示搭载所述车载终端的车辆的行车状况;根据所述行车状况信息,对所述第三阈值进行调整。
通过上述方式,达到的效果至少可以包括:可以融合语音检测的具体应用场景来进行端 点检测,例如应用在车载场景下,可以利用驾驶过程中的行车状况,来调整尾部静音时长的阈值,使得阈值可以根据当前的行车状况自适应调整,提升语音端点检测的鲁棒性。
可选地,对所述第三阈值进行调整的过程可以包括:若所述行车状况信息表示发生了急转弯,对所述第三阈值进行调整,调整后的第三阈值大于调整前的第三阈值;或,若所述行车状况信息表示发生了急刹车,对所述第三阈值进行调整,调整后的第三阈值大于调整前的第三阈值。
通过上述方式,达到的效果至少可以包括:如果车辆发生急转弯或急刹车,用户的语音很可能由于发生急转弯或急刹车而产生中断,导致语音结束点出现的概率变大,语音的中断时长也会相应变长,此时,通过提高尾部静音时长的阈值,能够让调整后的阈值适应于急转弯或急刹车的情况。
可以融合语音检测的具体应用场景来进行端点检测,例如应用在车载场景下,可以利用驾驶过程中的行车状况,来调整尾部静音时长的阈值,使得阈值可以根据当前的行车状况自适应调整,提升语音端点检测的鲁棒性。
可选地,上述语音检测的方法可以应用于车载终端,车载终端还可以采集环境信息,所述环境信息表示搭载所述车载终端的车辆所处的环境;根据所述环境信息,对所述预测模型的参数进行调整。
通过结合车外环境进行调参,达到的效果至少可以包括:在车辆驾驶的过程中,车外环境会对驾驶员的情绪产生影响,而情绪的变化会影响到人脸识别的过程,那么通过结合车外环境来调整预测模型的参数,可以让预测模型进行人脸识别的过程与当前的车外环境匹配,从而提高预测模型预测结果的精确性。
可选地,对所述预测模型的参数进行调整的过程可以包括:若所述环境信息表示发生了交通拥堵,对所述预测模型中第三分类器的判决阈值进行调整,所述第三分类器用于在输入数据高于所述判决阈值时判决用户具有继续说话的意图,在输入数据低于或等于所述判决阈值时判决用户不具有继续说话的意图。
通过上述方式,达到的效果至少可以包括:交通拥塞的场景下驾驶员心情焦躁的概率,会比交通畅通的场景下驾驶员心情焦躁的概率更高,而情绪的变化会影响到人脸识别的过程,那么通过结合交通状况来调整预测模型的参数,可以让预测模型进行人脸识别的过程与当前的交通状况匹配,从而提高预测模型预测结果的精确性。
第二方面,提供了一种用于语音检测的预测模型的训练方法,在该方法中,可以获取样本音频信号集以及待标注的样本人脸图像集;根据所述样本音频信号集中的第一样本音频信号,对所述样本人脸图像集中的第三样本人脸图像进行处理,得到第一样本人脸图像,所述第一样本人脸图像标注了第一标签,所述第一标签表示样本用户具有继续说话的意图,所述第一样本人脸图像的拍摄时间点及拍摄对象和所述第一样本音频信号的采集时间点及采集对象均相同;根据所述样本音频信号集中的第二样本音频信号,对所述样本人脸图像集中的第四样本人脸图像进行处理,得到第二样本人脸图像,所述第二样本人脸图像标注了第二标签,所述第二标签表示样本用户不具有继续说话的意图,所述第二样本人脸图像的拍摄时间点及拍摄对象和所述第二样本音频信号的采集时间点及采集对象均相同;使用所述第一样本人脸图像以及所述第二样本人脸图像进行模型训练,得到预测模型,所述预测模型用于预测用户 是否具有继续说话的意图。
可选地,所述第一样本音频信号满足第一条件,所述第一条件包括以下至少一项:所述第一样本音频信号对应的VAD结果先从说话状态更新为沉默状态,再从所述沉默状态更新为所述说话状态;或,所述第一样本音频信号的尾部静音时长小于第一阈值且大于第二阈值,所述第一阈值大于所述第二阈值;或,文本信息组合的第一置信度大于第一文本信息的第二置信度,所述文本信息组合为所述第一文本信息与第二文本信息的组合,所述第一文本信息表示所述第一样本音频信号的上一个样本音频信号的语义,所述第二文本信息表示所述第一样本音频信号的下一个样本音频信号的语义,所述第一置信度表示所述文本信息组合为完整语句的概率,所述第二置信度表示所述第一文本信息为完整语句的概率;或,所述文本信息组合的第一置信度大于所述第二文本信息的第三置信度,所述第三置信度表示所述第二文本信息为完整语句的概率。
可选地,所述第二样本音频信号满足第二条件,所述第二条件包括以下至少一项:所述第二样本音频信号对应的VAD结果从说话状态更新为沉默状态;或,所述第二样本音频信号的尾部静音时长大于第一阈值。
可选地,所述第一样本人脸图像满足第三条件,所述第三条件包括:将所述第一样本人脸图像输入所述预测模型中的第一分类器以及所述预测模型中的第二分类器后,所述第一分类器输出的概率大于所述第二分类器输出的概率,所述第一分类器用于预测人脸图像包含动作的概率,所述第二分类器用于预测人脸图像不包含动作的概率。
第三方面,提供了一种语音检测装置,该语音检测装置具有实现上述第一方面或第一方面任一种可选方式中语音检测的功能。该语音检测装置包括至少一个模块,至少一个模块用于实现上述第一方面或第一方面任一种可选方式所提供的语音检测方法。
可选地,所述将所述人脸图像输入预测模型的触发条件包括:检测所述音频信号的尾部静音时长;确定所述尾部静音时长大于第三阈值。
可选地,所述装置应用于车载终端,所述装置还包括:第一采集模块,用于采集行车状况信息,所述行车状况信息表示搭载所述车载终端的车辆的行车状况;第一调整模块,用于根据所述行车状况信息,对所述第三阈值进行调整。
可选地,所述第一调整模块,用于若所述行车状况信息表示发生了急转弯,对所述第三阈值进行调整,调整后的第三阈值大于调整前的第三阈值。
可选地,所述第一调整模块,用于若所述行车状况信息表示发生了急刹车,对所述第三阈值进行调整,调整后的第三阈值大于调整前的第三阈值。
可选地,所述装置应用于车载终端,所述装置还包括:第二采集模块,用于采集环境信息,所述环境信息表示搭载所述车载终端的车辆所处的环境;第二调整模块,用于根据所述环境信息,对所述预测模型的参数进行调整。
可选地,所述第二调整模块,用于若所述环境信息表示发生了交通拥堵,对所述预测模型中第三分类器的判决阈值进行调整,所述第三分类器用于在输入数据高于所述判决阈值时判决用户具有继续说话的意图,在输入数据低于或等于所述判决阈值时判决用户不具有继续说话的意图。
第四方面,提供了一种用于语音检测的预测模型的训练装置,该装置包括:
获取模块,用于获取样本音频信号集以及待标注的样本人脸图像集;
处理模块,用于根据所述样本音频信号集中的第一样本音频信号,对所述样本人脸图像集中的第三样本人脸图像进行处理,得到第一样本人脸图像,所述第一样本人脸图像标注了第一标签,所述第一标签表示样本用户具有继续说话的意图,所述第一样本人脸图像的拍摄时间点及拍摄对象和所述第一样本音频信号的采集时间点及采集对象均相同;
所述处理模块,还用于根据所述样本音频信号集中的第二样本音频信号,对所述样本人脸图像集中的第四样本人脸图像进行处理,得到第二样本人脸图像,所述第二样本人脸图像标注了第二标签,所述第二标签表示样本用户不具有继续说话的意图,所述第二样本人脸图像的拍摄时间点及拍摄对象和所述第二样本音频信号的采集时间点及采集对象均相同;
训练模块,用于使用所述第一样本人脸图像以及所述第二样本人脸图像进行模型训练,得到预测模型,所述预测模型用于预测用户是否具有继续说话的意图。
可选地,所述第一样本音频信号满足第一条件,所述第一条件包括以下至少一项:
所述第一样本音频信号对应的VAD结果先从说话状态更新为沉默状态,再从所述沉默状态更新为所述说话状态;或,
所述第一样本音频信号的尾部静音时长小于第一阈值且大于第二阈值,所述第一阈值大于所述第二阈值;或,
文本信息组合的第一置信度大于第一文本信息的第二置信度,所述文本信息组合为所述第一文本信息与第二文本信息的组合,所述第一文本信息表示所述第一样本音频信号的上一个样本音频信号的语义,所述第二文本信息表示所述第一样本音频信号的下一个样本音频信号的语义,所述第一置信度表示所述文本信息组合为完整语句的概率,所述第二置信度表示所述第一文本信息为完整语句的概率;或,
所述文本信息组合的第一置信度大于所述第二文本信息的第三置信度,所述第三置信度表示所述第二文本信息为完整语句的概率。
可选地,所述第二样本音频信号满足第二条件,所述第二条件包括以下至少一项:
所述第二样本音频信号对应的VAD结果从说话状态更新为沉默状态;或,
所述第二样本音频信号的尾部静音时长大于第一阈值。
可选地,所述第一样本人脸图像满足第三条件,所述第三条件包括:
将所述第一样本人脸图像输入所述预测模型中的第一分类器以及所述预测模型中的第二分类器后,所述第一分类器输出的概率大于所述第二分类器输出的概率,所述第一分类器用于预测人脸图像包含动作的概率,所述第二分类器用于预测人脸图像不包含动作的概率。
第五方面,提供了一种电子设备,该电子设备包括处理器,该处理器用于执行指令,使得该电子设备执行上述第一方面或第一方面任一种可选方式所提供的语音检测方法。第五方面提供的电子设备的具体细节可参见上述第一方面或第一方面任一种可选方式,此处不再赘述。
第六方面,提供了一种电子设备,该电子设备包括处理器,该处理器用于执行指令,使 得该电子设备执行上述第二方面或第二方面任一种可选方式所提供的用于语音检测的预测模型的训练方法。第六方面提供的电子设备的具体细节可参见上述第二方面或第二方面任一种可选方式,此处不再赘述。
第七方面,提供了一种计算机可读存储介质,该存储介质中存储有至少一条指令,该指令由处理器读取以使电子设备执行上述第一方面或第一方面任一种可选方式所提供的语音检测方法。
第八方面,提供了一种计算机可读存储介质,该存储介质中存储有至少一条指令,该指令由处理器读取以使电子设备执行上述第二方面或第二方面任一种可选方式所提供的用于语音检测的预测模型的训练方法。
第九方面,提供了一种计算机程序产品,当该计算机程序产品在电子设备上运行时,使得电子设备执行上述第一方面或第一方面任一种可选方式所提供的语音检测方法。
第十方面,提供了一种计算机程序产品,当该计算机程序产品在电子设备上运行时,使得电子设备执行上述第二方面或第二方面任一种可选方式所提供的用于语音检测的预测模型的训练方法。
第十一方面,提供了一种芯片,当该芯片在电子设备上运行时,使得电子设备执行上述第一方面或第一方面任一种可选方式所提供的语音检测方法。
第十二方面,提供了一种芯片,当该芯片在电子设备上运行时,使得电子设备执行上述第二方面或第二方面任一种可选方式所提供的用于语音检测的预测模型的训练方法。
附图说明
图1是本申请实施例提供的一种语音检测方法的实施环境的示意图;
图2是本申请实施例提供的一种用于语音检测的预测模型的训练方法的流程图;
图3是本申请实施例提供的一种标注第一标签所需满足的条件的示意图;
图4是本申请实施例提供的一种标注第二标签所需满足的条件的示意图;
图5是本申请实施例提供的一种预测模型的结构示意图;
图6是本申请实施例提供的一种语音检测方法的流程图;
图7是本申请实施例提供的一种句法分析的示意图;
图8是本申请实施例提供的一种句法分析的示意图;
图9是本申请实施例提供的一种语音检测方法的流程图;
图10是本申请实施例提供的一种车载场景下语音检测方法的流程图;
图11是本申请实施例提供的一种语音检测方法的软件架构图;
图12是本申请实施例提供的一种语音检测方法的流程图;
图13是本申请实施例提供的一种语音检测装置的结构示意图;
图14是本申请实施例提供的一种用于语音检测的预测模型的训练装置的结构示意图;
图15是本申请实施例提供的一种终端100的结构示意图;
图16是本申请实施例提供的一种终端100的功能架构图;
图17是本申请实施例提供的一种计算设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请中术语“至少一个”的含义是指一个或多个,本申请中术语“多个”的含义是指两个或两个以上,例如,多个第二报文是指两个或两个以上的第二报文。本文中术语“系统”和“网络”经常可互换使用。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。
应理解,在本申请的各个实施例中,各个过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其它信息确定B。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。
以下,对本申请涉及的术语进行解释。
语音端点(Endpoint)检测:是指对音频中的语音结束点进行检测的技术。具体而言,音频通常包括多个音频信号,在端点检测的过程中,可以依次检测每个音频信号,判断当前的音频信号是否为语音结束点。
语音端点检测技术通常应用在语音交互的场景中,当用户说话后,通过对音频进行语音端点检测,确定语音起始点和语音结束点,截取语音起始点语音和结束点之间的音频,作为一条语音指令。对于语音起始点而言,由于语音交互通常由用户主动发起。例如,语音交互的触发方式可以是一按即说(Push To Talk,PTT)的方式。比如说,用户可以通过按压一个实体的按键或者虚拟的按键,来启动语音交互;又如,语音交互的触发方式可以是语音唤醒(Voice Trigger,VT)的方式。比如说,用户可以通过说出唤醒词,来启动语音交互。这就使得语音起始点比较容易准确检测。而对于语音结束点而言,通常需要机器自动检测。
相关技术中,语音结束点通常仅是依赖自动语音识别(Auto Speech Recognition,ASR)以及语音活性检测(Voice Activity Detection,VAD)技术实现。
VAD:用于检测一定时间窗内的音频信号是否是语音信号。依赖于VAD技术的语音结束点检测方案是:当VAD检测到一定时长的非语音,则确定语音结束。这个时长一般是一个固定的时长,比如800毫秒。若VAD检测到超过800毫秒的非语音,则会确定语音结束,将当前检测的音频信号作为语音端点。其中,语音尾部静音(Trailing Silence,TS)是这种端点检测方法的重要参数。但是,很难设置一个固定的时长参数来适配所有的场景和环境,例如,如果设置的时长参数过大,则用户感受到的时延会越长。如果设置的时长参数过小,则用户的语音容易被截断。
ASR技术以及VAD技术检测语音结束点存在两个主要问题:第一,背景噪音容易导致检测语音结束点偏晚;第二,若语音中间包含停顿,容易导致检测出的语音结束点偏早。而这两个问题会极大影响用户的使用体验:由于存在第一个问题,机器会很长时间后才检测到语音指令已经结束,由于语音指令的实际结束时间比检测到的结束时间来说更长,导致语音指令结束后,经过一段时间后才会执行语音指令,这就造成执行语音指令的时延过大,从使用者的角度来说,说出语音后要等待很长时间后系统才进行反馈,无疑产生了卡顿的现象,影响用户的体验。由于存在第二个问题,用户的语音尚未结束,就已经被系统提前截断,那么系统根据过早截断的语音所解析出的语音指令就会不完整,导致系统根据语音指令识别出的用户意图与实际用户意图相比出现严重偏差,进而导致语音交互业务处理错误。由此可见,单独依赖声学信息的VAD,会在有些场景下,不足以准确判断语音端点的状态。
而通过下述方法实施例,能够结合声学信息、文本信息和视觉信息进行综合决策,从而实现多模态的语音结束点检测,这种方法检测到的语音结束点更准确,因此可以有效解决延迟过长和过早截断的两个问题,从而克服了VAD方案的缺陷,能够大幅提升用户体验。此外,该方法可以不依赖于特定的ASR引擎和特定场景,各个模态的检测可以独立执行、综合判断,可操作性更易。
以下,示例性介绍本申请的硬件环境。
图1是本申请实施例提供的一种语音检测方法的实施环境的示意图。该实施环境包括:终端和语音检测平台。
参见图1,终端可以是车载终端101、智能手机102、智能音箱103或者机器人104。当然,图1所示的几种终端仅是举例,终端也可以是其他支持语音检测功能的电子设备,例如智能家居设备、智能电视、游戏主机、台式计算机、平板电脑、电子书阅读器、智能电视、MP3(moving picture experts group audio layer III,动态影像专家压缩标准音频层面3)播放器或MP4(moving picture experts group audio layer IV,动态影像专家压缩标准音频层面4)播放器和膝上型便携计算机等等,本实施例对终端的设备类型不做限定。
终端可以运行有支持语音检测的应用程序。该应用程序可以是导航应用、语音助手、智能问答应用等。示例性的,终端是用户使用的终端,终端运行的应用程序内登录有用户账号,该用户账号可以预先在语音检测平台中注册。终端可以通过无线网络或有线网络与语音检测平台相连。
语音检测平台用于为支持语音检测的应用程序提供后台服务。例如,语音检测平台可以执行下述方法实施例,训练得到预测模型,将预测模型发送给终端,以便终端利用预测模型来进行语音检测。
语音检测平台包括服务器201以及数据库202。服务器201可以是一台服务器,也可以是多台服务器组成的集群。数据库202中可以用于存储样本集,例如包含大量样本人脸图像的样本人脸图像集、包含大量样本音频信号的样本音频信号集等。服务器201可以访问数据库202,得到数据库202存储的样本集,通过样本集训练得到预测模型。
本领域技术人员可以知晓,上述终端、服务器或者数据库的数量可以更多或更少。比如上述终端、服务器或者数据库可以仅为一个,或者上述为几十个或几百个,或者更多数量,此时虽图中未示出,语音检测系统还包括其他终端、其他服务器或者其他数据库。
以上示例性介绍了系统架构,以下示例性介绍基于上文提供的系统架构进行语音检测的方法流程。
语音检测的方法流程可以包括模型训练阶段以及模型预测阶段。以下,通过图2实施例,对模型训练阶段的方法流程进行介绍,通过图6实施例,对模型预测阶段的方法流程进行介绍。
参见图2,图2是本申请实施例提供的一种用于语音检测的预测模型的训练方法的流程图,如图2所示,该方法可以应用在电子设备,该电子设备可以为图1所示系统架构中的终端,也可以是图1所示系统架构中的语音检测平台,比如是服务器201。该方法包括以下步骤:
步骤201、电子设备获取样本音频信号集以及待标注的样本人脸图像集。
样本音频信号集包括多个样本音频信号,样本人脸图像集包括多个样本人脸图像,每个样本人脸图像的拍摄时间点及拍摄对象和对应的样本音频信号的采集时间点及采集对象均相同,可以根据样本人脸图像和样本音频信号之间的对应关系,对样本人脸图像进行标注。
步骤202、电子设备根据样本音频信号集中的第一样本音频信号,对样本人脸图像集中的第三样本人脸图像进行处理,得到第一样本人脸图像。
电子设备可以获取样本音频信号集,该样本音频信号集包括多个样本音频信号,每个样本音频信号和每个样本人脸图像之间可以存在对应关系。样本音频信号和样本人脸图像之间的对应关系是指样本音频信号的采集时间点和样本人脸图像的拍摄时间点相同。比如说,X时Y刻拍摄的样本人脸图像对应于X时Y刻采集的样本音频信号。其中,样本音频信号集的获取方式可以包括多种。例如,电子设备可以包括麦克风,电子设备可以接收录音指令,响应于录音指令,通过麦克风采集样本用户发出的音频,得到样本音频信号。其中,录音指令可以由用户的操作触发。又如,电子设备可以通过网络向服务器请求样本音频信号集,本实施例对如何获取样本音频信号集不做限定。
第三样本人脸图像为未标注的样本人脸图像,第三样本人脸图像可以是样本人脸图像集中的任一个样本人脸图像。第一样本音频信号和第三样本人脸图像具有对应关系,第三样本人脸图像的拍摄时间点及拍摄对象和对应的第一样本音频信号的采集时间点及采集对象均相同。
电子设备可以获取样本人脸图像集,样本人脸图像集包括多个第三样本人脸图像,样本人脸图像集的获取方式可以包括多种。例如,电子设备可以包括摄像头,电子设备可以接收拍摄指令,响应于拍摄指令,通过摄像头对样本用户进行拍摄,得到样本人脸图像集。其中,拍摄指令用于指示电子设备进行拍摄,拍摄指令可以由用户的操作触发。又如,电子设备可 以读取预先存储的样本人脸图像集。再如,电子设备可以通过网络向服务器请求样本人脸图像集,本实施例对如何获取样本人脸图像集不做限定。
第一样本人脸图像为已标注的样本人脸图像,第一样本人脸图像可以由第三样本人脸图像添加标签后得到。由于第三样本人脸图像的拍摄时间点及拍摄对象和第一样本人脸图像的拍摄时间点及拍摄对象均相同,则第三样本人脸图像的拍摄时间点及拍摄对象和第一样本音频信号的采集时间点及采集对象也均相同。第一样本人脸图像的内容为样本用户的人脸,第一样本人脸图像包含样本用户具有继续说话的意图的特征,第一样本人脸图像可以由摄像头对样本用户进行拍摄得到。第一样本人脸图像的数量可以是多个,不同第一样本人脸图像对应的样本用户可以相同或者不同。第一样本人脸图像标注了第一标签。
第一标签表示样本用户具有继续说话的意图。第一标签可以是任意数据格式,例如数字、字母、字符串等。例如,第一标签可以是“Think before speaking”(说话之前的思考状态)。
获取第一样本人脸图像的方式可以包括多种,以下通过获取方式一至获取方式三举例说明。
获取方式一、电子设备获取样本人脸图像集。对于样本人脸图像集中的每个第三样本人脸图像,电子设备可以确定该第三样本人脸图像对应的第一样本音频信号,判断第一样本音频信号是否满足第一条件,若第一样本音频信号满足第一条件,则对第三样本人脸图像添加第一标签,得到第一样本人脸图像,该第一样本人脸图像包含第三样本人脸图像以及第一标签。通过该流程可见,在获取方式一中,第一标签是根据第一样本音频信号确定的。
第一条件用于判断第一样本音频信号是否包含继续说话意图,若第一样本音频信号满足第一条件,可以将对应的第三样本人脸图像标注为第一样本人脸图像。第一条件可以根据实验、经验或需求设置。例如,第一条件可以包括以下第一条件(1)至第一条件(4)中的至少一项:
第一条件(1)第一样本音频信号对应的VAD结果先从说话状态更新为沉默状态,再从沉默状态更新为说话状态。
由于样本人脸图像是以视觉的维度表征X时Y刻的样本用户的用户意图,样本音频信号是以声学的维度表征X时Y刻的样本用户的用户意图,可见样本人脸图像和样本音频信号是从不同的模态反映了相同的用户意图,基于这一构思,可以利用相互对应的样本人脸图像和样本音频信号,来挖掘出用户意图在声学模态的特征与视觉模态的特征之间的关联关系,那么在模型预测阶段,即可利用关联关系融合多模态的特征,进行语音结束点的检测。
在一些实施例中,针对第一条件(1)检测的过程可以包括:电子设备可以包括VAD单元,该VAD单元用于检测当前时间窗的音频信号是否为语音信号。该VAD单元可以是软件,也可以是硬件,或者是软件和硬件的组合。该VAD单元的输入参数可以包括音频信号,该VAD单元的输出参数可以包括音频信号的VAD结果,该VAD结果可以包括说话状态以及沉默状态。说话状态表示音频信号为语音信号,例如,说话状态在程序中可以记录为Speech(说话的);沉默状态表示音频信号不为语音信号,沉默状态在程序中可以记录为Silence(沉默的)。在标注样本人脸图像的过程中,可以将第一样本音频信号输入VAD单元,通过VAD单元对第一样本音频信号进行VAD处理,输出VAD结果。如果VAD结果首先是说话状态,之后切换为沉默状态,再之后又切换回说话状态,可以确定第一样本音频信号满足第一条件(1)。
以下结合一个示例性场景,对设置第一条件(1)的效果进行说明:
如果样本用户在说话期间进行了短暂的停顿,那么这种场景下,对说话期间采集的音频进行VAD的过程中,对于停顿之前的音频而言,VAD结果是说话状态;对于停顿期间的音频而言,VAD结果是沉默状态;对于停顿之后的音频而言,VAD结果恢复为说话状态。那么如果采集的样本音频信号满足第一条件(1),表明样本音频信号与这种场景中停顿期间的音频吻合。由于样本用户停顿之后又继续进行了说话,而不是结束语音,因此停顿时间点样本用户的意图是继续进行说话,那么停顿时间点拍摄的样本人脸图像会包含继续说话的用户意图,那么通过将该样本人脸图像标注为第一人脸图像,后续即可让模型通过第一样本人脸图像,学习出人脸图像与继续说话的用户意图之间的映射关系,那么在模型应用阶段,即可使用模型来根据一幅未知的人脸图像,预测出当前用户是否具有继续说话的意图。
第一条件(2)第一样本音频信号的尾部静音时长小于第一阈值且大于第二阈值。
尾部静音时长也称语音尾部静音(Trailing Silence,TS),是指语音信号的尾部的静音片段持续的总时长。音频信号的尾部静音时长越长,表明音频信号是语音结束点的概率越大。本实施例中,可以通过阈值检测音频信号的尾部静音时长是否已经满足语音结束的第一条件。具体地,尾部静音时长对应的阈值可以包括第一阈值以及第二阈值。
第一阈值可以是尾部静音时长对应的阈值中的最大值,第一阈值大于第二阈值。若尾部静音时长大于第一阈值,可以确定音频信号是语音结束点。例如,第一阈值可以在程序中记为D max。第一阈值的具体数值可以根据实验、经验或需求配置,本实施例在第一阈值的具体数值不做限定。
第二阈值可以是尾部静音时长对应的阈值中的最小值,若尾部静音时长大于第二阈值,可以确定音频信号具有是语音结束点的概率,即,音频信号可能是语音结束点,也可能不是语音结束点,可以利用其他模态的特征来进一步判定音频信号是否是语音结束点。例如,第二阈值可以在程序中记为D min
具体而言,电子设备可以检测第一样本音频信号的尾部静音时长,对尾部静音时长与第一阈值和第二阈值进行比较,若尾部静音时长小于第一阈值且大于第二阈值,可以确定第一样本音频信号满足第一条件(2)。
第一条件(3)文本信息组合对应的第一置信度大于第一文本信息对应的第二置信度。
文本信息组合为第一文本信息与第二文本信息的组合。第一文本信息表示第一样本音频信号的上一个样本音频信号的语义。第二文本信息表示第一样本音频信号的下一个样本音频信号的语义。该文本信息组合可以是有序的组合,第一文本信息在前,第二文本信息在后。
例如,如果样本用户在说话期间进行了短暂的停顿,文本信息组合可以表示说话期间整段音频的语义,第一文本信息可以表示停顿之前的语义,第二文本信息可以表示停顿之后的语义。在一个示例性场景中,用户说“我要去金海路”,之后停顿了一下,然后继续说“金穗路”。在这一场景中,第一样本音频信号可以为停顿期间的静音片段,第一文本信息是停顿之前的音频信号对应的文本信息,即“我要去金海路”。第二文本信息是停顿之后的音频信号对应的文本信息,即“金穗路”,文本信息组合可以是“我要去金海路”和“金穗路”的组合,即“我要去金海路金穗路”。
第一置信度表示文本信息组合为完整语句的概率。第一置信度越大,表示文本信息组合为完整语句的概率越高,那么第一样本音频信号是停顿而不是结束的概率越高,第一样本音 频信号对应的第三样本人脸图像包含继续说话意图的概率也就越高,则将该第三样本人脸图像标注为第一样本人脸图像的准确性越高。例如,在上述场景中,第一置信度可以表示“我要去金海路金穗路”为完整语句的概率。第一置信度可以记为Conf merge
第二置信度表示第一文本信息为完整语句的概率。第二置信度越大,表示第一文本信息为完整语句的概率越高,那么第一样本音频信号是结束而不是停顿的概率越高,第一样本音频信号对应的第三样本人脸图像包含不继续说话意图的概率也就越高。例如,在上述场景中,第二置信度可以表示“我要去金海路”为完整语句的概率。第二置信度可以记为Conf spliti
具体而言,电子设备以及第一条件(3)进行检测的过程可以包括以下步骤:
步骤一、对第一样本音频信号的上一个样本音频信号进行语音识别,得到第一文本信息。
步骤二、对第一样本音频信号的下一个样本音频信号进行语音识别,得到第二文本信息。
步骤三、对第一文本信息和第二文本信息进行拼接,得到文本信息组合。
步骤四、对文本信息组合进行句法分析,得到第一置信度。
步骤五、对第一文本信息进行句法分析,得到第二置信度。
步骤六、对第一置信度和第二置信度进行比较,若第一置信度大于第二置信度,可以确定样本人脸图像满足第一条件(3)。
第一条件(4)文本信息组合对应的第一置信度大于第二文本信息对应的第三置信度。
第三置信度表示第二文本信息为完整语句的概率。第三置信度越大,表示第二文本信息为完整语句的概率越高。例如,在上述场景中,第三置信度可以表示“金穗路”为完整语句的概率。
具体而言,电子设备以及第一条件(3)进行检测的过程可以包括以下步骤:
步骤一、对第一样本音频信号的上一个样本音频信号进行语音识别,得到第一文本信息。
步骤二、对第一样本音频信号的下一个样本音频信号进行语音识别,得到第二文本信息。
步骤三、对第一文本信息和第二文本信息进行拼接,得到文本信息组合。
步骤四、对文本信息组合进行句法分析,得到第一置信度。
步骤五、对第二文本信息进行句法分析,得到第三置信度。
步骤六、对第一置信度和第三置信度进行比较,若第一置信度大于第三置信度,可以确定样本人脸图像满足第一条件(4)。
需要说明的一点是,上述第一条件(3)和第一条件(4)可以结合,结合方案中第一条件(3)和第一条件(4)可以是且的关系。具体地,第一条件(3)和第一条件(4)的结合方案可以是:
步骤一、对第一样本音频信号的上一个样本音频信号进行语音识别,得到第一文本信息。
步骤二、对第一样本音频信号的下一个样本音频信号进行语音识别,得到第二文本信息。
步骤三、对第一文本信息和第二文本信息进行拼接,得到文本信息组合。
步骤四、对文本信息组合进行句法分析,得到第一置信度。
步骤五、对第一文本信息进行句法分析,得到第二置信度。
步骤六、对第二文本信息进行句法分析,得到第三置信度。
步骤七、对第一置信度与第二置信度进行比较,对第一置信度与第三置信度进行比较,若第一置信度大于第二置信度,且第一置信度大于第三置信度,可以确定样本人脸图像满足第一条件,若第一置信度小于或等于第二置信度,或者第一置信度小于或等于第三置信度, 可以确定样本人脸图像不满足第一条件。
通过上述第一条件(3)和第一条件(4),达到的效果至少可以包括:对于包含短暂停的一句话而言,相关技术会以停顿点为分割点,将这一句完整的话割裂开来,切分为两段语音。比如,对于“我要去金海路金穗路”来说,如果用户说完“我要去金海路”后停顿了一下,电子设备会将这句话切分为“我要去金海路”和“金穗路”。由于用户说到“我要去金海路”中的“路”时,电子设备提前判定已经检测到了语音结束点,导致语音结束点检测过早。那么,电子设备会直接将“我要去金海路”作为语音指令,而忽略掉后面跟随的“金穗路”,导致识别的语音指令不完整,如果电子设备直接根据“我要去金海路”来进行业务处理,比如说导航至金海路,无疑会影响业务处理的准确性。
而通过上述方法,能够综合考虑前后两段音频信号:不仅对前后两段音频信号分别进行识别,得出两段音频信号对应的分句各自是完整语句的置信度,还对多段音频信号组成的整体进行识别,得到两个分句的整体是完整语句的置信度;若整体是完整语句的置信度大于两个分句各自是完整语句的置信度,则将两个分句之间的静音片段对应的样本人脸图像取出,标记为第一样本人脸图像,从而可以让模型通过标注好的第一样本人脸图像,学习出停顿时人脸图像会包含的特征。例如,如果“我要去金海路金穗路”的置信度会大于“我要去金海路”的置信度和“金穗路”的置信度,此时把“我要去金海路”和“金穗路”之间的静音片段对应的样本人脸图像取出,添加标签“Think before speaking”。
应理解,上述第一条件(1)至第一条件(4)可以采用任意方式结合。例如,可以仅使用这4种第一条件中的一种第一条件,或者,执行这4种第一条件中两种或两种以上的第一条件。如果将不同第一条件结合起来,不同第一条件之间的逻辑关系可以是且的关系,也可以是或的关系。示例性地,参见图3,满足第一条件的情况可以如图3所示。还应理解,如果第一条件(1)至第一条件(4)中的不同第一条件结合,对结合方案中不同第一条件进行判定时时间先后不做限定。可以某一实现方式先执行,其他实现方式后执行,也可以多种实现方式并行执行。
此外,在标注过程中,对于样本人脸图像集中的每个第三样本人脸图像,电子设备可以还可以判断第三样本人脸图像是否满足第三条件,若第三样本人脸图像满足第三条件,则对第三样本人脸图像添加第一标签,得到第一样本人脸图像。其中,第三条件和第一条件(1)至第一条件(4)中的任一项或多项可以结合,若第三条件和第一条件(1)至第一条件(4)中的任一项或多项结合,第三条件和第一条件之间的逻辑关系可以是且的关系,也可以是或者的关系。
在一些实施例中,第三条件包括:将第一样本人脸图像分别输入预测模型中的第一分类器以及预测模型中的第二分类器后,第一分类器输出的概率大于第二分类器输出的概率。
第一分类器用于预测人脸图像包含动作的概率,第一分类器的输出参数可以是人脸图像包含动作的概率,第一分类器的输入参数可以是人脸图像的关键点的特征。第一分类器可以称为动作单元。第一分类器可以是预测模型的一部分,例如,第一分类器可以是图5中动作识别层中的一部分,可以包括一个或多个AU。
第二分类器用于预测人脸图像不包含动作的概率,第二分类器的输出参数可以是人脸图像不包含动作的概率,第二分类器的输入参数可以是人脸图像的关键点的特征。第二分类器可以称为动作单元。
第一分类器和第二分类器可以组合使用。例如,若第一分类器输出的概率大于第二分类器输出的概率,表明第一分类器的输出结果有效。在一些实施例中,第一分类器的数量可以包括多个,每个第一分类器可以用于预测一个动作,或者,多个第一分类器的组合用于预测一个动作。多个第一分类器输出的概率和第二分类器输出的概率之和可以等于1。
示例性地,第一分类器可以有N个,N为正整数。N个第一分类器中第i个第一分类器可以记为AU i,AU i输出的概率可以记为PAU i。第二分类器可以记为NEU,NEU输出的概率可以记为PNEU。PAU 1、PAU 2……PAU N与PNEU之和为1。若第一分类器AU i的输出结果大于第二分类器NEU的输出结果,即PAU i大于P NEU,则第一分类器AU i当前的输出结果有效;若第一分类器AU i的输出结果小于或等于第二分类器NEU的输出结果,即PAU i小于或等于PNEU,则第一分类器AU i当前的输出结果无效。其中,N为正整数,i为正整数,i小于N。其中,如果第一分类器的数量为多个,第一条件(5)具体可以是:任一个第一分类器输出的概率大于第二分类器输出的概率,即存在PAU i>P NEU
获取方式二、电子设备向数据库发送获取请求,该获取请求用于请求获取第一样本人脸图像,数据库响应于获取请求,读取第一样本人脸图像返回给电子设备。
获取方式三、电子设备访问本地磁盘,读取磁盘中预先存储的第一样本人脸图像。
应理解,上述获取方式一至获取方式三仅是示例性说明,并不代表是第一样本人脸图像获取功能的必选实现方式。在另一些实施例中,也可以采用其他实现方式来实现获取第一样本人脸图像的功能,而这些实现第一样本人脸图像获取功能的其他方式作为步骤202的一种具体情况,也应涵盖在本申请实施例的保护范围之内。
步骤203、电子设备根据样本音频信号集中的第二样本音频信号,对样本人脸图像集中的第四样本人脸图像进行处理,得到第二样本人脸图像。
第四样本人脸图像为未标注的样本人脸图像,第四样本人脸图像可以是样本人脸图像集中的任一个样本人脸图像。第二样本音频信号和第四样本人脸图像具有对应关系,第四样本人脸图像的拍摄时间点及拍摄对象和对应的第二样本音频信号的采集时间点及采集对象均相同。
第二样本人脸图像为已标注的样本人脸图像,第二样本人脸图像可以由第三样本人脸图像添加标签后得到。由于第四样本人脸图像的拍摄时间点及拍摄对象和第二样本人脸图像的拍摄时间点及拍摄对象均相同,则第二样本人脸图像的拍摄时间点及拍摄对象和第二样本音频信号的采集时间点及采集对象也均相同。第二样本人脸图像的内容为样本用户的人脸,第二样本人脸图像包含样本用户具有继续说话的意图的特征,第二样本人脸图像可以由摄像头对样本用户进行拍摄得到。第二样本人脸图像的数量可以是多个,不同第二样本人脸图像对应的样本用户可以相同或者不同。此外,第二样本人脸图像对应的样本用户和第一样本人脸图像对应的样本用户可以相同或者不同。第二样本人脸图像标注了第二标签。
第二标签表示样本用户不具有继续说话的意图。第二标签可以是任意数据格式,例如数字、字母、字符串等。例如,第二标签可以是“Neutral”(中立)。
获取第二样本人脸图像的方式可以包括多种,以下通过获取方式一至获取方式三举例说明。
获取方式一、电子设备获取样本人脸图像集。对于样本人脸图像集中的每个第四样本人脸图像,电子设备可以确定该第四样本人脸图像对应的第二样本音频信号,判断第二样本音 频信号是否满足第二条件,若第二样本音频信号满足第二条件,则对第四样本人脸图像添加第二标签,得到第二样本人脸图像,该第二样本人脸图像包含第四样本人脸图像以及第二标签。通过该流程可见,在获取方式一中,第二标签是根据第二样本音频信号确定的。
第二条件用于判断对应的第二样本音频信号是否包含不继续说话的意图,若第二样本音频信号满足第二条件,可以将对应的第四样本人脸图像标注为第二样本人脸图像。第二条件可以根据实验、经验或需求设置,例如,第二条件包括以下第二条件(1)至第二条件(2)中的至少一项:
第二条件(1)第二样本音频信号对应的VAD结果从说话状态更新为沉默状态。
第二条件(2)第二样本音频信号的尾部静音时长大于第一阈值。
具体而言,电子设备可以检测第二样本音频信号的尾部静音时长,对尾部静音时长与第一阈值进行比较,若尾部静音时长大于第一阈值,由于尾部静音时长已经大于阈值的最大值,表明第二样本音频信号不是停顿而是结束,则可以确定第二样本音频信号满足第二条件(2)。
示例性地,参见图4,如果样本音频信号满足图4所示的第二条件,可以向样本人脸图像添加标签“Neutral”(中立),以标明样本人脸图像对应于没有继续说话的用户意图。
通过利用上述第一第二条件和第二条件为样本人脸图像添加对应的标签,达到的效果至少可以包括:能够融合拍摄的人脸图像、采集的音频信号、文本信息的语义等多个模态的信息,从而结合全局信息来对训练数据进行自动标注,由于综合考虑了各个模态的信息,可以保证样本人脸图像的标签与是否继续说话的用户意图相匹配,那么由于标注得到的样本的准确性高,模型根据准确的样本进行训练后,预测用户意图的准确性也会较高,因此有助于在模型应用阶段,准确地检测出语音结束点。
获取方式二、电子设备向数据库发送获取请求,该获取请求用于请求获取第二样本人脸图像,数据库响应于获取请求,读取第二样本人脸图像,返回给电子设备。
获取方式三、电子设备访问本地磁盘,读取磁盘中预先存储的第二样本人脸图像。
应理解,上述获取方式一至获取方式三仅是示例性说明,并不代表是第二样本人脸图像获取功能的必选实现方式。在另一些实施例中,也可以采用其他实现方式来实现获取第二样本人脸图像的功能,而这些实现第二样本人脸图像获取功能的其他方式作为步骤203的一种具体情况,也应涵盖在本申请实施例的保护范围之内。
应理解,本实施例对步骤202与步骤203的时序不做限定。在一些实施例中,步骤202与步骤203可以顺序执行。例如,可以先执行步骤202,再执行步骤203;也可以先执行步骤203,再执行步骤202。在另一些实施例中,步骤202与步骤203也可以并行执行,即,可以同时执行步骤202以及步骤203。
步骤204、电子设备使用第一样本人脸图像以及第二样本人脸图像进行模型训练,得到预测模型。
预测模型用于预测用户是否具有继续说话的意图。预测模型可以是一个二分类器,预测模型的预测结果可以包括第一取值和第二取值。预测结果的第一取值表示用户具有继续说话的意图。预测结果的第二取值表示用户不具有继续说话的意图。第一取值和第二取值可以是任意两个不同的数据。例如,预测结果的第一取值可以是1,预测结果的第二取值可以是0;或者,预测结果的第一取值可以是0,预测结果的第二取值可以是1。示例性地,将人脸图像输入预测模型之后,若预测模型预测该人脸图像表示用户具有继续说话的意图,预测模型可 以输出1;若预测模型根据输入的人脸图像预测该人脸图像表示用户不具有继续说话的意图时,预测模型可以输出0。
预测模型可以是人工智能(artificial intelligence,AI)模型。预测模型的具体类型可以包括多种。例如,预测模型可以包括神经网络、支持向量机、线性回归模型、逻辑回归模型、决策树或者随机森林中的至少一种。例如,预测模型可以是神经网络。具体地,预测模型可以是卷积神经网络或者循环神经网络等。
采用神经网络来实现预测模型时,预测模型中的每个模块可以是一个层,或者,每个模块可以是多个层组成的网络。每个层可以包括一个或多个节点。例如,参见图5,预测模型包括输入层、第一隐藏层、动作识别层、第二隐藏层和输出层。此外,预测模型可以包括关键点提取模块(图5未示出)。
预测模型中不同模块可以连接,这里的连接是指可进行数据交互。如图5所示,输入层可以和第一隐藏层相连,第一隐藏层和动作识别层相连,动作识别层和第二隐藏层相连,第二隐藏层和输出层相连。此外,关键点提取模块可以和输入层相连。应理解,虽然图5未示出,预测模型中不同模块之间还可以具有其他连接关系。例如,不同层之间可以跨层连接。
其中,关键点提取模块用于从人脸图像提取关键点的特征,将关键点的特征输入至输入层。输入层用于将关键点的特征输出至第一隐藏层。第一隐藏层用于对关键点的特征进行线性映射以及非线性映射,得到映射后的关键点的特征,将映射后的关键点的特征输出至动作识别层。动作识别层用于对映射后的关键点的特征进行识别,得到动作特征,将动作特征输出至第二隐藏层。第二隐藏层用于对动作特征进行线性映射以及非线性映射,得到映射后的动作特征,将映射后的动作特征输出至输出层。输出层用于对映射后的动作特征进行分类,得到不同类别分别对应的置信度;根据置信度确定预测结果。
在一些实施例中,输入层可以包括多个节点,输入层的每个节点用于接收一个关键点的特征。例如,参见图5,输入层可以包括FP1、FP2、FP3……FPn,FP1用于接收关键点1的特征,发送至隐藏层;FP2用于接收关键点2的特征,发送至隐藏层;FP3用于接收关键点3的特征,发送至隐藏层,依次类推,FPn用于接收关键点n的特征,发送至隐藏层。
在一些实施例中,动作识别层可以包括多个第一分类器以及第二分类器。每个第一分类器可以从第一隐藏层接收映射后的关键点的特征,进行动作识别后,得到人脸图像包含动作的概率。第二分类器可以从第一隐藏层接收映射后的关键点的特征,进行动作识别后,得到人脸图像不包含动作的概率。若第一分类器的输出结果大于第二分类器的输出结果,则该第一分类器的输出结果可以发送至第二隐藏层。
例如,参见图5,动作识别层可以包括N个第一分类器,这N个第一分类器中的每个第一分类器可以称为一个动作单元(Action Unit,AU),N个第一分类器分别记为AU1、AU2、AU3……AUn。通过这N个动作单元,可以识别人脸关键肌肉点的变化,从而利用肌肉点的变化,识别出面部微表情以及用户的精神状态;将识别出的特征经过隐藏层的非线性变换后,可以预测出用户未来是否具有继续说话的意图。
模型训练的过程可以包括多种实现方式。在一些实施例中,模型训练可以包括多次迭代的过程。每次迭代的过程可以包括以下步骤(1.1)至步骤(1.3):
步骤(1.1)电子设备将第一样本图像输入预测模型,通过预测模型对第一样本图像进行处理,输出预测结果。
步骤(1.2)电子设备根据该预测结果与第一标签,通过损失函数计算第一损失值,第一损失值表示预测结果与第一标签之间的偏差,预测结果与第一标签之间的偏差越大,则第一损失值越大。
步骤(1.3)电子设备根据第一损失值调整预测模型的参数。
或者,每次迭代的过程包括以下步骤(2.1)至步骤(2.3)。
步骤(2.1)电子设备将第二样本图像输入预测模型,通过预测模型对第二样本图像进行处理,输出预测结果。
步骤(2.2)电子设备根据该预测结果与第二标签,通过损失函数计算第二损失值,该第二损失值表示预测结果与第二标签之间的偏差,预测结果与第二标签之间的偏差越大,则第二损失值越大。
步骤(2.3)电子设备根据第二损失值调整预测模型的参数。
以上示出了训练的一次迭代过程,每当迭代一次后,电子设备可以检测当前是否已经满足训练终止条件,当不满足训练终止条件时,电子设备执行下一次迭代过程;当满足训练终止条件时,电子设备将本次迭代过程所采用的预测模型输出为训练完成的预测模型。
其中,该训练终止条件可以为迭代次数达到目标次数或者损失函数满足预设条件,还可以为基于验证数据集验证时,其能力在一段时间内没有提升。其中,该目标次数可以是预先设置的迭代次数,用以确定训练结束的时机,避免对训练资源的浪费。该预设条件可以是训练过程中损失函数值在一段时间内不变或者不下降,此时说明训练过程已经达到了训练的效果,即预测模型具有了根据人脸图像识别用户是否具有继续说话意图的功能。
在一些实施例中,预测模型的训练过程可以包括第一训练阶段以及第二训练阶段,第一训练阶段用于对第一分类器以及第二分类器进行训练,第二训练阶段用于对第三分类器进行训练。其中,第一分类器、第二分类器或者第三分类器均可以是预测模型的一部分。例如,第一分类器可以是图5中动作识别层中的一部分,可以包括一个或多个AU。第二分类器也可以是图5中动作识别层中的一部分,可以包括一个或多个AU。第三分类器可以是图5中输出层的判决器。可以预先使用第五样本人脸图像以及第六样本人脸图像进行模型训练,得到第一分类器和第二分类器,使用第一分类器和第二分类器,通过上述第三条件对样本人脸图像进行标注。对第一分类器、第二分类器以及待训练的第三分类器进行组合,得到待训练的预测模型,该待训练的预测模型包括第一分类器以及未训练的第三分类器。再通过执行本实施例,使用已标注的第一样本人脸图像和第二样本人脸图像进行训练,使得第三分类器的模型参数得到调整,能够学习到判决是否具有说话意图的能力,最终得到预测模型。
本实施例提供了实现用户意图预测功能的模型训练方法,利用包含继续说话的用户意图的样本人脸图像以及包含不继续说话的用户意图的样本人脸图像,进行模型训练,预测模型可以通过训练的过程,从包含继续说话的用户意图的样本人脸图像和对应的标签中,学习出用户意图为继续说话时,人脸图像的特征会是怎么样的,从包含不继续说话的用户意图的样本人脸图像和对应的标签中,学习出用户意图为不继续说话时,人脸图像的特征又会是怎么样的,那么预测模型由于学习到用户意图与人脸图像特征之间的对应关系,在模型应用阶段,即可通过模型来根据一幅未知的人脸图像,预测出当前用户是否具有继续说话的意图,从而利用人脸图像表示的用户意图,准确地检测出当前的语音信号是否为语音结束点。
上述方法实施例介绍了预测模型的训练流程,以下通过图6实施例,对应用图2实施例提供的预测模型进行语音端点检测的流程进行介绍。
参见图6,图6是本申请实施例提供的一种语音检测方法的流程图。该方法应用于电子设备。该电子设备可以为图1所示系统架构中的终端,也可以是图1所示系统架构中的语音检测平台,比如是服务器201。执行图6实施例的电子设备和执行图2实施例的电子设备可以是同一个电子设备,也可以是不同的电子设备。如果执行图6实施例的电子设备和执行图2实施例的电子设备不同,两个方法实施例中的电子设备可以进行交互,协同完成语音检测的任务。比如说,预测模型的训练步骤可以由服务器执行,利用预测模型进行检测的步骤可以由终端执行。当然,预测模型的训练步骤和检测步骤也可以均在终端侧执行,或者均在服务器侧执行。具体而言,该方法包括以下步骤:
步骤601、电子设备获取音频信号以及人脸图像。
人脸图像的拍摄时间点和音频信号的采集时间点相同。通过获取同一时间点对应的音频信号以及人脸图像,人脸图像表示的用户意图和音频信号表示的用户意图会相同,从而可以借助人脸图像包含的信息,来准确地检测音频信号是否为语音结束点。
例如,在X时Y刻,电子设备可以通过摄像头采集音频信号,并通过摄像头拍摄人脸图像。音频信号可以表示用户在X时Y刻是否具有继续说话的意图,人脸图像也可以表示用户在X时Y刻是否具有继续说话的意图。
当然,由电子设备本端来采集音频信号以及拍摄人脸图像仅是举例说明,在另一些实施例中,电子设备也可以从终端接收语音检测指令,该语音检测指令携带了音频信号以及人脸图像,电子设备可以响应于语音检测指令,根据音频信号以及人脸图像来执行下述方法流程,将语音检测的结果返回至终端。
步骤601的触发条件可以包括多种情况。举例来说,本实施例可以应用在语音交互的场景中,如果终端检测到包含唤醒词的音频信号,可以从待机状态切换为工作状态,也即是,终端被唤醒,终端的唤醒事件可以触发步骤601的执行。
步骤602、电子设备对音频信号进行语音识别,得到音频信号对应的第三文本信息,检测音频信号的尾部静音时长。
为了与模型训练阶段使用的文本信息区分描述,本实施例中,将步骤601中获取的音频信号对应的文本信息记为第三文本信息。具体而言,可以对步骤601中获取的音频信号进行ASR,得到第三文本信息。例如,第三文本信息可以是“打电话给张老师”、“导航到世纪大道”等。此外,还可以对步骤601中获取的音频信号进行VAD,得到尾部静音时长。
应理解,本实施例对语音识别的步骤和尾部静音时长的检测步骤的时序不做限定。在一些实施例中,在执行步骤602的过程中,语音识别的步骤与尾部静音时长的检测步骤可以顺序执行。例如,可以先执行语音识别的步骤,再执行尾部静音时长的检测步骤;也可以先执行尾部静音时长的检测步骤,再执行语音识别的步骤。在另一些实施例中,语音识别的步骤与尾部静音时长的检测步骤也可以并行执行,即,可以同时执行语音识别的步骤以及尾部静音时长的检测步骤。
步骤603、电子设备对尾部静音时长与对应的阈值进行比较。
在将尾部静音时长与阈值进行比较的过程中,可以使用第三阈值,该第三阈值可以是图2实施例提及的第一阈值,也可以是图2实施例提及的第二阈值,或者可以是第一阈值和第 二阈值的组合,或者可以是第一阈值和第二阈值之外的其他阈值。在一些实施例中,使用阈值进行比较的过程具体可以包括以下步骤:
步骤(1)电子设备可以对尾部静音时长与第一阈值进行比较,若尾部静音时长小于第一阈值,则执行步骤(2)。此外,若尾部静音时长大于或等于第一阈值,则电子设备确定语音信号为语音结束点。
步骤(2)电子设备可以对尾部静音时长与第三阈值进行比较,若尾部静音时长大于第三阈值,则执行步骤604。若尾部静音时长小于或等于第三阈值,则电子设备继续获取下一个音频信号以及下一个音频信号对应的人脸图像,继续对下一个音频信号执行步骤601至步骤603。其中,步骤(2)中使用的第三阈值可以小于步骤(1)中使用的第一阈值,此外,步骤(2)中使用的第三阈值和上文中的第二阈值数值相等,也即是,推理侧使用的静音检测阈值和训练侧使用的静音检测阈值可以相同。
通过上述比较方式,可以在尾部静音时长处于第三阈值和第一阈值之间时,执行下述语音检测的流程。这种方式达到的效果至少可以包括:一旦静音时长大于最小的阈值(第三阈值),就结合文本模态以及图像模态,利用句法分析的结果以及面部分析的结果来检测语音结束点,从而通过多模态信息的融合,尽可能快又准地检测到语音端点,避免延时过长的情况。而当静音时长大于最大的阈值(第一阈值),由于静默时间过长,可以免去句法分析的流程以及面部分析的流程,直接确定已经检测到语音结束点。
步骤604、若尾部静音时长大于第三阈值,电子设备对第三文本信息进行句法分析,得到第一分析结果。
第一分析结果用于表示第三文本信息是否为完整语句。第一分析结果可以包括第一取值和第二取值。第一分析结果的第一取值表示第三文本信息是完整语句。第一分析结果的第二取值表示第三文本信息不是完整语句,而是一个待补充的语句。第一分析结果的第一取值和第二取值可以是任意两个不同的数据。例如,第一分析结果的第一取值是1,第一分析结果的第二取值是0;或者,第一分析结果的第一取值是0,第一分析结果的第二取值是1。第三文本信息可以视为一个词汇序列,第一分析结果可以是该词汇序列的序列预测结果。
句法分析的实现方式可以包括多种。在一些实施例中,句法分析包括以下步骤一至步骤五:
步骤一、电子设备对第三文本信息进行分词,得到多个词汇。
分词的方式可以包括多种。举例来说,可以每隔一个字符分割一次,则得到的每个词汇为一个字。例如,参见图7,第三文本信息为“打电话给张老师”,对“打电话给张老师”进行分词后,得到多个词汇分别是“打”、“电”、“话”、“给”、“张”、“老”和“师”。又如,参见图8,第三文本信息为“导航到金海路金穗路”,对“导航到金海路金穗路”进行分词后,得到多个词汇分别是“导”、“航”、“到”、“金”、“海”、“路”、“金”、“穗”和“路”。
步骤二、对于多个词汇中的每个词汇,电子设备对词汇进行句法分析,得到词汇对应的第二分析结果。
第二分析结果用于表示词汇与词汇之前的词汇是否组成了完整语句。例如,第二分析结果可以包括第一取值和第二取值。第二分析结果的第一取值表示词汇与词汇之前的词汇组成了完整语句。第二分析结果的第二取值表示词汇与词汇之前的词汇没有组成了完整语句。第二分析结果的第一取值和第二取值可以是任意两个不同的数据。例如,第二分析结果的第一 取值是1,第二分析结果的第二取值是0;或者,第二分析结果的第一取值是0,第二分析结果的第二取值是1。
例如,参见图7,以第一取值为1,第二取值为0为例,如果分词后得到的多个词汇是多个词汇分别是“打”、“电”、“话”、“给”、“张”、“老”、“师”,句法分析后可以得出,“打”对应的第二分析结果为0,“电”对应的第二分析结果为0,“话”对应的第二分析结果为0,“给”对应的第二分析结果为0,“张”对应的第二分析结果为0,“老”对应的第二分析结果为0,“师”对应的第二分析结果为1。又如,参见图8,如果分词后得到的多个词汇是多个词汇分别是“导”、“航”、“到”、“金”、“海”、“路”、“金”、“穗”和“路”,句法分析后可以得出,“导”对应的第二分析结果为0,“航”对应的第二分析结果为0,“到”对应的第二分析结果为0,“金”对应的第二分析结果为0,“海”对应的第二分析结果为0,“路”(此处是指金海路中的路)对应的第二分析结果为1,“金”对应的第二分析结果为0,“穗”对应的第二分析结果为0,“路”(此处是指金穗路中的路)对应的第二分析结果为1。
在一些实施例中,可以采用流式检测的方式进行句法分析。流式检测的具体过程可以包括:电子设备可从第三文本信息中的第一个词汇开始,遍历每个词汇,对当前遍历的词汇与之前的每个词汇进行文本分析,输出当前遍历的词汇对应的第二分析结果。其中,若当前遍历的词汇对应的第二分析结果表示没有组成完整语句,则继续遍历下一个词汇,直至遍历到最后一个词汇为止,或者,直到遍历到的词汇的第二分析结果表示组成完整语句为止;若当前遍历的词汇对应的第二分析结果表示组成完整语句,电子设备可以确定第三文本信息为完整语句,停止继续遍历。
例如,参见图7,第三文本信息为“打”、“电”、“话”、“给”、“张”、“老”、“师”。在流式检测的过程中,当识别“打”时,预测“打”句法不完整,输出0;当识别“电”时,预测“打电”句法不完整,输出0;当识别“话”时,预测“打电话”句法不完整,输出0;当识别“给”时,预测“打电话给”句法不完整,输出0;当识别“张”时,预测“打电话给张”句法不完整,输出0;当识别“老”时,预测“打电话给张老”句法不完整,输出0;当识别“师”时,预测“打电话给张老师”句法完整,输出1。
又如,参见图8,第三文本信息为“导”、“航”、“到”、“金”、“海”、“路”、“金”、“穗”和“路”。在流式检测的过程中,当识别“导”时,预测“导”句法不完整,输出0;当识别“航”时,预测“导航”句法不完整,输出0;当识别“到”时,预测“导航到”句法不完整,输出0;当识别“金”时,预测“导航到金”句法不完整,输出0;当识别“海”时,预测“导航到金海”句法不完整,输出0;当识别“路”时,预测“导航到金海路”句法完整,输出1;当识别“金”时,预测“导航到金海路金”句法不完整,输出0;当识别“穗”时,预测“导航到金海路金穗”句法不完整,输出0;当识别“路”时,预测“导航到金海路金穗路”句法完整,输出1。
步骤三、对于多个词汇中的每个词汇,电子设备判断该词汇对应的第二分词结果是否表示组成了完整语句,若多个词汇中任一词汇对应的第二分析结果表示组成了完整语句,执行下述步骤四,若多个词汇中每个词汇对应的第二分析结果均表示没有组成完整语句,执行下述步骤五。
步骤四、电子设备确定第三文本信息为完整语句。
步骤五、电子设备确定第三文本信息不为完整语句。
通过执行上述步骤一至步骤五来进行句法分析,达到的效果至少可以包括:不仅综合考 虑了每个词汇与之前词汇之间在句法上的联系,而且利用了N—Best(N条最优)算法,每当检测到一个词汇,则判断该词汇是否已经和之前的词汇组成了完整语句,一旦当前的词汇表示已经组成完整语句时,即可确定已分析的文本信息为完整语句,执行下一步的检测流程。那么,可以在当前音频信号具有是语音结束点的概率时,及时检测出来,从而保证语音结束点检测的实时性,避免语音结束点检测过晚。
步骤605、电子设备判断第一分析结果是否表示为第三文本信息为完整语句。
若第一分析结果表示为第三文本信息不为完整语句,则电子设备可以确定音频信号不为语音结束点。若第一分析结果表示为第三文本信息为完整语句,则电子设备可以确定音频信号具有是语音结束点的概率,则执行步骤606来进行人脸识别。
例如,参见图8,当识别“导”时,输出0,此时确定未检测到完整语句,继续遍历下一个词汇;当识别“航”时输出0,此时确定未检测到完整语句,继续遍历下一个词汇;当识别“到”时,输出0,此时确定未检测到完整语句,继续遍历下一个词汇;当识别“金”时,输出0,此时确定未检测到完整语句,继续遍历下一个词汇;当识别“海”时,输出0,此时确定未检测到完整语句,继续遍历下一个词汇;当识别“路”时,输出1,此时执行下述步骤606和步骤607;而通过人脸识别的步骤,预测模型输出的预测结果为0,表示用户具有继续说话的意图,则继续遍历下一个词汇;当识别“金”时,输出0,此时确定未检测到完整语句,继续遍历下一个词汇;当识别“穗”时,输出0,此时确定未检测到完整语句,继续遍历下一个词汇;当识别“路”时,输出1,此时执行步骤606步骤607;而通过人脸识别的步骤,预测模型输出的预测结果为1,表示用户不具有继续说话的意图,确定检测到了语音结束点。
从上述描述可以看出,通过实施本实施例提供的句法分析方法来检测语音结束点,至少可以达到以下效果:通过图8的例子可以看出,当前词汇与之前的词汇组成的语句的句法完整,并不能成为当前词汇是语音结束点的唯一依据。例如,用户的真实意图是“导航到金海路金穗路”,虽然“导航到金海路”这个分句的句法是完整的,但实际上,“金海路”中的“路”并不是真实的语音结束点,“金穗路”中的“路”才是真实的语音结束点。“导航到金海路金穗路”是一条完整的语音指令,但如果实施相关技术提供的方法,单纯依赖声学信息,就可能在检测到“导航到金海路”时,就将“金海路”中的“路”误判为语音结束点,导致语音指令被分割为“导航到金海路”和“金穗路”,造成曲解了用户意图,致使导航到错误的位置。而通过本实施例提供的方法,可以在检测到音频信号已经句法完整的条件下,触发应用预测模型来进行人脸识别的流程,从而利用预测结果,进一步判断音频信号是否确实到达了语音结束点,从而通过融合视觉模态的特征,避免句法分析误判的情况,极大地提高语音结束点检测的准确性,降低语音指令被过早截断的概率。此外,上述句法分析的方法简单易行,实用性高。
步骤606、若第一分析结果表示为第三文本信息为完整语句,电子设备将人脸图像输入预测模型。
步骤607、电子设备通过预测模型对人脸图像进行处理,输出预测结果。
由于在模型训练阶段,预测模型利用样本以及标签,学习到了人脸图像与用户意图之间的映射关系,那么在步骤607中,预测模型即可基于学习出的映射关系,对人脸图像进行识别,确定该人脸图像对应的用户意图,从而预测出用户是否具有继续说话的意图。
在一些实施例中,预测模型进行处理的具体过程可以包括以下步骤一至步骤四:
步骤一、提取人脸图像包含的关键点。
步骤二、对关键点进行处理,得到人脸图像的动作特征。
从人脸图像中挖掘出动作特征的具体过程可以包括多种实现方式。例如,可以通过执行以下(1)至(4)来获取动作特征。
(1)将人脸图像输入预测模型中的关键点提取模块,通过关键点提取模块从人脸图像中提取关键点的特征。
关键点的特征可以是任意数据形式,包括而不限于一维的向量、二维的特征图或者三维的特征立方体。人脸图像中关键点的数量可以是多个,在执行步骤(1)时,可以提取出多个关键点中每个关键点的特征。
(2)可以将关键点的特征输入至输入层,通过输入层将关键点的特征发送至第一隐藏层。
参见图5,可以将关键点1的特征、关键点2的特征、关键点2的特征、关键点3的特征、关键点n的特征输入至输入层,输入层的节点FP1接收关键点1的特征,发送至隐藏层;节点FP2接收关键点2的特征,发送至隐藏层;节点FP3接收关键点3的特征,发送至隐藏层,依次类推,节点FPn接收关键点n的特征,发送至隐藏层。
(3)通过第一隐藏层对关键点的特征进行线性映射以及非线性映射,得到映射后的关键点的特征,将映射后的关键点的特征发送至动作识别层。
(4)通过动作识别层对映射后的关键点的特征进行识别,得到动作特征。
例如,参见图5,动作识别层可以包括N个动作单元,分别记为AU 1、AU 2、AU 3……AU n。动作单元AU 1对映射后的关键点的特征进行识别后,输出PAU 1,若动作单元AU 1的输出结果大于NEU的输出结果,即PAU 1大于PNEU,则PAU 1的输出结果有效;动作单元AU 2对映射后的关键点的特征进行识别后,输出PAU 2,若动作单元AU 2的输出结果大于NEU的输出结果,即PAU 2大于PNEU,则PAU 2的输出结果有效;以此类推。动作单元NEU对映射后的关键点的特征进行识别后,输出PNEU,可以利用PNEU与其他动作单元的输出结果进行比较,对有效的动作单元的输出结果求和,得到的和值为动作特征。
动作识别层的每个动作单元可以对应于人脸中的一个关键肌肉点,每个动作单元能够在对应的关键肌肉点发生变化时识别出来。比如说,AU1可以识别抬起上嘴唇和人中区域的肌肉。AU2可以识别颔部下降,AU3可以识别嘴角拉伸,AU4可以识别眉毛压低并聚拢,AU5可以识别嘴角拉动向下倾斜,AU6可以识别抬起眉毛外角。AU的识别结果通过输出的概率的大小来指明,比如,PAU 1越大,表示人脸抬起了上嘴唇和人中区域的肌肉的概率越高。而用户面部微表情不同时,动作识别层中各个AU输出的概率也各不相同。比如说,如果用户当前的面部表情是喜悦,则由于喜悦时人脸通常扬起嘴角,则PAU 1会越大,因此可以通过PAU 1来识别出来。
步骤三、对动作特征进行分类,得到不同类别分别对应的置信度。
步骤四、根据置信度确定预测结果。
例如,可以对动作特征进行分类,得到第一类别的置信度以及第二类别的置信度,第一类别为用户具有继续说话的意图,第二类别为用户不具有继续说话的意图。可以对第一类别的置信度与第二类别的置信度进行比较,若第一类别的置信度大于第二类别的置信度,将用户具有继续说话的意图输出为预测结果;或,若第一类别的置信度不大于第二类别的置信度,将用户不具有继续说话的意图输出为预测结果。
例如,参见图5,可以将动作特征输入第二隐藏层,通过第二隐藏层对动作特征进行非 线性映射以及线性映射,得到映射后的动作特征。通过输出层对映射后的动作特征进行分类,得到的类别可以是预测模型的预测结果。如果类别是有继续说话意图,则表明当前音频信号还没有来到语音结束点。如果类别是没有继续说话意图,则将当前识别的音频信号作为语音结束点。
预测模型通过采用上述步骤一至步骤四来进行预测,达到的效果至少可以包括:
若一段语音中包含停顿,对语音进行句法分析时,无法区分一个音频信号是停顿还是语音结束点。而通过上述方法,融合了人脸的关键点的特征以及动作特征,能够基于人脸当前进行的动作,精确地识别出面部包含的微表情,从而根据表情推理出用户的精神状态,进而预测出用户是否具有继续说话的意图。这种方法借助于视觉信息来进行辅助判断,从而解决了句法分析无法解决的问题,能够减少语音的过早截断。
步骤608、若预测结果表示用户不具有继续说话的意图,电子设备确定音频信号为语音结束点。
电子设备确定音频信号为语音结束点时,可以执行语音结束对应的任意业务处理功能。例如,可以将语音检测结果返回给用户,或者将语音检测结果输出至后续模块。比如说,电子设备可以从音频中截取语音起始点和语音结束点之间的部分,解析出语音指令,响应于语音指令进行业务处理。
在一些实施例中,语音检测的方法流程可以如图9所示,包括以下5个步骤:
步骤1:对音频信号进行语音识别(ASR),获取流式的N-best结果和尾部静音时长。
步骤2:对尾部静音时长与最大静音时长阈值Dmax进行比较,若尾部静音时长大于Dmax,则进入到步骤5,否则进入步骤3;
步骤3:对尾部静音时长与最小静音时长阈值Dmin进行比较,若尾部静音时长小于Dmin,则进入到步骤1,否则进入步骤4;
步骤4:分析语音识别的N-best结果以及人脸面部动作单元和关键点,对音频信号进行分类,若满足语音结束点对应的条件,则进入步骤5,否则进入步骤1;
步骤5:检测到语音结束点。
可选地,还可以考虑行车状况,在执行步骤4的过程中,利用行车状况进行综合判断,具体参见下述图11实施例。
本实施例提供了一种多模态的语音结束点检测方法,通过模型对拍摄的人脸图像进行识别,从而预测出用户是否具有继续说话的意图,结合预测结果,来判决采集到的音频信号是否为语音结束点,由于在声学特征的基础上,还融合了人脸图像这种视觉模态的特征来进行检测,即使在背景噪声很强或者用户说话期间停顿的情况下,也能够利用人脸图像来准确判决语音信号是否为语音结束点,因此避免了背景噪声以及说话停顿造成的干扰,从而避免了背景噪声以及说话停顿的干扰会引发的过晚或者过早检测出语音交互处于结束状态的问题,提高了检测语音结束点的准确性。此外,由于解决了语音交互时语音结束点检测不准确的问题,避免了语音结束点过晚检测会引发的响应时延过长的问题,从而缩短了语音交互的时延,提高了语音交互的流畅性,避免了语音结束点过早检测会引发的语音指令被过早截断的问题,从而避免用户意图理解有误的情况,提高了语音交互的准确性。
上述方法实施例提供的预测模型可以应用在任意需要检测语音检测的场景下,以下通过 一个示例性应用场景举例说明。
参见图10,图10是本申请实施例提供的一种车载场景下语音检测方法的流程图。该方法的交互主体包括车载终端和服务器,包括以下步骤:
步骤1001、服务器获取样本音频信号集以及待标注的样本人脸图像集。
步骤1002、服务器根据样本音频信号集中的第一样本音频信号,对样本人脸图像集中的第三样本人脸图像进行处理,得到第一样本人脸图像。
步骤1003、服务器根据样本音频信号集中的第二样本音频信号,对样本人脸图像集中的第四样本人脸图像进行处理,得到第二样本人脸图像。
步骤1004、服务器使用第一样本人脸图像以及第二样本人脸图像进行模型训练,得到预测模型。
步骤1005、服务器向车载终端发送预测模型。
步骤1006、车载终端接收预测模型,对预测模型进行存储。
步骤1007、车载终端获取音频信号以及人脸图像。
步骤1008、车载终端对音频信号进行语音识别,得到音频信号对应的第三文本信息,检测音频信号的尾部静音时长。
步骤1009、车载终端对尾部静音时长与对应的阈值进行比较。
步骤1010、若尾部静音时长大于第三阈值,车载终端对第三文本信息进行句法分析,得到第一分析结果。
本实施例中,可以考虑车辆的行车状况,综合检测语音结束点。在一些实施例中,车载终端可以采集行车状况信息,根据行车状况信息,对尾部静音时长对应的阈值进行调整,例如对第三阈值进行调整。其中,行车状况信息表示搭载车载终端的车辆的行车状况。车载终端可以配置有传感器,可以通过传感器采集得到行车状况信息。
通过上述方式,达到的效果至少可以包括:可以融合语音检测的具体应用场景来进行端点检测,例如应用在车载场景下,可以利用驾驶过程中的行车状况,来调整尾部静音时长的阈值,使得阈值可以根据当前的行车状况自适应调整,提升语音端点检测的鲁棒性。
行车状况信息的具体含义可以包括至少一种,以下通过方式一至方式二举例说明。
方式一、若行车状况信息表示发生了急转弯,对第三阈值进行调整,调整后的第三阈值大于调整前的第三阈值。
如果车辆发生急转弯,用户的语音很可能由于发生急转弯而产生中断,导致语音结束点出现的概率变大,语音的中断时长也会相应变长,此时,通过提高尾部静音时长的阈值,能够让调整后的阈值适应于急转弯的情况。其中,车载终端可以配置加速度计传感器,可以通过加速度计传感器,采集到急转弯的情况。
方式二、若行车状况信息表示发生了急刹车,对第三阈值进行调整,调整后的第三阈值大于调整前的第三阈值。
如果车辆发生急刹车,用户的语音很可能由于发生急刹车而产生中断,导致语音结束点出现的概率变大,语音的中断时长也会相应变长,此时,通过提高尾部静音时长的阈值,能够让调整后的阈值适应于急刹车的情况。其中,车载终端可以配置加速度计传感器,可以通过加速度计传感器,采集到急刹车的情况。
其中,可以实施方式一或者实施方式二,或者实施方式一和方式二的结合。
步骤1011、车载终端判断第一分析结果是否表示为第三文本信息为完整语句。
步骤1012、若第一分析结果表示为第三文本信息为完整语句,车载终端将人脸图像输入预测模型。
步骤1013、车载终端通过预测模型对人脸图像进行处理,输出预测结果。
步骤1014、若预测结果表示用户不具有继续说话的意图,车载终端确定音频信号为语音结束点。
可选地,可以考虑车外环境,综合检测语音结束点。在一些实施例中,车载终端可以采集环境信息,环境信息表示搭载车载终端的车辆所处的环境。车载终端可以根据环境信息,对预测模型的参数进行调整。其中,车载终端可以配置行车记录仪,可以通过行车记录仪,采集到车外环境的情况。此外,与训练阶段的模型参数调整过程不同,根据环境信息进行调整的方式可以是模型微调。
示例性地,若环境信息表示发生了交通拥堵,车载终端可以对预测模型中第三分类器的判决阈值进行调整。其中,第三分类器用于在输入数据高于判决阈值时判决用户具有继续说话的意图,在输入数据低于或等于判决阈值时判决用户不具有继续说话的意图。例如,参见5,第三分类器可以是输出层的节点。
通过结合车外环境进行调参,达到的效果至少可以包括:在车辆驾驶的过程中,车外环境会对驾驶员的情绪产生影响。例如,交通拥塞的场景下驾驶员心情焦躁的概率,会比交通畅通的场景下驾驶员心情焦躁的概率更高。而情绪的变化会影响到人脸识别的过程,那么通过结合车外环境来调整预测模型的参数,可以让预测模型进行人脸识别的过程与当前的车外环境匹配,从而提高预测模型预测结果的精确性。
需要说明的一点是,本实施例是以车载场景为例进行说明,本方案可广泛应用于各种具备语音交互场景,可普遍实施。而在其他语音交互的场景下,也可以进一步利用场景信息来进行语音结束点的检测。例如,如果用在智能音箱或者机器人上,可以结合声源信息或者声场信息,来检测语音结束点。
本实施例提供了车载场景下多模态的语音结束点检测方法,通过模型对拍摄的人脸图像进行识别,从而预测出用户是否具有继续说话的意图,结合预测结果,来判决采集到的音频信号是否为语音结束点,由于在声学特征的基础上,还融合了人脸图像这种视觉模态的特征来进行检测,即使在背景噪声很强或者用户说话期间停顿的情况下,也能够利用人脸图像来准确判决语音信号是否为语音结束点,因此避免了背景噪声以及说话停顿造成的干扰,从而避免了背景噪声以及说话停顿的干扰会引发的过晚或者过早检测出语音交互处于结束状态的问题,提高了检测语音结束点的准确性。此外,由于解决了语音交互时语音结束点检测不准确的问题,避免了语音结束点过晚检测会引发的响应时延过长的问题,从而缩短了语音交互的时延,提高了语音交互的流畅性,避免了语音结束点过早检测会引发的语音指令被过早截断的问题,从而避免用户意图理解有误的情况,提高了车载场景下语音交互的准确性。
以上介绍了本实施例提供的语音检测方法,以下示例性介绍该语音检测方法的软件架构。
参见图11,该软件架构可以包括多个功能模块,例如可以包括数据获取模块、数据处理模块以及决策模块。其中,每个功能模块可以通过软件实现。
数据获取模块用于通过麦克风实时采集音频流,通过摄像头实时拍摄视频流。数据获取 模块可以将音频流和视频流传入数据处理模块。数据处理模块可以通过中央处理器提供的处理数据能力和控制设备能力,根据音频流和视频流提取多种模态的信息,例如声学信息、语义信息和视觉信息,将多种模态的信息传入决策模块。决策模块可以对各模态信息进行融合,从而决策当前音频信号是否为语音端点。
参见图12,图12是机器基于上述软件架构执行语音结束点检测的流程图。如图12所示,可以对音频信号进行自动语音识别,得到语音尾部静音的持续时长以及文本信息的N-best结果,根据N-best结果进行句法分析,根据分析结果以及持续时长与阈值之间的大小关系,可以对当前的音频信号进行分类,类别为语音结束点或者非语音结束点。
以上介绍了本申请实施例的语音检测方法,以下介绍本申请实施例的语音检测装置,应理解,该应用于语音检测装置其具有上述方法中语音检测设备的任意功能。
图13是本申请实施例提供的一种语音检测装置的结构示意图,如图13所示,该语音检测装置包括:获取模块1301,用于执行上述方法实施例中的步骤601或步骤1007;输入模块1302,用于执行步骤606或步骤1012;处理模块1303,用于执行步骤607或步骤1013;确定模块1304,用于执行步骤608或步骤1014。
可选地,处理模块,包括:
提取子模块,用于执行步骤607中的步骤一;
处理子模块,用于执行步骤607中的步骤二;
分类子模块,用于执行步骤607中的步骤三。
可选地,获取模块,还用于执行步骤201;装置还包括:训练模块,用于执行步骤202。
可选地,第一样本人脸图像满足第一条件。
可选地,第二样本人脸图像满足第二条件。
可选地,该装置还包括:语音识别模块,用于执行语音识别的步骤;句法分析模块,用于执行句法分析的步骤;确定模块,还用于若句法分析的结果表示不为完整语句,确定音频信号不为语音结束点;或者,若句法分析的结果表示为完整语句,触发输入模块1302执行步骤606或步骤1012。
可选地,句法分析模块,用于执行句法分析中的步骤一至步骤五。
可选地,将人脸图像输入预测模型的触发条件包括:检测音频信号的尾部静音时长;确定尾部静音时长大于第三阈值。
可选地,装置应用于车载终端,装置还包括:第一采集模块,用于采集行车状况信息;第一调整模块,用于根据行车状况信息,对第三阈值进行调整。
可选地,第一调整模块,用于若行车状况信息表示发生了急转弯,对第三阈值进行调整;若行车状况信息表示发生了急刹车,对第三阈值进行调整。
可选地,装置应用于车载终端,装置还包括:第二采集模块,用于采集环境信息;第二调整模块,用于根据环境信息,对预测模型的参数进行调整。
可选地,第二调整模块,用于若环境信息表示发生了交通拥堵,对预测模型中第三分类器的判决阈值进行调整。
应理解,图13实施例提供的语音检测装置对应于上述方法实施例中的语音检测装置,语音检测装置中的各模块和上述其他操作和/或功能分别为了实现方法实施例中的语音检测装 置所实施的各种步骤和方法,具体细节可参见上述方法实施例,为了简洁,在此不再赘述。
应理解,图13实施例提供的语音检测装置在检测语音时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将语音检测装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的语音检测装置与上述语音检测的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图14是本申请实施例提供的一种用于语音检测的预测模型的训练装置的结构示意图,如图14所示,该语音检测装置包括:获取模块1401,用于执行上述图2方法实施例中的步骤201或图10实施例中的步骤1001;处理模块1402,用于执行上述图2方法实施例中的步骤202和步骤203,或图10实施例中的步骤1002和步骤1003;训练模块1403,用于执行上述图2方法实施例中的步骤204,或图10实施例中的步骤1004。
可选地,第一样本音频信号满足第一条件。
可选地,第二样本音频信号满足第二条件。
可选地,第一样本人脸图像满足第三条件。
应理解,图14实施例提供的用于语音检测的预测模型的训练装置对应于图2方法实施例中的电子设备,用于语音检测的预测模型的训练装置中的各模块和上述其他操作和/或功能分别为了实现方法实施例中的图2方法实施例中的电子设备所实施的各种步骤和方法,具体细节可参见上述图2方法实施例,为了简洁,在此不再赘述。
应理解,图14实施例提供的用于语音检测的预测模型的训练装置在训练用于语音检测的预测模型时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将用于语音检测的预测模型的训练装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的用于语音检测的预测模型的训练装置与上述用于语音检测的预测模型的训练的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
以上介绍了本申请实施例的电子设备,以下介绍电子设备可能的产品形态。
应理解,但凡具备上述电子设备的特征的任何形态的产品,都落入本申请的保护范围。还应理解,以下介绍仅为举例,不限制本申请实施例的电子设备的产品形态仅限于此。
本申请实施例提供了一种电子设备,该电子设备包括处理器,该处理器用于执行指令,使得该电子设备执行上述各个方法实施例提供的语音检测方法。
作为示例,处理器可以是一个通用中央处理器(central processing unit,CPU)、网络处理器(Network Processor,简称NP)、微处理器、或者可以是一个或多个用于实现本申请方案的集成电路,例如,专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。该处理器可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。该处理器的数量可以是一个,也可以是多个。
在一些可能的实施例中,该电子设备还可以包括存储器。
存储器可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其它类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only Memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。
存储器和处理器可以分离设置,存储器和处理器也可以集成在一起。在一些可能的实施例中,该电子设备还可以包括收发器。收发器用于与其它设备或通信网络通信,网络通信的方式可以而不限于是以太网,无线接入网(RAN),无线局域网(wireless local area networks,WLAN)等。
在一些可能的实施例中,执行上述图2实施例、图6实施例或图10实施例中的电子设备可以实现为终端,以下对终端的硬件结构进行示例性描述。
图15是本申请实施例提供的一种终端100的结构示意图。终端100可以是图1所示硬件环境中的车载终端101、智能手机102、智能音箱103或者机器人104,当然也可以是其他类型的终端。
终端100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
可以理解的是,本申请实施例示意的结构并不构成对终端100的具体限定。在本申请另一些实施例中,终端100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从该存储器中直接调用。避免了重复存 取,减少了处理器110的等待时间,因而提高了系统的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
I2C接口是一种双向同步串行总线,包括一根串行数据线(serial data line,SDA)和一根串行时钟线(derail clock line,SCL)。在一些实施例中,处理器110可以包含多组I2C总线。处理器110可以通过不同的I2C总线接口分别耦合触摸传感器180K,充电器,闪光灯,摄像头193等。例如:处理器110可以通过I2C接口耦合触摸传感器180K,使处理器110与触摸传感器180K通过I2C总线接口通信,实现终端100的触摸功能。
I2S接口可以用于音频通信。在一些实施例中,处理器110可以包含多组I2S总线。处理器110可以通过I2S总线与音频模块170耦合,实现处理器110与音频模块170之间的通信。在一些实施例中,音频模块170可以通过I2S接口向无线通信模块160传递音频信号,实现通过蓝牙耳机接听电话的功能。
PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。在一些实施例中,音频模块170与无线通信模块160可以通过PCM总线接口耦合。在一些实施例中,音频模块170也可以通过PCM接口向无线通信模块160传递音频信号,实现通过蓝牙耳机接听电话的功能。该I2S接口和该PCM接口都可以用于音频通信。
UART接口是一种通用串行数据总线,用于异步通信。该总线可以为双向通信总线。它将要传输的数据在串行通信与并行通信之间转换。在一些实施例中,UART接口通常被用于连接处理器110与无线通信模块160。例如:处理器110通过UART接口与无线通信模块160中的蓝牙模块通信,实现蓝牙功能。在一些实施例中,音频模块170可以通过UART接口向无线通信模块160传递音频信号,实现通过蓝牙耳机播放音乐的功能。
MIPI接口可以被用于连接处理器110与显示屏194,摄像头193等外围器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI),显示屏串行接口(display serial interface,DSI)等。在一些实施例中,处理器110和摄像头193通过CSI接口通信,实现终端100的拍摄功能。处理器110和显示屏194通过DSI接口通信,实现终端100的显示功能。
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器110与摄像头193,显示屏194,无线通信模块160,音频模块170,传感器模块180等。GPIO接口还可以被配置为I2C接口,I2S接口,UART接口,MIPI接口等。
USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为终端100充电,也可以用于终端100与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他终端,例如AR设备等。
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对终端100的结构限定。在本申请另一些实施例中,终端100也可以采用上述实施例中 不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块140可以通过终端100的无线充电线圈接收无线充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为终端供电。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,显示屏194,摄像头193,和无线通信模块160等供电。电源管理模块141还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。在其他一些实施例中,电源管理模块141也可以设置于处理器110中。在另一些实施例中,电源管理模块141和充电管理模块140也可以设置于同一个器件中。
终端100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。终端100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在终端100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。
无线通信模块160可以提供应用在终端100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,终端100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得终端100可以通过无线通信技术与网络以及其他设备通信。该无线通信技术可以包括全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期演进(long term evolution,LTE),BT,GNSS,WLAN,NFC,FM,和/或IR技术等。该GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)。
终端100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,终端100可以包括1个或N个显示屏194,N为大于1的正整数。
终端100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将该电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,终端100可以包括1个或N个摄像头193,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当终端100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。终端100可以支持一种或多种视频编解码器。这样,终端100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现终端100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展终端100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,该可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。此外,内部存储器121可以存储上述方法实施例中描述的预测模型。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储终端100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行终端100的各种功能应用以及数据处理。
终端100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。终端100可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当终端100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。终端100可以设置至少一个麦克风170C。在另一些实施例中,终端100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,终端100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动终端平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。压力传感器180A的种类很多,如电阻式压力传感器,电感式压力传感器,电容式压力传感器等。电容式压力传感器可以是包括至少两个具有导电材料的平行板。当有力作用于压力传感器180A,电极之间的电容改变。终端100根据电容的变化确定压力的强度。当有触摸操作作用于显示屏194,终端100根据压力传感器180A检测该触摸操作强度。终端100也可以根据压力传感器180A的检测信号计算触摸的位置。在一些实施例中,作用于相同触摸位置,但不同触摸操作强度的触摸操作,可以对应不同的操作指令。例如:当有触摸操作强度小于第一压力阈值的触摸操作作用于短消息应用图标时,执行查看短消息的指令。当有触摸操作强度大于或等于第一压力阈值的触摸操作作用于短消 息应用图标时,执行新建短消息的指令。
陀螺仪传感器180B可以用于确定终端100的运动姿态。在一些实施例中,可以通过陀螺仪传感器180B确定终端100围绕三个轴(即,x,y和z轴)的角速度。陀螺仪传感器180B可以用于拍摄防抖。示例性的,当按下快门,陀螺仪传感器180B检测终端100抖动的角度,根据角度计算出镜头模组需要补偿的距离,让镜头通过反向运动抵消终端100的抖动,实现防抖。陀螺仪传感器180B还可以用于导航,体感游戏场景。
气压传感器180C用于测量气压。在一些实施例中,终端100通过气压传感器180C测得的气压值计算海拔高度,辅助定位和导航。
磁传感器180D包括霍尔传感器。终端100可以利用磁传感器180D检测翻盖皮套的开合。在一些实施例中,当终端100是翻盖机时,终端100可以根据磁传感器180D检测翻盖的开合。进而根据检测到的皮套的开合状态或翻盖的开合状态,设置翻盖自动解锁等特性。
加速度传感器180E可检测终端100在各个方向上(一般为三轴)加速度的大小。当终端100静止时可检测出重力的大小及方向。还可以用于识别终端姿态,应用于横竖屏切换,计步器等应用。
距离传感器180F,用于测量距离。终端100可以通过红外或激光测量距离。在一些实施例中,拍摄场景,终端100可以利用距离传感器180F测距以实现快速对焦。
接近光传感器180G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。发光二极管可以是红外发光二极管。终端100通过发光二极管向外发射红外光。终端100使用光电二极管检测来自附近物体的红外反射光。当检测到充分的反射光时,可以确定终端100附近有物体。当检测到不充分的反射光时,终端100可以确定终端100附近没有物体。终端100可以利用接近光传感器180G检测用户手持终端100贴近耳朵通话,以便自动熄灭屏幕达到省电的目的。接近光传感器180G也可用于皮套模式,口袋模式自动解锁与锁屏。
环境光传感器180L用于感知环境光亮度。终端100可以根据感知的环境光亮度自适应调节显示屏194亮度。环境光传感器180L也可用于拍照时自动调节白平衡。环境光传感器180L还可以与接近光传感器180G配合,检测终端100是否在口袋里,以防误触。
指纹传感器180H用于采集指纹。终端100可以利用采集的指纹特性实现指纹解锁,访问应用锁,指纹拍照,指纹接听来电等。
温度传感器180J用于检测温度。在一些实施例中,终端100利用温度传感器180J检测的温度,执行温度处理策略。例如,当温度传感器180J上报的温度超过阈值,终端100执行降低位于温度传感器180J附近的处理器的性能,以便降低功耗实施热保护。在另一些实施例中,当温度低于另一阈值时,终端100对电池142加热,以避免低温导致终端100异常关机。在其他一些实施例中,当温度低于又一阈值时,终端100对电池142的输出电压执行升压,以避免低温导致的异常关机。
触摸传感器180K,也称“触控器件”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于终端100的表面,与显示屏194所处的位置不同。
骨传导传感器180M可以获取振动信号。在一些实施例中,骨传导传感器180M可以获 取人体声部振动骨块的振动信号。骨传导传感器180M也可以接触人体脉搏,接收血压跳动信号。在一些实施例中,骨传导传感器180M也可以设置于耳机中,结合成骨传导耳机。音频模块170可以基于该骨传导传感器180M获取的声部振动骨块的振动信号,解析出语音指令,实现语音功能。应用处理器可以基于该骨传导传感器180M获取的血压跳动信号解析心率信息,实现心率检测功能。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。终端100可以接收按键输入,产生与终端100的用户设置以及功能控制有关的键信号输入。
马达191可以产生振动提示。马达191可以用于来电振动提示,也可以用于触摸振动反馈。例如,作用于不同应用(例如拍照,音频播放等)的触摸操作,可以对应不同的振动反馈效果。作用于显示屏194不同区域的触摸操作,马达191也可对应不同的振动反馈效果。不同的应用场景(例如:时间提醒,接收信息,闹钟,游戏等)也可以对应不同的振动反馈效果。触摸振动反馈效果还可以支持自定义。
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。
SIM卡接口195用于连接SIM卡。SIM卡可以通过插入SIM卡接口195,或从SIM卡接口195拔出,实现和终端100的接触和分离。终端100可以支持1个或N个SIM卡接口,N为大于1的正整数。SIM卡接口195可以支持Nano SIM卡,Micro SIM卡,SIM卡等。同一个SIM卡接口195可以同时插入多张卡。该多张卡的类型可以相同,也可以不同。SIM卡接口195也可以兼容不同类型的SIM卡。SIM卡接口195也可以兼容外部存储卡。终端100通过SIM卡和网络交互,实现通话以及数据通信等功能。在一些实施例中,终端100采用eSIM,即:嵌入式SIM卡。eSIM卡可以嵌在终端100中,不能和终端100分离。终端100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。
以分层架构的Android(安卓)系统为例,示例性说明终端100的软件结构。
图16是本申请实施例提供的一种终端100的功能架构图。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。
应用程序层可以包括一系列应用程序包。
如图16所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。
如图16所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。该数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供终端100的通信功能。例如通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,终端振动,指示灯闪烁等。
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。
2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
下面结合语音检测场景,示例性说明终端100软件以及硬件的工作流程。
终端100通过音频驱动,启动麦克风170C,通过麦克风170C采集音频信号,启动摄像头驱动,通过摄像头193拍摄人脸图像。终端将预测模型加载至内部存储器121,处理器110将人脸图像输入预测模型,处理器110通过预测模型对人脸图像进行处理,输出预测结果;若预测结果表示用户不具有继续说话的意图,处理器110确定音频信号为语音结束点。
在一些可能的实施例中,执行上述图2实施例、图6实施例或图10实施例中的电子设备可以实现为计算设备,该计算设备可以是服务器、主机或个人计算机等。该计算设备可以由一般性的总线体系结构来实现。
参见图17,图17是本申请实施例提供的一种计算设备的结构示意图,该计算设备可以配置为上述方法实施例中的电子设备。
计算设备可以是方法实施例全部或部分描述的内容中涉及的任一设备。计算设备包括至少一个处理器1701、通信总线1702、存储器1703以及至少一个通信接口1704。
处理器1701可以是一个通用中央处理器(central processing unit,CPU)、网络处理器(NP)、微处理器、或者可以是一个或多个用于实现本申请方案的集成电路,例如,专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。
通信总线1702用于在上述组件之间传送信息。通信总线1702可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储器1703可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,也可以是随机存取存储器(random access memory,RAM)或者可存储信息和指令的其它类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only Memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。存储器1703可以是独立存在,并通过通信总线1702与处理器1701相连接。存储器1703也可以和处理器1701集成在一起。
通信接口1704使用任何收发器一类的装置,用于与其它设备或通信网络通信。通信接口1704包括有线通信接口,还可以包括无线通信接口。其中,有线通信接口例如可以为以太网接口。以太网接口可以是光接口,电接口或其组合。无线通信接口可以为无线局域网(wireless local area networks,WLAN)接口,蜂窝网络通信接口或其组合等。
在具体实现中,作为一种实施例,处理器1701可以包括一个或多个CPU,如图3中所示的CPU0和CPU1。
在具体实现中,作为一种实施例,计算机设备可以包括多个处理器,如图3中所示的处理器1701和处理器1705。这些处理器中的每一个可以是一个单核处理器(single-CPU),也可以是一个多核处理器(multi-CPU)。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。
在具体实现中,作为一种实施例,计算机设备还可以包括输出设备1706和输入设备1707。输出设备1706和处理器1701通信,可以以多种方式来显示信息。例如,输出设备1706可以是液晶显示器(liquid crystal display,LCD)、发光二级管(light emitting diode,LED)显示设备、阴极射线管(cathode ray tube,CRT)显示设备或投影仪(projector)等。输入设备1707和处理器1701通信,可以以多种方式接收用户的输入。例如,输入设备1707可以是鼠标、键盘、触摸屏设备或传感设备等。
在一些实施例中,存储器1703用于存储执行本申请方案的程序代码1710,处理器1701可以执行存储器1703中存储的程序代码1710。也即是,计算设备可以通过处理器1701以及存储器1703中的程序代码1710,来实现方法实施例提供的方法。
本申请实施例的计算设备可对应于上述各个方法实施例中的电子设备,并且,该计算设备中的处理器1710、收发器1720等可以实现上述各个方法实施例中的电子设备所具有的功 能和/或所实施的各种步骤和方法。为了简洁,在此不再赘述。
在一些可能的实施例中,执行上述图2实施例、图6实施例或图10实施例中的电子设备也可以由通用处理器来实现。例如,该通用处理器的形态可以是一种芯片。具体地,实现电子设备的通用处理器包括处理电路和与该处理电路内部连接通信的输入接口以及输出接口,该输入接口可以将音频信号以及人脸图像输入处理电路,处理电路用于执行步骤602至步骤608,该处理电路可以通过输出接口,输出语音检测的结果。可选地,该通用处理器还可以包括存储介质,存储介质可以存储处理电路执行的指令,该处理电路用于执行存储介质存储的指令以执行上述各个方法实施例。可选地,该存储介质还可以用于缓存预测模型,或者对预测模型进行持久化存储。
作为一种可能的产品形态,执行上述图2实施例、图6实施例或图10实施例中的电子设备,还可以使用下述来实现:一个或多个现场可编程门阵列(英文全称:field-programmable gate array,英文简称:FPGA)、可编程逻辑器件(英文全称:programmable logic device,英文简称:PLD)、控制器、状态机、门逻辑、分立硬件部件、任何其它适合的电路、或者能够执行本申请通篇所描述的各种功能的电路的任意组合。
在一些可能的实施例中,执行上述图2实施例、图6实施例或图10实施例中的电子设备还可以使用计算机程序产品实现。具体地,本申请实施例提供了一种计算机程序产品,当该计算机程序产品在电子设备上运行时,使得电子设备执行上述方法实施例中的语音检测方法。
应理解,上述各种产品形态的电子设备,比如终端100、计算设备1600分别具有上述图2方法实施例、图6实施例或图10实施例中电子设备的任意功能,此处不再赘述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例中描述的各方法步骤和单元,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各实施例的步骤及组成。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参见前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
该作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。 可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
该集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例中方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上描述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机程序指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例中的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,该计算机程序指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,DVD)、或者半导体介质(例如固态硬盘)等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (23)

  1. 一种语音检测方法,其特征在于,所述方法包括:
    获取音频信号以及人脸图像,所述人脸图像的拍摄时间点和所述音频信号的采集时间点相同;
    将所述人脸图像输入预测模型,所述预测模型用于预测用户是否具有继续说话的意图;
    通过所述预测模型对所述人脸图像进行处理,输出预测结果;
    若所述预测结果表示所述用户不具有继续说话的意图,确定所述音频信号为语音结束点。
  2. 根据权利要求1所述的方法,其特征在于,所述通过所述预测模型对所述人脸图像进行处理,输出预测结果,包括:
    提取所述人脸图像包含的关键点;
    对所述关键点进行处理,得到所述人脸图像的动作特征;
    对所述动作特征进行分类,得到不同类别分别对应的置信度;
    根据所述置信度确定所述预测结果。
  3. 根据权利要求1或2所述的方法,其特征在于,所述预测模型是根据第一样本人脸图像以及第二样本人脸图像训练得到的;
    所述第一样本人脸图像标注了第一标签,所述第一标签表示样本用户具有继续说话的意图,所述第一标签是根据第一样本音频信号确定的,所述第一样本音频信号的采集时间点及采集对象和所述第一样本人脸图像的拍摄时间点及拍摄对象均相同;
    所述第二样本人脸图像标注了第二标签,所述第二标签表示样本用户不具有继续说话的意图,所述第二标签是根据第二样本音频信号确定的,所述第二样本音频信号的采集时间点及采集对象和所述第二样本人脸图像的拍摄时间点及拍摄对象均相同。
  4. 根据权利要求3所述的方法,其特征在于,所述第一样本音频信号满足第一条件,所述第一条件包括以下至少一项:
    所述第一样本音频信号对应的语音活性检测VAD结果先从说话状态更新为沉默状态,再从所述沉默状态更新为所述说话状态;或,
    所述第一样本音频信号的尾部静音时长小于第一阈值且大于第二阈值,所述第一阈值大于所述第二阈值;或,
    文本信息组合的第一置信度大于第一文本信息的第二置信度,所述文本信息组合为所述第一文本信息与第二文本信息的组合,所述第一文本信息表示所述第一样本音频信号的上一个样本音频信号的语义,所述第二文本信息表示所述第一样本音频信号的下一个样本音频信号的语义,所述第一置信度表示所述文本信息组合为完整语句的概率,所述第二置信度表示所述第一文本信息为完整语句的概率;或,
    所述文本信息组合的第一置信度大于所述第二文本信息的第三置信度,所述第三置信度 表示所述第二文本信息为完整语句的概率。
  5. 根据权利要求3或4所述的方法,其特征在于,所述第二样本音频信号满足第二条件,所述第二条件包括以下至少一项:
    所述第二样本音频信号对应的VAD结果从说话状态更新为沉默状态;或,
    所述第二样本音频信号的尾部静音时长大于第一阈值。
  6. 根据权利要求3至5中任一项所述的方法,其特征在于,所述第一样本人脸图像满足第三条件,所述第三条件包括:
    将所述第一样本人脸图像分别输入所述预测模型中的第一分类器以及所述预测模型中的第二分类器后,所述第一分类器输出的概率大于所述第二分类器输出的概率,所述第一分类器用于预测人脸图像包含动作的概率,所述第二分类器用于预测人脸图像不包含动作的概率。
  7. 根据权利要求1所述的方法,其特征在于,所述获取音频信号以及人脸图像之后,所述方法包括:
    对所述音频信号进行语音识别,得到所述音频信号对应的第三文本信息;
    对所述第三文本信息进行句法分析,得到第一分析结果,所述第一分析结果用于表示所述第三文本信息是否为完整语句;
    若所述第一分析结果表示为所述第三文本信息不为完整语句,确定所述音频信号不为语音结束点;或者,若所述第一分析结果表示为所述第三文本信息为完整语句,执行所述将所述人脸图像输入预测模型的步骤。
  8. 根据权利要求7所述的方法,其特征在于,所述对所述第三文本信息进行句法分析,得到第一分析结果,包括:
    对所述第三文本信息进行分词,得到多个词汇;
    对于所述多个词汇中的每个词汇,对所述词汇进行句法分析,得到所述词汇对应的第二分析结果,所述第二分析结果用于表示所述词汇与所述词汇之前的词汇是否组成了完整语句;
    若所述多个词汇中任一词汇对应的第二分析结果表示组成了完整语句,确定所述第三文本信息为完整语句;或者,若所述多个词汇中每个词汇对应的第二分析结果均表示没有组成完整语句,确定所述第三文本信息不为完整语句。
  9. 一种用于语音检测的预测模型的训练方法,其特征在于,所述方法包括:
    获取样本音频信号集以及待标注的样本人脸图像集;
    根据所述样本音频信号集中的第一样本音频信号,对所述样本人脸图像集中的第三样本人脸图像进行处理,得到第一样本人脸图像,所述第一样本人脸图像标注了第一标签,所述第一标签表示样本用户具有继续说话的意图,所述第一样本人脸图像的拍摄时间点及拍摄对象和所述第一样本音频信号的采集时间点及采集对象均相同;
    根据所述样本音频信号集中的第二样本音频信号,对所述样本人脸图像集中的第四样本人脸图像进行处理,得到第二样本人脸图像,所述第二样本人脸图像标注了第二标签,所述 第二标签表示样本用户不具有继续说话的意图,所述第二样本人脸图像的拍摄时间点及拍摄对象和所述第二样本音频信号的采集时间点及采集对象均相同;
    使用所述第一样本人脸图像以及所述第二样本人脸图像进行模型训练,得到预测模型,所述预测模型用于预测用户是否具有继续说话的意图。
  10. 根据权利要求9所述的方法,其特征在于,所述第一样本音频信号满足第一条件,所述第一条件包括以下至少一项:
    所述第一样本音频信号对应的语音活性检测VAD结果先从说话状态更新为沉默状态,再从所述沉默状态更新为所述说话状态;或,
    所述第一样本音频信号的尾部静音时长小于第一阈值且大于第二阈值,所述第一阈值大于所述第二阈值;或,
    文本信息组合的第一置信度大于第一文本信息的第二置信度,所述文本信息组合为所述第一文本信息与第二文本信息的组合,所述第一文本信息表示所述第一样本音频信号的上一个样本音频信号的语义,所述第二文本信息表示所述第一样本音频信号的下一个样本音频信号的语义,所述第一置信度表示所述文本信息组合为完整语句的概率,所述第二置信度表示所述第一文本信息为完整语句的概率;或,
    所述文本信息组合的第一置信度大于所述第二文本信息的第三置信度,所述第三置信度表示所述第二文本信息为完整语句的概率。
  11. 根据权利要求9或10所述的方法,其特征在于,所述第二样本音频信号满足第二条件,所述第二条件包括以下至少一项:
    所述第二样本音频信号对应的VAD结果从说话状态更新为沉默状态;或,
    所述第二样本音频信号的尾部静音时长大于第一阈值。
  12. 根据权利要求9至11中任一项所述的方法,其特征在于,所述第一样本人脸图像满足第三条件,所述第三条件包括:
    将所述第一样本人脸图像输入所述预测模型中的第一分类器以及所述预测模型中的第二分类器后,所述第一分类器输出的概率大于所述第二分类器输出的概率,所述第一分类器用于预测人脸图像包含动作的概率,所述第二分类器用于预测人脸图像不包含动作的概率。
  13. 一种语音检测装置,其特征在于,所述装置包括:
    获取模块,用于获取音频信号以及人脸图像,所述人脸图像的拍摄时间点和所述音频信号的采集时间点相同;
    输入模块,用于将所述人脸图像输入预测模型,所述预测模型用于预测用户是否具有继续说话的意图;
    处理模块,用于通过所述预测模型对所述人脸图像进行处理,输出预测结果;
    确定模块,用于若所述预测结果表示所述用户不具有继续说话的意图,确定所述音频信号为语音结束点。
  14. 根据权利要求13所述的装置,其特征在于,所述处理模块,包括:
    提取子模块,用于提取所述人脸图像包含的关键点;
    处理子模块,用于对所述关键点进行处理,得到所述人脸图像的动作特征;
    分类子模块,用于对所述动作特征进行分类,得到不同类别分别对应的置信度;
    确定子模块,用于根据所述置信度确定所述预测结果。
  15. 根据权利要求13或14所述的装置,其特征在于,所述预测模型是根据第一样本人脸图像以及第二样本人脸图像训练得到的;
    所述第一样本人脸图像标注了第一标签,所述第一标签表示样本用户具有继续说话的意图,所述第一标签是根据第一样本音频信号确定的,所述第一样本音频信号的采集时间点及采集对象和所述第一样本人脸图像的拍摄时间点及拍摄对象均相同;
    所述第二样本人脸图像标注了第二标签,所述第二标签表示样本用户不具有继续说话的意图,所述第二标签是根据第二样本音频信号确定的,所述第二样本音频信号的采集时间点及采集对象和所述第二样本人脸图像的拍摄时间点及拍摄对象均相同。
  16. 根据权利要求15所述的装置,其特征在于,所述第一样本音频信号满足第一条件,所述第一条件包括以下至少一项:
    所述第一样本音频信号对应的语音活性检测VAD结果先从说话状态更新为沉默状态,再从所述沉默状态更新为所述说话状态;或,
    所述第一样本音频信号的尾部静音时长小于第一阈值且大于第二阈值,所述第一阈值大于所述第二阈值;或,
    文本信息组合的第一置信度大于第一文本信息的第二置信度,所述文本信息组合为所述第一文本信息与第二文本信息的组合,所述第一文本信息表示所述第一样本音频信号的上一个样本音频信号的语义,所述第二文本信息表示所述第一样本音频信号的下一个样本音频信号的语义,所述第一置信度表示所述文本信息组合为完整语句的概率,所述第二置信度表示所述第一文本信息为完整语句的概率;或,
    所述文本信息组合的第一置信度大于所述第二文本信息的第三置信度,所述第三置信度表示所述第二文本信息为完整语句的概率。
  17. 根据权利要求15或16所述的装置,其特征在于,所述第二样本音频信号满足第二条件,所述第二条件包括以下至少一项:所述第二样本音频信号对应的VAD结果从说话状态更新为沉默状态;或,所述第二样本音频信号的尾部静音时长大于第一阈值。
  18. 根据权利要求15至17中任一项所述的装置,其特征在于,所述第一样本人脸图像满足第三条件,所述第三条件包括:将所述第一样本人脸图像分别输入所述预测模型中的第一分类器以及所述预测模型中的第二分类器后,所述第一分类器输出的概率大于所述第二分类器输出的概率,所述第一分类器用于预测人脸图像包含动作的概率,所述第二分类器用于预测人脸图像不包含动作的概率。
  19. 根据权利要求13所述的装置,其特征在于,所述装置还包括:
    语音识别模块,用于对所述音频信号进行语音识别,得到所述音频信号对应的第三文本信息;
    句法分析模块,用于对所述第三文本信息进行句法分析,得到第一分析结果,所述第一分析结果用于表示所述第三文本信息是否为完整语句;
    所述确定模块,还用于若所述第一分析结果表示为所述第三文本信息不为完整语句,确定所述音频信号不为语音结束点;或者,若所述第一分析结果表示为所述第三文本信息为完整语句,执行所述将所述人脸图像输入预测模型的步骤。
  20. 根据权利要求19所述的装置,其特征在于,所述句法分析模块,用于对所述第三文本信息进行分词,得到多个词汇;对于所述多个词汇中的每个词汇,对所述词汇进行句法分析,得到所述词汇对应的第二分析结果,所述第二分析结果用于表示所述词汇与所述词汇之前的词汇是否组成了完整语句;若所述多个词汇中任一词汇对应的第二分析结果表示组成了完整语句,确定所述第三文本信息为完整语句;或者,若所述多个词汇中每个词汇对应的第二分析结果均表示没有组成完整语句,确定所述第三文本信息不为完整语句。
  21. 一种用于语音检测的预测模型的训练装置,其特征在于,所述装置包括:
    获取模块,用于获取样本音频信号集以及待标注的样本人脸图像集;
    处理模块,用于根据所述样本音频信号集中的第一样本音频信号,对所述样本人脸图像集中的第三样本人脸图像进行处理,得到第一样本人脸图像,所述第一样本人脸图像标注了第一标签,所述第一标签表示样本用户具有继续说话的意图,所述第一样本人脸图像的拍摄时间点及拍摄对象和所述第一样本音频信号的采集时间点及采集对象均相同;
    所述处理模块,还用于根据所述样本音频信号集中的第二样本音频信号,对所述样本人脸图像集中的第四样本人脸图像进行处理,得到第二样本人脸图像,所述第二样本人脸图像标注了第二标签,所述第二标签表示样本用户不具有继续说话的意图,所述第二样本人脸图像的拍摄时间点及拍摄对象和所述第二样本音频信号的采集时间点及采集对象均相同;
    训练模块,用于使用所述第一样本人脸图像以及所述第二样本人脸图像进行模型训练,得到预测模型,所述预测模型用于预测用户是否具有继续说话的意图。
  22. 一种电子设备,其特征在于,所述电子设备包括处理器,所述处理器用于执行指令,使得所述电子设备执行如权利要求1至权利要求8中任一项所述的方法,或如权利要求9至权利要求12中任一项所述的方法。
  23. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令,所述指令由处理器读取以使电子设备执行如权利要求1至权利要求8中任一项所述的方法,或如权利要求9至权利要求12中任一项所述的方法。
PCT/CN2019/125121 2019-12-13 2019-12-13 语音检测方法、预测模型的训练方法、装置、设备及介质 WO2021114224A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/CN2019/125121 WO2021114224A1 (zh) 2019-12-13 2019-12-13 语音检测方法、预测模型的训练方法、装置、设备及介质
EP19956031.9A EP4064284A4 (en) 2019-12-13 2019-12-13 SPEECH DETECTION METHODS, TRAINING METHODS FOR PREDICTIVE MODELS, DEVICE, DEVICE AND MEDIUM
CN201980052133.4A CN112567457B (zh) 2019-12-13 2019-12-13 语音检测方法、预测模型的训练方法、装置、设备及介质
US17/838,500 US20220310095A1 (en) 2019-12-13 2022-06-13 Speech Detection Method, Prediction Model Training Method, Apparatus, Device, and Medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/125121 WO2021114224A1 (zh) 2019-12-13 2019-12-13 语音检测方法、预测模型的训练方法、装置、设备及介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/838,500 Continuation US20220310095A1 (en) 2019-12-13 2022-06-13 Speech Detection Method, Prediction Model Training Method, Apparatus, Device, and Medium

Publications (1)

Publication Number Publication Date
WO2021114224A1 true WO2021114224A1 (zh) 2021-06-17

Family

ID=75041165

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/125121 WO2021114224A1 (zh) 2019-12-13 2019-12-13 语音检测方法、预测模型的训练方法、装置、设备及介质

Country Status (4)

Country Link
US (1) US20220310095A1 (zh)
EP (1) EP4064284A4 (zh)
CN (1) CN112567457B (zh)
WO (1) WO2021114224A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113873191A (zh) * 2021-10-12 2021-12-31 苏州万店掌软件技术有限公司 一种基于语音的视频回溯方法、装置及系统
CN114171016A (zh) * 2021-11-12 2022-03-11 北京百度网讯科技有限公司 语音交互的方法、装置、电子设备及存储介质
CN114267345A (zh) * 2022-02-25 2022-04-01 阿里巴巴达摩院(杭州)科技有限公司 模型训练方法、语音处理方法及其装置

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220321612A1 (en) * 2021-04-02 2022-10-06 Whatsapp Llc Enhanced text and voice communications
CN113223501B (zh) * 2021-04-27 2022-11-04 北京三快在线科技有限公司 一种语音交互业务的执行方法及执行装置
CN113488043B (zh) * 2021-06-30 2023-03-24 上海商汤临港智能科技有限公司 乘员说话检测方法及装置、电子设备和存储介质
CN113535925B (zh) * 2021-07-27 2023-09-05 平安科技(深圳)有限公司 语音播报方法、装置、设备及存储介质
CN113655938B (zh) * 2021-08-17 2022-09-02 北京百度网讯科技有限公司 一种用于智能座舱的交互方法、装置、设备和介质
WO2023092399A1 (zh) * 2021-11-25 2023-06-01 华为技术有限公司 语音识别方法、语音识别装置及系统
CN115171678A (zh) * 2022-06-01 2022-10-11 合众新能源汽车有限公司 语音识别方法、装置、电子设备、存储介质及产品
CN115240402B (zh) * 2022-07-13 2023-04-07 北京拙河科技有限公司 一种观光车调度方法和系统
CN114898755B (zh) * 2022-07-14 2023-01-17 科大讯飞股份有限公司 语音处理方法及相关装置、电子设备、存储介质
KR102516391B1 (ko) * 2022-09-02 2023-04-03 주식회사 액션파워 음성 구간 길이를 고려하여 오디오에서 음성 구간을 검출하는 방법
WO2024058911A1 (en) * 2022-09-14 2024-03-21 Microsoft Technology Licensing, Llc Systems for semantic segmentation for speech
CN115910043B (zh) * 2023-01-10 2023-06-30 广州小鹏汽车科技有限公司 语音识别方法、装置及车辆
CN116798427A (zh) * 2023-06-21 2023-09-22 支付宝(杭州)信息技术有限公司 基于多模态的人机交互方法及数字人系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005114576A1 (ja) * 2004-05-21 2005-12-01 Asahi Kasei Kabushiki Kaisha 動作内容判定装置
CN102682273A (zh) * 2011-03-18 2012-09-19 夏普株式会社 嘴唇运动检测设备和方法
CN103617801A (zh) * 2013-12-18 2014-03-05 联想(北京)有限公司 语音检测方法、装置及电子设备
CN103745723A (zh) * 2014-01-13 2014-04-23 苏州思必驰信息科技有限公司 一种音频信号识别方法及装置
CN108573701A (zh) * 2017-03-14 2018-09-25 谷歌有限责任公司 基于唇部检测的查询端点化
CN109817211A (zh) * 2019-02-14 2019-05-28 珠海格力电器股份有限公司 一种电器控制方法、装置、存储介质及电器
CN110033790A (zh) * 2017-12-25 2019-07-19 卡西欧计算机株式会社 声音认识装置、机器人、声音认识方法以及记录介质

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9318129B2 (en) * 2011-07-18 2016-04-19 At&T Intellectual Property I, Lp System and method for enhancing speech activity detection using facial feature detection
US9031847B2 (en) * 2011-11-15 2015-05-12 Microsoft Technology Licensing, Llc Voice-controlled camera operations
US9437186B1 (en) * 2013-06-19 2016-09-06 Amazon Technologies, Inc. Enhanced endpoint detection for speech recognition
US9215543B2 (en) * 2013-12-03 2015-12-15 Cisco Technology, Inc. Microphone mute/unmute notification
US10360926B2 (en) * 2014-07-10 2019-07-23 Analog Devices Global Unlimited Company Low-complexity voice activity detection
CN107679506A (zh) * 2017-10-12 2018-02-09 Tcl通力电子(惠州)有限公司 智能产品的唤醒方法、智能产品及计算机可读存储介质
CN108257616A (zh) * 2017-12-05 2018-07-06 苏州车萝卜汽车电子科技有限公司 人机对话的检测方法以及装置
CN109509471A (zh) * 2018-12-28 2019-03-22 浙江百应科技有限公司 一种基于vad算法打断智能语音机器人对话的方法
CN110265065B (zh) * 2019-05-13 2021-08-03 厦门亿联网络技术股份有限公司 一种构建语音端点检测模型的方法及语音端点检测系统
CN110534109B (zh) * 2019-09-25 2021-12-14 深圳追一科技有限公司 语音识别方法、装置、电子设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005114576A1 (ja) * 2004-05-21 2005-12-01 Asahi Kasei Kabushiki Kaisha 動作内容判定装置
CN102682273A (zh) * 2011-03-18 2012-09-19 夏普株式会社 嘴唇运动检测设备和方法
CN103617801A (zh) * 2013-12-18 2014-03-05 联想(北京)有限公司 语音检测方法、装置及电子设备
CN103745723A (zh) * 2014-01-13 2014-04-23 苏州思必驰信息科技有限公司 一种音频信号识别方法及装置
CN108573701A (zh) * 2017-03-14 2018-09-25 谷歌有限责任公司 基于唇部检测的查询端点化
CN110033790A (zh) * 2017-12-25 2019-07-19 卡西欧计算机株式会社 声音认识装置、机器人、声音认识方法以及记录介质
CN109817211A (zh) * 2019-02-14 2019-05-28 珠海格力电器股份有限公司 一种电器控制方法、装置、存储介质及电器

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4064284A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113873191A (zh) * 2021-10-12 2021-12-31 苏州万店掌软件技术有限公司 一种基于语音的视频回溯方法、装置及系统
CN113873191B (zh) * 2021-10-12 2023-11-28 苏州万店掌软件技术有限公司 一种基于语音的视频回溯方法、装置及系统
CN114171016A (zh) * 2021-11-12 2022-03-11 北京百度网讯科技有限公司 语音交互的方法、装置、电子设备及存储介质
CN114171016B (zh) * 2021-11-12 2022-11-25 北京百度网讯科技有限公司 语音交互的方法、装置、电子设备及存储介质
CN114267345A (zh) * 2022-02-25 2022-04-01 阿里巴巴达摩院(杭州)科技有限公司 模型训练方法、语音处理方法及其装置

Also Published As

Publication number Publication date
EP4064284A4 (en) 2022-11-30
US20220310095A1 (en) 2022-09-29
CN112567457B (zh) 2021-12-10
EP4064284A1 (en) 2022-09-28
CN112567457A (zh) 2021-03-26

Similar Documents

Publication Publication Date Title
WO2021114224A1 (zh) 语音检测方法、预测模型的训练方法、装置、设备及介质
RU2766255C1 (ru) Способ голосового управления и электронное устройство
CN110134316B (zh) 模型训练方法、情绪识别方法及相关装置和设备
CN110910872B (zh) 语音交互方法及装置
WO2020168929A1 (zh) 对特定路线上的特定位置进行识别的方法及电子设备
CN110798506B (zh) 执行命令的方法、装置及设备
WO2022052776A1 (zh) 一种人机交互的方法、电子设备及系统
JP7252327B2 (ja) 人間とコンピュータとの相互作用方法および電子デバイス
WO2021258797A1 (zh) 图像信息输入方法、电子设备及计算机可读存储介质
WO2021254411A1 (zh) 意图识别方法和电子设备
WO2021057537A1 (zh) 一种卡顿预测的方法、数据处理的方法以及相关装置
CN114242037A (zh) 一种虚拟人物生成方法及其装置
WO2022179604A1 (zh) 一种分割图置信度确定方法及装置
CN111222836A (zh) 一种到站提醒方法及相关装置
WO2022143258A1 (zh) 一种语音交互处理方法及相关装置
WO2021238371A1 (zh) 生成虚拟角色的方法及装置
CN115113751A (zh) 调整触摸手势的识别参数的数值范围的方法和装置
CN114691839A (zh) 一种意图槽位识别方法
CN114822543A (zh) 唇语识别方法、样本标注方法、模型训练方法及装置、设备、存储介质
WO2022007757A1 (zh) 跨设备声纹注册方法、电子设备及存储介质
CN116391212A (zh) 一种防止手势误识别的方法及电子设备
WO2023098467A1 (zh) 语音解析方法、电子设备、可读存储介质及芯片系统
WO2023236908A1 (zh) 图像描述方法、电子设备及计算机可读存储介质
WO2024082914A1 (zh) 视频问答方法及电子设备
WO2020253694A1 (zh) 一种用于识别音乐的方法、芯片和终端

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19956031

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019956031

Country of ref document: EP

Effective date: 20220624

NENP Non-entry into the national phase

Ref country code: DE