CN110837758B - Keyword input method and device and electronic equipment - Google Patents

Keyword input method and device and electronic equipment Download PDF

Info

Publication number
CN110837758B
CN110837758B CN201810939640.7A CN201810939640A CN110837758B CN 110837758 B CN110837758 B CN 110837758B CN 201810939640 A CN201810939640 A CN 201810939640A CN 110837758 B CN110837758 B CN 110837758B
Authority
CN
China
Prior art keywords
keyword
confidence
weighted
preset
lip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810939640.7A
Other languages
Chinese (zh)
Other versions
CN110837758A (en
Inventor
董勤波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN201810939640.7A priority Critical patent/CN110837758B/en
Publication of CN110837758A publication Critical patent/CN110837758A/en
Application granted granted Critical
Publication of CN110837758B publication Critical patent/CN110837758B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Studio Devices (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The embodiment of the invention provides a keyword input method and device and electronic equipment. The method comprises the following steps: acquiring an audio signal input by a user and a video signal acquired during the period of inputting the audio signal by the user, wherein the video signal comprises a lip video image; carrying out keyword recognition on the audio signal to obtain a first keyword and the confidence of the first keyword; performing lip language identification on the lip video image to obtain a second keyword and the confidence of the second keyword; determining a weighted confidence coefficient of the first keyword according to the confidence coefficient of the relative quality and the first keyword, and determining a weighted confidence coefficient of the second keyword according to the confidence coefficient of the relative quality and the second keyword, wherein the relative quality is used for representing the signal quality of the audio signal relative to the signal quality of the video signal; and taking the keywords with larger weighted confidence degrees of the first keywords and the second keywords as input keywords. The accuracy of keyword input can be effectively improved.

Description

Keyword input method and device and electronic equipment
Technical Field
The present invention relates to the field of audio technologies, and in particular, to a keyword input method, a keyword input device, and an electronic device.
Background
The keyword recognition technology can convert vocabulary in the user voice into computer readable input and can be used for realizing intelligent man-machine interaction. Wherein the keyword recognition can recognize and determine a small number of specific words in a continuous voice of the user as the keywords of the continuous voice. But is limited by the accuracy of the keyword recognition technique, the accuracy of the recognized keywords may not be high.
In the prior art, in order to improve accuracy of a keyword recognition result, when a user is detected to input voice, a face image of the user is shot by a camera to obtain a face video, the voice input by the user is subjected to keyword recognition to obtain a keyword recognition result, a lip of the user is positioned from the face video to perform lip recognition to obtain a lip recognition result, and one of the keyword recognition result and the lip recognition result with higher confidence is used as the recognition result, wherein the confidence of the keyword recognition result depends on similarity of voice features and an acoustic model in the keyword recognition process, and the confidence of the lip recognition result depends on similarity of lip features and a preset template in the lip recognition process.
However, in the actual application scenario, factors that interfere with voice input and/or interfere with camera shooting may exist, for example, background noise and insufficient light, and these factors may cause a large deviation between the confidence level of the keyword recognition result and/or the lip recognition result and the actual possibility, and the obtained recognition result may be inaccurate.
Disclosure of Invention
The embodiment of the invention aims to provide a keyword input method, a keyword input device and electronic equipment, so as to realize self-adaptive weight adjustment according to the signal quality of an audio signal and a video signal, and improve the accuracy of keyword input. The specific technical scheme is as follows:
in a first aspect of an embodiment of the present invention, there is provided a keyword input method, including:
acquiring an audio signal input by a user and a video signal acquired during the period of inputting the audio signal by the user, wherein the video signal comprises a lip video image of the user;
carrying out keyword recognition on the audio signal to obtain a first keyword and the confidence coefficient of the first keyword, wherein the confidence coefficient of the first keyword is used for indicating the confidence degree that the keyword input by the user is the first keyword;
Performing lip language identification on the lip video image to obtain a second keyword and the confidence coefficient of the second keyword, wherein the confidence coefficient of the second keyword is used for indicating the confidence degree that the keyword input by the user is the second keyword;
determining a weighted confidence of the first keyword according to the confidence of the relative quality and the first keyword, and determining a weighted confidence of the second keyword according to the confidence of the relative quality and the second keyword, wherein the relative quality is used for representing the quality of the signal quality of the audio signal relative to the signal quality of the video signal, the weighted confidence of the first keyword is positively correlated with the relative quality, and the weighted confidence of the second keyword is negatively correlated with the relative quality;
and taking the keywords with larger weighting confidence degrees of the first keywords and the second keywords as input keywords.
With reference to the first aspect, in a first possible implementation manner, after calculating a product of the first weight and the confidence coefficient of the first keyword as a weighted confidence coefficient of the first keyword, calculating a product of the second weight and the confidence coefficient of the second keyword as a weighted confidence coefficient of the second keyword, the method further includes:
Determining whether a larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is larger than a first preset confidence coefficient threshold value;
and if the larger value is larger than the first preset confidence threshold value, executing the step of taking the keyword with larger weighted confidence in the first keyword and the second keyword as the input keyword.
With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, after the determining a weighted confidence of the first keyword according to the confidence of the relative quality and the first keyword, and determining a weighted confidence of the second keyword according to the confidence of the relative quality and the second keyword, the method further includes:
and if the larger value is not larger than the first preset confidence threshold value, determining that the keyword is not recognized.
With reference to the first possible implementation manner of the first aspect, in a third possible implementation manner, before determining whether a larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is greater than a first preset confidence threshold value, the method further includes:
Determining whether the first keyword and the second keyword are consistent;
and if the first keyword is inconsistent with the second keyword, executing the step of determining whether the larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is larger than a first preset confidence coefficient threshold value.
If the first keyword is consistent with the second keyword, determining whether the larger value is greater than a second preset confidence threshold value, wherein the second preset confidence threshold value is smaller than the first preset confidence threshold value;
and if the larger value is larger than the second preset confidence threshold, taking the first keyword or the second keyword as an input keyword.
With reference to the first aspect, in a fourth possible implementation manner, the performing keyword recognition on the audio signal includes:
inputting the audio signals into a preset detection neural network to remove noise signals and mute signals in the audio signals and obtain voice signals of the audio signals, wherein the detection neural network is trained by a plurality of sample audio signals in advance, and the voice signals of each sample audio signal are used for labeling in advance;
Inputting the audio signals into a preset recognition neural network, wherein the recognition neural network is trained by a plurality of sample voice signals in advance, and voice contents corresponding to each sample voice signal are used in advance for marking;
and acquiring the first keyword output by the identification neural network and the confidence coefficient of the first keyword.
And carrying out keyword recognition on the voice content to obtain a first keyword and the confidence of the first keyword.
With reference to the first aspect, in a fifth possible implementation manner, the performing lip language recognition on the lip video image includes:
inputting the lip video images into a preset lip neural network, wherein the lip neural network is trained by a plurality of sample lip video images in advance, and keywords corresponding to the sample lip images are used for labeling each sample lip video image in advance;
and obtaining the keywords output by the lip language neural network and the confidence degrees of the keywords as second keywords and the confidence degrees of the second keywords.
With reference to the first aspect, in a sixth possible implementation manner, the method is applied to the unmanned aerial vehicle controller, and after the adding a keyword with a larger weight confidence degree to the first keyword and the second keyword as an input keyword, the method further includes:
Acquiring a control instruction corresponding to the input keyword;
and controlling the unmanned aerial vehicle bound by the unmanned aerial vehicle controller to execute the control instruction.
With reference to the first aspect, in a seventh possible implementation manner, the determining a weighted confidence of the first keyword according to a confidence of the first keyword and the relative quality, and determining a weighted confidence of the second keyword according to a confidence of the second keyword and the relative quality includes:
determining a first weight and a second weight according to a preset weighting rule, wherein the first weight is positively related to the relative mass, and the second weight is negatively related to the relative mass;
calculating a product of the first weight and the confidence of the first keyword as a weighted confidence of the first keyword, and calculating a product of the second weight and the confidence of the second keyword as a weighted confidence of the second keyword.
With reference to the seventh possible implementation manner of the first aspect, in an eighth possible implementation manner, the determining, according to a preset weighting rule, the first weight and the second weight includes:
determining an interpolation factor alpha according to a preset weighting rule, wherein alpha is positively correlated with relative quality, alpha is more than 0 and less than 1, and the relative quality is the ratio of the signal quality of the audio signal to the signal quality of the video signal;
Taking alpha as a first weight;
1-alpha is taken as the second weight.
In a second aspect of the embodiment of the present invention, there is provided a keyword input apparatus, the apparatus including:
the system comprises a signal acquisition module, a video acquisition module and a display module, wherein the signal acquisition module is used for acquiring an audio signal input by a user and a video signal acquired during the period of inputting the audio signal by the user, and the video signal comprises a lip video image of the user;
the keyword recognition module is used for recognizing keywords of the audio signal to obtain a first keyword and the confidence coefficient of the first keyword, wherein the confidence coefficient of the first keyword is used for representing the confidence degree that the keywords input by the user are the first keyword;
the lip identification module is used for carrying out lip identification on the lip video image to obtain a second keyword and the confidence coefficient of the second keyword, wherein the confidence coefficient of the second keyword is used for indicating the confidence degree that the keyword input by the user is the second keyword;
a command decision module, configured to determine a weighted confidence of the first keyword according to a confidence of a relative quality and the first keyword, and determine a weighted confidence of the second keyword according to a confidence of the relative quality and the second keyword, where the relative quality is used to represent a quality degree of a signal quality of the audio signal relative to a signal quality of the video signal, the weighted confidence of the first keyword is positively related to the relative quality, and the weighted confidence of the second keyword is negatively related to the relative quality; and taking the keywords with larger weighting confidence degrees in the first keywords and the second keywords as input keywords.
With reference to the second aspect, in a first possible implementation manner, the command decision module is further configured to determine, after the determining, according to the relative quality and the confidence level of the first keyword, a weighted confidence level of the first keyword, and determining, according to the relative quality and the confidence level of the second keyword, a weighted confidence level of the second keyword, whether a larger value of the weighted confidence level of the first keyword and the weighted confidence level of the second keyword is greater than a first preset confidence threshold;
and if the larger value is larger than the first preset confidence threshold value, executing the step of taking the keyword with larger weighted confidence in the first keyword and the second keyword as the input keyword.
With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner, the command decision module is further configured to determine, after determining whether a larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is greater than a first preset confidence threshold value, if the larger value is not greater than the first preset confidence threshold value, not to identify the keyword.
With reference to the first possible implementation manner of the second aspect, in a third possible implementation manner, the command decision module is further configured to determine, before determining whether a larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is greater than a first preset confidence threshold, whether the first keyword and the second keyword are consistent;
if the first keyword is inconsistent with the second keyword, executing the step of determining whether the larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is larger than a first preset confidence coefficient threshold value;
if the first keyword is consistent with the second keyword, determining whether the larger value is greater than a second preset confidence threshold value, wherein the second preset confidence threshold value is smaller than the first preset confidence threshold value;
and if the larger value is larger than the second preset confidence threshold value, taking the first keyword or the second keyword as an input keyword.
With reference to the second aspect, in a fourth possible implementation manner, the keyword recognition module is specifically configured to input the audio signal to a preset detection neural network, so as to remove noise signals and mute signals in the audio signal to obtain a voice signal of the audio signal, where the detection neural network is trained by a plurality of sample audio signals in advance, and for each sample audio signal, the voice signal of the sample audio signal is used in advance for labeling;
Inputting the audio signals into a preset recognition neural network, wherein the recognition neural network is trained by a plurality of sample voice signals in advance, and voice contents corresponding to each sample voice signal are used in advance for marking;
and acquiring the first keyword output by the identification neural network and the confidence coefficient of the first keyword.
With reference to the second aspect, in a fifth possible implementation manner, the lip language identification module is specifically configured to input the lip language video image to a preset lip language neural network, where the lip language neural network is trained by a plurality of sample lip language video images in advance, and for each sample lip language video image, a keyword corresponding to the sample lip language image is used in advance to perform a standard;
and obtaining the keywords output by the lip language neural network and the confidence degrees of the keywords as second keywords and the confidence degrees of the second keywords.
With reference to the second aspect, in a sixth possible implementation manner, the device is applied to an unmanned aerial vehicle controller, and the command judgment module obtains a control instruction corresponding to the input keyword after the keyword with larger weighted confidence is used as the input keyword in the first keyword and the second keyword; and controlling the unmanned aerial vehicle bound by the unmanned aerial vehicle controller to execute the control instruction.
With reference to the second aspect, in a seventh possible implementation manner, the command determining module is specifically configured to determine, according to a preset weighting rule, a first weight and a second weight, where the first weight is positively related to a relative quality, and the second weight is negatively related to the relative quality;
calculating a product of the first weight and the confidence of the first keyword as a weighted confidence of the first keyword, and calculating a product of the second weight and the confidence of the second keyword as a weighted confidence of the second keyword.
With reference to the seventh possible implementation manner of the second aspect, in an eighth possible implementation manner, the command decision module is specifically configured to determine an interpolation factor α according to a preset weighting rule, where α is positively correlated with a relative quality, and α is greater than 0 and less than 1, and the relative quality is a ratio of a signal quality of the audio signal to a signal quality of the video signal;
taking alpha as a first weight;
1-alpha is taken as the second weight.
In a third aspect of the embodiment of the present invention, there is provided an electronic device including:
a memory for storing a computer program;
And the processor is used for realizing any one of the keyword input methods when executing the programs stored in the memory.
In a fourth aspect of the embodiments of the present invention, there is provided a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements any one of the keyword input methods described above.
According to the keyword input method, the keyword input device and the electronic equipment, the first weight and the second weight can be adaptively adjusted according to the signal quality of the audio signal and the video signal, when the signal quality of the audio signal is good, the weighted confidence of the first keyword is improved, when the signal quality of the video signal is good, the weighted confidence of the second keyword is improved, the weighted confidence of the keyword can better reflect the possibility that the audio signal input by a user comprises the keyword, and the accuracy of keyword input is effectively improved. Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a keyword input method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of another keyword input method according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of another keyword input method according to an embodiment of the present invention;
fig. 4 is a schematic flow chart of a control method of an unmanned aerial vehicle according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a neural network according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a keyword input device according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic flow chart of a keyword input method according to an embodiment of the present invention, which may include:
S101, acquiring an audio signal input by a user and acquiring a video signal acquired during the period when the audio signal is input by the user, wherein the video signal comprises a lip video image of the user.
The method can be applied to an electronic device with a video shooting function and a voice recording function, such as a mobile terminal with a microphone and a camera, the mobile terminal can start the camera to collect video signals after detecting that a user is inputting an audio signal, the collected video signals can only comprise lip video images, further, the video signals can also be obtained by the camera based on a face recognition technology, and in the case, the video signals can also comprise video images of other areas of the face of the user besides the lip video images.
S102, keyword recognition is carried out on the audio signal, and a first keyword and the confidence of the first keyword are obtained.
The confidence level of the first keyword is used for indicating the confidence level that the keyword input by the user is the first keyword. The keyword recognition process may be to remove a non-voice signal in an audio signal, obtain a voice signal of the audio signal, extract acoustic features from the voice signal, match the extracted acoustic features with acoustic models of preset keywords to determine keywords corresponding to the extracted acoustic features, use the keywords as first keywords, and calculate confidence of the first keywords according to the matching degree of the acoustic features and the acoustic models of the keywords.
Further, in an alternative embodiment, the removing the non-speech signal from the audio signal may be performed by a preset detection neural network, where the detection neural network is trained by a plurality of sample audio signals in advance, and for each sample audio signal, the speech signal of the sample audio signal is labeled in advance, for example, each data frame in the sample audio signal is labeled in advance, so as to indicate whether the video frame belongs to the speech signal, and further, the labeling may be performed by a preset recognition neural network.
After training, the detection neural network can remove non-voice signals in the input audio signals to obtain corresponding voice signals. It will be appreciated that during the period of inputting an audio signal, the user may include a noise signal in the audio signal due to a noisy surrounding environment or unstable current in the recording device, and the user may not always sound during the period of inputting the audio signal, and a mute signal with part of no human voice exists in the audio signal, which is not the audio signal that the user wants to input, so that the noise signal and the mute signal are removed from the audio signal, and a purer voice signal can be obtained, so as to facilitate subsequent analysis and processing. With the detection neural network, the noise signal and the mute signal can be more accurately removed from the audio signal based on deep learning.
And after the voice signal is obtained, the keyword recognition can be performed on the voice signal by using a preset recognition neural network. The speech signal may be input to a recognition neural network, which is trained in advance by a plurality of sample speech signals, wherein, for each sample speech signal, the speech content corresponding to the sample speech signal is labeled in advance. Therefore, the end-to-end mapping from the voice signal to the keywords can be realized, all characters in the voice signal do not need to be recognized, and then the keywords are determined from the recognized characters.
S103, lip language identification is carried out on the lip video image, and the second keywords and the confidence degrees of the second keywords are obtained.
The confidence level of the second keyword is used for indicating the confidence level that the keyword input by the user is the second keyword. It will be appreciated that if the video signal is obtained by capturing a face image of a user, the lip region may be located based on a lip recognition technique in the captured face video image, and the video image of the lip region may be referred to as a lip video image.
In an alternative embodiment, a motion mode of a lip in a lip video image may be analyzed, lip features are extracted according to the motion mode of the lip, the extracted lip features are matched with a preset lip recognition model, so as to determine a keyword corresponding to the extracted lip features, the keyword is used as a second keyword, and the confidence of the second keyword is calculated according to the matching degree of the lip features and the lip recognition model of the keyword. However, since the number of the preset lip language recognition models is limited, the lip language recognition method is not flexible enough and may have larger errors.
In another alternative embodiment, the lip video image may be input to a preset lip recognition neural network. The lip language identification neural network is trained by a plurality of sample lip language video images in advance, each sample lip language video image is labeled by using a keyword corresponding to the sample lip language video image in advance, and for example, the lip language video image of a related person can be collected as one sample lip language video image during the period that the related person recites a certain keyword, and the sample lip language video image is labeled by using the keyword which the related person recites.
After the lip language identification neural network is trained in advance, the end-to-end mapping from the lip language video image to the keywords can be realized, and the corresponding keywords and the confidence degrees of the keywords can be output according to the input lip language video image. And taking the keyword as a second keyword, wherein the confidence coefficient of the keyword is the confidence coefficient of the second keyword. Compared with the method using the lip language identification model, the lip language identification network can perform deep learning based on a large number of sample lip language video images, and further the accuracy of lip language identification is improved by adjusting network parameters to better approximate the mapping between the real lip language video images and keywords.
It can be understood that fig. 1 is only a schematic flow chart of the keyword input method provided in the embodiment of the present invention, and in other embodiments, S103 may be performed before S102, or may be performed synchronously or alternately with S102.
S104, determining the weighted execution degree of the first keyword according to the relative quality and the confidence degree of the first keyword, and determining the weighted confidence degree of the second keyword according to the relative quality and the confidence degree of the second keyword.
Wherein the relative quality is used to represent the quality of the audio signal relative to the video signal, the weighted confidence of the first keyword is positively correlated with the relative quality, and the weighted confidence of the second keyword is negatively correlated with the relative quality.
Further, in an alternative embodiment, the relative quality may be a ratio of a signal quality of the audio signal to a signal quality of the video signal, the higher the relative quality, the better the signal quality of the audio signal relative to the signal quality of the video signal, and the lower the relative quality, the worse the signal quality of the audio signal relative to the signal quality of the video signal. The weighted confidence of the first keyword is positively correlated with the relative quality, which means that the greater the relative quality, the greater the weighted confidence of the first keyword, among all the parameters determining the value of the weighted confidence of the first keyword, except for the relative quality, without changing the other parameters. Similarly, the weighted confidence of the second keyword is inversely related to the relative quality, which means that the weighted confidence of the first keyword is smaller as the relative quality is larger, when the other parameters than the relative quality are unchanged among all the parameters for determining the value of the weighted confidence of the second keyword.
In an alternative embodiment, the first weight and the second weight may be determined according to a preset weighting rule, where the first weight is positively correlated with the relative quality, the second weight is negatively correlated with the relative quality, a product of the first weight and the confidence of the first keyword is calculated as a weighted confidence of the first keyword, and a product of the second weight and the confidence of the second keyword is calculated as a weighted confidence of the second keyword. Because the first weight is positively correlated with the relative quality of the audio signal, the weighted confidence of the first keyword can reflect not only the matching degree of the acoustic characteristics of the audio signal and the acoustic model, but also the quality of the signal quality of the audio signal relative to the signal quality of the video signal. Similarly, the weighted confidence of the second keyword can reflect the quality of the signal quality of the video signal relative to the quality of the signal quality of the audio signal.
On the premise that the first weight is positively correlated with the relative mass and the second weight is negatively correlated with the relative mass, the weighting rule can be set according to actual requirements. For example, the quality Score score_voice of the audio signal and the quality Score score_video of the video signal may be determined according to a preset quality Score rule, the score_voice/score_video is taken as a first weight, and the 1-score_voice/score_video is taken as a second weight, wherein the higher the signal quality is, the higher the Score of the signal quality is, and the signal quality is determined by the code rate of the signal.
Further, in an alternative embodiment, the interpolation factor α may be determined according to a preset weighting rule, where α is a factor positively related to the relative quality, and the value interval of α is (0, 1), and α is taken as the first weight, and (1- α) is taken as the second weight. It will be appreciated that since α is positively correlated with the relative mass, (1- α) is negatively correlated with the relative mass. Illustratively, α may be calculated by the following formula:
Figure BDA0001768730270000111
s105, using the keywords with larger weighting confidence degrees in the first keywords and the second keywords as input keywords.
It will be appreciated that if the audio signal has a poor signal quality, it is indicated that the audio signal may contain speech content that is less complete than what the user actually speaks, i.e. that the user may input less useful information in the audio signal, even if the acoustic features match the acoustic model to a high degree, the user may input keywords that are less trustworthy than the first keywords due to lack of sufficient useful information, in which case there may be a large gap between the confidence level of the first keywords and the actual trustworthiness level. Similarly, when the signal quality of the video signal is poor, there may be a large difference in confidence of the second keyword from the actual confidence level. At this time, it is difficult to determine the relative magnitude of the actual confidence levels of the two keywords only according to the confidence levels of the two keywords, for example, the actual confidence level of the first keyword may be greater than the actual confidence level of the second keyword, but the confidence level of the first keyword is lower than the confidence level of the second keyword, and at this time, if judged only according to the confidence level, the second keyword with a lower actual confidence level may be used as the input keyword, resulting in lower accuracy of the input keyword.
In the embodiment, the weighted confidence coefficient can effectively reflect the quality of the signal quality of the audio signal relative to the quality of the signal quality of the video signal, when the signal quality of the audio signal is relatively better, the keyword recognition result is considered to be more reliable, and when the signal quality of the video signal is relatively better, the lip recognition result is considered to be more reliable, so that the weighted confidence coefficient is closer to the actual confidence coefficient, and the keyword with the larger weighted confidence coefficient is selected as the input keyword, so that the method is more accurate.
Referring to fig. 2, fig. 2 is another flow chart of a keyword input method provided in an embodiment of the present invention, which may include:
s201, acquiring an audio signal input by a user and acquiring a video signal acquired during the period when the audio signal is input by the user, wherein the video signal comprises a lip video image of the user.
This step is the same as S101, and reference may be made to the foregoing description of S101, which is not repeated here.
S202, keyword recognition is carried out on the audio signal, and a first keyword and the confidence of the first keyword are obtained.
This step is the same as S102, and reference may be made to the foregoing description of S102, which is not repeated here.
S203, lip language identification is carried out on the lip video image, and the second keywords and the confidence degrees of the second keywords are obtained.
This step is the same as S103, and reference may be made to the foregoing description of S103, which is not repeated here.
S204, determining the weighted execution degree of the first keyword according to the relative quality and the confidence degree of the first keyword, and determining the weighted confidence degree of the second keyword according to the relative quality and the confidence degree of the second keyword. .
This step is the same as S104, and reference may be made to the foregoing description of S104, which is not repeated here.
S205, determining whether a larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is larger than a first preset confidence threshold, executing S206 if the larger value is larger than the first preset confidence threshold, and executing S207 if the larger value is not larger than the first preset confidence threshold.
The first preset confidence threshold value can be set according to actual requirements, if the first preset confidence threshold value is set higher, the accuracy of keyword input is higher, and if the first keyword preset confidence threshold value is set lower, the recognition rate of keyword input is higher, wherein the recognition rate refers to the probability of successfully recognizing keywords from an audio signal or a video signal.
S206, using the keywords with larger weighting confidence degrees in the first keywords and the second keywords as input keywords.
This step is the same as S105, and reference may be made to the foregoing description of S105, which is not repeated here.
S207, determining that the keyword is not recognized.
If the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword are not greater than the first preset confidence coefficient threshold value, the confidence level of the first keyword and the second keyword which are keywords input by a user can be considered to be not high, and in order to avoid inputting wrong keywords, the fact that the keywords are not recognized at this time can be determined. In other embodiments, if the weighted confidence degrees of the first keyword and the second keyword are not greater than the first preset confidence threshold, the user may be further prompted that the keyword input fails or is invalid.
Referring to fig. 3, fig. 3 is another flow chart of a keyword input method provided in an embodiment of the present invention, which may include:
s301, acquiring an audio signal input by a user and acquiring a video signal acquired during the period when the audio signal is input by the user, wherein the video signal comprises a lip video image of the user.
This step is the same as S101, and reference may be made to the foregoing description of S101, which is not repeated here.
S302, keyword recognition is carried out on the audio signal, and a first keyword and the confidence of the first keyword are obtained.
This step is the same as S102, and reference may be made to the foregoing description of S102, which is not repeated here.
S303, performing lip language identification on the lip video image to obtain a second keyword and the confidence of the second keyword.
This step is the same as S103, and reference may be made to the foregoing description of S103, which is not repeated here.
S304, determining the weighted execution degree of the first keyword according to the relative quality and the confidence degree of the first keyword, and determining the weighted confidence degree of the second keyword according to the relative quality and the confidence degree of the second keyword. .
This step is the same as S104, and reference may be made to the foregoing description of S104, which is not repeated here.
S305, determining whether the first keyword and the second keyword are identical, if the first keyword and the second keyword are not identical, executing S306, and if the first keyword and the second keyword are identical, executing S307.
S306, determining whether a larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is larger than a first preset confidence coefficient threshold value, executing S308 if the larger value is larger than the first preset confidence coefficient threshold value, and executing S309 if the larger value is not larger than the first preset confidence coefficient threshold value.
This step is the same as S205, and reference may be made to the foregoing description of S205, which is not repeated here.
S307, determining whether the larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is larger than a second preset confidence coefficient threshold value, executing S311 if the larger value is larger than the second preset confidence coefficient threshold value, and executing S310 if the larger value is not larger than the first preset confidence coefficient threshold value.
Wherein the second preset confidence threshold is less than the first preset confidence threshold. Assuming that the weighted confidence of the first keyword is 0.8 and the weighted confidence of the second keyword is 0.7, if the first keyword and the second keyword are identical, the theoretical probability that neither the first keyword nor the second keyword is a keyword input by the user is 0.06, and the theoretical probability that the first keyword and the second keyword are keywords input by the user is 0.94, which is greater than the weighted confidence of the first keyword or the second keyword. It can be seen that if the first keyword is consistent with the second keyword, the actual confidence level of the first keyword and the second keyword will be higher than the first weighted confidence level and the second weighted confidence level, at this time, the accuracy of keyword input may be considered to be guaranteed, and the confidence threshold standard may be appropriately lowered to improve the recognition rate of the keywords.
And S308, taking the keywords with larger weighting confidence degrees of the first keywords and the second keywords as input keywords.
This step is the same as S105, and reference may be made to the foregoing description of S105, which is not repeated here.
S309, determining that the keyword is not recognized.
This step is the same as S207, and reference may be made to the foregoing description of S207, which is not repeated here.
S310, taking the first keyword or the second keyword as an input keyword.
Since the first keyword and the second keyword are identical, there is no substantial difference in whether the first keyword or the second keyword is used as an input keyword.
Referring to fig. 4, fig. 4 is a schematic flow chart of a control method of an unmanned aerial vehicle according to an embodiment of the present invention, where the method is applied to an unmanned aerial vehicle controller, and may include:
s401, acquiring an audio signal input by a user and acquiring a video signal acquired during the period when the audio signal is input by the user, wherein the video signal comprises a lip video image of the user.
This step is the same as S101, and reference may be made to the foregoing description of S101, which is not repeated here.
S402, keyword recognition is carried out on the audio signal, and a first keyword and the confidence of the first keyword are obtained.
This step is the same as S102, and reference may be made to the foregoing description of S102, which is not repeated here.
S403, performing lip language identification on the lip video image to obtain a second keyword and the confidence of the second keyword.
This step is the same as S103, and reference may be made to the foregoing description of S103, which is not repeated here.
S404, determining the weighted execution degree of the first keyword according to the relative quality and the confidence degree of the first keyword, and determining the weighted confidence degree of the second keyword according to the relative quality and the confidence degree of the second keyword. .
This step is the same as S104, and reference may be made to the foregoing description of S104, which is not repeated here.
S405, it is determined whether the first keyword and the second keyword are identical, if the first keyword and the second keyword are not identical, S406 is performed, and if the first keyword and the second keyword are identical, S407 is performed.
This step is the same as S305, and reference may be made to the foregoing description of S305, which is not repeated here.
S406, determining whether a larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is larger than a first preset confidence coefficient threshold value, executing S408 if the larger value is larger than the first preset confidence coefficient threshold value, and executing S409 if the larger value is not larger than the first preset confidence coefficient threshold value.
This step is the same as S205, and reference may be made to the foregoing description of S205, which is not repeated here.
S407, determining whether a larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is larger than a second preset confidence coefficient threshold value, if the larger value is larger than the second preset confidence coefficient threshold value, executing S410, and if the larger value is not larger than the first preset confidence coefficient threshold value, executing S409.
This step is the same as S307, and reference may be made to the foregoing description of S307, which is not repeated here.
And S408, taking the keywords with larger weighting confidence degrees of the first keywords and the second keywords as input keywords.
This step is the same as S105, and reference may be made to the foregoing description of S105, which is not repeated here.
S409, determining that the keyword is not recognized.
This step is the same as S207, and reference may be made to the foregoing description of S207, which is not repeated here.
S410, using the first keyword or the second keyword as an input keyword.
This step is the same as S310, and reference may be made to the foregoing description of S310, which is not repeated here.
S411, acquiring a control instruction corresponding to the input keyword.
The corresponding relation between the control instruction and the keyword is preset, and may be stored in the memory of the unmanned aerial vehicle controller in a mapping table form, and further, the corresponding relation may be changed according to the actual requirement of the user. For example, assuming that a keyword "rising" and a control instruction for increasing the flight height of the unmanned aerial vehicle are established in advance in correspondence, when the input keyword is rising, the control instruction for increasing the flight height of the unmanned aerial vehicle is acquired.
S412, controlling the unmanned aerial vehicle bound by the unmanned aerial vehicle controller to execute the control instruction.
The unmanned aerial vehicle controller can be a mobile terminal or a remote control server, has a recording function and a shooting function, or is externally connected with external equipment with the two functions, and the exemplary unmanned aerial vehicle controller can be a smart phone provided with an unmanned aerial vehicle control program. In the prior art, a user can intelligently control the unmanned aerial vehicle through voice, but the unmanned aerial vehicle often works in an outdoor environment, and larger background noise possibly exists in the outdoor environment, so that keyword recognition results are not accurate enough, and the user cannot accurately control the unmanned aerial vehicle through voice. By combining keyword recognition and lip language recognition and carrying out weighted correction on the confidence coefficient according to the signal quality, the embodiment is selected, so that a user can more accurately input keywords, and further, the unmanned aerial vehicle is controlled through the keywords, and the accuracy of the unmanned aerial vehicle in a noisy scene is effectively improved.
The network structures of the detection neural network, the identification neural network and the lip language neural network provided by the embodiment of the invention can be set according to actual requirements, such as a cyclic neural network, a convolution neural network, a deep neural network and the like. The network structures of the detection neural network, the identification neural network and the lip language neural network may be the same or different. Further, referring to fig. 5, fig. 5 is a schematic structural diagram of a neural network provided by an embodiment of the present invention, where the detection neural network, the identification neural network, and the lip recognition network may be the neural network of the structure, and the neural network includes an Input Layer (Input Layer) 510, an Hidden Layer (hiden Layer) 520, and an Output Layer (Output Layer) 530. Where input layer 510 may include a plurality of input layer neurons 511, each for inputting a feature, illustratively, in a labial neural network, each neuron of the input layer may be used for inputting a feature of the labial video image. The hidden layer 520 is used to non-linearly map the features input through the input layer 510, and the hidden layer 520 may be formed by a layer of hidden layer neurons 521, and in other embodiments, may be formed by multiple layers of hidden neurons. The output layer may be formed by one output node 531, and may include a plurality of output nodes in other embodiments.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a keyword input device according to an embodiment of the present invention, which may include:
the signal acquisition module 601 is configured to acquire an audio signal input by a user and a video signal acquired during the period when the audio signal is input by the user, where the video signal includes a lip video image of the user;
the keyword recognition module 602 is configured to perform keyword recognition on the audio signal to obtain a first keyword and a confidence level of the first keyword, where the confidence level of the first keyword is used to represent a confidence level that a keyword input by a user is the first keyword;
the lip recognition module 603 is configured to perform lip recognition on the lip video image to obtain a second keyword and a confidence level of the second keyword, where the confidence level of the second keyword is used to indicate that a keyword input by a user is a confidence level of the second keyword;
a command decision module 604, configured to determine a weighted confidence of the first keyword according to the confidence of the relative quality and the first keyword, and determine a weighted confidence of the second keyword according to the confidence of the relative quality and the second keyword, where the relative quality is used to represent the quality of the signal quality of the audio signal relative to the signal quality of the video signal, the weighted confidence of the first keyword is positively correlated with the relative quality, and the weighted confidence of the second keyword is negatively correlated with the relative quality; and taking the keywords with larger weighted confidence degrees of the first keywords and the second keywords as input keywords.
Further, the command decision module 604 is further configured to determine whether a larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is greater than a first preset confidence threshold value after determining the weighted confidence coefficient of the first keyword according to the relative quality and the confidence coefficient of the first keyword and determining the weighted confidence coefficient of the second keyword according to the relative quality and the confidence coefficient of the second keyword;
and if the larger value is larger than a first preset confidence threshold value, executing the step of taking the keyword with larger weighted confidence in the first keyword and the second keyword as the input keyword.
Further, the command decision module 604 is further configured to determine that the keyword is not recognized after determining whether the greater value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset confidence threshold, if the greater value is not greater than the first preset confidence threshold.
Further, the command decision module 604 is further configured to determine whether the first keyword and the second keyword are consistent before determining whether a greater value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than a first preset confidence threshold;
If the first keyword and the second keyword are inconsistent, executing the step of determining whether the larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is larger than a first preset confidence coefficient threshold value;
if the first keyword is consistent with the second keyword, determining whether the larger value is larger than a second preset confidence threshold value, wherein the second preset confidence threshold value is smaller than the first preset confidence threshold value;
and if the larger value is larger than the second preset confidence threshold value, taking the first keyword or the second keyword as the input keyword.
Further, the keyword recognition module 602 is specifically configured to input an audio signal to a preset detection neural network to remove noise signals and mute signals in the audio signal to obtain a speech signal of the audio signal, where the detection neural network is trained by a plurality of sample audio signals in advance, and for each sample audio signal, speech content of the sample audio signal is used in advance for labeling;
inputting an audio signal into a preset recognition neural network, wherein the recognition neural network is trained by a plurality of sample voice signals in advance, and for each sample voice signal, voice content corresponding to the sample voice signal is used in advance for marking;
And acquiring the confidence of the first keyword output by the identification neural network.
And carrying out keyword recognition on the voice content to obtain the first keywords and the confidence of the first keywords.
Further, the lip recognition module 603 is specifically configured to input a lip video image to a preset lip neural network, where the lip neural network is trained by a plurality of sample lip video images in advance, and for each sample lip video image, keywords corresponding to the sample lip image are used in advance for performing a standard;
and obtaining the keywords output by the lip neural network and the confidence degrees of the keywords as the second keywords and the confidence degrees of the second keywords.
Further, the device is applied to the unmanned aerial vehicle controller, and the command judgment module 604 obtains a control instruction corresponding to the input keyword after adding a keyword with larger weight confidence degree to the first keyword and the second keyword as the input keyword; and controlling the unmanned aerial vehicle bound by the unmanned aerial vehicle controller to execute the control instruction.
Further, the command decision module 604 is specifically configured to determine, according to a preset weighting rule, a first weight and a second weight, where the first weight is positively related to the relative quality, and the second weight is negatively related to the relative quality;
The product of the first weight and the confidence of the first keyword is calculated as the weighted confidence of the first keyword, and the product of the second weight and the confidence of the second keyword is calculated as the weighted confidence of the second keyword.
Further, the command decision module 604 is specifically configured to determine an interpolation factor α according to a preset weighting rule, where α is positive correlation with a relative quality, and α is greater than 0 and less than 1, and the relative quality is a ratio of a signal quality of the audio signal to a signal quality of the video signal;
taking alpha as a first weight;
1-alpha is taken as the second weight.
The embodiment of the invention also provides an electronic device, as shown in fig. 7, which may include:
a memory 701 for storing a computer program;
the processor 702 is configured to execute the program stored in the memory 701, and implement the following steps:
acquiring an audio signal input by a user and a video signal acquired during the period of inputting the audio signal by the user, wherein the video signal comprises a lip video image of the user;
carrying out keyword recognition on the audio signal to obtain a first keyword and the confidence coefficient of the first keyword, wherein the confidence coefficient of the first keyword is used for representing the confidence degree that the keyword input by a user is the first keyword;
Performing lip language identification on the lip video image to obtain a second keyword and the confidence coefficient of the second keyword, wherein the confidence coefficient of the second keyword is used for indicating the confidence degree that the keyword input by the user is the second keyword;
determining a weighted confidence coefficient of the first keyword according to the confidence coefficient of the relative quality and the first keyword, and determining a weighted confidence coefficient of the second keyword according to the confidence coefficient of the relative quality and the second keyword, wherein the relative quality is used for representing the quality degree of the signal quality of the audio signal relative to the signal quality of the video signal, the weighted confidence coefficient of the first keyword is positively correlated with the relative quality, and the weighted confidence coefficient of the second keyword is negatively correlated with the relative quality;
and taking the keywords with larger weighted confidence degrees of the first keywords and the second keywords as input keywords.
Further, after determining the weighted confidence of the first keyword based on the confidence of the relative quality and the first keyword, and determining the weighted confidence of the second keyword based on the confidence of the relative quality and the second keyword, the method further comprises:
determining whether a larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is larger than a first preset confidence coefficient threshold value;
And if the larger value is larger than a first preset confidence threshold value, executing the step of taking the keyword with larger weighted confidence in the first keyword and the second keyword as the input keyword.
Further, after determining whether the greater of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset confidence threshold, the method further includes:
if the larger value is not larger than the first preset confidence threshold value, determining that the keyword is not recognized.
Further, before determining whether the greater of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset confidence threshold, the method further includes:
determining whether the first keyword and the second keyword are consistent;
if the first keyword and the second keyword are not identical, the step of determining whether the greater of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than a first preset confidence threshold is performed.
If the first keyword is consistent with the second keyword, determining whether the larger value is larger than a second preset confidence threshold value, wherein the second preset confidence threshold value is smaller than the first preset confidence threshold value;
And if the larger value is larger than a second preset confidence threshold value, taking the first keyword or the second keyword as the input keyword.
Further, keyword recognition is performed on the audio signal, including:
inputting the audio signals into a preset detection neural network to remove noise signals and mute signals in the audio signals, obtaining voice signals of the audio signals, and training the detection neural network through a plurality of sample audio signals in advance, wherein the voice signals of the sample audio signals are used for labeling for each sample audio signal in advance;
inputting an audio signal into a preset recognition neural network, wherein the recognition neural network is trained by a plurality of sample voice signals in advance, and for each sample voice signal, voice content corresponding to the sample voice signal is used in advance for marking;
acquiring the first keyword output by the recognition neural network and the confidence coefficient of the first keyword, and further performing lip language recognition on the lip video image, wherein the method comprises the following steps:
inputting lip video images into a preset lip neural network, wherein the lip neural network is trained by a plurality of sample lip video images in advance, and keywords corresponding to the sample lip images are used for labeling each sample lip video image in advance;
And obtaining the keywords output by the lip neural network and the confidence degrees of the keywords as the second keywords and the confidence degrees of the second keywords.
Further, when the method is applied to the unmanned aerial vehicle controller, the keyword with larger weight confidence degree is added to the first keyword and the second keyword, and the method further comprises the following steps of:
acquiring a control instruction corresponding to an input keyword;
and controlling the unmanned aerial vehicle bound by the unmanned aerial vehicle controller to execute the control instruction.
Further, determining a weighted confidence of the first keyword based on the relative quality and the confidence of the first keyword, and determining a weighted confidence of the second keyword based on the relative quality and the confidence of the second keyword, comprising:
determining a first weight and a second weight according to a preset weighting rule, wherein the first weight is positively related to the relative mass, and the second weight is negatively related to the relative mass;
the product of the first weight and the confidence of the first keyword is calculated as the weighted confidence of the first keyword, and the product of the second weight and the confidence of the second keyword is calculated as the weighted confidence of the second keyword.
Further, determining the first weight and the second weight according to a preset weighting rule includes:
Determining an interpolation factor alpha according to a preset weighting rule, wherein alpha is positively correlated with relative quality, alpha is more than 0 and less than 1, and the relative quality is the ratio of the signal quality of an audio signal to the signal quality of a video signal;
taking alpha as a first weight; 1-alpha is taken as the second weight.
The Memory mentioned in the electronic device may include a random access Memory (Random Access Memory, RAM) or may include a Non-Volatile Memory (NVM), such as at least one magnetic disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, there is also provided a computer-readable storage medium having instructions stored therein, which when run on a computer, cause the computer to perform the keyword input method of any one of the above embodiments.
In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the keyword input method of any of the above embodiments.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for embodiments of the apparatus, electronic device, computer readable storage medium, computer program product, the description is relatively simple as it is substantially similar to the method embodiments, where relevant see also part of the description of the method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (18)

1. A keyword input method, the method comprising:
acquiring an audio signal input by a user and a video signal acquired during the period of inputting the audio signal by the user, wherein the video signal comprises a lip video image of the user;
carrying out keyword recognition on the audio signal to obtain a first keyword and the confidence coefficient of the first keyword, wherein the confidence coefficient of the first keyword is used for indicating the confidence degree that the keyword input by the user is the first keyword;
performing lip language identification on the lip video image to obtain a second keyword and the confidence coefficient of the second keyword, wherein the confidence coefficient of the second keyword is used for indicating the confidence degree that the keyword input by the user is the second keyword;
determining a weighted confidence of the first keyword according to the confidence of the relative quality and the first keyword, and determining a weighted confidence of the second keyword according to the confidence of the relative quality and the second keyword, wherein the relative quality is a ratio of the signal quality of the audio signal to the signal quality of the video signal and is used for representing the quality degree of the signal quality of the audio signal relative to the signal quality of the video signal, the signal quality is determined based on the code rate of the signal, the weighted confidence of the first keyword is positively correlated with the relative quality, and the weighted confidence of the second keyword is negatively correlated with the relative quality;
The keywords with larger weighting confidence degrees in the first keywords and the second keywords are used as input keywords;
the determining the weighted confidence of the first keyword according to the confidence of the relative quality and the first keyword, and the determining the weighted confidence of the second keyword according to the confidence of the relative quality and the second keyword includes:
determining a first weight and a second weight according to a preset weighting rule, wherein the first weight is positively related to the relative mass, and the second weight is negatively related to the relative mass;
calculating a product of the first weight and the confidence of the first keyword as a weighted confidence of the first keyword, and calculating a product of the second weight and the confidence of the second keyword as a weighted confidence of the second keyword.
2. The method of claim 1, wherein after said determining a weighted confidence of the first keyword based on the confidence of the relative quality and the first keyword, and determining a weighted confidence of the second keyword based on the confidence of the relative quality and the second keyword, the method further comprises:
Determining whether a larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is larger than a first preset confidence coefficient threshold value;
and if the larger value is larger than the first preset confidence threshold value, executing the step of taking the keyword with larger weighted confidence in the first keyword and the second keyword as the input keyword.
3. The method of claim 2, wherein after said determining if the greater of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than a first preset confidence threshold, the method further comprises:
and if the larger value is not larger than the first preset confidence threshold value, determining that the keyword is not recognized.
4. The method of claim 2, wherein prior to said determining whether the greater of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than a first preset confidence threshold, the method further comprises:
determining whether the first keyword and the second keyword are consistent;
if the first keyword is inconsistent with the second keyword, executing the step of determining whether the larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is larger than a first preset confidence coefficient threshold value;
If the first keyword is consistent with the second keyword, determining whether the larger value is greater than a second preset confidence threshold value, wherein the second preset confidence threshold value is smaller than the first preset confidence threshold value;
and if the larger value is larger than the second preset confidence threshold, taking the first keyword or the second keyword as an input keyword.
5. The method of claim 1, wherein said keyword recognition of said audio signal comprises:
inputting the audio signals into a preset detection neural network to remove noise signals and mute signals in the audio signals and obtain voice signals of the audio signals, wherein the detection neural network is trained by a plurality of sample audio signals in advance, and the voice signals of each sample audio signal are used for labeling in advance;
inputting the audio signals into a preset recognition neural network, wherein the recognition neural network is trained by a plurality of sample voice signals in advance, and voice contents corresponding to each sample voice signal are used in advance for marking;
And acquiring the first keyword output by the identification neural network and the confidence coefficient of the first keyword.
6. The method of claim 1, wherein said lip recognition of the lip video image comprises:
inputting the lip video images into a preset lip neural network, wherein the lip neural network is trained by a plurality of sample lip video images in advance, and keywords corresponding to the sample lip images are used for labeling each sample lip video image in advance;
and obtaining the keywords output by the lip language neural network and the confidence degrees of the keywords as second keywords and the confidence degrees of the second keywords.
7. The method according to claim 1, wherein the method is applied to a controller of an unmanned aerial vehicle, and after the keyword with a higher confidence is weighted among the first keyword and the second keyword as the input keyword, the method further comprises:
acquiring a control instruction corresponding to the input keyword;
and controlling the unmanned aerial vehicle bound by the unmanned aerial vehicle controller to execute the control instruction.
8. The method of claim 1, wherein determining the first weight and the second weight according to a preset weighting rule comprises:
Determining an interpolation factor alpha according to a preset weighting rule, wherein alpha and the relative quality are positively correlated, and alpha is more than 0 and less than 1;
taking alpha as a first weight;
1-alpha is taken as the second weight.
9. A keyword input device, the device comprising:
the system comprises a signal acquisition module, a video acquisition module and a display module, wherein the signal acquisition module is used for acquiring an audio signal input by a user and a video signal acquired during the period of inputting the audio signal by the user, and the video signal comprises a lip video image of the user;
the keyword recognition module is used for recognizing keywords of the audio signal to obtain a first keyword and the confidence coefficient of the first keyword, wherein the confidence coefficient of the first keyword is used for representing the confidence degree that the keywords input by the user are the first keyword;
the lip identification module is used for carrying out lip identification on the lip video image to obtain a second keyword and the confidence coefficient of the second keyword, wherein the confidence coefficient of the second keyword is used for indicating the confidence degree that the keyword input by the user is the second keyword;
a command decision module, configured to determine a weighted confidence of the first keyword according to a confidence of a relative quality and the first keyword, and determine a weighted confidence of the second keyword according to a confidence of the relative quality and the second keyword, where the relative quality is a ratio of a signal quality of the audio signal to a signal quality of the video signal, and is used to represent a quality of the signal quality of the audio signal relative to the signal quality of the video signal, the signal quality is determined based on a code rate of a signal, the weighted confidence of the first keyword is positively correlated with the relative quality, and the weighted confidence of the second keyword is negatively correlated with the relative quality; and the keywords with larger weighting confidence degrees in the first keywords and the second keywords are used as input keywords;
The command judgment module is specifically used for determining a first weight and a second weight according to a preset weighting rule, wherein the first weight is positively related to the relative mass, and the second weight is negatively related to the relative mass;
calculating a product of the first weight and the confidence of the first keyword as a weighted confidence of the first keyword, and calculating a product of the second weight and the confidence of the second keyword as a weighted confidence of the second keyword.
10. The apparatus of claim 9, wherein the command decision module is further configured to determine whether a greater value of the weighted confidence level of the first keyword and the weighted confidence level of the second keyword is greater than a first preset confidence threshold value after the determining the weighted confidence level of the first keyword based on the relative quality and the confidence level of the first keyword, and the determining the weighted confidence level of the second keyword based on the relative quality and the confidence level of the second keyword;
and if the larger value is larger than the first preset confidence threshold value, executing the step of taking the keyword with larger weighted confidence in the first keyword and the second keyword as the input keyword.
11. The apparatus of claim 10, wherein the command decision module is further configured to determine that no keyword is identified after determining whether a greater value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than a first preset confidence threshold, if the greater value is not greater than the first preset confidence threshold.
12. The apparatus of claim 10, wherein the command decision module is further configured to determine whether the first keyword and the second keyword agree before the greater of the weighted confidence level of the first keyword and the weighted confidence level of the second keyword is greater than a first preset confidence threshold;
if the first keyword is inconsistent with the second keyword, executing the step of determining whether the larger value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is larger than a first preset confidence coefficient threshold value;
if the first keyword is consistent with the second keyword, determining whether the larger value is greater than a second preset confidence threshold value, wherein the second preset confidence threshold value is smaller than the first preset confidence threshold value;
And if the larger value is larger than the second preset confidence threshold value, taking the first keyword or the second keyword as an input keyword.
13. The apparatus of claim 9, wherein the keyword recognition module is specifically configured to input the audio signal to a preset detection neural network to remove noise signals and mute signals in the audio signal to obtain voice content of the audio signal, and the detection neural network is trained by a plurality of sample audio signals in advance, wherein for each sample audio signal, voice content of the sample audio signal is used in advance for labeling;
inputting the audio signals into a preset recognition neural network, wherein the recognition neural network is trained by a plurality of sample voice signals in advance, and voice contents corresponding to each sample voice signal are used in advance for marking;
and acquiring the first keyword output by the identification neural network and the confidence coefficient of the first keyword.
14. The apparatus of claim 9, wherein the lip recognition module is specifically configured to input the lip video image to a preset lip neural network, where the lip neural network is trained in advance by a plurality of sample lip video images, and for each sample lip video image, a keyword corresponding to the sample lip image is used in advance to perform a standard;
And obtaining the keywords output by the lip language neural network and the confidence degrees of the keywords as second keywords and the confidence degrees of the second keywords.
15. The apparatus of claim 9, wherein the command decision module is configured to obtain a control command corresponding to an input keyword after the keyword with a higher weighted confidence level is used as the input keyword in the first keyword and the second keyword; and controlling the unmanned aerial vehicle bound by the unmanned aerial vehicle controller to execute the control instruction.
16. The apparatus according to claim 9, wherein the command decision module is specifically configured to determine an interpolation factor α according to a preset weighting rule, where α is positively correlated with the relative quality, and α is greater than 0 and less than 1;
taking alpha as a first weight;
1-alpha is taken as the second weight.
17. An electronic device, the electronic device comprising:
a memory for storing a computer program;
a processor for carrying out the method steps of any one of claims 1-8 when executing a program stored on a memory.
18. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-8.
CN201810939640.7A 2018-08-17 2018-08-17 Keyword input method and device and electronic equipment Active CN110837758B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810939640.7A CN110837758B (en) 2018-08-17 2018-08-17 Keyword input method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810939640.7A CN110837758B (en) 2018-08-17 2018-08-17 Keyword input method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN110837758A CN110837758A (en) 2020-02-25
CN110837758B true CN110837758B (en) 2023-06-02

Family

ID=69573513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810939640.7A Active CN110837758B (en) 2018-08-17 2018-08-17 Keyword input method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN110837758B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460907B (en) * 2020-03-05 2023-06-20 浙江大华技术股份有限公司 Malicious behavior identification method, system and storage medium
CN111429887B (en) * 2020-04-20 2023-05-30 合肥讯飞数码科技有限公司 Speech keyword recognition method, device and equipment based on end-to-end
WO2022016406A1 (en) * 2020-07-22 2022-01-27 北京小米移动软件有限公司 Information transmission method and apparatus, and communication device
CN112735413B (en) * 2020-12-25 2024-05-31 浙江大华技术股份有限公司 Instruction analysis method based on camera device, electronic equipment and storage medium
CN112861791B (en) * 2021-03-11 2022-08-23 河北工业大学 Lip language identification method combining graph neural network and multi-feature fusion

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023703A (en) * 2009-09-22 2011-04-20 现代自动车株式会社 Combined lip reading and voice recognition multimodal interface system
CN102194454A (en) * 2010-03-05 2011-09-21 富士通株式会社 Equipment and method for detecting key word in continuous speech
CN103177721A (en) * 2011-12-26 2013-06-26 中国电信股份有限公司 Voice recognition method and system
CN103943107A (en) * 2014-04-03 2014-07-23 北京大学深圳研究生院 Audio/video keyword identification method based on decision-making level fusion
CN104409075A (en) * 2014-11-28 2015-03-11 深圳创维-Rgb电子有限公司 Voice identification method and system
CN105679316A (en) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 Voice keyword identification method and apparatus based on deep neural network
CN108182937A (en) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 Keyword recognition method, device, equipment and storage medium
CN108346427A (en) * 2018-02-05 2018-07-31 广东小天才科技有限公司 Voice recognition method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018039045A1 (en) * 2016-08-24 2018-03-01 Knowles Electronics, Llc Methods and systems for keyword detection using keyword repetitions

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023703A (en) * 2009-09-22 2011-04-20 现代自动车株式会社 Combined lip reading and voice recognition multimodal interface system
CN102194454A (en) * 2010-03-05 2011-09-21 富士通株式会社 Equipment and method for detecting key word in continuous speech
CN103177721A (en) * 2011-12-26 2013-06-26 中国电信股份有限公司 Voice recognition method and system
CN103943107A (en) * 2014-04-03 2014-07-23 北京大学深圳研究生院 Audio/video keyword identification method based on decision-making level fusion
CN104409075A (en) * 2014-11-28 2015-03-11 深圳创维-Rgb电子有限公司 Voice identification method and system
CN105679316A (en) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 Voice keyword identification method and apparatus based on deep neural network
CN108182937A (en) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 Keyword recognition method, device, equipment and storage medium
CN108346427A (en) * 2018-02-05 2018-07-31 广东小天才科技有限公司 Voice recognition method, device, equipment and storage medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Adaptive fusion of acoustic and visual sources for automatic speech recognition;AlexandrinaRogozan等;《Adaptive fusion of acoustic and visual sources for automatic speech recognition》;19981223;第26卷(第2期);全文 *
Deep multimodal learning for Audio-Visual Speech Recognition;Youssef Mroueh等;《 2015 IEEE International Conference on Acoustics, Speech and Signal Processing》;20150806;第2131-2133页,图1 *
OPTIMAL WEIGHTING OF POSTERIORS FOR AUDIO-VISUAL SPEECH RECOGNITION;Martin Heckmann等;《 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings》;20020807;全文 *
Performance Improvement of Audio-Visual SpeechRecognition with Optimal Reliability Fusion;Tariquzzaman等;《2011 International Conference on Internet Computing and Information Services》;20111101;第203-205页 *
SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORKS;Guoguo Chen等;《2014 IEEE International Conference on Acoustics, Speech and Signal Processing》;20140714;全文 *
Stream weight estimation for multistream audio–visual speech recognition in a multispeaker environment;Xu Shao等;《https://doi.org/10.1016/j.specom.2007.11.002》;20071119;第50卷(第4期);全文 *

Also Published As

Publication number Publication date
CN110837758A (en) 2020-02-25

Similar Documents

Publication Publication Date Title
CN110837758B (en) Keyword input method and device and electronic equipment
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN110310623B (en) Sample generation method, model training method, device, medium, and electronic apparatus
CN109558512B (en) Audio-based personalized recommendation method and device and mobile terminal
CN112435684B (en) Voice separation method and device, computer equipment and storage medium
US11790912B2 (en) Phoneme recognizer customizable keyword spotting system with keyword adaptation
CN110473568B (en) Scene recognition method and device, storage medium and electronic equipment
CN111868823B (en) Sound source separation method, device and equipment
CN111048113A (en) Sound direction positioning processing method, device and system, computer equipment and storage medium
CN111739539A (en) Method, device and storage medium for determining number of speakers
CN110706707B (en) Method, apparatus, device and computer-readable storage medium for voice interaction
CN104715753A (en) Data processing method and electronic device
CN104575509A (en) Voice enhancement processing method and device
CN111326152A (en) Voice control method and device
CN110972112A (en) Subway running direction determining method, device, terminal and storage medium
CN113628612A (en) Voice recognition method and device, electronic equipment and computer readable storage medium
CN110827799B (en) Method, apparatus, device and medium for processing voice signal
CN111312223A (en) Training method and device of voice segmentation model and electronic equipment
US20230386470A1 (en) Speech instruction recognition method, electronic device, and non-transient computer readable storage medium
CN112786028A (en) Acoustic model processing method, device, equipment and readable storage medium
CN112669837A (en) Awakening method and device of intelligent terminal and electronic equipment
US10818298B2 (en) Audio processing
CN116884402A (en) Method and device for converting voice into text, electronic equipment and storage medium
CN114220177B (en) Lip syllable recognition method, device, equipment and medium
CN116978359A (en) Phoneme recognition method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant