WO2021128847A1 - 终端交互方法、装置、计算机设备及存储介质 - Google Patents

终端交互方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021128847A1
WO2021128847A1 PCT/CN2020/105762 CN2020105762W WO2021128847A1 WO 2021128847 A1 WO2021128847 A1 WO 2021128847A1 CN 2020105762 W CN2020105762 W CN 2020105762W WO 2021128847 A1 WO2021128847 A1 WO 2021128847A1
Authority
WO
WIPO (PCT)
Prior art keywords
verified
image
video
audio
analysis value
Prior art date
Application number
PCT/CN2020/105762
Other languages
English (en)
French (fr)
Inventor
熊玮
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021128847A1 publication Critical patent/WO2021128847A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a terminal interaction method, device, computer equipment and storage medium.
  • the purpose of the embodiments of this application is to propose a terminal interaction method to solve the problem that the existing terminal interaction method has a single identification dimension, and usually only needs to be verified at the beginning of the operation by the user, which fails to provide a good security guarantee. problem.
  • an embodiment of the present application provides a terminal interaction method, which adopts the following technical solutions:
  • analysis is performed again to determine whether to close the interactive interface.
  • an embodiment of the present application also provides a terminal interaction device, which adopts the following technical solutions:
  • the first acquisition module is configured to acquire the first to-be-verified video and the first to-be-verified audio;
  • An audio analysis module configured to input characteristic parameters of the first to-be-verified audio into a recognition model to analyze the first to-be-verified audio to obtain a first audio analysis value
  • a video analysis module configured to analyze the first to-be-verified video to obtain a first video analysis value by using facial action recognition technology of a face image
  • the first determining module is configured to determine whether to provide an interactive interface for the user to perform interactive operations according to the first audio analysis value and the first video analysis value;
  • the second acquisition module is configured to periodically acquire the second to-be-verified video and the second to-be-verified audio after the interactive interface is provided;
  • the second determining module re-analyzes based on the second to-be-verified video and the second to-be-verified audio to determine whether to close the interactive interface.
  • the embodiments of the present application also provide a computer device, which adopts the following technical solutions:
  • a computer device includes a memory and a processor, the memory stores a computer process, and the processor implements the steps of the terminal interaction method described below when the processor executes the computer process:
  • analysis is performed again to determine whether to close the interactive interface.
  • the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:
  • the computer-readable storage medium stores a computer process, and when the computer process is executed by a processor, the steps of the terminal interaction method described below are implemented:
  • analysis is performed again to determine whether to close the interactive interface.
  • the interactive interface can increase the identification dimension of terminal interaction and improve the security of terminal interaction.
  • the second audio analysis value and the second video analysis value determine whether to close the interactive interface, so that timing analysis can be realized, and the terminal can be controlled according to the analysis result, so that identification is also performed during the user's use of the terminal, which further improves the security of terminal interaction .
  • Figure 1 is an exemplary system architecture diagram to which the present application can be applied;
  • Fig. 2 is a flowchart of an embodiment of a terminal interaction method according to the present application.
  • FIG. 3 is a flowchart of a specific implementation of step S3 in FIG. 2;
  • FIG. 4 is a flowchart of a specific implementation of step S31 in FIG. 3;
  • FIG. 5 is a flowchart of a specific implementation of step S33 in FIG. 3;
  • FIG. 6 is a flowchart of a specific implementation of step S34 in FIG. 3;
  • FIG. 7 is a flowchart of a specific implementation of step S4 in FIG. 2;
  • FIG. 8 is a flowchart of a specific implementation of step S5 in FIG. 2;
  • Fig. 9 is a schematic structural diagram of an embodiment of a terminal interaction device according to the present application.
  • Fig. 10 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • Various communication client applications such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, and social platform software, can be installed on the terminal devices 101, 102, and 103.
  • the terminal devices 101, 102, 103 may be various electronic devices with display screens and support for web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4) players, laptop portable computers and desktop computers, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4
  • laptop portable computers and desktop computers etc.
  • the server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
  • terminal interaction method provided in the embodiments of the present application is generally executed by the server/terminal device, and accordingly, the terminal interaction device is generally set in the server/terminal device.
  • terminal devices, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.
  • the terminal interaction method includes the following steps:
  • the first to-be-verified video and the first to-be-verified audio may be video and audio recorded in real time by a mobile terminal (for example, a personal mobile phone) or a dedicated terminal (for example, a bank teller machine, etc.).
  • a mobile terminal for example, a personal mobile phone
  • a dedicated terminal for example, a bank teller machine, etc.
  • the user needs to speak or answer the salesperson’s questions according to a given language. For example, the salesperson asks: "Are you Mr. XX?", the user answers: “Yes”, the salesperson continues to ask: "The return rate of XX products you buy is XX, and the payback period is XX. Do you know", The user answered: "Yes”.
  • S2 Input the characteristic parameters of the first to-be-verified audio into a recognition model to analyze the first to-be-verified audio to obtain a first audio analysis value.
  • step S2 the analysis of the first to-be-verified audio can be realized through the following steps:
  • the characteristic parameter may be a MFCC (Mel-scale Frequency Cepstral Coefficients) characteristic parameter, a sound intensity characteristic parameter, or a formant characteristic parameter.
  • MFCC Mel-scale Frequency Cepstral Coefficients
  • formant characteristic parameters spectral envelope method, cepstrum method, LPC interpolation method, LPC root finding method, Hilbert transform method, etc.
  • Pre-emphasis can increase the resolution of the high frequency band in the speech signal to remove the influence of the lip radiation.
  • the function of windowing and framing is: the speech signal itself is non-stationary, but it also has the characteristics of short-term stability, so the speech signal is divided into small segments and treated as a stationary signal.
  • the segmentation here can be understood as: sub-frame, in order to fully analyze the speech signal, there must be a frame shift (the understanding here is very similar to the sliding window of image processing).
  • the frame shift here can be understood as windowing.
  • endpoint detection can also be performed: the start point and the end point of the valid sound segment are detected to remove the invalid sound segment, thereby improving the processing efficiency of the voice signal.
  • FFT Fast Fourier Transform
  • the characteristic parameters may also be prosodic characteristics such as speech rate, energy, average zero-crossing rate, and pitch frequency.
  • the training set used to train the recognition model usually includes a feature parameter and a classification result corresponding to the feature parameter.
  • the recognition result of the characteristic parameter of the recognition model will be closer and closer to the classification result of the training set.
  • the classification results in the training set can be simply divided into two results, fraud and safety, or can be divided into multiple results such as fear, disgust, surprise, confusion, thinking, sadness, and anger.
  • the recognition model can use Hidden Markov Model HMM, Gaussian Mixture Model GMM, Support Vector Machine SVM, Artificial Neural Network ANN, etc., among which support vector machine SVM is easy to debug and experiment.
  • S3 Analyze the first to-be-verified video through facial action recognition technology of the face image to obtain a first video analysis value.
  • step S3 may include:
  • step S31 may include:
  • S311 In the first video to be verified, extract sample images according to a preset time interval.
  • the preset time interval may be 1S, 2S, 5S, and so on.
  • the preset time interval is 1S, that is, the extracted sample images are the images corresponding to 1S, 2S, 3S... in the first video to be verified. Since some micro expressions have a short duration (according to statistics, the shortest micro expression can even last only 0.25 s), so the preset time interval should be as small as possible if the computing power allows.
  • S312 Determine whether there are micro-expressions in each sampled image, and when there are micro-expressions in the sampled images, acquire images adjacent to the sampled image in the first to-be-verified video to form the image sequence.
  • step S312 it is possible to request an external facial action coding system (Facial action coding system, FACS) to determine whether there are micro-expressions in the sampled image.
  • FACS facial action coding system
  • the face image will be recognized which facial actions exist, and then the emotion code corresponding to the face image will be obtained according to the existing facial actions.
  • Different facial motion numbers correspond to different facial motions.
  • AU1 Raise the inner corner of the eyebrows
  • AU2 Raise the outer corner of the eyebrows
  • AU4 Lower the eyebrows
  • AU6 Raise the cheeks
  • AU9 Wrinkle the nose
  • the facial action obtained is: AU4+AU6+AU9+AU11+AU16+AU25, then the emotion corresponding to the emotion code obtained at this time is pain.
  • the facial action is: AU4+AU5+AU7+AU23, then the emotion corresponding to the emotion code obtained at this time is anger.
  • the facial action is: AU4+AU14, then the emotion corresponding to the emotion code obtained at this time is thinking.
  • the images adjacent to the sampled image may be the image of 20 frames before the sampled image and the image of 20 frames after the sampled image, or the image of 0.5S before the sampled image and the image of 0.5S after the sampled image.
  • the time of the sampled image with micro-expression in the video to be verified is 1S and 3S, that is, the image from 0.5S to 1.5S is an image sequence, and the image from 2.5S to 3.5S is another image sequence.
  • the emotion coding of each image in the image sequence can be obtained by requesting an external facial action coding system (Facial action coding system, FACS).
  • FACS facial action coding system
  • an emotion code can correspond to an emotion. Emotions can be simply divided into two results, fraud and safety, or can be divided into multiple results such as fear, disgust, surprise, confusion, thinking, sadness, and anger.
  • S33 In the images of the same image sequence, divide the images with the same emotion code into a group, and determine the score value of each group of emotion codes according to the images in the group, and use the emotion code corresponding to the group with the largest score as the group Image emotional coding of image sequence.
  • step S33 for example, there are ten images in the same image sequence, among which, the emotion of 3 images is 1: fear, the emotion of 3 images is 0: normal, and the emotion of 4 images is 2: 2: confused. That is, these ten images will be divided into three groups according to fear, normality and confusion.
  • the determining the score of each group of emotion codes according to the images in the group may include:
  • S331 Arrange each image in the image sequence according to the time when the image appears in the video to be verified, and set the weight of each image according to the sequence of the arrangement, where the weight is a follow-up arrangement The sequence of increasing first and then decreasing.
  • the number sequence may be preset, generally according to the order of the images, and the increasing peak value is in the middle of the order.
  • the time of an image sequence in the video is 0.5S to 1.5S, and there are a total of 10 images in the image sequence, and the time that they appear in the video are: 0.6S, 0.7S, 0.8S, 0.9S, 1.0 S, 1.1S, 1.2S, 1.3S, 1.4S, 1.5S. That is, the weights set for these 10 images can be 1, 2, 3, 4, 5, 6, 5, 4, 3, 2 in order.
  • step S332 the example of the above step S331 is followed.
  • the fear of emotion grouping has the highest score, so fear is used as the emotion of the image sequence.
  • the amplitude of the micro-expression presents a trend that first gradually increases to a peak and then gradually falls back to normal. Therefore, at the midpoint of the micro-expression process, the amplitude of the micro-expression is the largest.
  • the micro-expression recognition results corresponding to the time images are more reliable. Therefore, through the above steps S331 and S332, appropriate weights can be assigned to the micro-expression recognition results of the images at different time points, so that the emotion recognition results of the image sequence are more accurate .
  • S34 Determine the first video analysis value according to the emotion code of the image sequence.
  • the foregoing step S34 may include:
  • step S341 for example, four image sequences are extracted from the video, and the time is 0.5S to 1.5S, 2S to 3S, 3.5S to 4.5S, 5S to 6S, and the weight of each image sequence can be Both are set to 1.
  • S342 Identify the time period of each image sequence, and segment the first to-be-verified audio according to the identified time period to obtain an audio segment corresponding to each image sequence.
  • step S342 and step S343 the example of the above step S341 is followed, that is, the four segments of audio from 0.5S to 1.5S, 2S to 3S, 3.5S to 4.5S, and 5S to 6S are obtained from the first audio to be verified. Segment, and analyze it to obtain the corresponding audio mood code.
  • the audio clip can be analyzed by requesting a third-party audio sentiment analysis service.
  • the specific analysis method may also be consistent with the above step S2.
  • step S344 following the example of the above steps S342 and S343, suppose that the audio emotion coding of the four image sequences of 0.5S to 1.5S, 2S to 3S, 3.5S to 4.5S, and 5S to 6S are 1: fear , 2: Confused, 2: Confused, 3: Angry, and the audio emotion check results of these four audio clips are 1: afraid, 0: normal, 0: normal, 0: normal, at this time, 0.5S to 1.5 The weight corresponding to the image sequence of S is increased to 3, and the weights of the other three image sequences remain at 1.
  • S345 Divide the image sequences with the same image emotion coding into one group, and add the corresponding weights of the image sequences in each group to obtain the score of each group of image sequences, and use the image emotion code corresponding to the group with the largest score as The first video analysis value.
  • step S345 the example of the above step S344 is followed.
  • the image sequence of 0.5S to 1.5S is a group, and the emotion code is 1: fear
  • the image sequence of 2S to 3S, 3.5S to 4.5S is one Group
  • its emotion code is 2: confused
  • 5S to 6S image sequence is a group
  • its emotion code is 0: normal
  • the analysis value is 1: fear.
  • S4 Determine, according to the first audio analysis value and the first video analysis value, whether to provide an interactive interface for the user to perform interactive operations.
  • step S4 may include:
  • S41 Determine the audio score corresponding to the first audio analysis value and the video score corresponding to the second video analysis value according to a preset rule.
  • the preset rules can be as follows:
  • the first audio analysis value Audio score 1 Fear 1 2: Confusion 5 ... ...
  • Video score 1 Fear 1 2: Confusion 5 ... ...
  • S42 Extract image data from the first to-be-verified video to perform face matching to obtain a face matching rate.
  • the face matching can be implemented in the following manner: extracting the face image in the video, and comparing the face image with the face database of the public security organ to obtain the face matching rate.
  • the face matching rate can be 0.1, 0.2, 0.5, 0.8, 1, etc.
  • S43 Calculate a weighted sum of the audio score and the video score using the face matching rate as a weight.
  • step S44 and step S45 following the example of step S43, when the security threshold is 1.5, the weighted sum is less than the security threshold at this time, that is, it is judged that the user has fraudulent behavior, so the corresponding interactive interface is not provided.
  • the corresponding interactive interface may be a page where the user purchases a financial product or an introduction page of the financial product, and so on.
  • the second to-be-verified video and the second to-be-verified audio may be video and audio recorded by a mobile terminal (for example, a personal mobile phone) or a dedicated terminal (for example, a bank teller machine, etc.).
  • a mobile terminal for example, a personal mobile phone
  • a dedicated terminal for example, a bank teller machine, etc.
  • step S5 may include:
  • a security label may be preset for each interactive interface, and the security level corresponding to the interactive interface is stored in the security label.
  • the security level of the auto insurance purchase page is preset to level A
  • the security level of the auto insurance introduction page is preset to level B, etc.
  • the safety requirements of Class A are higher than those of Class B.
  • S52 Determine the frequency for acquiring the second to-be-verified video and the second to-be-verified audio according to the security level, and acquire the second to-be-verified video and the second to-be-verified audio according to the frequency.
  • the frequency corresponding to the security level can be determined according to a preset rule.
  • the frequency corresponding to safety level A is once every 10S
  • the frequency corresponding to safety level B is once every 30S
  • the second to-be-verified video and the second-to-be-verified audio are collected once in 10S.
  • the second to-be-verified video and the second to-be-verified audio are collected every 30S.
  • different video and audio collection frequencies can be used according to the security level of the interactive interface currently used by the user.
  • the security level is higher, the higher the collection frequency is used, and when the security level is lower, the higher the collection frequency is used.
  • Low collection frequency can reduce the amount of data processing and improve efficiency on the basis of ensuring the safety of terminal interaction.
  • S6 Re-analyze based on the second to-be-verified video and second to-be-verified audio to determine whether to close the interactive interface.
  • step S6 the process of reanalyzing the second video to be verified and the second audio to be verified is the same as the process of analyzing the first video to be verified and the first audio to be verified, and will not be repeated here.
  • the process of determining whether to close the interactive interface is similar to the limitation of step S4, and will not be repeated here.
  • the main difference is that when the security is not enough (for example, the weighted sum is less than the safety threshold) based on the second audio analysis value and the second video analysis value, the interactive interface is closed, and when the security is sufficient (for example, the weighted sum is greater than the safety threshold) ), the interactive interface is not closed.
  • the first audio to be verified can be analyzed to obtain the first audio analysis value
  • the first video to be verified can be analyzed to obtain the first video analysis value
  • the first audio analysis value and the first video analysis value can be synthesized
  • the second audio analysis value and the second video analysis value determine whether to close the interactive interface, so that timing analysis can be realized, and the terminal can be controlled according to the analysis result, so that identification is also performed during the user's use of the terminal, which further improves the security of terminal interaction .
  • the computer process can be stored in a computer readable storage medium. When executed, it may include the procedures of the above-mentioned method embodiments.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
  • this application provides an embodiment of a terminal interaction device.
  • the device embodiment corresponds to the method embodiment shown in FIG. Used in various electronic devices.
  • the terminal interaction device 400 in this embodiment includes: a first acquisition module 401, an audio analysis module 402, a video analysis module 403, a first determination module 404, a second acquisition module 405, and a second determination module 406. among them:
  • the first obtaining module 401 is configured to obtain the first to-be-verified video and the first to-be-verified audio.
  • the audio analysis module 402 is configured to input the characteristic parameters of the first to-be-verified audio into a recognition model to analyze the first to-be-verified audio to obtain a first audio analysis value.
  • the video analysis module 403 is configured to analyze the first to-be-verified video to obtain a first video analysis value by using a facial motion recognition technology of a face image.
  • the first determining module 404 is configured to determine whether to provide an interactive interface for the user to perform interactive operations according to the first audio analysis value and the first video analysis value.
  • the second acquisition module 405 is configured to periodically acquire the second to-be-verified video and the second to-be-verified audio after the interactive interface is provided.
  • the second determining module 406 re-analyzes based on the second to-be-verified video and the second to-be-verified audio to determine whether to close the interactive interface.
  • the first audio to be verified can be analyzed to obtain the first audio analysis value
  • the first video to be verified can be analyzed to obtain the first video analysis value
  • the first audio analysis value and the first video analysis value can be synthesized
  • the second audio analysis value and the second video analysis value determine whether to close the interactive interface, so that timing analysis can be realized, and the terminal can be controlled according to the analysis result, so that identification is also performed during the user's use of the terminal, which further improves the security of terminal interaction .
  • terminal interaction device 400 The specific limitation on the terminal interaction device 400 is consistent with the specific limitation of the above terminal interaction method, and will not be repeated here.
  • FIG. 10 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 11 includes a memory 111, a processor 112, and a network interface 113 that are connected to each other in communication through a system bus. It should be pointed out that only the computer device 11 with components 111-113 is shown in the figure, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • the memory 111 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static memory Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 111 may be an internal storage unit of the computer device 11, such as a hard disk or a memory of the computer device 11.
  • the memory 111 may also be an external storage device of the computer device 11, for example, a plug-in hard disk equipped on the computer device 11, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc.
  • the memory 111 may also include both an internal storage unit of the computer device 11 and an external storage device thereof.
  • the memory 111 is generally used to store an operating system and various application software installed in the computer device 11, such as computer-readable instructions for terminal interaction methods.
  • the memory 111 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 112 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 112 is generally used to control the overall operation of the computer device 11.
  • the processor 112 is configured to run computer-readable instructions or process data stored in the memory 111, for example, computer-readable instructions for running the terminal interaction method.
  • the network interface 113 may include a wireless network interface or a wired network interface, and the network interface 113 is generally used to establish a communication connection between the computer device 11 and other electronic devices.
  • the present application also provides another implementation manner, that is, a computer-readable storage medium storing a terminal interaction process, and the terminal interaction process can be executed by at least one processor to enable all The at least one processor executes the steps of the terminal interaction method described above.
  • the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. ⁇
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Engineering & Computer Science (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例属于人工智能领域,涉及一种终端交互方法,包括获取第一待验证视频和第一待验证音频;对第一待验证音频和第一待验证视频进行分析以得到第一分析结果;根据第一分析结果确定是否提供供用户进行交互操作的交互界面;以及在提供交互界面后,定时地获取第二待验证视频和第二待验证音频;并在每次获取第二待验证视频和第二待验证音频时,对第二待验证音频和第二待验证视频进行分析以得到第二分析结果;根据第二分析结果确定是否关闭交互界面。本申请还提供一种终端交互装置、计算机设备及存储介质。本申请能够增加终端交互的辨识维度,并且能够在用户使用终端的过程中也进行识别,从而提高终端交互的安全性。

Description

终端交互方法、装置、计算机设备及存储介质
本申请以2019年12月25日提交的申请号为201911357310.8,名称为“终端交互方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。
技术领域
本申请涉及人工智能领域,尤其涉及一种终端交互方法、装置、计算机设备及存储介质。
背景技术
随着计算机技术的发展,计算机终端已遍布于人们生活中的各个角落。现今,人们进行许多活动时都需要操作计算机终端。为保障安全性,终端在允许用户进行操作前,有时会对用户的身份进行验证,但在实现本申请的过程中,发明人意识到现有的验证方法的辨识维度较为单一,且通常只需要在用户一开始操作时进行验证,未能提供很好的安全保障。
发明内容
本申请实施例的目的在于提出一种终端交互方法,以解决现有的终端交互方法存在辨识维度较为单一,且通常只需要在用户一开始操作时进行验证,未能提供很好的安全保障的问题。
为了解决上述技术问题,本申请实施例提供一种终端交互方法,采用了如下所述的技术方案:
获取第一待验证视频和第一待验证音频;
将所述第一待验证音频的特征参数输入至识别模型以对所述第一待验证音频进行分析得到第一音频分析值;
通过人脸图像的面部动作识别技术对所述第一待验证视频进行分析以得到第一视频分析值;
根据所述第一音频分析值和所述第一视频分析值确定是否提供供用户进行交互操作的交互界面;以及
在提供交互界面后,定时地获取第二待验证视频和第二待验证音频;
基于所述第二待验证视频和第二待验证音频重新进行分析以确定是否关闭交互界面。
为了解决上述技术问题,本申请实施例还提供一种终端交互装置,采用了如下所述的技术方案:
第一获取模块,用于获取第一待验证视频和第一待验证音频;
音频分析模块,用于将所述第一待验证音频的特征参数输入至识别模型以对所述第一待验证音频进行分析得到第一音频分析值;
视频分析模块,用于通过人脸图像的面部动作识别技术对所述第一待验证视频进行分析以得到第一视频分析值;
第一确定模块,用于根据所述第一音频分析值和所述第一视频分析值确定是否提供供用户进行交互操作的交互界面;以及
第二获取模块,用于在提供交互界面后,定时地获取第二待验证视频和第二待验证音频;
第二确定模块,基于所述第二待验证视频和第二待验证音频重新进行分析以确定是否关闭交互界面。
为了解决上述技术问题,本申请实施例还提供一种计算机设备,采用了如下所述的技术方案:
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机流程,所述处理 器执行所述计算机流程时实现如下所述的终端交互方法的步骤:
获取第一待验证视频和第一待验证音频;
将所述第一待验证音频的特征参数输入至识别模型以对所述第一待验证音频进行分析得到第一音频分析值;
通过人脸图像的面部动作识别技术对所述第一待验证视频进行分析以得到第一视频分析值;
根据所述第一音频分析值和所述第一视频分析值确定是否提供供用户进行交互操作的交互界面;以及
在提供交互界面后,定时地获取第二待验证视频和第二待验证音频;
基于所述第二待验证视频和第二待验证音频重新进行分析以确定是否关闭交互界面。
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,采用了如下所述的技术方案:
所述计算机可读存储介质上存储有计算机流程,所述计算机流程被处理器执行时实现如下所述的终端交互方法的步骤:
获取第一待验证视频和第一待验证音频;
将所述第一待验证音频的特征参数输入至识别模型以对所述第一待验证音频进行分析得到第一音频分析值;
通过人脸图像的面部动作识别技术对所述第一待验证视频进行分析以得到第一视频分析值;
根据所述第一音频分析值和所述第一视频分析值确定是否提供供用户进行交互操作的交互界面;以及
在提供交互界面后,定时地获取第二待验证视频和第二待验证音频;
基于所述第二待验证视频和第二待验证音频重新进行分析以确定是否关闭交互界面。
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
对第一待验证音频进行分析得到第一音频分析值,对第一待验证视频进行分析得到第一视频分析值,然后综合第一音频分析值和第一视频分析值来提供供用户进行交互操作的交互界面,从而能够增加终端交互的辨识维度,提高终端交互的安全性。另外,定时地获取第二待验证视频和第二待验证音频,对第二待验证音频进行分析得到第二音频分析值,对第二待验证视频进行分析得到第二视频分析值,然后综合第二音频分析值和第二视频分析值确定是否关闭交互界面,从而能够实现定时的分析,并根据分析结果控制终端,实现在用户使用终端的过程中也进行识别,进一步提高了终端交互的安全性。
附图说明
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请可以应用于其中的示例性系统架构图;
图2是根据本申请的终端交互方法的一个实施例的流程图;
图3是图2中步骤S3的一种具体实施方式的流程图;
图4是图3中步骤S31的一种具体实施方式的流程图;
图5是图3中步骤S33的一种具体实施方式的流程图;
图6是图3中步骤S34的一种具体实施方式的流程图;
图7是图2中步骤S4的一种具体实施方式的流程图;
图8是图2中步骤S5的一种具体实施方式的流程图;
图9是根据本申请的终端交互装置的一个实施例的结构示意图;
图10是根据本申请的计算机设备的一个实施例的结构示意图。
具体实施方式
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。
需要说明的是,本申请实施例所提供的终端交互方法一般由服务器/终端设备执行,相应地,终端交互装置一般设置于服务器/终端设备中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
继续参考图2,示出了根据本申请的终端交互方法的一个实施例的流程图。所述的终端交互方法,包括以下步骤:
S1:获取第一待验证视频和第一待验证音频。
在上述步骤S1中,第一待验证视频和第一待验证音频可以是由移动终端(例如,个人手机)或专用终端(例如,银行柜员机等)实时录制的视频和音频。在录制视频和音频的过程中,用户需要按照给定的话术说话或者回答销售人员的问题。例如,销售人员询问:“请问您是XX先生吗?”,用户回答:“是”,销售人员继续询问:“您购买的XX产品回报率为XX,回收周期为XX,请问您是否了解”,用户回答:“是”。
S2:将所述第一待验证音频的特征参数输入至识别模型以对所述第一待验证音频进行分析得到第一音频分析值。
在上述步骤S2中,可以通过以下步骤来实现对第一待验证音频的分析:
(1)提取第一待验证音频的特征参数。其中,特征参数可以是MFCC(Mel-scale Frequency Cepstral Coefficients)特征参数,也可以是声强特征参数,也可以是共振峰特征参数。提取共振峰特征参数时,可以使用谱包络法、倒谱法、LPC内插法、LPC求 根法、希尔伯特变换法等。提取声强特征参数时,可以进行如下计算:SIL=10lg(I/I'),式中I为声强,I'=10e -12瓦/平米,为基准声强,SIL即为声强特征参数。提取MFCC特征参数时,可以通过以下形式实现:(a)对第一待验证音频进行预加重、分帧和加窗。这里,进行预加重能够增加语音信号中高频段的分辨率,以去除口唇辐射的影响。而加窗分帧的作用是:语音信号本身是非平稳的,但是又兼具短时平稳的特点,因此将语音信号分成一小段将此看作平稳信号来处理。这里的分段可以理解为是:分帧,为了全面完整地分析语音信号,要有帧移(这里的理解与图像处理的滑动窗很类似)。此处的帧移可理解为加窗。可选地,还可以进行端点检测:检测有效声音段的起始点与结束点,以去除无效声音段,从而提高语音信号的处理效率。(b)对每一个短时分析窗,通过FFT(快速傅里叶变换)得到对应的频谱。(c)将所述频谱通过梅尔滤波器组得到梅尔频谱。(d)在梅尔频谱上面进行倒谱分析获得梅尔频率倒谱系数MFCC。可选地,特征参数还可以是语速、能量、平均过零率、基音频率等韵律特征。
(2)将提取到的特征参数输入至预先训练好的识别模型以得到第一音频分析值。这里,用于训练识别模型的训练集通常包括有特征参数和该特征参数对应的分类结果。这样,在训练的过程中,识别模型对特征参数的识别结果将越来越接近训练集的分类结果。当识别结果与分类结果相比,其准确率达到一定程度时,即可认为识别模型已经训练完成。可选地,训练集中的分类结果可简单地分为欺诈和安全两个结果,也可以分为害怕、反感、惊讶、困惑、思考、哀伤、生气等多个结果。识别模型可以使用隐马尔可夫模型HMM、高斯混合模型GMM、支持向量机SVM、人工神经网络ANN等,其中使用支持向量机SVM时,易于调试和实验。
S3:通过人脸图像的面部动作识别技术对所述第一待验证视频进行分析以得到第一视频分析值。
进一步地,如图3所示,上述步骤S3可以包括:
S31:在所述第一待验证视频中提取存在微表情的图像序列。
进一步地,如图4所示,上述步骤S31可以包括:
S311:在所述第一待验证视频中,按照预设的时间间隔提取抽样图像。
在上述步骤S311中,预设的时间间隔可以是1S、2S、5S等。例如,预设的时间间隔取值为1S,即提取的抽样图像为第一待验证视频中1S、2S、3S......时对应的图像。由于有些微表情的持续时间很短(据统计,最短时间的微表情甚至可以仅持续0.25S),故在计算能力允许的条件下,预设的时间间隔尽量取更小的值。
S312:判断每张抽样图像中是否存在微表情,当抽样图像中存在微表情时,获取所述第一待验证视频中与所述抽样图像邻近的图像以组成所述图像序列。
在上述步骤S312中,可以通过请求外部的面部动作编码系统(Facial action coding system,FACS)来实现判断抽样图像中是否存在微表情。在FACS系统中,人脸图像会被识别存在哪些面部动作,然后根据所存在的面部动作得到人脸图像对应的情绪编码。不同的面部动作编号对应着不同的面部动作,例如,AU1:抬起眉毛内角,AU2:抬起眉毛外角,AU4:降低眉毛,AU6:脸颊提升,AU9:皱鼻......FACS识别到的面部动作为:AU4+AU6+AU9+AU11+AU16+AU25,则此时得到的情绪编码对应的情绪为疼痛。面部动作为:AU4+AU5+AU7+AU23,则此时得到的情绪编码对应的情绪为愤怒。面部动作为:AU4+AU14,则此时得到的情绪编码对应的情绪为思考。与抽样图像邻近的图像可以是抽样图像前20帧的图像和抽样图像后20帧的图像,也可以是抽样图像前0.5S的图像和抽样图像后0.5S的图像。承接上述步骤S311的例子,假设存在微表情的抽样图像在待验证视频中的时间为1S和3S,即0.5S至1.5S的图像为一个图像序列,2.5S至3.5S的图像为另一个图像序列。
S32:获取所述图像序列中每一张图像的情绪编码。
在上述步骤S32中,可以通过请求外部的面部动作编码系统(Facial action coding  system,FACS)来获取图像序列中每一张图像的情绪编码。具体的,一个情绪编码可以对应着一种情绪。情绪可以简单地分为欺诈和安全两个结果,也可以分为害怕、反感、惊讶、困惑、思考、哀伤、生气等多个结果。
S33:在同一个图像序列的图像中,将情绪编码相同的图像分为一组,并根据分组中的图像确定每组情绪编码的分值,以分值最大的分组对应的情绪编码作为所述图像序列的图像情绪编码。
在上述步骤S33中,例如,在同一个图像序列中有十张图像,其中,3张图像识别的情绪为1:害怕,3张图像识别的情绪为0:正常,4张图像识别:2:困惑。即这十张图像将被按照害怕、正常和困惑划分为三组。
进一步地,如图5所示,在上述步骤S33中,所述根据分组中的图像确定每组情绪编码的分值可以包括:
S331:按照图像在所述待验证视频中出现的时间对所述图像序列中的每一张图像进行排列,并根据排列的顺序设置每一张图像的权值,所述权值为一个跟随排列的顺序先递增后递减的数列。
在上述步骤S331中,数列可以是预设的,一般按照图像的排序,在排序正中的即是增大的峰值。例如,一个图像序列在视频中的时间为0.5S至1.5S,且图像序列中的图像共有10张,其在视频中出现的时间依次为:0.6S、0.7S、0.8S、0.9S、1.0S、1.1S、1.2S、1.3S、1.4S、1.5S。即对这10张图像设置的权值可以依次为1、2、3、4、5、6、5、4、3、2。情绪为害怕的分组的图像共有3张,其在视频中出现的时间依次为1S、1.1S、1.2S,其对应的权值依次为5、6、5,情绪为正常的分组的图像共有3张,其在视频中出现的时间依次为1.3S、1.4S、1.5S,其对应的权值依次为4、3、2。情绪为困惑的分组的图像共有4张,其在视频中出现的时间依次为0.6S、0.7S、0.8S、0.9S,其对应的权值依次为1、2、3、4。
S332:以每组中所有图像的权值相加以得到该组对应的情绪编码的分值。
在上述步骤S332中,承接上述步骤S331的例子,此时,害怕情绪的分组的分值为5+6+5=16,正常情绪的分组的分值为4+3+2=9,困惑情绪的分组的分值为1+2+3+4=10。害怕情绪分组的分值最大,故以害怕作为该图像序列的情绪。
由于在人的微表情持续过程中,微表情的幅度呈现出一个先逐渐增加至峰值然后又逐渐回落至正常的趋势,所以在微表情持续过程中的中点时,微表情的幅度最大,此时图像对应的微表情识别结果可信度更高,故通过上述步骤S331和步骤S332能够对不同时间点的图像的微表情识别结果赋予合适的权值,从而使图像序列的情绪识别结果更加准确。
S34:根据所述图像序列的情绪编码确定所述第一视频分析值。
进一步地,如图6所示,当所述图像序列为多个图像序列时,上述步骤S34可以包括:
S341:对每个图像序列赋予相同的权值。
在上述步骤S341中,例如,在视频中提取到四个图像序列,时间分别为0.5S至1.5S,2S至3S,3.5S至4.5S,5S至6S,其中每个图像序列的权值可以都设置为1。
S342:识别每个图像序列的时间段,将第一待验证音频按识别到的时间段进行分段,得到与每个图像序列对应的音频片段。
S343:对所述音频片段进行分析以获得每个图像序列对应的音频情绪编码。
在上述步骤S342和步骤S343中,承接上述步骤S341的例子,即在第一待验证音频中获取0.5S至1.5S,2S至3S,3.5S至4.5S,5S至6S这四段时间的音频片段,并对其进行分析以获得对应的音频情绪编码。这里,可以通过请求第三方音频情绪分析服务对音频片段进行分析。具体的分析方法也可以与上述步骤S2一致。
S344:当音频情绪编码与同时间段内的图像序列的图像情绪编码相同时,增大该图像序列对应的权值。
在上述步骤S344中,承接上述步骤S342和步骤S343的例子,假设0.5S至1.5S,2S 至3S,3.5S至4.5S,5S至6S这四个图像序列的音频情绪编码依次为1:害怕,2:困惑,2:困惑,3:生气,并且这四段音频片段的音频情绪校验结果依次为1:害怕,0:正常,0:正常,0:正常,此时,0.5S至1.5S的图像序列对应的权值即增大为3,其他三个图像序列的权值依然保持为1。
S345:将图像情绪编码相同的图像序列分为一组,并将每组中图像序列对应的权值相加以得到每组图像序列的分值,以分值最大的分组对应的图像情绪编码为所述第一视频分析值。
在上述步骤S345中,承接上述步骤S344的例子,此时,0.5S至1.5S的图像序列为一组,其情绪编码为1:害怕,2S至3S、3.5S至4.5S的图像序列为一组,其情绪编码为2:困惑,5S至6S的图像序列为一组,其情绪编码为0:正常。情绪编码为1:害怕的分组的分值为3,情绪编码为2:困惑的分组的分值为1+1=2,情绪编码为0:正常的分组的分值为1,故第一视频分析值为1:害怕。
通过上述步骤S341、S342、S343、S344和S345,能够在根据图像序列的情绪确定第一视频分析值时,结合图像序列对应的第一待验证音频的音频情绪编码进行计算,从而使得到的第一视频分析值更加准确。
S4:根据所述第一音频分析值和所述第一视频分析值确定是否提供供用户进行交互操作的交互界面。
进一步地,如图7所示,上述步骤S4可以包括:
S41:按照预设的规则确定所述第一音频分析值对应的音频分值和所述第二视频分析值对应的视频分值。
在上述步骤S41中,预设的规则中情绪与欺诈行为存在的相关关系越大,其对应的音频分值和视频分值可以越小。预设的规则可以如下表:
第一音频分析值 音频分值
1:害怕 1
2:困惑 5
... ...
第一视频分析值 视频分值
1:害怕 1
2:困惑 5
... ...
S42:从所述第一待验证视频中提取图像数据进行人脸匹配以获得人脸匹配率。
在上述步骤S42中,进行人脸匹配可以通过如下方式实现:提取视频中的人脸图像,将人脸图像与公安机关的人脸库进行比对以获得人脸匹配率。人脸匹配率可以是0.1、0.2、0.5、0.8、1等。
S43:以所述人脸匹配率为权值计算所述音频分值和所述视频分值的加权和。
在上述步骤S43中,假如人脸匹配率为0.6,第一音频分析值和第一视频分析值均为1:害怕,即加权和为0.6*1+0.6*1=1.2。
S44:当所述加权和大于安全阈值时,提供相应的交互界面。
S45:当所述加权和小于或等于安全阈值时,则不提供相应的交互界面。
在上述步骤S44和步骤S45中,承接上述步骤S43的例子,当安全阈值为1.5时,此时加权和小于安全阈值,即判断用户存在欺诈行为,故不提供相应的交互界面。这里,相应的交互界面可以是用户购买金融产品的页面或金融产品的介绍页面等等。
通过上述步骤S41、S42、S43、S44和S45,能够在根据第一音频分析值和第一视频分析值提供相应的交互界面时,结合人脸的匹配率对用户的身份进行进一步的验证,进一步 地增加终端交互的辨识维度,提高终端交互的安全性。
S5:定时地获取第二待验证视频和第二待验证音频。
在上述步骤S5中,第二待验证视频和第二待验证音频可以是由移动终端(例如,个人手机)或专用终端(例如,银行柜员机等)录制的视频和音频。
进一步地,如图8所示,上述步骤S5可以包括:
S51:获取当前交互界面的安全等级。
在上述步骤S51中,可以为每个交互界面预设一个安全标签,在安全标签中存储有该交互界面对应的安全等级。例如,车险购买页面的安全等级预设为A级,车险介绍页面的安全等级预设为B级等。A级的安全要求高于B级的安全要求。
S52:根据所述安全等级确定获取第二待验证视频和第二待验证音频的频率,并按照所述频率获取第二待验证视频和第二待验证音频。
在上述步骤S52中,可以按照预设的规则确定安全等级对应的频率。承接上述步骤S51的例子,假设在预设的规则中,安全等级A级对应的频率为每10S一次,安全等级B级对应的频率为每30S一次,即当用户在使用车险购买页面时,每10S采集一次第二待验证视频和第二待验证音频,当用户在使用车险介绍页面时,每30S采集一次第二待验证视频和第二待验证音频。
通过上述步骤S51和步骤S52,能够根据用户当前所使用的交互界面的安全等级采用不同的视频和音频采集频率,当安全等级较高时采用较高的采集频率,当安全等级较低时采用较低的采集频率,这样能够在保证终端交互的安全性的基础之上减少数据处理量,提高效率。
S6:基于所述第二待验证视频和第二待验证音频重新进行分析以确定是否关闭交互界面。
在上述步骤S6中,对第二待验证视频和第二待验证音频重新进行分析的过程与对第一待验证视频和第一待验证音频进行的分析过程一致,在此不再一一赘述。另外,关于确定是否关闭交互界面的过程和上述步骤S4的限定类似,在此不再一一赘述。其不同之处主要在于当根据第二音频分析值和第二视频分析值得出安全性不够(例如,加权和小于安全阈值)时,关闭交互界面,当安全性足够(例如,加权和大于安全阈值)时,不关闭交互界面。
在本实施例中,能够对第一待验证音频进行分析得到第一音频分析值,对第一待验证视频进行分析得到第一视频分析值,然后综合第一音频分析值和第一视频分析值来提供供用户进行交互操作的交互界面,从而能够增加终端交互的辨识维度,提高终端交互的安全性。另外,定时地获取第二待验证视频和第二待验证音频,对第二待验证音频进行分析得到第二音频分析值,对第二待验证视频进行分析得到第二视频分析值,然后综合第二音频分析值和第二视频分析值确定是否关闭交互界面,从而能够实现定时的分析,并根据分析结果控制终端,实现在用户使用终端的过程中也进行识别,进一步提高了终端交互的安全性。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机流程来指令相关的硬件来完成,该计算机流程可存储于一计算机可读取存储介质中,该流程在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他 步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
进一步参考图9,作为对上述图2所示方法的实现,本申请提供了一种终端交互装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图9所示,本实施例所述的终端交互装置400包括:第一获取模块401、音频分析模块402、视频分析模块403、第一确定模块404、第二获取模块405以及第二确定模块406。其中:
第一获取模块401,用于获取第一待验证视频和第一待验证音频。
音频分析模块402,用于将所述第一待验证音频的特征参数输入至识别模型以对所述第一待验证音频进行分析得到第一音频分析值。
视频分析模块403,用于通过人脸图像的面部动作识别技术对所述第一待验证视频进行分析以得到第一视频分析值。
第一确定模块404,用于根据所述第一音频分析值和所述第一视频分析值确定是否提供供用户进行交互操作的交互界面。
第二获取模块405,用于在提供交互界面后,定时地获取第二待验证视频和第二待验证音频。
第二确定模块406,基于所述第二待验证视频和第二待验证音频重新进行分析以确定是否关闭交互界面。
在本实施例中,能够对第一待验证音频进行分析得到第一音频分析值,对第一待验证视频进行分析得到第一视频分析值,然后综合第一音频分析值和第一视频分析值来提供供用户进行交互操作的交互界面,从而能够增加终端交互的辨识维度,提高终端交互的安全性。另外,定时地获取第二待验证视频和第二待验证音频,对第二待验证音频进行分析得到第二音频分析值,对第二待验证视频进行分析得到第二视频分析值,然后综合第二音频分析值和第二视频分析值确定是否关闭交互界面,从而能够实现定时的分析,并根据分析结果控制终端,实现在用户使用终端的过程中也进行识别,进一步提高了终端交互的安全性。
关于终端交互装置400的具体限定与上述终端交互方法的具体限定一致,在此不再一一赘述。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图10,图10为本实施例计算机设备基本结构框图。
所述计算机设备11包括通过系统总线相互通信连接存储器111、处理器112、网络接口113。需要指出的是,图中仅示出了具有组件111-113的计算机设备11,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
所述存储器111至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器111可以是所述计算机设备11的内部存储单元,例如该计算机设备11的硬盘或内存。在另一些实施例中,所述存储器111也可以是所述计算机设备11的外部存储设备,例如该计算 机设备11上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器111还可以既包括所述计算机设备11的内部存储单元也包括其外部存储设备。本实施例中,所述存储器111通常用于存储安装于所述计算机设备11的操作系统和各类应用软件,例如终端交互方法的计算机可读指令等。此外,所述存储器111还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器112在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器112通常用于控制所述计算机设备11的总体操作。本实施例中,所述处理器112用于运行所述存储器111中存储的计算机可读指令或者处理数据,例如运行所述终端交互方法的计算机可读指令。
所述网络接口113可包括无线网络接口或有线网络接口,该网络接口113通常用于在所述计算机设备11与其他电子设备之间建立通信连接。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有终端交互流程,所述终端交互流程可被至少一个处理器执行,以使所述至少一个处理器执行如上述的终端交互方法的步骤。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。

Claims (20)

  1. 一种基于人工智能的终端交互方法,其中,包括下述步骤:
    获取第一待验证视频和第一待验证音频;
    将所述第一待验证音频的特征参数输入至识别模型以对所述第一待验证音频进行分析得到第一音频分析值;
    通过人脸图像的面部动作识别技术对所述第一待验证视频进行分析以得到第一视频分析值;
    根据所述第一音频分析值和所述第一视频分析值确定是否提供供用户进行交互操作的交互界面;以及
    在提供交互界面后,定时地获取第二待验证视频和第二待验证音频;
    基于所述第二待验证视频和第二待验证音频重新进行分析以确定是否关闭交互界面。
  2. 根据权利要求1所述的终端交互方法,其中,所述对所述第一待验证视频进行分析以得到第一视频分析值包括:
    在所述第一待验证视频中提取存在微表情的图像序列;
    获取所述图像序列中每一张图像的情绪编码;
    在同一个图像序列的图像中,将情绪编码相同的图像分为一组,并根据分组中的图像确定每组情绪编码的分值,以分值最大的分组对应的情绪编码作为所述图像序列的图像情绪编码;
    根据所述图像序列的图像情绪编码确定所述第一视频分析值。
  3. 根据权利要求2所述的终端交互方法,其中,所述在所述第一待验证视频中提取存在微表情的图像序列包括:
    在所述第一待验证视频中,按照预设的时间间隔提取抽样图像;
    判断每张抽样图像中是否存在微表情;
    当抽样图像中存在微表情时,获取所述第一待验证视频中与所述抽样图像邻近的图像以组成所述图像序列。
  4. 根据权利要求2所述的终端交互方法,其中,所述根据分组中的图像确定每组情绪编码的分值包括:
    按照图像在所述待验证视频中出现的时间对所述图像序列中的每一张图像进行排列,并根据排列的顺序设置每一张图像的权值,所述权值为一个跟随排列的顺序先递增后递减的数列;
    以每组中所有图像的权值相加以得到该组对应的情绪编码的分值。
  5. 根据权利要求2所述的终端交互方法,其中,所述图像序列为多个图像序列,所述根据所述图像序列的图像情绪编码确定所述第一视频分析值包括:
    对每个图像序列赋予相同的权值;
    识别每个图像序列的时间段,将第一待验证音频按识别到的时间段进行分段,得到与每个图像序列对应的音频片段;
    对所述音频片段进行分析以获得每个图像序列对应的音频情绪编码;
    当音频情绪编码与同时间段内的图像序列的图像情绪编码相同时,增大该图像序列对应的权值;
    将图像情绪编码相同的图像序列分为一组,并将每组中图像序列对应的权值相加以得到每组图像序列的分值,以分值最大的分组对应的图像情绪编码为所述第一视频分析值。
  6. 根据权利要求1所述的终端交互方法,其中,所述根据所述第一音频分析值和所述第一视频分析值确定是否提供供用户进行交互操作的交互界面包括:
    按照预设的规则确定所述第一音频分析值对应的音频分值和所述第二视频分析值对应的视频分值;
    从所述第一待验证视频中提取图像数据进行人脸匹配以获得人脸匹配率;
    以所述人脸匹配率为权值计算所述音频分值和所述视频分值的加权和;
    当所述加权和大于安全阈值时,提供相应的交互界面;
    当所述加权和小于或等于安全阈值时,则不提供相应的交互界面。
  7. 根据权利要求1所述的终端交互方法,其中,所述定时地获取第二待验证视频和第二待验证音频包括:
    获取当前交互界面的安全等级;
    根据所述安全等级确定获取第二待验证视频和第二待验证音频的频率,并按照所述频率获取第二待验证视频和第二待验证音频。
  8. 一种基于人工智能的终端交互装置,其中,包括:
    第一获取模块,用于获取第一待验证视频和第一待验证音频;
    音频分析模块,用于将所述第一待验证音频的特征参数输入至识别模型以对所述第一待验证音频进行分析得到第一音频分析值;
    视频分析模块,用于通过人脸图像的面部动作识别技术对所述第一待验证视频进行分析以得到第一视频分析值;
    第一确定模块,用于根据所述第一音频分析值和所述第一视频分析值确定是否提供供用户进行交互操作的交互界面;以及
    第二获取模块,用于在提供交互界面后,定时地获取第二待验证视频和第二待验证音频;
    第二确定模块,基于所述第二待验证视频和第二待验证音频重新进行分析以确定是否关闭交互界面。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下所述的数据更新方法的步骤:
    获取第一待验证视频和第一待验证音频;
    将所述第一待验证音频的特征参数输入至识别模型以对所述第一待验证音频进行分析得到第一音频分析值;
    通过人脸图像的面部动作识别技术对所述第一待验证视频进行分析以得到第一视频分析值;
    根据所述第一音频分析值和所述第一视频分析值确定是否提供供用户进行交互操作的交互界面;以及
    在提供交互界面后,定时地获取第二待验证视频和第二待验证音频;
    基于所述第二待验证视频和第二待验证音频重新进行分析以确定是否关闭交互界面。
  10. 根据权利要求9所述的计算机设备,其中,所述对所述第一待验证视频进行分析以得到第一视频分析值包括:
    在所述第一待验证视频中提取存在微表情的图像序列;
    获取所述图像序列中每一张图像的情绪编码;
    在同一个图像序列的图像中,将情绪编码相同的图像分为一组,并根据分组中的图像确定每组情绪编码的分值,以分值最大的分组对应的情绪编码作为所述图像序列的图像情绪编码;
    根据所述图像序列的图像情绪编码确定所述第一视频分析值。
  11. 根据权利要求10所述的计算机设备,其中,所述在所述第一待验证视频中提取存在微表情的图像序列包括:
    在所述第一待验证视频中,按照预设的时间间隔提取抽样图像;
    判断每张抽样图像中是否存在微表情;
    当抽样图像中存在微表情时,获取所述第一待验证视频中与所述抽样图像邻近的图像以组成所述图像序列。
  12. 根据权利要求10所述的计算机设备,其中,所述根据分组中的图像确定每组情 绪编码的分值包括:
    按照图像在所述待验证视频中出现的时间对所述图像序列中的每一张图像进行排列,并根据排列的顺序设置每一张图像的权值,所述权值为一个跟随排列的顺序先递增后递减的数列;
    以每组中所有图像的权值相加以得到该组对应的情绪编码的分值。
  13. 根据权利要求10所述的计算机设备,其中,所述图像序列为多个图像序列,所述根据所述图像序列的图像情绪编码确定所述第一视频分析值包括:
    对每个图像序列赋予相同的权值;
    识别每个图像序列的时间段,将第一待验证音频按识别到的时间段进行分段,得到与每个图像序列对应的音频片段;
    对所述音频片段进行分析以获得每个图像序列对应的音频情绪编码;
    当音频情绪编码与同时间段内的图像序列的图像情绪编码相同时,增大该图像序列对应的权值;
    将图像情绪编码相同的图像序列分为一组,并将每组中图像序列对应的权值相加以得到每组图像序列的分值,以分值最大的分组对应的图像情绪编码为所述第一视频分析值。
  14. 根据权利要求9所述的计算机设备,其中,所述根据所述第一音频分析值和所述第一视频分析值确定是否提供供用户进行交互操作的交互界面包括:
    按照预设的规则确定所述第一音频分析值对应的音频分值和所述第二视频分析值对应的视频分值;
    从所述第一待验证视频中提取图像数据进行人脸匹配以获得人脸匹配率;
    以所述人脸匹配率为权值计算所述音频分值和所述视频分值的加权和;
    当所述加权和大于安全阈值时,提供相应的交互界面;
    当所述加权和小于或等于安全阈值时,则不提供相应的交互界面。
  15. 一种计算机可读存储介质,其中,所述计算机可读指令被一种处理器执行时,使得所述一种处理执行所述的终端交互方法的步骤:
    获取第一待验证视频和第一待验证音频;
    将所述第一待验证音频的特征参数输入至识别模型以对所述第一待验证音频进行分析得到第一音频分析值;
    通过人脸图像的面部动作识别技术对所述第一待验证视频进行分析以得到第一视频分析值;
    根据所述第一音频分析值和所述第一视频分析值确定是否提供供用户进行交互操作的交互界面;以及
    在提供交互界面后,定时地获取第二待验证视频和第二待验证音频;
    基于所述第二待验证视频和第二待验证音频重新进行分析以确定是否关闭交互界面。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述对所述第一待验证视频进行分析以得到第一视频分析值包括:
    在所述第一待验证视频中提取存在微表情的图像序列;
    获取所述图像序列中每一张图像的情绪编码;
    在同一个图像序列的图像中,将情绪编码相同的图像分为一组,并根据分组中的图像确定每组情绪编码的分值,以分值最大的分组对应的情绪编码作为所述图像序列的图像情绪编码;
    根据所述图像序列的图像情绪编码确定所述第一视频分析值。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述在所述第一待验证视频中提取存在微表情的图像序列包括:
    在所述第一待验证视频中,按照预设的时间间隔提取抽样图像;
    判断每张抽样图像中是否存在微表情;
    当抽样图像中存在微表情时,获取所述第一待验证视频中与所述抽样图像邻近的图像以组成所述图像序列。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述根据分组中的图像确定每组情绪编码的分值包括:
    按照图像在所述待验证视频中出现的时间对所述图像序列中的每一张图像进行排列,并根据排列的顺序设置每一张图像的权值,所述权值为一个跟随排列的顺序先递增后递减的数列;
    以每组中所有图像的权值相加以得到该组对应的情绪编码的分值。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述图像序列为多个图像序列,所述根据所述图像序列的图像情绪编码确定所述第一视频分析值包括:
    对每个图像序列赋予相同的权值;
    识别每个图像序列的时间段,将第一待验证音频按识别到的时间段进行分段,得到与每个图像序列对应的音频片段;
    对所述音频片段进行分析以获得每个图像序列对应的音频情绪编码;
    当音频情绪编码与同时间段内的图像序列的图像情绪编码相同时,增大该图像序列对应的权值;
    将图像情绪编码相同的图像序列分为一组,并将每组中图像序列对应的权值相加以得到每组图像序列的分值,以分值最大的分组对应的图像情绪编码为所述第一视频分析值。
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述根据所述第一音频分析值和所述第一视频分析值确定是否提供供用户进行交互操作的交互界面包括:
    按照预设的规则确定所述第一音频分析值对应的音频分值和所述第二视频分析值对应的视频分值;
    从所述第一待验证视频中提取图像数据进行人脸匹配以获得人脸匹配率;
    以所述人脸匹配率为权值计算所述音频分值和所述视频分值的加权和;
    当所述加权和大于安全阈值时,提供相应的交互界面;
    当所述加权和小于或等于安全阈值时,则不提供相应的交互界面。
PCT/CN2020/105762 2019-12-25 2020-07-30 终端交互方法、装置、计算机设备及存储介质 WO2021128847A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911357310.8A CN111178226A (zh) 2019-12-25 2019-12-25 终端交互方法、装置、计算机设备及存储介质
CN201911357310.8 2019-12-25

Publications (1)

Publication Number Publication Date
WO2021128847A1 true WO2021128847A1 (zh) 2021-07-01

Family

ID=70650454

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105762 WO2021128847A1 (zh) 2019-12-25 2020-07-30 终端交互方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN111178226A (zh)
WO (1) WO2021128847A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115208585A (zh) * 2022-09-07 2022-10-18 环球数科集团有限公司 一种基于零知识证明的数据交互方法与系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178226A (zh) * 2019-12-25 2020-05-19 深圳壹账通智能科技有限公司 终端交互方法、装置、计算机设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150279369A1 (en) * 2014-03-27 2015-10-01 Samsung Electronics Co., Ltd. Display apparatus and user interaction method thereof
CN107766785A (zh) * 2017-01-25 2018-03-06 丁贤根 一种面部识别方法
CN108053218A (zh) * 2017-12-29 2018-05-18 宁波大学 一种安全的移动支付方法
CN110223710A (zh) * 2019-04-18 2019-09-10 深圳壹账通智能科技有限公司 多重联合认证方法、装置、计算机装置及存储介质
CN110473049A (zh) * 2019-05-22 2019-11-19 深圳壹账通智能科技有限公司 理财产品推荐方法、装置、设备及计算机可读存储介质
CN111178226A (zh) * 2019-12-25 2020-05-19 深圳壹账通智能科技有限公司 终端交互方法、装置、计算机设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150279369A1 (en) * 2014-03-27 2015-10-01 Samsung Electronics Co., Ltd. Display apparatus and user interaction method thereof
CN107766785A (zh) * 2017-01-25 2018-03-06 丁贤根 一种面部识别方法
CN108053218A (zh) * 2017-12-29 2018-05-18 宁波大学 一种安全的移动支付方法
CN110223710A (zh) * 2019-04-18 2019-09-10 深圳壹账通智能科技有限公司 多重联合认证方法、装置、计算机装置及存储介质
CN110473049A (zh) * 2019-05-22 2019-11-19 深圳壹账通智能科技有限公司 理财产品推荐方法、装置、设备及计算机可读存储介质
CN111178226A (zh) * 2019-12-25 2020-05-19 深圳壹账通智能科技有限公司 终端交互方法、装置、计算机设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115208585A (zh) * 2022-09-07 2022-10-18 环球数科集团有限公司 一种基于零知识证明的数据交互方法与系统
CN115208585B (zh) * 2022-09-07 2022-11-18 环球数科集团有限公司 一种基于零知识证明的数据交互方法与系统

Also Published As

Publication number Publication date
CN111178226A (zh) 2020-05-19

Similar Documents

Publication Publication Date Title
Yadav et al. Survey on machine learning in speech emotion recognition and vision systems using a recurrent neural network (RNN)
WO2021208287A1 (zh) 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质
CN112259106B (zh) 声纹识别方法、装置、存储介质及计算机设备
Datcu et al. Semantic audiovisual data fusion for automatic emotion recognition
Mariooryad et al. Compensating for speaker or lexical variabilities in speech for emotion recognition
US10210867B1 (en) Adjusting user experience based on paralinguistic information
WO2019019256A1 (zh) 电子装置、身份验证的方法、系统及计算机可读存储介质
WO2021047319A1 (zh) 基于语音的个人信用评估方法、装置、终端及存储介质
US10019988B1 (en) Adjusting a ranking of information content of a software application based on feedback from a user
RU2720359C1 (ru) Способ и оборудование распознавания эмоций в речи
US20170294192A1 (en) Classifying Signals Using Mutual Information
Sethu et al. Speech based emotion recognition
Yang et al. Enhanced multiclass SVM with thresholding fusion for speech-based emotion classification
Noroozi et al. Supervised vocal-based emotion recognition using multiclass support vector machine, random forests, and adaboost
Mohamed et al. Face mask recognition from audio: The MASC database and an overview on the mask challenge
WO2021128847A1 (zh) 终端交互方法、装置、计算机设备及存储介质
CN110136726A (zh) 一种语音性别的估计方法、装置、系统及存储介质
Lopez-Otero et al. Analysis of gender and identity issues in depression detection on de-identified speech
CN113314150A (zh) 基于语音数据的情绪识别方法、装置及存储介质
Dawood et al. A robust voice spoofing detection system using novel CLS-LBP features and LSTM
Pao et al. A study on the search of the most discriminative speech features in the speaker dependent speech emotion recognition
Shah et al. Speech emotion recognition based on SVM using MATLAB
Tsai et al. Self-defined text-dependent wake-up-words speaker recognition system
CN110838294B (zh) 一种语音验证方法、装置、计算机设备及存储介质
Ntalampiras Directed acyclic graphs for content based sound, musical genre, and speech emotion classification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20904556

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 31.10.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20904556

Country of ref document: EP

Kind code of ref document: A1