WO2021128847A1 - Procédé et appareil d'interaction de terminal, dispositif informatique et support de stockage - Google Patents

Procédé et appareil d'interaction de terminal, dispositif informatique et support de stockage Download PDF

Info

Publication number
WO2021128847A1
WO2021128847A1 PCT/CN2020/105762 CN2020105762W WO2021128847A1 WO 2021128847 A1 WO2021128847 A1 WO 2021128847A1 CN 2020105762 W CN2020105762 W CN 2020105762W WO 2021128847 A1 WO2021128847 A1 WO 2021128847A1
Authority
WO
WIPO (PCT)
Prior art keywords
verified
image
video
audio
analysis value
Prior art date
Application number
PCT/CN2020/105762
Other languages
English (en)
Chinese (zh)
Inventor
熊玮
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021128847A1 publication Critical patent/WO2021128847A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a terminal interaction method, device, computer equipment and storage medium.
  • the purpose of the embodiments of this application is to propose a terminal interaction method to solve the problem that the existing terminal interaction method has a single identification dimension, and usually only needs to be verified at the beginning of the operation by the user, which fails to provide a good security guarantee. problem.
  • an embodiment of the present application provides a terminal interaction method, which adopts the following technical solutions:
  • analysis is performed again to determine whether to close the interactive interface.
  • an embodiment of the present application also provides a terminal interaction device, which adopts the following technical solutions:
  • the first acquisition module is configured to acquire the first to-be-verified video and the first to-be-verified audio;
  • An audio analysis module configured to input characteristic parameters of the first to-be-verified audio into a recognition model to analyze the first to-be-verified audio to obtain a first audio analysis value
  • a video analysis module configured to analyze the first to-be-verified video to obtain a first video analysis value by using facial action recognition technology of a face image
  • the first determining module is configured to determine whether to provide an interactive interface for the user to perform interactive operations according to the first audio analysis value and the first video analysis value;
  • the second acquisition module is configured to periodically acquire the second to-be-verified video and the second to-be-verified audio after the interactive interface is provided;
  • the second determining module re-analyzes based on the second to-be-verified video and the second to-be-verified audio to determine whether to close the interactive interface.
  • the embodiments of the present application also provide a computer device, which adopts the following technical solutions:
  • a computer device includes a memory and a processor, the memory stores a computer process, and the processor implements the steps of the terminal interaction method described below when the processor executes the computer process:
  • analysis is performed again to determine whether to close the interactive interface.
  • the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:
  • the computer-readable storage medium stores a computer process, and when the computer process is executed by a processor, the steps of the terminal interaction method described below are implemented:
  • analysis is performed again to determine whether to close the interactive interface.
  • the interactive interface can increase the identification dimension of terminal interaction and improve the security of terminal interaction.
  • the second audio analysis value and the second video analysis value determine whether to close the interactive interface, so that timing analysis can be realized, and the terminal can be controlled according to the analysis result, so that identification is also performed during the user's use of the terminal, which further improves the security of terminal interaction .
  • Figure 1 is an exemplary system architecture diagram to which the present application can be applied;
  • Fig. 2 is a flowchart of an embodiment of a terminal interaction method according to the present application.
  • FIG. 3 is a flowchart of a specific implementation of step S3 in FIG. 2;
  • FIG. 4 is a flowchart of a specific implementation of step S31 in FIG. 3;
  • FIG. 5 is a flowchart of a specific implementation of step S33 in FIG. 3;
  • FIG. 6 is a flowchart of a specific implementation of step S34 in FIG. 3;
  • FIG. 7 is a flowchart of a specific implementation of step S4 in FIG. 2;
  • FIG. 8 is a flowchart of a specific implementation of step S5 in FIG. 2;
  • Fig. 9 is a schematic structural diagram of an embodiment of a terminal interaction device according to the present application.
  • Fig. 10 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • Various communication client applications such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, and social platform software, can be installed on the terminal devices 101, 102, and 103.
  • the terminal devices 101, 102, 103 may be various electronic devices with display screens and support for web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4) players, laptop portable computers and desktop computers, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4
  • laptop portable computers and desktop computers etc.
  • the server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
  • terminal interaction method provided in the embodiments of the present application is generally executed by the server/terminal device, and accordingly, the terminal interaction device is generally set in the server/terminal device.
  • terminal devices, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.
  • the terminal interaction method includes the following steps:
  • the first to-be-verified video and the first to-be-verified audio may be video and audio recorded in real time by a mobile terminal (for example, a personal mobile phone) or a dedicated terminal (for example, a bank teller machine, etc.).
  • a mobile terminal for example, a personal mobile phone
  • a dedicated terminal for example, a bank teller machine, etc.
  • the user needs to speak or answer the salesperson’s questions according to a given language. For example, the salesperson asks: "Are you Mr. XX?", the user answers: “Yes”, the salesperson continues to ask: "The return rate of XX products you buy is XX, and the payback period is XX. Do you know", The user answered: "Yes”.
  • S2 Input the characteristic parameters of the first to-be-verified audio into a recognition model to analyze the first to-be-verified audio to obtain a first audio analysis value.
  • step S2 the analysis of the first to-be-verified audio can be realized through the following steps:
  • the characteristic parameter may be a MFCC (Mel-scale Frequency Cepstral Coefficients) characteristic parameter, a sound intensity characteristic parameter, or a formant characteristic parameter.
  • MFCC Mel-scale Frequency Cepstral Coefficients
  • formant characteristic parameters spectral envelope method, cepstrum method, LPC interpolation method, LPC root finding method, Hilbert transform method, etc.
  • Pre-emphasis can increase the resolution of the high frequency band in the speech signal to remove the influence of the lip radiation.
  • the function of windowing and framing is: the speech signal itself is non-stationary, but it also has the characteristics of short-term stability, so the speech signal is divided into small segments and treated as a stationary signal.
  • the segmentation here can be understood as: sub-frame, in order to fully analyze the speech signal, there must be a frame shift (the understanding here is very similar to the sliding window of image processing).
  • the frame shift here can be understood as windowing.
  • endpoint detection can also be performed: the start point and the end point of the valid sound segment are detected to remove the invalid sound segment, thereby improving the processing efficiency of the voice signal.
  • FFT Fast Fourier Transform
  • the characteristic parameters may also be prosodic characteristics such as speech rate, energy, average zero-crossing rate, and pitch frequency.
  • the training set used to train the recognition model usually includes a feature parameter and a classification result corresponding to the feature parameter.
  • the recognition result of the characteristic parameter of the recognition model will be closer and closer to the classification result of the training set.
  • the classification results in the training set can be simply divided into two results, fraud and safety, or can be divided into multiple results such as fear, disgust, surprise, confusion, thinking, sadness, and anger.
  • the recognition model can use Hidden Markov Model HMM, Gaussian Mixture Model GMM, Support Vector Machine SVM, Artificial Neural Network ANN, etc., among which support vector machine SVM is easy to debug and experiment.
  • S3 Analyze the first to-be-verified video through facial action recognition technology of the face image to obtain a first video analysis value.
  • step S3 may include:
  • step S31 may include:
  • S311 In the first video to be verified, extract sample images according to a preset time interval.
  • the preset time interval may be 1S, 2S, 5S, and so on.
  • the preset time interval is 1S, that is, the extracted sample images are the images corresponding to 1S, 2S, 3S... in the first video to be verified. Since some micro expressions have a short duration (according to statistics, the shortest micro expression can even last only 0.25 s), so the preset time interval should be as small as possible if the computing power allows.
  • S312 Determine whether there are micro-expressions in each sampled image, and when there are micro-expressions in the sampled images, acquire images adjacent to the sampled image in the first to-be-verified video to form the image sequence.
  • step S312 it is possible to request an external facial action coding system (Facial action coding system, FACS) to determine whether there are micro-expressions in the sampled image.
  • FACS facial action coding system
  • the face image will be recognized which facial actions exist, and then the emotion code corresponding to the face image will be obtained according to the existing facial actions.
  • Different facial motion numbers correspond to different facial motions.
  • AU1 Raise the inner corner of the eyebrows
  • AU2 Raise the outer corner of the eyebrows
  • AU4 Lower the eyebrows
  • AU6 Raise the cheeks
  • AU9 Wrinkle the nose
  • the facial action obtained is: AU4+AU6+AU9+AU11+AU16+AU25, then the emotion corresponding to the emotion code obtained at this time is pain.
  • the facial action is: AU4+AU5+AU7+AU23, then the emotion corresponding to the emotion code obtained at this time is anger.
  • the facial action is: AU4+AU14, then the emotion corresponding to the emotion code obtained at this time is thinking.
  • the images adjacent to the sampled image may be the image of 20 frames before the sampled image and the image of 20 frames after the sampled image, or the image of 0.5S before the sampled image and the image of 0.5S after the sampled image.
  • the time of the sampled image with micro-expression in the video to be verified is 1S and 3S, that is, the image from 0.5S to 1.5S is an image sequence, and the image from 2.5S to 3.5S is another image sequence.
  • the emotion coding of each image in the image sequence can be obtained by requesting an external facial action coding system (Facial action coding system, FACS).
  • FACS facial action coding system
  • an emotion code can correspond to an emotion. Emotions can be simply divided into two results, fraud and safety, or can be divided into multiple results such as fear, disgust, surprise, confusion, thinking, sadness, and anger.
  • S33 In the images of the same image sequence, divide the images with the same emotion code into a group, and determine the score value of each group of emotion codes according to the images in the group, and use the emotion code corresponding to the group with the largest score as the group Image emotional coding of image sequence.
  • step S33 for example, there are ten images in the same image sequence, among which, the emotion of 3 images is 1: fear, the emotion of 3 images is 0: normal, and the emotion of 4 images is 2: 2: confused. That is, these ten images will be divided into three groups according to fear, normality and confusion.
  • the determining the score of each group of emotion codes according to the images in the group may include:
  • S331 Arrange each image in the image sequence according to the time when the image appears in the video to be verified, and set the weight of each image according to the sequence of the arrangement, where the weight is a follow-up arrangement The sequence of increasing first and then decreasing.
  • the number sequence may be preset, generally according to the order of the images, and the increasing peak value is in the middle of the order.
  • the time of an image sequence in the video is 0.5S to 1.5S, and there are a total of 10 images in the image sequence, and the time that they appear in the video are: 0.6S, 0.7S, 0.8S, 0.9S, 1.0 S, 1.1S, 1.2S, 1.3S, 1.4S, 1.5S. That is, the weights set for these 10 images can be 1, 2, 3, 4, 5, 6, 5, 4, 3, 2 in order.
  • step S332 the example of the above step S331 is followed.
  • the fear of emotion grouping has the highest score, so fear is used as the emotion of the image sequence.
  • the amplitude of the micro-expression presents a trend that first gradually increases to a peak and then gradually falls back to normal. Therefore, at the midpoint of the micro-expression process, the amplitude of the micro-expression is the largest.
  • the micro-expression recognition results corresponding to the time images are more reliable. Therefore, through the above steps S331 and S332, appropriate weights can be assigned to the micro-expression recognition results of the images at different time points, so that the emotion recognition results of the image sequence are more accurate .
  • S34 Determine the first video analysis value according to the emotion code of the image sequence.
  • the foregoing step S34 may include:
  • step S341 for example, four image sequences are extracted from the video, and the time is 0.5S to 1.5S, 2S to 3S, 3.5S to 4.5S, 5S to 6S, and the weight of each image sequence can be Both are set to 1.
  • S342 Identify the time period of each image sequence, and segment the first to-be-verified audio according to the identified time period to obtain an audio segment corresponding to each image sequence.
  • step S342 and step S343 the example of the above step S341 is followed, that is, the four segments of audio from 0.5S to 1.5S, 2S to 3S, 3.5S to 4.5S, and 5S to 6S are obtained from the first audio to be verified. Segment, and analyze it to obtain the corresponding audio mood code.
  • the audio clip can be analyzed by requesting a third-party audio sentiment analysis service.
  • the specific analysis method may also be consistent with the above step S2.
  • step S344 following the example of the above steps S342 and S343, suppose that the audio emotion coding of the four image sequences of 0.5S to 1.5S, 2S to 3S, 3.5S to 4.5S, and 5S to 6S are 1: fear , 2: Confused, 2: Confused, 3: Angry, and the audio emotion check results of these four audio clips are 1: afraid, 0: normal, 0: normal, 0: normal, at this time, 0.5S to 1.5 The weight corresponding to the image sequence of S is increased to 3, and the weights of the other three image sequences remain at 1.
  • S345 Divide the image sequences with the same image emotion coding into one group, and add the corresponding weights of the image sequences in each group to obtain the score of each group of image sequences, and use the image emotion code corresponding to the group with the largest score as The first video analysis value.
  • step S345 the example of the above step S344 is followed.
  • the image sequence of 0.5S to 1.5S is a group, and the emotion code is 1: fear
  • the image sequence of 2S to 3S, 3.5S to 4.5S is one Group
  • its emotion code is 2: confused
  • 5S to 6S image sequence is a group
  • its emotion code is 0: normal
  • the analysis value is 1: fear.
  • S4 Determine, according to the first audio analysis value and the first video analysis value, whether to provide an interactive interface for the user to perform interactive operations.
  • step S4 may include:
  • S41 Determine the audio score corresponding to the first audio analysis value and the video score corresponding to the second video analysis value according to a preset rule.
  • the preset rules can be as follows:
  • the first audio analysis value Audio score 1 Fear 1 2: Confusion 5 ... ...
  • Video score 1 Fear 1 2: Confusion 5 ... ...
  • S42 Extract image data from the first to-be-verified video to perform face matching to obtain a face matching rate.
  • the face matching can be implemented in the following manner: extracting the face image in the video, and comparing the face image with the face database of the public security organ to obtain the face matching rate.
  • the face matching rate can be 0.1, 0.2, 0.5, 0.8, 1, etc.
  • S43 Calculate a weighted sum of the audio score and the video score using the face matching rate as a weight.
  • step S44 and step S45 following the example of step S43, when the security threshold is 1.5, the weighted sum is less than the security threshold at this time, that is, it is judged that the user has fraudulent behavior, so the corresponding interactive interface is not provided.
  • the corresponding interactive interface may be a page where the user purchases a financial product or an introduction page of the financial product, and so on.
  • the second to-be-verified video and the second to-be-verified audio may be video and audio recorded by a mobile terminal (for example, a personal mobile phone) or a dedicated terminal (for example, a bank teller machine, etc.).
  • a mobile terminal for example, a personal mobile phone
  • a dedicated terminal for example, a bank teller machine, etc.
  • step S5 may include:
  • a security label may be preset for each interactive interface, and the security level corresponding to the interactive interface is stored in the security label.
  • the security level of the auto insurance purchase page is preset to level A
  • the security level of the auto insurance introduction page is preset to level B, etc.
  • the safety requirements of Class A are higher than those of Class B.
  • S52 Determine the frequency for acquiring the second to-be-verified video and the second to-be-verified audio according to the security level, and acquire the second to-be-verified video and the second to-be-verified audio according to the frequency.
  • the frequency corresponding to the security level can be determined according to a preset rule.
  • the frequency corresponding to safety level A is once every 10S
  • the frequency corresponding to safety level B is once every 30S
  • the second to-be-verified video and the second-to-be-verified audio are collected once in 10S.
  • the second to-be-verified video and the second to-be-verified audio are collected every 30S.
  • different video and audio collection frequencies can be used according to the security level of the interactive interface currently used by the user.
  • the security level is higher, the higher the collection frequency is used, and when the security level is lower, the higher the collection frequency is used.
  • Low collection frequency can reduce the amount of data processing and improve efficiency on the basis of ensuring the safety of terminal interaction.
  • S6 Re-analyze based on the second to-be-verified video and second to-be-verified audio to determine whether to close the interactive interface.
  • step S6 the process of reanalyzing the second video to be verified and the second audio to be verified is the same as the process of analyzing the first video to be verified and the first audio to be verified, and will not be repeated here.
  • the process of determining whether to close the interactive interface is similar to the limitation of step S4, and will not be repeated here.
  • the main difference is that when the security is not enough (for example, the weighted sum is less than the safety threshold) based on the second audio analysis value and the second video analysis value, the interactive interface is closed, and when the security is sufficient (for example, the weighted sum is greater than the safety threshold) ), the interactive interface is not closed.
  • the first audio to be verified can be analyzed to obtain the first audio analysis value
  • the first video to be verified can be analyzed to obtain the first video analysis value
  • the first audio analysis value and the first video analysis value can be synthesized
  • the second audio analysis value and the second video analysis value determine whether to close the interactive interface, so that timing analysis can be realized, and the terminal can be controlled according to the analysis result, so that identification is also performed during the user's use of the terminal, which further improves the security of terminal interaction .
  • the computer process can be stored in a computer readable storage medium. When executed, it may include the procedures of the above-mentioned method embodiments.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
  • this application provides an embodiment of a terminal interaction device.
  • the device embodiment corresponds to the method embodiment shown in FIG. Used in various electronic devices.
  • the terminal interaction device 400 in this embodiment includes: a first acquisition module 401, an audio analysis module 402, a video analysis module 403, a first determination module 404, a second acquisition module 405, and a second determination module 406. among them:
  • the first obtaining module 401 is configured to obtain the first to-be-verified video and the first to-be-verified audio.
  • the audio analysis module 402 is configured to input the characteristic parameters of the first to-be-verified audio into a recognition model to analyze the first to-be-verified audio to obtain a first audio analysis value.
  • the video analysis module 403 is configured to analyze the first to-be-verified video to obtain a first video analysis value by using a facial motion recognition technology of a face image.
  • the first determining module 404 is configured to determine whether to provide an interactive interface for the user to perform interactive operations according to the first audio analysis value and the first video analysis value.
  • the second acquisition module 405 is configured to periodically acquire the second to-be-verified video and the second to-be-verified audio after the interactive interface is provided.
  • the second determining module 406 re-analyzes based on the second to-be-verified video and the second to-be-verified audio to determine whether to close the interactive interface.
  • the first audio to be verified can be analyzed to obtain the first audio analysis value
  • the first video to be verified can be analyzed to obtain the first video analysis value
  • the first audio analysis value and the first video analysis value can be synthesized
  • the second audio analysis value and the second video analysis value determine whether to close the interactive interface, so that timing analysis can be realized, and the terminal can be controlled according to the analysis result, so that identification is also performed during the user's use of the terminal, which further improves the security of terminal interaction .
  • terminal interaction device 400 The specific limitation on the terminal interaction device 400 is consistent with the specific limitation of the above terminal interaction method, and will not be repeated here.
  • FIG. 10 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 11 includes a memory 111, a processor 112, and a network interface 113 that are connected to each other in communication through a system bus. It should be pointed out that only the computer device 11 with components 111-113 is shown in the figure, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • the memory 111 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static memory Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 111 may be an internal storage unit of the computer device 11, such as a hard disk or a memory of the computer device 11.
  • the memory 111 may also be an external storage device of the computer device 11, for example, a plug-in hard disk equipped on the computer device 11, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc.
  • the memory 111 may also include both an internal storage unit of the computer device 11 and an external storage device thereof.
  • the memory 111 is generally used to store an operating system and various application software installed in the computer device 11, such as computer-readable instructions for terminal interaction methods.
  • the memory 111 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 112 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 112 is generally used to control the overall operation of the computer device 11.
  • the processor 112 is configured to run computer-readable instructions or process data stored in the memory 111, for example, computer-readable instructions for running the terminal interaction method.
  • the network interface 113 may include a wireless network interface or a wired network interface, and the network interface 113 is generally used to establish a communication connection between the computer device 11 and other electronic devices.
  • the present application also provides another implementation manner, that is, a computer-readable storage medium storing a terminal interaction process, and the terminal interaction process can be executed by at least one processor to enable all The at least one processor executes the steps of the terminal interaction method described above.
  • the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. ⁇
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Engineering & Computer Science (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Image Analysis (AREA)

Abstract

Des modes de réalisation de la présente demande concernent le domaine de l'intelligence artificielle, et concernent un procédé d'interaction de terminal, qui consiste : à obtenir une première vidéo à vérifier et un premier contenu audio à vérifier ; à analyser le premier contenu audio à vérifier et la première vidéo à vérifier afin d'obtenir un premier résultat d'analyse ; à déterminer, selon le premier résultat d'analyse, s'il faut fournir une interface interactive afin qu'un utilisateur effectue une opération interactive ; une fois que l'interface interactive est fournie, à obtenir régulièrement une seconde vidéo à vérifier et un second contenu audio à vérifier ; lorsque la seconde vidéo à vérifier et le second contenu audio à vérifier sont obtenus chaque fois, à analyser le second contenu audio à vérifier et la seconde vidéo à vérifier afin d'obtenir un second résultat d'analyse ; et à déterminer, selon le second résultat d'analyse, s'il faut fermer l'interface interactive. La présente demande concerne en outre un appareil d'interaction de terminal, un dispositif informatique et un support de stockage. Selon la présente demande, la dimension de reconnaissance de l'interaction de terminal peut être augmentée, et la reconnaissance peut être effectuée dans le processus d'utilisation d'un terminal par un utilisateur, de telle sorte que la sécurité de l'interaction du terminal est améliorée.
PCT/CN2020/105762 2019-12-25 2020-07-30 Procédé et appareil d'interaction de terminal, dispositif informatique et support de stockage WO2021128847A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911357310.8 2019-12-25
CN201911357310.8A CN111178226A (zh) 2019-12-25 2019-12-25 终端交互方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021128847A1 true WO2021128847A1 (fr) 2021-07-01

Family

ID=70650454

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105762 WO2021128847A1 (fr) 2019-12-25 2020-07-30 Procédé et appareil d'interaction de terminal, dispositif informatique et support de stockage

Country Status (2)

Country Link
CN (1) CN111178226A (fr)
WO (1) WO2021128847A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115208585A (zh) * 2022-09-07 2022-10-18 环球数科集团有限公司 一种基于零知识证明的数据交互方法与系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178226A (zh) * 2019-12-25 2020-05-19 深圳壹账通智能科技有限公司 终端交互方法、装置、计算机设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150279369A1 (en) * 2014-03-27 2015-10-01 Samsung Electronics Co., Ltd. Display apparatus and user interaction method thereof
CN107766785A (zh) * 2017-01-25 2018-03-06 丁贤根 一种面部识别方法
CN108053218A (zh) * 2017-12-29 2018-05-18 宁波大学 一种安全的移动支付方法
CN110223710A (zh) * 2019-04-18 2019-09-10 深圳壹账通智能科技有限公司 多重联合认证方法、装置、计算机装置及存储介质
CN110473049A (zh) * 2019-05-22 2019-11-19 深圳壹账通智能科技有限公司 理财产品推荐方法、装置、设备及计算机可读存储介质
CN111178226A (zh) * 2019-12-25 2020-05-19 深圳壹账通智能科技有限公司 终端交互方法、装置、计算机设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150279369A1 (en) * 2014-03-27 2015-10-01 Samsung Electronics Co., Ltd. Display apparatus and user interaction method thereof
CN107766785A (zh) * 2017-01-25 2018-03-06 丁贤根 一种面部识别方法
CN108053218A (zh) * 2017-12-29 2018-05-18 宁波大学 一种安全的移动支付方法
CN110223710A (zh) * 2019-04-18 2019-09-10 深圳壹账通智能科技有限公司 多重联合认证方法、装置、计算机装置及存储介质
CN110473049A (zh) * 2019-05-22 2019-11-19 深圳壹账通智能科技有限公司 理财产品推荐方法、装置、设备及计算机可读存储介质
CN111178226A (zh) * 2019-12-25 2020-05-19 深圳壹账通智能科技有限公司 终端交互方法、装置、计算机设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115208585A (zh) * 2022-09-07 2022-10-18 环球数科集团有限公司 一种基于零知识证明的数据交互方法与系统
CN115208585B (zh) * 2022-09-07 2022-11-18 环球数科集团有限公司 一种基于零知识证明的数据交互方法与系统

Also Published As

Publication number Publication date
CN111178226A (zh) 2020-05-19

Similar Documents

Publication Publication Date Title
Yadav et al. Survey on machine learning in speech emotion recognition and vision systems using a recurrent neural network (RNN)
CN112562691B (zh) 一种声纹识别的方法、装置、计算机设备及存储介质
CN112259106B (zh) 声纹识别方法、装置、存储介质及计算机设备
WO2021208287A1 (fr) Procédé et appareil de détection d'activité vocale pour reconnaissance d'émotion, dispositif électronique et support de stockage
CN107527620B (zh) 电子装置、身份验证的方法及计算机可读存储介质
Datcu et al. Semantic audiovisual data fusion for automatic emotion recognition
Mariooryad et al. Compensating for speaker or lexical variabilities in speech for emotion recognition
US10210867B1 (en) Adjusting user experience based on paralinguistic information
US10019988B1 (en) Adjusting a ranking of information content of a software application based on feedback from a user
Mohamed et al. Face mask recognition from audio: The MASC database and an overview on the mask challenge
Sethu et al. Speech based emotion recognition
RU2720359C1 (ru) Способ и оборудование распознавания эмоций в речи
Yang et al. Enhanced multiclass SVM with thresholding fusion for speech-based emotion classification
Noroozi et al. Supervised vocal-based emotion recognition using multiclass support vector machine, random forests, and adaboost
CN110738998A (zh) 基于语音的个人信用评估方法、装置、终端及存储介质
Lopez-Otero et al. Analysis of gender and identity issues in depression detection on de-identified speech
Dawood et al. A robust voice spoofing detection system using novel CLS-LBP features and LSTM
WO2018210323A1 (fr) Procédé et dispositif permettant la fourniture d'un objet social
CN114138960A (zh) 用户意图识别方法、装置、设备及介质
CN113314150A (zh) 基于语音数据的情绪识别方法、装置及存储介质
WO2021128847A1 (fr) Procédé et appareil d'interaction de terminal, dispositif informatique et support de stockage
Meudt et al. Multi classifier systems and forward backward feature selection algorithms to classify emotional coloured speech
Pao et al. A study on the search of the most discriminative speech features in the speaker dependent speech emotion recognition
Shah et al. Speech emotion recognition based on SVM using MATLAB
Tsai et al. Self-defined text-dependent wake-up-words speaker recognition system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20904556

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 31.10.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20904556

Country of ref document: EP

Kind code of ref document: A1