CN111191073A - Video and audio recognition method, device, storage medium and device - Google Patents

Video and audio recognition method, device, storage medium and device Download PDF

Info

Publication number
CN111191073A
CN111191073A CN201911374298.1A CN201911374298A CN111191073A CN 111191073 A CN111191073 A CN 111191073A CN 201911374298 A CN201911374298 A CN 201911374298A CN 111191073 A CN111191073 A CN 111191073A
Authority
CN
China
Prior art keywords
target
video
audio
user
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911374298.1A
Other languages
Chinese (zh)
Inventor
黄超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN201911374298.1A priority Critical patent/CN111191073A/en
Publication of CN111191073A publication Critical patent/CN111191073A/en
Priority to PCT/CN2020/102532 priority patent/WO2021128817A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/686Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a video and audio recognition method, a device, a storage medium and a device, wherein the method comprises the steps of receiving a target service type input by a user, searching a corresponding target service file according to the target service type, displaying the target service file, shooting a target video read aloud by the user, carrying out audio separation on the target video through an audio and video separator to obtain target audio information, and reducing the complicated steps of manual input through voice reading; performing character recognition on the target audio information to obtain target information, performing frame extraction processing on the target video to obtain a user picture, and verifying the identity of the user; and generating a target business document of the user according to the user picture and the target information, and acquiring multi-aspect data by analyzing the video based on artificial intelligence, thereby verifying the identity of the user and simultaneously improving the information input efficiency of the user.

Description

Video and audio recognition method, device, storage medium and device
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a video and audio recognition method, video and audio recognition equipment, a storage medium and a device.
Background
When the real verification requirement is carried out on the user in a financial scene, the data authenticity of the user needs to be repeatedly collected and then verified, so that the wind control capability is improved as much as possible, the loan finance of the user is accurately evaluated as much as possible, and the aim is accurate wind control. In the current loan scene, an identity authentication process is commonly added, after the authentication is passed, a user inputs information in a webpage or an Application (APP) to collect user data, so that the complicated operation can lead to more pages, the abnormity can be increased, the input of user information is long, and the user experience is very poor.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a video and audio recognition method, a device, a storage medium and a device, and aims to solve the technical problem of long time consumption caused by complicated input operation of user information in the prior art.
In order to achieve the above object, the present invention provides a video and audio recognition method, which comprises the following steps:
receiving a target service type input by a user, searching a corresponding target service pattern according to the target service type, and displaying the target service pattern;
shooting a target video of the target business case read by the user, and carrying out audio separation on the target video through an audio-video separator to obtain target audio information;
performing character recognition on the target audio information to obtain target information;
performing frame extraction processing on the target video to obtain a user picture;
and generating a target business document of the user according to the user picture and the target information.
Preferably, the performing character recognition on the target audio information to obtain target information includes:
performing character recognition on the target audio information to obtain corresponding text information;
comparing the text information with the target business case to obtain the accuracy of the text information;
and when the accuracy is greater than a preset accuracy threshold, extracting information of the text through a regular expression to obtain target information.
Preferably, when the accuracy is greater than a preset accuracy threshold, after information extraction is performed on the text through a regular expression to obtain target information, the video and audio recognition method includes:
judging whether the target information meets a preset rule or not;
if not, prompting to enable the user to read the target service case again;
and if so, executing the step of performing frame extraction processing on the target video to obtain a user picture.
Preferably, before the frame extraction processing is performed on the target video and a user picture is obtained, the video and audio identification method further includes:
carrying out face recognition on the target video, and carrying out living body detection on the recognized face;
and when the living body detection is successful, executing the step of performing frame extraction processing on the target video to obtain a user picture.
Preferably, the performing face recognition on the target video and performing living body detection on the recognized face includes:
carrying out face recognition on the target video, and intercepting eye regions of the recognized face to obtain eye region images;
identifying whether the eye region image has blinking actions through a preset blinking model;
and if the eye region image is recognized to have blinking motion, determining that the living body detection is successful.
Preferably, after the frame extraction processing is performed on the target video to obtain the user picture, the video and audio identification method further includes:
preprocessing the user picture to obtain a preprocessed picture;
screening the preprocessed pictures according to the definition to obtain screened pictures;
comparing the screened picture with a preset picture to obtain a comparison result;
correspondingly, the generating the target business document of the user according to the user picture and the target information includes:
and when the comparison result exceeds a preset similarity threshold value, generating a target business document of the user according to the screening picture and the target information.
Preferably, the shooting a target video of the target business case read aloud by the user, and performing audio separation on the target video through an audio-video separator to obtain target audio information includes:
shooting a target video of the target business case read by the user while playing the target music;
performing audio separation on the target video through an audio-video separator to obtain mixed audio information;
and extracting target audio information of the target business case read by the user from the mixed audio information by calculating an auditory scene analysis algorithm.
Furthermore, to achieve the above object, the present invention further provides a video and audio recognition apparatus, which includes a memory, a processor, and a video and audio recognition program stored in the memory and executable on the processor, wherein the video and audio recognition program is configured to implement the steps of the video and audio recognition method as described above.
Furthermore, to achieve the above object, the present invention further provides a storage medium having a video and audio recognition program stored thereon, wherein the video and audio recognition program, when executed by a processor, implements the steps of the video and audio recognition method as described above.
In addition, to achieve the above object, the present invention further provides a video and audio recognition apparatus, including:
the searching module is used for receiving a target service type input by a user, searching a corresponding target service file according to the target service type and displaying the target service file;
the audio separation module is used for shooting a target video of the target business case read by the user, and performing audio separation on the target video through an audio-video separator to obtain target audio information;
the character recognition module is used for carrying out character recognition on the target audio information to obtain target information;
the frame extraction processing module is used for carrying out frame extraction processing on the target video to obtain a user picture;
and the generating module is used for generating the target business document of the user according to the user picture and the target information.
In the invention, a target service type input by a user is received, a corresponding target service file is searched according to the target service type, the target service file is displayed, a target video of the target service file read aloud by the user is shot, the target video is subjected to audio separation through an audio-video separator to obtain target audio information, and the complicated steps of manual input are reduced through voice reading; performing character recognition on the target audio information to obtain target information, performing frame extraction processing on the target video to obtain a user picture, and verifying the identity of the user; and generating a target business document of the user according to the user picture and the target information, and acquiring multi-aspect data by analyzing the video based on artificial intelligence, thereby verifying the identity of the user and simultaneously improving the information input efficiency of the user.
Drawings
FIG. 1 is a schematic structural diagram of a video and audio recognition device of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a video/audio recognition method according to a first embodiment of the present invention;
FIG. 3 is a flowchart illustrating a second embodiment of a video/audio recognition method according to the present invention;
FIG. 4 is a flowchart illustrating a video/audio recognition method according to a third embodiment of the present invention;
FIG. 5 is a block diagram of a video/audio recognition apparatus according to a first embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a video and audio recognition device in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the video audio recognition apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), and the optional user interface 1003 may further include a standard wired interface and a wireless interface, and the wired interface for the user interface 1003 may be a USB interface in the present invention. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory or a Non-volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the video audio recognition device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a video and audio recognition program.
In the video and audio recognition apparatus shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server and performing data communication with the background server; the user interface 1003 is mainly used for connecting user equipment; the video and audio recognition device calls a video and audio recognition program stored in the memory 1005 through the processor 1001 and executes the video and audio recognition method provided by the embodiment of the invention.
Based on the hardware structure, the embodiment of the video and audio identification method is provided.
Referring to fig. 2, fig. 2 is a flowchart illustrating a video and audio recognition method according to a first embodiment of the present invention.
In a first embodiment, the video and audio recognition method comprises the following steps:
step S10: receiving a target service type input by a user, searching a corresponding target service pattern according to the target service type, and displaying the target service pattern.
It should be understood that the main execution body of the present embodiment is the video and audio recognition device, wherein the video and audio recognition device may be an electronic device such as a smart phone, a personal computer, or a server, and the present embodiment is not limited thereto. In a webpage or an APP, various service types can be presented through options, a user selects a target service type to be performed, and when the target service type input by the user is received, a target service file corresponding to the target service type is searched from a preset mapping relation table, wherein the preset mapping relation table comprises the corresponding relation between the service type and the service file. The target business type includes loan, lease or insurance business, and the target business case is user-related information that needs to be collected for each business type, for example, each business type needs to collect user's personal basic information, such as a section of personal information case: i am xxx, i am identification number xxxxx, i am from the xxx region, etc. The different service types also need to collect relevant information corresponding to the service types, for example, the loan service also needs to collect the following information: whether the information such as the house property, the vehicle, the annual income and the like exists or not can be established in advance according to the service type, and the information required to be collected is presented in a filling-in form.
Step S20: shooting a target video of the target business case read by the user, and carrying out audio separation on the target video through an audio-video separator to obtain target audio information.
It should be noted that, when reading the target service document, the user may fill the content to be filled in with the information of the user. Shooting the process of reading the target service file by the user, and recording a video through a camera function of the video and audio recognition device, such as a video recording function of a smart phone. There is the button of making a video recording in webpage or the APP, the target business case is in when demonstrateing in webpage or the APP, the top or the below of business case set up the button of making a video recording, and the user is through clicking this button of making a video recording, shoots oneself and reads aloud the video of target business case obtains the target video.
It can be understood that audio separation usually takes out the sound and image of the video separately, and the audio separation step is: setting an audio source; acquiring the number of tracks in a source file, and traversing to find a required audio track; and extracting the found audio track to obtain the target audio information.
Step S30: and carrying out character recognition on the target audio information to obtain target information.
In specific implementation, the silence of the head and the tail ends in the target audio information is cut off, so that the interference to the subsequent steps is reduced. The first audio information after silence removal is framed, i.e. the first audio information is cut into small segments, each segment being called a frame, and the framing operation is typically implemented using a moving window function. After framing, the first audio information becomes many small segments. Then, the waveform is transformed, the characteristics of Mel-scale frequency cepstral coeffients (MFCC) are extracted, and each frame of waveform is changed into a multi-dimensional vector. Then, recognizing the frame as a state; combining the states into phonemes; the phonemes are combined into words. The plurality of frames of speech correspond to one state, each three states are combined into one phoneme, the plurality of phonemes are combined into one word, so that corresponding text information is obtained, and the content filled by the user in the text information can be extracted to be used as the target information.
Step S40: and performing frame extraction processing on the target video to obtain a user picture.
It should be understood that, the target video is initialized while being instantiated, the total frame number of the target video is obtained and printed, a variable is defined and used for storing and storing each frame image, a cyclic flag bit is used for defining a current frame, each frame of the target video and a character stream are read, a long and integer long type is converted into a character type and transmitted to an object str, each 10 frames are set to obtain one frame, the frame is converted into a picture to be output, an end condition is met, when the current frame number is greater than the total frame number, the cycle is stopped, and the output picture is the user picture.
Step S50: and generating a target business document of the user according to the user picture and the target information.
It should be noted that the user picture may be used as the authentication information of the user, and may also perform voiceprint extraction on the audio, apply the extracted voiceprint to the identity of the user, and perform authentication according to the voiceprint. The target information is related information about the user extracted from a text read by the user, the user picture and the target information are combined to generate a document, namely the target service document, and the target service document comprises user identity authentication information and various user information required by the target service type.
In the embodiment, a target service type input by a user is received, a corresponding target service file is searched according to the target service type, the target service file is displayed, a target video of the target service file read aloud by the user is shot, the target video is subjected to audio separation through an audio-video separator to obtain target audio information, and the complicated steps of manual input are reduced through voice reading; performing character recognition on the target audio information to obtain target information, performing frame extraction processing on the target video to obtain a user picture, and verifying the identity of the user; and generating a target business document of the user according to the user picture and the target information, and acquiring multi-aspect data by analyzing the video based on artificial intelligence, thereby verifying the identity of the user and simultaneously improving the information input efficiency of the user.
Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of the video and audio recognition method according to the present invention, and the second embodiment of the video and audio recognition method according to the present invention is provided based on the first embodiment shown in fig. 2.
In the second embodiment, the step S30 includes:
step S301: and performing character recognition on the target audio information to obtain corresponding text information.
It should be understood that, when performing character recognition on the target audio information, firstly, removing the silence at the head and tail ends of the target audio information, then framing the first audio information after the silence removal, where a plurality of frames of voices correspond to one state, and seeing which state a frame corresponds to has the highest probability, which state the frame belongs to, constructing a state network, and searching for a path most matched with sound from the state network, in the voice recognition process, an optimal path is searched in the state network, every three states are combined into one phoneme, and a plurality of phonemes are combined into one word, so as to obtain the text information corresponding to the target audio information.
Step S302: and comparing the text information with the target business case to obtain the accuracy of the text information.
It can be understood that, the text information is a text formed by the user reading the target service pattern, in order to determine whether the user reads the correct service pattern and whether the target service pattern is read correctly, the fixed content in the text information may be extracted, the extracted content may be compared with the target service pattern, and the similarity between the extracted content and the target service pattern may be used as the accuracy of the text information.
Step S303: and when the accuracy is greater than a preset accuracy threshold, extracting information of the text through a regular expression to obtain target information.
It should be noted that the preset accuracy threshold may be set according to an empirical value, for example, 80%, when the accuracy is greater than the preset accuracy threshold, the two contents are similar, that is, the accuracy of the text detail information is considered to meet the requirement, and the text information may be further analyzed.
In a specific implementation, the requirement for extracting the specific location character string may be implemented by a regular expression, specifically: the extraction of the character string at a single position can be extracted by using (++. A string "a123b 456 b", which can be used if we want to match all values between a and the last b instead of the value between a and the first occurrence of b? To control the case of regular greedy and non-greedy matching. Control only matches 0 or 1, so only the match between b and the nearest will be output. Extracting character strings of a plurality of continuous positions by using (.
Further, in this embodiment, after the step S303, the method further includes:
judging whether the target information meets a preset rule or not;
if not, prompting to enable the user to read the target service case again;
if yes, the step S40 is executed.
It should be understood that, the template data written in the target service pattern is analyzed in advance, and a corresponding rule is set for each information to be written, for example, the telephone number is 11 digits, the preset rule corresponding to the telephone number in the target service pattern is to determine whether the telephone number is 11 digits, if the preset rule is satisfied, the content of the telephone number in the target information is considered to be correct, and if the preset rule is not satisfied, the telephone number in the target information is considered to be misreading, and a voice prompt may be performed, for example, the telephone number is prompted to be 11 digits, the number of content digits read now is incorrect, or the content read now is one more digit. The prompting can also be performed in a text prompting manner, for example, the content of the error in the text message is marked with red, and the text message is prompted to have the error in a text annotation manner beside the content. The input box may support correction of modifications to enable a user to modify the textual information.
It can be understood that the region may also preset the corresponding preset rule, for example, each geographic location information in the map is pre-entered, when the content of the reading of the target business case is address information, it is determined whether the address information in the text information belongs to the pre-entered geographic location information, if so, the read address information is considered to be correct, and if not, the read address information is considered to be incorrect.
In this embodiment, the text information recognized by voice is compared with the target service pattern, and when the accuracy is greater than a preset accuracy threshold, the text information is extracted through a regular expression to obtain the target information, so that the accuracy of information entry is improved.
Referring to fig. 4, fig. 4 is a flowchart illustrating a video and audio recognition method according to a third embodiment of the present invention, and the video and audio recognition method according to the third embodiment of the present invention is provided based on the first embodiment or the second embodiment. This embodiment is explained based on the first embodiment.
In the third embodiment, before the step S40, the method further includes:
carrying out face recognition on the target video, and carrying out living body detection on the recognized face;
when the living body detection is successful, the step S40 is executed.
It should be understood that the target video is subjected to face recognition, the principle of finding persons by the video is the same as that of finding persons by pictures, the video is a collection of pictures and is essentially also a picture finding person, and the found persons and the recognized faces are drawn with rectangular frames to realize face recognition. The Face recognition technology (Face Detection) is responsible for recognizing the Face position, Face registration (Face Alignment) is used for carrying out Face Alignment, an algorithm is to adopt affine transformation, carry out Face Alignment according to eye coordinates, use a Visual Geometry Group Network (VGG) model for feature extraction, open a picture by a get _ feature _ new function, and use a VGG Network for feature extraction. The compare _ pic function calculates the similarity for the two incoming features. The key point is the selection of the threshold. The face _ record _ test function reads the test pictures and calculates the optimal parameters of each group of pictures. And storing the aligned face picture for use in subsequent face feature comparison. And carrying out Face recognition by using Seeta Face Engine or FaceAlignment. And acquiring the human face features in the input picture. Face recognition was performed using opencv, cv2. cascadeclassister.
Further, the performing face recognition on the target video and performing living body detection on the recognized face includes:
carrying out face recognition on the target video, and intercepting eye regions of the recognized face to obtain eye region images;
identifying whether the eye region image has blinking actions through a preset blinking model;
and if the eye region image is recognized to have blinking motion, determining that the living body detection is successful.
It can be understood that the living body detection is performed on the recognized face, and whether the detected face moves or blinks or not is used to determine whether the detected face is a real person, not a photo. Firstly, detecting a face and positioning eyes; then intercepting an eye region, and calculating the opening and closing degree of eyes from the normalized image; and establishing a model for judging the blinking motion based on the convolutional neural network, and identifying whether the blinking motion exists in the image according to the model. The method comprises the steps of establishing a convolutional neural network model to be trained in advance, obtaining a large number of sample images, intercepting a human face eye region in the sample images to obtain sample eye images, obtaining sample blink information corresponding to each sample image, training the convolutional neural network model to be trained according to the sample eye images and the corresponding sample blink information to obtain a preset blink model, identifying the eye region images through the preset blink model, and if the eye region images are identified to have blink actions, considering that the target video is a real person, and determining that living body detection is successful.
In this embodiment, after the step S40, the method further includes:
step S401: and preprocessing the user picture to obtain a preprocessed picture.
It should be understood that, when the frame extraction processing is performed on the target video, and a plurality of user pictures are usually obtained, the user pictures need to be further processed to obtain a user picture with better quality as the authentication information of the user. The user picture can be preprocessed in advance, and the purpose of image preprocessing is to eliminate irrelevant information in the image, and to remove or reduce the interference of illumination, an imaging system or an external environment and the like on the image as much as possible, so that the characteristic of the image can be obviously represented in the image. The preprocessing process comprises the processing steps of light compensation, gray level transformation, histogram equalization, normalization, geometric correction, filtering, sharpening and the like of the face image, so that the preprocessed image is obtained.
Step S402: and screening the preprocessed pictures according to the definition to obtain screened pictures.
It should be noted that, usually, the number of the preprocessed pictures is multiple, and a picture with higher definition is selected from the multiple preprocessed pictures for face recognition. The definition of the image is an important index for measuring the quality of the image, a secondary fuzzy Reblur algorithm can be adopted for evaluating the definition, if an image is blurred, the image is blurred again, and the change of high-frequency components is small; however, if the original image is clear and the blurring process is performed once, the high frequency component varies greatly. Therefore, a degraded image of the image can be obtained by performing Gaussian blur processing on the image to be evaluated once, then the change conditions of adjacent pixel values of the original image and the degraded image are compared, the definition value is determined according to the change magnitude, the smaller the calculation result is, the clearer the image is, and the clearer the image is, otherwise, the more fuzzy the image is. The idea can be called a sharpness algorithm based on secondary blurring, and specifically is to perform low-pass filtering on the preprocessed picture to obtain a blurred image, calculate a change of gray values of adjacent pixels in the preprocessed picture to obtain a first pixel change value, calculate a change of gray values of adjacent pixels in the blurred image to obtain a second pixel change value, perform comparative analysis on the first pixel change value and the second pixel change value, perform normalization processing to obtain a sharpness result, and perform screening on the preprocessed picture according to the sharpness result to obtain the screened picture.
Step S403: and comparing the screened picture with a preset picture to obtain a comparison result.
In specific implementation, facial feature point positioning is carried out on the screening picture to obtain the to-be-processed facial feature points corresponding to the screening picture; comparing the human face characteristic points to be processed with preset positive face characteristic points to obtain a homography matrix; transforming the face in the picture through the homography matrix to obtain a calibrated face picture; and the preset pictures are pictures of users in a public security system, the calibration face pictures are compared with the characteristics of all pictures in a public security system library through a convolutional neural network model to obtain the face similarity between the screening pictures and all the preset pictures, and the face similarity is used as the comparison result.
Accordingly, the step S50 includes:
step S501: and when the comparison result exceeds a preset similarity threshold value, generating a target business document of the user according to the screening picture and the target information.
It can be understood that the face similarity is used as the comparison result, and if the face similarity exceeds the preset similarity threshold, the identity of the user is considered to be verified, so that business data can be further established for the user. The preset similarity threshold may be set according to an empirical value, such as 40%. And comparing the face characteristics, calculating the face similarity, and if the preset similarity threshold is set to be 0.4, namely if the similarity is more than 40%, determining that the person is the same person, generating the target business document of the user according to the screening picture and the target information.
In this embodiment, the step S20 includes:
shooting a target video of the target business case read by the user while playing the target music;
performing audio separation on the target video through an audio-video separator to obtain mixed audio information;
and extracting target audio information of the target business case read by the user from the mixed audio information by calculating an auditory scene analysis algorithm.
It should be understood that, in order to ensure the security of personal information, the target music can be played simultaneously when the user reads the target business case, the target music can create a noisy speech environment, and the user's personal information is prevented from being known by others, and the target music is included in the target video captured at this time and the audio of the target business case is read by the user. The audio and video separator is used for separating the audio of the target video to obtain mixed audio information, and a Computational Auditory Scene Analysis (CASA) algorithm is further adopted to simulate a human Auditory system to extract the speech read by the user from a noise environment. The audio information is encoded to enable grouping and parsing. There are currently several tens of groupings by reference to time and frequency correlations, including pitch, spatial position, and start/end times. Pitch is a very important grouping criterion that discriminates a sound according to different harmonic patterns. When two or more microphones are employed, the sound isolation system may determine the direction and distance of each microphone's sound from the spatial location information. The CASA modeling approach enables the sound isolation system to focus on a certain sound source, such as a certain person, and to mask out background sound. The start/stop time packet refers to the time when a certain sound component starts to appear and stops, and when the data is combined with the original frequency data, whether the sound component comes from the same sound source or not can be judged. Masking out a series of noise concentrations identifies a particular sound source. Sounds with similar attributes will form the same audio stream, and likewise, different attributes will form respective audio streams. These different audio streams may be employed to identify persistent or repetitive sound sources. With enough sound packets, the actual sound isolation process can demap from the identified sound sources and respond to the actual speaker's voice, thereby isolating the target audio information of the user reading the target business document.
In the embodiment, the user picture is processed to obtain the screening picture with better quality, and then the screening picture is compared with the preset picture in the public security system to verify the identity of the user, so that the safety and the reliability of information input are improved.
In addition, an embodiment of the present invention further provides a storage medium, where a video and audio recognition program is stored on the storage medium, and the video and audio recognition program, when executed by a processor, implements the steps of the video and audio recognition method described above.
In addition, referring to fig. 5, an embodiment of the present invention further provides a video and audio recognition apparatus, where the video and audio recognition apparatus includes:
the searching module 10 is configured to receive a target service type input by a user, search a corresponding target service pattern according to the target service type, and display the target service pattern.
It should be understood that, in a web page or APP, various service types may be presented through options, a user selects a target service type to be performed, and when the target service type input by the user is received, a target service case corresponding to the target service type is searched from a preset mapping relation table, where the preset mapping relation table includes a corresponding relation between the service type and the service case. The target business type includes loan, lease or insurance business, and the target business case is user-related information that needs to be collected for each business type, for example, each business type needs to collect user's personal basic information, such as a section of personal information case: i am xxx, i am identification number xxxxx, i am from the xxx region, etc. The different service types also need to collect relevant information corresponding to the service types, for example, the loan service also needs to collect the following information: whether the information such as the house property, the vehicle, the annual income and the like exists or not can be established in advance according to the service type, and the information required to be collected is presented in a filling-in form.
And the audio separation module 20 is configured to shoot a target video of the target business case read aloud by the user, and perform audio separation on the target video through an audio/video separator to obtain target audio information.
It should be noted that, when reading the target service document, the user may fill the content to be filled in with the information of the user. Shooting the process of reading the target service file by the user, and recording a video through a camera function of the video and audio recognition device, such as a video recording function of a smart phone. There is the button of making a video recording in webpage or the APP, the target business case is in when demonstrateing in webpage or the APP, the top or the below of business case set up the button of making a video recording, and the user is through clicking this button of making a video recording, shoots oneself and reads aloud the video of target business case obtains the target video.
It can be understood that audio separation usually takes out the sound and image of the video separately, and the audio separation step is: setting an audio source; acquiring the number of tracks in a source file, and traversing to find a required audio track; and extracting the found audio track to obtain the target audio information.
And the character recognition module 30 is configured to perform character recognition on the target audio information to obtain target information.
In specific implementation, the silence of the head and the tail ends in the target audio information is cut off, so that the interference to the subsequent steps is reduced. The first audio information after silence removal is framed, i.e. the first audio information is cut into small segments, each segment being called a frame, and the framing operation is typically implemented using a moving window function. After framing, the first audio information becomes many small segments. Then, the waveform is transformed, the characteristics of Mel-scale frequency cepstral coeffients (MFCC) are extracted, and each frame of waveform is changed into a multi-dimensional vector. Then, recognizing the frame as a state; combining the states into phonemes; the phonemes are combined into words. The plurality of frames of speech correspond to one state, each three states are combined into one phoneme, the plurality of phonemes are combined into one word, so that corresponding text information is obtained, and the content filled by the user in the text information can be extracted to be used as the target information.
And the frame extracting processing module 40 is configured to perform frame extracting processing on the target video to obtain a user picture.
It should be understood that, the target video is initialized while being instantiated, the total frame number of the target video is obtained and printed, a variable is defined and used for storing and storing each frame image, a cyclic flag bit is used for defining a current frame, each frame of the target video and a character stream are read, a long and integer long type is converted into a character type and transmitted to an object str, each 10 frames are set to obtain one frame, the frame is converted into a picture to be output, an end condition is met, when the current frame number is greater than the total frame number, the cycle is stopped, and the output picture is the user picture.
And a generating module 50, configured to generate a target service document of the user according to the user picture and the target information.
It should be noted that the user picture may be used as the authentication information of the user, and may also perform voiceprint extraction on the audio, apply the extracted voiceprint to the identity of the user, and perform authentication according to the voiceprint. The target information is related information about the user extracted from a text read by the user, the user picture and the target information are combined to generate a document, namely the target service document, and the target service document comprises user identity authentication information and various user information required by the target service type.
In the embodiment, a target service type input by a user is received, a corresponding target service file is searched according to the target service type, the target service file is displayed, a target video of the target service file read aloud by the user is shot, the target video is subjected to audio separation through an audio-video separator to obtain target audio information, and the complicated steps of manual input are reduced through voice reading; performing character recognition on the target audio information to obtain target information, performing frame extraction processing on the target video to obtain a user picture, and verifying the identity of the user; and generating a target business document of the user according to the user picture and the target information, and acquiring multi-aspect data by analyzing the video based on artificial intelligence, thereby verifying the identity of the user and simultaneously improving the information input efficiency of the user.
In an embodiment, the text recognition module 30 is further configured to perform text recognition on the target audio information to obtain corresponding text information; comparing the text information with the target business case to obtain the accuracy of the text information; and when the accuracy is greater than a preset accuracy threshold, extracting information of the text through a regular expression to obtain target information.
In an embodiment, the video and audio recognition apparatus further comprises:
the judging module is used for judging whether the target information meets a preset rule or not;
the prompting module is used for prompting if the target business case is not met so that the user can read the target business case again;
and the frame extracting processing module 40 is further configured to, if the frame extracting processing is satisfied, execute the step of performing frame extracting processing on the target video to obtain a user picture.
In an embodiment, the video and audio recognition apparatus further comprises:
the living body detection module is used for carrying out face recognition on the target video and carrying out living body detection on the recognized face;
the frame extracting processing module 40 is further configured to execute the step of performing frame extracting processing on the target video to obtain a user picture when the living body detection is successful.
In an embodiment, the living body detection module is further configured to perform face recognition on the target video, and intercept an eye region of the recognized face to obtain an eye region image; identifying whether the eye region image has blinking actions through a preset blinking model; and if the eye region image is recognized to have blinking motion, determining that the living body detection is successful.
In an embodiment, the video and audio recognition apparatus further comprises:
the preprocessing module is used for preprocessing the user picture to obtain a preprocessed picture;
the screening module is used for screening the preprocessed pictures according to the definition to obtain screened pictures;
the comparison module is used for comparing the screened picture with a preset picture to obtain a comparison result;
the generating module 50 is further configured to generate a target service document of the user according to the screening picture and the target information when the comparison result exceeds a preset similarity threshold.
In an embodiment, the audio separation module 20 is further configured to shoot a target video of the user reading the target service pattern while playing target music;
performing audio separation on the target video through an audio-video separator to obtain mixed audio information;
and extracting target audio information of the target business case read by the user from the mixed audio information by calculating an auditory scene analysis algorithm.
Other embodiments or specific implementation manners of the video and audio recognition apparatus according to the present invention may refer to the above method embodiments, and are not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third and the like do not denote any order, but rather the words first, second and the like may be interpreted as indicating any order.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be substantially implemented or a part contributing to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g., a Read Only Memory (ROM)/Random Access Memory (RAM), a magnetic disk, an optical disk), and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A video and audio recognition method is characterized by comprising the following steps:
receiving a target service type input by a user, searching a corresponding target service pattern according to the target service type, and displaying the target service pattern;
shooting a target video of the target business case read by the user, and carrying out audio separation on the target video through an audio-video separator to obtain target audio information;
performing character recognition on the target audio information to obtain target information;
performing frame extraction processing on the target video to obtain a user picture;
and generating a target business document of the user according to the user picture and the target information.
2. The video audio recognition method of claim 1, wherein the performing text recognition on the target audio information to obtain target information comprises:
performing character recognition on the target audio information to obtain corresponding text information;
comparing the text information with the target business case to obtain the accuracy of the text information;
and when the accuracy is greater than a preset accuracy threshold, extracting information of the text through a regular expression to obtain target information.
3. The video audio recognition method of claim 2, wherein after the text is extracted by a regular expression to obtain target information when the accuracy is greater than a preset accuracy threshold, the video audio recognition method comprises:
judging whether the target information meets a preset rule or not;
if not, prompting to enable the user to read the target service case again;
and if so, executing the step of performing frame extraction processing on the target video to obtain a user picture.
4. The video and audio recognition method of claim 1, wherein before the frame-extracting process is performed on the target video to obtain the user picture, the video and audio recognition method further comprises:
carrying out face recognition on the target video, and carrying out living body detection on the recognized face;
and when the living body detection is successful, executing the step of performing frame extraction processing on the target video to obtain a user picture.
5. The video audio recognition method of claim 4, wherein the performing face recognition on the target video and performing live body detection on the recognized face comprises:
carrying out face recognition on the target video, and intercepting eye regions of the recognized face to obtain eye region images;
identifying whether the eye region image has blinking actions through a preset blinking model;
and if the eye region image is recognized to have blinking motion, determining that the living body detection is successful.
6. The video and audio recognition method of claim 1, wherein after the frame extraction processing is performed on the target video to obtain a user picture, the video and audio recognition method further comprises:
preprocessing the user picture to obtain a preprocessed picture;
screening the preprocessed pictures according to the definition to obtain screened pictures;
comparing the screened picture with a preset picture to obtain a comparison result;
correspondingly, the generating the target business document of the user according to the user picture and the target information includes:
and when the comparison result exceeds a preset similarity threshold value, generating a target business document of the user according to the screening picture and the target information.
7. The video and audio recognition method of any one of claims 1 to 6, wherein the capturing a target video of the user reading the target business case, and performing audio separation on the target video through an audio and video separator to obtain target audio information comprises:
shooting a target video of the target business case read by the user while playing the target music;
performing audio separation on the target video through an audio-video separator to obtain mixed audio information;
and extracting target audio information of the target business case read by the user from the mixed audio information by calculating an auditory scene analysis algorithm.
8. A video audio recognition device, characterized in that the video audio recognition device comprises: memory, processor and a video and audio recognition program stored on the memory and executable on the processor, the video and audio recognition program when executed by the processor implementing the steps of the video and audio recognition method according to any one of claims 1 to 7.
9. A storage medium, characterized in that the storage medium has stored thereon a video and audio recognition program which, when executed by a processor, implements the steps of the video and audio recognition method according to any one of claims 1 to 7.
10. An apparatus for video and audio recognition, the apparatus comprising:
the searching module is used for receiving a target service type input by a user, searching a corresponding target service file according to the target service type and displaying the target service file;
the audio separation module is used for shooting a target video of the target business case read by the user, and performing audio separation on the target video through an audio-video separator to obtain target audio information;
the character recognition module is used for carrying out character recognition on the target audio information to obtain target information;
the frame extraction processing module is used for carrying out frame extraction processing on the target video to obtain a user picture;
and the generating module is used for generating the target business document of the user according to the user picture and the target information.
CN201911374298.1A 2019-12-26 2019-12-26 Video and audio recognition method, device, storage medium and device Pending CN111191073A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911374298.1A CN111191073A (en) 2019-12-26 2019-12-26 Video and audio recognition method, device, storage medium and device
PCT/CN2020/102532 WO2021128817A1 (en) 2019-12-26 2020-07-17 Video and audio recognition method, apparatus and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911374298.1A CN111191073A (en) 2019-12-26 2019-12-26 Video and audio recognition method, device, storage medium and device

Publications (1)

Publication Number Publication Date
CN111191073A true CN111191073A (en) 2020-05-22

Family

ID=70710065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911374298.1A Pending CN111191073A (en) 2019-12-26 2019-12-26 Video and audio recognition method, device, storage medium and device

Country Status (2)

Country Link
CN (1) CN111191073A (en)
WO (1) WO2021128817A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814714A (en) * 2020-07-15 2020-10-23 前海人寿保险股份有限公司 Image identification method, device and equipment based on audio and video recording and storage medium
CN112734752A (en) * 2021-01-25 2021-04-30 上海微亿智造科技有限公司 Method and system for image screening in flying shooting process
CN112911180A (en) * 2021-01-28 2021-06-04 中国建设银行股份有限公司 Video recording method and device, electronic equipment and readable storage medium
WO2021128817A1 (en) * 2019-12-26 2021-07-01 深圳壹账通智能科技有限公司 Video and audio recognition method, apparatus and device and storage medium
CN113822195A (en) * 2021-09-23 2021-12-21 四川云恒数联科技有限公司 Government affair platform user behavior recognition feedback method based on video analysis
CN115250375A (en) * 2021-04-26 2022-10-28 北京中关村科金技术有限公司 Method and device for detecting audio and video content compliance based on fixed telephone technology

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325742A (en) * 2018-09-26 2019-02-12 平安普惠企业管理有限公司 Business approval method, apparatus, computer equipment and storage medium
CN109840406A (en) * 2017-11-29 2019-06-04 百度在线网络技术(北京)有限公司 Living body verification method, device and computer equipment
CN110348378A (en) * 2019-07-10 2019-10-18 北京旷视科技有限公司 A kind of authentication method, device and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10178301B1 (en) * 2015-06-25 2019-01-08 Amazon Technologies, Inc. User identification based on voice and face
CN110147726B (en) * 2019-04-12 2024-02-20 财付通支付科技有限公司 Service quality inspection method and device, storage medium and electronic device
CN111191073A (en) * 2019-12-26 2020-05-22 深圳壹账通智能科技有限公司 Video and audio recognition method, device, storage medium and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840406A (en) * 2017-11-29 2019-06-04 百度在线网络技术(北京)有限公司 Living body verification method, device and computer equipment
CN109325742A (en) * 2018-09-26 2019-02-12 平安普惠企业管理有限公司 Business approval method, apparatus, computer equipment and storage medium
CN110348378A (en) * 2019-07-10 2019-10-18 北京旷视科技有限公司 A kind of authentication method, device and storage medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021128817A1 (en) * 2019-12-26 2021-07-01 深圳壹账通智能科技有限公司 Video and audio recognition method, apparatus and device and storage medium
CN111814714A (en) * 2020-07-15 2020-10-23 前海人寿保险股份有限公司 Image identification method, device and equipment based on audio and video recording and storage medium
CN111814714B (en) * 2020-07-15 2024-03-29 前海人寿保险股份有限公司 Image recognition method, device, equipment and storage medium based on audio and video recording
CN112734752A (en) * 2021-01-25 2021-04-30 上海微亿智造科技有限公司 Method and system for image screening in flying shooting process
CN112734752B (en) * 2021-01-25 2021-10-01 上海微亿智造科技有限公司 Method and system for image screening in flying shooting process
CN112911180A (en) * 2021-01-28 2021-06-04 中国建设银行股份有限公司 Video recording method and device, electronic equipment and readable storage medium
CN115250375A (en) * 2021-04-26 2022-10-28 北京中关村科金技术有限公司 Method and device for detecting audio and video content compliance based on fixed telephone technology
CN115250375B (en) * 2021-04-26 2024-01-26 北京中关村科金技术有限公司 Audio and video content compliance detection method and device based on fixed telephone technology
CN113822195A (en) * 2021-09-23 2021-12-21 四川云恒数联科技有限公司 Government affair platform user behavior recognition feedback method based on video analysis

Also Published As

Publication number Publication date
WO2021128817A1 (en) 2021-07-01

Similar Documents

Publication Publication Date Title
CN111191073A (en) Video and audio recognition method, device, storage medium and device
US10275672B2 (en) Method and apparatus for authenticating liveness face, and computer program product thereof
CN111881726B (en) Living body detection method and device and storage medium
CN111339913A (en) Method and device for recognizing emotion of character in video
US11694474B2 (en) Interactive user authentication
JP7412496B2 (en) Living body (liveness) detection verification method, living body detection verification system, recording medium, and training method for living body detection verification system
CN112148922A (en) Conference recording method, conference recording device, data processing device and readable storage medium
CN112633221A (en) Face direction detection method and related device
CN111341350A (en) Man-machine interaction control method and system, intelligent robot and storage medium
CN115376559A (en) Emotion recognition method, device and equipment based on audio and video
CN111950327A (en) Mouth shape correcting method, mouth shape correcting device, mouth shape correcting medium and computing equipment
CN114639150A (en) Emotion recognition method and device, computer equipment and storage medium
CN111401198A (en) Audience emotion recognition method, device and system
Lucey et al. Continuous pose-invariant lipreading
CN116206373A (en) Living body detection method, electronic device and storage medium
CN116645683A (en) Signature handwriting identification method, system and storage medium based on prompt learning
CN110569707A (en) identity recognition method and electronic equipment
CN114565449A (en) Intelligent interaction method and device, system, electronic equipment and computer readable medium
Cotter Laboratory exercises for an undergraduate biometric signal processing course
CN114697687B (en) Data processing method and device
CN117851835B (en) Deep learning internet of things recognition system and method
CN115116147B (en) Image recognition, model training, living body detection method and related device
CN114760484B (en) Live video identification method, live video identification device, computer equipment and storage medium
CN115512419A (en) Video identification method, system, electronic equipment and storage medium
Dixit et al. SIFRS: Spoof Invariant Facial Recognition System (A Helping Hand for Visual Impaired People)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination