WO2021128817A1 - Video and audio recognition method, apparatus and device and storage medium - Google Patents

Video and audio recognition method, apparatus and device and storage medium Download PDF

Info

Publication number
WO2021128817A1
WO2021128817A1 PCT/CN2020/102532 CN2020102532W WO2021128817A1 WO 2021128817 A1 WO2021128817 A1 WO 2021128817A1 CN 2020102532 W CN2020102532 W CN 2020102532W WO 2021128817 A1 WO2021128817 A1 WO 2021128817A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
video
audio
information
user
Prior art date
Application number
PCT/CN2020/102532
Other languages
French (fr)
Chinese (zh)
Inventor
黄超
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021128817A1 publication Critical patent/WO2021128817A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/686Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people

Definitions

  • This application relates to the technical field of artificial intelligence, and in particular to a video and audio recognition method, equipment, storage medium and device.
  • the main purpose of this application is to provide a video and audio recognition method, equipment, storage medium, and device, which are intended to solve the technical problem that the input operation of user information is complicated and time-consuming in the prior art.
  • the video and audio recognition method includes the following steps:
  • the target business document of the user is generated according to the user picture and the target information.
  • this application also proposes a video and audio recognition device, the video and audio recognition device includes a memory, a processor, and a video and audio recognition program stored on the memory and running on the processor , The video and audio recognition program is configured to implement the following steps:
  • the target business document of the user is generated according to the user picture and the target information.
  • this application also proposes a storage medium with a video and audio recognition program stored on the storage medium, and when the video and audio recognition program is executed by a processor, the following steps are implemented:
  • the target business document of the user is generated according to the user picture and the target information.
  • this application also proposes a video and audio recognition device, the video and audio recognition device comprising:
  • the search module is configured to receive the target business type input by the user, search for the corresponding target business copy based on the target business type, and display the target business copy;
  • An audio separation module configured to shoot a target video in which the user reads the target business copy, and perform audio separation on the target video through an audio-video separator to obtain target audio information
  • the text recognition module is used to perform text recognition on the target audio information to obtain target information
  • the frame extraction processing module is used to perform frame extraction processing on the target video to obtain user pictures
  • the generating module is used to generate the target business document of the user according to the user picture and the target information.
  • the audio and video separator performs audio separation on the target video to obtain target audio information, and reduces the tedious steps of manual input through voice reading; performs text recognition on the target audio information to obtain target information, and extracts frames from the target video Process to obtain user pictures to verify the user identity; generate the user’s target business document based on the user picture and the target information, and based on artificial intelligence, obtain various data by parsing the video, and verify the user’s identity at the same time Improve the efficiency of user information entry.
  • FIG. 1 is a schematic structural diagram of a video and audio recognition device in a hardware operating environment involved in a solution of an embodiment of the present application;
  • FIG. 2 is a schematic flowchart of a first embodiment of a video and audio recognition method according to this application;
  • FIG. 3 is a schematic flowchart of a second embodiment of a video and audio recognition method according to this application.
  • FIG. 4 is a schematic flowchart of a third embodiment of a video and audio recognition method according to this application.
  • Fig. 5 is a structural block diagram of a first embodiment of a video and audio recognition device according to the present application.
  • FIG. 1 is a schematic structural diagram of a video and audio recognition device in a hardware operating environment involved in a solution of an embodiment of the application.
  • the video and audio recognition device may include a processor 1001, such as a central processing unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the user interface 1003 may include a display screen (Display), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the wired interface of the user interface 1003 may be a USB interface in this application.
  • the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a wireless fidelity (WI-FIdelity, WI-FI) interface).
  • WI-FIdelity wireless fidelity
  • the memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory, or a stable memory (Non-volatile Memory, NVM), such as a disk memory.
  • RAM Random Access Memory
  • NVM Non-volatile Memory
  • the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
  • FIG. 1 does not constitute a limitation on the video and audio recognition device, and may include more or less components than those shown in the figure, or a combination of certain components, or different component arrangements.
  • a memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a video and audio recognition program.
  • the network interface 1004 is mainly used to connect to a back-end server and perform data communication with the back-end server; the user interface 1003 is mainly used to connect to user equipment; the video and audio recognition device passes through the processor 1001 calls the video and audio recognition program stored in the memory 1005, and executes the video and audio recognition method provided in the embodiment of the present application.
  • the video and audio recognition method includes the following steps:
  • Step S10 Receive the target business type input by the user, search for the corresponding target business copy based on the target business type, and display the target business copy.
  • the execution subject of this embodiment is the video and audio recognition device, where the video and audio recognition device may be an electronic device such as a smart phone, a personal computer, or a server, which is not limited in this embodiment.
  • the video and audio recognition device may be an electronic device such as a smart phone, a personal computer, or a server, which is not limited in this embodiment.
  • various service types can be presented through options. The user selects the target service type to be performed, and when the target service type input by the user is received, it searches for the target service type from the preset mapping relationship table.
  • the target business copy corresponding to the target business type, and the preset mapping relationship table includes the corresponding relationship between the business type and the business copy.
  • the target business types include businesses such as loans, leasing, or insurance.
  • the target business copy is user-related information that needs to be collected for each business type.
  • each business type needs to collect basic personal information of users, such as a piece of personal information copy: I It is xxx, my ID number is xxxxx, I am from the xxx area, etc.
  • Different business types also need to collect relevant information corresponding to the business type.
  • the loan business also needs to collect the following information: whether there is a loan, whether there is a real estate, a car, and the amount of annual income.
  • the corresponding business can be established in advance according to the type of business Copywriting, the information that needs to be collected is presented in the form of filling in the blanks.
  • Step S20 Shoot the target video in which the user reads the target business copy, and perform audio separation on the target video through an audio-video separator to obtain target audio information.
  • the user when the user reads the target business copy aloud, he can fill in the content that needs to be filled in in combination with his own information.
  • the process of the user reading the target business copy is photographed, and the video recording can be performed through the camera function of the video and audio recognition device, such as the recording function of a smart phone.
  • There is a camera button in the webpage or APP and when the target business copy is displayed in the webpage or APP, a camera button is set above or below the business copy. The user clicks on the camera button to take a picture of himself or herself.
  • the video of the target business copy is described, and the target video is obtained.
  • audio separation usually takes out the sound and image of the video separately.
  • the steps of separating audio are: set the audio source; get the number of tracks in the source file, and traverse to find the required audio track; Extract to obtain the target audio information.
  • Step S30 Perform text recognition on the target audio information to obtain target information.
  • the mute at the beginning and the end of the target audio information is cut off to reduce the interference caused to subsequent steps.
  • Framing the first audio information after silence is cut, that is, cutting the first audio information into a small segment, each segment is called a frame, and the framing operation is generally implemented by using a moving window function. After framing, the first audio information becomes many small segments.
  • transform the waveform extract Mel-scale Frequency Cepstral Coefficients (MFCC) features, and turn each frame of waveform into a multi-dimensional vector.
  • MFCC Mel-scale Frequency Cepstral Coefficients
  • Several frames of speech correspond to one state, every three states are combined into a phoneme, and several phonemes are combined into a word, so as to obtain the corresponding text information.
  • the content filled by the user in the text information can be extracted as the target information .
  • Step S40 Perform frame extraction processing on the target video to obtain a user picture.
  • the target video is instantiated and initialized at the same time, the total number of frames of the target video is obtained and printed, and a variable is defined to store and store each frame of images, cycle flags, define the current frame, and read Take each frame of the target video, a string stream, convert the long type long type into a character type and pass it to the object str, set to get a frame every 10 frames, convert the frame into a picture output, end condition, current frame number When it is greater than the total number of frames, the loop stops, and the output picture is the user picture.
  • Step S50 Generate a target business document of the user according to the user picture and the target information.
  • the user picture can be used as the user's identity verification information
  • the user's voiceprint can also be extracted from the audio
  • the extracted voiceprint is used as the user's identity
  • the identity verification is performed based on the voiceprint .
  • the target information is related information about the user extracted from the text read aloud by the user, and the user picture and the target information are combined to generate a data document, that is, the target business document, and the target business document includes User identity verification information and various user information required by the target service type.
  • the target video is audio separated by an audio and video separator to obtain target audio information, and the tedious steps of manual input are reduced through voice reading; text recognition is performed on the target audio information to obtain target information, and the target video is extracted Frame processing to obtain user pictures to verify the user’s identity; generate the user’s target business document based on the user picture and the target information, and based on artificial intelligence, obtain various data by parsing the video to verify the user’s identity At the same time, the efficiency of user information entry is improved.
  • FIG. 3 is a schematic flowchart of the second embodiment of the video and audio recognition method of the present application. Based on the first embodiment shown in FIG. 2 above, the second embodiment of the video and audio recognition method of the present application is proposed.
  • step S30 includes:
  • Step S301 Perform text recognition on the target audio information to obtain corresponding text information.
  • the mute at the beginning and the end of the target audio information is cut, and then the first audio information after the mute cut is divided into frames, and several frames of speech correspond to a state. See which state the frame corresponds to has the greatest probability, then which state the frame belongs to, construct a state network to find the path that best matches the sound from the state network.
  • the speech recognition process is actually to search for the best path in the state network. Every three states are combined into one phoneme, and several phonemes are combined into one word, so as to obtain the text information corresponding to the target audio information.
  • Step S302 Compare the text information with the target business copy to obtain the correct rate of the text information.
  • the text information is the text formed by the user reading the target business copy.
  • the target business copy can be read aloud.
  • the fixed content in the text information is extracted, and the extracted content is compared with the target business copy.
  • the similarity between the extracted content and the target business copy can be used as the correct rate of the text information.
  • Step S303 When the correctness rate is greater than a preset correctness rate threshold, information extraction is performed on the text through regular expressions to obtain target information.
  • the preset correctness rate threshold can be set according to an empirical value, such as 80%.
  • the content of the two is similar, that is, the detailed text information is considered
  • the correct rate meets the requirements, and the text information can be further analyzed.
  • the requirement to extract the string at a specific location can be achieved through regular expressions, specifically: string extraction at a single location, you can use the regular expression (.+?) to extract, for example, a string "a123b” "If we want to extract the value 123 between ab, we can use findall with regular expressions, which will return a list that contains all the matching conditions.
  • This method can be used to extract numbers such as the user's phone number and ID number. The character string at the corresponding position is extracted.
  • a string "a123b456b” if we want to match all the values between a and the last b instead of the value between a and the first occurrence of b, we can use? To control the situation of regular greedy and non-greedy matching.
  • the method further includes:
  • step S40 is executed.
  • the template data filled in the target business copy is analyzed in advance, and corresponding rules are set for each information that needs to be filled. For example, if the phone number is an 11-digit number, then the target business copy The preset rule corresponding to the phone number in the middle is to determine whether the phone number is an 11-digit number. If the preset rule is met, the content of the phone number in the target information can be considered to be correct. According to the preset rules, it can be considered that the phone number in the target information is read incorrectly, and a voice prompt can be performed. For example, if the phone number is 11 digits, the number of digits read aloud is incorrect or the content read aloud is one more digit. Numbers etc.
  • the input box can support modification of error correction, so that the user can modify the text information.
  • the region will also set the corresponding preset rules in advance, for example, enter various geographic location information in the map in advance, and determine the address in the text information when the content read by the target business copy is address information Whether the information belongs to the pre-entered geographic location information, if it belongs, the address information read aloud is considered correct, if not, the address information read aloud is considered wrong.
  • the text is extracted through regular expressions to obtain Target information, thereby improving the accuracy of information entry.
  • FIG. 4 is a schematic flowchart of a third embodiment of a video and audio recognition method according to this application. Based on the above-mentioned first or second embodiment, a third embodiment of the video and audio recognition method according to this application is proposed. This embodiment is described based on the first embodiment.
  • the method before the step S40, the method further includes:
  • step S40 is executed.
  • the principle of finding people in the video is the same as finding people in pictures.
  • the video is a collection of pictures, and in essence it is still looking for people in pictures, and draw the people found and the recognized faces. Rectangular frame to realize face recognition.
  • the face recognition technology (Face Detection) is responsible for the recognition of the face position, and the face alignment (Face Alignment) performs face alignment.
  • the algorithm uses affine transformation to perform face alignment according to the eye coordinates, and uses the visual geometry group network (Visual Geometry Group Network).
  • Geometry Group Network (VGG) model for feature extraction
  • the get_feature_new function opens the picture, and uses the VGG network to extract features.
  • the compare_pic function calculates the similarity of the two features passed in.
  • the key point is the selection of the threshold.
  • the face_recog_test function will read the test pictures and calculate the best parameters for each group of pictures. Save the aligned face picture for use in subsequent facial feature comparisons.
  • Use Seeta Face Engine or Face Alignment for face recognition Get the facial features in the input picture.
  • the performing face recognition on the target video and performing live detection on the recognized face includes:
  • live detection is performed on the recognized face, whether the detected face is moving, or whether it is blinking, etc., to determine whether it is a real person, not a photo.
  • face detection and eye location then the eye area is intercepted, and the degree of opening and closing of the eyes is calculated from the normalized image; based on the convolutional neural network, a model for judging the blinking action is established, and the image is recognized based on the model Whether there is a blinking action in it.
  • the convolutional neural network model to be trained can be established in advance, a large number of sample images can be obtained, the eye region of the human face in the sample image can be intercepted, the sample eye image can be obtained, and the sample corresponding to each sample image can be obtained Blink information, training the convolutional neural network model to be trained according to the sample eye image and the corresponding sample blink information to obtain the preset blink model, then the preset blink model can be used to The eye area image is recognized, and if it is recognized that the eye area image has a blinking action, it is considered that the target video is a real person, and the living body detection is determined to be successful.
  • the method further includes:
  • Step S401 Perform preprocessing on the user picture to obtain a preprocessed picture.
  • the user picture can be pre-processed in advance.
  • image pre-processing is to eliminate irrelevant information in the image, remove or reduce as much as possible the interference of light, imaging system, or external environment on the image, so that its features can be displayed in the image.
  • the preprocessing process includes processing steps such as light compensation, grayscale transformation, histogram equalization, normalization, geometric correction, filtering, and sharpening of the face image, so as to obtain the preprocessed picture.
  • Step S402 Screen the pre-processed pictures according to the definition to obtain screened pictures.
  • the sharpness of the image is an important indicator to measure the quality of the image.
  • the sharpness can be evaluated by the secondary blur Reblur algorithm. If an image is already blurred, then it will be blurred again, and the high frequency components will not change much; But if the original image is clear, and if it is blurred once, the high-frequency components will change greatly. Therefore, the degraded image of the image can be obtained by performing a Gaussian blurring process on the image to be evaluated, and then compare the changes in the adjacent pixel values of the original image and the degraded image, and determine the level of the sharpness value according to the magnitude of the change.
  • the preprocessed picture is passed through low-pass filtering to obtain a blurred image, and the change in the gray value of adjacent pixels in the preprocessed picture is calculated to obtain the first A pixel change value, and calculate the gray value change of adjacent pixels in the blurred image to obtain a second pixel change value, compare and analyze the first pixel change value with the second pixel change value, and perform Normalization processing is performed to obtain a sharpness result, and the pre-processed picture is screened according to the sharpness result to obtain the screened picture.
  • Step S403 Compare the selected picture with a preset picture to obtain a comparison result.
  • step S50 includes:
  • Step S501 When the comparison result exceeds a preset similarity threshold, generate a target business document of the user according to the screened picture and the target information.
  • the face similarity is used as the comparison result, and if the face similarity exceeds the preset similarity threshold, it is considered that the identity of the user has been verified, and the user may be further Create business profile.
  • the preset similarity threshold may be set according to an empirical value, such as 40%. Compare the facial features and calculate the similarity of the face. If the preset similarity threshold is set to 0.4, that is, if the similarity is greater than 40%, the person is considered to be the same person, and the image can be screened and the target The information generates the target business document of the user.
  • step S20 includes:
  • the target music can be played at the same time.
  • the target music can create a noisy voice environment and prevent the user's personal information from being learned by others.
  • the target video captured at this time includes the target music and the audio of the user reading the target business copy.
  • To perform audio separation on the target video through an audio and video separator to obtain mixed audio information it is necessary to further use the Calculation Auditory Scene Analysis (CASA) algorithm to simulate the human auditory system to remove the voice read by the user from the noise environment Extracted from it.
  • the audio information will be encoded to achieve grouping and parsing. There are currently dozens of grouping criteria related to time and frequency, including pitch, spatial position, and start/end time.
  • the Pitch is a very important grouping basis. It identifies the unique characteristics of a certain sound according to different harmonic modes. When two or more microphones are used, the sound isolation system can determine the direction and distance of each microphone's sound according to the spatial position information.
  • the CASA modeling method enables the sound isolation system to focus on a certain sound source, such as a certain person, and shield the background sound.
  • the start/stop time grouping refers to the moment when a certain sound component starts to appear and stops. When these data are combined with the original frequency data, it can be judged whether they come from the same sound source. Mask out a series of noises to focus on identifying a specific sound source. Sounds with similar attributes will form the same audio stream, and similarly, sounds with different attributes will form their own audio streams. These different audio streams can be used to identify continuous or repetitive sound sources. With enough voice groups, the actual voice isolation process can match the identified sound sources and respond to the voice of the real speaker, thereby separating the target audio information of the target business copy by the user.
  • the user pictures are processed to obtain better quality screening pictures, and then the screening pictures are compared with preset pictures in the public security system to verify the identity of the user and improve information entry Security and reliability.
  • the embodiment of the present application also proposes a storage medium, the storage medium may be non-volatile or volatile, and a video and audio recognition program is stored on the storage medium, and the video and audio recognition program is stored on the storage medium.
  • the steps of the video and audio recognition method as described above are realized.
  • an embodiment of the present application also proposes a video and audio recognition device, and the video and audio recognition device includes:
  • the searching module 10 is configured to receive the target business type input by the user, search for the corresponding target business copy according to the target business type, and display the target business copy.
  • various service types can be presented through options, and the user selects the target service type to be performed, and when the target service type input by the user is received, the preset mapping relationship table
  • the target business copy corresponding to the target business type is searched in the, and the preset mapping relationship table includes the corresponding relationship between the business type and the business copy.
  • the target business types include businesses such as loans, leasing, or insurance.
  • the target business copy is user-related information that needs to be collected for each business type. For example, each business type needs to collect basic personal information of users, such as a piece of personal information copy: I It is xxx, my ID number is xxxxx, I am from the xxx area, etc.
  • the loan business also needs to collect the following information: whether there is a loan, whether there is a real estate, a car, and the amount of annual income.
  • the corresponding business can be established in advance according to the type of business Copywriting, the information that needs to be collected is presented in the form of filling in the blanks.
  • the audio separation module 20 is used to shoot the target video in which the user reads the target business copy, and perform audio separation on the target video through an audio-video separator to obtain target audio information.
  • the user when the user reads the target business copy aloud, he can fill in the content that needs to be filled in in combination with his own information.
  • the process of the user reading the target business copy is photographed, and the video recording can be performed through the camera function of the video and audio recognition device, such as the recording function of a smart phone.
  • There is a camera button in the webpage or APP and when the target business copy is displayed in the webpage or APP, a camera button is set above or below the business copy. The user clicks on the camera button to take a picture of himself or herself.
  • the video of the target business copy is described, and the target video is obtained.
  • audio separation usually takes out the sound and image of the video separately.
  • the steps of separating audio are: set the audio source; get the number of tracks in the source file, and traverse to find the required audio track; Extract to obtain the target audio information.
  • the text recognition module 30 is used to perform text recognition on the target audio information to obtain target information.
  • the mute at the beginning and the end of the target audio information is cut off to reduce the interference caused to subsequent steps.
  • Framing the first audio information after silence is cut that is, cutting the first audio information into a small segment, each segment is called a frame, and the framing operation is generally implemented by using a moving window function. After framing, the first audio information becomes many small segments.
  • transform the waveform extract Mel-scale Frequency Cepstral Coefficients (MFCC) features, and turn each frame of waveform into a multi-dimensional vector.
  • MFCC Mel-scale Frequency Cepstral Coefficients
  • Several frames of speech correspond to a state, every three states are combined into a phoneme, and several phonemes are combined into a word, so as to obtain the corresponding text information.
  • the content filled by the user in the text information can be extracted as the target information .
  • the frame extraction processing module 40 is configured to perform frame extraction processing on the target video to obtain a user picture.
  • the target video is instantiated and initialized at the same time, the total number of frames of the target video is obtained and printed, and a variable is defined to store and store each frame of images, cycle flags, define the current frame, and read Take each frame of the target video, a string stream, convert the long type long type into a character type and pass it to the object str, set to get a frame every 10 frames, convert the frame into a picture output, end condition, current frame number When it is greater than the total number of frames, the loop stops, and the output picture is the user picture.
  • the generating module 50 is configured to generate a target business document of the user according to the user picture and the target information.
  • the user picture can be used as the user's identity verification information
  • the user's voiceprint can also be extracted from the audio
  • the extracted voiceprint is used as the user's identity
  • the identity verification is performed based on the voiceprint .
  • the target information is related information about the user extracted from the text read aloud by the user, and the user picture and the target information are combined to generate a data document, that is, the target business document, and the target business document includes User identity verification information and various user information required by the target service type.
  • the target video is audio separated by an audio and video separator to obtain target audio information, and the tedious steps of manual input are reduced through voice reading; text recognition is performed on the target audio information to obtain target information, and the target video is extracted Frame processing to obtain user pictures to verify the user’s identity; generate the user’s target business document based on the user picture and the target information, and based on artificial intelligence, obtain various data by parsing the video to verify the user’s identity At the same time, the efficiency of user information entry is improved.
  • the text recognition module 30 is further configured to perform text recognition on the target audio information to obtain corresponding text information; compare the text information with the target business copy to obtain the The correct rate of the text information; when the correct rate is greater than the preset correct rate threshold, information extraction is performed on the text through regular expressions to obtain target information.
  • the video and audio recognition device further includes:
  • the judgment module is used to judge whether the target information satisfies a preset rule
  • a prompting module is used for prompting if not satisfied, so that the user can read the target business copy again;
  • the frame extraction processing module 40 is further configured to perform the step of performing frame extraction processing on the target video to obtain a user picture if it is satisfied.
  • the video and audio recognition device further includes:
  • a living body detection module configured to perform face recognition on the target video, and perform live body detection on the recognized face
  • the frame extraction processing module 40 is further configured to perform the step of performing frame extraction processing on the target video to obtain a user picture when the living body detection is successful.
  • the living body detection module is further used to perform face recognition on the target video, intercept the eye area of the recognized face, and obtain an image of the eye area; recognition by a preset blinking model Whether the eye area image has a blinking action; if it is recognized that the eye area image has a blinking action, it is determined that the living body detection is successful.
  • the video and audio recognition device further includes:
  • the preprocessing module is used to preprocess the user picture to obtain the preprocessed picture
  • the screening module is used to screen the pre-processed pictures according to the definition to obtain the screened pictures;
  • the comparison module is used to compare the selected picture with the preset picture to obtain a comparison result
  • the generating module 50 is further configured to generate a target business document of the user according to the screened picture and the target information when the comparison result exceeds a preset similarity threshold.
  • the audio separation module 20 is also used to play the target music while shooting the target video of the user reading the target business copy;
  • serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments.
  • several of these devices may be embodied in the same hardware item.
  • the use of the words first, second, and third does not indicate any order, and these words may be interpreted as signs.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as a read-only memory mirror (Read Only Memory) Only Memory image, ROM)/Random Access Memory (Random Access Memory, RAM, magnetic disks, optical disks), include several instructions to enable a terminal device (can be a mobile phone, computer, server, air conditioner, or network Equipment, etc.) execute the methods described in each embodiment of the present application.
  • a storage medium such as a read-only memory mirror (Read Only Memory) Only Memory image, ROM)/Random Access Memory (Random Access Memory, RAM, magnetic disks, optical disks

Abstract

A video and audio recognition method, apparatus and device and a storage medium. The method comprises: receiving a target service type input by a user, and according to the target service type, searching a corresponding target service text and displaying same (S10); shooting a target video of the target service text read aloud by the user, and by means of an audio and video separator, carrying out audio separation on the target video to obtain target audio information (S20), wherein tedious steps of manual input are reduced by means of voice reading; carrying out character recognition on the target audio information to obtain target information (S30); carrying out frame extraction on the target video to obtain a user picture (S40) so as to perform verification of user identity; and according to the user picture and the target information, generating a target service document of the user (S50).

Description

视频音频识别方法、设备、存储介质及装置Video and audio recognition method, equipment, storage medium and device
本申请要求于2019年12月26日提交中国专利局、申请号为201911374298.1,发明名称为“视频音频识别方法、设备、存储介质及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on December 26, 2019 with the application number 201911374298.1 and the invention title "Video and Audio Recognition Method, Equipment, Storage Medium and Device", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及人工智能的技术领域,尤其涉及一种视频音频识别方法、设备、存储介质及装置。This application relates to the technical field of artificial intelligence, and in particular to a video and audio recognition method, equipment, storage medium and device.
背景技术Background technique
金融场景中在对用户进行真实的校验需求时,需要对用户的数据真实性反复收集再验证真假,以便尽可能提升风控能力,以尽可能的精确评价用户的贷款金融,目标是精准风控。发明人意识到,在目前贷款场景中,比较常见都会增加一个身份验证的过程,验证通过后在通过用户在网页或者应用程序(Application,APP)中输入信息,以进行用户资料的收集,如此繁琐的操作,会导致页面比较多,异常也会增加,用户信息的录入耗时长,对于用户体验也非常差。In financial scenarios, when real verification needs are performed on users, it is necessary to repeatedly collect and verify the authenticity of users' data in order to improve risk control capabilities as much as possible, and evaluate users' loan finance as accurately as possible. The goal is accuracy. Wind control. The inventor realizes that in current loan scenarios, it is common to add an identity verification process. After verification is passed, the user enters information in the web page or application (Application, APP) to collect user information, which is so cumbersome. The operation of, will lead to more pages, anomalies will also increase, the input of user information takes a long time, and the user experience is also very poor.
上述内容仅用于辅助理解本申请的技术方案,并不代表承认上述内容是现有技术。The above content is only used to assist the understanding of the technical solution of the application, and does not mean that the above content is recognized as prior art.
技术解决方案Technical solutions
本申请的主要目的在于提供一种视频音频识别方法、设备、存储介质及装置,旨在解决现有技术中用户信息的录入操作繁琐导致耗时长的技术问题。The main purpose of this application is to provide a video and audio recognition method, equipment, storage medium, and device, which are intended to solve the technical problem that the input operation of user information is complicated and time-consuming in the prior art.
为实现上述目的,本申请提供一种视频音频识别方法,所述视频音频识别方法包括以下步骤:In order to achieve the above objective, the present application provides a video and audio recognition method. The video and audio recognition method includes the following steps:
接收用户输入的目标业务类型,根据所述目标业务类型查找对应的目标业务文案,将所述目标业务文案进行展示;Receiving the target business type input by the user, searching for a corresponding target business copy based on the target business type, and displaying the target business copy;
拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息;Shooting the target video in which the user reads the target business copy, and performing audio separation on the target video through an audio-video separator to obtain target audio information;
对所述目标音频信息进行文字识别,获得目标信息;Perform text recognition on the target audio information to obtain target information;
对所述目标视频进行抽帧处理,获得用户图片;Performing frame extraction processing on the target video to obtain a user picture;
根据所述用户图片和所述目标信息生成所述用户的目标业务文档。The target business document of the user is generated according to the user picture and the target information.
此外,为实现上述目的,本申请还提出一种视频音频识别设备,所述视频音频识别设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的视频音频识别程序,所述视频音频识别程序配置为实现以下步骤:In addition, in order to achieve the above object, this application also proposes a video and audio recognition device, the video and audio recognition device includes a memory, a processor, and a video and audio recognition program stored on the memory and running on the processor , The video and audio recognition program is configured to implement the following steps:
接收用户输入的目标业务类型,根据所述目标业务类型查找对应的目标业务文案,将所述目标业务文案进行展示;Receiving the target business type input by the user, searching for a corresponding target business copy based on the target business type, and displaying the target business copy;
拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息;Shooting the target video in which the user reads the target business copy, and performing audio separation on the target video through an audio-video separator to obtain target audio information;
对所述目标音频信息进行文字识别,获得目标信息;Perform text recognition on the target audio information to obtain target information;
对所述目标视频进行抽帧处理,获得用户图片;Performing frame extraction processing on the target video to obtain a user picture;
根据所述用户图片和所述目标信息生成所述用户的目标业务文档。The target business document of the user is generated according to the user picture and the target information.
此外,为实现上述目的,本申请还提出一种存储介质,所述存储介质上存储有视频音频识别程序,所述视频音频识别程序被处理器执行时实现以下步骤:In addition, in order to achieve the above-mentioned object, this application also proposes a storage medium with a video and audio recognition program stored on the storage medium, and when the video and audio recognition program is executed by a processor, the following steps are implemented:
接收用户输入的目标业务类型,根据所述目标业务类型查找对应的目标业务文案,将所述目标业务文案进行展示;Receiving the target business type input by the user, searching for a corresponding target business copy based on the target business type, and displaying the target business copy;
拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息;Shooting the target video in which the user reads the target business copy, and performing audio separation on the target video through an audio-video separator to obtain target audio information;
对所述目标音频信息进行文字识别,获得目标信息;Perform text recognition on the target audio information to obtain target information;
对所述目标视频进行抽帧处理,获得用户图片;Performing frame extraction processing on the target video to obtain a user picture;
根据所述用户图片和所述目标信息生成所述用户的目标业务文档。The target business document of the user is generated according to the user picture and the target information.
此外,为实现上述目的,本申请还提出一种视频音频识别装置,所述视频音频识别装置包括:In addition, in order to achieve the above objective, this application also proposes a video and audio recognition device, the video and audio recognition device comprising:
查找模块,用于接收用户输入的目标业务类型,根据所述目标业务类型查找对应的目标业务文案,将所述目标业务文案进行展示;The search module is configured to receive the target business type input by the user, search for the corresponding target business copy based on the target business type, and display the target business copy;
音频分离模块,用于拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息;An audio separation module, configured to shoot a target video in which the user reads the target business copy, and perform audio separation on the target video through an audio-video separator to obtain target audio information;
文字识别模块,用于对所述目标音频信息进行文字识别,获得目标信息;The text recognition module is used to perform text recognition on the target audio information to obtain target information;
抽帧处理模块,用于对所述目标视频进行抽帧处理,获得用户图片;The frame extraction processing module is used to perform frame extraction processing on the target video to obtain user pictures;
生成模块,用于根据所述用户图片和所述目标信息生成所述用户的目标业务文档。The generating module is used to generate the target business document of the user according to the user picture and the target information.
本申请中,通过接收用户输入的目标业务类型,根据所述目标业务类型查找对应的目标业务文案,将所述目标业务文案进行展示,拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息,通过语音朗读减少手动输入的繁琐步骤;对所述目标音频信息进行文字识别,获得目标信息,对所述目标视频进行抽帧处理,获得用户图片,以对用户身份实现验证;根据所述用户图片和所述目标信息生成所述用户的目标业务文档,基于人工智能,通过解析视频获得多方面的数据,验证用户身份的同时提升用户的信息录入效率。In this application, by receiving the target business type input by the user, searching for the corresponding target business copy based on the target business type, displaying the target business copy, shooting the target video in which the user reads the target business copy, and passing The audio and video separator performs audio separation on the target video to obtain target audio information, and reduces the tedious steps of manual input through voice reading; performs text recognition on the target audio information to obtain target information, and extracts frames from the target video Process to obtain user pictures to verify the user identity; generate the user’s target business document based on the user picture and the target information, and based on artificial intelligence, obtain various data by parsing the video, and verify the user’s identity at the same time Improve the efficiency of user information entry.
附图说明Description of the drawings
图1是本申请实施例方案涉及的硬件运行环境的视频音频识别设备的结构示意图;FIG. 1 is a schematic structural diagram of a video and audio recognition device in a hardware operating environment involved in a solution of an embodiment of the present application;
图2为本申请视频音频识别方法第一实施例的流程示意图;2 is a schematic flowchart of a first embodiment of a video and audio recognition method according to this application;
图3为本申请视频音频识别方法第二实施例的流程示意图;FIG. 3 is a schematic flowchart of a second embodiment of a video and audio recognition method according to this application;
图4为本申请视频音频识别方法第三实施例的流程示意图;4 is a schematic flowchart of a third embodiment of a video and audio recognition method according to this application;
图5为本申请视频音频识别装置第一实施例的结构框图。Fig. 5 is a structural block diagram of a first embodiment of a video and audio recognition device according to the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
本发明的实施方式Embodiments of the present invention
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.
参照图1,图1为本申请实施例方案涉及的硬件运行环境的视频音频识别设备结构示意图。Referring to FIG. 1, FIG. 1 is a schematic structural diagram of a video and audio recognition device in a hardware operating environment involved in a solution of an embodiment of the application.
如图1所示,该视频音频识别设备可以包括:处理器1001,例如中央处理器(Central Processing Unit,CPU),通信总线1002、用户接口1003,网络接口1004,存储器1005。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display),可选用户接口1003还可以包括标准的有线接口、无线接口,对于用户接口1003的有线接口在本申请中可为USB接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如无线保真(WIreless-FIdelity,WI-FI)接口)。存储器1005可以是高速的随机存取存储器(Random Access Memory,RAM)存储器,也可以是稳定的存储器(Non-volatile Memory,NVM),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 1, the video and audio recognition device may include a processor 1001, such as a central processing unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The wired interface of the user interface 1003 may be a USB interface in this application. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a wireless fidelity (WI-FIdelity, WI-FI) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory, or a stable memory (Non-volatile Memory, NVM), such as a disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
本领域技术人员可以理解,图1中示出的结构并不构成对视频音频识别设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute a limitation on the video and audio recognition device, and may include more or less components than those shown in the figure, or a combination of certain components, or different component arrangements.
如图1所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及视频音频识别程序。As shown in FIG. 1, a memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a video and audio recognition program.
在图1所示的视频音频识别设备中,网络接口1004主要用于连接后台服务器,与所述后台服务器进行数据通信;用户接口1003主要用于连接用户设备;所述视频音频识别设备通过处理器1001调用存储器1005中存储的视频音频识别程序,并执行本申请实施例提供的视频音频识别方法。In the video and audio recognition device shown in FIG. 1, the network interface 1004 is mainly used to connect to a back-end server and perform data communication with the back-end server; the user interface 1003 is mainly used to connect to user equipment; the video and audio recognition device passes through the processor 1001 calls the video and audio recognition program stored in the memory 1005, and executes the video and audio recognition method provided in the embodiment of the present application.
基于上述硬件结构,提出本申请视频音频识别方法的实施例。Based on the above hardware structure, an embodiment of the video and audio recognition method of the present application is proposed.
参照图2,图2为本申请视频音频识别方法第一实施例的流程示意图,提出本申请视频音频识别方法第一实施例。2, which is a schematic flowchart of a first embodiment of a video and audio recognition method according to this application, and a first embodiment of the video and audio recognition method according to this application is proposed.
在第一实施例中,所述视频音频识别方法包括以下步骤:In the first embodiment, the video and audio recognition method includes the following steps:
步骤S10:接收用户输入的目标业务类型,根据所述目标业务类型查找对应的目标业务文案,将所述目标业务文案进行展示。Step S10: Receive the target business type input by the user, search for the corresponding target business copy based on the target business type, and display the target business copy.
应理解的是,本实施例的执行主体是所述视频音频识别设备,其中,所述视频音频识别设备可为智能手机、个人电脑或服务器等电子设备,本实施例对此不加以限制。在网页或者APP内,可通过选项呈现各种业务类型,用户选择需要进行的所述目标业务类型,在接收到用户输入的所述目标业务类型时,从预设映射关系表中查找与所述目标业务类型对应的目标业务文案,所述预设映射关系表中包括业务类型与业务文案之间的对应关系。所述目标业务类型包括贷款、租赁或保险等业务,所述目标业务文案为各业务类型需要收集的用户相关信息,比如各业务类型均需采集用户的个人基本信息,如一段个人信息文案:我是xxx,我的身份证件号是xxxxx,我是来自xxx地区等。不同的业务类型还需采集业务类型对应的相关信息,比如贷款业务还需采集如下信息:是否有在还贷款,是否有房产、车子以及年收入多少等信息,可预先按照业务类型建立对应的业务文案,将需要采集的信息以填空形式呈现。It should be understood that the execution subject of this embodiment is the video and audio recognition device, where the video and audio recognition device may be an electronic device such as a smart phone, a personal computer, or a server, which is not limited in this embodiment. In the webpage or APP, various service types can be presented through options. The user selects the target service type to be performed, and when the target service type input by the user is received, it searches for the target service type from the preset mapping relationship table. The target business copy corresponding to the target business type, and the preset mapping relationship table includes the corresponding relationship between the business type and the business copy. The target business types include businesses such as loans, leasing, or insurance. The target business copy is user-related information that needs to be collected for each business type. For example, each business type needs to collect basic personal information of users, such as a piece of personal information copy: I It is xxx, my ID number is xxxxx, I am from the xxx area, etc. Different business types also need to collect relevant information corresponding to the business type. For example, the loan business also needs to collect the following information: whether there is a loan, whether there is a real estate, a car, and the amount of annual income. The corresponding business can be established in advance according to the type of business Copywriting, the information that needs to be collected is presented in the form of filling in the blanks.
步骤S20:拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息。Step S20: Shoot the target video in which the user reads the target business copy, and perform audio separation on the target video through an audio-video separator to obtain target audio information.
需要说明的是,用户在朗读所述目标业务文案时,可对需要填空的内容结合自己的信息在朗读时进行填充。对所述用户朗读所述目标业务文案的过程进行拍摄,视频录制可通过所述视频音频识别设备的摄像功能进行,比如智能手机的录像功能。在所述网页或者APP内有摄像按钮,所述目标业务文案在所述网页或者APP内进行展示时,所述业务文案的上方或者下方设置摄像按钮,用户通过点击该摄像按钮,拍摄自己朗读所述目标业务文案的视频,获得所述目标视频。It should be noted that when the user reads the target business copy aloud, he can fill in the content that needs to be filled in in combination with his own information. The process of the user reading the target business copy is photographed, and the video recording can be performed through the camera function of the video and audio recognition device, such as the recording function of a smart phone. There is a camera button in the webpage or APP, and when the target business copy is displayed in the webpage or APP, a camera button is set above or below the business copy. The user clicks on the camera button to take a picture of himself or herself. The video of the target business copy is described, and the target video is obtained.
可理解的是,音频分离通常是将视频的声音和图像分别取出来,分离音频步骤为:设置音频源;获取源文件中轨道的数量,并遍历找到需要的音频轨;对找到的音频轨进行提取,获得所述目标音频信息。It is understandable that audio separation usually takes out the sound and image of the video separately. The steps of separating audio are: set the audio source; get the number of tracks in the source file, and traverse to find the required audio track; Extract to obtain the target audio information.
步骤S30:对所述目标音频信息进行文字识别,获得目标信息。Step S30: Perform text recognition on the target audio information to obtain target information.
在具体实现中,将所述目标音频信息中首尾端的静音切除,降低对后续步骤造成的干扰。对静音切除后的第一音频信息进行分帧,也就是把所述第一音频信息切开成一小段一小段,每小段称为一帧,分帧操作一般使用移动窗函数来实现。分帧后,所述第一音频信息就变成了很多小段。再将波形作变换,提取梅尔倒谱系数(Mel-scale Frequency Cepstral Coefficients,MFCC)特征,把每一帧波形变成一个多维向量。接着,把帧识别成状态;把状态组合成音素;把音素组合成单词。若干帧语音对应一个状态,每三个状态组合成一个音素,若干个音素组合成一个单词,从而获得对应的文本信息,可将所述文本信息中用户填充的内容进行提取,作为所述目标信息。In a specific implementation, the mute at the beginning and the end of the target audio information is cut off to reduce the interference caused to subsequent steps. Framing the first audio information after silence is cut, that is, cutting the first audio information into a small segment, each segment is called a frame, and the framing operation is generally implemented by using a moving window function. After framing, the first audio information becomes many small segments. Then transform the waveform, extract Mel-scale Frequency Cepstral Coefficients (MFCC) features, and turn each frame of waveform into a multi-dimensional vector. Next, recognize the frame as a state; combine the states into phonemes; and combine phonemes into words. Several frames of speech correspond to one state, every three states are combined into a phoneme, and several phonemes are combined into a word, so as to obtain the corresponding text information. The content filled by the user in the text information can be extracted as the target information .
步骤S40:对所述目标视频进行抽帧处理,获得用户图片。Step S40: Perform frame extraction processing on the target video to obtain a user picture.
应理解的是,对所述目标视频实例化的同时进行初始化,获取所述目标视频总帧数并打印,定义一个变量,用来存放存储每一帧图像,循环标志位,定义当前帧,读取所述目标视频每一帧,字符串流,将长整型long类型的转换成字符型传给对象str,设置每10帧获取一次帧,将帧转成图片输出,结束条件,当前帧数大于总帧数时候时,循环停止,输出的图片即为所述用户图片。It should be understood that the target video is instantiated and initialized at the same time, the total number of frames of the target video is obtained and printed, and a variable is defined to store and store each frame of images, cycle flags, define the current frame, and read Take each frame of the target video, a string stream, convert the long type long type into a character type and pass it to the object str, set to get a frame every 10 frames, convert the frame into a picture output, end condition, current frame number When it is greater than the total number of frames, the loop stops, and the output picture is the user picture.
步骤S50:根据所述用户图片和所述目标信息生成所述用户的目标业务文档。Step S50: Generate a target business document of the user according to the user picture and the target information.
需要说明的是,所述用户图片可作为所述用户的身份验证信息,还可对所述音频进行用户的声纹提取,将提取的声纹作用用户的身份标识,并根据声纹进行身份验证。所述目标信息为从用户朗读的文本中提取的关于用户的相关信息,将所述用户图片和所述目标信息结合生成一个资料文档,即为所述目标业务文档,则所述目标业务文档包括用户身份验证信息和所述目标业务类型需要的各种用户信息。It should be noted that the user picture can be used as the user's identity verification information, and the user's voiceprint can also be extracted from the audio, the extracted voiceprint is used as the user's identity, and the identity verification is performed based on the voiceprint . The target information is related information about the user extracted from the text read aloud by the user, and the user picture and the target information are combined to generate a data document, that is, the target business document, and the target business document includes User identity verification information and various user information required by the target service type.
本实施例中,通过接收用户输入的目标业务类型,根据所述目标业务类型查找对应的目标业务文案,将所述目标业务文案进行展示,拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息,通过语音朗读减少手动输入的繁琐步骤;对所述目标音频信息进行文字识别,获得目标信息,对所述目标视频进行抽帧处理,获得用户图片,以对用户身份实现验证;根据所述用户图片和所述目标信息生成所述用户的目标业务文档,基于人工智能,通过解析视频获得多方面的数据,验证用户身份的同时提升用户的信息录入效率。In this embodiment, by receiving the target business type input by the user, searching for the corresponding target business copy based on the target business type, displaying the target business copy, and shooting a target video in which the user reads the target business copy aloud, The target video is audio separated by an audio and video separator to obtain target audio information, and the tedious steps of manual input are reduced through voice reading; text recognition is performed on the target audio information to obtain target information, and the target video is extracted Frame processing to obtain user pictures to verify the user’s identity; generate the user’s target business document based on the user picture and the target information, and based on artificial intelligence, obtain various data by parsing the video to verify the user’s identity At the same time, the efficiency of user information entry is improved.
参照图3,图3为本申请视频音频识别方法第二实施例的流程示意图,基于上述图2所示的第一实施例,提出本申请视频音频识别方法的第二实施例。Referring to FIG. 3, FIG. 3 is a schematic flowchart of the second embodiment of the video and audio recognition method of the present application. Based on the first embodiment shown in FIG. 2 above, the second embodiment of the video and audio recognition method of the present application is proposed.
在第二实施例中,所述步骤S30,包括:In the second embodiment, the step S30 includes:
步骤S301:对所述目标音频信息进行文字识别,获得对应的文本信息。Step S301: Perform text recognition on the target audio information to obtain corresponding text information.
应理解的是,对所述目标音频信息进行文字识别,首先,将所述目标音频信息中首尾端的静音切除,再对静音切除后的第一音频信息进行分帧,若干帧语音对应一个状态,看某帧对应哪个状态的概率最大,那这帧就属于哪个状态,构建一个状态网络,从状态网络中寻找与声音最匹配的路径,语音识别过程其实就是在状态网络中搜索一条最佳路径,每三个状态组合成一个音素,若干个音素组合成一个单词,从而获得所述目标音频信息对应的所述文本信息。It should be understood that, to perform text recognition on the target audio information, first, the mute at the beginning and the end of the target audio information is cut, and then the first audio information after the mute cut is divided into frames, and several frames of speech correspond to a state. See which state the frame corresponds to has the greatest probability, then which state the frame belongs to, construct a state network to find the path that best matches the sound from the state network. The speech recognition process is actually to search for the best path in the state network. Every three states are combined into one phoneme, and several phonemes are combined into one word, so as to obtain the text information corresponding to the target audio information.
步骤S302:将所述文本信息与所述目标业务文案进行比对,获得所述文本信息的正确率。Step S302: Compare the text information with the target business copy to obtain the correct rate of the text information.
可理解的是,所述文本信息为用户朗读所述目标业务文案所形成的文本,为了判断所述用户是否朗读了正确的业务文案,以及是否正确进行所述目标业务文案的朗读,可对所述文本信息中的固定内容进行提取,将提取的内容与所述目标业务文案进行对比,可将提取的内容与所述目标业务文案之间的相似度作为所述文本信息的正确率。It is understandable that the text information is the text formed by the user reading the target business copy. In order to determine whether the user has read the correct business copy, and whether the target business copy has been read aloud, the target business copy can be read aloud. The fixed content in the text information is extracted, and the extracted content is compared with the target business copy. The similarity between the extracted content and the target business copy can be used as the correct rate of the text information.
步骤S303:在所述正确率大于预设正确率阈值时,通过正则表达式对所述文本进行信息提取,获得目标信息。Step S303: When the correctness rate is greater than a preset correctness rate threshold, information extraction is performed on the text through regular expressions to obtain target information.
需要说明的是,所述预设正确率阈值可根据经验值进行设置,比如80%,在所述正确率大于所述预设正确率阈值时,两者内容相近,即认为所述文本细信息的正确率符合要求,可对所述文本信息进行进一步分析。It should be noted that the preset correctness rate threshold can be set according to an empirical value, such as 80%. When the correctness rate is greater than the preset correctness rate threshold, the content of the two is similar, that is, the detailed text information is considered The correct rate meets the requirements, and the text information can be further analyzed.
在具体实现中,提取特定位置字符串的需求可通过正则表达式实现,具体为:单个位置的字符串提取,可以使用(.+?)这个正则表达式来提取,举例,一个字符串"a123b",如果我们想提取ab之间的值123,可以使用findall配合正则表达式,这样会返回一个包含所以符合情况的列表list,对于提取用户的电话号码和身份证号码等数字即可用该方法进行对应位置的字符串提取而获得。一个字符串”a123b456b”,如果我们想匹配a和最后一个b之间的所有值而非a和第一个出现的b之间的值,可以用?来控制正则贪婪和非贪婪匹配的情况。控制只匹配0或1个,所以只会输出和最近的b之间的匹配情况。连续多个位置的字符串提取,使用(?P<name>…)这个正则表达式来提取,举例,有一行webserver的access日志:'192.168.0.1 25/Oct/2012:14:46:34 "GET /api HTTP/1.1" 200 44 "http://abc.com/search" "Mozilla/5.0"',想提取这行日志里面所有的内容,可以写多个(?P<name>expr)来提取,其中名称name可以更改成该位置字符串命名的变量,表达式expr改成提取位置的正则即可,从而将所述文本信息中用户朗读时填充的内容进行提取,获得所述目标信息。In the specific implementation, the requirement to extract the string at a specific location can be achieved through regular expressions, specifically: string extraction at a single location, you can use the regular expression (.+?) to extract, for example, a string "a123b" "If we want to extract the value 123 between ab, we can use findall with regular expressions, which will return a list that contains all the matching conditions. This method can be used to extract numbers such as the user's phone number and ID number. The character string at the corresponding position is extracted. A string "a123b456b", if we want to match all the values between a and the last b instead of the value between a and the first occurrence of b, we can use? To control the situation of regular greedy and non-greedy matching. The control only matches 0 or 1, so only the match with the nearest b will be output. To extract strings in multiple consecutive positions, use the regular expression (?P<name>...) to extract. For example, there is a line of webserver access log: '192.168.0.1 25/Oct/2012:14:46:34" GET /api HTTP/1.1" 200 44 "http://abc.com/search" "Mozilla/5.0"', if you want to extract all the content in this line of log, you can write multiple (?P<name>expr) to extract, where the name name can be changed to the variable named by the string at that position, and the expression expr is changed It suffices to be the regularity of the extraction position, so as to extract the content filled by the user when reading aloud in the text information to obtain the target information.
进一步地,在本实施例中,所述步骤S303之后,还包括:Further, in this embodiment, after the step S303, the method further includes:
判断所述目标信息是否满足预设规则;Judging whether the target information satisfies a preset rule;
若不满足,则进行提示,以使所述用户重新朗读所述目标业务文案;If it is not satisfied, a prompt is given to make the user re-read the target business copy;
若满足,则执行所述步骤S40。If it is satisfied, the step S40 is executed.
应理解的是,预先根据所述目标业务文案填写出的模板资料进行分析,对每一处需要填写的信息均设置对应的规则,比如,电话号码为11位的数字,则所述目标业务文案中电话号码对应的所述预设规则为判断所述电话号码是否为11位的数字,若满足所述预设规则,则可认为所述目标信息中的电话号码的内容正确,若不满足所述预设规则,则可认为所述目标信息中电话号码朗读错误,可进行语音提示,例如,提示电话号码为11位数字,现在朗读的内容位数不正确或者现在朗读的内容多了一位数字等。也可以以文字提示方式进行提示,例如,将所述文本信息中错误的内容标红,并在旁边以文字批注形式提示所述文本信息有误。输入框可以支持修改纠错,以使用户对所述文本信息进行修改。It should be understood that the template data filled in the target business copy is analyzed in advance, and corresponding rules are set for each information that needs to be filled. For example, if the phone number is an 11-digit number, then the target business copy The preset rule corresponding to the phone number in the middle is to determine whether the phone number is an 11-digit number. If the preset rule is met, the content of the phone number in the target information can be considered to be correct. According to the preset rules, it can be considered that the phone number in the target information is read incorrectly, and a voice prompt can be performed. For example, if the phone number is 11 digits, the number of digits read aloud is incorrect or the content read aloud is one more digit. Numbers etc. It can also be prompted in a text prompt manner, for example, marking the wrong content in the text information in red, and prompting that the text information is wrong in the form of text comments beside it. The input box can support modification of error correction, so that the user can modify the text information.
可理解的是,地区也会预先设置对应的所述预设规则,例如预先录入地图中各个地理位置信息,在所述目标业务文案朗读的内容是地址信息时,判断所述文本信息中的地址信息是否属于预先录入的地理位置信息,若属于,则认为朗读的地址信息正确,若不属于,则认为朗读的地址信息有误。It is understandable that the region will also set the corresponding preset rules in advance, for example, enter various geographic location information in the map in advance, and determine the address in the text information when the content read by the target business copy is address information Whether the information belongs to the pre-entered geographic location information, if it belongs, the address information read aloud is considered correct, if not, the address information read aloud is considered wrong.
在本实施例中,通过将语音识别出的所述文本信息与所述目标业务文案进行比对,在正确率大于预设正确率阈值时,通过正则表达式对所述文本进行信息提取,获得目标信息,从而提高信息录入的准确率。In this embodiment, by comparing the text information recognized by the voice with the target business copy, when the correct rate is greater than a preset correct rate threshold, the text is extracted through regular expressions to obtain Target information, thereby improving the accuracy of information entry.
参照图4,图4为本申请视频音频识别方法第三实施例的流程示意图,基于上述第一实施例或第二实施例,提出本申请视频音频识别方法的第三实施例。本实施例基于所述第一实施例进行说明。Referring to FIG. 4, FIG. 4 is a schematic flowchart of a third embodiment of a video and audio recognition method according to this application. Based on the above-mentioned first or second embodiment, a third embodiment of the video and audio recognition method according to this application is proposed. This embodiment is described based on the first embodiment.
在第三实施例中,所述步骤S40之前,还包括:In the third embodiment, before the step S40, the method further includes:
对所述目标视频进行人脸识别,对识别到的人脸进行活体检测;Performing face recognition on the target video, and performing live detection on the recognized face;
在活体检测成功时,执行所述步骤S40。When the living body detection is successful, the step S40 is executed.
应理解的是,对所述目标视频进行人脸识别,视频找人跟图片找人原理一样,视频就是图片的集合,本质上还是图片找人,将找到的人和识别到的人脸画上矩形框,实现人脸识别。脸部识别技术(Face Detection)负责人脸位置的识别,人脸配准(Face Alignment)进行人脸对齐,算法是采用仿射变换,根据眼睛坐标进行人脸对齐,使用视觉几何组网络(Visual Geometry Group Network,VGG)模型做特征提取,get_feature_new函数打开图片,使用VGG网络提取特征。compare_pic函数对传入的两个特征计算相似度。关键点在于阈值的选取。face_recog_test函数会读取测试图片,计算各组图片最佳的参数。将对齐后的人脸图片保存,作为后续人脸特征比较时使用。使用Seeta Face Engine或者Face Alignment进行人脸识别。获取输入图片中的人脸特征。使用opencv的cv2.CascadeClassifier做人脸识别。It should be understood that, to perform face recognition on the target video, the principle of finding people in the video is the same as finding people in pictures. The video is a collection of pictures, and in essence it is still looking for people in pictures, and draw the people found and the recognized faces. Rectangular frame to realize face recognition. The face recognition technology (Face Detection) is responsible for the recognition of the face position, and the face alignment (Face Alignment) performs face alignment. The algorithm uses affine transformation to perform face alignment according to the eye coordinates, and uses the visual geometry group network (Visual Geometry Group Network). Geometry Group Network (VGG) model for feature extraction, the get_feature_new function opens the picture, and uses the VGG network to extract features. The compare_pic function calculates the similarity of the two features passed in. The key point is the selection of the threshold. The face_recog_test function will read the test pictures and calculate the best parameters for each group of pictures. Save the aligned face picture for use in subsequent facial feature comparisons. Use Seeta Face Engine or Face Alignment for face recognition. Get the facial features in the input picture. Use opencv's cv2.CascadeClassifier for face recognition.
进一步地,所述对所述目标视频进行人脸识别,对识别到的人脸进行活体检测,包括:Further, the performing face recognition on the target video and performing live detection on the recognized face includes:
对所述目标视频进行人脸识别,对识别到的人脸的眼部区域进行截取,获得眼部区域图像;Performing face recognition on the target video, and intercepting the eye area of the recognized face to obtain an eye area image;
通过预设眨眼模型识别所述眼部区域图像是否有眨眼动作;Recognizing whether there is a blinking action in the eye area image by using a preset blinking model;
若识别到所述眼部区域图像有眨眼动作,则认定活体检测成功。If it is recognized that the eye area image has a blinking action, it is determined that the living body detection is successful.
可理解的是,对识别到的人脸进行活体检测,检测到的人脸是否发生运动,或,是否眨眼等来判断是否为真实的人,而非照片。首先对人脸检测与眼睛定位;然后眼部区域截取,从归一化处理后的图像中计算眼睛的开合程度;基于卷积神经网络建立用于判断眨眼动作的模型,根据该模型识别图像中是否有眨眼动作。可预先建立待训练的卷积神经网络模型,获取大量的样本图像,对所述样本图像中的人脸眼部区域进行截取,获得样本眼部图像,并获取与各所述样本图像对应的样本眨眼信息,根据所述样本眼部图像与对应的样本眨眼信息对所述待训练的卷积神经网络模型进行训练,获得所述预设眨眼模型,则可通过所述预设眨眼模型对所述眼部区域图像进行识别,若识别到所述眼部区域图像有眨眼动作,则认为所述目标视频中是真实的人,认定活体检测成功。It is understandable that live detection is performed on the recognized face, whether the detected face is moving, or whether it is blinking, etc., to determine whether it is a real person, not a photo. First, face detection and eye location; then the eye area is intercepted, and the degree of opening and closing of the eyes is calculated from the normalized image; based on the convolutional neural network, a model for judging the blinking action is established, and the image is recognized based on the model Whether there is a blinking action in it. The convolutional neural network model to be trained can be established in advance, a large number of sample images can be obtained, the eye region of the human face in the sample image can be intercepted, the sample eye image can be obtained, and the sample corresponding to each sample image can be obtained Blink information, training the convolutional neural network model to be trained according to the sample eye image and the corresponding sample blink information to obtain the preset blink model, then the preset blink model can be used to The eye area image is recognized, and if it is recognized that the eye area image has a blinking action, it is considered that the target video is a real person, and the living body detection is determined to be successful.
在本实施例中,所述步骤S40之后,还包括:In this embodiment, after the step S40, the method further includes:
步骤S401:对所述用户图片进行预处理,获得预处理图片。Step S401: Perform preprocessing on the user picture to obtain a preprocessed picture.
应理解的是,对所述目标视频进行抽帧处理,通常获得多张所述用户图片,则需对所述用户图片进一步处理,以获得质量更好的用户图片,作为所述用户的身份验证信息。可预先对所述用户图片进行预处理,图像预处理的目的是消除图像中无关的信息,尽可能去除或者减少光照、成像系统或外部环境等对图像的干扰,使它具有的特征能够在图像中明显地表现出来。所述预处理过程包括人脸图像的光线补偿、灰度变换、直方图均衡化、归一化、几何校正、滤波以及锐化等处理步骤,从而获得所述预处理图片。It should be understood that if the target video is subjected to frame extraction processing, usually multiple user pictures are obtained, and then the user pictures need to be further processed to obtain a better quality user picture as the user’s identity verification information. The user picture can be pre-processed in advance. The purpose of image pre-processing is to eliminate irrelevant information in the image, remove or reduce as much as possible the interference of light, imaging system, or external environment on the image, so that its features can be displayed in the image. Clearly manifested in. The preprocessing process includes processing steps such as light compensation, grayscale transformation, histogram equalization, normalization, geometric correction, filtering, and sharpening of the face image, so as to obtain the preprocessed picture.
步骤S402:根据清晰度对所述预处理图片进行筛选,获得筛选图片。Step S402: Screen the pre-processed pictures according to the definition to obtain screened pictures.
需要说明的是,通常所述预处理图片为多张,从中选择清晰度较高的图片进行人脸识别。图像的清晰度是衡量图像质量优劣的重要指标,清晰度的评价可采用二次模糊Reblur算法,如果一幅图像已经模糊了,那么再对它进行一次模糊处理,高频分量变化不大;但如果原图是清楚的,对它进行一次模糊处理,则高频分量变化会非常大。因此可以通过对待评测图像进行一次高斯模糊处理,得到该图像的退化图像,然后再比较原图像和退化图像相邻像素值的变化情况,根据变化的大小确定清晰度值的高低,计算结果越小表明图像越清晰,反之越模糊。这种思路可称作基于二次模糊的清晰度算法,具体为将所述预处理图片通过低通滤波,获得模糊图像,计算所述预处理图片中相邻像素灰度值的变化,获得第一像素变化值,并计算所述模糊图像中相邻像素灰度值的变化,获得第二像素变化值,将所述第一像素变化值与所述第二像素变化值进行比较分析,并进行归一化处理,获得清晰度结果,根据所述清晰度结果对所述预处理图片进行筛选,获得所述筛选图片。It should be noted that there are usually multiple preprocessed pictures, from which pictures with higher definition are selected for face recognition. The sharpness of the image is an important indicator to measure the quality of the image. The sharpness can be evaluated by the secondary blur Reblur algorithm. If an image is already blurred, then it will be blurred again, and the high frequency components will not change much; But if the original image is clear, and if it is blurred once, the high-frequency components will change greatly. Therefore, the degraded image of the image can be obtained by performing a Gaussian blurring process on the image to be evaluated, and then compare the changes in the adjacent pixel values of the original image and the degraded image, and determine the level of the sharpness value according to the magnitude of the change. The smaller the calculation result Indicates that the clearer the image, the more blurry it is. This idea can be referred to as a sharpness algorithm based on secondary blur. Specifically, the preprocessed picture is passed through low-pass filtering to obtain a blurred image, and the change in the gray value of adjacent pixels in the preprocessed picture is calculated to obtain the first A pixel change value, and calculate the gray value change of adjacent pixels in the blurred image to obtain a second pixel change value, compare and analyze the first pixel change value with the second pixel change value, and perform Normalization processing is performed to obtain a sharpness result, and the pre-processed picture is screened according to the sharpness result to obtain the screened picture.
步骤S403:将所述筛选图片与预设图片进行对比,获得比对结果。Step S403: Compare the selected picture with a preset picture to obtain a comparison result.
在具体实现中,对所述筛选图片进行面部特征点定位,获得所述筛选图片对应的待处理人脸特征点;将所述待处理人脸特征点与预设正脸特征点进行比较,获得单应性矩阵;通过所述单应性矩阵对照片中的人脸进行变换,获得校准人脸图片;所述预设图片为公安系统中用户的图片,通过卷积神经网络模型对所述校准人脸图片和公安系统库中的各照片特征进行比对,获得所述筛选图片与各所述预设图片之间的人脸相似度,将所述人脸相似度作为所述比对结果。In a specific implementation, perform facial feature point positioning on the screened picture to obtain the face feature points to be processed corresponding to the screened picture; compare the face feature points to be processed with preset facial feature points to obtain A homography matrix; transform the face in the photo by the homography matrix to obtain a calibration face picture; the preset picture is a picture of a user in the public security system, and the calibration is performed through a convolutional neural network model The face picture is compared with each photo feature in the public security system library to obtain the face similarity between the screened picture and each of the preset pictures, and the face similarity is used as the comparison result.
相应地,所述步骤S50,包括:Correspondingly, the step S50 includes:
步骤S501:在所述对比结果超过预设相似度阈值时,根据所述筛选图片和所述目标信息生成所述用户的目标业务文档。Step S501: When the comparison result exceeds a preset similarity threshold, generate a target business document of the user according to the screened picture and the target information.
可理解的是,将所述人脸相似度作为所述比对结果,若所述人脸相似度超过所述预设相似度阈值,则认为所述用户的身份得到核实,可进一步为该用户建立业务资料。所述预设相似度阈值可根据经验值进行设置,比如40%。做人脸特征的比较,计算人脸相似度,若所述预设相似度阈值设置为0.4,就是说在相似度大于40%,就认为是同一个人,则可根据所述筛选图片和所述目标信息生成所述用户的目标业务文档。It is understandable that the face similarity is used as the comparison result, and if the face similarity exceeds the preset similarity threshold, it is considered that the identity of the user has been verified, and the user may be further Create business profile. The preset similarity threshold may be set according to an empirical value, such as 40%. Compare the facial features and calculate the similarity of the face. If the preset similarity threshold is set to 0.4, that is, if the similarity is greater than 40%, the person is considered to be the same person, and the image can be screened and the target The information generates the target business document of the user.
在本实施例中,所述步骤S20,包括:In this embodiment, the step S20 includes:
播放目标音乐的同时,拍摄所述用户朗读所述目标业务文案的目标视频;While playing the target music, shoot the target video of the user reading the target business copy;
通过音视频分离器对所述目标视频进行音频分离,获得混合音频信息;Performing audio separation on the target video by an audio and video separator to obtain mixed audio information;
通过计算听觉场景分析算法从所述混合音频信息中提取所述用户朗读所述目标业务文案的目标音频信息。Extracting the target audio information of the user reading the target business copy from the mixed audio information through a computational auditory scene analysis algorithm.
应理解的是,为了确保个人信息安全性,在用户朗读所述目标业务文案时,可同时播放所述目标音乐,所述目标音乐能够制造嘈杂的语音环境,避免用户的个人信息被其他人获悉,此时拍摄的所述目标视频中的包括所述目标音乐和所述用户朗读所述目标业务文案的音频。通过音视频分离器对所述目标视频进行音频分离,获得混合音频信息,需要进一步采用计算听觉场景分析(Calculation Auditory Scene Analysis,CASA)算法来模拟人类听觉系统将所述用户朗读的语音从噪声环境中提取出来。会对音频信息进行编码从而实现分组和解析。目前有几十种分组依据涉及时间和频率相关,包括音高、空间位置和起始/结束时间。音高是一个非常重要的分组依据,它根据不同的谐波模式来鉴别某种声音的唯一特征。当采用两个或者多个麦克风时,声音隔离系统可以根据空间位置信息来确定每个麦克风声音的方向和距离。CASA建模方式使得声音隔离系统能够集中于某一声音源,比如某个特定的人,并且屏蔽掉背景声音。起始/停止时间分组指的是某一声音成分开始出现和停止的时刻,这些数据与原始的频率数据合并时就能够判断是否来自同一声音源。屏蔽掉一系列噪声集中识别某一特定的声音源。具有相似属性的声音会形成同一音频流,同样的,不同的属性的会形成各自的音频流。可以采用这些不同的音频流来鉴别持续或者重复的声源。有了足够的声音分组,实际的声音隔离处理就可从已经鉴别过的声源中去匹配,并且响应真正说话者的声音,从而分离出所述用户朗读所述目标业务文案的目标音频信息。It should be understood that, in order to ensure the security of personal information, when the user reads the target business copy, the target music can be played at the same time. The target music can create a noisy voice environment and prevent the user's personal information from being learned by others. , The target video captured at this time includes the target music and the audio of the user reading the target business copy. To perform audio separation on the target video through an audio and video separator to obtain mixed audio information, it is necessary to further use the Calculation Auditory Scene Analysis (CASA) algorithm to simulate the human auditory system to remove the voice read by the user from the noise environment Extracted from it. The audio information will be encoded to achieve grouping and parsing. There are currently dozens of grouping criteria related to time and frequency, including pitch, spatial position, and start/end time. Pitch is a very important grouping basis. It identifies the unique characteristics of a certain sound according to different harmonic modes. When two or more microphones are used, the sound isolation system can determine the direction and distance of each microphone's sound according to the spatial position information. The CASA modeling method enables the sound isolation system to focus on a certain sound source, such as a certain person, and shield the background sound. The start/stop time grouping refers to the moment when a certain sound component starts to appear and stops. When these data are combined with the original frequency data, it can be judged whether they come from the same sound source. Mask out a series of noises to focus on identifying a specific sound source. Sounds with similar attributes will form the same audio stream, and similarly, sounds with different attributes will form their own audio streams. These different audio streams can be used to identify continuous or repetitive sound sources. With enough voice groups, the actual voice isolation process can match the identified sound sources and respond to the voice of the real speaker, thereby separating the target audio information of the target business copy by the user.
本实施例中,通过将用户图片进行处理,获得质量较好的筛选图片,再将所述筛选图片与公安系统中的预设图片进行比对,以对所述用户身份进行验证,提高信息录入的安全性和可靠性。In this embodiment, the user pictures are processed to obtain better quality screening pictures, and then the screening pictures are compared with preset pictures in the public security system to verify the identity of the user and improve information entry Security and reliability.
此外,本申请实施例还提出一种存储介质,所述存储介质可以是非易失性的,也可以是易失性的,所述存储介质上存储有视频音频识别程序,所述视频音频识别程序被处理器执行时实现如上文所述的视频音频识别方法的步骤。In addition, the embodiment of the present application also proposes a storage medium, the storage medium may be non-volatile or volatile, and a video and audio recognition program is stored on the storage medium, and the video and audio recognition program is stored on the storage medium. When executed by the processor, the steps of the video and audio recognition method as described above are realized.
此外,参照图5,本申请实施例还提出一种视频音频识别装置,所述视频音频识别装置包括:In addition, referring to FIG. 5, an embodiment of the present application also proposes a video and audio recognition device, and the video and audio recognition device includes:
查找模块10,用于接收用户输入的目标业务类型,根据所述目标业务类型查找对应的目标业务文案,将所述目标业务文案进行展示。The searching module 10 is configured to receive the target business type input by the user, search for the corresponding target business copy according to the target business type, and display the target business copy.
应理解的是,在网页或者APP内,可通过选项呈现各种业务类型,用户选择需要进行的所述目标业务类型,在接收到用户输入的所述目标业务类型时,从预设映射关系表中查找与所述目标业务类型对应的目标业务文案,所述预设映射关系表中包括业务类型与业务文案之间的对应关系。所述目标业务类型包括贷款、租赁或保险等业务,所述目标业务文案为各业务类型需要收集的用户相关信息,比如各业务类型均需采集用户的个人基本信息,如一段个人信息文案:我是xxx,我的身份证件号是xxxxx,我是来自xxx地区等。不同的业务类型还需采集业务类型对应的相关信息,比如贷款业务还需采集如下信息:是否有在还贷款,是否有房产、车子以及年收入多少等信息,可预先按照业务类型建立对应的业务文案,将需要采集的信息以填空形式呈现。It should be understood that in a webpage or APP, various service types can be presented through options, and the user selects the target service type to be performed, and when the target service type input by the user is received, the preset mapping relationship table The target business copy corresponding to the target business type is searched in the, and the preset mapping relationship table includes the corresponding relationship between the business type and the business copy. The target business types include businesses such as loans, leasing, or insurance. The target business copy is user-related information that needs to be collected for each business type. For example, each business type needs to collect basic personal information of users, such as a piece of personal information copy: I It is xxx, my ID number is xxxxx, I am from the xxx area, etc. Different business types also need to collect relevant information corresponding to the business type. For example, the loan business also needs to collect the following information: whether there is a loan, whether there is a real estate, a car, and the amount of annual income. The corresponding business can be established in advance according to the type of business Copywriting, the information that needs to be collected is presented in the form of filling in the blanks.
音频分离模块20,用于拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息。The audio separation module 20 is used to shoot the target video in which the user reads the target business copy, and perform audio separation on the target video through an audio-video separator to obtain target audio information.
需要说明的是,用户在朗读所述目标业务文案时,可对需要填空的内容结合自己的信息在朗读时进行填充。对所述用户朗读所述目标业务文案的过程进行拍摄,视频录制可通过所述视频音频识别设备的摄像功能进行,比如智能手机的录像功能。在所述网页或者APP内有摄像按钮,所述目标业务文案在所述网页或者APP内进行展示时,所述业务文案的上方或者下方设置摄像按钮,用户通过点击该摄像按钮,拍摄自己朗读所述目标业务文案的视频,获得所述目标视频。It should be noted that when the user reads the target business copy aloud, he can fill in the content that needs to be filled in in combination with his own information. The process of the user reading the target business copy is photographed, and the video recording can be performed through the camera function of the video and audio recognition device, such as the recording function of a smart phone. There is a camera button in the webpage or APP, and when the target business copy is displayed in the webpage or APP, a camera button is set above or below the business copy. The user clicks on the camera button to take a picture of himself or herself. The video of the target business copy is described, and the target video is obtained.
可理解的是,音频分离通常是将视频的声音和图像分别取出来,分离音频步骤为:设置音频源;获取源文件中轨道的数量,并遍历找到需要的音频轨;对找到的音频轨进行提取,获得所述目标音频信息。It is understandable that audio separation usually takes out the sound and image of the video separately. The steps of separating audio are: set the audio source; get the number of tracks in the source file, and traverse to find the required audio track; Extract to obtain the target audio information.
文字识别模块30,用于对所述目标音频信息进行文字识别,获得目标信息。The text recognition module 30 is used to perform text recognition on the target audio information to obtain target information.
在具体实现中,将所述目标音频信息中首尾端的静音切除,降低对后续步骤造成的干扰。对静音切除后的第一音频信息进行分帧,也就是把所述第一音频信息切开成一小段一小段,每小段称为一帧,分帧操作一般使用移动窗函数来实现。分帧后,所述第一音频信息就变成了很多小段。再将波形作变换,提取梅尔倒谱系数(Mel-scale Frequency Cepstral Coefficients,MFCC)特征,把每一帧波形变成一个多维向量。接着,把帧识别成状态;把状态组合成音素;把音素组合成单词。若干帧语音对应一个状态,每三个状态组合成一个音素,若干个音素组合成一个单词,从而获得对应的文本信息,可将所述文本信息中用户填充的内容进行提取,作为所述目标信息。In a specific implementation, the mute at the beginning and the end of the target audio information is cut off to reduce the interference caused to subsequent steps. Framing the first audio information after silence is cut, that is, cutting the first audio information into a small segment, each segment is called a frame, and the framing operation is generally implemented by using a moving window function. After framing, the first audio information becomes many small segments. Then transform the waveform, extract Mel-scale Frequency Cepstral Coefficients (MFCC) features, and turn each frame of waveform into a multi-dimensional vector. Next, recognize the frame as a state; combine the states into phonemes; and combine phonemes into words. Several frames of speech correspond to a state, every three states are combined into a phoneme, and several phonemes are combined into a word, so as to obtain the corresponding text information. The content filled by the user in the text information can be extracted as the target information .
抽帧处理模块40,用于对所述目标视频进行抽帧处理,获得用户图片。The frame extraction processing module 40 is configured to perform frame extraction processing on the target video to obtain a user picture.
应理解的是,对所述目标视频实例化的同时进行初始化,获取所述目标视频总帧数并打印,定义一个变量,用来存放存储每一帧图像,循环标志位,定义当前帧,读取所述目标视频每一帧,字符串流,将长整型long类型的转换成字符型传给对象str,设置每10帧获取一次帧,将帧转成图片输出,结束条件,当前帧数大于总帧数时候时,循环停止,输出的图片即为所述用户图片。It should be understood that the target video is instantiated and initialized at the same time, the total number of frames of the target video is obtained and printed, and a variable is defined to store and store each frame of images, cycle flags, define the current frame, and read Take each frame of the target video, a string stream, convert the long type long type into a character type and pass it to the object str, set to get a frame every 10 frames, convert the frame into a picture output, end condition, current frame number When it is greater than the total number of frames, the loop stops, and the output picture is the user picture.
生成模块50,用于根据所述用户图片和所述目标信息生成所述用户的目标业务文档。The generating module 50 is configured to generate a target business document of the user according to the user picture and the target information.
需要说明的是,所述用户图片可作为所述用户的身份验证信息,还可对所述音频进行用户的声纹提取,将提取的声纹作用用户的身份标识,并根据声纹进行身份验证。所述目标信息为从用户朗读的文本中提取的关于用户的相关信息,将所述用户图片和所述目标信息结合生成一个资料文档,即为所述目标业务文档,则所述目标业务文档包括用户身份验证信息和所述目标业务类型需要的各种用户信息。It should be noted that the user picture can be used as the user's identity verification information, and the user's voiceprint can also be extracted from the audio, the extracted voiceprint is used as the user's identity, and the identity verification is performed based on the voiceprint . The target information is related information about the user extracted from the text read aloud by the user, and the user picture and the target information are combined to generate a data document, that is, the target business document, and the target business document includes User identity verification information and various user information required by the target service type.
本实施例中,通过接收用户输入的目标业务类型,根据所述目标业务类型查找对应的目标业务文案,将所述目标业务文案进行展示,拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息,通过语音朗读减少手动输入的繁琐步骤;对所述目标音频信息进行文字识别,获得目标信息,对所述目标视频进行抽帧处理,获得用户图片,以对用户身份实现验证;根据所述用户图片和所述目标信息生成所述用户的目标业务文档,基于人工智能,通过解析视频获得多方面的数据,验证用户身份的同时提升用户的信息录入效率。In this embodiment, by receiving the target business type input by the user, searching for the corresponding target business copy based on the target business type, displaying the target business copy, and shooting a target video in which the user reads the target business copy aloud, The target video is audio separated by an audio and video separator to obtain target audio information, and the tedious steps of manual input are reduced through voice reading; text recognition is performed on the target audio information to obtain target information, and the target video is extracted Frame processing to obtain user pictures to verify the user’s identity; generate the user’s target business document based on the user picture and the target information, and based on artificial intelligence, obtain various data by parsing the video to verify the user’s identity At the same time, the efficiency of user information entry is improved.
在一实施例中,所述文字识别模块30,还用于对所述目标音频信息进行文字识别,获得对应的文本信息;将所述文本信息与所述目标业务文案进行比对,获得所述文本信息的正确率;在所述正确率大于预设正确率阈值时,通过正则表达式对所述文本进行信息提取,获得目标信息。In an embodiment, the text recognition module 30 is further configured to perform text recognition on the target audio information to obtain corresponding text information; compare the text information with the target business copy to obtain the The correct rate of the text information; when the correct rate is greater than the preset correct rate threshold, information extraction is performed on the text through regular expressions to obtain target information.
在一实施例中,所述视频音频识别装置还包括:In an embodiment, the video and audio recognition device further includes:
判断模块,用于判断所述目标信息是否满足预设规则;The judgment module is used to judge whether the target information satisfies a preset rule;
提示模块,用于若不满足,则进行提示,以使所述用户重新朗读所述目标业务文案;A prompting module is used for prompting if not satisfied, so that the user can read the target business copy again;
所述抽帧处理模块40,还用于若满足,则执行所述对所述目标视频进行抽帧处理,获得用户图片的步骤。The frame extraction processing module 40 is further configured to perform the step of performing frame extraction processing on the target video to obtain a user picture if it is satisfied.
在一实施例中,所述视频音频识别装置还包括:In an embodiment, the video and audio recognition device further includes:
活体检测模块,用于对所述目标视频进行人脸识别,对识别到的人脸进行活体检测;A living body detection module, configured to perform face recognition on the target video, and perform live body detection on the recognized face;
所述抽帧处理模块40,还用于在活体检测成功时,执行所述对所述目标视频进行抽帧处理,获得用户图片的步骤。The frame extraction processing module 40 is further configured to perform the step of performing frame extraction processing on the target video to obtain a user picture when the living body detection is successful.
在一实施例中,所述活体检测模块,还用于对所述目标视频进行人脸识别,对识别到的人脸的眼部区域进行截取,获得眼部区域图像;通过预设眨眼模型识别所述眼部区域图像是否有眨眼动作;若识别到所述眼部区域图像有眨眼动作,则认定活体检测成功。In an embodiment, the living body detection module is further used to perform face recognition on the target video, intercept the eye area of the recognized face, and obtain an image of the eye area; recognition by a preset blinking model Whether the eye area image has a blinking action; if it is recognized that the eye area image has a blinking action, it is determined that the living body detection is successful.
在一实施例中,所述视频音频识别装置还包括:In an embodiment, the video and audio recognition device further includes:
预处理模块,用于对所述用户图片进行预处理,获得预处理图片;The preprocessing module is used to preprocess the user picture to obtain the preprocessed picture;
筛选模块,用于根据清晰度对所述预处理图片进行筛选,获得筛选图片;The screening module is used to screen the pre-processed pictures according to the definition to obtain the screened pictures;
对比模块,用于将所述筛选图片与预设图片进行对比,获得比对结果;The comparison module is used to compare the selected picture with the preset picture to obtain a comparison result;
所述生成模块50,还用于在所述对比结果超过预设相似度阈值时,根据所述筛选图片和所述目标信息生成所述用户的目标业务文档。The generating module 50 is further configured to generate a target business document of the user according to the screened picture and the target information when the comparison result exceeds a preset similarity threshold.
在一实施例中,所述音频分离模块20,还用于播放目标音乐的同时,拍摄所述用户朗读所述目标业务文案的目标视频;In an embodiment, the audio separation module 20 is also used to play the target music while shooting the target video of the user reading the target business copy;
通过音视频分离器对所述目标视频进行音频分离,获得混合音频信息;Performing audio separation on the target video by an audio and video separator to obtain mixed audio information;
通过计算听觉场景分析算法从所述混合音频信息中提取所述用户朗读所述目标业务文案的目标音频信息。Extracting the target audio information of the user reading the target business copy from the mixed audio information through a computational auditory scene analysis algorithm.
本申请所述视频音频识别装置的其他实施例或具体实现方式可参照上述各方法实施例,此处不再赘述。For other embodiments or specific implementations of the video and audio recognition device described in the present application, reference may be made to the foregoing method embodiments, and details are not described herein again.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, method, article, or system. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or system that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。词语第一、第二、以及第三等的使用不表示任何顺序,可将这些词语解释为标识。The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments. In the unit claims listing several devices, several of these devices may be embodied in the same hardware item. The use of the words first, second, and third does not indicate any order, and these words may be interpreted as signs.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如只读存储器镜像(Read Only Memory image,ROM)/随机存取存储器(Random Access Memory,RAM)、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as a read-only memory mirror (Read Only Memory) Only Memory image, ROM)/Random Access Memory (Random Access Memory, RAM, magnetic disks, optical disks), include several instructions to enable a terminal device (can be a mobile phone, computer, server, air conditioner, or network Equipment, etc.) execute the methods described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种视频音频识别方法,其中,所述视频音频识别方法包括以下步骤:A video and audio recognition method, wherein the video and audio recognition method includes the following steps:
    接收用户输入的目标业务类型,根据所述目标业务类型查找对应的目标业务文案,将所述目标业务文案进行展示;Receiving the target business type input by the user, searching for a corresponding target business copy based on the target business type, and displaying the target business copy;
    拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息;Shooting the target video in which the user reads the target business copy, and performing audio separation on the target video through an audio-video separator to obtain target audio information;
    对所述目标音频信息进行文字识别,获得目标信息;Perform text recognition on the target audio information to obtain target information;
    对所述目标视频进行抽帧处理,获得用户图片;Performing frame extraction processing on the target video to obtain a user picture;
    根据所述用户图片和所述目标信息生成所述用户的目标业务文档。The target business document of the user is generated according to the user picture and the target information.
  2. 如权利要求1所述的视频音频识别方法,其中,所述对所述目标音频信息进行文字识别,获得目标信息,包括:5. The video and audio recognition method according to claim 1, wherein said performing text recognition on said target audio information to obtain target information comprises:
    对所述目标音频信息进行文字识别,获得对应的文本信息;Perform text recognition on the target audio information to obtain corresponding text information;
    将所述文本信息与所述目标业务文案进行比对,获得所述文本信息的正确率;Comparing the text information with the target business copy to obtain the correct rate of the text information;
    在所述正确率大于预设正确率阈值时,通过正则表达式对所述文本进行信息提取,获得目标信息。When the correct rate is greater than a preset correct rate threshold, information extraction is performed on the text through regular expressions to obtain target information.
  3. 如权利要求2所述的视频音频识别方法,其中,所述在所述正确率大于预设正确率阈值时,通过正则表达式对所述文本进行信息提取,获得目标信息之后,所述视频音频识别方法包括:The video and audio recognition method according to claim 2, wherein when the correct rate is greater than a preset correct rate threshold, the text is extracted by regular expressions, and after the target information is obtained, the video and audio Identification methods include:
    判断所述目标信息是否满足预设规则;Judging whether the target information satisfies a preset rule;
    若不满足,则进行提示,以使所述用户重新朗读所述目标业务文案;If it is not satisfied, a prompt is given to make the user re-read the target business copy;
    若满足,则执行所述对所述目标视频进行抽帧处理,获得用户图片的步骤。If it is satisfied, execute the step of performing frame extraction processing on the target video to obtain a user picture.
  4. 如权利要求1所述的视频音频识别方法,其中,所述对所述目标视频进行抽帧处理,获得用户图片之前,所述视频音频识别方法还包括:5. The video and audio recognition method of claim 1, wherein, before the frame extraction process is performed on the target video to obtain a user picture, the video and audio recognition method further comprises:
    对所述目标视频进行人脸识别,对识别到的人脸进行活体检测;Performing face recognition on the target video, and performing live detection on the recognized face;
    在活体检测成功时,执行所述对所述目标视频进行抽帧处理,获得用户图片的步骤。When the living body detection is successful, the step of performing frame extraction processing on the target video to obtain a user picture is performed.
  5. 如权利要求4所述的视频音频识别方法,其中,所述对所述目标视频进行人脸识别,对识别到的人脸进行活体检测,包括:5. The video and audio recognition method according to claim 4, wherein said performing face recognition on said target video and performing live detection on the recognized face comprises:
    对所述目标视频进行人脸识别,对识别到的人脸的眼部区域进行截取,获得眼部区域图像;Performing face recognition on the target video, and intercepting the eye area of the recognized face to obtain an eye area image;
    通过预设眨眼模型识别所述眼部区域图像是否有眨眼动作;Recognizing whether there is a blinking action in the eye area image by using a preset blinking model;
    若识别到所述眼部区域图像有眨眼动作,则认定活体检测成功。If it is recognized that the eye area image has a blinking action, it is determined that the living body detection is successful.
  6. 如权利要求1所述的视频音频识别方法,其中,所述对所述目标视频进行抽帧处理,获得用户图片之后,所述视频音频识别方法还包括:5. The video and audio recognition method of claim 1, wherein after the frame extraction process is performed on the target video to obtain a user picture, the video and audio recognition method further comprises:
    对所述用户图片进行预处理,获得预处理图片;Preprocessing the user picture to obtain a preprocessed picture;
    根据清晰度对所述预处理图片进行筛选,获得筛选图片;Filter the pre-processed pictures according to the definition to obtain the filtered pictures;
    将所述筛选图片与预设图片进行对比,获得比对结果;Comparing the screened picture with a preset picture to obtain a comparison result;
    相应地,所述根据所述用户图片和所述目标信息生成所述用户的目标业务文档,包括:Correspondingly, the generating the target business document of the user according to the user picture and the target information includes:
    在所述对比结果超过预设相似度阈值时,根据所述筛选图片和所述目标信息生成所述用户的目标业务文档。When the comparison result exceeds a preset similarity threshold, the user's target business document is generated according to the screened picture and the target information.
  7. 如权利要求1-6中任一项所述的视频音频识别方法,其中,所述拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息,包括:5. The video and audio recognition method according to any one of claims 1-6, wherein the shooting the target video in which the user reads the target business copy aloud, and performing audio separation on the target video through an audio and video separator, Obtain target audio information, including:
    播放目标音乐的同时,拍摄所述用户朗读所述目标业务文案的目标视频;While playing the target music, shoot the target video of the user reading the target business copy;
    通过音视频分离器对所述目标视频进行音频分离,获得混合音频信息;Performing audio separation on the target video by an audio and video separator to obtain mixed audio information;
    通过计算听觉场景分析算法从所述混合音频信息中提取所述用户朗读所述目标业务文案的目标音频信息。Extracting the target audio information of the user reading the target business copy from the mixed audio information through a computational auditory scene analysis algorithm.
  8. 一种视频音频识别设备,其中,所述视频音频识别设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的视频音频识别程序,所述视频音频识别程序被所述处理器执行时实现以下步骤:A video and audio recognition device, wherein the video and audio recognition device includes a memory, a processor, and a video and audio recognition program stored on the memory and running on the processor, and the video and audio recognition program is The processor implements the following steps when executing:
    接收用户输入的目标业务类型,根据所述目标业务类型查找对应的目标业务文案,将所述目标业务文案进行展示;Receiving the target business type input by the user, searching for a corresponding target business copy based on the target business type, and displaying the target business copy;
    拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息;Shooting the target video in which the user reads the target business copy, and performing audio separation on the target video through an audio-video separator to obtain target audio information;
    对所述目标音频信息进行文字识别,获得目标信息;Perform text recognition on the target audio information to obtain target information;
    对所述目标视频进行抽帧处理,获得用户图片;Performing frame extraction processing on the target video to obtain a user picture;
    根据所述用户图片和所述目标信息生成所述用户的目标业务文档。The target business document of the user is generated according to the user picture and the target information.
  9. 如权利要求8所述的视频音频识别设备,其中,所述对所述目标音频信息进行文字识别,获得目标信息,包括:8. The video and audio recognition device according to claim 8, wherein said performing text recognition on said target audio information to obtain target information comprises:
    对所述目标音频信息进行文字识别,获得对应的文本信息;Perform text recognition on the target audio information to obtain corresponding text information;
    将所述文本信息与所述目标业务文案进行比对,获得所述文本信息的正确率;Comparing the text information with the target business copy to obtain the correct rate of the text information;
    在所述正确率大于预设正确率阈值时,通过正则表达式对所述文本进行信息提取,获得目标信息。When the correct rate is greater than a preset correct rate threshold, information extraction is performed on the text through regular expressions to obtain target information.
  10. 如权利要求9所述的视频音频识别设备,其中,所述在所述正确率大于预设正确率阈值时,通过正则表达式对所述文本进行信息提取,获得目标信息之后,所述视频音频识别方法包括:The video and audio recognition device according to claim 9, wherein, when the correctness rate is greater than a preset correctness rate threshold, the text is extracted through regular expressions, and after the target information is obtained, the video and audio Identification methods include:
    判断所述目标信息是否满足预设规则;Judging whether the target information satisfies a preset rule;
    若不满足,则进行提示,以使所述用户重新朗读所述目标业务文案;If it is not satisfied, a prompt is given to make the user re-read the target business copy;
    若满足,则执行所述对所述目标视频进行抽帧处理,获得用户图片的步骤。If it is satisfied, execute the step of performing frame extraction processing on the target video to obtain a user picture.
  11. 如权利要求8所述的视频音频识别设备,其中,所述对所述目标视频进行抽帧处理,获得用户图片之前,所述视频音频识别方法还包括:8. The video and audio recognition device according to claim 8, wherein, before the frame extraction process is performed on the target video to obtain a user picture, the video and audio recognition method further comprises:
    对所述目标视频进行人脸识别,对识别到的人脸进行活体检测;Performing face recognition on the target video, and performing live detection on the recognized face;
    在活体检测成功时,执行所述对所述目标视频进行抽帧处理,获得用户图片的步骤。When the living body detection is successful, the step of performing frame extraction processing on the target video to obtain a user picture is performed.
  12. 如权利要求11所述的视频音频识别设备,其中,所述对所述目标视频进行人脸识别,对识别到的人脸进行活体检测,包括:The video and audio recognition device according to claim 11, wherein said performing face recognition on said target video and performing live detection on the recognized face comprises:
    对所述目标视频进行人脸识别,对识别到的人脸的眼部区域进行截取,获得眼部区域图像;Performing face recognition on the target video, and intercepting the eye area of the recognized face to obtain an eye area image;
    通过预设眨眼模型识别所述眼部区域图像是否有眨眼动作;Recognizing whether there is a blinking action in the eye area image by using a preset blinking model;
    若识别到所述眼部区域图像有眨眼动作,则认定活体检测成功。If it is recognized that the eye area image has a blinking action, it is determined that the living body detection is successful.
  13. 如权利要求8所述的视频音频识别设备,其中,所述对所述目标视频进行抽帧处理,获得用户图片之后,所述视频音频识别方法还包括:8. The video and audio recognition device according to claim 8, wherein, after the frame extraction process is performed on the target video to obtain a user picture, the video and audio recognition method further comprises:
    对所述用户图片进行预处理,获得预处理图片;Preprocessing the user picture to obtain a preprocessed picture;
    根据清晰度对所述预处理图片进行筛选,获得筛选图片;Filter the pre-processed pictures according to the definition to obtain the filtered pictures;
    将所述筛选图片与预设图片进行对比,获得比对结果;Comparing the screened picture with a preset picture to obtain a comparison result;
    相应地,所述根据所述用户图片和所述目标信息生成所述用户的目标业务文档,包括:Correspondingly, the generating the target business document of the user according to the user picture and the target information includes:
    在所述对比结果超过预设相似度阈值时,根据所述筛选图片和所述目标信息生成所述用户的目标业务文档。When the comparison result exceeds a preset similarity threshold, the user's target business document is generated according to the screened picture and the target information.
  14. 如权利要求8-13中任一项所述的视频音频识别设备,其中,所述拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息,包括:The video and audio recognition device according to any one of claims 8-13, wherein the shooting the target video in which the user reads the target business copy aloud, and performing audio separation on the target video through an audio-video separator, Obtain target audio information, including:
    播放目标音乐的同时,拍摄所述用户朗读所述目标业务文案的目标视频;While playing the target music, shoot the target video of the user reading the target business copy;
    通过音视频分离器对所述目标视频进行音频分离,获得混合音频信息;Performing audio separation on the target video by an audio and video separator to obtain mixed audio information;
    通过计算听觉场景分析算法从所述混合音频信息中提取所述用户朗读所述目标业务文案的目标音频信息。Extracting the target audio information of the user reading the target business copy from the mixed audio information through a computational auditory scene analysis algorithm.
  15. 一种存储介质,其中,所述存储介质上存储有视频音频识别程序,所述视频音频识别程序被处理器执行时实现以下步骤:A storage medium, wherein a video and audio recognition program is stored on the storage medium, and the following steps are implemented when the video and audio recognition program is executed by a processor:
    接收用户输入的目标业务类型,根据所述目标业务类型查找对应的目标业务文案,将所述目标业务文案进行展示;Receiving the target business type input by the user, searching for a corresponding target business copy based on the target business type, and displaying the target business copy;
    拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息;Shooting the target video in which the user reads the target business copy, and performing audio separation on the target video through an audio-video separator to obtain target audio information;
    对所述目标音频信息进行文字识别,获得目标信息;Perform text recognition on the target audio information to obtain target information;
    对所述目标视频进行抽帧处理,获得用户图片;Performing frame extraction processing on the target video to obtain a user picture;
    根据所述用户图片和所述目标信息生成所述用户的目标业务文档。The target business document of the user is generated according to the user picture and the target information.
  16. 如权利要求15所述的存储介质,其中,所述对所述目标音频信息进行文字识别,获得目标信息,包括:15. The storage medium of claim 15, wherein said performing character recognition on said target audio information to obtain target information comprises:
    对所述目标音频信息进行文字识别,获得对应的文本信息;Perform text recognition on the target audio information to obtain corresponding text information;
    将所述文本信息与所述目标业务文案进行比对,获得所述文本信息的正确率;Comparing the text information with the target business copy to obtain the correct rate of the text information;
    在所述正确率大于预设正确率阈值时,通过正则表达式对所述文本进行信息提取,获得目标信息。When the correct rate is greater than a preset correct rate threshold, information extraction is performed on the text through regular expressions to obtain target information.
  17. 如权利要求16所述的存储介质,其中,所述在所述正确率大于预设正确率阈值时,通过正则表达式对所述文本进行信息提取,获得目标信息之后,所述视频音频识别方法包括:The storage medium according to claim 16, wherein, when the correctness rate is greater than a preset correctness rate threshold, the text is extracted by regular expressions, and after the target information is obtained, the video and audio recognition method include:
    判断所述目标信息是否满足预设规则;Judging whether the target information satisfies a preset rule;
    若不满足,则进行提示,以使所述用户重新朗读所述目标业务文案;If it is not satisfied, a prompt is given to make the user re-read the target business copy;
    若满足,则执行所述对所述目标视频进行抽帧处理,获得用户图片的步骤。If it is satisfied, execute the step of performing frame extraction processing on the target video to obtain a user picture.
  18. 如权利要求15所述的存储介质,其中,所述对所述目标视频进行抽帧处理,获得用户图片之前,所述视频音频识别方法还包括:15. The storage medium according to claim 15, wherein, before performing frame extraction processing on the target video to obtain a user picture, the video and audio recognition method further comprises:
    对所述目标视频进行人脸识别,对识别到的人脸进行活体检测;Performing face recognition on the target video, and performing live detection on the recognized face;
    在活体检测成功时,执行所述对所述目标视频进行抽帧处理,获得用户图片的步骤。When the living body detection is successful, the step of performing frame extraction processing on the target video to obtain a user picture is performed.
  19. 如权利要求18所述的存储介质,其中,所述对所述目标视频进行人脸识别,对识别到的人脸进行活体检测,包括:17. The storage medium of claim 18, wherein said performing face recognition on said target video and performing live detection on the recognized face comprises:
    对所述目标视频进行人脸识别,对识别到的人脸的眼部区域进行截取,获得眼部区域图像;Performing face recognition on the target video, and intercepting the eye area of the recognized face to obtain an eye area image;
    通过预设眨眼模型识别所述眼部区域图像是否有眨眼动作;Recognizing whether there is a blinking action in the eye area image by using a preset blinking model;
    若识别到所述眼部区域图像有眨眼动作,则认定活体检测成功。If it is recognized that the eye area image has a blinking action, it is determined that the living body detection is successful.
  20. 一种视频音频识别装置,其中,所述视频音频识别装置包括:A video and audio recognition device, wherein the video and audio recognition device includes:
    查找模块,用于接收用户输入的目标业务类型,根据所述目标业务类型查找对应的目标业务文案,将所述目标业务文案进行展示;The search module is configured to receive the target business type input by the user, search for the corresponding target business copy based on the target business type, and display the target business copy;
    音频分离模块,用于拍摄所述用户朗读所述目标业务文案的目标视频,通过音视频分离器对所述目标视频进行音频分离,获得目标音频信息;An audio separation module, configured to shoot a target video in which the user reads the target business copy, and perform audio separation on the target video through an audio-video separator to obtain target audio information;
    文字识别模块,用于对所述目标音频信息进行文字识别,获得目标信息;The text recognition module is used to perform text recognition on the target audio information to obtain target information;
    抽帧处理模块,用于对所述目标视频进行抽帧处理,获得用户图片;The frame extraction processing module is used to perform frame extraction processing on the target video to obtain user pictures;
    生成模块,用于根据所述用户图片和所述目标信息生成所述用户的目标业务文档。The generating module is used to generate the target business document of the user according to the user picture and the target information.
PCT/CN2020/102532 2019-12-26 2020-07-17 Video and audio recognition method, apparatus and device and storage medium WO2021128817A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911374298.1 2019-12-26
CN201911374298.1A CN111191073A (en) 2019-12-26 2019-12-26 Video and audio recognition method, device, storage medium and device

Publications (1)

Publication Number Publication Date
WO2021128817A1 true WO2021128817A1 (en) 2021-07-01

Family

ID=70710065

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/102532 WO2021128817A1 (en) 2019-12-26 2020-07-17 Video and audio recognition method, apparatus and device and storage medium

Country Status (2)

Country Link
CN (1) CN111191073A (en)
WO (1) WO2021128817A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191073A (en) * 2019-12-26 2020-05-22 深圳壹账通智能科技有限公司 Video and audio recognition method, device, storage medium and device
CN111814714B (en) * 2020-07-15 2024-03-29 前海人寿保险股份有限公司 Image recognition method, device, equipment and storage medium based on audio and video recording
CN112734752B (en) * 2021-01-25 2021-10-01 上海微亿智造科技有限公司 Method and system for image screening in flying shooting process
CN112911180A (en) * 2021-01-28 2021-06-04 中国建设银行股份有限公司 Video recording method and device, electronic equipment and readable storage medium
CN115250375B (en) * 2021-04-26 2024-01-26 北京中关村科金技术有限公司 Audio and video content compliance detection method and device based on fixed telephone technology
CN113822195B (en) * 2021-09-23 2023-01-24 四川云恒数联科技有限公司 Government affair platform user behavior recognition feedback method based on video analysis

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325742A (en) * 2018-09-26 2019-02-12 平安普惠企业管理有限公司 Business approval method, apparatus, computer equipment and storage medium
CN110147726A (en) * 2019-04-12 2019-08-20 财付通支付科技有限公司 Business quality detecting method and device, storage medium and electronic device
US20190313014A1 (en) * 2015-06-25 2019-10-10 Amazon Technologies, Inc. User identification based on voice and face
CN111191073A (en) * 2019-12-26 2020-05-22 深圳壹账通智能科技有限公司 Video and audio recognition method, device, storage medium and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840406B (en) * 2017-11-29 2022-05-17 百度在线网络技术(北京)有限公司 Living body verification method and device and computer equipment
CN110348378A (en) * 2019-07-10 2019-10-18 北京旷视科技有限公司 A kind of authentication method, device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190313014A1 (en) * 2015-06-25 2019-10-10 Amazon Technologies, Inc. User identification based on voice and face
CN109325742A (en) * 2018-09-26 2019-02-12 平安普惠企业管理有限公司 Business approval method, apparatus, computer equipment and storage medium
CN110147726A (en) * 2019-04-12 2019-08-20 财付通支付科技有限公司 Business quality detecting method and device, storage medium and electronic device
CN111191073A (en) * 2019-12-26 2020-05-22 深圳壹账通智能科技有限公司 Video and audio recognition method, device, storage medium and device

Also Published As

Publication number Publication date
CN111191073A (en) 2020-05-22

Similar Documents

Publication Publication Date Title
WO2021128817A1 (en) Video and audio recognition method, apparatus and device and storage medium
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
Jafar et al. Forensics and analysis of deepfake videos
CN109697416B (en) Video data processing method and related device
CN109660744A (en) The double recording methods of intelligence, equipment, storage medium and device based on big data
US10970909B2 (en) Method and apparatus for eye movement synthesis
CN114465737B (en) Data processing method and device, computer equipment and storage medium
CN112148922A (en) Conference recording method, conference recording device, data processing device and readable storage medium
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
CN109829363A (en) Expression recognition method, device, computer equipment and storage medium
JP7148737B2 (en) Liveness detection verification method, liveness detection verification system, recording medium, and liveness detection verification system training method
US20230058259A1 (en) System and Method for Video Authentication
Korshunov et al. Tampered speaker inconsistency detection with phonetically aware audio-visual features
CN110493612A (en) Processing method, server and the computer readable storage medium of barrage information
CN111950327A (en) Mouth shape correcting method, mouth shape correcting device, mouth shape correcting medium and computing equipment
CN116708055B (en) Intelligent multimedia audiovisual image processing method, system and storage medium
CN111401198B (en) Audience emotion recognition method, device and system
CN114902217A (en) System for authenticating digital content
Lucey et al. Continuous pose-invariant lipreading
CN112466306B (en) Conference summary generation method, device, computer equipment and storage medium
CN114565449A (en) Intelligent interaction method and device, system, electronic equipment and computer readable medium
JP7347511B2 (en) Audio processing device, audio processing method, and program
CN111933131A (en) Voice recognition method and device
CN112365340A (en) Multi-mode personal loan risk prediction method
CN111209863A (en) Living body model training and human face living body detection method, device and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20907878

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 26.10.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20907878

Country of ref document: EP

Kind code of ref document: A1