WO2021128817A1

WO2021128817A1 - Video and audio recognition method, apparatus and device and storage medium

Info

Publication number: WO2021128817A1
Application number: PCT/CN2020/102532
Authority: WO
Inventors: 黄超
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2019-12-26
Filing date: 2020-07-17
Publication date: 2021-07-01
Also published as: CN111191073A

Abstract

A video and audio recognition method, apparatus and device and a storage medium. The method comprises: receiving a target service type input by a user, and according to the target service type, searching a corresponding target service text and displaying same (S10); shooting a target video of the target service text read aloud by the user, and by means of an audio and video separator, carrying out audio separation on the target video to obtain target audio information (S20), wherein tedious steps of manual input are reduced by means of voice reading; carrying out character recognition on the target audio information to obtain target information (S30); carrying out frame extraction on the target video to obtain a user picture (S40) so as to perform verification of user identity; and according to the user picture and the target information, generating a target service document of the user (S50).

Description

Video and audio recognition method, equipment, storage medium and device

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on December 26, 2019 with the application number 201911374298.1 and the invention title "Video and Audio Recognition Method, Equipment, Storage Medium and Device", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the technical field of artificial intelligence, and in particular to a video and audio recognition method, equipment, storage medium and device.

Background technique

In financial scenarios, when real verification needs are performed on users, it is necessary to repeatedly collect and verify the authenticity of users' data in order to improve risk control capabilities as much as possible, and evaluate users' loan finance as accurately as possible. The goal is accuracy. Wind control. The inventor realizes that in current loan scenarios, it is common to add an identity verification process. After verification is passed, the user enters information in the web page or application (Application, APP) to collect user information, which is so cumbersome. The operation of, will lead to more pages, anomalies will also increase, the input of user information takes a long time, and the user experience is also very poor.

The above content is only used to assist the understanding of the technical solution of the application, and does not mean that the above content is recognized as prior art.

Technical solutions

The main purpose of this application is to provide a video and audio recognition method, equipment, storage medium, and device, which are intended to solve the technical problem that the input operation of user information is complicated and time-consuming in the prior art.

In order to achieve the above objective, the present application provides a video and audio recognition method. The video and audio recognition method includes the following steps:

Receiving the target business type input by the user, searching for a corresponding target business copy based on the target business type, and displaying the target business copy;

Shooting the target video in which the user reads the target business copy, and performing audio separation on the target video through an audio-video separator to obtain target audio information;

Perform text recognition on the target audio information to obtain target information;

Performing frame extraction processing on the target video to obtain a user picture;

The target business document of the user is generated according to the user picture and the target information.

In addition, in order to achieve the above object, this application also proposes a video and audio recognition device, the video and audio recognition device includes a memory, a processor, and a video and audio recognition program stored on the memory and running on the processor , The video and audio recognition program is configured to implement the following steps:

In addition, in order to achieve the above-mentioned object, this application also proposes a storage medium with a video and audio recognition program stored on the storage medium, and when the video and audio recognition program is executed by a processor, the following steps are implemented:

In addition, in order to achieve the above objective, this application also proposes a video and audio recognition device, the video and audio recognition device comprising:

The search module is configured to receive the target business type input by the user, search for the corresponding target business copy based on the target business type, and display the target business copy;

An audio separation module, configured to shoot a target video in which the user reads the target business copy, and perform audio separation on the target video through an audio-video separator to obtain target audio information;

The text recognition module is used to perform text recognition on the target audio information to obtain target information;

The frame extraction processing module is used to perform frame extraction processing on the target video to obtain user pictures;

The generating module is used to generate the target business document of the user according to the user picture and the target information.

In this application, by receiving the target business type input by the user, searching for the corresponding target business copy based on the target business type, displaying the target business copy, shooting the target video in which the user reads the target business copy, and passing The audio and video separator performs audio separation on the target video to obtain target audio information, and reduces the tedious steps of manual input through voice reading; performs text recognition on the target audio information to obtain target information, and extracts frames from the target video Process to obtain user pictures to verify the user identity; generate the user’s target business document based on the user picture and the target information, and based on artificial intelligence, obtain various data by parsing the video, and verify the user’s identity at the same time Improve the efficiency of user information entry.

Description of the drawings

FIG. 1 is a schematic structural diagram of a video and audio recognition device in a hardware operating environment involved in a solution of an embodiment of the present application;

2 is a schematic flowchart of a first embodiment of a video and audio recognition method according to this application;

FIG. 3 is a schematic flowchart of a second embodiment of a video and audio recognition method according to this application;

4 is a schematic flowchart of a third embodiment of a video and audio recognition method according to this application;

Fig. 5 is a structural block diagram of a first embodiment of a video and audio recognition device according to the present application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Embodiments of the present invention

It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.

Referring to FIG. 1, FIG. 1 is a schematic structural diagram of a video and audio recognition device in a hardware operating environment involved in a solution of an embodiment of the application.

As shown in FIG. 1, the video and audio recognition device may include a processor 1001, such as a central processing unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The wired interface of the user interface 1003 may be a USB interface in this application. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a wireless fidelity (WI-FIdelity, WI-FI) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory, or a stable memory (Non-volatile Memory, NVM), such as a disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001.

Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute a limitation on the video and audio recognition device, and may include more or less components than those shown in the figure, or a combination of certain components, or different component arrangements.

As shown in FIG. 1, a memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a video and audio recognition program.

In the video and audio recognition device shown in FIG. 1, the network interface 1004 is mainly used to connect to a back-end server and perform data communication with the back-end server; the user interface 1003 is mainly used to connect to user equipment; the video and audio recognition device passes through the processor 1001 calls the video and audio recognition program stored in the memory 1005, and executes the video and audio recognition method provided in the embodiment of the present application.

Based on the above hardware structure, an embodiment of the video and audio recognition method of the present application is proposed.

2, which is a schematic flowchart of a first embodiment of a video and audio recognition method according to this application, and a first embodiment of the video and audio recognition method according to this application is proposed.

In the first embodiment, the video and audio recognition method includes the following steps:

Step S10: Receive the target business type input by the user, search for the corresponding target business copy based on the target business type, and display the target business copy.

It should be understood that the execution subject of this embodiment is the video and audio recognition device, where the video and audio recognition device may be an electronic device such as a smart phone, a personal computer, or a server, which is not limited in this embodiment. In the webpage or APP, various service types can be presented through options. The user selects the target service type to be performed, and when the target service type input by the user is received, it searches for the target service type from the preset mapping relationship table. The target business copy corresponding to the target business type, and the preset mapping relationship table includes the corresponding relationship between the business type and the business copy. The target business types include businesses such as loans, leasing, or insurance. The target business copy is user-related information that needs to be collected for each business type. For example, each business type needs to collect basic personal information of users, such as a piece of personal information copy: I It is xxx, my ID number is xxxxx, I am from the xxx area, etc. Different business types also need to collect relevant information corresponding to the business type. For example, the loan business also needs to collect the following information: whether there is a loan, whether there is a real estate, a car, and the amount of annual income. The corresponding business can be established in advance according to the type of business Copywriting, the information that needs to be collected is presented in the form of filling in the blanks.

Step S20: Shoot the target video in which the user reads the target business copy, and perform audio separation on the target video through an audio-video separator to obtain target audio information.

It should be noted that when the user reads the target business copy aloud, he can fill in the content that needs to be filled in in combination with his own information. The process of the user reading the target business copy is photographed, and the video recording can be performed through the camera function of the video and audio recognition device, such as the recording function of a smart phone. There is a camera button in the webpage or APP, and when the target business copy is displayed in the webpage or APP, a camera button is set above or below the business copy. The user clicks on the camera button to take a picture of himself or herself. The video of the target business copy is described, and the target video is obtained.

It is understandable that audio separation usually takes out the sound and image of the video separately. The steps of separating audio are: set the audio source; get the number of tracks in the source file, and traverse to find the required audio track; Extract to obtain the target audio information.

Step S30: Perform text recognition on the target audio information to obtain target information.

In a specific implementation, the mute at the beginning and the end of the target audio information is cut off to reduce the interference caused to subsequent steps. Framing the first audio information after silence is cut, that is, cutting the first audio information into a small segment, each segment is called a frame, and the framing operation is generally implemented by using a moving window function. After framing, the first audio information becomes many small segments. Then transform the waveform, extract Mel-scale Frequency Cepstral Coefficients (MFCC) features, and turn each frame of waveform into a multi-dimensional vector. Next, recognize the frame as a state; combine the states into phonemes; and combine phonemes into words. Several frames of speech correspond to one state, every three states are combined into a phoneme, and several phonemes are combined into a word, so as to obtain the corresponding text information. The content filled by the user in the text information can be extracted as the target information .

Step S40: Perform frame extraction processing on the target video to obtain a user picture.

It should be understood that the target video is instantiated and initialized at the same time, the total number of frames of the target video is obtained and printed, and a variable is defined to store and store each frame of images, cycle flags, define the current frame, and read Take each frame of the target video, a string stream, convert the long type long type into a character type and pass it to the object str, set to get a frame every 10 frames, convert the frame into a picture output, end condition, current frame number When it is greater than the total number of frames, the loop stops, and the output picture is the user picture.

Step S50: Generate a target business document of the user according to the user picture and the target information.

It should be noted that the user picture can be used as the user's identity verification information, and the user's voiceprint can also be extracted from the audio, the extracted voiceprint is used as the user's identity, and the identity verification is performed based on the voiceprint . The target information is related information about the user extracted from the text read aloud by the user, and the user picture and the target information are combined to generate a data document, that is, the target business document, and the target business document includes User identity verification information and various user information required by the target service type.

In this embodiment, by receiving the target business type input by the user, searching for the corresponding target business copy based on the target business type, displaying the target business copy, and shooting a target video in which the user reads the target business copy aloud, The target video is audio separated by an audio and video separator to obtain target audio information, and the tedious steps of manual input are reduced through voice reading; text recognition is performed on the target audio information to obtain target information, and the target video is extracted Frame processing to obtain user pictures to verify the user’s identity; generate the user’s target business document based on the user picture and the target information, and based on artificial intelligence, obtain various data by parsing the video to verify the user’s identity At the same time, the efficiency of user information entry is improved.

Referring to FIG. 3, FIG. 3 is a schematic flowchart of the second embodiment of the video and audio recognition method of the present application. Based on the first embodiment shown in FIG. 2 above, the second embodiment of the video and audio recognition method of the present application is proposed.

In the second embodiment, the step S30 includes:

Step S301: Perform text recognition on the target audio information to obtain corresponding text information.

It should be understood that, to perform text recognition on the target audio information, first, the mute at the beginning and the end of the target audio information is cut, and then the first audio information after the mute cut is divided into frames, and several frames of speech correspond to a state. See which state the frame corresponds to has the greatest probability, then which state the frame belongs to, construct a state network to find the path that best matches the sound from the state network. The speech recognition process is actually to search for the best path in the state network. Every three states are combined into one phoneme, and several phonemes are combined into one word, so as to obtain the text information corresponding to the target audio information.

Step S302: Compare the text information with the target business copy to obtain the correct rate of the text information.

It is understandable that the text information is the text formed by the user reading the target business copy. In order to determine whether the user has read the correct business copy, and whether the target business copy has been read aloud, the target business copy can be read aloud. The fixed content in the text information is extracted, and the extracted content is compared with the target business copy. The similarity between the extracted content and the target business copy can be used as the correct rate of the text information.

Step S303: When the correctness rate is greater than a preset correctness rate threshold, information extraction is performed on the text through regular expressions to obtain target information.

It should be noted that the preset correctness rate threshold can be set according to an empirical value, such as 80%. When the correctness rate is greater than the preset correctness rate threshold, the content of the two is similar, that is, the detailed text information is considered The correct rate meets the requirements, and the text information can be further analyzed.

In the specific implementation, the requirement to extract the string at a specific location can be achieved through regular expressions, specifically: string extraction at a single location, you can use the regular expression (.+?) to extract, for example, a string "a123b" "If we want to extract the value 123 between ab, we can use findall with regular expressions, which will return a list that contains all the matching conditions. This method can be used to extract numbers such as the user's phone number and ID number. The character string at the corresponding position is extracted. A string "a123b456b", if we want to match all the values between a and the last b instead of the value between a and the first occurrence of b, we can use? To control the situation of regular greedy and non-greedy matching. The control only matches 0 or 1, so only the match with the nearest b will be output. To extract strings in multiple consecutive positions, use the regular expression (?P<name>...) to extract. For example, there is a line of webserver access log: '192.168.0.1 25/Oct/2012:14:46:34" GET /api HTTP/1.1" 200 44 "http://abc.com/search" "Mozilla/5.0"', if you want to extract all the content in this line of log, you can write multiple (?P<name>expr) to extract, where the name name can be changed to the variable named by the string at that position, and the expression expr is changed It suffices to be the regularity of the extraction position, so as to extract the content filled by the user when reading aloud in the text information to obtain the target information.

Further, in this embodiment, after the step S303, the method further includes:

Judging whether the target information satisfies a preset rule;

If it is not satisfied, a prompt is given to make the user re-read the target business copy;

If it is satisfied, the step S40 is executed.

It should be understood that the template data filled in the target business copy is analyzed in advance, and corresponding rules are set for each information that needs to be filled. For example, if the phone number is an 11-digit number, then the target business copy The preset rule corresponding to the phone number in the middle is to determine whether the phone number is an 11-digit number. If the preset rule is met, the content of the phone number in the target information can be considered to be correct. According to the preset rules, it can be considered that the phone number in the target information is read incorrectly, and a voice prompt can be performed. For example, if the phone number is 11 digits, the number of digits read aloud is incorrect or the content read aloud is one more digit. Numbers etc. It can also be prompted in a text prompt manner, for example, marking the wrong content in the text information in red, and prompting that the text information is wrong in the form of text comments beside it. The input box can support modification of error correction, so that the user can modify the text information.

It is understandable that the region will also set the corresponding preset rules in advance, for example, enter various geographic location information in the map in advance, and determine the address in the text information when the content read by the target business copy is address information Whether the information belongs to the pre-entered geographic location information, if it belongs, the address information read aloud is considered correct, if not, the address information read aloud is considered wrong.

In this embodiment, by comparing the text information recognized by the voice with the target business copy, when the correct rate is greater than a preset correct rate threshold, the text is extracted through regular expressions to obtain Target information, thereby improving the accuracy of information entry.

Referring to FIG. 4, FIG. 4 is a schematic flowchart of a third embodiment of a video and audio recognition method according to this application. Based on the above-mentioned first or second embodiment, a third embodiment of the video and audio recognition method according to this application is proposed. This embodiment is described based on the first embodiment.

In the third embodiment, before the step S40, the method further includes:

Performing face recognition on the target video, and performing live detection on the recognized face;

When the living body detection is successful, the step S40 is executed.

It should be understood that, to perform face recognition on the target video, the principle of finding people in the video is the same as finding people in pictures. The video is a collection of pictures, and in essence it is still looking for people in pictures, and draw the people found and the recognized faces. Rectangular frame to realize face recognition. The face recognition technology (Face Detection) is responsible for the recognition of the face position, and the face alignment (Face Alignment) performs face alignment. The algorithm uses affine transformation to perform face alignment according to the eye coordinates, and uses the visual geometry group network (Visual Geometry Group Network). Geometry Group Network (VGG) model for feature extraction, the get_feature_new function opens the picture, and uses the VGG network to extract features. The compare_pic function calculates the similarity of the two features passed in. The key point is the selection of the threshold. The face_recog_test function will read the test pictures and calculate the best parameters for each group of pictures. Save the aligned face picture for use in subsequent facial feature comparisons. Use Seeta Face Engine or Face Alignment for face recognition. Get the facial features in the input picture. Use opencv's cv2.CascadeClassifier for face recognition.

Further, the performing face recognition on the target video and performing live detection on the recognized face includes:

Performing face recognition on the target video, and intercepting the eye area of the recognized face to obtain an eye area image;

Recognizing whether there is a blinking action in the eye area image by using a preset blinking model;

If it is recognized that the eye area image has a blinking action, it is determined that the living body detection is successful.

It is understandable that live detection is performed on the recognized face, whether the detected face is moving, or whether it is blinking, etc., to determine whether it is a real person, not a photo. First, face detection and eye location; then the eye area is intercepted, and the degree of opening and closing of the eyes is calculated from the normalized image; based on the convolutional neural network, a model for judging the blinking action is established, and the image is recognized based on the model Whether there is a blinking action in it. The convolutional neural network model to be trained can be established in advance, a large number of sample images can be obtained, the eye region of the human face in the sample image can be intercepted, the sample eye image can be obtained, and the sample corresponding to each sample image can be obtained Blink information, training the convolutional neural network model to be trained according to the sample eye image and the corresponding sample blink information to obtain the preset blink model, then the preset blink model can be used to The eye area image is recognized, and if it is recognized that the eye area image has a blinking action, it is considered that the target video is a real person, and the living body detection is determined to be successful.

In this embodiment, after the step S40, the method further includes:

Step S401: Perform preprocessing on the user picture to obtain a preprocessed picture.

It should be understood that if the target video is subjected to frame extraction processing, usually multiple user pictures are obtained, and then the user pictures need to be further processed to obtain a better quality user picture as the user’s identity verification information. The user picture can be pre-processed in advance. The purpose of image pre-processing is to eliminate irrelevant information in the image, remove or reduce as much as possible the interference of light, imaging system, or external environment on the image, so that its features can be displayed in the image. Clearly manifested in. The preprocessing process includes processing steps such as light compensation, grayscale transformation, histogram equalization, normalization, geometric correction, filtering, and sharpening of the face image, so as to obtain the preprocessed picture.

Step S402: Screen the pre-processed pictures according to the definition to obtain screened pictures.

It should be noted that there are usually multiple preprocessed pictures, from which pictures with higher definition are selected for face recognition. The sharpness of the image is an important indicator to measure the quality of the image. The sharpness can be evaluated by the secondary blur Reblur algorithm. If an image is already blurred, then it will be blurred again, and the high frequency components will not change much; But if the original image is clear, and if it is blurred once, the high-frequency components will change greatly. Therefore, the degraded image of the image can be obtained by performing a Gaussian blurring process on the image to be evaluated, and then compare the changes in the adjacent pixel values of the original image and the degraded image, and determine the level of the sharpness value according to the magnitude of the change. The smaller the calculation result Indicates that the clearer the image, the more blurry it is. This idea can be referred to as a sharpness algorithm based on secondary blur. Specifically, the preprocessed picture is passed through low-pass filtering to obtain a blurred image, and the change in the gray value of adjacent pixels in the preprocessed picture is calculated to obtain the first A pixel change value, and calculate the gray value change of adjacent pixels in the blurred image to obtain a second pixel change value, compare and analyze the first pixel change value with the second pixel change value, and perform Normalization processing is performed to obtain a sharpness result, and the pre-processed picture is screened according to the sharpness result to obtain the screened picture.

Step S403: Compare the selected picture with a preset picture to obtain a comparison result.

In a specific implementation, perform facial feature point positioning on the screened picture to obtain the face feature points to be processed corresponding to the screened picture; compare the face feature points to be processed with preset facial feature points to obtain A homography matrix; transform the face in the photo by the homography matrix to obtain a calibration face picture; the preset picture is a picture of a user in the public security system, and the calibration is performed through a convolutional neural network model The face picture is compared with each photo feature in the public security system library to obtain the face similarity between the screened picture and each of the preset pictures, and the face similarity is used as the comparison result.

Correspondingly, the step S50 includes:

Step S501: When the comparison result exceeds a preset similarity threshold, generate a target business document of the user according to the screened picture and the target information.

It is understandable that the face similarity is used as the comparison result, and if the face similarity exceeds the preset similarity threshold, it is considered that the identity of the user has been verified, and the user may be further Create business profile. The preset similarity threshold may be set according to an empirical value, such as 40%. Compare the facial features and calculate the similarity of the face. If the preset similarity threshold is set to 0.4, that is, if the similarity is greater than 40%, the person is considered to be the same person, and the image can be screened and the target The information generates the target business document of the user.

In this embodiment, the step S20 includes:

While playing the target music, shoot the target video of the user reading the target business copy;

Performing audio separation on the target video by an audio and video separator to obtain mixed audio information;

Extracting the target audio information of the user reading the target business copy from the mixed audio information through a computational auditory scene analysis algorithm.

It should be understood that, in order to ensure the security of personal information, when the user reads the target business copy, the target music can be played at the same time. The target music can create a noisy voice environment and prevent the user's personal information from being learned by others. , The target video captured at this time includes the target music and the audio of the user reading the target business copy. To perform audio separation on the target video through an audio and video separator to obtain mixed audio information, it is necessary to further use the Calculation Auditory Scene Analysis (CASA) algorithm to simulate the human auditory system to remove the voice read by the user from the noise environment Extracted from it. The audio information will be encoded to achieve grouping and parsing. There are currently dozens of grouping criteria related to time and frequency, including pitch, spatial position, and start/end time. Pitch is a very important grouping basis. It identifies the unique characteristics of a certain sound according to different harmonic modes. When two or more microphones are used, the sound isolation system can determine the direction and distance of each microphone's sound according to the spatial position information. The CASA modeling method enables the sound isolation system to focus on a certain sound source, such as a certain person, and shield the background sound. The start/stop time grouping refers to the moment when a certain sound component starts to appear and stops. When these data are combined with the original frequency data, it can be judged whether they come from the same sound source. Mask out a series of noises to focus on identifying a specific sound source. Sounds with similar attributes will form the same audio stream, and similarly, sounds with different attributes will form their own audio streams. These different audio streams can be used to identify continuous or repetitive sound sources. With enough voice groups, the actual voice isolation process can match the identified sound sources and respond to the voice of the real speaker, thereby separating the target audio information of the target business copy by the user.

In this embodiment, the user pictures are processed to obtain better quality screening pictures, and then the screening pictures are compared with preset pictures in the public security system to verify the identity of the user and improve information entry Security and reliability.

In addition, the embodiment of the present application also proposes a storage medium, the storage medium may be non-volatile or volatile, and a video and audio recognition program is stored on the storage medium, and the video and audio recognition program is stored on the storage medium. When executed by the processor, the steps of the video and audio recognition method as described above are realized.

In addition, referring to FIG. 5, an embodiment of the present application also proposes a video and audio recognition device, and the video and audio recognition device includes:

The searching module 10 is configured to receive the target business type input by the user, search for the corresponding target business copy according to the target business type, and display the target business copy.

It should be understood that in a webpage or APP, various service types can be presented through options, and the user selects the target service type to be performed, and when the target service type input by the user is received, the preset mapping relationship table The target business copy corresponding to the target business type is searched in the, and the preset mapping relationship table includes the corresponding relationship between the business type and the business copy. The target business types include businesses such as loans, leasing, or insurance. The target business copy is user-related information that needs to be collected for each business type. For example, each business type needs to collect basic personal information of users, such as a piece of personal information copy: I It is xxx, my ID number is xxxxx, I am from the xxx area, etc. Different business types also need to collect relevant information corresponding to the business type. For example, the loan business also needs to collect the following information: whether there is a loan, whether there is a real estate, a car, and the amount of annual income. The corresponding business can be established in advance according to the type of business Copywriting, the information that needs to be collected is presented in the form of filling in the blanks.

The audio separation module 20 is used to shoot the target video in which the user reads the target business copy, and perform audio separation on the target video through an audio-video separator to obtain target audio information.

The text recognition module 30 is used to perform text recognition on the target audio information to obtain target information.

In a specific implementation, the mute at the beginning and the end of the target audio information is cut off to reduce the interference caused to subsequent steps. Framing the first audio information after silence is cut, that is, cutting the first audio information into a small segment, each segment is called a frame, and the framing operation is generally implemented by using a moving window function. After framing, the first audio information becomes many small segments. Then transform the waveform, extract Mel-scale Frequency Cepstral Coefficients (MFCC) features, and turn each frame of waveform into a multi-dimensional vector. Next, recognize the frame as a state; combine the states into phonemes; and combine phonemes into words. Several frames of speech correspond to a state, every three states are combined into a phoneme, and several phonemes are combined into a word, so as to obtain the corresponding text information. The content filled by the user in the text information can be extracted as the target information .

The frame extraction processing module 40 is configured to perform frame extraction processing on the target video to obtain a user picture.

The generating module 50 is configured to generate a target business document of the user according to the user picture and the target information.

In an embodiment, the text recognition module 30 is further configured to perform text recognition on the target audio information to obtain corresponding text information; compare the text information with the target business copy to obtain the The correct rate of the text information; when the correct rate is greater than the preset correct rate threshold, information extraction is performed on the text through regular expressions to obtain target information.

In an embodiment, the video and audio recognition device further includes:

The judgment module is used to judge whether the target information satisfies a preset rule;

A prompting module is used for prompting if not satisfied, so that the user can read the target business copy again;

The frame extraction processing module 40 is further configured to perform the step of performing frame extraction processing on the target video to obtain a user picture if it is satisfied.

In an embodiment, the video and audio recognition device further includes:

A living body detection module, configured to perform face recognition on the target video, and perform live body detection on the recognized face;

The frame extraction processing module 40 is further configured to perform the step of performing frame extraction processing on the target video to obtain a user picture when the living body detection is successful.

In an embodiment, the living body detection module is further used to perform face recognition on the target video, intercept the eye area of the recognized face, and obtain an image of the eye area; recognition by a preset blinking model Whether the eye area image has a blinking action; if it is recognized that the eye area image has a blinking action, it is determined that the living body detection is successful.

In an embodiment, the video and audio recognition device further includes:

The preprocessing module is used to preprocess the user picture to obtain the preprocessed picture;

The screening module is used to screen the pre-processed pictures according to the definition to obtain the screened pictures;

The comparison module is used to compare the selected picture with the preset picture to obtain a comparison result;

The generating module 50 is further configured to generate a target business document of the user according to the screened picture and the target information when the comparison result exceeds a preset similarity threshold.

In an embodiment, the audio separation module 20 is also used to play the target music while shooting the target video of the user reading the target business copy;

For other embodiments or specific implementations of the video and audio recognition device described in the present application, reference may be made to the foregoing method embodiments, and details are not described herein again.

It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, method, article, or system. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or system that includes the element.

The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments. In the unit claims listing several devices, several of these devices may be embodied in the same hardware item. The use of the words first, second, and third does not indicate any order, and these words may be interpreted as signs.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as a read-only memory mirror (Read Only Memory) Only Memory image, ROM)/Random Access Memory (Random Access Memory, RAM, magnetic disks, optical disks), include several instructions to enable a terminal device (can be a mobile phone, computer, server, air conditioner, or network Equipment, etc.) execute the methods described in each embodiment of the present application.

The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A video and audio recognition method, wherein the video and audio recognition method includes the following steps:

Receiving the target business type input by the user, searching for a corresponding target business copy based on the target business type, and displaying the target business copy;

Shooting the target video in which the user reads the target business copy, and performing audio separation on the target video through an audio-video separator to obtain target audio information;

Perform text recognition on the target audio information to obtain target information;

Performing frame extraction processing on the target video to obtain a user picture;

The target business document of the user is generated according to the user picture and the target information.
5. The video and audio recognition method according to claim 1, wherein said performing text recognition on said target audio information to obtain target information comprises:

Perform text recognition on the target audio information to obtain corresponding text information;

Comparing the text information with the target business copy to obtain the correct rate of the text information;

When the correct rate is greater than a preset correct rate threshold, information extraction is performed on the text through regular expressions to obtain target information.
The video and audio recognition method according to claim 2, wherein when the correct rate is greater than a preset correct rate threshold, the text is extracted by regular expressions, and after the target information is obtained, the video and audio Identification methods include:

Judging whether the target information satisfies a preset rule;

If it is not satisfied, a prompt is given to make the user re-read the target business copy;

If it is satisfied, execute the step of performing frame extraction processing on the target video to obtain a user picture.
5. The video and audio recognition method of claim 1, wherein, before the frame extraction process is performed on the target video to obtain a user picture, the video and audio recognition method further comprises:

Performing face recognition on the target video, and performing live detection on the recognized face;

When the living body detection is successful, the step of performing frame extraction processing on the target video to obtain a user picture is performed.
5. The video and audio recognition method according to claim 4, wherein said performing face recognition on said target video and performing live detection on the recognized face comprises:

Performing face recognition on the target video, and intercepting the eye area of the recognized face to obtain an eye area image;

Recognizing whether there is a blinking action in the eye area image by using a preset blinking model;

If it is recognized that the eye area image has a blinking action, it is determined that the living body detection is successful.
5. The video and audio recognition method of claim 1, wherein after the frame extraction process is performed on the target video to obtain a user picture, the video and audio recognition method further comprises:

Preprocessing the user picture to obtain a preprocessed picture;

Filter the pre-processed pictures according to the definition to obtain the filtered pictures;

Comparing the screened picture with a preset picture to obtain a comparison result;

Correspondingly, the generating the target business document of the user according to the user picture and the target information includes:

When the comparison result exceeds a preset similarity threshold, the user's target business document is generated according to the screened picture and the target information.
5. The video and audio recognition method according to any one of claims 1-6, wherein the shooting the target video in which the user reads the target business copy aloud, and performing audio separation on the target video through an audio and video separator, Obtain target audio information, including:

While playing the target music, shoot the target video of the user reading the target business copy;

Performing audio separation on the target video by an audio and video separator to obtain mixed audio information;

Extracting the target audio information of the user reading the target business copy from the mixed audio information through a computational auditory scene analysis algorithm.
A video and audio recognition device, wherein the video and audio recognition device includes a memory, a processor, and a video and audio recognition program stored on the memory and running on the processor, and the video and audio recognition program is The processor implements the following steps when executing:

Receiving the target business type input by the user, searching for a corresponding target business copy based on the target business type, and displaying the target business copy;

Shooting the target video in which the user reads the target business copy, and performing audio separation on the target video through an audio-video separator to obtain target audio information;

Perform text recognition on the target audio information to obtain target information;

Performing frame extraction processing on the target video to obtain a user picture;

The target business document of the user is generated according to the user picture and the target information.
8. The video and audio recognition device according to claim 8, wherein said performing text recognition on said target audio information to obtain target information comprises:

Perform text recognition on the target audio information to obtain corresponding text information;

Comparing the text information with the target business copy to obtain the correct rate of the text information;

When the correct rate is greater than a preset correct rate threshold, information extraction is performed on the text through regular expressions to obtain target information.
The video and audio recognition device according to claim 9, wherein, when the correctness rate is greater than a preset correctness rate threshold, the text is extracted through regular expressions, and after the target information is obtained, the video and audio Identification methods include:

Judging whether the target information satisfies a preset rule;

If it is not satisfied, a prompt is given to make the user re-read the target business copy;

If it is satisfied, execute the step of performing frame extraction processing on the target video to obtain a user picture.
8. The video and audio recognition device according to claim 8, wherein, before the frame extraction process is performed on the target video to obtain a user picture, the video and audio recognition method further comprises:

Performing face recognition on the target video, and performing live detection on the recognized face;

When the living body detection is successful, the step of performing frame extraction processing on the target video to obtain a user picture is performed.
The video and audio recognition device according to claim 11, wherein said performing face recognition on said target video and performing live detection on the recognized face comprises:

Performing face recognition on the target video, and intercepting the eye area of the recognized face to obtain an eye area image;

Recognizing whether there is a blinking action in the eye area image by using a preset blinking model;

If it is recognized that the eye area image has a blinking action, it is determined that the living body detection is successful.
8. The video and audio recognition device according to claim 8, wherein, after the frame extraction process is performed on the target video to obtain a user picture, the video and audio recognition method further comprises:

Preprocessing the user picture to obtain a preprocessed picture;

Filter the pre-processed pictures according to the definition to obtain the filtered pictures;

Comparing the screened picture with a preset picture to obtain a comparison result;

Correspondingly, the generating the target business document of the user according to the user picture and the target information includes:

When the comparison result exceeds a preset similarity threshold, the user's target business document is generated according to the screened picture and the target information.
The video and audio recognition device according to any one of claims 8-13, wherein the shooting the target video in which the user reads the target business copy aloud, and performing audio separation on the target video through an audio-video separator, Obtain target audio information, including:

While playing the target music, shoot the target video of the user reading the target business copy;

Performing audio separation on the target video by an audio and video separator to obtain mixed audio information;

Extracting the target audio information of the user reading the target business copy from the mixed audio information through a computational auditory scene analysis algorithm.
A storage medium, wherein a video and audio recognition program is stored on the storage medium, and the following steps are implemented when the video and audio recognition program is executed by a processor:

Receiving the target business type input by the user, searching for a corresponding target business copy based on the target business type, and displaying the target business copy;

Shooting the target video in which the user reads the target business copy, and performing audio separation on the target video through an audio-video separator to obtain target audio information;

Perform text recognition on the target audio information to obtain target information;

Performing frame extraction processing on the target video to obtain a user picture;

The target business document of the user is generated according to the user picture and the target information.
15. The storage medium of claim 15, wherein said performing character recognition on said target audio information to obtain target information comprises:

Perform text recognition on the target audio information to obtain corresponding text information;

Comparing the text information with the target business copy to obtain the correct rate of the text information;

When the correct rate is greater than a preset correct rate threshold, information extraction is performed on the text through regular expressions to obtain target information.
The storage medium according to claim 16, wherein, when the correctness rate is greater than a preset correctness rate threshold, the text is extracted by regular expressions, and after the target information is obtained, the video and audio recognition method include:

Judging whether the target information satisfies a preset rule;

If it is not satisfied, a prompt is given to make the user re-read the target business copy;

If it is satisfied, execute the step of performing frame extraction processing on the target video to obtain a user picture.
15. The storage medium according to claim 15, wherein, before performing frame extraction processing on the target video to obtain a user picture, the video and audio recognition method further comprises:

Performing face recognition on the target video, and performing live detection on the recognized face;

When the living body detection is successful, the step of performing frame extraction processing on the target video to obtain a user picture is performed.
17. The storage medium of claim 18, wherein said performing face recognition on said target video and performing live detection on the recognized face comprises:

Performing face recognition on the target video, and intercepting the eye area of the recognized face to obtain an eye area image;

Recognizing whether there is a blinking action in the eye area image by using a preset blinking model;

If it is recognized that the eye area image has a blinking action, it is determined that the living body detection is successful.
A video and audio recognition device, wherein the video and audio recognition device includes:

The search module is configured to receive the target business type input by the user, search for the corresponding target business copy based on the target business type, and display the target business copy;

An audio separation module, configured to shoot a target video in which the user reads the target business copy, and perform audio separation on the target video through an audio-video separator to obtain target audio information;

The text recognition module is used to perform text recognition on the target audio information to obtain target information;

The frame extraction processing module is used to perform frame extraction processing on the target video to obtain user pictures;

The generating module is used to generate the target business document of the user according to the user picture and the target information.