CN116311538B

CN116311538B - Distributed audio and video processing system

Info

Publication number: CN116311538B
Application number: CN202310562473.XA
Authority: CN
Inventors: 张巧霞; 宗建新; 刘恋恋; 孟书铖
Original assignee: Jiangsu Xianwaiyin Zhizao Technology Co ltd
Current assignee: Jiangsu Xianwaiyin Zhizao Technology Co ltd
Priority date: 2023-05-18
Filing date: 2023-05-18
Publication date: 2023-09-01
Anticipated expiration: 2043-05-18
Also published as: CN116311538A

Abstract

The invention discloses a distributed audio and video processing system, which relates to the technical field of distributed audio and video processing, wherein a model for respectively performing pinyin recognition on video data and audio data of a person speaking is trained in advance by a set mouth shape matching model training module, audio and video data to be processed are collected by an audio and video data collecting module, an audio and video segmentation module is arranged for segmenting audio and video to be processed according to matching conditions of audio and video in the audio and video data to obtain a plurality of audio and video segments, and a distributed processing module is arranged for distributing audio and video segments to distributed processing nodes.

Description

Distributed audio and video processing system

Technical Field

The invention relates to the technical field of distributed audio and video processing, in particular to a distributed audio and video processing system.

Background

The audio and video are widely applied to the fields of online education, video conference, scientific research and the like, and real-time processing and transmission of audio and video data are very important. However, due to the problems of large volume of audio and video data, bandwidth limitation, transmission delay and the like, a single server has difficulty in meeting the real-time processing and transmission requirements of the audio and video data. Therefore, the distributed audio and video processing system gradually becomes an important technical scheme.

The distributed audio and video processing system can divide the audio and video data into a plurality of small data segments for processing, and combine the processing results. The system can utilize the processing capacity of a plurality of servers, and greatly improves the processing efficiency and the transmission speed of audio and video data. However, due to the real-time nature and timing of the audio-video data, consistency and integrity of the time stamps of each data segment is very important. If the time stamps are inconsistent, dislocation and distortion of audio and video data can be caused; if the data segment is incomplete, key information may be lost, and the quality of the audio and video data is affected.

Therefore, the invention provides a distributed audio and video processing system.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a distributed audio and video processing system which ensures the consistency of the time stamps of the starting position and the end position of each data segment distributed to the distributed processing nodes, thereby ensuring the integrity of the data segment processed by the distributed processing nodes.

In order to achieve the above purpose, the invention provides a distributed audio/video processing system, which comprises a mouth shape matching model training module, an audio/video data collecting module, an audio/video dividing module and a distributed processing module; wherein, each module is connected by a wired and/or wireless network mode;

the mouth shape matching model training module is mainly used for training a model for respectively performing pinyin identification on video data and audio data of a tester speaking in advance;

the model for respectively performing pinyin recognition on the video data and the audio data of the speech of the tester is trained by the mouth shape matching model training module and comprises the following steps of:

step S1: collecting a pinyin set in advance; the pinyin set comprises all the pinyin representing the pronunciation of the Chinese character;

step S2: the method comprises the steps that when a plurality of testers read according to each pinyin in a pinyin set, audio data and video data are synchronously collected, and the audio data and the video data are marked;

step S3: extracting the characteristics of each piece of audio data to obtain audio characteristic vectors of the audio data, and training a machine learning model for identifying corresponding pinyin according to the audio characteristic vectors; marking the machine learning model as M1;

step S4: identifying a human mouth in each frame of image of video data by using a target identification algorithm, intercepting the human mouth image in each frame of image, and combining the human mouth image into a mouth-shaped action video according to the frame rate of the video data according to the frame sequence of each frame of image in the video;

step S5: inputting the mouth shape action video into an action recognition neural network model, training the action recognition neural network model, and obtaining an action recognition neural network model M2 for recognizing corresponding pinyin according to the mouth shape action video;

the method for training the action recognition neural network model comprises the following steps:

the motion recognition neural network model takes a mouth shape motion video as input, takes a predicted label as output, takes a real label of the mouth shape motion video as a predicted target, and takes the predicted label and the prediction accuracy of the real label as a training target; training the action recognition neural network model until the prediction accuracy reaches a preset accuracy threshold value, and stopping training; marking the action recognition neural network model as M2;

the mouth shape matching model training module sends a machine learning model M1 and a motion recognition neural network model M2 to the audio/video segmentation module;

the audio and video data collection module is mainly used for collecting audio and video data to be processed;

the audio and video data collection module collects the audio and video data to be processed in the following manner:

capturing an audio picture and a video picture to be captured through an audio capturing device and a video capturing device to obtain corresponding audio data and video data, and obtaining the sampling rate of the audio data and the frame rate of the video data;

marking audio data, video data, the sampling rate of the audio data and the frame rate of the video data as audio and video data to be processed;

the audio and video data collection module sends the audio and video data to be processed to the audio and video segmentation module;

the audio/video segmentation module is mainly used for segmenting the audio/video to be processed according to the matching condition of the audio and the video in the audio/video data to obtain a plurality of audio/video segments;

the audio and video segmentation module segments the audio and video to be processed, and the acquisition of a plurality of audio and video segments comprises the following steps:

step P1: the data processing background marks the same initial time stamp on the initial positions of the audio data and the video data according to the reference clock; marking the sampling rate of the audio data as v, the frame rate of the video data as f, and defining a variable i, wherein i=1;

step P2: presetting a frame sampling period T according to practical experience, and sampling video data every other frame

Sampling is sequentially carried out in the period T;

for each sampling to obtain an image corresponding to a frame, using a target recognition algorithm to recognize whether a mouth of a person exists in the image, if the mouth of the person exists, obtaining an image of a previous frame or a next frame of the frame, judging whether the mouth shape of the mouth is consistent in a current frame, a previous frame of the current frame and a next frame of the current frame through an image comparison technology, and if the mouth shape is consistent, repeating the step P2; if at least one frame of the mouth shape is inconsistent, the step P3 is carried out;

step P3: searching a matching frame by adopting a dichotomy from a frame interval of the position of a current frame of video data and the position of a frame sampled in the previous frame sampling period, wherein the matching frame refers to a frame in which a human mouth appears for the first time in the frame interval and the mouth appears in the previous frame of the frame, the frame and the next frame of the frame and has inconsistent mouth shapes, marking the positions of the matching frame in all frames of the video data as Pi, and updating i as i+1;

step P4: calculating the video duration Yi of the Pi frame in the video data, wherein the calculation formula of the video duration Yi is as followsPresetting an error time threshold w;

searching a matching sampling point position Ci in the audio data, wherein the calculation formula of the matching sampling point position Ci is Ci= (Yi-w) v;

step P5: performing mouth shape matching on the audio data from the position Ci of the matched sampling point and the video data from the position of the Pi frame to obtain a mouth shape matching position Ki in the audio data;

the mouth shape matching mode is as follows:

presetting the number N of pinyin matches and a speech speed period x1; calculating a video traversing period x2 based on the speech speed period x1, wherein the video traversing period x2=x1×f; calculating an audio traversing period x3 based on the speech speed period x1, wherein the calculating formula of the audio traversing period x3 is x3=x1×v; the speech speed period x1 is the speech speed of each word spoken by a video person, and under normal conditions, the time for reading each word in the video data and the audio data is consistent;

starting video data from Pi frames, taking a video traversing period x2 frame as a period, intercepting N segments of video data with the length of x2, acquiring mouth images in each frame of images in each intercepted video data, forming mouth images into mouth shape matching videos, identifying pinyin from the mouth shape matching videos by using an action identification neural network model M2, identifying N pinyin altogether, and sequencing the N pinyin according to the sequence of the mouth shape matching videos;

presetting a matching frequency threshold R;

starting from the matching sampling point position Ci, intercepting the subsequent audio data by taking the x3 sampling points of the audio traversing period as a period to obtain a plurality of audio segments, extracting the characteristics of the data of the plurality of audio segments, identifying the pinyin in the audio segments by using a machine learning model M1, and stopping traversing until N pinyins identified according to the mouth shape matching video are sequentially matched in all the identified pinyins or the number of the traversed audio segments is greater than a matching frequency threshold value R; if the number of the traversed sampling points is larger than the matching frequency threshold R, sending an audio/video abnormality early warning signal to a data processing background; if N pinyins identified according to the mouth shape matching video are matched in sequence, the position of a first sampling point corresponding to an audio segment of a first pinyin in the N pinyins is obtained; the first sampling point is a mouth shape matching position Ki in the audio data;

step P6: if i=1, dividing the audio data from the starting position to the audio segment of the mouth shape matching position Ki, and dividing the video data from the starting position to the video segment of the Pi frame;

if i >1, dividing the audio data from the mouth shape matching position K (i-1) to the mouth shape matching position Ki, and dividing the video data from the frame P (i-1) to the video segment of the frame Pi; and continuing to execute the step P2;

step P7: combining the audio segments and the video segments according to the intercepting sequence to sequentially obtain combined audio-video segments; i.e. combining the ith audio segment and the ith video segment into an ith audio-video segment;

the audio and video segmentation module sends all audio and video segments to the distributed processing module;

the distributed processing module is mainly used for distributing distributed processing nodes to the audio and video segments;

the distributed processing module distributes the audio and video segments with distributed processing nodes in the following manner:

respectively stamping the audio and the video in the ith audio-video section with the same time stamp at the starting position; the calculation mode of the time stamp is that the video duration Yi is added to the initial time stamp;

ordering the audio and video segments according to the video duration Yi from large to small;

obtaining the current residual calculation force of each distributed processing node, and sequencing the current residual calculation force from big to small; further, those skilled in the art should understand the calculation principle and meaning of the remaining calculation force, so the present invention will not be repeated here, for example, the remaining calculation force of the CPU may be calculated by using the CPU usage rate and the number of CPU cores, and assuming that n CPU cores are provided, the current CPU usage rate is p, and the remaining calculation force of the CPU may be expressed as: (1-p) n;

and sequentially sending the audio and video segments to the distributed processing nodes in the corresponding sequence.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, through training the action recognition neural network model and the machine learning model for respectively carrying out pinyin recognition on the video data and the audio data of the person speaking in advance, traversing the complete video data according to the frame sampling period T, recognizing whether the mouth of the person appears in the video image, judging whether the mouth shape of the person is consistent, judging that the video person speaking is carried out when the mouth shape is inconsistent, and improving the retrieval efficiency of the segmentation points of the matched audio and video through periodic sampling and the judgment of the mouth shape consistency;

according to the invention, the matching sampling point positions of the audio data are obtained from the nodes with inconsistent mouth shapes of the video data, the consistency of the pinyin expressed by the mouth shapes of video personnel in the video data and the pinyin sequence expressed in the audio data is identified, under the condition of consistency, the audio and video are segmented at the corresponding positions, and the audio and video segments are obtained based on the segmentation sequence, so that the consistency of the time stamps of the starting position and the ending position of each audio and video segment is ensured, and the integrity of each data segment distributed to the distributed processing nodes is ensured.

Drawings

Fig. 1 is a block diagram of a distributed audio/video processing system according to embodiment 1 of the present invention.

Detailed Description

The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, a distributed audio/video processing system includes a mouth shape matching model training module, an audio/video data collecting module, an audio/video dividing module and a distributed processing module; wherein, each module is connected by a wired and/or wireless network mode;

the mouth shape matching model training module is mainly used for pre-training models for respectively performing pinyin identification on video data and audio data of a test person speaking;

in a preferred embodiment, the model training module for training the model for respectively performing pinyin recognition on the video data and the audio data of the person speaking includes the following steps:

the pinyin in the pinyin set is numbered, and the labels of the audio data and the video data are the numbers corresponding to the pinyin;

step S3: extracting the characteristics of each piece of audio data to obtain audio characteristic vectors of the audio data, and training a machine learning model for identifying corresponding pinyin according to the audio characteristic vectors; marking the machine learning model as M1; preferably, the machine learning model may be an SVM model;

wherein the audio feature vector may include vectorized sound frequency, amplitude, spectral density, etc.;

tools for audio data feature extraction may include Librosa, pyAudioAnalysis, MIRtoolbox, etc.;

step S5: inputting the mouth shape action video into an action recognition neural network model, training the action recognition neural network model, and obtaining the action recognition neural network model which recognizes corresponding pinyin according to the mouth shape action video;

preferably, the training mode of the action recognition neural network model is as follows:

the motion recognition neural network model takes a mouth shape motion video as input, takes a predicted label as output, takes a real label of the mouth shape motion video as a predicted target, and takes the predicted label and the prediction accuracy of the real label as a training target; training the action recognition neural network model until the prediction accuracy reaches a preset accuracy threshold value, and stopping training; marking the action recognition neural network model as M2; preferably, the motion recognition neural network model may be a dual-flow convolutional network model or a TSN network model;

in a preferred embodiment, the audio and video data collecting module collects audio and video data to be processed in the following manner:

the audio and video segmentation module segments the audio and video to be processed mainly according to the matching condition of the audio and video in the audio and video data;

in a preferred embodiment, the audio-video segmentation module segments the audio-video to be processed according to the matching condition of the audio and the video in the audio-video data, including the following steps:

step P1: the data processing background marks the same initial time stamp on the initial positions of the audio data and the video data according to the reference clock; it can be understood that the starting position of the audio is the first sampling point, and the starting position of the video is the first frame image; marking the sampling rate of the audio data as v, the frame rate of the video data as f, and defining a variable i, wherein i=1;

Sampling is sequentially carried out in the period T;

it should be noted that, the mode of judging whether the mouth shapes are consistent may be that whether the degree of opening and closing of the mouth in each frame of image is greater than a preset opening and closing threshold value by an image comparison technology, if so, the mouth shapes are judged to be inconsistent; if the opening and closing threshold value is smaller than the opening and closing threshold value, judging that the mouth shapes are consistent;

step P3: searching a matching frame by adopting a dichotomy from a frame interval of the position of a current frame of video data and the position of a frame sampled in the previous frame sampling period, wherein the matching frame refers to a frame in which a human mouth appears for the first time in the frame interval and the mouth appears in the previous frame of the frame, the frame and the next frame of the frame and has inconsistent mouth shapes, marking the positions of the matching frame in all frames of the video data as Pi, and updating i as i+1; it can be understood that i can be used as the ith matching of the mouth shapes in the audio and video;

searching a matching sampling point position Ci in the audio data, wherein the calculation formula of the matching sampling point position Ci is Ci= (Yi-w) v; note that, if Yi-w, ci=0;

step P5: performing mouth shape matching on the audio data from the matched sampling point position Ci and the video data from the Pi frame position to obtain a mouth shape matching position Ki in the audio data;

the mouth shape matching mode is as follows:

presetting a matching frequency threshold R; it can be understood that if the audio corresponding to the video is not matched in the interval of the front error time threshold w and the rear matching frequency threshold R of the matching sampling point position Ci, it is indicated that the audio and video have a larger error during recording, and manual inspection is required;

starting from the matching sampling point position Ci, intercepting the subsequent audio data by taking the x3 sampling points of the audio traversing period as a period to obtain a plurality of audio segments, extracting the characteristics of each segment of audio data, identifying the pinyin in the audio segments by using a machine learning model M1, and stopping traversing until N pinyins identified according to the mouth shape matching video are sequentially matched in all the identified pinyins or the number of the traversed audio segments is greater than a matching frequency threshold value R; if the number of the traversed sampling points is larger than the matching frequency threshold R, sending an audio/video abnormality early warning signal to a data processing background; if N pinyins identified according to the mouth shape matching video are matched in sequence, the position of a first sampling point of an audio segment corresponding to the first one of the N pinyins is obtained; the first sampling point is a mouth shape matching position Ki in the audio data;

in a preferred embodiment, the distributed processing module assigns the audio and video segments to distributed processing nodes in the following manner:

and sequentially sending the audio and video segments to the distributed processing nodes in the corresponding sequence according to the sequence, for example, sending the first sequenced audio and video to the first sequenced distributed processing node.

The above embodiments are only for illustrating the technical method of the present invention and not for limiting the same, and it should be understood by those skilled in the art that the technical method of the present invention may be modified or substituted without departing from the spirit and scope of the technical method of the present invention.

Claims

1. The distributed audio and video processing system is characterized by comprising a mouth shape matching model training module, an audio and video data collecting module, an audio and video segmentation module and a distributed processing module; wherein, each module is connected by a wired and/or wireless network mode;

the mouth shape matching model training module is used for training models for respectively performing pinyin identification on video data and audio data of a tester speaking in advance, and sending the trained models to the audio-video segmentation module;

the audio and video data collection module is used for collecting audio and video data to be processed and sending the audio and video data to be processed to the audio and video segmentation module;

the audio/video segmentation module is used for segmenting the audio/video to be processed according to the matching condition of the audio and the video in the audio/video data to obtain a plurality of audio/video segments, and sending all the audio/video segments to the distributed processing module;

the distributed processing module is used for distributing distributed processing nodes to the audio and video segments;

the audio and video to be processed is divided to obtain a plurality of audio and video segments, which comprises the following steps:

step P2: presetting a frame sampling period T, and sequentially carrying out video data every other frame sampling period T

Sampling;

for each sampling to obtain an image corresponding to a frame, using a target recognition algorithm to recognize whether a mouth of a person exists in the image, judging whether the mouth shape of the mouth is consistent in the current frame, a frame above the current frame and a frame below the current frame through an image comparison technology if the mouth of the person exists, and repeating the step P2 if the mouth shapes are consistent; if at least one frame of the mouth shape is inconsistent, the step P3 is carried out;

step P3: the method comprises the steps of forming a frame interval from the position of a current frame of video data and the position of a frame sampled in the previous frame sampling period, and searching a matching frame in the frame interval by adopting a dichotomy, wherein the matching frame refers to the first occurrence of a human mouth part in the frame interval，And the mouth is in the frame inconsistent with the mouth shape in the previous frame of the frame, the frame and the next frame of the frame, the positions of the matched frames in all frames of the video data are marked as Pi, and i is updated to be i+1;

step P7: and combining the audio and video segments according to the intercepting sequence to obtain the combined audio and video segments in sequence.

2. The distributed audio and video processing system according to claim 1, wherein the mouth shape matching model training module is configured to pre-train a model for respectively performing pinyin recognition on video data and audio data of a speaker of a tester, and the method comprises the following steps:

step S5: according to the mouth shape action video, inputting the mouth shape action video into the action recognition neural network model, training the action recognition neural network model, and obtaining an action recognition neural network model M2 for recognizing the corresponding pinyin according to the mouth shape action video.

3. The distributed audio-video processing system of claim 2, wherein the training of the motion recognition neural network model is performed by:

the motion recognition neural network model takes a mouth shape motion video as input, takes a predicted label as output, takes a real label of the mouth shape motion video as a predicted target, and takes the predicted label and the prediction accuracy of the real label as a training target; training the action recognition neural network model until the prediction accuracy reaches a preset accuracy threshold value, and stopping training; the action recognition neural network model is labeled M2.

4. A distributed audio and video processing system according to claim 3, wherein the means for collecting audio and video data to be processed is:

the audio data, the video data, the sampling rate of the audio data, and the frame rate of the video data are marked as audio-video data to be processed.

5. The distributed audio-video processing system according to claim 4, wherein the mouth shape matching is performed by:

presetting the number N of pinyin matches and a speech speed period x1; calculating a video traversing period x2 based on the speech speed period x1, wherein the video traversing period x2=x1×f; calculating an audio traversing period x3 based on the speech speed period x1, wherein the calculating formula of the audio traversing period x3 is x3=x1×v; the speech speed period x1 is the speech speed of each word of the video personnel;

starting video data from Pi frames, taking a video traversing period x2 frame as a period, intercepting N segments of video data with the length of x2, acquiring mouth images in each frame of image in each intercepted video data, forming mouth images into mouth shape matching videos, identifying pinyin from the mouth shape matching videos by using an action identification neural network model M2, identifying N pinyin altogether, and sequencing the N pinyin according to the sequence of the mouth shape matching videos;

presetting a matching frequency threshold R;

starting from the matching sampling point position Ci, intercepting the subsequent audio data by taking the x3 sampling points of the audio traversing period as a period to obtain a plurality of audio segments, extracting the characteristics of the data of the plurality of audio segments, and identifying the pinyin in the audio segments by using a machine learning model M1 until N pinyins identified according to the mouth shape matching video are sequentially matched in all the pinyins or the number of the traversed audio segments is greater than a matching frequency threshold R; if N pinyins identified according to the mouth shape matching video are matched in sequence, the position of a first sampling point corresponding to an audio segment of a first pinyin in the N pinyins is obtained; the first sample point is the die matching position Ki in the audio data.

6. The distributed audio/video processing system according to claim 5, wherein the distributed processing nodes are configured to:

obtaining the current residual calculation force of each distributed processing node, and sequencing the current residual calculation force from big to small;