CN116311538B - Distributed audio and video processing system - Google Patents

Distributed audio and video processing system Download PDF

Info

Publication number
CN116311538B
CN116311538B CN202310562473.XA CN202310562473A CN116311538B CN 116311538 B CN116311538 B CN 116311538B CN 202310562473 A CN202310562473 A CN 202310562473A CN 116311538 B CN116311538 B CN 116311538B
Authority
CN
China
Prior art keywords
audio
video
frame
data
video data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310562473.XA
Other languages
Chinese (zh)
Other versions
CN116311538A (en
Inventor
张巧霞
宗建新
刘恋恋
孟书铖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Xianwaiyin Zhizao Technology Co ltd
Original Assignee
Jiangsu Xianwaiyin Zhizao Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Xianwaiyin Zhizao Technology Co ltd filed Critical Jiangsu Xianwaiyin Zhizao Technology Co ltd
Priority to CN202310562473.XA priority Critical patent/CN116311538B/en
Publication of CN116311538A publication Critical patent/CN116311538A/en
Application granted granted Critical
Publication of CN116311538B publication Critical patent/CN116311538B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses a distributed audio and video processing system, which relates to the technical field of distributed audio and video processing, wherein a model for respectively performing pinyin recognition on video data and audio data of a person speaking is trained in advance by a set mouth shape matching model training module, audio and video data to be processed are collected by an audio and video data collecting module, an audio and video segmentation module is arranged for segmenting audio and video to be processed according to matching conditions of audio and video in the audio and video data to obtain a plurality of audio and video segments, and a distributed processing module is arranged for distributing audio and video segments to distributed processing nodes.

Description

Distributed audio and video processing system
Technical Field
The invention relates to the technical field of distributed audio and video processing, in particular to a distributed audio and video processing system.
Background
The audio and video are widely applied to the fields of online education, video conference, scientific research and the like, and real-time processing and transmission of audio and video data are very important. However, due to the problems of large volume of audio and video data, bandwidth limitation, transmission delay and the like, a single server has difficulty in meeting the real-time processing and transmission requirements of the audio and video data. Therefore, the distributed audio and video processing system gradually becomes an important technical scheme.
The distributed audio and video processing system can divide the audio and video data into a plurality of small data segments for processing, and combine the processing results. The system can utilize the processing capacity of a plurality of servers, and greatly improves the processing efficiency and the transmission speed of audio and video data. However, due to the real-time nature and timing of the audio-video data, consistency and integrity of the time stamps of each data segment is very important. If the time stamps are inconsistent, dislocation and distortion of audio and video data can be caused; if the data segment is incomplete, key information may be lost, and the quality of the audio and video data is affected.
Therefore, the invention provides a distributed audio and video processing system.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a distributed audio and video processing system which ensures the consistency of the time stamps of the starting position and the end position of each data segment distributed to the distributed processing nodes, thereby ensuring the integrity of the data segment processed by the distributed processing nodes.
In order to achieve the above purpose, the invention provides a distributed audio/video processing system, which comprises a mouth shape matching model training module, an audio/video data collecting module, an audio/video dividing module and a distributed processing module; wherein, each module is connected by a wired and/or wireless network mode;
the mouth shape matching model training module is mainly used for training a model for respectively performing pinyin identification on video data and audio data of a tester speaking in advance;
the model for respectively performing pinyin recognition on the video data and the audio data of the speech of the tester is trained by the mouth shape matching model training module and comprises the following steps of:
step S1: collecting a pinyin set in advance; the pinyin set comprises all the pinyin representing the pronunciation of the Chinese character;
step S2: the method comprises the steps that when a plurality of testers read according to each pinyin in a pinyin set, audio data and video data are synchronously collected, and the audio data and the video data are marked;
step S3: extracting the characteristics of each piece of audio data to obtain audio characteristic vectors of the audio data, and training a machine learning model for identifying corresponding pinyin according to the audio characteristic vectors; marking the machine learning model as M1;
step S4: identifying a human mouth in each frame of image of video data by using a target identification algorithm, intercepting the human mouth image in each frame of image, and combining the human mouth image into a mouth-shaped action video according to the frame rate of the video data according to the frame sequence of each frame of image in the video;
step S5: inputting the mouth shape action video into an action recognition neural network model, training the action recognition neural network model, and obtaining an action recognition neural network model M2 for recognizing corresponding pinyin according to the mouth shape action video;
the method for training the action recognition neural network model comprises the following steps:
the motion recognition neural network model takes a mouth shape motion video as input, takes a predicted label as output, takes a real label of the mouth shape motion video as a predicted target, and takes the predicted label and the prediction accuracy of the real label as a training target; training the action recognition neural network model until the prediction accuracy reaches a preset accuracy threshold value, and stopping training; marking the action recognition neural network model as M2;
the mouth shape matching model training module sends a machine learning model M1 and a motion recognition neural network model M2 to the audio/video segmentation module;
the audio and video data collection module is mainly used for collecting audio and video data to be processed;
the audio and video data collection module collects the audio and video data to be processed in the following manner:
capturing an audio picture and a video picture to be captured through an audio capturing device and a video capturing device to obtain corresponding audio data and video data, and obtaining the sampling rate of the audio data and the frame rate of the video data;
marking audio data, video data, the sampling rate of the audio data and the frame rate of the video data as audio and video data to be processed;
the audio and video data collection module sends the audio and video data to be processed to the audio and video segmentation module;
the audio/video segmentation module is mainly used for segmenting the audio/video to be processed according to the matching condition of the audio and the video in the audio/video data to obtain a plurality of audio/video segments;
the audio and video segmentation module segments the audio and video to be processed, and the acquisition of a plurality of audio and video segments comprises the following steps:
step P1: the data processing background marks the same initial time stamp on the initial positions of the audio data and the video data according to the reference clock; marking the sampling rate of the audio data as v, the frame rate of the video data as f, and defining a variable i, wherein i=1;
step P2: presetting a frame sampling period T according to practical experience, and sampling video data every other frame
Sampling is sequentially carried out in the period T;
for each sampling to obtain an image corresponding to a frame, using a target recognition algorithm to recognize whether a mouth of a person exists in the image, if the mouth of the person exists, obtaining an image of a previous frame or a next frame of the frame, judging whether the mouth shape of the mouth is consistent in a current frame, a previous frame of the current frame and a next frame of the current frame through an image comparison technology, and if the mouth shape is consistent, repeating the step P2; if at least one frame of the mouth shape is inconsistent, the step P3 is carried out;
step P3: searching a matching frame by adopting a dichotomy from a frame interval of the position of a current frame of video data and the position of a frame sampled in the previous frame sampling period, wherein the matching frame refers to a frame in which a human mouth appears for the first time in the frame interval and the mouth appears in the previous frame of the frame, the frame and the next frame of the frame and has inconsistent mouth shapes, marking the positions of the matching frame in all frames of the video data as Pi, and updating i as i+1;
step P4: calculating the video duration Yi of the Pi frame in the video data, wherein the calculation formula of the video duration Yi is as followsPresetting an error time threshold w;
searching a matching sampling point position Ci in the audio data, wherein the calculation formula of the matching sampling point position Ci is Ci= (Yi-w) v;
step P5: performing mouth shape matching on the audio data from the position Ci of the matched sampling point and the video data from the position of the Pi frame to obtain a mouth shape matching position Ki in the audio data;
the mouth shape matching mode is as follows:
presetting the number N of pinyin matches and a speech speed period x1; calculating a video traversing period x2 based on the speech speed period x1, wherein the video traversing period x2=x1×f; calculating an audio traversing period x3 based on the speech speed period x1, wherein the calculating formula of the audio traversing period x3 is x3=x1×v; the speech speed period x1 is the speech speed of each word spoken by a video person, and under normal conditions, the time for reading each word in the video data and the audio data is consistent;
starting video data from Pi frames, taking a video traversing period x2 frame as a period, intercepting N segments of video data with the length of x2, acquiring mouth images in each frame of images in each intercepted video data, forming mouth images into mouth shape matching videos, identifying pinyin from the mouth shape matching videos by using an action identification neural network model M2, identifying N pinyin altogether, and sequencing the N pinyin according to the sequence of the mouth shape matching videos;
presetting a matching frequency threshold R;
starting from the matching sampling point position Ci, intercepting the subsequent audio data by taking the x3 sampling points of the audio traversing period as a period to obtain a plurality of audio segments, extracting the characteristics of the data of the plurality of audio segments, identifying the pinyin in the audio segments by using a machine learning model M1, and stopping traversing until N pinyins identified according to the mouth shape matching video are sequentially matched in all the identified pinyins or the number of the traversed audio segments is greater than a matching frequency threshold value R; if the number of the traversed sampling points is larger than the matching frequency threshold R, sending an audio/video abnormality early warning signal to a data processing background; if N pinyins identified according to the mouth shape matching video are matched in sequence, the position of a first sampling point corresponding to an audio segment of a first pinyin in the N pinyins is obtained; the first sampling point is a mouth shape matching position Ki in the audio data;
step P6: if i=1, dividing the audio data from the starting position to the audio segment of the mouth shape matching position Ki, and dividing the video data from the starting position to the video segment of the Pi frame;
if i >1, dividing the audio data from the mouth shape matching position K (i-1) to the mouth shape matching position Ki, and dividing the video data from the frame P (i-1) to the video segment of the frame Pi; and continuing to execute the step P2;
step P7: combining the audio segments and the video segments according to the intercepting sequence to sequentially obtain combined audio-video segments; i.e. combining the ith audio segment and the ith video segment into an ith audio-video segment;
the audio and video segmentation module sends all audio and video segments to the distributed processing module;
the distributed processing module is mainly used for distributing distributed processing nodes to the audio and video segments;
the distributed processing module distributes the audio and video segments with distributed processing nodes in the following manner:
respectively stamping the audio and the video in the ith audio-video section with the same time stamp at the starting position; the calculation mode of the time stamp is that the video duration Yi is added to the initial time stamp;
ordering the audio and video segments according to the video duration Yi from large to small;
obtaining the current residual calculation force of each distributed processing node, and sequencing the current residual calculation force from big to small; further, those skilled in the art should understand the calculation principle and meaning of the remaining calculation force, so the present invention will not be repeated here, for example, the remaining calculation force of the CPU may be calculated by using the CPU usage rate and the number of CPU cores, and assuming that n CPU cores are provided, the current CPU usage rate is p, and the remaining calculation force of the CPU may be expressed as: (1-p) n;
and sequentially sending the audio and video segments to the distributed processing nodes in the corresponding sequence.
Compared with the prior art, the invention has the beneficial effects that:
according to the invention, through training the action recognition neural network model and the machine learning model for respectively carrying out pinyin recognition on the video data and the audio data of the person speaking in advance, traversing the complete video data according to the frame sampling period T, recognizing whether the mouth of the person appears in the video image, judging whether the mouth shape of the person is consistent, judging that the video person speaking is carried out when the mouth shape is inconsistent, and improving the retrieval efficiency of the segmentation points of the matched audio and video through periodic sampling and the judgment of the mouth shape consistency;
according to the invention, the matching sampling point positions of the audio data are obtained from the nodes with inconsistent mouth shapes of the video data, the consistency of the pinyin expressed by the mouth shapes of video personnel in the video data and the pinyin sequence expressed in the audio data is identified, under the condition of consistency, the audio and video are segmented at the corresponding positions, and the audio and video segments are obtained based on the segmentation sequence, so that the consistency of the time stamps of the starting position and the ending position of each audio and video segment is ensured, and the integrity of each data segment distributed to the distributed processing nodes is ensured.
Drawings
Fig. 1 is a block diagram of a distributed audio/video processing system according to embodiment 1 of the present invention.
Detailed Description
The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
As shown in fig. 1, a distributed audio/video processing system includes a mouth shape matching model training module, an audio/video data collecting module, an audio/video dividing module and a distributed processing module; wherein, each module is connected by a wired and/or wireless network mode;
the mouth shape matching model training module is mainly used for pre-training models for respectively performing pinyin identification on video data and audio data of a test person speaking;
in a preferred embodiment, the model training module for training the model for respectively performing pinyin recognition on the video data and the audio data of the person speaking includes the following steps:
step S1: collecting a pinyin set in advance; the pinyin set comprises all the pinyin representing the pronunciation of the Chinese character;
step S2: the method comprises the steps that when a plurality of testers read according to each pinyin in a pinyin set, audio data and video data are synchronously collected, and the audio data and the video data are marked;
the pinyin in the pinyin set is numbered, and the labels of the audio data and the video data are the numbers corresponding to the pinyin;
step S3: extracting the characteristics of each piece of audio data to obtain audio characteristic vectors of the audio data, and training a machine learning model for identifying corresponding pinyin according to the audio characteristic vectors; marking the machine learning model as M1; preferably, the machine learning model may be an SVM model;
wherein the audio feature vector may include vectorized sound frequency, amplitude, spectral density, etc.;
tools for audio data feature extraction may include Librosa, pyAudioAnalysis, MIRtoolbox, etc.;
step S4: identifying a human mouth in each frame of image of video data by using a target identification algorithm, intercepting the human mouth image in each frame of image, and combining the human mouth image into a mouth-shaped action video according to the frame rate of the video data according to the frame sequence of each frame of image in the video;
step S5: inputting the mouth shape action video into an action recognition neural network model, training the action recognition neural network model, and obtaining the action recognition neural network model which recognizes corresponding pinyin according to the mouth shape action video;
preferably, the training mode of the action recognition neural network model is as follows:
the motion recognition neural network model takes a mouth shape motion video as input, takes a predicted label as output, takes a real label of the mouth shape motion video as a predicted target, and takes the predicted label and the prediction accuracy of the real label as a training target; training the action recognition neural network model until the prediction accuracy reaches a preset accuracy threshold value, and stopping training; marking the action recognition neural network model as M2; preferably, the motion recognition neural network model may be a dual-flow convolutional network model or a TSN network model;
the mouth shape matching model training module sends a machine learning model M1 and a motion recognition neural network model M2 to the audio/video segmentation module;
the audio and video data collection module is mainly used for collecting audio and video data to be processed;
in a preferred embodiment, the audio and video data collecting module collects audio and video data to be processed in the following manner:
capturing an audio picture and a video picture to be captured through an audio capturing device and a video capturing device to obtain corresponding audio data and video data, and obtaining the sampling rate of the audio data and the frame rate of the video data;
marking audio data, video data, the sampling rate of the audio data and the frame rate of the video data as audio and video data to be processed;
the audio and video data collection module sends the audio and video data to be processed to the audio and video segmentation module;
the audio and video segmentation module segments the audio and video to be processed mainly according to the matching condition of the audio and video in the audio and video data;
in a preferred embodiment, the audio-video segmentation module segments the audio-video to be processed according to the matching condition of the audio and the video in the audio-video data, including the following steps:
step P1: the data processing background marks the same initial time stamp on the initial positions of the audio data and the video data according to the reference clock; it can be understood that the starting position of the audio is the first sampling point, and the starting position of the video is the first frame image; marking the sampling rate of the audio data as v, the frame rate of the video data as f, and defining a variable i, wherein i=1;
step P2: presetting a frame sampling period T according to practical experience, and sampling video data every other frame
Sampling is sequentially carried out in the period T;
for each sampling to obtain an image corresponding to a frame, using a target recognition algorithm to recognize whether a mouth of a person exists in the image, if the mouth of the person exists, obtaining an image of a previous frame or a next frame of the frame, judging whether the mouth shape of the mouth is consistent in a current frame, a previous frame of the current frame and a next frame of the current frame through an image comparison technology, and if the mouth shape is consistent, repeating the step P2; if at least one frame of the mouth shape is inconsistent, the step P3 is carried out;
it should be noted that, the mode of judging whether the mouth shapes are consistent may be that whether the degree of opening and closing of the mouth in each frame of image is greater than a preset opening and closing threshold value by an image comparison technology, if so, the mouth shapes are judged to be inconsistent; if the opening and closing threshold value is smaller than the opening and closing threshold value, judging that the mouth shapes are consistent;
step P3: searching a matching frame by adopting a dichotomy from a frame interval of the position of a current frame of video data and the position of a frame sampled in the previous frame sampling period, wherein the matching frame refers to a frame in which a human mouth appears for the first time in the frame interval and the mouth appears in the previous frame of the frame, the frame and the next frame of the frame and has inconsistent mouth shapes, marking the positions of the matching frame in all frames of the video data as Pi, and updating i as i+1; it can be understood that i can be used as the ith matching of the mouth shapes in the audio and video;
step P4: calculating the video duration Yi of the Pi frame in the video data, wherein the calculation formula of the video duration Yi is as followsPresetting an error time threshold w;
searching a matching sampling point position Ci in the audio data, wherein the calculation formula of the matching sampling point position Ci is Ci= (Yi-w) v; note that, if Yi-w, ci=0;
step P5: performing mouth shape matching on the audio data from the matched sampling point position Ci and the video data from the Pi frame position to obtain a mouth shape matching position Ki in the audio data;
the mouth shape matching mode is as follows:
presetting the number N of pinyin matches and a speech speed period x1; calculating a video traversing period x2 based on the speech speed period x1, wherein the video traversing period x2=x1×f; calculating an audio traversing period x3 based on the speech speed period x1, wherein the calculating formula of the audio traversing period x3 is x3=x1×v; the speech speed period x1 is the speech speed of each word spoken by a video person, and under normal conditions, the time for reading each word in the video data and the audio data is consistent;
starting video data from Pi frames, taking a video traversing period x2 frame as a period, intercepting N segments of video data with the length of x2, acquiring mouth images in each frame of images in each intercepted video data, forming mouth images into mouth shape matching videos, identifying pinyin from the mouth shape matching videos by using an action identification neural network model M2, identifying N pinyin altogether, and sequencing the N pinyin according to the sequence of the mouth shape matching videos;
presetting a matching frequency threshold R; it can be understood that if the audio corresponding to the video is not matched in the interval of the front error time threshold w and the rear matching frequency threshold R of the matching sampling point position Ci, it is indicated that the audio and video have a larger error during recording, and manual inspection is required;
starting from the matching sampling point position Ci, intercepting the subsequent audio data by taking the x3 sampling points of the audio traversing period as a period to obtain a plurality of audio segments, extracting the characteristics of each segment of audio data, identifying the pinyin in the audio segments by using a machine learning model M1, and stopping traversing until N pinyins identified according to the mouth shape matching video are sequentially matched in all the identified pinyins or the number of the traversed audio segments is greater than a matching frequency threshold value R; if the number of the traversed sampling points is larger than the matching frequency threshold R, sending an audio/video abnormality early warning signal to a data processing background; if N pinyins identified according to the mouth shape matching video are matched in sequence, the position of a first sampling point of an audio segment corresponding to the first one of the N pinyins is obtained; the first sampling point is a mouth shape matching position Ki in the audio data;
step P6: if i=1, dividing the audio data from the starting position to the audio segment of the mouth shape matching position Ki, and dividing the video data from the starting position to the video segment of the Pi frame;
if i >1, dividing the audio data from the mouth shape matching position K (i-1) to the mouth shape matching position Ki, and dividing the video data from the frame P (i-1) to the video segment of the frame Pi; and continuing to execute the step P2;
step P7: combining the audio segments and the video segments according to the intercepting sequence to sequentially obtain combined audio-video segments; i.e. combining the ith audio segment and the ith video segment into an ith audio-video segment;
the audio and video segmentation module sends all audio and video segments to the distributed processing module;
the distributed processing module is mainly used for distributing distributed processing nodes to the audio and video segments;
in a preferred embodiment, the distributed processing module assigns the audio and video segments to distributed processing nodes in the following manner:
respectively stamping the audio and the video in the ith audio-video section with the same time stamp at the starting position; the calculation mode of the time stamp is that the video duration Yi is added to the initial time stamp;
ordering the audio and video segments according to the video duration Yi from large to small;
obtaining the current residual calculation force of each distributed processing node, and sequencing the current residual calculation force from big to small; further, those skilled in the art should understand the calculation principle and meaning of the remaining calculation force, so the present invention will not be repeated here, for example, the remaining calculation force of the CPU may be calculated by using the CPU usage rate and the number of CPU cores, and assuming that n CPU cores are provided, the current CPU usage rate is p, and the remaining calculation force of the CPU may be expressed as: (1-p) n;
and sequentially sending the audio and video segments to the distributed processing nodes in the corresponding sequence according to the sequence, for example, sending the first sequenced audio and video to the first sequenced distributed processing node.
The above embodiments are only for illustrating the technical method of the present invention and not for limiting the same, and it should be understood by those skilled in the art that the technical method of the present invention may be modified or substituted without departing from the spirit and scope of the technical method of the present invention.

Claims (6)

1. The distributed audio and video processing system is characterized by comprising a mouth shape matching model training module, an audio and video data collecting module, an audio and video segmentation module and a distributed processing module; wherein, each module is connected by a wired and/or wireless network mode;
the mouth shape matching model training module is used for training models for respectively performing pinyin identification on video data and audio data of a tester speaking in advance, and sending the trained models to the audio-video segmentation module;
the audio and video data collection module is used for collecting audio and video data to be processed and sending the audio and video data to be processed to the audio and video segmentation module;
the audio/video segmentation module is used for segmenting the audio/video to be processed according to the matching condition of the audio and the video in the audio/video data to obtain a plurality of audio/video segments, and sending all the audio/video segments to the distributed processing module;
the distributed processing module is used for distributing distributed processing nodes to the audio and video segments;
the audio and video to be processed is divided to obtain a plurality of audio and video segments, which comprises the following steps:
step P1: the data processing background marks the same initial time stamp on the initial positions of the audio data and the video data according to the reference clock; marking the sampling rate of the audio data as v, the frame rate of the video data as f, and defining a variable i, wherein i=1;
step P2: presetting a frame sampling period T, and sequentially carrying out video data every other frame sampling period T
Sampling;
for each sampling to obtain an image corresponding to a frame, using a target recognition algorithm to recognize whether a mouth of a person exists in the image, judging whether the mouth shape of the mouth is consistent in the current frame, a frame above the current frame and a frame below the current frame through an image comparison technology if the mouth of the person exists, and repeating the step P2 if the mouth shapes are consistent; if at least one frame of the mouth shape is inconsistent, the step P3 is carried out;
step P3: the method comprises the steps of forming a frame interval from the position of a current frame of video data and the position of a frame sampled in the previous frame sampling period, and searching a matching frame in the frame interval by adopting a dichotomy, wherein the matching frame refers to the first occurrence of a human mouth part in the frame intervalAnd the mouth is in the frame inconsistent with the mouth shape in the previous frame of the frame, the frame and the next frame of the frame, the positions of the matched frames in all frames of the video data are marked as Pi, and i is updated to be i+1;
step P4: calculating the video duration Yi of the Pi frame in the video data, wherein the calculation formula of the video duration Yi is as followsPresetting an error time threshold w;
searching a matching sampling point position Ci in the audio data, wherein the calculation formula of the matching sampling point position Ci is Ci= (Yi-w) v;
step P5: performing mouth shape matching on the audio data from the matched sampling point position Ci and the video data from the Pi frame position to obtain a mouth shape matching position Ki in the audio data;
step P6: if i=1, dividing the audio data from the starting position to the audio segment of the mouth shape matching position Ki, and dividing the video data from the starting position to the video segment of the Pi frame;
if i >1, dividing the audio data from the mouth shape matching position K (i-1) to the mouth shape matching position Ki, and dividing the video data from the frame P (i-1) to the video segment of the frame Pi; and continuing to execute the step P2;
step P7: and combining the audio and video segments according to the intercepting sequence to obtain the combined audio and video segments in sequence.
2. The distributed audio and video processing system according to claim 1, wherein the mouth shape matching model training module is configured to pre-train a model for respectively performing pinyin recognition on video data and audio data of a speaker of a tester, and the method comprises the following steps:
step S1: collecting a pinyin set in advance; the pinyin set comprises all the pinyin representing the pronunciation of the Chinese character;
step S2: the method comprises the steps that when a plurality of testers read according to each pinyin in a pinyin set, audio data and video data are synchronously collected, and the audio data and the video data are marked;
step S3: extracting the characteristics of each piece of audio data to obtain audio characteristic vectors of the audio data, and training a machine learning model for identifying corresponding pinyin according to the audio characteristic vectors; marking the machine learning model as M1;
step S4: identifying a human mouth in each frame of image of video data by using a target identification algorithm, intercepting the human mouth image in each frame of image, and combining the human mouth image into a mouth-shaped action video according to the frame rate of the video data according to the frame sequence of each frame of image in the video;
step S5: according to the mouth shape action video, inputting the mouth shape action video into the action recognition neural network model, training the action recognition neural network model, and obtaining an action recognition neural network model M2 for recognizing the corresponding pinyin according to the mouth shape action video.
3. The distributed audio-video processing system of claim 2, wherein the training of the motion recognition neural network model is performed by:
the motion recognition neural network model takes a mouth shape motion video as input, takes a predicted label as output, takes a real label of the mouth shape motion video as a predicted target, and takes the predicted label and the prediction accuracy of the real label as a training target; training the action recognition neural network model until the prediction accuracy reaches a preset accuracy threshold value, and stopping training; the action recognition neural network model is labeled M2.
4. A distributed audio and video processing system according to claim 3, wherein the means for collecting audio and video data to be processed is:
capturing an audio picture and a video picture to be captured through an audio capturing device and a video capturing device to obtain corresponding audio data and video data, and obtaining the sampling rate of the audio data and the frame rate of the video data;
the audio data, the video data, the sampling rate of the audio data, and the frame rate of the video data are marked as audio-video data to be processed.
5. The distributed audio-video processing system according to claim 4, wherein the mouth shape matching is performed by:
presetting the number N of pinyin matches and a speech speed period x1; calculating a video traversing period x2 based on the speech speed period x1, wherein the video traversing period x2=x1×f; calculating an audio traversing period x3 based on the speech speed period x1, wherein the calculating formula of the audio traversing period x3 is x3=x1×v; the speech speed period x1 is the speech speed of each word of the video personnel;
starting video data from Pi frames, taking a video traversing period x2 frame as a period, intercepting N segments of video data with the length of x2, acquiring mouth images in each frame of image in each intercepted video data, forming mouth images into mouth shape matching videos, identifying pinyin from the mouth shape matching videos by using an action identification neural network model M2, identifying N pinyin altogether, and sequencing the N pinyin according to the sequence of the mouth shape matching videos;
presetting a matching frequency threshold R;
starting from the matching sampling point position Ci, intercepting the subsequent audio data by taking the x3 sampling points of the audio traversing period as a period to obtain a plurality of audio segments, extracting the characteristics of the data of the plurality of audio segments, and identifying the pinyin in the audio segments by using a machine learning model M1 until N pinyins identified according to the mouth shape matching video are sequentially matched in all the pinyins or the number of the traversed audio segments is greater than a matching frequency threshold R; if N pinyins identified according to the mouth shape matching video are matched in sequence, the position of a first sampling point corresponding to an audio segment of a first pinyin in the N pinyins is obtained; the first sample point is the die matching position Ki in the audio data.
6. The distributed audio/video processing system according to claim 5, wherein the distributed processing nodes are configured to:
respectively stamping the audio and the video in the ith audio-video section with the same time stamp at the starting position; the calculation mode of the time stamp is that the video duration Yi is added to the initial time stamp;
ordering the audio and video segments according to the video duration Yi from large to small;
obtaining the current residual calculation force of each distributed processing node, and sequencing the current residual calculation force from big to small;
and sequentially sending the audio and video segments to the distributed processing nodes in the corresponding sequence.
CN202310562473.XA 2023-05-18 2023-05-18 Distributed audio and video processing system Active CN116311538B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310562473.XA CN116311538B (en) 2023-05-18 2023-05-18 Distributed audio and video processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310562473.XA CN116311538B (en) 2023-05-18 2023-05-18 Distributed audio and video processing system

Publications (2)

Publication Number Publication Date
CN116311538A CN116311538A (en) 2023-06-23
CN116311538B true CN116311538B (en) 2023-09-01

Family

ID=86826166

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310562473.XA Active CN116311538B (en) 2023-05-18 2023-05-18 Distributed audio and video processing system

Country Status (1)

Country Link
CN (1) CN116311538B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112004111A (en) * 2020-09-01 2020-11-27 南京烽火星空通信发展有限公司 News video information extraction method for global deep learning
CN114554268A (en) * 2022-02-23 2022-05-27 湖南快乐阳光互动娱乐传媒有限公司 Audio and video data processing method and device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220095532A (en) * 2020-12-30 2022-07-07 주식회사 쿠오핀 Method to divide the processing capabilities of artificial intelligence between devices and servers in a network environment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112004111A (en) * 2020-09-01 2020-11-27 南京烽火星空通信发展有限公司 News video information extraction method for global deep learning
CN114554268A (en) * 2022-02-23 2022-05-27 湖南快乐阳光互动娱乐传媒有限公司 Audio and video data processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN116311538A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN111292764B (en) Identification system and identification method
CN106297776B (en) A kind of voice keyword retrieval method based on audio template
CN110879989A (en) Ads-b signal target identification method based on small sample local machine learning model
CN112492343B (en) Video live broadcast monitoring method and related device
CN111488487B (en) Advertisement detection method and detection system for all-media data
CN112766218B (en) Cross-domain pedestrian re-recognition method and device based on asymmetric combined teaching network
CN111738218B (en) Human body abnormal behavior recognition system and method
CN112183289A (en) Method, device, equipment and medium for detecting patterned screen
CN115132201A (en) Lip language identification method, computer device and storage medium
WO2022062027A1 (en) Wine product positioning method and apparatus, wine product information management method and apparatus, and device and storage medium
CN116311538B (en) Distributed audio and video processing system
CN113065533A (en) Feature extraction model generation method and device, electronic equipment and storage medium
CN109829887B (en) Image quality evaluation method based on deep neural network
CN110163142B (en) Real-time gesture recognition method and system
CN111179972A (en) Human voice detection algorithm based on deep learning
CN114005054A (en) AI intelligence system of grading
CN114022754A (en) Few-sample image identification method combined with contrast learning
CN116453023B (en) Video abstraction system, method, electronic equipment and medium for 5G rich media information
CN110647810A (en) Method and device for constructing and identifying radio signal image identification model
CN116600166B (en) Video real-time editing method, device and equipment based on audio analysis
CN116052647A (en) Multi-modal pronunciation teaching interaction system, device and method
CN112990145B (en) Group-sparse-based age estimation method and electronic equipment
CN113329190B (en) Animation design video production analysis management method, equipment, system and computer storage medium
CN117011761A (en) Self-supervision behavior key frame detection method and system
CN116781856A (en) Audio-visual conversion control method, system and storage medium based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant