CN115119007B

CN115119007B - Big data based audio acquisition and processing system and method for online live broadcast recording

Info

Publication number: CN115119007B
Application number: CN202210724426.6A
Authority: CN
Inventors: 冼文忠
Original assignee: Xinyingke Electroacoustic Technology Co ltd
Current assignee: Xinyingke Electroacoustic Technology Co ltd
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2023-03-03
Anticipated expiration: 2042-06-23
Also published as: CN115119007A

Abstract

The invention discloses an audio acquisition processing system and method for online live broadcast recording based on big data, comprising a video data acquisition module, a recording end data analysis module, a to-be-analyzed set judgment module, a first correction audio acquisition module and a background sound adjustment module; the video data acquisition module is used for acquiring video data of an online live broadcast recording end, and the recording end data analysis module is used for analyzing image data and audio data of the recording end; the to-be-analyzed set judgment module judges whether the set is 0 or not based on the to-be-analyzed set; the first correction audio acquisition module analyzes the image information acquired by the listening end and marks a first correction audio based on the condition that the image information in the set to be analyzed is not 0; the background tone adjusting module is used for adjusting the proportion of the background tone and the main body tone in the first correction audio frequency; the invention carries out the preprocessing of the corresponding audio according to the image data meeting the existence of the serial mark so as to improve the listening experience of a listening end.

Description

Big data based audio acquisition and processing system and method for online live broadcast recording

Technical Field

The invention relates to the technical field of audio acquisition and processing, in particular to an audio acquisition and processing system and method for online live broadcast recording based on big data.

Background

With the continuous development of science and technology, the online live broadcast recording becomes another communication mode for people at present, the online live broadcast has the difference between a recording end and a listening end, the live broadcast recording not only can reflect the expression and the action of a recorded person in a video, but also can synchronously transmit and record the sound of the person, and brings double enjoyment of hearing and vision for the listening end;

however, when recording sounds at the recording end, there are sounds other than recorded persons, some of the sounds are intended to be transmitted to listeners by the recorded persons, and some of the recorded sounds are interference factors but cannot be completely avoided; meanwhile, when the 'noise' generated by the recording end is transmitted to the audience, the listening experience of the audience is influenced, so that how to predict the 'noise' which is possibly generated according to the analysis of the data in advance when the 'noise' is not generated becomes the problem which is solved by the invention.

Disclosure of Invention

The invention aims to provide an audio acquisition and processing system and method for online live broadcast recording based on big data, so as to solve the problems in the background technology.

In order to solve the technical problems, the invention provides the following technical scheme: an audio acquisition and processing method for online live broadcast recording based on big data comprises the following specific steps:

step S100: acquiring video data of an online live recording end, wherein the video data comprises image data and audio data; carrying out continuous equal one-to-one correspondence on the audio data and the image data of the recording end according to the time sequence; the audio data of the recording end comprises a recording main body sound and a recording background sound, the recording main body sound is an audio frequency for recording contents by a recorder, and the recording background sound is an audio frequency corresponding to non-recorded contents;

step S200: recording image data corresponding to the recorded main voice as a first image set, and recording image data corresponding to the recorded background voice as a second image set; analyzing the behavior of a person who records in the first image set and the second image set of the recording end, and taking the intersection of the first image set and the second image set as a set to be analyzed;

step S300: judging a set to be analyzed, wherein the first image set and the second image set are not empty sets, and when the set to be analyzed is 0, analyzing a condition set which needs to correct and record background sound in the second image set; when the set to be analyzed is not 0, recording as a target set, extracting a corresponding image in the target set as a target image, wherein audio data corresponding to the target image is a target audio, and the target audio comprises a target main body sound and a target background sound;

step S400: recording image information which corresponds to the image generation time in the target set and is acquired by a listening end, wherein the image information is a listener image acquired by a listening end camera device; analyzing the relation between the image information acquired by the listening end and the target audio, and marking a first correction audio;

step S500: based on the first corrected audio marked in step S400, the target background sound in the first corrected audio is adjusted. Because when the background sound and the main body sound exist simultaneously, the listening effect of the listening end to the audio frequency can be interfered, bad listening experience is brought, the main body sound recorded by the recording end is not clear enough, and the recording efficiency is reduced.

Further, the behavior of the person who records in the image data of the recording terminal is analyzed, and the method comprises the following steps:

step S210: marking the positions of the head, eyes and elbow of the recorded person in the first image set and the second image set; establishing a rectangular coordinate system by taking the central point of the image data as an origin, and recording an angle average value R1 formed by a line segment of the head position of a person recorded in the first image set to the origin by taking the nose as a fixed point in the change process and a corresponding angle average value R2 in the second image set; calculating the difference value of the angle average values R1 and R2, and recording the head of the recorder as a first serial mark when the difference value is greater than or equal to a preset difference value threshold;

step S220: when the difference is smaller than a preset difference threshold, obtaining an eye expansion degree proportion E = { E1a, E1b, E2a, E2b }, wherein the eye expansion degree proportion is the proportion of the exposed area of eyeballs to the whole area of the eyes, and the whole area of the eyes is the rectangular area above eyelids below eyebrows; using the formula:

calculating a dynamic index E of eyes of the recorded person, wherein E1a represents an eyeball expansion degree proportion mean value when the head position change angle of the recorded person in the first image set is less than or equal to R1, E1b represents an eyeball expansion degree proportion mean value when the head position change angle of the recorded person in the first image set is greater than R1, E2a represents an eyeball expansion degree proportion mean value when the head position change angle of the recorded person in the second image set is less than or equal to R2, and E2b represents an eyeball expansion degree proportion mean value when the head position change angle of the recorded person in the second image set is greater than R2;

the human eye dynamic index is calculated and recorded in order to analyze the behavior dynamic trend of the human eyes recorded under the same tendency of the head angle change in the images corresponding to different audios, and the same tendency indicates that when the head position angle change of the recorded person in the first image set is smaller than the angle average value, the value corresponding to the angle average value is still obtained in the second image set; analyzing the dynamic trend of the eye behaviors influenced by the change of the head angle in the same set, and comprehensively analyzing the eye behavior difference of the recorded person under different scenes;

step S230: comparing the dynamic index e of the eyes of the recorded person with a preset dynamic index threshold value e0, and recording the eyes of the recorded person as a second serial mark when e is greater than or equal to e 0; when the dynamic index is larger than the threshold value, the eye behaviors of the recorded person under different scenes are subjected to dynamic difference; when e is smaller than e0, acquiring the stay time h1k of the elbow of the recorded human hand in the kth quadrant in the first image set and the stay time h2k of the kth quadrant in the second image set; and arranging corresponding quadrants in the first image set into a set K1 according to the sequence of the dwell time from large to small, arranging corresponding quadrants in the second image set into a set K2 according to the sequence of the dwell time from large to small, judging whether the first quadrants in the set K1 and the set K2 are the same, and marking that the first quadrant regions do not correspond to the first quadrant regions of the second image set at the same time as a third serial mark.

Analyzing the difference of the behavior of the recording person corresponding to the main voice and the background voice in the recording process according to different conditions by analyzing the image data of the recording end, wherein different scenes can be generated according to the divided image set of the audio, different scenes are different corresponding to the serial marks obtained by analysis, the head of the recording person is judged firstly because the head is easy to calibrate in the video image, and the head difference is analyzed simply and quickly, and the eyes are further analyzed under the condition that the head is not different because the head position of the recording person is unchanged but the difference of the images corresponding to the audio is caused by the dynamic change of the eyes; if the dynamic change of the eyes cannot effectively distinguish the behavior difference of the recorded person in the corresponding images of the main body sound and the background sound, the elbow of the hand is further analyzed, and the behavior difference of the recorded person in the corresponding images of the main body sound and the background sound can be effectively marked through triple verification.

Further, when the set to be analyzed is 0, analyzing a condition set of the second image set, which needs to correct the recorded background sound, including the following steps:

acquiring the ith image p2i in the second image set and the first image p1i in the first image set corresponding to the image p1i before the adjacent time period, substituting the set formed by the image p1i and the image p2i into the process from the step S210 to the step S230 for analysis, and judging the finally obtained target serial mark, wherein the target serial mark is any one of { a first serial mark, a second serial mark and a third serial mark };

when the target tandem mark is the first tandem mark, recording an angle threshold value of the image p1i as [ R (p 1 i) min, R (p 1 i) max ], wherein the angle threshold value is a first correction condition, R (p 1 i) min is an angle minimum value formed by a line segment of the head position in the image p1i, which takes the nose as a fixed point, to the origin in the changing process, and R (p 1 i) max is an angle maximum value formed by a line segment of the head position in the image p1i, which takes the nose as a fixed point, to the origin in the changing process;

when the target serial mark is a second serial mark, establishing a relation pair { a first serial mark → a second serial mark } in the image p1i, wherein the relation pair is a relation formed by eye opening proportion corresponding to each head position, and recording a relation pair threshold { a first serial mark min → a second serial mark max } in the image p1i as a second correction condition, wherein the first serial mark min → the second serial mark max represent all combination relations formed by an angle minimum value formed by a line segment from the head position to an origin point by taking a nose as a fixed point in a changing process and a maximum value of the eye opening proportion corresponding to the head;

when the target serial mark is a third serial mark, acquiring a quadrant K0 corresponding to the third serial mark in a first image set before the image generation time corresponding to the third serial mark; acquiring quadrant sets { K1a, K2a and K3a } except the third serial mark, which are arranged in sequence according to the retention time from large to small, and quadrants { K1b, K2b and K3b } corresponding to the first image set before the image generation time corresponding to the { K1a, K2a and K3a }; constructing a quadrant dynamic path Q = { KA → KB }, and sorting the quadrant dynamic paths from small to large according to quadrant difference values, wherein the quadrant dynamic path corresponding to the third serial mark is always the first, and the priority is a correction condition III; the staying time length indicates that the proportion of background sounds is large, the influence on audio recording is large, so that the condition is preferably inspected, the object limit difference values are sorted from small to large so as to analyze that the quadrant change is small and the quadrant change is difficult to monitor under the condition that different audios correspond to adjacent image data, the quadrant change is large, the action amplitude of a recorded person is large, the difference is likely to be generated in the previous two-layer analysis, and the quadrant difference value is small and taken as the priority inspection;

the condition set for correcting the recording background sound is { correction condition one, correction condition two, correction condition three }, and when detecting that the image data of the recording end meets any correction condition, the volume of the background sound is reduced under the state corresponding to { first serial mark, second serial mark, third serial mark }.

Further, analyzing the relation between the image information acquired by the listening end and the target audio, and marking the first correction audio, comprising the following steps:

step S410: acquiring image information of a listening end corresponding to a target audio, establishing a face extension change curve of a listener image, extracting the target audio corresponding to a curve point with the change rate being more than or equal to a change rate threshold value in the face extension change curve, and recording the target audio as a first target audio;

step S420: marking the target audio corresponding to the curve point with the change rate smaller than the change rate threshold value in the face extension degree change curve as a second target audio, comparing the similarity of the target image corresponding to the first target audio and the target image corresponding to the second target audio, and marking the first target audio as a first corrected audio if the similarity is smaller than the similarity threshold value; and if the similarity is greater than or equal to the similarity threshold, not marking. When the background sound and the main body sound are generated, the receiving degree of the audio frequency of the recording end can be reflected according to the facial expression of a listener at the listening end, if the recording end generates noise or harsh sound, the corresponding response is generated corresponding to the face of the listener at the listening end, and therefore whether the background sound and the main body sound of the recording end do not influence the audio frequency of the listening end is estimated; the reason for analyzing the similarity is that if the image corresponding to the listening end face part stretch degree change rate higher than the threshold value and the image lower than the threshold value have obvious difference, the audio affecting the listening end is indicated when the image is higher than the threshold value, and if the two images are not obviously different, the face stretch degree higher than the threshold value may be caused by the self-reason of the listening person at the listening end, and the audio data of the recording end does not need to be adjusted.

Further, according to the first corrected audio marked by analyzing in step S400, the method for adjusting the target background sound in the first corrected audio includes the following steps:

step S510: acquiring a first correction audio, mixing and synthesizing a target main body sound and a target background sound in the first correction audio, synthesizing the mixed sound based on a mixed sound ratio, wherein the mixed sound ratio is the target main body sound: target background sound = s0: g0, and s0 > g0;

step S520: acquiring a sound mixing ratio when a listening end receives a second target audio, and calculating a sound mixing ratio threshold value G = [ s1/G1, s2/G2] corresponding to the second target audio, wherein s1/G1 is the minimum value of the ratio of a target main body sound to a target background sound in the second target audio, and s2/G2 is the maximum value of the ratio of the target main body sound to the target background sound in the second target audio;

step S530: s0/g0 is adjusted so that s0/g 0. Epsilon. [ s1/g1, s2/g2].

An audio acquisition and processing system for online live broadcast recording based on big data comprises a video data acquisition module, a recording end data analysis module, a to-be-analyzed set judgment module, a first correction audio acquisition module and a background sound adjustment module;

the video data acquisition module is used for acquiring video data of an online live broadcast recording end, wherein the video data comprises image data and audio data; continuously and equally dividing the audio data and the image data of the recording end into one-to-one correspondence according to the time sequence; the audio data of the recording end comprises a recording main body sound and a recording background sound, the recording main body sound is an audio frequency for recording contents by a recorder, and the recording background sound is an audio frequency corresponding to non-recorded contents;

the recording end data analysis module is used for analyzing the image data and the audio data of the recording end;

the to-be-analyzed set judgment module is used for judging whether the set is 0 or not based on the to-be-analyzed set acquired by the recording end data analysis module;

the first correction audio acquisition module analyzes the image information acquired by the listening end and marks a first correction audio based on the condition that the image information in the set to be analyzed is not 0;

the background tone adjusting module is used for adjusting the proportion of the background tone to the main body tone in the first correction audio.

Further, the recording end data analysis module comprises a first image collection acquisition unit, a second image collection acquisition unit and a serial connection mark analysis unit;

a first image set acquisition unit acquires image data corresponding to the recording main body voice and records the image data as a first image set;

the second image set acquisition unit acquires image data corresponding to the recorded background sound and records the image data as a second image set;

the tandem mark analysis unit is used for analyzing the position relation of the head, the eyes and the elbow of the recorded person and establishing a progressive analysis method from the head, the eyes to the elbow through layer-by-layer analysis.

Further, the to-be-analyzed set judgment module comprises a data substitution unit, a scene analysis unit and a condition set acquisition unit;

the data substituting unit substitutes and analyzes the actual data in the acquired collection to be analyzed based on the analysis method of the serial connection mark analysis unit;

the scene analysis unit analyzes different scenes based on the result of the data substitution unit;

the condition set acquisition unit forms a condition set based on the correction conditions obtained by the scene analysis unit, and reduces the volume of the background sound when detecting that the image data of the recording end meets any correction condition.

Further, the first correction audio acquisition module comprises a face extension degree change curve establishing unit, a curve point marking unit and a similarity comparison unit;

the face extension degree change curve establishing unit is used for acquiring image information of a listening end corresponding to the target audio and establishing a face extension degree change curve of the image of the listener;

the curve point marking unit is used for marking the target audio corresponding to the curve point which is greater than or equal to the change rate threshold value on the face extension degree change curve and recording the target audio as a first target audio; marking the target audio corresponding to the curve point with the change rate smaller than the change rate threshold value in the face extension degree change curve, and recording the target audio as a second target audio;

the similarity comparison unit is used for comparing the similarity of a target image corresponding to the first target audio and a target image corresponding to the second target audio, and if the similarity is smaller than a similarity threshold value, the first target audio is marked as a first correction audio; and if the similarity is greater than or equal to the similarity threshold, not marking.

Compared with the prior art, the invention has the following beneficial effects: the invention combines the image data and the audio data in the analysis of the recording end, analyzes the behavior analysis method of the recorder according to the big data, and the behavior analysis method relates to three different parts in the image of the recording end, and carries out layer-by-layer progressive analysis, which is accurate and fast; meanwhile, the corresponding serial marks are obtained by substituting the actual data under different conditions into a behavior analysis method of the recorder, and the preprocessing of the corresponding audio is carried out according to the image data meeting the existence of the serial marks, so that the audio proportion of the background sound is timely adjusted when the background sound generates poor impression to the listener of the listener, the listening feeling of the listener is met, the recording efficiency of the recorder is improved, and the main body sound is clearer.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic structural diagram of an audio acquisition and processing system for big data-based online live recording according to the present invention;

fig. 2 is a frame diagram of an audio acquisition and processing system according to an embodiment of the system and method for acquiring and processing big data-based live broadcast recorded audio.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-2, the present invention provides a technical solution: the audio acquisition and processing method for the online live broadcast recording based on the big data is characterized by comprising the following specific steps of:

step S100: acquiring video data of an online live recording end, wherein the video data comprises image data and audio data; continuously and equally dividing the audio data and the image data of the recording end into one-to-one correspondence according to the time sequence; the audio data of the recording end comprises a recording main body sound and a recording background sound, the recording main body sound is an audio frequency for recording the content of a recorder, and the recording background sound is an audio frequency corresponding to the non-recorded content;

step S200: recording image data corresponding to the recorded main voice as a first image set, and recording image data corresponding to the recorded background voice as a second image set; analyzing the behavior actions of the person recorded in the first image set and the second image set of the recording terminal, such as: the first image set and the second image set have two possibilities, the first one is that only main body sound exists when a person records contents, when the main body sound disappears, background sound appears, for example, when a teacher teaches on line, when the teacher does not say the contents in class, friction sound caused by writing contents is wiped by a board, and at the moment, the friction sound belongs to the background sound; the second is that when a recorder records contents, the main body sound and the background sound of the recorded contents exist at the same time, for example, a teacher gives lessons online and carries out blackboard writing while teaching lessons, friction sound caused by blackboard writing is background sound, or loudspeaker sound on a road outside a window can be used as background sound, and the background sound and the main body sound exist at the same time; taking the intersection of the first image set and the second image set as a set to be analyzed;

step S400: recording image information acquired by a listening end corresponding to the image generation time in the target set, wherein the image information is a listener image acquired by a listening end camera device; analyzing the relation between the image information acquired by the listening end and the target audio, and marking a first correction audio;

Analyzing the behavior of a person who records in the image data of the recording terminal, comprising the following steps:

step S210: marking the positions of the head, the eyes and the elbow of the hand of the person recorded in the first image set and the second image set; establishing a rectangular coordinate system by taking the central point of the image data as an origin, and recording an angle average value R1 formed by a line segment of the head position of a recording person in the first image set to the origin by taking the nose as a fixed point in a change process and a corresponding angle average value R2 in the second image set; calculating a difference value between the angle average values R1 and R2, and recording the head of the recorder as a first serial mark when the difference value is greater than or equal to a preset difference value threshold;

step S220: when the difference is smaller than a preset difference threshold, obtaining an eye expansion ratio E = { E1a, E1b, E2a, E2b }, wherein the eye expansion ratio is the ratio of the exposed area of eyeballs to the whole area of the eyes, and the whole area of the eyes is the rectangular area above the eyelids below the eyebrows; using the formula:

calculating a dynamic index E of eyes of the recorded person, wherein E1a represents an eyeball spread ratio mean value when the head position change angle of the recorded person in the first image set is less than or equal to R1, E1b represents an eyeball spread ratio mean value when the head position change angle of the recorded person in the first image set is greater than R1, E2a represents an eyeball spread ratio mean value when the head position change angle of the recorded person in the second image set is less than or equal to R2, and E2b represents an eyeball spread ratio mean value when the head position change angle of the recorded person in the second image set is greater than R2;

the human eye dynamic index is calculated to analyze the behavior dynamic trend of the human eyes recorded under the same tendency of the angle change of the head of the recorded person in the images corresponding to different audios, and the same tendency indicates that when the angle change of the head position of the recorded person in the first image set is smaller than the angle average value, the numerical value smaller than the angle average value is still obtained in the second image set; analyzing the dynamic trend of the eye behaviors influenced by the change of the head angle in the same set, and comprehensively analyzing the eye behavior difference of the recorded person under different scenes;

step S230: comparing the dynamic index e of the eyes of the recorded person with a preset dynamic index threshold value e0, and recording the eyes of the recorded person as a second serial mark when e is greater than or equal to e 0; when the dynamic index is larger than the threshold value, the eye behaviors of the person recorded under different scenes are subjected to dynamic difference; when e is smaller than e0, acquiring the stay time h1k of the elbow of the recorded human hand in the kth quadrant in the first image set and the stay time h2k of the kth quadrant in the second image set; arranging corresponding quadrants in the first image set into a set K1 according to the sequence of the stay time lengths from large to small, arranging corresponding quadrants in the second image set into a set K2 according to the sequence of the stay time lengths from large to small, judging whether the first quadrants in the set K1 and the set K2 are the same, and marking the first quadrant regions which correspond to the first quadrant regions of the second image set as third serial marks when the first quadrant regions are not the same.

Analyzing the difference of the behavior of the human beings recorded corresponding to the main body sound and the background sound in the recording process according to different conditions by analyzing the image data of the recording end, wherein different scenes can be generated according to the divided image sets of the audio, different scenes are different in serial marks obtained by corresponding analysis, the head of the human being recorded is judged firstly because the head is easy to calibrate in the video image, the head difference is analyzed simply and quickly, the eyes are further analyzed under the condition that the head is not different, and the head position of the human being recorded is unchanged but the difference of the images corresponding to the audio is caused by the dynamic change of the eyes; if the dynamic change of the eyes cannot effectively distinguish the behavior difference of the recorded person in the corresponding images of the main body sound and the background sound, the elbow of the hand is further analyzed, and the behavior difference of the recorded person in the corresponding images of the main body sound and the background sound can be effectively marked through triple verification.

When the set to be analyzed is 0, analyzing a condition set which needs to correct the recorded background sound in the second image set, and comprising the following steps:

acquiring the ith image p2i in the second image set and the first image p1i in the first image set corresponding to the image p1i before the adjacent time period, substituting the set formed by the image p1i and the image p2i into the processes from the step S210 to the step S230 for analysis, and judging the finally obtained target serial mark, wherein the target serial mark is any one of { a first serial mark, a second serial mark and a third serial mark };

when the target tandem mark is the first tandem mark, recording an angle threshold value of the image p1i as [ R (p 1 i) min, R (p 1 i) max ], wherein the angle threshold value is a first correction condition, R (p 1 i) min is an angle minimum value formed by a line segment of the head position in the image p1i, which takes the nose as a fixed point to the origin, in the changing process, and R (p 1 i) max is an angle maximum value formed by a line segment of the head position in the image p1i, which takes the nose as a fixed point to the origin, in the changing process;

when the target serial mark is a third serial mark, acquiring a quadrant K0 corresponding to the third serial mark in a first image set before the image generation time corresponding to the third serial mark; acquiring other quadrant sets { K1a, K2a and K3a } which are arranged in descending order according to the retention time except for the third serial mark, and quadrants { K1b, K2b and K3b } corresponding to the first image set before the image generation time corresponding to { K1a, K2a and K3a }; constructing a quadrant dynamic path Q = { KA → KB }, and sorting the quadrant dynamic paths according to the quadrant difference from small to large in priority, wherein the quadrant dynamic path corresponding to the third serial mark is always the first, and the priority is a correction condition III; the staying time length indicates that the proportion of background sounds is large, the influence on audio recording is large, so that the condition is preferably inspected, the object limit difference values are sorted from small to large so as to analyze that the quadrant change is small and the quadrant change is difficult to monitor under the condition that different audios correspond to adjacent image data, the quadrant change is large, the action amplitude of a recorded person is large, the difference is likely to be generated in the previous two-layer analysis, and the quadrant difference value is small and taken as the priority inspection;

such as: in the first image set { quadrant 1:15min (12: 3min (12: 1min (12: 2min (12

In the second image set { quadrant 1:1min (12: 1min (12: 6min (12: 5min (12;

the third series is marked as quadrant 3,

and the quadrant K0 corresponding to the first image set before the image generation time of the third tandem mark is quadrant 2,

the quadrant dynamic path corresponding to the third serial flag is { quadrant 3 → quadrant K0: quadrant 2}

A quadrant set { K1a (quadrant 4), K2a (quadrant 2) = K3a (quadrant 1) } which is arranged in descending order of dwell time; { K1a, K2a, K3a } corresponds to a corresponding quadrant { K1b (quadrant 3), K2b (quadrant 4), K3b (quadrant 1) } in the first image set before the image generation time;

the corresponding difference is: { K1a (quadrant 4) -K1b (quadrant 3) =1, K2b (quadrant 4) -K2a (quadrant 2) =2, K3a (quadrant 1) -K3b (quadrant 1) =0};

the priority is { third concatenation flag → K0 > K3a → K3b > K1a → K1b > K2a → K2b };

Analyzing the relation between the image information acquired by the listening end and the target audio and marking the first correction audio, comprising the following steps:

step S420: recording a target audio corresponding to a curve point with a change rate smaller than a change rate threshold value in the face extension degree change curve as a second target audio, comparing the similarity of a target image corresponding to the first target audio and a target image corresponding to the second target audio, and if the similarity is smaller than the similarity threshold value, marking the first target audio as a first corrected audio; if the similarity is greater than or equal to the similarity threshold, marking is not performed. When the background sound and the main body sound are generated, the acceptance degree of the audio frequency of the recording end can be reflected according to the facial expression of a listener at the listening end, if the recording end generates noise or harsh sound, the corresponding reaction can be generated corresponding to the face of the listener at the listening end, and therefore whether the background sound and the main body sound of the recording end do not influence the audio frequency of the listening end or not is estimated; and the reason for analyzing the similarity is that if the image corresponding to the change rate of the face stretch degree of the listening end surface is higher than the threshold value and the image lower than the threshold value has obvious difference, the corresponding image is the audio affecting the listening end when the change rate of the face stretch degree is higher than the threshold value, and if the two images are not obviously different, the face stretch degree when the change rate of the face stretch degree is higher than the threshold value is probably caused by the self-reason of the listening person of the listening end, and the audio data of the recording end does not need to be adjusted.

Analyzing the marked first correction audio according to the step S400, and adjusting the target background sound in the first correction audio, including the following steps:

step S510: acquiring a first correction audio, mixing and synthesizing a target main sound and a target background sound in the first correction audio, wherein the mixing and synthesizing is carried out based on a mixing proportion, and the mixing proportion is the target main sound: target background sound = s0: g0, and s0 > g0;

step S530: s0/g0 is adjusted so that s0/g 0. Epsilon. [ s1/g1, s2/g2].

The system for acquiring and processing the audio recorded by live broadcasting on line based on big data is characterized by comprising a video data acquisition module, a recording end data analysis module, a to-be-analyzed set judgment module, a first corrected audio acquisition module and a background sound adjustment module;

the video data acquisition module is used for acquiring video data of an online live broadcast recording end, wherein the video data comprises image data and audio data; carrying out continuous equal one-to-one correspondence on the audio data and the image data of the recording end according to the time sequence; the audio data of the recording end comprises a recording main body sound and a recording background sound, the recording main body sound is an audio frequency for recording the content of a recorder, and the recording background sound is an audio frequency corresponding to the non-recorded content;

Example as shown in fig. 2: in the video data acquisition module of the system, an audio acquisition and processing system is used, and the audio acquisition and processing system has the following composition structure: an analog-digital converter (ADC/DAC), a large-scale array set circuit (FPGA), a touch display integrated module, an OTG chip, a Bluetooth chip, a FLASH memory, a potentiometer, keys and other control parts;

the system is provided with 4 microphone input interfaces which can accept 4 simultaneous speeches or sings, 2 groups of stereo input interfaces which can be accessed to electronic keys or other musical instrument players and the like, 1 digital OTG audio input/output interface which is matched with a smart phone terminal, 1 high-speed USB digital audio input/output interface, 5 earphone output interfaces, 1 monitoring audio output interface, wireless Bluetooth communication audio and the like. Forming an 8-by-8 audio matrix by using a large-scale array integrated circuit, freely distributing signal routes to each input/output interface, and performing processing such as audio signal gain, volume, effect, noise, equalization and the like;

the acquisition system uses a large-size touch display screen, has a man-machine interaction interface with independent intellectual property rights, can simulate a large number of entity operation keys in the touch display screen, effectively reduces the number of desktop operation keys of the audio acquisition processing system for the whole online live broadcast recording, effectively reduces the volume, and miniaturizes products. The function and the state of each audio channel of the system are imaged by using a large-size touch display screen, so that a user can easily and vividly observe and know the function and the state of the system, and the time for learning and using the system is effectively reduced;

the acquisition system adopts the RGB three primary colors luminous LED with adjustable, uses the PWM pulse adjustment technology, and can set the color and the brightness displayed by each entity key on the desktop of the system. The user can calculate the self-defined setting to the luminance and the required look of each entity button according to personal taste and ambient lighting, has effectively improved product and has used interesting and pleasing to the eye degree to when using for a long time, have certain guard action to user's eyes. The keys of the traditional product can only be adjusted to be bright or not bright, and the brightness and the color can not be set;

the acquisition system is provided with a digital audio OTG interface, and transmits acquired and processed audio signals to the smart phone in a digital audio mode, so that lossless high-fidelity audio input and output are realized. This system supports main smart mobile phone system on the market like Hua be hongmeng OS, android mobile phone OS, apple IOS, is connected to smart mobile phone's TYPE C or LIGHT interface with the OTG data line, can two-way transmission reach 24 bits/48 KHz's harmless high-fidelity digital audio to can also charge at the smart mobile phone that connects when using, realize the limit transmission function that charges. Traditional products are mostly connected to an earphone interface or an external earphone interface of a smart phone through a 4-pole audio line to transmit analog audio signals, and the achieved audio effect is poor in signal-to-noise ratio, narrow in frequency response, small in dynamic range and easy to interfere. The system solves the problems that in the traditional mode, the audio signal is transmitted to the smart phone in a transmission mode of connecting through an analog signal mode, the dynamic range of the audio signal is limited, the noise is large, the interference is easy to happen, and the quality of the audio signal is influenced.

The recording end data analysis module comprises a first image set acquisition unit, a second image set acquisition unit and a serial mark analysis unit;

the first image set acquisition unit acquires image data corresponding to the recording main body voice and records the image data as a first image set;

The to-be-analyzed set judgment module comprises a data substituting unit, a scene analysis unit and a condition set acquisition unit;

The first correction audio acquisition module comprises a face extension degree change curve establishing unit, a curve point marking unit and a similarity comparison unit;

the curve point marking unit is used for marking target audio corresponding to a curve point which is greater than or equal to a change rate threshold value on the face extension degree change curve and recording the target audio as first target audio; marking the target audio corresponding to the curve point with the change rate smaller than the change rate threshold value in the face extension degree change curve, and recording the target audio as a second target audio;

the similarity comparison unit is used for comparing the similarity of a target image corresponding to the first target audio and a target image corresponding to the second target audio, and if the similarity is smaller than a similarity threshold value, the first target audio is marked as a first correction audio; if the similarity is greater than or equal to the similarity threshold, marking is not performed.

Example in fig. 2: the background tone adjusting module of the system can also carry out multitrack recording synthesis, namely, aiming at the proportion adjustment of mixed audio, the audio signals of all input channels are recorded into a storage. When a recording key on the desktop of the system is clicked, the recording function can be accessed by one key, and the function is independent, so that other functions of the system are not influenced while recording. The system uses the 'POLYWAV' multitrack recording technology to process up to 16 channels of audio signals respectively and independently to form 1 recording file containing 16 tracks, and the recording file is recorded into a storage for storage. The file can restore 14-channel independent audio signals in a recording program in a computer DAW format and is independently processed in later edition. The traditional product can only mix the multi-channel audio signals into 2-channel stereo format, and then record the 2-channel stereo format into a storage device for storage, so that the audio signals of each channel can not be processed independently in the later audio editing.

It should be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An audio acquisition and processing method for online live broadcast recording based on big data is characterized by comprising the following specific steps:

step S200: recording image data corresponding to the recorded main voice as a first image set, and recording image data corresponding to the recorded background voice as a second image set; analyzing the behavior of a person recorded in the first image set and the second image set at the recording end, and taking the intersection of the first image set and the second image set as a set to be analyzed;

the method for analyzing the behavior of the person who records in the image data of the recording end comprises the following steps:

step S210: marking the positions of the head, eyes and elbow of the recorded person in the first image set and the second image set; establishing a rectangular coordinate system by taking the central point of the image data as an origin, and recording an angle average value R1 formed by a line segment of the head position of a recording person in the first image set to the origin by taking the nose as a fixed point in a change process and a corresponding angle average value R2 in the second image set; calculating a difference value between the angle average values R1 and R2, and recording the head of the recorder as a first serial mark when the difference value is greater than or equal to a preset difference value threshold;

step S220: when the difference is smaller than a preset difference threshold value, obtaining an eye expansion ratio E = { E1a, E1b, E2a, E2b }, wherein the eye expansion ratio is the ratio of the exposed area of eyeballs to the whole area of the eyes, and the whole area of the eyes is the rectangular area above the eyelids below the eyebrows; using the formula:

step S230: comparing the dynamic index e of the eyes of the recorded person with a preset dynamic index threshold value e0, and recording the eyes of the recorded person as a second serial mark when e is greater than or equal to e 0; when e is smaller than e0, acquiring the stay time h1k of the elbow of the recorded human hand in the kth quadrant in the first image set and the stay time h2k of the kth quadrant in the second image set; arranging corresponding quadrants in the first image set into a set K1 according to the sequence of the stay time lengths from large to small, arranging corresponding quadrants in the second image set into a set K2 according to the sequence of the stay time lengths from large to small, judging whether a first quadrant in the set K1 and the set K2 is the same, and marking a first quadrant region which corresponds to a first quadrant region of the second image set as a third serial mark when the first quadrant region is not the same;

step S300: judging the set to be analyzed, wherein the first image set and the second image set are not empty sets, and when the set to be analyzed is 0, analyzing a condition set of recording background sound needing to be corrected in the second image set; when the set to be analyzed is not 0, recording as a target set, extracting a corresponding image in the target set as a target image, and extracting audio data corresponding to the target image as a target audio, wherein the target audio comprises a target main body sound and a target background sound;

when the set to be analyzed is 0, analyzing the condition set of the recording background sound needing to be corrected in the second image set, and comprising the following steps of:

acquiring the ith image p2i in the second image set and the first image p1i in the first image set corresponding to the image p1i before the adjacent time period, substituting the set formed by the image p1i and the image p2i into the process from the step S210 to the step S230 for analysis, and judging a finally obtained target serial mark, wherein the target serial mark is any one of { a first serial mark, a second serial mark and a third serial mark };

when the target serial mark is a second serial mark, establishing a relation pair { first serial mark → second serial mark } in the image p1i, wherein the relation pair is a relation formed by eye opening proportion corresponding to each head position, and recording a relation pair threshold value { first serial mark min → second serial mark max } in the image p1i as a second correction condition, wherein the first serial mark min → second serial mark max represents all combination relations formed by an angle minimum value formed by a line segment of the head position from a nose serving as a fixed point to an origin in a changing process and a maximum value of the eye opening proportion corresponding to the head;

when the target serial mark is a third serial mark, acquiring a quadrant K0 corresponding to the third serial mark in a first image set before the image generation time corresponding to the third serial mark; acquiring quadrant sets { K1a, K2a and K3a } except the third serial mark, which are arranged in sequence according to the retention time from large to small, and quadrants { K1b, K2b and K3b } corresponding to the first image set before the image generation time corresponding to the { K1a, K2a and K3a }; constructing a quadrant dynamic path Q = { KA → KB }, and sorting the quadrant dynamic paths from small to large according to quadrant difference values, wherein the quadrant dynamic path corresponding to the third serial mark is always the first, and the priority is a correction condition III;

the condition set of the recording background sound needing to be corrected is { correction condition one, correction condition two, correction condition three }, and when the image data of the recording end is detected to meet any correction condition, the volume of the background sound is reduced under the state corresponding to { first serial mark, second serial mark, third serial mark };

step S400: recording image information acquired by a listening end corresponding to the image generation time in a target set, wherein the image information is a listener image acquired by a listening end camera device; analyzing the relation between the image information acquired by the listening end and the target audio, and marking a first correction audio;

the method for analyzing the relation between the image information acquired by the listening end and the target audio and marking the first correction audio comprises the following steps:

step S420: marking the target audio corresponding to the curve point with the change rate smaller than the change rate threshold value in the face extension degree change curve as a second target audio, comparing the similarity of the target image corresponding to the first target audio and the target image corresponding to the second target audio, and marking the first target audio as a first corrected audio if the similarity is smaller than the similarity threshold value; if the similarity is greater than or equal to the similarity threshold, not marking;

step S500: based on the first corrected audio marked in the step S400, adjusting a target background sound in the first corrected audio;

the analyzing the marked first correction audio according to the step S400 to adjust the target background sound in the first correction audio includes the following steps:

step S510: acquiring a first correction audio, and performing mixed sound synthesis on a target main body sound and a target background sound in the first correction audio, wherein the mixed sound synthesis is performed based on a mixed sound ratio, and the mixed sound ratio is the target main body sound: target background sound = s0: g0, and s0 > g0;

step S530: s0/g0 is adjusted so that s0/g 0. Epsilon. [ s1/g1, s2/g2].

2. An audio acquisition and processing system based on big data live-broadcast recording applied to the method of claim 1, which is characterized by comprising a video data acquisition module, a recording end data analysis module, a to-be-analyzed set judgment module, a first correction audio acquisition module and a background sound adjustment module;

the video data acquisition module is used for acquiring video data of an online live broadcast recording end, wherein the video data comprises image data and audio data; continuously and equally dividing the audio data and the image data of the recording end into one-to-one correspondence according to the time sequence; the audio data of the recording end comprises a recording main body sound and a recording background sound, the recording main body sound is an audio frequency for recording the content of a recorder, and the recording background sound is an audio frequency corresponding to the non-recorded content;

3. The big-data-based audio acquisition and processing system for live online recording according to claim 2, wherein: the recording end data analysis module comprises a first image set acquisition unit, a second image set acquisition unit and a serial mark analysis unit;

the first image set acquisition unit acquires image data corresponding to the recorded main voice and records the image data as a first image set;

the serial mark analysis unit is used for analyzing the position relation of the head, the eyes and the elbow of the recorded person, and establishing a progressive analysis method from the head, the eyes to the elbow through layer-by-layer analysis.

4. The big-data-based audio acquisition and processing system for online live recording according to claim 3, wherein: the to-be-analyzed set judgment module comprises a data substitution unit, a scene analysis unit and a condition set acquisition unit;

the data substituting unit is used for substituting and analyzing the actual data in the acquired collection to be analyzed based on the analysis method of the serial connection mark analysis unit;

the condition set acquisition unit forms a condition set based on the correction conditions obtained by the scene analysis unit, and reduces the volume of background sound when detecting that the image data of the recording end meets any correction condition.

5. The big-data-based audio acquisition and processing system for online live recording according to claim 4, wherein: the first correction audio acquisition module comprises a face extension degree change curve establishing unit, a curve point marking unit and a similarity comparison unit;

the curve point marking unit is used for marking the target audio corresponding to the curve point which is greater than or equal to the change rate threshold value on the face stretching degree change curve and recording the target audio as a first target audio; marking target audio corresponding to a curve point with the change rate smaller than the change rate threshold value in the face extension degree change curve as second target audio;