CN114697687B

CN114697687B - Data processing method and device

Info

Publication number: CN114697687B
Application number: CN202011593579.9A
Authority: CN
Inventors: 马少武; 文湘江; 刘千仞
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2023-07-14
Anticipated expiration: 2040-12-29
Also published as: CN114697687A

Abstract

The invention provides a data processing method and a data processing device, relates to the technical field of data processing, and solves the problem of how to more intelligently identify whether the content broadcast by a live broadcasting room is legal or not. Acquiring a live broadcast service video stream; extracting a target picture group, voice data and subtitle data in a live broadcast service video stream; determining a picture score according to a first feature set contained in the target picture group; determining a speech score from a second set of features contained in the speech data; determining a caption score according to a third feature set contained in the caption data; determining a composite score according to the picture score, the voice score and the subtitle score; when the comprehensive score is determined to be greater than or equal to the target score, generating alarm information; the alarm information is used for indicating that the live service video stream is abnormal.

Description

Data processing method and device

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method and apparatus.

Background

For the live platform, the live content of the live platform needs to be audited to confirm that the live content is normal legal content and not non-compliant content. In the prior art, whether the live broadcast content is legal or not is confirmed, and mainly depends on auditors, and whether illegal live broadcast content exists in each live broadcast room or not needs to be checked regularly. Obviously, the prior art is time-consuming and labor-consuming, and the auditing efficiency is not high.

Therefore, how to more intelligently identify whether the content broadcast by the live broadcasting room is legal or not, namely, to improve the auditing efficiency, is a problem which needs to be solved in the prior art.

Disclosure of Invention

The invention provides a data processing method and a data processing device, which solve the problem of how to more intelligently identify whether the content broadcast in a live broadcast room is legal or not.

In order to achieve the above purpose, the invention adopts the following technical scheme:

in a first aspect, an embodiment of the present invention provides a data processing method, including: acquiring a live broadcast service video stream; extracting a target picture group, voice data and subtitle data in a live broadcast service video stream; determining a picture score according to a first feature set contained in the target picture group; determining a speech score from a second set of features contained in the speech data; determining a caption score according to a third feature set contained in the caption data; determining a composite score according to the picture score, the voice score and the subtitle score; when the comprehensive score is determined to be greater than or equal to the target score, generating alarm information; the alarm information is used for indicating that the live service video stream is abnormal.

As can be seen from the above, in the data processing method provided by the present invention, the target frame group, the voice data and the subtitle data in the live broadcast service video stream are extracted, and the target frame group, the voice data and the subtitle data are analyzed, so that the frame score can be determined according to the first feature set included in the target frame group. Then, a speech score is determined from the second feature set contained in the speech data. Then, a caption score is determined according to the third feature set included in the caption data. Further, a composite score is determined based on the picture score, the voice score, and the subtitle score. When the comprehensive score is greater than or equal to the target score, alarm information is generated, so that a user can determine that the live broadcast service video stream is abnormal according to the alarm information, the live broadcast service video stream does not need to be checked manually, the checking efficiency is improved, the manual cost is reduced, and the problem of how to more intelligently identify whether the content broadcasted in the live broadcast room is legal or not is solved.

In a second aspect, the present invention provides a data processing apparatus comprising: an acquisition unit and a processing unit.

Specifically, the acquiring unit is configured to acquire a live service video stream; the processing unit is used for extracting the target picture group, the voice data and the subtitle data in the live broadcast service video stream acquired by the acquisition unit; the processing unit is further configured to determine a picture score according to a first feature set included in the target picture group; the processing unit is further configured to determine a speech score according to a second feature set included in the speech data; the processing unit is further configured to determine a caption score according to a third feature set included in the caption data; the processing unit is further used for determining a comprehensive score according to the picture score, the voice score and the subtitle score; the processing unit is further used for generating alarm information when the comprehensive score is determined to be greater than or equal to the target score; the alarm information is used for indicating that the live service video stream is abnormal.

In a third aspect, the present invention provides a server comprising: communication interface, processor, memory, bus; the memory is used for storing computer execution instructions, and the processor is connected with the memory through a bus. When the server is running, the processor executes computer-executable instructions stored in the memory to cause the server to perform the data processing method as provided in the first aspect described above.

In a fourth aspect, the present invention provides a computer-readable storage medium comprising instructions. The instructions, when executed on a computer, cause the computer to perform the data processing method as provided in the first aspect above.

In a fifth aspect, the present invention provides a computer program product for, when run on a computer, causing the computer to carry out the data processing method according to the design of the first aspect.

It should be noted that the above-mentioned computer instructions may be stored in whole or in part on the first computer readable storage medium. The first computer readable storage medium may be packaged together with the processor of the data processing apparatus or may be packaged separately from the processor of the data processing apparatus, which is not limited in the present invention.

The description of the second, third, fourth and fifth aspects of the present invention may refer to the detailed description of the first aspect; further, the advantageous effects described in the second aspect, the third aspect, the fourth aspect, and the fifth aspect may refer to the advantageous effect analysis of the first aspect, and are not described herein.

In the present invention, the names of the above-described data processing apparatuses do not constitute limitations on the devices or function modules themselves, and in actual implementations, these devices or function modules may appear under other names. Insofar as the function of each device or function module is similar to that of the present invention, it falls within the scope of the claims of the present invention and the equivalents thereof.

These and other aspects of the invention will be more readily apparent from the following description.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a communication system to which a data processing method is applied according to an embodiment of the present invention;

FIG. 2 is a flow chart of a data processing method according to an embodiment of the present invention;

FIG. 3 is a second flowchart of a data processing method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a server according to an embodiment of the present invention;

FIG. 5 is a second schematic diagram of a server according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a computer program product of a data processing method according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described below with reference to the accompanying drawings.

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to clearly describe the technical solution of the embodiments of the present invention, in the embodiments of the present invention, the terms "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect, and those skilled in the art will understand that the terms "first", "second", etc. do not limit the number and execution order.

Fig. 1 is an architecture diagram of an implementation environment in which the following communication methods may be applied, as shown in fig. 1, according to an exemplary embodiment. The implementation environment includes a user equipment 01 of the anchor, a server 02, and a user equipment 03 (including N User Equipments (UEs) that are integers greater than or equal to 0) of the user. When the user equipment 01 establishes communication connection with the server 02 through the application program and opens a live broadcast room, the user equipment 03 can access the live broadcast room after establishing communication connection with the server 02 through the application program, and further the user equipment 03 maintains communication connection with the user equipment 01 through long-link service provided by the server 02.

In one implementation, the server 02 may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center. The server 02 may include a processor, memory, network interfaces, and the like.

In one embodiment, a user device is used to provide voice and/or data connectivity services to a user. The terminals may be named differently, such as UE, terminal unit, terminal station, mobile station, remote terminal, mobile device, wireless communication device, vehicle user equipment, terminal agent or terminal device, etc. Optionally, the terminal may be a handheld device, an in-vehicle device, a wearable device, or a computer with a communication function, which is not limited in any way in the embodiment of the present invention. For example, the handheld device may be a smart phone. The in-vehicle device may be an in-vehicle navigation system. The wearable device may be a smart bracelet. The computer may be a personal digital assistant (personal digital assistant, PDA) computer, a tablet computer, or a laptop computer (laptop computer).

At present, the content broadcast by the live broadcast platform is audited mainly by means of manual online detection, user post reporting and the like, and the auditing efficiency is low. Therefore, according to the data processing method provided by the embodiment of the invention, the target picture group, the voice data and the subtitle data in the live broadcast service video stream are extracted, and the target picture group, the voice data and the subtitle data are analyzed, so that the picture score can be determined according to the first feature set contained in the target picture group. Then, a speech score is determined from the second feature set contained in the speech data. Then, a caption score is determined according to the third feature set included in the caption data. Further, a composite score is determined based on the picture score, the voice score, and the subtitle score. When the comprehensive score is greater than or equal to the target score, alarm information is generated, so that a user can determine that the live broadcast service video stream is abnormal according to the alarm information, the live broadcast service video stream does not need to be checked manually, the checking efficiency is improved, the manual cost is reduced, and the problem of how to more intelligently identify whether the content broadcasted in the live broadcast room is legal or not is solved. The specific implementation process is as follows:

The data processing method provided by the embodiment of the present invention will be described below with reference to the communication system shown in fig. 1, taking the data processing device as the server 02 as an example.

As shown in fig. 2, the data processing method includes the contents of the following steps S11 to S17:

s11, the server 02 acquires the live service video stream.

In one implementation, live traffic video streams of the real-time streaming protocol (Real Time Streaming Protocol, RTSP)/real-time information transport protocol (Real Time Message Protocol, RTMP) may be parsed and identified by an online deep packet inspection (Deep Packet Inspection, DPI) probe device.

S12, the server 02 extracts a target picture group, voice data and subtitle data in the live broadcast service video stream.

In one implementation, the live service video stream needs to be stored first, and then the target group of pictures, the voice data and the subtitle data in the live service video stream can be extracted. When the acquired live service video streams are more, the cost of storing the acquired live service video streams is higher. Therefore, the data processing method provided by the embodiment of the invention samples the live broadcast service video stream according to the ratio of N to 1, thereby reducing the cost of storing and acquiring the live broadcast service video stream.

S13, the server 02 determines a screen score from the first feature set included in the target screen group.

In one implementation, when the server 02 extracts a target group of pictures in the live service video stream, the live service video stream needs to be decoded by a decoder first, and after removing the interference of various noises, the target group of pictures (such as a continuous keyframe I/B/P frame or group of pictures (Group of Pictures, GOP)) is determined. Then, a first feature set included in the target group of pictures is determined based on the target group of pictures.

S14, the server 02 determines a voice score according to the second feature set contained in the voice data.

In one implementation manner, when the server 02 extracts voice data in a live service video stream, the live service video stream needs to be decoded by a decoder first, and after interference of various noises is removed, a target picture group is obtained, and at the same time, voice data synchronized with the target picture group is obtained. Then, the voice data is recognized, and feature extraction is performed, so that a second feature set included in the voice data is determined.

S15, the server 02 determines a caption score according to the third feature set included in the caption data.

In one implementation manner, when the server 02 extracts voice data in a live service video stream, the live service video stream needs to be decoded by a decoder, after interference of various noises is removed, a target picture group is obtained, and at the same time, caption data synchronized with the target picture group is obtained by an identification technology such as optical character recognition (Optical Character Recognition, OCR) so as to determine a third feature set included in the caption data.

S16, the server 02 determines a comprehensive score according to the picture score, the voice score and the subtitle score.

And S17, when the server 02 determines that the comprehensive score is greater than or equal to the target score, generating alarm information. The alarm information is used for indicating that the live service video stream is abnormal.

In one implementation, the server 02 sends the alert information to the user in the form of a short message. Alternatively, after generating the alarm information, the server 02 displays the alarm information.

As can be seen from the above, the server 02 extracts the target group of pictures, the voice data, and the subtitle data in the live broadcast service video stream, and analyzes the target group of pictures, the voice data, and the subtitle data, so that the picture score can be determined according to the first feature set included in the target group of pictures. Then, the server 02 determines a voice score from the second feature set included in the voice data. Then, the server 02 determines a subtitle score from the third feature set included in the subtitle data. Further, the server 02 determines a composite score from the picture score, the voice score, and the subtitle score. When the comprehensive score is greater than or equal to the target score, the server 02 generates alarm information, so that a user can determine that the live broadcast service video stream is abnormal according to the alarm information, the live broadcast service video stream does not need to be checked manually, the checking efficiency is improved, the labor cost is reduced, and the problem of how to more intelligently identify whether the content broadcasted in the live broadcast room is legal or not is solved.

In one embodiment, the first feature set includes at least a picture color, a light compensation value, scene information, skin exposure, a human target portion, a behavior feature, and background noise interference cancellation, where in this case, as shown in fig. 3 in conjunction with fig. 2, S13 may be specifically implemented by S130 and S131 described below.

S130, the server 02 determines the score of each feature in the first feature set according to the preconfigured first corresponding relation. The first corresponding relation comprises a corresponding relation between each feature in the first feature set and the score.

In one implementation, the picture color of the live service video stream may be determined as follows.

Specifically, for a computer, a picture image is a set of standardized multi-order two-dimensional array matrix, a color space has color space models such as RGB (Red, green, blue), HIS (Hue, saturation, value) and the like, and three channel values of RGB of pixels at each position coordinate can be obtained by blocking and traversing the obtained pixels of each frame/key frame picture, and the picture color can be determined by comparing the three channel values with standard color space values.

Specifically, the picture color can be determined by a histogram method, a global/local cumulative histogram method, a statistical characteristic method of color parameters, a wavelet-based block image color characteristic extraction method and the like.

In one implementation, the light compensation value of the live service video stream may be determined as follows.

Specifically, in the process of identifying the key frame I frame image, the information identification is difficult due to uneven illumination of the whole image caused by illumination environment or reflection of the object surface and the like, and the image is subjected to the pre-processing by a light compensation method so as to enhance the contrast and definition of the image and improve the accuracy of the later image identification.

Specifically, the light compensation value may be determined by a gray level transformation method represented by a histogram equalization method, a homomorphic filtering method based on an illumination-reflection model, a Retinex enhancement method, a gradient domain enhancement method, and the like.

In one implementation, the scene information of the live service video stream may be determined as follows.

Specifically, a convolutional neural network (Convolutional Neural Networks, CNN) deep learning method can be adopted for scene information identification, so that feature objects in the surrounding environment except the human body in the key frame I frame image are learned, feature extraction and classification, training, matching and comparison are performed on a large number of extracted object feature value standard libraries, the current environment where a principal is located is deduced, such as indoor, outdoor and the like, and classification processing is performed, so that auxiliary judgment information in more scenes is acquired, more accurate image content identification is facilitated, and the content property of an image is judged.

Exemplary scene information includes indoor and outdoor.

In one implementation, the skin exposure of the live traffic video stream may be determined as follows.

Specifically, the skin exposure can be determined by deep learning a portrait feature extraction algorithm such as an FCN algorithm (Fully Convolutional Networks For Semantic Segmentation) or an edge contour feature-based Bayesian model and other algorithms, and a human body image can be extracted. Then, by calculating the area ratio and specific distribution of the face and limb image area to the entire skin color area, it is determined whether or not the non-compliant content (also referred to as abnormal content) is included.

In one implementation, the human target site of the live service video stream may be determined as follows.

Specifically, the human body and the target part of the human body can be identified by a CNN or a multi-layer feedback loop neural network (Recurrent Neural Network, RNN) and other human body feature extraction algorithms, and illegal identification results and scores are given by comparing and judging with a human body feature data set. The human body characteristic data set comprises all types of human body images, has various postures and various human body proportions, is very diversified, and can be used for finer and more complex human body characteristic recognition or image matting scenes. For example, helen Parsing Dataset is a face image segmentation Dataset obtained by labeling the keypoint detection Dataset Helen Dataset, and contains 2000 training images and 330 test images, and is newly added after continuous refinement.

In one implementation, the behavior characteristics of the live service video stream may be determined as follows.

Specifically, in a behavior-aware study on images, image features extracted by convolutional neural networks are applied to behavior classification (action classification). In the behavior cognition problem of the video, the convolutional neural network can keep a two-dimensional structure, learn through stacking the characteristics of continuous time segments of the key frame I frames, establish a 2D convolutional neural network changing along a time axis, or extract the characteristics from frame to frame and input the cyclic neural network RNN, so that the motion track and behavior of a target part of a human body are identified, further behavior judgment is made, whether the identification result is accurate and reliable is further confirmed, and corresponding judgment scores are given.

In one implementation, background noise interference cancellation for live traffic video streams may be determined as follows.

Specifically, the information interference of non-key characteristic objects can be removed through filtering algorithms such as mean filtering, square filtering, gaussian filtering, median filtering, bilateral filtering and the like, and the efficiency and accuracy of target image characteristic identification are improved in a centralized manner. Then, on the basis of keeping the original information of the image as much as possible, the background noise in the image is filtered, namely, the image is subjected to smoothing processing and blurring processing, and then the pixel value of the pixel point where the noise is located is determined as the noise removal value of the surrounding adjacent pixel points.

In one implementation manner, the first correspondence includes a correspondence between a picture color and a score, and the server 02 determines the score corresponding to the picture color of the live service video stream by querying the score corresponding to the picture color of the live service video stream in the first correspondence.

Exemplary, the correspondence between the screen color and the score is shown in table 1.

TABLE 1

Picture color	Score of
		Clear and clear	7-10
Standard of	4-6
		Blurring	0-3

In one implementation manner, the first correspondence includes a correspondence between the light compensation value and the score, and the server 02 determines the score corresponding to the light compensation value of the live service video stream by querying the score corresponding to the light compensation value of the live service video stream in the first correspondence.

Exemplary, the correspondence between the ray compensation value and the score is shown in table 2.

TABLE 2

Light compensation value	Score of
		Clear and clear	7-10
Standard of	4-6
		Weak and weak	0-3

In one implementation manner, the first correspondence includes a correspondence between scene information and scores, and the server 02 determines the score corresponding to the scene information of the live service video stream by querying the score corresponding to the scene information of the live service video stream in the first correspondence.

Exemplary, the correspondence of the scene information and the score is shown in table 3.

TABLE 3 Table 3

In one implementation manner, the first correspondence includes a correspondence between skin exposure and a score, and the server 02 determines the score corresponding to the skin exposure of the live service video stream by querying the score corresponding to the skin exposure of the live service video stream in the first correspondence.

Exemplary, the correspondence between skin exposure and score is shown in table 4.

TABLE 4 Table 4

Skin exposure	Score of
		Full bare	7-10
Semi-bare	4-6

In one implementation manner, the first correspondence includes a correspondence between a human target portion and a score, and the server 02 determines the score corresponding to the human target portion of the live service video stream by querying the score corresponding to the human target portion of the live service video stream in the first correspondence.

Specifically, a score is given when a human body part in the live broadcast service video stream is identified and then the human body part is determined to be a human body target part. Otherwise, the score is 0.

In one implementation manner, the first correspondence includes a correspondence between behavior features and scores, and the server 02 determines the score corresponding to the behavior feature of the live service video stream by querying the score corresponding to the behavior feature of the live service video stream in the first correspondence.

Illustratively, the correspondence between behavior characteristics and scores is shown in table 5.

TABLE 5

Behavioral characteristics	Score of
		Non-compliance behavior	6-10
Normal behavior	0-5

In one implementation manner, the first correspondence includes a correspondence between background noise interference cancellation and a score, and the server 02 determines the score corresponding to background noise interference cancellation of the live service video stream by querying the score corresponding to background noise interference cancellation of the live service video stream in the first correspondence.

Illustratively, the background noise interference cancellation corresponds to the score as shown in table 6.

TABLE 6

Background noise interference cancellation	Score of
		High standard quality	7-10
Standard mass	4-6
		Common quality	0-3

S131, the server 02 determines a picture score according to the score of each feature in the first feature set. Wherein the picture score satisfies the following formula:

K _video ＝ω _f1 ×S _f1 +ω _f2 ×S _f2 +ω _f3 ×S _f3 +ω _f4 ×S _f4 +ω _f5 ×S _f5 +ω _f6 ×S _f6 +ω _f7 ×S _f7 。

wherein K is _video Represent the picture score, S _f1 Score corresponding to the color of the picture, S _f2 Representing the score corresponding to the light compensation value, S _f3 Representing the score corresponding to the scene information, S _f4 Score corresponding to skin exposure, S _f5 Representing the score corresponding to the target part of the human body, S _f6 Representing scores corresponding to behavioral characteristics, S _f7 Score, ω, representing background noise interference cancellation correspondence _f1 、ω _f2 、ω _f3 、ω _f4 、ω _f5 、ω _f6 And omega _f7 Are constants greater than 0 and less than or equal to 1, ω _f1 +ω _f2 +ω _f3 +ω _f4 +ω _f5 +ω _f6 +ω _f7 ＝1。

In one embodiment, the second feature set includes at least a volume value, a pitch frequency, frequency domain information, a voice subband energy, a subband spectrum centroid, a target phonetic text, and background noise interference cancellation, in which case, as shown in fig. 3 in conjunction with fig. 2, S14 may be implemented specifically by S140 and S141 described below.

And S140, the server 02 determines the score of each feature in the second feature set according to the second preset corresponding relation. The second correspondence includes a correspondence between each feature in the second feature set and a score.

In one implementation, the volume value of the live service video stream may be determined as follows.

Specifically, the volume is also called sound intensity and loudness, which refers to subjective feeling of the human ear on the intensity of the sound heard, and the objective evaluation scale is the amplitude of the sound. Since the dynamic range of sound pressure perceived by the human ear is too large, and the perception of the sound size by the human ear is approximately logarithmic to the sound pressure and intensity, the sound is usually measured in decibels by logarithmic values. The most common standard for speech coding is PCM coding (Pulse Code Modelation, pulse code modulation), the advanced audio coding AAC (Advanced Audio Coding) employed by MPEG, keeps all the original audio data recorded, essentially as individual sample points, typically in the form of 16-bit integers. Thus, speech captured during a period of synchronous GOP or continuous key I-frame time is typically captured for recognition analysis in determining the volume value, thereby determining the volume value.

In one implementation, the pitch frequency of the live traffic video stream may be determined as follows.

Specifically, the audible frequency range of human beings is 20 Hz-20 kHz. The pitch frequency is related to the length, thickness, toughness, stiffness, pronunciation habit and the like of the vocal cords of the individuals, so that the characteristics of the individual sounds are reflected to a great extent, and the pitch frequency of the live service video stream can be determined by adopting a time domain estimation method, a transformation method or a mixing method.

In one implementation, the frequency domain information of the live service video stream may be determined as follows.

Specifically, a common voice frequency domain analysis method, such as a band-pass filter bank method, a Fouier transformation method, homomorphic analysis and a linear prediction method, can be adopted to determine the frequency domain information.

In one implementation, the voice subband energy of a live traffic video stream may be determined as follows.

Specifically, in order to accurately perform speech recognition in a noise environment, methods such as a gaussian mixture model GMM (Gaussian Mixture Model) classifier and the like can be used for training and testing the extracted subband energy variation features. Thus, the live service video stream is input into the Gaussian mixture model, and the voice sub-band energy is determined.

In one implementation, the sub-band spectral centroid of a live traffic video stream may be determined as follows.

In particular, since the spectral peak positions of each band are relatively less affected by background noise, better robustness is achieved, while the sub-band spectral centroid (Subband Spectrum Centroid, SSC) is very close to the peak position in the spectrum, so that the sub-band spectral centroid can be determined from the peaks.

In one implementation, the target phonetic text of the live service video stream may be determined as follows.

Specifically, after the voice signal is preprocessed, the voice features are extracted frame by frame, the traditional feature type extracting method comprises a mel frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC) method and the like, and the most matched word sequence is found to be output as a recognition result under the common guidance of an acoustic model (such as a hidden markov model HMM, hidden Markov Model and the like), a language model (such as an N-gram, a cyclic neural network language model RNNLM (Recurrent Neural Network Language Modeling) and the like) and a pronunciation dictionary.

Of course, an artificial intelligence neural network may also be employed to determine the target phonetic text of the live service video stream.

Specifically, a cepstral mean subtraction (Cepstrum Mean Subtraction, CMS) may be used to eliminate background noise.

In one implementation manner, the second correspondence includes a correspondence between a volume value and a score, and the server 02 determines the score corresponding to the volume value of the live service video stream by querying the score corresponding to the volume value of the live service video stream in the second correspondence.

Illustratively, the correspondence of the volume value and the score is shown in table 7.

TABLE 7

Volume value	Score of
		Sounding sound	10
Mute	0

In one implementation manner, the second correspondence includes a correspondence between pitch frequencies and scores, and the server 02 determines the score corresponding to the pitch frequency of the live service video stream by querying the score corresponding to the pitch frequency of the live service video stream in the second correspondence.

Illustratively, the correspondence between pitch frequency and score is shown in table 8.

TABLE 8

Fundamental tone frequency	Score of
		Coincidence detection feature	6-10
Non-compliance detection feature	0-5

In one implementation manner, the second correspondence includes a correspondence between frequency domain information and a score, and the server 02 determines the score corresponding to the frequency domain information of the live service video stream by querying the score corresponding to the frequency domain information of the live service video stream in the second correspondence.

Illustratively, the correspondence between the frequency domain information and the score is shown in table 9.

TABLE 9

Frequency domain information	Score of
		Low distortion	7-10
In the distortion degree	4-6
		High distortion degree	0-3

In one implementation manner, the second correspondence includes a correspondence between the voice subband energy and the score, and the server 02 determines the score corresponding to the voice subband energy of the live service video stream by querying the score corresponding to the voice subband energy of the live service video stream in the second correspondence.

Illustratively, the correspondence between the speech subband energy and the score is shown in table 10.

Table 10

Speech subband energy	Score of
		Good robustness	7-10
In robustness	4-6
		Poor robustness	0-3

In one implementation, the second correspondence includes a correspondence of a sub-band spectrum centroid and a score, and the server 02 determines the score corresponding to the sub-band spectrum centroid of the live service video stream by querying the score corresponding to the sub-band spectrum centroid of the live service video stream in the second correspondence.

Exemplary, the correspondence between sub-band spectrum centroid and score is shown in table 11.

TABLE 11

Sub-band spectral centroid	Score of
		Good robustness	7-10
In robustness	4-6
		Poor robustness	0-3

In one implementation manner, the second correspondence includes a correspondence between background noise interference cancellation and a score, and the server 02 determines the score corresponding to background noise interference cancellation of the live service video stream by querying the score corresponding to background noise interference cancellation of the live service video stream in the second correspondence.

Illustratively, the background noise interference cancellation corresponds to the score as shown in table 12.

Table 12

S141, the server 02 determines a voice score according to the score of each feature in the second feature set. Wherein the speech score satisfies the following formula:

K _voice ＝ω _v1 ×S _v1 +ω _v2 ×S _v2 +ω _v3 ×S _v3 +ω _v4 ×S _v4 +ω _v5 ×S _v5 +ω _v6 ×S _v6 +ω _v7 ×S _v7 。

wherein K is _voice Representing a speech score, S _v1 Representing the score corresponding to the volume value, S _v2 Representing the score corresponding to the pitch frequency, S _v3 Representing the score corresponding to the frequency domain information, S _v4 Representing the score corresponding to the energy of the voice sub-band, S _v5 Representing the score corresponding to the centroid of the subband spectrum, S _v6 Score, sv, representing target phonetic text correspondence ₇ Score, ω, representing background noise interference cancellation correspondence _v1 、ω _v2 、ω _v3 、ω _v4 、ω _v5 、ω _v6 And omega _v7 Are constants greater than 0 and less than or equal to 1, ω _v1 +ω _v2 +ω _v3 +ω _v4 +ω _v5 +ω _v6 +ω _v7 ＝1。

In one embodiment, the third feature set includes at least image binarization, noise removal value, inclination correction value, layout analysis value, target text, and context verification, in which case, as shown in fig. 3 in conjunction with fig. 2, S15 may be implemented specifically by S150 and S151 described below.

And S150, the server 02 determines the score of each feature in the third feature set according to the preconfigured third corresponding relation. The third corresponding relation comprises a corresponding relation between each feature in the third feature set and the score.

In one implementation, the image binarization of the live service video stream may be determined as follows.

Specifically, the image binarization (Image Binarization) is to set the gray value of the pixel point on the image to 0-255, so as to present a process of obvious black-white effect, greatly reduce the data amount in the image, and thus, can highlight the outline of the text object. For example, the image binarization of the live traffic video stream may be determined by OTSU maximum inter-class variance method.

In one implementation, the noise removal value of the live service video stream may be determined as follows.

In one implementation, the tilt correction value of the live traffic video stream may be determined as follows.

In particular, during digital image optical character text recognition (Optical Character Recognition, OCR), there is always an angular tilt due to the uncontrollability of the captured image. The inclination angle is required to satisfy a certain range:

(θ is the inclination angle, d is the text line spacing, and L is the text line length), if the range is exceeded, the text in the lower (upper) line may be spliced to the line, and the original text may be replaced, so that false break may occur. Generally speaking: θ should be within 2 degrees. To avoid or correct this problem, the image needs to be preprocessed for tilt detection. For example, the tilt correction value of the live service video stream may be determined according to a projection method, a Hough transform method (proposed by Paul Hough in 1962), or the like.

In one implementation, the layout analysis value of the live service video stream may be determined as follows.

Specifically, the image layout analysis mainly judges whether a pixel area is a text or an image, analyzes the characters such as the typesetting transverse and longitudinal regularity, the small characters, the word spacing and the like of caption characters, and adopts the traditional methods such as contour projection, connected domain marking, machine learning and the like. For low quality images such as tilt, blur, distortion, etc., a deep learning method, such as a full convolution network FCN (FullyConvolutional Networks for Semantic Segmentation), a master-RCNN (Region-based Convolutional Neural Networks) method, etc., is mainly used.

In one implementation, the target text of the live service video stream may be determined as follows.

Specifically, the text of the caption target of the key frame image is extracted, on the basis of preprocessing the image, text detection is firstly carried out, a Canny algorithm can be used for carrying out edge detection to obtain a character image outline edge map, then image mapping correction transformation warp is carried out, then characters in the image are subjected to character cutting, each row contains one row of characters, then each row is divided according to a column diagram in the horizontal direction, an image divided by rows and columns is obtained, at the moment, each divided area contains a complete character string, then expansion operation is carried out on each area to repair broken characters, and finally, a communication area marking algorithm is used for dividing the characters from left to right, so that the identification of the target characters is completed.

In one implementation, the context verification of live service video streams may be determined as follows.

Specifically, the most matched word sequence can be found to be output as the recognition result under the common guidance of a language model (such as N-gram, a cyclic neural network language model RNNLM (Recurrent Neural Network Langvage Modeling), etc.) and a context dictionary library.

In one implementation manner, the third correspondence includes a correspondence between image binarization and score, and the server 02 determines the score corresponding to image binarization of the live service video stream by querying the score corresponding to image binarization of the live service video stream in the third correspondence.

Exemplary, the correspondence between image binarization and score is shown in table 13.

TABLE 13

Image binarization	Score of
		High standard quality	7-10
Standard mass	4-6
		Common quality	0-3

In one implementation manner, the third correspondence includes a correspondence between noise removal values and scores, and the server 02 determines the score corresponding to the noise removal value of the live service video stream by querying the score corresponding to the noise removal value of the live service video stream in the third correspondence.

Illustratively, the correspondence of the noise removal value to the score is shown in table 14.

TABLE 14

In one implementation manner, the third correspondence includes a correspondence between the inclination correction value and the score, and the server 02 determines the score corresponding to the inclination correction value of the live service video stream by querying the score corresponding to the inclination correction value of the live service video stream in the third correspondence.

Illustratively, the correspondence of the inclination correction value and the score is shown in table 15.

TABLE 15

Inclination correction value	Score of
		Good robustness	7-10
In robustness	4-6

In one implementation manner, the third correspondence includes a correspondence between a layout analysis value and a score, and the server 02 determines the score corresponding to the layout analysis value of the live broadcast service video stream by querying the score corresponding to the layout analysis value of the live broadcast service video stream in the third correspondence.

Illustratively, the correspondence between layout analysis values and scores is shown in table 16.

Table 16

Layout analysis value	Score of
		Good robustness	7-10
In robustness	4-6

In one implementation manner, the third correspondence includes a correspondence between the target text and the score, and the server 02 determines the score of the target text of the live service video stream by querying the score corresponding to the target text of the live service video stream in the third correspondence.

In one implementation manner, the third correspondence includes a correspondence between a context check and a score, and the server 02 determines the score corresponding to the context check of the live service video stream by querying the score corresponding to the context check of the live service video stream in the third correspondence.

Illustratively, the correspondence of the context check and the score is shown in table 17.

TABLE 17

Contextual verification	Score of
		High accuracy	7-10
In accuracy rate	4-6
		Low accuracy	0-3

S151, the server 02 determines a caption score according to the score of each feature in the third feature set. Wherein the caption score satisfies the following formula:

K _text ＝ω _t1 ×S _t1 +ω _t2 ×S _t2 +ω _t3 ×S _t3 +ω _t4 ×S _t4 +ω _t5 ×S _t5 +ω _t6 ×S _t6 。

wherein K is _text Representing caption score, S _t1 Representing the score corresponding to the binarization of the image, S _t2 Score corresponding to noise removal value, S _t3 Indicating a score corresponding to the inclination correction value, S _t4 Representing the score corresponding to the layout analysis value, S _t5 Representing the score corresponding to the target text, S _t6 Representing the score, ω, corresponding to the context check _t1 、ω _t2 、ω _t3 、ω _t4 、ω _t5 And omega _t6 Are constants greater than 0 and less than or equal to 1, ω _t1 +ω _t2 +ω _t3 +ω _t4 +ω _t5 +ω _t6 ＝1。

In one embodiment, as shown in fig. 2, S16 may be implemented in the following manner.

The server 02 determines a composite score from the picture score, the voice score, and the subtitle score. Wherein the composite score satisfies the following formula:

K _total ＝ω _video ×K _video +ω _voice ×K _voice +ω _text ×K _text 。

wherein K is _total Represent the composite score, K _video Represent the picture score, K _voice Representing speech score, K _text Representing subtitle score, ω _video 、ω _voice 、ω _text Are constants greater than 0 and less than or equal to 1, and ω _video +ω _voice +ω _text ＝1。

The foregoing description of the solution provided by the embodiments of the present invention has been mainly presented in terms of a method. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The embodiment of the invention can divide the functional modules of the data processing device according to the method example, for example, each functional module can be divided corresponding to each function, or two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present invention, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.

Fig. 4 is a schematic structural diagram of a server 02 according to an embodiment of the present invention. The server 02 is used for acquiring live broadcast service video streams; extracting a target picture group, voice data and subtitle data in a live broadcast service video stream; determining a picture score according to a first feature set contained in the target picture group; determining a speech score from a second set of features contained in the speech data; determining a caption score according to a third feature set contained in the caption data; determining a composite score according to the picture score, the voice score and the subtitle score; and when the comprehensive score is determined to be greater than or equal to the target score, generating alarm information. The server 02 may include an acquisition unit 101 and a processing unit 102.

An obtaining unit 101, configured to obtain a live service video stream. For example, in connection with fig. 2, the acquisition unit 101 may be used to perform S11.

The processing unit 102 is configured to extract a target picture group, voice data and subtitle data in a live broadcast service video stream; determining a picture score according to a first feature set contained in the target picture group; determining a speech score from a second set of features contained in the speech data; determining a caption score according to a third feature set contained in the caption data; determining a composite score according to the picture score, the voice score and the subtitle score; and when the comprehensive score is determined to be greater than or equal to the target score, generating alarm information. For example, in connection with fig. 2, the processing unit 102 may be used to perform S12, S13, S14, S15, S16 and S17. Referring to fig. 3, the processing unit 102 may be configured to perform S130, S131, S140, S141, S150, and S151.

All relevant contents of each step related to the above method embodiment may be cited to the functional descriptions of the corresponding functional modules, and their effects are not described herein.

Of course, the server 02 provided in the embodiment of the present invention includes, but is not limited to, the above modules, for example, the server 02 may further include the storage unit 103. The storage unit 103 may be used for storing the program code of the write server 02, and may also be used for storing data generated by the write server 02 during operation, such as data in a write request, etc.

Fig. 5 is a schematic structural diagram of a server 02 according to an embodiment of the present invention, as shown in fig. 5, where the server 02 may include: at least one processor 51, a memory 52, a communication interface 53 and a communication bus 54.

The following describes each constituent element of the server 02 in detail with reference to fig. 5:

the processor 51 is a control center of the server 02, and may be one processor or a collective term of a plurality of processing elements. For example, processor 51 is a central processing unit (Central Processing Unit, CPU), but may also be an integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits configured to implement embodiments of the present invention, such as: one or more DSPs, or one or more field programmable gate arrays (Field Programmable Gate Array, FPGAs).

In a particular implementation, processor 51 may include one or more CPUs, such as CPU0 and CPU1 shown in FIG. 5, as an example. Also, as one example, the server 02 may include a plurality of processors, such as the processor 51 and the processor 55 shown in fig. 5. Each of these processors may be a Single-core processor (Single-CPU) or a Multi-core processor (Multi-CPU). A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

The Memory 52 may be, but is not limited to, a Read-Only Memory (ROM) or other type of static storage device that can store static information and instructions, a random access Memory (Random Access Memory, RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), a compact disc (Compact Disc Read-Only Memory, CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 52 may be stand alone and be coupled to the processor 51 via a communication bus 54. Memory 52 may also be integrated with processor 51.

In a specific implementation, the memory 52 is used to store data in the present invention and to execute software programs of the present invention. The processor 51 may perform various functions of the air conditioner by running or executing a software program stored in the memory 52 and calling data stored in the memory 52.

The communication interface 53 uses any transceiver-like means for communicating with other devices or communication networks, such as a radio access network (Radio Access Network, RAN), a wireless local area network (Wireless Local Area Networks, WLAN), a terminal, a cloud, etc. The communication interface 53 may include a receiving unit implementing a receiving function and a transmitting unit implementing a transmitting function.

The communication bus 54 may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (Peripheral Component Interconnect, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 5, but not only one bus or one type of bus.

As an example, in connection with fig. 4, the acquisition unit 101 in the server 02 realizes the same function as the communication interface 53 in fig. 5, the processing unit 102 realizes the same function as the processor 51 in fig. 5, and the storage unit 103 realizes the same function as the memory 52 in fig. 5.

Another embodiment of the present invention also provides a computer-readable storage medium having stored therein instructions which, when executed on a computer, cause the computer to perform the method shown in the above-described method embodiment.

In some embodiments, the disclosed methods may be implemented as computer program instructions encoded on a computer-readable storage medium in a machine-readable format or encoded on other non-transitory media or articles of manufacture.

Fig. 6 schematically illustrates a conceptual partial view of a computer program product provided by an embodiment of the invention, the computer program product comprising a computer program for executing a computer process on a computing device.

In one embodiment, a computer program product is provided using signal bearing medium 410. The signal bearing medium 410 may include one or more program instructions that when executed by one or more processors may provide the functionality or portions of the functionality described above with respect to fig. 2. Thus, for example, referring to the embodiment shown in FIG. 2, one or more features of S11-S17 may be carried by one or more instructions associated with signal bearing medium 410. Further, the program instructions in fig. 6 also describe example instructions.

In some examples, signal bearing medium 410 may comprise a computer readable medium 411 such as, but not limited to, a hard disk drive, compact Disk (CD), digital Video Disk (DVD), digital tape, memory, read-only memory (ROM), or random access memory (random access memory, RAM), among others.

In some implementations, the signal bearing medium 410 may include a computer recordable medium 412 such as, but not limited to, memory, read/write (R/W) CD, R/W DVD, and the like.

In some implementations, the signal bearing medium 410 may include a communication medium 413 such as, but not limited to, a digital and/or analog communication medium (e.g., fiber optic cable, waveguide, wired communications link, wireless communications link, etc.).

The signal bearing medium 410 may be conveyed by a communication medium 413 in wireless form (e.g., a wireless communication medium conforming to the IEEE 802.41 standard or other transmission protocol). The one or more program instructions may be, for example, computer-executable instructions or logic-implemented instructions.

In some examples, a data-writing apparatus such as described with respect to fig. 2 may be configured to provide various operations, functions, or actions in response to program instructions through one or more of computer-readable medium 411, computer-recordable medium 412, and/or communication medium 413.

From the foregoing description of the embodiments, it will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of functional modules is illustrated, and in practical application, the above-described functional allocation may be implemented by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to implement all or part of the functions described above.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present invention may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the method described in the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the present invention is not limited thereto, but any changes or substitutions within the technical scope of the present invention should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of data processing, comprising:

acquiring a live broadcast service video stream;

extracting a target picture group, voice data and subtitle data in the live broadcast service video stream;

determining a picture score according to a first feature set contained in the target picture group;

determining a speech score according to a second feature set contained in the speech data;

determining a caption score according to a third feature set contained in the caption data;

determining a composite score according to the picture score, the voice score and the subtitle score;

generating alarm information when the comprehensive score is determined to be greater than or equal to a target score; the alarm information is used for indicating that the live broadcast service video stream is abnormal;

the third feature set at least comprises image binarization, noise removal value, inclination correction value, layout analysis value, target characters and context relation verification;

The determining a caption score according to the third feature set contained in the caption data includes:

determining the score of each feature in the third feature set according to a preset third corresponding relation; wherein the third correspondence includes a correspondence of each feature in the third feature set to a score;

determining a caption score according to the score of each feature in the third feature set; wherein the caption score satisfies the following formula:

K _text ＝ω _t1 ×S _t1 +ω _t2 ×S _t2 +ω _t3 ×S _t3 +ω _t4 ×S _t4 +ω _t5 ×S _t5 +ω _t6 ×S _t6 ；

2. The data processing method according to claim 1, wherein the first feature set includes at least picture color, light compensation value, scene information, skin exposure, human target site, behavioral characteristics, and background noise interference cancellation;

the determining a picture score according to the first feature set contained in the target picture group includes:

Determining the score of each feature in the first feature set according to a first preset corresponding relation; wherein the first correspondence includes a correspondence of each feature in the first feature set to a score;

determining a picture score according to the score of each feature in the first feature set; wherein the picture score satisfies the following formula:

K _video ＝ω _f1 ×S _f1 +ω _f2 ×S _f2 +ω _f3 ×S _f3 +ω _f4 ×S _f4 +ω _f5 ×S _f5 +ω _f6 ×S _f6 +ω _f7 ×S _f7 ；

wherein K is _video Represent the picture score, S _f1 Score corresponding to the color of the picture, S _f2 Representing the score corresponding to the light compensation value, S _f3 Representing the score corresponding to the scene information, S _f4 Score corresponding to skin exposure, S _f5 Representing the score corresponding to the target part of the human body, S _f6 Representing scores corresponding to behavioral characteristics, S _f7 Score, ω, representing background noise interference cancellation correspondence _f1 、ω _f2 、ω _f3 、ω _f4 、ω _f5 、ω _f6 And omega _f7 Are constants greater than 0 and less than or equal to 1, ω _f1 +ω _f2 +ω _f3 +ω _f14 +ω _f5 +ω _f6 +ω _f7 ＝1。

3. The data processing method of claim 1, wherein the second feature set includes at least a volume value, a pitch frequency, frequency domain information, a speech subband energy, a subband spectral centroid, a target phonetic text, and background noise interference cancellation;

the determining a voice score according to the second feature set contained in the voice data comprises:

Determining the score of each feature in the second feature set according to a second preset corresponding relation; wherein the second correspondence includes a correspondence of each feature in the second feature set to a score;

determining a speech score based on the score for each feature in the second set of features; wherein the speech score satisfies the following formula:

K _voice ＝ω _v1 ×S _v1 +ω _v2 ×S _v2 +ω _v3 ×S _v3 +ω _v4 ×S _v4 +ω _v5 ×S _v5 +ω _v6 ×S _v6 +ω _v7 ×S _v7 ；

wherein K is _voice Representing a speech score, S _v1 Representing the score corresponding to the volume value, S _v2 Representing the score corresponding to the pitch frequency, S _v3 Representing the score corresponding to the frequency domain information, S _v4 Representing the score corresponding to the energy of the voice sub-band, S _v5 Representing the score corresponding to the centroid of the subband spectrum, S _v6 Score corresponding to the target voice text is represented, S _v7 Score, ω, representing background noise interference cancellation correspondence _v1 、ω _v2 、ω _v3 、ω _v4 、ω _v5 、ω _v6 And omega _v7 Are constants greater than 0 and less than or equal to 1, ω _v1 +ω _v2 +ω _v3 +ω _v4 +ω _v5 +ω _v6 +ω _v7 ＝1。

4. A data processing method according to any one of claims 1 to 3, wherein determining a composite score from the picture score, the speech score and the subtitle score comprises:

determining a composite score according to the picture score, the voice score and the subtitle score; wherein the composite score satisfies the following formula:

K _total ＝ω _video ×K _video +ω _voice ×K _voice +ω _text ×K _text ；

5. A data processing apparatus, comprising:

the acquisition unit is used for acquiring the live broadcast service video stream;

the processing unit is used for extracting the target picture group, the voice data and the subtitle data in the live broadcast service video stream acquired by the acquisition unit;

the processing unit is further configured to determine a picture score according to a first feature set included in the target picture group;

the processing unit is further configured to determine a speech score according to a second feature set included in the speech data;

the processing unit is further configured to determine a caption score according to a third feature set included in the caption data;

the processing unit is further configured to determine a composite score according to the picture score, the voice score, and the subtitle score;

the processing unit is further used for generating alarm information when the comprehensive score is determined to be greater than or equal to the target score; the alarm information is used for indicating that the live broadcast service video stream is abnormal;

the processing unit is specifically configured to determine a score of each feature in the third feature set according to a third preset correspondence; wherein the third correspondence includes a correspondence of each feature in the third feature set to a score;

the processing unit is specifically configured to determine a caption score according to the score of each feature in the third feature set; wherein the caption score satisfies the following formula:

6. The data processing apparatus of claim 5, wherein the first feature set includes at least picture color, light compensation value, scene information, skin exposure, human target site, behavioral characteristics, and background noise interference cancellation;

The processing unit is specifically configured to determine a score of each feature in the first feature set according to a first preset correspondence; wherein the first correspondence includes a correspondence of each feature in the first feature set to a score;

the processing unit is specifically configured to determine a picture score according to a score of each feature in the first feature set; wherein the picture score satisfies the following formula:

7. The data processing apparatus of claim 5, wherein the second feature set includes at least a volume value, a pitch frequency, frequency domain information, a speech subband energy, a subband spectral centroid, a target phonetic text, and background noise interference cancellation;

the processing unit is specifically configured to determine a score of each feature in the second feature set according to a second preset correspondence; wherein the second correspondence includes a correspondence of each feature in the second feature set to a score;

The processing unit is specifically configured to determine a speech score according to the score of each feature in the second feature set; wherein the speech score satisfies the following formula:

8. The data processing apparatus according to any one of claims 5-7, wherein the processing unit is configured to determine a composite score based on the picture score, the speech score and the subtitle score; wherein the composite score satisfies the following formula:

K _total ＝ω _video ×K _video +ω _voice ×K _voice +ω _text ×K _text ；

wherein K is _total Represent the composite score, K _video Represent the picture score, K _voice Representing speech score, K _text Representing subtitle score, ω _video 、ω _voice 、ω _text All are more than 0 and less than or equal to 1Number, and omega _video +ω _voice +ω _text ＝1。

9. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the data processing method according to any of the preceding claims 1-4.

10. A server, comprising: communication interface, processor, memory, bus;

the memory is used for storing computer execution instructions, and the processor is connected with the memory through the bus;

when the server is running, the processor executes the computer-executable instructions stored in the memory to cause the server to perform the data processing method according to any one of the preceding claims 1-4.