CN117115155A

CN117115155A - Image analysis method and system based on AI live broadcast

Info

Publication number: CN117115155A
Application number: CN202311370419.1A
Authority: CN
Inventors: 陈达剑; 李火亮; 陈鹏
Original assignee: Jiangxi Tuoshi Intelligent Technology Co ltd
Current assignee: Jiangxi Tuoshi Intelligent Technology Co ltd
Priority date: 2023-10-23
Filing date: 2023-10-23
Publication date: 2023-11-24

Abstract

The application provides an image analysis method and system based on AI live broadcast, wherein the method comprises the following steps: acquiring a live video stream, preprocessing the video stream to acquire a video set comprising a plurality of video segments, wherein the video segments comprise a plurality of frame images; determining a key frame image from the video segment, and extracting an optical flow image in the video segment; determining a combined feature of the video segment based on the key frame image and the optical flow image; constructing an LSTM classifier, and training the LSTM classifier through combined features so that the LSTM classifier has sensitive image recognition capability. By the method, the sensitive images can be automatically identified accurately and efficiently in real time, and the method replaces a manual management method, so that the requirement of real-time management is met.

Description

Image analysis method and system based on AI live broadcast

Technical Field

The application relates to the technical field of image processing, in particular to an image analysis method and system based on AI live broadcast.

Background

With the continuous improvement of network quality and bandwidth, people can use higher and higher quality and high speed network services. In this context, the live industry has evolved with the help of powerful network deployments and high network speed support.

Along with the development of artificial intelligence technology, artificial intelligence AI live broadcast is also widely applied to live broadcast industry, and a brand new live broadcast mode is used for replacing a traditional live broadcast mode, and AI live broadcast is an intelligent virtual image constructed based on the artificial intelligence technology, and real-time live broadcast is performed on a network platform through the virtual image, so that user experience can be enhanced, live broadcast quality can be improved, and live broadcast environment conditions can be improved.

However, the problems existing in the AI live broadcast are gradually revealed, in the AI live broadcast process, some sensitive images can appear, and the management of the sensitive images is basically in a human management stage, and the management of the sensitive images is mainly performed through supervision of users and staff on a platform, however, the quantity of AI live broadcast in the same time period is extremely huge, and the real-time management requirement on the sensitive images in the live broadcast process cannot be met only by means of human management.

Disclosure of Invention

The embodiment of the application provides an image analysis method and system based on AI live broadcasting, which are used for solving the technical problems that in the prior art, sensitive images in the AI live broadcasting process are managed only by means of manpower management, the management strength is insufficient, and the real-time management requirement on the sensitive images in the AI live broadcasting process cannot be met.

In a first aspect, an embodiment of the present application provides an image analysis method based on AI live broadcast, including the following steps:

acquiring a live video stream, and preprocessing the video stream to acquire a video set comprising a plurality of video segments, wherein the video segments comprise a plurality of frame images;

determining a key frame image from the video segment, and extracting an optical flow image in the video segment;

determining a combined feature of the video segment based on the key frame image and the optical flow image;

constructing an LSTM classifier, and training the LSTM classifier through the combined features so as to enable the LSTM classifier to have sensitive image recognition capability.

Further, the step of preprocessing the video stream to obtain a video set including a plurality of video segments, the video segments including a plurality of frame images includes:

dividing the video stream into a plurality of frame images under continuous frames, dividing the frame images of a first frame into a first video segment, and taking the frame images of the first frame as a first center point of the first video segment;

comparing the similarity between the frame image of the second frame and the first center point;

if the similarity between the frame image of the second frame and the first center point is greater than a similarity threshold, dividing the frame image of the second frame into the first video segment, and calculating a first updated center point of the first video segment;

if the similarity between the frame image of the second frame and the first center point is smaller than a similarity threshold, dividing the frame image of the second frame into a second video segment, and taking the frame image of the second frame as a second center point of the second video segment;

and sequentially processing the subsequent frame images in a time sequence until a plurality of frame images are classified into a plurality of video segments to form a video set.

Further, the calculation formula of the first update center point is:

，

wherein the method comprises the steps of，Representing a first updated center point, +.>Indicating the number of existing frame pictures in the ith video segment,/-, for example>Frame image representing the j-th frame, +.>Representing the kth frame image in the ith video segment.

Further, the step of determining a key frame image from the video segment specifically includes:

and calculating entropy values of the frame images in the video segment, and comparing the entropy values of different frame images to select the frame image with the largest entropy value as a key frame image.

Further, the step of extracting an optical flow image in the video segment includes:

extracting a first-direction moving image and a second-direction moving image of the video segment by a TV-L1 dense optical flow algorithm;

and accumulating the first direction moving image and the second direction moving image as an optical flow image.

Further, the step of determining the combined features of the video segment based on the key frame image and the optical flow image comprises:

constructing a feature extraction model, wherein the feature extraction model comprises a spatial convolution network and an optical flow convolution network, and the spatial convolution network and the optical flow convolution network are both connected with a combined network;

taking the key frame image as an input value of the spatial convolution network to acquire spatial features;

taking the optical flow image as an input value of the optical flow convolution network to acquire action characteristics;

and taking the spatial characteristics and the action characteristics as input values of the combined network so as to acquire the combined characteristics through the combined network.

Further, the step of acquiring the combined characteristic through the combined network specifically includes:

and carrying out averaging processing on the action features and the space features through the combined network to form combined features.

Further, the step of constructing an LSTM classifier, training the LSTM classifier by the combined features to enable the LSTM classifier to have sensitive image recognition capabilities includes:

constructing a plurality of first neurons to form an LSTM layer;

constructing a plurality of second neurons to form a fully connected layer, and connecting the LSTM layers with the fully connected layer to form an LSTM classifier;

the combined characteristic is partitioned into a positive sample and a negative sample based on a sensitive image and a normal image, and the positive sample and the negative sample are input into the LSTM classifier as input values, so that the LSTM classifier has sensitive image recognition capability.

Further, the fully-connected layer includes 2 second neurons, and an activation function of the fully-connected layer is a softmax function.

In a second aspect, an embodiment of the present application provides an image analysis system based on AI live broadcast, which is applied to an image analysis method based on AI live broadcast in the above technical solution, where the system includes:

the preprocessing module is used for acquiring a live video stream, preprocessing the video stream to acquire a video set comprising a plurality of video segments, wherein the video segments comprise a plurality of frame images;

the extraction module is used for determining a key frame image from the video segment and extracting an optical flow image in the video segment;

a combination module for determining a combination feature of the video segment based on the key frame image and the optical flow image;

and the analysis module is used for constructing an LSTM classifier, and training the LSTM classifier through the combined features so as to enable the LSTM classifier to have sensitive image recognition capability.

Compared with the prior art, the application has the beneficial effects that: by extracting the key frame image, further extracting the spatial characteristics of the key frame image, training a neural network by using the spatial characteristics of the key frame image, then completing the automatic identification of the sensitive image in the AI live video stream, further extracting the action characteristics of the optical flow image by extracting the optical flow image together, combining the spatial characteristics and the action characteristics into the combined characteristics, avoiding the situation of misjudgment on partial indistinguishable images caused by only considering the spatial characteristics, effectively improving the accuracy of the automatic identification of the sensitive image, and after constructing and training the LSTM classifier, carrying out image classification on a large number of video streams in the same time period by the LSTM classifier after completing the lifting of the combined characteristics so as to complete the real-time, accurate and efficient automatic identification of the sensitive image.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

Fig. 1 is a flowchart of an image analysis method based on AI live broadcast in a first embodiment of the present application;

fig. 2 is a block diagram of an image analysis system based on AI live broadcast according to a second embodiment of the present application;

the application will be further described in the following detailed description in conjunction with the above-described figures.

Detailed Description

The present application will be described and illustrated with reference to the accompanying drawings and examples in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. All other embodiments, which can be made by a person of ordinary skill in the art based on the embodiments provided by the present application without making any inventive effort, are intended to fall within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the described embodiments of the application can be combined with other embodiments without conflict.

Referring to fig. 1, the image analysis method based on AI live broadcast provided by the first embodiment of the present application includes the following steps:

step S10: acquiring a live video stream, and preprocessing the video stream to acquire a video set comprising a plurality of video segments, wherein the video segments comprise a plurality of frame images;

the purpose of preprocessing the video stream is to summarize and classify the frame images with higher similarity, so that the data processing amount of the video stream is effectively reduced, the calculation complexity is reduced, and the efficiency of sensitive image recognition is improved. The step S10 includes:

s110: dividing the video stream into a plurality of frame images under continuous frames, dividing the frame images of a first frame into a first video segment, and taking the frame images of the first frame as a first center point of the first video segment;

s120: comparing the similarity between the frame image of the second frame and the first center point;

in the initial situation, comparing the similarity of the frame image of the second frame with that of the first frame, specifically, constructing an HSV color space, and equally dividing an H value in the HSV color space into 12 sections, an S value into 5 sections and a V value into 5 sections to form a plurality of index areas; mapping the frame image of the first frame and the frame image of the second frame into HSV color space respectively to obtain a first color histogram and a second color histogram; acquiring a corresponding minimum value of the first color histogram and the second color histogram in the index area; it is understandable that the H value, S value, and V value of the first color histogram and the second color histogram fall into the number of points in the index region; and superposing the minimum values of different index areas to form a similarity value.

S130: if the similarity between the frame image of the second frame and the first center point is greater than a similarity threshold, dividing the frame image of the second frame into the first video segment, and calculating a first updated center point of the first video segment;

that is, the similarity value is smaller than the similarity threshold, it is understood that, in the different index regions, the first color histogram and the second color histogram have a larger number of points falling into the index regions, which means that the frame image of the first frame has a higher similarity with the frame image of the second frame, and therefore, the frame image of the second frame needs to be classified into the first video segment. After the division is completed, when the frame images of the subsequent frames are processed, the number of the frame images in the first video segment is changed, so that the first center point needs to be updated to meet the subsequent similarity comparison requirement.

Specifically, the calculation formula of the first update center point is:

，

wherein,representing a first updated center point, +.>Indicating the number of existing frame pictures in the ith video segment,/-, for example>Frame image representing the j-th frame, +.>Representing the kth frame image in the ith video segment.

S140: if the similarity between the frame image of the second frame and the first center point is smaller than a similarity threshold, dividing the frame image of the second frame into a second video segment, and taking the frame image of the second frame as a second center point of the second video segment;

that is, when the similarity is insufficient, the frame image of the second frame is separated from the frame image of the first frame.

S150: sequentially processing the subsequent frame images in a time sequence until a plurality of frame images are classified into a plurality of video segments to form a video set;

it can be understood that when the frame image of a certain frame is processed, the similarity between the frame image and the center point of the determined video segment is compared, and the frame image is further classified by the processing methods from the step S130 to the step S140, which will not be described herein.

Step S20: determining a key frame image from the video segment, and extracting an optical flow image in the video segment;

the step S20 includes:

s210: calculating entropy values of the frame images in the video segment, and comparing the entropy values of different frame images to select the frame image with the largest entropy value as a key frame image;

image entropy is a statistical form of features that reflects how much information is averaged in an image, and specifically refers to the amount of information contained in the aggregate features of the gray distribution in an image. It will be appreciated that in the video stream, a number of the key frame images are included.

S220: extracting a first-direction moving image and a second-direction moving image of the video segment by a TV-L1 dense optical flow algorithm;

optical flow is a method used to describe objects in a scene that dynamically change between consecutive frames due to motion. Essentially, it is a two-dimensional field of vectors, each representing the displacement of the point in the scene from the previous frame to the next frame. And solving the optical flow, namely inputting two continuous frame images in the video segment, and outputting a two-dimensional vector field based on pixels of the frame images.

Taking two continuous frame images in the video segment as an example, the illumination energy of the frame image of the previous frame can determine a first X-axis coordinate and a first Y-axis coordinate in a coordinate system, the illumination energy of the frame image of the next frame can determine a second X-axis coordinate and a second Y-axis coordinate in the coordinate system, the position change from the first X-axis coordinate to the second X-axis coordinate is the first direction moving image, and the second direction moving image is the same.

S230: accumulating the first-direction moving image and the second-direction moving image as an optical flow image;

the first direction moving image and the second direction moving image point to the moving trend of the frame image, so that the first direction moving image and the second direction moving image need to be comprehensively considered, namely, accumulation processing is performed so as to correspond to the key frame image.

Step S30: determining a combined feature of the video segment based on the key frame image and the optical flow image;

and comprehensively considering the key frame image and the optical flow image, namely synchronously considering the action characteristic represented by the optical flow image while considering the space characteristic represented by the key frame image.

Specifically, the step S30 includes:

s310: constructing a feature extraction model, wherein the feature extraction model comprises a spatial convolution network and an optical flow convolution network, and the spatial convolution network and the optical flow convolution network are both connected with a combined network;

it should be noted that, the spatial convolution network and the optical flow convolution network both use RseNet-50 network models, which are different in that the training data for model training is different. The combination network is used for combining the features extracted through the spatial convolution network and the optical flow convolution network.

S320: taking the key frame image as an input value of the spatial convolution network to acquire spatial features;

s330: taking the optical flow image as an input value of the optical flow convolution network to acquire action characteristics;

s340: and taking the spatial characteristics and the action characteristics as input values of the combined network so as to acquire the combined characteristics through the combined network.

Specifically, the action feature and the spatial feature are subjected to a averaging process through the combination network to form a combination feature. Assuming that 1000 motion features and 1000 spatial features are extracted, the number of the combined features after the averaging process is 1000.

In some embodiments, the combined network may also form the combined feature by superimposing the action feature and the spatial feature. Assuming that 1000 motion features and 1000 spatial features are extracted, the number of the combined features after the superposition processing is still 1000.

Step S40: constructing an LSTM classifier, and training the LSTM classifier through the combined features so as to enable the LSTM classifier to have sensitive image recognition capability.

Specifically, the step S40 includes:

s410: constructing a plurality of first neurons to form an LSTM layer;

preferably, 128 of said first neurons are constructed to form said LSTM layer. The LSTM layer is used for learning the combined features, so that the LSTM classifier has distinguishing capability.

S420: constructing a plurality of second neurons to form a fully connected layer, and connecting the LSTM layers with the fully connected layer to form an LSTM classifier;

since only sensitive images need to be distinguished, the classification result of the LSTM classifier is only sensitive images or normal images, and thus the fully connected layer includes 2 second neurons. Further, the activation function of the fully connected layer is a softmax function.

S430: the combined characteristic is partitioned into a positive sample and a negative sample based on a sensitive image and a normal image, and the positive sample and the negative sample are input into the LSTM classifier as input values, so that the LSTM classifier has sensitive image recognition capability.

By extracting the key frame image, further extracting the spatial characteristics of the key frame image, training a neural network by using the spatial characteristics of the key frame image, then completing the automatic identification of the sensitive image in the AI live video stream, further extracting the action characteristics of the optical flow image by extracting the optical flow image together, combining the spatial characteristics and the action characteristics into the combined characteristics, avoiding the situation of misjudgment on partial indistinguishable images caused by only considering the spatial characteristics, effectively improving the accuracy of the automatic identification of the sensitive image, and after constructing and training the LSTM classifier, carrying out image classification on a large number of video streams in the same time period by the LSTM classifier after completing the lifting of the combined characteristics so as to complete the real-time, accurate and efficient automatic identification of the sensitive image.

Referring to fig. 2, a second embodiment of the present application provides an image analysis system based on AI live broadcast, which is applied to the image analysis based on AI live broadcast in the above embodiment, and will not be described again. As used below, the terms "module," "unit," "sub-unit," and the like may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The system comprises:

the preprocessing module 10 is configured to acquire a live video stream, and preprocess the video stream to acquire a video set including a plurality of video segments, where the video segments include a plurality of frame images;

the preprocessing module 10 includes:

a first unit, configured to divide the video stream into a plurality of frame images under consecutive frames, divide the frame images of a first frame into a first video segment, and use the frame images of the first frame as a first center point of the first video segment;

a second unit, configured to perform similarity comparison on the frame image of a second frame and the first center point;

a third unit, configured to divide the frame image of the second frame into the first video segment and calculate a first updated center point of the first video segment if the similarity between the frame image of the second frame and the first center point is greater than a similarity threshold;

a fourth unit, configured to divide the frame image of the second frame into a second video segment if the similarity between the frame image of the second frame and the first center point is smaller than a similarity threshold, and take the frame image of the second frame as a second center point of the second video segment;

and a fifth unit, configured to sequentially process the subsequent frame images in a time sequence until a plurality of frame images are classified into a plurality of video segments, so as to form a video set.

An extraction module 20, configured to determine a key frame image from the video segment, and extract an optical flow image in the video segment;

the extraction module 20 includes:

a sixth unit, configured to calculate entropy values of the frame images in the video segment, and compare the entropy values of different frame images, so as to select the frame image with the largest entropy value as a key frame image;

a seventh unit, configured to extract a first direction moving image and a second direction moving image of the video segment through a TV-L1 dense optical flow algorithm;

eighth means for accumulating the first-direction moving image and the second-direction moving image as optical flow images;

a combination module 30 for determining a combination feature of the video segment based on the key frame image and the optical flow image;

the combining module 30 includes:

a ninth unit, configured to construct a feature extraction model, where the feature extraction model includes a spatial convolution network and an optical flow convolution network, and the spatial convolution network and the optical flow convolution network are both connected to a combined network;

a tenth unit, configured to use the key frame image as an input value of the spatial convolution network, so as to obtain a spatial feature;

an eleventh unit, configured to take the optical flow image as an input value of the optical flow convolution network, so as to obtain an action feature;

a twelfth unit, configured to use the spatial feature and the action feature as input values of the combination network, so as to obtain a combination feature through the combination network;

the twelfth unit is specifically configured to take the spatial feature and the motion feature as input values of the combination network, and perform a averaging process on the motion feature and the spatial feature through the combination network to form a combination feature;

an analysis module 40, configured to construct an LSTM classifier, and train the LSTM classifier through the combined features, so that the LSTM classifier has a sensitive image recognition capability;

the analysis module 40 includes:

a thirteenth unit for constructing a number of first neurons to form LSTM layers;

a fourteenth unit for constructing a plurality of second neurons to form a fully connected layer, connecting the LSTM layers to the fully connected layer to form an LSTM classifier;

a fifteenth unit, configured to divide the combined feature into a positive sample and a negative sample based on a sensitive image and a normal image, and input the positive sample and the negative sample as input values into the LSTM classifier, so that the LSTM classifier has a sensitive image recognition capability.

The application also provides a computer, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the image analysis method based on the AI live broadcast in the technical scheme when executing the computer program.

The application also provides a storage medium, on which a computer program is stored, which when being executed by a processor implements the AI live broadcast-based image analysis method as described in the above technical scheme.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. An image analysis method based on AI live broadcast is characterized by comprising the following steps:

2. The AI live-based image analysis method of claim 1, wherein the step of preprocessing the video stream to obtain a video set including a number of video segments, the video segments including a number of frame images, comprises:

3. The AI live-based image analysis method of claim 2, wherein the first updated center point has a calculation formula:

，

wherein,representing a first updated center point, +.>Indicating the number of existing frame pictures in the ith video segment,/-, for example>Representing the j-th frameIs>Representing the kth frame image in the ith video segment.

4. The AI live-based image analysis method of claim 1, wherein the step of determining a key frame image from the video segment is specifically:

5. The AI live-based image analysis method of claim 1, wherein the step of extracting an optical flow image in the video segment comprises:

6. The AI live-based image analysis method of claim 1, wherein the step of determining a combined feature of the video segment based on the key frame image and the optical flow image comprises:

7. The AI-live-based image analysis method of claim 6, wherein the step of obtaining the combined features through the combined network is specifically:

8. The AI live-based image analysis method of claim 1, wherein the step of constructing an LSTM classifier, training the LSTM classifier with the combined features to provide the LSTM classifier with sensitive image recognition capabilities comprises:

constructing a plurality of first neurons to form an LSTM layer;

9. The AI live-based image analysis method of claim 8, wherein the fully connected layer includes 2 of the second neurons, and an activation function of the fully connected layer is a softmax function.

10. An AI live broadcast-based image analysis system applied to the AI live broadcast-based image analysis method as claimed in any one of claims 1 to 9, characterized in that the system comprises: