CN117975173A

CN117975173A - Children's cult picture recognition method and device based on lightweight visual converter

Info

Publication number: CN117975173A
Application number: CN202410389340.1A
Authority: CN
Inventors: 陈雁; 林涛; 杜吉祥; 翟传敏
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2024-04-02
Filing date: 2024-04-02
Publication date: 2024-05-03
Anticipated expiration: 2044-04-02
Also published as: CN117975173B

Abstract

The invention provides a child evil dictionary picture identification method and device based on a lightweight visual converter, and relates to the technical field of image identification. The identification method comprises the steps of S1, acquiring a cartoon image to be identified. S2, preprocessing the cartoon image. S3, inputting the preprocessed cartoon image into a trained real-time child evil dictionary picture identification model based on a lightweight visual converter, and obtaining a prediction vector. And S4, comparing and judging the prediction vector based on the prediction threshold value to judge whether the cartoon image belongs to the child evil dictionary picture. The network structure of the real-time child evil dictionary picture identifying model based on the light-weight visual transducer comprises a first convolution layer network, a first mobile network, a second mobile network, a third mobile network, a fourth mobile network, a first light-weight transducer network, a fifth mobile network, a second light-weight transducer network, a sixth mobile network, a third light-weight transducer network, a second convolution layer network and a multi-layer perceptron which are sequentially connected.

Description

Children's cult picture recognition method and device based on lightweight visual converter

技术领域Technical Field

本发明涉及图像识别技术领域，具体而言，涉及一种基于轻量化视觉转换器的儿童邪典图片识别方法和设备。The present invention relates to the technical field of image recognition, and in particular to a method and device for recognizing children's cult pictures based on a lightweight visual converter.

背景技术Background technique

儿童邪典图片指的是可能在儿童内容平台上出现的一类具有不适当的图像。这些图片往往以儿童友好的外观为伪装，然而实际上却包含了不适合儿童观看的主题和场景。Child cult images refer to a category of inappropriate images that may appear on children's content platforms. These images are often disguised as child-friendly, but in fact contain themes and scenes that are not suitable for children.

相较于识别图像中所包含对象的一般的图像识别技术而言，儿童邪典图片的图像自动识别技术仍然处于发展阶段，其主要挑战包括以下五点。Compared with general image recognition technology that identifies objects contained in images, the automatic image recognition technology of children's cult pictures is still in the development stage, and its main challenges include the following five points.

一、伪装性强：儿童邪典通常通过模仿或伪装成正常的儿童内容，使得其难以被自动识别。制作者可能巧妙地使用熟悉的儿童动画角色或风格，混淆了真正的儿童友好内容与不适当内容之间的界限。1. Strong disguise: Children's cult content often imitates or disguises itself as normal children's content, making it difficult to automatically identify. The producers may cleverly use familiar children's animation characters or styles to confuse the line between truly child-friendly content and inappropriate content.

二、多样性和变化：制作者不断尝试新的方式来规避识别。这种多样性和不断变化的特点增加了识别算法的复杂性。2. Diversity and change: Producers are constantly trying new ways to circumvent recognition. This diversity and constant change increase the complexity of the recognition algorithm.

三、数据获取难度大：训练和优化模型需要大量的标记数据，而这可能会受到道德和法律方面的限制。3. Difficulty in obtaining data: Training and optimizing models requires a large amount of labeled data, which may be subject to ethical and legal restrictions.

四、语义理解：有些邪典内容可能包含隐晦的恶意主题，需要对语义和上下文有深刻的理解。这使得简单的图像特征提取可能无法捕捉到所有的问题。4. Semantic understanding: Some cult content may contain obscure malicious themes, which requires a deep understanding of semantics and context. This makes it possible that simple image feature extraction cannot capture all problems.

五、部署成本：要应对以上四个问题，传统模型需要将模型搭建的很大，因此导致在边缘设备难以部署且运行速度较慢。5. Deployment cost: To address the above four issues, traditional models need to be built very large, which makes them difficult to deploy on edge devices and runs slowly.

有鉴于此，申请人在研究了现有的技术后特提出本申请。In view of this, the applicant filed this application after studying the existing technology.

发明内容Summary of the invention

本发明提供了一种基于轻量化视觉转换器的儿童邪典图片识别方法和设备，以改善上述技术问题中的至少一个。The present invention provides a method and device for identifying children's cult pictures based on a lightweight visual converter to improve at least one of the above-mentioned technical problems.

第一方面、本发明实施例提供了一种基于轻量化视觉转换器的儿童邪典图片识别方法，其包含步骤S1至步骤S4。In a first aspect, an embodiment of the present invention provides a method for identifying cult children images based on a lightweight visual converter, which comprises steps S1 to S4.

S1、获取待识别的卡通图像。S1. Obtain a cartoon image to be recognized.

S2、根据所述卡通图像进行预处理。S2. Preprocessing is performed according to the cartoon image.

S3、将预处理后的卡通图像输入预先训练好的基于轻量化视觉转换器的实时儿童邪典图片识别模型，获取卡通图像是否属于儿童邪典图片的预测向量。S3. Input the preprocessed cartoon image into a pre-trained real-time children's cult picture recognition model based on a lightweight visual converter to obtain a prediction vector of whether the cartoon image belongs to a children's cult picture.

S4、基于预测阈值对所述预测向量进行比较判断，以判断所述卡通图像是否属于儿童邪典图片。S4. Compare and judge the prediction vector based on a prediction threshold to determine whether the cartoon image is a cult picture for children.

所述基于轻量化视觉转换器的实时儿童邪典图片识别模型的网络结构包括依次连接的第一卷积层网络、第一移动网络、第二移动网络、第三移动网络、第四移动网络、第一轻量化转换器网络、第五移动网络、第二轻量化转换器网络、第六移动网络、第三轻量化转换器网络、第二卷积网络和多层感知机。The network structure of the real-time children's cult picture recognition model based on lightweight visual converter includes a first convolutional layer network, a first mobile network, a second mobile network, a third mobile network, a fourth mobile network, a first lightweight converter network, a fifth mobile network, a second lightweight converter network, a sixth mobile network, a third lightweight converter network, a second convolutional network and a multi-layer perceptron connected in sequence.

在一个可选的实施例中，所述基于轻量化视觉转换器的实时儿童邪典图片识别模型的训练步骤包括步骤A1至步骤A4。In an optional embodiment, the training step of the real-time children's cult image recognition model based on lightweight visual converter includes steps A1 to A4.

A1、从训练集中进行批量随机选取。其中，每一个批次为n张图片。A1. Randomly select batches from the training set, where each batch consists of n images.

A2、将每个批次中的n张图片的尺度都缩放为256×256的大小，接着对批次中所有的图片进行数据增强。A2. Scale the size of n images in each batch to 256×256, and then perform data augmentation on all images in the batch.

A3、将数据增强过的图片数据输入到还未训练好的基于轻量化视觉转换器的实时儿童邪典图片识别模型，获取一组预测置信度。A3. Input the data-augmented image data into the untrained real-time children's cult image recognition model based on lightweight visual converter to obtain a set of prediction confidences.

A4、将预测置信度和n张图片的标签进行损失计算，得到损失值通过反向传播算法进行优化，直至训练完成。其中，损失函数采用Focal Loss损失函数。A4. Calculate the loss of the prediction confidence and the labels of the n images, and optimize the loss value through the back propagation algorithm until the training is completed. Among them, the loss function adopts the Focal Loss loss function.

所述Focal Loss损失函数为：The Focal Loss loss function is:

， ,

式中，代表分类损失函数、/>为损失函数中用于平衡正负样本的不均匀性的超参数、/>代表模型的分类预测值、/>为损失函数中用于调节容易样本与困难样本的损失的超参数、/>代表图片的类别标签。In the formula, represents the classification loss function, /> is a hyperparameter used in the loss function to balance the unevenness of positive and negative samples. Represents the classification prediction value of the model, /> is the hyperparameter used to adjust the loss of easy samples and difficult samples in the loss function,/> Represents the category label of the image.

在一个可选的实施例中，所述第一卷积层网络输入是大小为256×256×3的图片，输出为大小128×128×16的向量。所述第一移动网络的输出为128×128×16的向量。所述第二移动网络的输出为64×64×24的向量。所述第三移动网络的输出为64×64×24的向量。所述第四移动网络的输出为32×32×48的向量。所述第一轻量化转换器网络的输出为32×32×48的向量。所述第五移动网络的输出为16×16×64的向量。所述第二轻量化转换器网络的输出为16×16×64的向量。所述第六移动网络的输出为8×8×80的向量。所述第三轻量化转换器网络的输出为8×8×80的向量，第三轻量化转换器网络由4个卷积层网络和3个转换器模块连接组成。所述第二卷积层网络的输出为8×8×320的向量。所述多层感知机的输出为1×1×2的置信度。In an optional embodiment, the input of the first convolutional layer network is a picture of size 256×256×3, and the output is a vector of size 128×128×16. The output of the first mobile network is a vector of 128×128×16. The output of the second mobile network is a vector of 64×64×24. The output of the third mobile network is a vector of 64×64×24. The output of the fourth mobile network is a vector of 32×32×48. The output of the first lightweight converter network is a vector of 32×32×48. The output of the fifth mobile network is a vector of 16×16×64. The output of the second lightweight converter network is a vector of 16×16×64. The output of the sixth mobile network is a vector of 8×8×80. The output of the third lightweight converter network is a vector of 8×8×80, and the third lightweight converter network is composed of 4 convolutional layer networks and 3 converter modules connected. The output of the second convolutional layer network is a vector of 8×8×320. The output of the multilayer perceptron is a 1×1×2 confidence level.

在一个可选的实施例中，所述第一卷积层网络包含顺序连接的卷积层、批量归一化层和SiLU激活函数。其中，所述第一卷积层网络的卷积层包含n=16个大小为3×3的卷积核。In an optional embodiment, the first convolutional layer network comprises sequentially connected convolutional layers, batch normalization layers and SiLU activation functions. The convolutional layer of the first convolutional layer network comprises n=16 convolutional kernels of size 3×3.

在一个可选的实施例中，所述第二卷积层网络包含顺序连接的卷积层、批量归一化层和SiLU激活函数顺序连接组成。其中，所述第二卷积层网络的卷积层包含n=320个大小为1×1的卷积核。In an optional embodiment, the second convolutional layer network comprises a sequentially connected convolutional layer, a batch normalization layer, and a SiLU activation function, wherein the convolutional layer of the second convolutional layer network comprises n=320 convolution kernels of size 1×1.

在一个可选的实施例中，所述第一移动网络和所述第三移动网络均采用残差结构。残差结构的移动网络包含依次连接的分组卷积、批量归一化层、SiLU激活函数、卷积层、批量归一化层、SiLU激活函数和加法操作顺序连接组成。In an optional embodiment, the first mobile network and the third mobile network both adopt a residual structure. The residual structure mobile network comprises a group convolution, a batch normalization layer, a SiLU activation function, a convolution layer, a batch normalization layer, a SiLU activation function and an addition operation connected in sequence.

所述第一移动网络的分组卷积包含32个大小为3×3的卷积核，步长为1，组数量为32。所述第一移动网络的卷积层包含16个大小为1×1的卷积核、步长为1，填充为0。The group convolution of the first mobile network includes 32 convolution kernels of size 3×3, a step size of 1, and a group number of 32. The convolution layer of the first mobile network includes 16 convolution kernels of size 1×1, a step size of 1, and a padding of 0.

所述第三移动网络的分组卷积包含48个大小为3×3的卷积核，步长为1，组数量为48。所述第三移动网络的卷积层包含24个大小为1×1的卷积核，步长为1，填充为0。The group convolution of the third mobile network includes 48 convolution kernels of size 3×3, a step size of 1, and a group number of 48. The convolution layer of the third mobile network includes 24 convolution kernels of size 1×1, a step size of 1, and a padding of 0.

在一个可选的实施例中，所述第二移动网络、第四移动网络、第五移动网络和第六移动网络均采用顺序连接结构。顺序连接结构的移动网络包含依次连接的分组卷积、批量归一化层、SiLU激活函数、卷积层、批量归一化层和SiLU激活函数。In an optional embodiment, the second mobile network, the fourth mobile network, the fifth mobile network and the sixth mobile network all adopt a sequential connection structure. The mobile network of the sequential connection structure comprises a group convolution, a batch normalization layer, a SiLU activation function, a convolution layer, a batch normalization layer and a SiLU activation function connected in sequence.

所述第二移动网络的分组卷积包含32个大小为3×3的卷积核，步长为1，组数量为32。所述第二移动网络的卷积层包含24个大小为1×1的卷积核，步长为1，填充为0。The group convolution of the second mobile network includes 32 convolution kernels of size 3×3, a step size of 1, and a group number of 32. The convolution layer of the second mobile network includes 24 convolution kernels of size 1×1, a step size of 1, and a padding of 0.

所述第四移动网络的分组卷积包含48个大小为3×3的卷积核，步长为1，组数量为48。所述第四移动网络的卷积层包含48个大小为1×1的卷积核，步长为1，填充为0。The group convolution of the fourth moving network includes 48 convolution kernels of size 3×3, a step size of 1, and a group number of 48. The convolution layer of the fourth moving network includes 48 convolution kernels of size 1×1, a step size of 1, and a padding of 0.

所述第五移动网络的分组卷积包含96个大小为3×3的卷积核，步长为1，组数量为96。所述第五移动网络的卷积层包含64个大小为1×1的卷积核，步长为1，填充为0。The group convolution of the fifth moving network includes 96 convolution kernels of size 3×3, a step size of 1, and a group number of 96. The convolution layer of the fifth moving network includes 64 convolution kernels of size 1×1, a step size of 1, and a padding of 0.

所述第六移动网络的分组卷积包含128个大小为3×3的卷积核，步长为1，组数量为128。所述第六移动网络的卷积层包含80个大小为1×1的卷积核，步长为1，填充为0。The group convolution of the sixth moving network includes 128 convolution kernels of size 3×3, a step size of 1, and a group number of 128. The convolution layer of the sixth moving network includes 80 convolution kernels of size 1×1, a step size of 1, and a padding of 0.

在一个可选的实施例中，轻量化转换器网络包含依次连接的一号卷积层网络、二号卷积层网络、向量拉伸操作、多个转换器模块、向量拉伸操作和三号卷积层网络。其中，所述第一轻量化转换器网络的转换器模块的数量为2个。所述第二轻量化转换器网络的转换器模块的数量为4个。所述第三轻量化转换器网络的转换器模块的数量为3个。所述一号卷积层网络包含顺序连接的卷积层、批量归一化层和SiLU激活函数。所述一号卷积层网络的卷积核大小为3×3。所述二号卷积层网络和所述三号卷积层网络均包含顺序连接的卷积层、批量归一化层和SiLU激活函数。所述二号卷积层网络和所述三号卷积层网络的卷积核大小为1×1。In an optional embodiment, the lightweight converter network includes a convolutional layer network No. 1, a convolutional layer network No. 2, a vector stretching operation, a plurality of converter modules, a vector stretching operation, and a convolutional layer network No. 3 connected in sequence. The number of converter modules of the first lightweight converter network is 2. The number of converter modules of the second lightweight converter network is 4. The number of converter modules of the third lightweight converter network is 3. The convolutional layer network No. 1 includes a convolutional layer, a batch normalization layer, and a SiLU activation function connected in sequence. The convolution kernel size of the convolutional layer network No. 1 is 3×3. The convolutional layer network No. 2 and the convolutional layer network No. 3 both include convolutional layers, batch normalization layers, and SiLU activation functions connected in sequence. The convolution kernel size of the convolutional layer network No. 2 and the convolutional layer network No. 3 is 1×1.

第一轻量化转换器网络的一号卷积层的卷积核数量为48。第一轻量化转换器网络的二号卷积层的卷积核数量为64。第一轻量化转换器网络的三号卷积层的卷积核数量为48。The number of convolution kernels of the first convolution layer of the first lightweight converter network is 48. The number of convolution kernels of the second convolution layer of the first lightweight converter network is 64. The number of convolution kernels of the third convolution layer of the first lightweight converter network is 48.

第二轻量化转换器网络的一号卷积层的卷积核数量为64。第二轻量化转换器网络的二号卷积层的卷积核数量为80。第二轻量化转换器网络的三号卷积层的卷积核数量为64。The number of convolution kernels of the first convolution layer of the second lightweight converter network is 64. The number of convolution kernels of the second convolution layer of the second lightweight converter network is 80. The number of convolution kernels of the third convolution layer of the second lightweight converter network is 64.

第三轻量化转换器网络的一号卷积层的卷积核数量为80。第三轻量化转换器网络的二号卷积层的卷积核数量为96。第三轻量化转换器网络的三号卷积层的卷积核数量为80。The number of convolution kernels of the first convolution layer of the third lightweight converter network is 80. The number of convolution kernels of the second convolution layer of the third lightweight converter network is 96. The number of convolution kernels of the third convolution layer of the third lightweight converter network is 80.

在一个可选的实施例中，所述转换器模块包含连接的注意力层和前馈网络。In an optional embodiment, the transformer module comprises a connected attention layer and a feed-forward network.

转换器模块的注意力层包含顺序连接的第一层归一化层、第一线性层、向量拉伸操作、注意力操作、向量拉伸操作和第二线性层。The attention layer of the transformer module contains a first normalization layer, a first linear layer, a vector stretching operation, an attention operation, a vector stretching operation, and a second linear layer connected sequentially.

转换器模块的前馈网络包含顺序连接的第二层归一化层、第三线性层、SiLU激活函数、第一Dropout随机失活、第四线性层和第二Dropout随机失活。The feed-forward network of the converter module contains a second normalization layer, a third linear layer, a SiLU activation function, a first Dropout random loss, a fourth linear layer, and a second Dropout random loss, which are connected sequentially.

其中，第一线性层的输出维度为96。第二线性层的输出维度和转换器模块的输入向量维度相同。第三线性层的输出维度为转换器模块输入向量的两倍。第四线性层的输出维度和转换器的输入向量维度相同。第一Dropout随机失活和第二Dropout随机失活的失活率都为0.1。The output dimension of the first linear layer is 96. The output dimension of the second linear layer is the same as the input vector dimension of the converter module. The output dimension of the third linear layer is twice the input vector of the converter module. The output dimension of the fourth linear layer is the same as the input vector dimension of the converter. The deactivation rates of the first Dropout random deactivation and the second Dropout random deactivation are both 0.1.

在一个可选的实施例中，多层感知机包括5层顺序连接的输入维度和输出维度都为320×8×8线性层、层归一化层和ReLU激活函数，以及顺序连接的1层输入维度为320×8×8输出维度为1×1×2的线性层、层归一化层和Sigmoid激活函数。In an optional embodiment, the multilayer perceptron includes 5 sequentially connected layers of linear layers with input and output dimensions of 320×8×8, layer normalization layers and ReLU activation functions, and 1 sequentially connected layer of linear layers with input dimensions of 320×8×8 and output dimensions of 1×1×2, layer normalization layers and Sigmoid activation functions.

第二方面、本发明实施例提供了一种基于轻量化视觉转换器的儿童邪典图片识别设备，其包括处理器、存储器，以及存储在所述存储器内的计算机程序。所述计算机程序能够被所述处理器执行，以实现如第一方面任意一段所述的一种基于轻量化视觉转换器的儿童邪典图片识别方法。In a second aspect, an embodiment of the present invention provides a device for identifying cult children's pictures based on a lightweight visual converter, comprising a processor, a memory, and a computer program stored in the memory. The computer program can be executed by the processor to implement a method for identifying cult children's pictures based on a lightweight visual converter as described in any paragraph of the first aspect.

第三方面、本发明实施例提供了一种计算机可读存储介质。所述计算机可读存储介质包括存储的计算机程序，其中，在所述计算机程序运行时控制所述计算机可读存储介质所在设备执行如第一方面任意一段所述的一种基于轻量化视觉转换器的儿童邪典图片识别方法。In a third aspect, an embodiment of the present invention provides a computer-readable storage medium. The computer-readable storage medium includes a stored computer program, wherein when the computer program is executed, the device where the computer-readable storage medium is located is controlled to execute a method for identifying cult children's pictures based on a lightweight visual converter as described in any paragraph of the first aspect.

通过采用上述技术方案，本发明可以取得以下技术效果：By adopting the above technical solution, the present invention can achieve the following technical effects:

本发明实施例的基于轻量化视觉转换器的儿童邪典图片识别方法不需要大量的算力，其利用计算机视觉分析技术，可以实现在大数据流量的背景下实时识别出数据中的儿童邪典图片，缓解了对于儿童邪典伪装性强、多样性变化、语义理解和部署成本高的挑战。使得儿童邪典识别模型能够低成本、快速高效的识别出多种多样的违规场景。The method for identifying children's cult images based on a lightweight visual converter in the embodiment of the present invention does not require a large amount of computing power. It uses computer vision analysis technology to achieve real-time identification of children's cult images in the data under the background of large data traffic, alleviating the challenges of strong disguise, diversity, semantic understanding and high deployment cost of children's cult images. This enables the children's cult identification model to identify a variety of violation scenarios quickly and efficiently at low cost.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本发明的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for use in the embodiments are briefly introduced below. It should be understood that the following drawings only show certain embodiments of the present invention and therefore should not be regarded as limiting the scope. For ordinary technicians in this field, other related drawings can be obtained based on these drawings without creative work.

图1是基于轻量化视觉转换器的儿童邪典图片识别方法的流程示意图。FIG1 is a flowchart of a method for identifying cult children images based on a lightweight visual converter.

图2为实施例中的儿童邪典图片识别模型的详细网络结构示意图。FIG2 is a schematic diagram of the detailed network structure of the children's cult image recognition model in the embodiment.

图3为实施例中的第一卷积层网络示意图。FIG3 is a schematic diagram of a first convolutional layer network in an embodiment.

图4为实施例中的第二卷积层网络结构示意图。FIG4 is a schematic diagram of the network structure of the second convolutional layer in an embodiment.

图5为实施例中的移动网络结构示意图（残差结构）。FIG5 is a schematic diagram of a mobile network structure (residual structure) in an embodiment.

图6为实施例中的移动网络结构示意图（顺序连接结构）。FIG. 6 is a schematic diagram of a mobile network structure (sequential connection structure) in an embodiment.

图7为实施例中的轻量化转换器网络结构示意图。FIG. 7 is a schematic diagram of a network structure of a lightweight converter in an embodiment.

图8为实施例中的转换器模块结构示意图。FIG8 is a schematic diagram of the structure of a converter module in an embodiment.

图9为实施例中的多层感知机结构示意图。FIG9 is a schematic diagram of the structure of a multi-layer perceptron in an embodiment.

图10为实施例中的训练流程示意图。FIG10 is a schematic diagram of a training process in an embodiment.

图11为实施例中的测试流程示意图。FIG. 11 is a schematic diagram of a test flow in an embodiment.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述。显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical scheme in the embodiment of the present invention will be described clearly and completely below in conjunction with the accompanying drawings in the embodiment of the present invention. Obviously, the described embodiment is only a part of the embodiment of the present invention, not all of the embodiments. Based on the embodiment of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

实施例一、请参阅图1至图11，本发明第一实施例提供一种基于轻量化视觉转换器的儿童邪典图片识别方法，其可由基于轻量化视觉转换器的儿童邪典图片识别设备来执行（以下简称：识别设备）。特别地，由识别设备中的一个或多个处理器来执行，以实现步骤S1至步骤S4。Embodiment 1, please refer to Figures 1 to 11. The first embodiment of the present invention provides a method for identifying children's cult pictures based on a lightweight visual converter, which can be performed by a children's cult picture recognition device based on a lightweight visual converter (hereinafter referred to as: recognition device). In particular, it is performed by one or more processors in the recognition device to implement steps S1 to S4.

其中，所述待识别的卡通图像可能是正常的卡通图像，也可能是和正常图像类似的儿童邪典图像。The cartoon image to be identified may be a normal cartoon image, or may be a cult image of children similar to a normal image.

可以理解的是，所述识别设备可以是便携笔记本计算机、台式机计算机、服务器、智能手机或者平板电脑等具有计算性能的电子设备。It is understandable that the identification device may be an electronic device with computing capabilities, such as a portable notebook computer, a desktop computer, a server, a smart phone, or a tablet computer.

其中，如图2所示，所述基于轻量化视觉转换器的实时儿童邪典图片识别模型的网络结构包括依次连接的第一卷积层网络、第一移动网络、第二移动网络、第三移动网络、第四移动网络、第一轻量化转换器网络、第五移动网络、第二轻量化转换器网络、第六移动网络、第三轻量化转换器网络、第二卷积网络和多层感知机。Among them, as shown in Figure 2, the network structure of the real-time children's cult picture recognition model based on lightweight visual converter includes a first convolutional layer network, a first mobile network, a second mobile network, a third mobile network, a fourth mobile network, a first lightweight converter network, a fifth mobile network, a second lightweight converter network, a sixth mobile network, a third lightweight converter network, a second convolutional network and a multi-layer perceptron connected in sequence.

优选的，所述第一卷积层网络输入是大小为256×256×3的图片，输出为大小128×128×16的向量。所述第一移动网络的输出为128×128×16的向量。所述第二移动网络的输出为64×64×24的向量。所述第三移动网络的输出为64×64×24的向量。所述第四移动网络的输出为32×32×48的向量。所述第一轻量化转换器网络的输出为32×32×48的向量。所述第五移动网络的输出为16×16×64的向量。所述第二轻量化转换器网络的输出为16×16×64的向量。所述第六移动网络的输出为8×8×80的向量。所述第三轻量化转换器网络的输出为8×8×80的向量，第三轻量化转换器网络由4个卷积层网络和3个转换器模块连接组成。所述第二卷积层网络的输出为8×8×320的向量。所述多层感知机的输出为1×1×2的置信度。Preferably, the first convolutional layer network input is a picture of size 256×256×3, and the output is a vector of size 128×128×16. The output of the first mobile network is a vector of 128×128×16. The output of the second mobile network is a vector of 64×64×24. The output of the third mobile network is a vector of 64×64×24. The output of the fourth mobile network is a vector of 32×32×48. The output of the first lightweight converter network is a vector of 32×32×48. The output of the fifth mobile network is a vector of 16×16×64. The output of the second lightweight converter network is a vector of 16×16×64. The output of the sixth mobile network is a vector of 8×8×80. The output of the third lightweight converter network is a vector of 8×8×80, and the third lightweight converter network is composed of 4 convolutional layer networks and 3 converter modules connected. The output of the second convolutional layer network is a vector of 8×8×320. The output of the multilayer perceptron is a 1×1×2 confidence level.

具体的，本发明实施例的基于轻量化视觉转换器的实时儿童邪典图片识别模型采用轻量化视觉转换器为核心模块。相较于传统的视觉转换器，轻量化视觉转换器在精度上没有明显下降，但是参数量更少，能够部署在边缘设备上并流畅运行。低功耗的基于轻量化视觉转换器的实时儿童邪典图片识别模型通过理解图片的语义从而判断该图片是否属于儿童邪典。Specifically, the real-time children's cult picture recognition model based on lightweight visual converter in the embodiment of the present invention adopts lightweight visual converter as the core module. Compared with the traditional visual converter, the lightweight visual converter has no obvious decrease in accuracy, but has fewer parameters and can be deployed on edge devices and run smoothly. The low-power real-time children's cult picture recognition model based on lightweight visual converter determines whether the picture belongs to children's cult by understanding the semantics of the picture.

本发明实施例的基于轻量化视觉转换器的儿童邪典图片识别方法不需要大量的算力，其利用计算机视觉分析技术，可以实现在大数据流量的背景下实时识别出数据中的儿童邪典图片，缓解了对于儿童邪典伪装性强、多样性变化、语义理解和部署成本高的挑战。使得儿童邪典识别模型能够低成本、快速高效的识别出多种多样的违规场景The method for identifying children's cult images based on a lightweight visual converter in the embodiment of the present invention does not require a large amount of computing power. It uses computer vision analysis technology to achieve real-time identification of children's cult images in the data under the background of large data traffic, alleviating the challenges of strong disguise, diversity, semantic understanding and high deployment cost of children's cult images. This enables the children's cult identification model to identify a variety of illegal scenes quickly and efficiently at low cost.

如图3所示，在上述实施例的基础上，本发明的一个可选地实施例中，所述第一卷积层网络包含顺序连接的卷积层、批量归一化层和SiLU激活函数。其中，所述第一卷积层网络的卷积层包含n=16个大小为3×3的卷积核。As shown in Figure 3, based on the above embodiment, in an optional embodiment of the present invention, the first convolutional layer network includes sequentially connected convolutional layers, batch normalization layers and SiLU activation functions. The convolutional layer of the first convolutional layer network includes n=16 convolution kernels of size 3×3.

如图4所示，在上述实施例的基础上，本发明的一个可选地实施例中，所述第二卷积层网络包含顺序连接的卷积层、批量归一化层和SiLU激活函数顺序连接组成。其中，所述第二卷积层网络的卷积层包含n=320个大小为1×1的卷积核。As shown in Figure 4, based on the above embodiment, in an optional embodiment of the present invention, the second convolutional layer network comprises a sequentially connected convolutional layer, a batch normalization layer, and a SiLU activation function sequentially connected. The convolutional layer of the second convolutional layer network comprises n=320 convolution kernels of size 1×1.

如图5所示，在上述实施例的基础上，本发明的一个可选地实施例中，所述第一移动网络和所述第三移动网络均采用残差结构。残差结构的移动网络包含依次连接的分组卷积、批量归一化层、SiLU激活函数、卷积层、批量归一化层、SiLU激活函数和加法操作顺序连接组成。其中，加法操作用以将残差结构的移动网络的输入和最后一个SiLU激活函数的输出相加。As shown in FIG5 , based on the above embodiment, in an optional embodiment of the present invention, the first mobile network and the third mobile network both adopt a residual structure. The mobile network of the residual structure comprises a group convolution, a batch normalization layer, a SiLU activation function, a convolution layer, a batch normalization layer, a SiLU activation function and an addition operation connected in sequence. The addition operation is used to add the input of the mobile network of the residual structure to the output of the last SiLU activation function.

具体的，第一移动网络由分组卷积（32个大小为3×3的卷积核、步长为1、组数量为32）、批量归一化层、SiLU激活函数、卷积层（16个大小为1×1的卷积核、步长为1，填充为0）、批量归一化层、SiLU激活函数和加法操作顺序连接组成。Specifically, the first mobile network consists of grouped convolution (32 convolution kernels of size 3×3, stride 1, group number 32), batch normalization layer, SiLU activation function, convolution layer (16 convolution kernels of size 1×1, stride 1, padding 0), batch normalization layer, SiLU activation function and addition operation sequential connection.

第三移动网络由分组卷积（48个大小为3×3的卷积核、步长为1、组数量为48）、批量归一化层、SiLU激活函数、卷积层（24个大小为1×1的卷积核、步长为1，填充为0）、批量归一化层、SiLU激活函数和加法操作顺序连接组成。The third mobile network consists of grouped convolution (48 convolution kernels of size 3×3, stride 1, group number 48), batch normalization layer, SiLU activation function, convolution layer (24 convolution kernels of size 1×1, stride 1, padding 0), batch normalization layer, SiLU activation function and addition operation sequential connection.

如图6所示，在上述实施例的基础上，本发明的一个可选地实施例中，所述第二移动网络、第四移动网络、第五移动网络和第六移动网络均采用顺序连接结构。顺序连接结构的移动网络包含依次连接的分组卷积、批量归一化层、SiLU激活函数、卷积层、批量归一化层和SiLU激活函数。As shown in Figure 6, based on the above embodiment, in an optional embodiment of the present invention, the second mobile network, the fourth mobile network, the fifth mobile network and the sixth mobile network all adopt a sequential connection structure. The mobile network of the sequential connection structure comprises a group convolution, a batch normalization layer, a SiLU activation function, a convolution layer, a batch normalization layer and a SiLU activation function connected in sequence.

具体的，第二移动网络由分组卷积（32个大小为3×3的卷积核、步长为1、组数量为32）、批量归一化层、SiLU激活函数、卷积层（24个大小为1×1的卷积核、步长为1，填充为0）、批量归一化层和SiLU激活函数顺序连接组成。Specifically, the second mobile network consists of group convolution (32 convolution kernels of size 3×3, stride 1, group number 32), batch normalization layer, SiLU activation function, convolution layer (24 convolution kernels of size 1×1, stride 1, padding 0), batch normalization layer and SiLU activation function connected sequentially.

第四移动网络由分组卷积（48个大小为3×3的卷积核、步长为1、组数量为48）、批量归一化层、SiLU激活函数、卷积层（48个大小为1×1的卷积核、步长为1，填充为0）、批量归一化层和SiLU激活函数顺序连接组成。The fourth mobile network consists of grouped convolution (48 convolution kernels of size 3×3, stride 1, group number 48), batch normalization layer, SiLU activation function, convolution layer (48 convolution kernels of size 1×1, stride 1, padding 0), batch normalization layer and SiLU activation function connected sequentially.

第五移动网络由分组卷积（96个大小为3×3的卷积核、步长为1、组数量为96）、批量归一化层、SiLU激活函数、卷积层（64个大小为1×1的卷积核、步长为1，填充为0）、批量归一化层和SiLU激活函数顺序连接组成。The fifth mobile network consists of grouped convolution (96 convolution kernels of size 3×3, stride 1, group number 96), batch normalization layer, SiLU activation function, convolution layer (64 convolution kernels of size 1×1, stride 1, padding 0), batch normalization layer and SiLU activation function connected sequentially.

第六移动网络由分组卷积（128个大小为3×3的卷积核、步长为1、组数量为128）、批量归一化层、SiLU激活函数、卷积层（80个大小为1×1的卷积核、步长为1，填充为0）、批量归一化层和SiLU激活函数顺序连接组成。The sixth mobile network consists of grouped convolution (128 convolution kernels of size 3×3, stride 1, group number 128), batch normalization layer, SiLU activation function, convolution layer (80 convolution kernels of size 1×1, stride 1, padding 0), batch normalization layer and SiLU activation function connected sequentially.

如图7所示，在上述实施例的基础上，本发明的一个可选地实施例中，轻量化转换器网络包含依次连接的一号卷积层网络、二号卷积层网络、向量拉伸操作、多个转换器模块、向量拉伸操作和三号卷积层网络。其中，所述第一轻量化转换器网络的转换器模块的数量为2个。所述第二轻量化转换器网络的转换器模块的数量为4个。所述第三轻量化转换器网络的转换器模块的数量为3个。As shown in FIG7 , based on the above embodiment, in an optional embodiment of the present invention, the lightweight converter network includes a first convolutional layer network, a second convolutional layer network, a vector stretching operation, a plurality of converter modules, a vector stretching operation, and a third convolutional layer network connected in sequence. The number of converter modules of the first lightweight converter network is 2. The number of converter modules of the second lightweight converter network is 4. The number of converter modules of the third lightweight converter network is 3.

所述一号卷积层网络包含顺序连接的卷积层、批量归一化层和SiLU激活函数。所述一号卷积层网络的卷积核大小为3×3。所述二号卷积层网络和所述三号卷积层网络均包含顺序连接的卷积层、批量归一化层和SiLU激活函数。所述二号卷积层网络和所述三号卷积层网络的卷积核大小为1×1。The first convolutional layer network includes sequentially connected convolutional layers, batch normalization layers, and SiLU activation functions. The convolution kernel size of the first convolutional layer network is 3×3. The second convolutional layer network and the third convolutional layer network both include sequentially connected convolutional layers, batch normalization layers, and SiLU activation functions. The convolution kernel size of the second convolutional layer network and the third convolutional layer network is 1×1.

具体的，一号卷积层网络和四号卷积层网络的结构与第一卷积层网络相同，但是n为48、64或80（分别对应第一轻量化转换器网络、第二轻量化转换器网络和第三轻量化转换器网络）。二号卷积层网络与第二卷积层网络相同，但是n为64、80或96（分别对应第一轻量化转换器网络、第二轻量化转换器网络和第三轻量化转换器网络）。三号卷积层网络与第二卷积层网络相同，但是n为48、64或80（分别对应第一轻量化转换器网络、第二轻量化转换器网络和第三轻量化转换器网络）。Specifically, the structures of the first convolutional layer network and the fourth convolutional layer network are the same as the first convolutional layer network, but n is 48, 64 or 80 (corresponding to the first lightweight converter network, the second lightweight converter network and the third lightweight converter network, respectively). The second convolutional layer network is the same as the second convolutional layer network, but n is 64, 80 or 96 (corresponding to the first lightweight converter network, the second lightweight converter network and the third lightweight converter network, respectively). The third convolutional layer network is the same as the second convolutional layer network, but n is 48, 64 or 80 (corresponding to the first lightweight converter network, the second lightweight converter network and the third lightweight converter network, respectively).

如图8所示，在上述实施例的基础上，本发明的一个可选地实施例中，所述转换器模块包含连接的注意力层和前馈网络。As shown in FIG. 8 , based on the above embodiment, in an optional embodiment of the present invention, the converter module includes a connected attention layer and a feedforward network.

具体的，转换器模块为轻量化转换器网络的基础模块。Specifically, the converter module is a basic module of the lightweight converter network.

如图9所示，在上述实施例的基础上，本发明的一个可选地实施例中，多层感知机包括5层顺序连接的输入维度和输出维度都为320×8×8线性层、层归一化层和ReLU激活函数，以及顺序连接的1层输入维度为320×8×8输出维度为1×1×2的线性层、层归一化层和Sigmoid激活函数。As shown in Figure 9, based on the above embodiment, in an optional embodiment of the present invention, the multilayer perceptron includes 5 sequentially connected layers of linear layers with input and output dimensions of 320×8×8, layer normalization layers and ReLU activation functions, and 1 sequentially connected layer of linear layers with input dimensions of 320×8×8 and output dimensions of 1×1×2, layer normalization layers and Sigmoid activation functions.

具体的，将儿童邪典图片输入到低功耗的基于轻量化视觉变换器的实时儿童邪典图片识别模型中，模型输出该图片是否为儿童邪典的置信度。Specifically, the children's cult picture is input into a low-power, lightweight visual transformer-based real-time children's cult picture recognition model, and the model outputs the confidence level of whether the picture is a children's cult.

本发明实施例针对已经公开的儿童邪典识别产品及其技术的不足，尤其在于无法识别各种场景下的儿童邪典图片，对于伪装性强的儿童邪典图片无法识别，且难以在边缘设备上部署实时运行等问题。The embodiments of the present invention address the deficiencies of the existing children's cult identification products and technologies, particularly the inability to identify children's cult images in various scenarios, the inability to identify children's cult images with strong disguises, and the difficulty in deploying them for real-time operation on edge devices.

提出了基于轻量化视觉转换器的实时儿童邪典图片识别模型，该模型具有识别准确率高、识别速度高，并且轻量化模型所需算力较少能够在边缘设备上流畅运行，以在更多的场景更好的理解多种多样的儿童邪典图像。因此可以大大降低儿童邪典对儿童心理健康的危害。A real-time children's cult image recognition model based on a lightweight visual converter is proposed. The model has high recognition accuracy and high recognition speed. The lightweight model requires less computing power and can run smoothly on edge devices to better understand a variety of children's cult images in more scenarios. Therefore, it can greatly reduce the harm of child cult to children's mental health.

如图10所示，在上述实施例的基础上，本发明的一个可选地实施例中，所述基于轻量化视觉转换器的实时儿童邪典图片识别模型的训练步骤包括步骤A1至步骤A4。As shown in FIG. 10 , based on the above embodiment, in an optional embodiment of the present invention, the training steps of the real-time children's cult image recognition model based on lightweight visual converter include steps A1 to A4.

可以理解的是，在训练之前需要先采集儿童邪典图片识别数据集。具体的，在国内外主流网站以及一些网站上根据“儿童邪典”、“elsagate”等关键词通过爬虫脚本爬取大量图片，然后进行人工二次筛选得到高质量的儿童邪典数据集。这些图片包括正常动画图像和儿童邪典图像两个类别。It is understandable that before training, it is necessary to collect a data set of children's cult image recognition. Specifically, a large number of images are crawled through crawler scripts on mainstream websites at home and abroad and some websites based on keywords such as "children's cult" and "elsagate", and then a high-quality children's cult data set is obtained through manual secondary screening. These images include two categories: normal animation images and children's cult images.

完成数据集的制作后，将儿童邪典数据集划分为训练集和测试集两个部分，以提供训练和测试轻量化视觉转换器的低功耗实时儿童邪典图片识别模型。其中，训练集和测试集的图片互不重复。After the data set is created, the children's cult data set is divided into two parts: a training set and a test set, in order to provide a low-power real-time children's cult image recognition model for training and testing lightweight visual converters. The images in the training set and the test set are not repeated.

A1、从训练集中进行批量随机选取。其中，每一个批次为n张图片。具体的，n的数值可以根据GPU设备显存的大小来进行合理选择，本发明对此不做具体限定。A1. Randomly select batches from the training set. Each batch consists of n pictures. Specifically, the value of n can be reasonably selected according to the size of the GPU device memory, and the present invention does not make any specific limitation on this.

A2、将每个批次中的n张图片的尺度都缩放为256×256的大小，接着对批次中所有的图片进行数据增强。其中，数据增强即进行MixUp、颜色空间、图片旋转、仿射变换等数据增强方法。A2. Scale the n images in each batch to 256×256, and then perform data enhancement on all the images in the batch. The data enhancement includes MixUp, color space, image rotation, affine transformation and other data enhancement methods.

所述Focal Loss损失函数为：The Focal Loss loss function is:

， ,

通过这样的学习模式，让低功耗的基于轻量化视觉转换器的实时儿童邪典图片识别模型不断的迭代学习理解儿童邪典的场景。用于调节容易样本与困难样本的损失，能够使损失函数更关注困难样本。/>代表图片的类别标签，介于0到1之间。Through this learning model, the low-power real-time children's cult image recognition model based on a lightweight visual converter can continuously iterate and learn to understand the children's cult scenes. Used to adjust the loss of easy samples and difficult samples, so that the loss function can pay more attention to difficult samples. /> Represents the category label of the image, between 0 and 1.

如图11所示，在上述实施例的基础上，本发明的一个可选地实施例中，模型训练好之后需要测试其准确率。As shown in FIG. 11 , based on the above embodiment, in an optional embodiment of the present invention, the accuracy of the model needs to be tested after it is trained.

首先，从儿童邪典数据集的测试集部分读取图片，并统一将读取的图片缩放到256×256的图片。然后将图片数据转为张量后输入到低功耗的基于轻量化视觉转换器的实时儿童邪典图片识别模型中，得到一组预测向量。最后通过与阈值比较判断图片是否为儿童邪典。First, we read images from the test set of the children's cult dataset and uniformly scale the read images to 256×256 images. Then, we convert the image data into tensors and input them into a low-power, lightweight visual converter-based real-time children's cult image recognition model to obtain a set of prediction vectors. Finally, we compare the image with the threshold to determine whether it is a children's cult.

优选的，在本专利中该阈值为0.5，在其它实施例中可以将阈值设为其它数值本发明对此不做具体限定。经测试低功耗的基于轻量化视觉转换器的实时儿童邪典图片识别模型对儿童邪典识别数据集的测试集的识别精度能够达到93.7%，召回率为91.3%。Preferably, in this patent, the threshold is 0.5, and in other embodiments, the threshold can be set to other values. The present invention does not specifically limit this. The low-power, real-time children's cult picture recognition model based on lightweight visual converter has been tested to achieve a recognition accuracy of 93.7% and a recall rate of 91.3% for the test set of the children's cult recognition data set.

实施例二、本发明实施例提供了一种基于轻量化视觉转换器的儿童邪典图片识别设备，其包括处理器、存储器，以及存储在所述存储器内的计算机程序。所述计算机程序能够被所述处理器执行，以实现如实施例一任意一段所述的一种基于轻量化视觉转换器的儿童邪典图片识别方法。Embodiment 2: The embodiment of the present invention provides a device for identifying cult children's pictures based on a lightweight visual converter, which includes a processor, a memory, and a computer program stored in the memory. The computer program can be executed by the processor to implement a method for identifying cult children's pictures based on a lightweight visual converter as described in any paragraph of Embodiment 1.

实施例三、本发明实施例提供了一种计算机可读存储介质。所述计算机可读存储介质包括存储的计算机程序，其中，在所述计算机程序运行时控制所述计算机可读存储介质所在设备执行如实施例一任意一段所述的一种基于轻量化视觉转换器的儿童邪典图片识别方法。Embodiment 3: The embodiment of the present invention provides a computer-readable storage medium. The computer-readable storage medium includes a stored computer program, wherein when the computer program is executed, the device where the computer-readable storage medium is located is controlled to execute a method for identifying cult children's pictures based on a lightweight visual converter as described in any paragraph of Embodiment 1.

在本发明实施例所提供的几个实施例中，应该理解到，所揭露的装置和方法，也可以通过其它的方式实现。以上所描述的装置和方法实施例仅仅是示意性的，例如，附图中的流程图和框图显示了根据本发明的多个实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分，所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现方式中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。In several embodiments provided in the embodiments of the present invention, it should be understood that the disclosed apparatus and method can also be implemented in other ways. The apparatus and method embodiments described above are merely schematic. For example, the flowcharts and block diagrams in the accompanying drawings show the possible architecture, functions and operations of the apparatus, method and computer program product according to multiple embodiments of the present invention. In this regard, each box in the flowchart or block diagram can represent a module, a program segment or a part of a code, and the module, program segment or a part of the code contains one or more executable instructions for implementing the specified logical function. It should also be noted that in some alternative implementations, the functions marked in the box can also occur in a different order from the order marked in the accompanying drawings. For example, two consecutive boxes can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that each box in the block diagram and/or flowchart, and the combination of boxes in the block diagram and/or flowchart can be implemented with a dedicated hardware-based system that performs a specified function or action, or can be implemented with a combination of dedicated hardware and computer instructions.

另外，在本发明各个实施例中的各功能模块可以集成在一起形成一个独立的部分，也可以是各个模块单独存在，也可以两个或两个以上模块集成形成一个独立的部分。In addition, the functional modules in the various embodiments of the present invention may be integrated together to form an independent part, or each module may exist independently, or two or more modules may be integrated to form an independent part.

所述功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备（可以是个人计算机，电子设备，或者网络设备等）执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器（ROM，Read-Only Memory）、随机存取存储器（RAM，Random Access Memory）、磁碟或者光盘等各种可以存储程序代码的介质。需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。If the function is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to enable a computer device (which can be a personal computer, electronic device, or network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program code. It should be noted that in this article, the term "include", "comprise" or any other variant thereof is intended to cover non-exclusive inclusion, so that the process, method, article or device including a series of elements includes not only those elements, but also includes other elements that are not explicitly listed, or also includes elements inherent to such process, method, article or device. Without more constraints, an element defined by the phrase "comprising a..." does not exclude the existence of other identical elements in the process, method, article or apparatus comprising the element.

在本发明实施例中使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本发明。在本发明实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。The terms used in the embodiments of the present invention are only for the purpose of describing specific embodiments, and are not intended to limit the present invention. The singular forms "a", "said" and "the" used in the embodiments of the present invention and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings.

应当理解，本文中使用的术语“和/或”仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中字符“/”，一般表示前后关联对象是一种“或”的关系。It should be understood that the term "and/or" used in this article is only a description of the association relationship of associated objects, indicating that there can be three relationships. For example, A and/or B can represent: A exists alone, A and B exist at the same time, and B exists alone. In addition, the character "/" in this article generally indicates that the associated objects before and after are in an "or" relationship.

取决于语境，如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”或“响应于检测”。类似地，取决于语境，短语“如果确定”或“如果检测（陈述的条件或事件）”可以被解释成为“当确定时”或“响应于确定”或“当检测（陈述的条件或事件）时”或“响应于检测（陈述的条件或事件）”。The word "if" as used herein may be interpreted as "at the time of" or "when" or "in response to determining" or "in response to detecting", depending on the context. Similarly, the phrases "if it is determined" or "if (stated condition or event) is detected" may be interpreted as "when it is determined" or "in response to determining" or "when detecting (stated condition or event)" or "in response to detecting (stated condition or event)", depending on the context.

实施例中提及的“第一\第二”仅仅是是区别类似的对象，不代表针对对象的特定排序，可以理解地，“第一\第二”在允许的情况下可以互换特定的顺序或先后次序。应该理解“第一\第二”区分的对象在适当情况下可以互换，以使这里描述的实施例能够以除了在这里图示或描述的那些内容以外的顺序实施。The "first\second" mentioned in the embodiments is only to distinguish similar objects, and does not represent a specific order for the objects. It is understandable that the "first\second" can be interchanged with the specific order or sequence where permitted. It should be understood that the objects distinguished by "first\second" can be interchanged where appropriate, so that the embodiments described herein can be implemented in an order other than those illustrated or described herein.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.

Claims

1. A child evil dictionary picture identifying method based on a light vision converter is characterized by comprising the following steps:

acquiring a cartoon image to be identified;

Preprocessing according to the cartoon image;

Inputting the preprocessed cartoon image into a pre-trained real-time child evil dictionary picture identification model based on a lightweight visual converter, and obtaining whether the cartoon image belongs to a predictive vector of the child evil dictionary picture;

comparing and judging the prediction vector based on a prediction threshold value to judge whether the cartoon image belongs to a child evil dictionary picture or not;

the network structure of the real-time child evil dictionary picture identifying model based on the light-weight visual converter comprises a first convolution layer network, a first mobile network, a second mobile network, a third mobile network, a fourth mobile network, a first light-weight converter network, a fifth mobile network, a second light-weight converter network, a sixth mobile network, a third light-weight converter network, a second convolution layer network and a multi-layer perceptron which are sequentially connected;

The first convolution layer network input is a picture with the size of 256×256×3, and outputs a vector with the size of 128×128×16; the output of the first mobile network is a 128 x 16 vector; the output of the second mobile network is a 64×64×24 vector; the output of the third mobile network is a 64×64×24 vector; the output of the fourth mobile network is a 32×32×48 vector; the output of the first lightweight converter network is a 32 x 48 vector; the output of the fifth mobile network is a 16×16×64 vector; the output of the second lightweight converter network is a 16×16×64 vector; the output of the sixth mobile network is a vector of 8 x 80; the output of the third light-weight converter network is a vector of 8 multiplied by 80, and the third light-weight converter network is formed by connecting 4 convolutional layer networks and 3 converter modules; the output of the second convolutional layer network is an 8×8×320 vector; the output of the multi-layer perceptron is 1 multiplied by 2 confidence;

The first convolution layer network comprises a convolution layer, a batch normalization layer and SiLU activation functions which are sequentially connected; wherein the convolution layer of the first convolution layer network comprises n=16 convolution kernels of size 3×3;

The second convolution layer network comprises convolution layers, batch normalization layers and SiLU activation functions which are sequentially connected; wherein the convolution layer of the second convolution layer network comprises n=320 convolution kernels of size 1×1;

The first mobile network and the third mobile network both adopt residual error structures;

The mobile network of the residual structure comprises a grouping convolution, a batch normalization layer, siLU activation functions, a convolution layer, a batch normalization layer, siLU activation functions and addition operation sequence which are connected in sequence;

the packet convolution of the first mobile network comprises 32 convolution kernels with the size of 3×3, the step size is 1, and the group number is 32; the convolution layer of the first mobile network comprises 16 convolution kernels with the size of 1 multiplied by 1, the step length is 1, and the filling is 0;

the packet convolution of the third mobile network comprises 48 convolution kernels with a size of 3×3, a step size of 1, and a group number of 48; the convolution layer of the third mobile network comprises 24 convolution kernels with the size of 1 multiplied by 1, the step length is 1, and the filling is 0;

The second mobile network, the fourth mobile network, the fifth mobile network and the sixth mobile network all adopt sequential connection structures;

the mobile network of the sequential connection structure comprises a grouping convolution, a batch normalization layer, siLU activation functions, a convolution layer, a batch normalization layer and SiLU activation functions which are sequentially connected;

the packet convolution of the second mobile network comprises 32 convolution kernels with the size of 3×3, the step size is 1, and the group number is 32; the convolution layer of the second mobile network comprises 24 convolution kernels with the size of 1 multiplied by 1, the step length is 1, and the filling is 0;

The packet convolution of the fourth mobile network comprises 48 convolution kernels with the size of 3×3, a step size of 1 and a group number of 48; the convolution layer of the fourth mobile network comprises 48 convolution kernels with the size of 1 multiplied by 1, the step length is 1, and the filling is 0;

the packet convolution of the fifth mobile network comprises 96 convolution kernels with the size of 3×3, the step size is 1, and the group number is 96; the convolution layer of the fifth mobile network comprises 64 convolution kernels with the size of 1×1, the step size is 1, and the padding is 0;

The packet convolution of the sixth mobile network comprises 128 convolution kernels of 3×3 size, a step size of 1, and a group number of 128; the convolution layer of the sixth mobile network comprises 80 convolution kernels with the size of 1×1, the step size is 1, and the filling is 0;

The lightweight converter network comprises a first convolution layer network, a second convolution layer network, vector stretching operation, a plurality of converter modules, vector stretching operation and a third convolution layer network which are connected in sequence; the first convolution layer network comprises a convolution layer, a batch normalization layer and SiLU activation functions which are sequentially connected; the convolution kernel size of the first convolution layer network is 3 multiplied by 3; the second convolution layer network and the third convolution layer network comprise convolution layers, batch normalization layers and SiLU activation functions which are sequentially connected; the convolution kernel sizes of the second convolution layer network and the third convolution layer network are 1 multiplied by 1;

The number of convolution kernels of the first convolution layer of the first lightweight converter network is 48; the number of convolution kernels of convolution layer number two of the first lightweight converter network is 64; the number of convolution kernels of convolution layer No. three of the first lightweight converter network is 48;

the number of convolution kernels of the first convolution layer of the second lightweight converter network is 64; the number of convolution kernels of the second convolution layer of the second lightweight converter network is 80; the number of convolution kernels of the third convolution layer of the second lightweight converter network is 64;

the number of convolution kernels of the first convolution layer of the third lightweight converter network is 80; the number of convolution kernels of the second convolution layer of the third lightweight converter network is 96; the number of convolution kernels of the third convolution layer of the third lightweight converter network is 80;

the number of converter modules of the first lightweight converter network is 2; the number of converter modules of the second lightweight converter network is 4; the number of converter modules of the third lightweight converter network is 3;

the converter module includes a connected attention layer and feed forward network;

the attention layer of the converter module comprises a first layer normalization layer, a first linear layer, a vector stretching operation, an attention operation, a vector stretching operation and a second linear layer which are sequentially connected;

The feedforward network of the converter module comprises a second normalization layer, a third linear layer, siLU activation functions, a first Dropout random deactivation, a fourth linear layer and a second Dropout random deactivation which are connected in sequence;

Wherein the output dimension of the first linear layer is 96; the output dimension of the second linear layer is the same as the input vector dimension of the converter module; the output dimension of the third linear layer is twice the input vector of the converter module; the output dimension of the fourth linear layer is the same as the input vector dimension of the converter; the deactivation rate of both the first Dropout random deactivation and the second Dropout random deactivation was 0.1.

2. The method for identifying the classical pattern of children based on the lightweight visual transducer according to claim 1, wherein the training step of the real-time classical pattern identification model of children based on the lightweight visual transducer comprises the following steps:

Randomly selecting batches from the training set; wherein each batch is n pictures;

scaling the scale of n pictures in each batch to 256×256, and then performing data enhancement on all the pictures in the batch;

inputting the picture data with the enhanced data into a real-time child evil dictionary picture identification model which is not trained and based on a lightweight visual converter, and obtaining a group of prediction confidence;

Carrying out loss calculation on the prediction confidence coefficient and the labels of the n pictures, and optimizing the obtained loss value through a back propagation algorithm until training is completed; wherein, the Loss function adopts a Focal Loss function;

The Focal Loss function is:

，

in the method, in the process of the invention, Representing a class loss function,/>Is a super-parameter for balancing the non-uniformity of positive and negative samples in the loss function,Classification predictions for representative models,/>Super-parameters for adjusting the losses of easy and difficult samples in the loss function,Category labels representing pictures.

3. The method for identifying a child's evil dictionary picture based on a lightweight visual converter of claim 1 or 2, wherein the multi-layered perceptron includes 5-layered sequentially connected linear layers of input dimension and output dimension 320 x 8, layer normalization layer and ReLU activation function, and sequentially connected linear layers of input dimension 320 x 8 and output dimension 1 x 2, layer normalization layer and Sigmoid activation function.

4. A child evil dictionary picture identifying device based on a light vision converter, which is characterized by comprising a processor, a memory and a computer program stored in the memory; the computer program is executable by the processor to implement a child evil dictionary picture recognition method based on a lightweight visual converter as set forth in any one of claims 1 to 3.

5. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored computer program, wherein the computer program when run controls a device in which the computer readable storage medium is located to perform a method for identifying a child's evil dictionary based on a lightweight visual converter as claimed in any one of claims 1 to 3.