CN110610230A

CN110610230A - Station caption detection method and device and readable storage medium

Info

Publication number: CN110610230A
Application number: CN201910698120.6A
Authority: CN
Inventors: 段运强; 井雅琪; 李扬曦; 刘雨帆; 张翠; 任博雅; 时磊; 胡燕林; 郭承禹; 佟玲玲; 段东圣; 张迎雪; 原春锋; 李兵
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2019-12-24

Abstract

The invention discloses a station caption detection method, a device and a readable storage medium, wherein the method comprises the following steps: acquiring a station caption data set, and grouping the station caption data set to acquire a station caption training set; constructing a multi-loss fusion twin neural network, and training the constructed multi-loss fusion twin neural network based on the station caption training set to obtain a trained multi-loss fusion twin neural network; and detecting the station caption to be detected through the trained multi-loss fusion twin neural network. The method well eliminates the influence on the training network caused by insufficient sample number by constructing the twin neural network framework, and can better detect the unknown new type of sensitive station caption.

Description

Station caption detection method and device and readable storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a station caption detection method, a station caption detection device and a readable storage medium.

Background

With the development of science and technology and the change of information carriers, various information clouds can integrate into every corner of the society in the internet, and the network empowerment can make people have more speaking rights and also make people spreading bad information have the opportunity. The internet affects the vast netizens in the aspects of information dissemination and public opinion guidance, so that the maintenance of network information security becomes a primary task.

The existing data is various in types but rare in data quantity, the workload of manual early-stage labeling is large, and the difficulty and the cost of using conventional target detection are large.

Disclosure of Invention

The embodiment of the invention provides a station caption detection method, a station caption detection device and a readable storage medium.

In a first aspect, a first embodiment of the present invention provides a station caption detecting method, including the following steps:

acquiring a station caption data set, and grouping the station caption data set to acquire a station caption training set;

constructing a multi-loss fusion twin neural network, and training the constructed multi-loss fusion twin neural network based on the station caption training set to obtain a trained multi-loss fusion twin neural network;

and detecting the station caption to be detected through the trained multi-loss fusion twin neural network.

Optionally, the acquiring the station caption data set includes:

acquiring a specified amount of picture data from a public data set and frames intercepted by an existing video, and cutting the picture data into picture clips with set sizes;

carrying out random processing on the existing vector station caption, and adding the vector station caption after the random processing as a watermark to different positions of the picture clip to obtain a station caption picture set;

classifying the station caption picture set according to the type of the station caption to obtain a station caption positive sample;

acquiring a plurality of pure background pictures, adding other watermarks to a set number of the pure background pictures to obtain watermark background pictures, and combining the watermark background pictures and the residual number of the pure background pictures to form a station caption negative sample;

and forming a station caption data set according to the station caption positive sample and the station caption negative sample.

Optionally, the grouping the station caption data sets to obtain a station caption training set includes:

and randomly arranging the station caption data sets, and dividing the randomly arranged station caption data sets into a station caption training set and a station caption testing set according to a proportion.

Optionally, the constructing a multi-loss fused twin neural network includes:

constructing a residual error neural network comprising a set depth;

constructing two residual error neural sub-networks with the same structure according to the residual error neural network;

constructing a contrast loss layer, connecting the outputs of the two residual neural sub-networks to the input of the contrast loss layer.

Optionally, the training the constructed multi-loss fusion twin neural network based on the station caption training set to obtain the trained multi-loss fusion twin neural network includes:

dividing the station caption training set and the station caption testing set into two equal parts;

keeping the corresponding relation unchanged, and disordering the data of the station caption training set and the station caption testing set after the two equal parts;

inputting the disturbed station caption training set and the disturbed station caption testing set into the two residual error neural sub-networks in pairs respectively;

performing similarity contrast on input data through a contrast-loss layer to train the multi-loss fused twin neural network.

Optionally, the performing similarity comparison on the input data through a contrast loss layer to train the multi-loss fused twin neural network includes:

constructing a classification loss function of a cost function layer of a residual neural subnetwork;

performing data processing on the classification loss values output by the two residual error neural sub-networks, and adding the classification loss values and the output of the contrast loss layer to obtain a final loss value;

training the multi-loss fused twin neural network according to the final loss value.

Optionally, the training the multi-loss fused twin neural network according to the final loss value includes:

setting training parameters and training the multi-loss fused twin neural network according to the final loss value;

and stopping training after the trained multi-loss fusion twin neural network achieves a preset effect.

In a second aspect, a second embodiment of the present invention provides a station caption detecting apparatus, including:

the data processing module is used for acquiring a station caption data set and grouping the station caption data set to acquire a station caption training set;

the network training module is used for constructing a multi-loss fusion twin neural network and training the constructed multi-loss fusion twin neural network based on the station caption training set to obtain a trained multi-loss fusion twin neural network;

and the detection module is used for detecting the station caption to be detected through the trained multi-loss fusion twin neural network.

In a third aspect, a third embodiment of the present invention provides a computer-readable storage medium, on which an implementation program for information transfer is stored, and the program, when executed by a processor, implements the steps of the method of the first embodiment.

According to the embodiment of the invention, the influence of insufficient sample quantity on the training network is well eliminated by constructing the twin neural network framework, and unknown new types of sensitive station marks can be better detected.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart of an embodiment of the method of the present invention;

FIG. 2 is a schematic diagram of a training model according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

A first embodiment of the present invention provides a first aspect, and a first embodiment of the present invention provides a station caption detecting method, as shown in fig. 1, the method including the steps of:

Optionally, in an optional embodiment of the present invention, the acquiring the station caption data set includes:

specifically, the public data set may be obtained through programming, or may be obtained through other manners, which is not specifically limited in this application.

specifically, in this embodiment, the watermark is randomly subjected to size scaling, deformation, color enhancement and weakening, and is randomly added to different positions of the picture, the pictures with the same type of logo are used as data of one type of label, and each type of picture is about ten thousand, so as to obtain a logo picture set.

specifically, in this embodiment, according to the number of station caption positive samples, for example, in this embodiment, ten thousand pure background pictures are further cut, and part of the pure background pictures is added with other watermarks, which are different from the station caption watermark, and the other watermarks are used as negative samples together with the remaining pure background pictures.

Optionally, in an optional embodiment of the present invention, the grouping the station caption data set to obtain a station caption training set includes:

Specifically, the whole station caption data set formed in the above manner is randomly disturbed, and the randomly arranged station caption data set is randomly divided into a training set and a test set according to a certain proportion.

Optionally, in an optional embodiment of the present invention, the constructing a multi-loss fused twin neural network includes:

constructing a residual error neural network comprising a set depth;

specifically, in this embodiment, a residual neural network with a predetermined depth is constructed, and the design of the residual neural network may include the following structure: the device comprises an ImageData data layer, a volume Convolution layer, a Batch Normal normalization layer, a ReLU activation function layer, a Pooling pool layer, an Eltwise addition layer, an InnerProduct full connection layer, a SoftmaxWithLoss cost function layer, an Accuracy precision layer and other structures, wherein the data layer, the initial Convolution layer, the normalization layer, the activation function layer, the maximum value pool layer, the Convolution layer and the normalization layer are sequentially connected, and a result is output to the addition layer. The initial convolutional layer is used for carrying out convolution on input sample data, a short-circuit path exists in the pooling layer and leads to the addition layer, the output of the addition layer leads to the activation function layer, and the output of the activation function layer is used as the input of the next convolutional layer and the next addition layer; and an average pooling layer and a full-link layer are connected behind the last activation function layer, and are sent to a cost function SoftmaxWithLoss layer and an Accuracy layer together with the label value output by the data layer, so that the final output of the network, namely the probability and the Accuracy of the station caption as the predicted station caption, is obtained.

In this embodiment, the residual neural network is used for short-circuiting the characteristics of the front and rear layers, and satisfies the following conditions:

y＝F(x,{w_i})+x

where x and y represent the input and output vectors of the residual network, respectively.

the residual error neural network is transformed into two residual error neural sub-networks with the same structure, and the input data can be averagely divided into two parts to be respectively input into the two sub-networks in the specific implementation process.

In this embodiment, a contrast Loss layer is further added on the basis of the residual neural sub-networks, and the output of the average pooling layer and the tag value of the output of the data layer in the two residual neural sub-networks are used as the input of the contrast Loss layer.

The contrast Loss layer function satisfies:

wherein d | | | a_n-b_n||₂The euclidean distance between two sample features is represented, y is a label indicating whether two samples match, y being 1 represents that two samples are similar or match, y being 0 represents mismatch, and margin is a set threshold. The output of the loss function in this embodiment can be well expressed as the matching degree of the pair of samples, and when y is equal to 1 (i.e. the samples are similar), the loss function only leaves yd²That is, if the euclidean distance of the original sample in the feature space is large, the loss increases because the current model is not good. And when y is 0 (i.e., the samples are not similar), the loss function is Σ (1-y) max (margin-d,0)²That is, when the samples are not similar, the euclidean distance of the feature space is small, and the loss value becomes large, so that the network can be well trained by comparing the matching degree of the two samples.

Optionally, in an optional embodiment of the present invention, the training the constructed multi-loss fusion twin neural network based on the station caption training set to obtain a trained multi-loss fusion twin neural network includes:

specifically, on the basis that the training set and the test set are randomly divided into the training set and the test set according to a certain proportion, the training set and the test set are reclassified, the training set and the test set are respectively and averagely divided into two parts, and the picture position information and the tag values are written into the text documents, wherein two text documents in the training set and two text documents in the test set can ensure that each row corresponds to one another one by one, half of the tag values are the same, and the other half of the tag values are different.

Keeping the corresponding relation unchanged, disordering the data of the station caption training set and the station caption testing set after the two equal parts, disordering the data in the two text documents of the training set, and keeping the corresponding relation unchanged.

Then, two parts of data are respectively input into the two sub-networks in pairs and pass through a contrast loss function L₁The input is mapped to a target space, and the similarity is compared in the target space using a simple distance (euclidean distance, etc.). In the training phase, the loss function values of a pair of samples from the same class are minimized, and the loss function values of a stack of samples from different classes are maximized.

in the embodiment, classification loss is applied to each network in the twin network, and the network is trained to recognize the existing type-sensitive station caption. Loss of classification L₂And cross entropy loss is adopted, so that the following conditions are met:

wherein,class predicted by the network for the nth sample, z_nIs the corresponding true category label.

And performing data processing on the classification loss values output by the two residual neural sub-networks, and adding the classification loss values and the output of the contrast loss layer to obtain a final loss value.

Specifically, as shown in fig. 2, the classification loss value L of the cost function layer output is set₂Multiplying the value by a set coefficient and comparing the value of the loss function L₁And adding the loss values to obtain the final loss value for training the network. For example, the final loss function may be:

L＝L₁+αL₂

Specifically, in this embodiment, the training is stopped and the model is saved after the output data of the network to be generated reaches the expected effect by setting the super parameters such as the number of iterations.

And detecting the unknown station caption by the saved model.

In view of the fact that the position of the sensitive station caption is relatively fixed, the method can directly extract the four-corner area of the video frame to detect the sensitive station caption. Meanwhile, station labels in the sensitive library may be increased continuously along with the time, and the station label comparison method is added to enable the sample to be detected to obtain the identification result through comparison calculation with the newly added station label. In addition, because the data volume of the sensitive station caption is rare, the invention simulates a real scene to generate training data, and uses a comparison loss training network to reduce the requirement of the model on the data volume. And the multiple losses such as the contrast loss and the classification loss are fused, so that the classification performance and the contrast performance of the detection model can be improved simultaneously.

After the technical scheme is adopted, the invention at least has the following beneficial effects:

1. the invention adopts multi-loss fusion, the classification loss value is multiplied by a specific coefficient and then added with the comparison loss function value to be used as a final loss value for training the network, and the unknown new type of sensitive station caption can be better detected.

2. By adopting the twin neural network framework, the invention well eliminates the influence on the training network caused by insufficient sample number, and better adapts to the characteristics of a data set with more sample types and less sample number of each type in the station caption detection.

3. The invention embeds the convolutional neural network into the twin neural network framework, adds the ContrastivLoss contrast loss layer in the convolutional neural network framework, leads the network to learn a similarity measure from the data, compares and matches a new sample of unknown class by using the learned measure, and has higher accuracy.

In a second aspect, a second embodiment of the present invention provides an implementation case of a station caption detection method, including:

1. programming to generate a data set, selecting more than one hundred pictures with bright colors from a public data set and frames intercepted by an existing video, cutting the pictures into specific sizes, taking a vector station caption as a watermark, randomly carrying out size scaling, deformation, color enhancement and weakening on the vector station caption, randomly adding the vector station caption to different positions of the pictures, taking the pictures with the same station caption as data of a class of labels, and taking about ten thousand pictures of each class of pictures; disorganizing the whole data set, and randomly dividing the data set into a training set and a testing set according to a certain proportion; and respectively and averagely dividing the training set and the test set into two parts, and writing the picture position information and the label value into a text document. The two text documents in the training set and the two text documents in the testing set can ensure that each row corresponds to each other one by one, and half of the label values are the same, while the other half of the label values are different; and (4) disordering data in the two text documents of the training set, but keeping the corresponding relation unchanged.

2. The data are respectively input into the data layers of the two sub-networks for data preprocessing, and data enhancement is carried out on the data, and the method comprises the following steps: the 64 x 112 blocks were randomly clipped, mirrored on, and the mean of all three channels was set to 127.5, normalizing the data to between 0 and 1, and the training batch size of the network was set to 64.

3. After the station caption data set is divided, the training set is used for training the network, the specific network structure is shown in fig. 2, and the whole network is formed by stacking a plurality of residual modules. Firstly, two groups of data are respectively input into initial convolution layers of two sub-networks, the step is set to be 2 through a convolution filter of 7 x 7, 3 pixel completions are added on each side of an input image, then four edges are all expanded by 3 pixels, namely the width and the height are both expanded by 6 pixels, and thus the feature diagram after convolution operation can keep the original size. And then sequentially passing the feature vectors output by the initial convolutional layer through a Batch Normal normalization layer, a ReLU activation function layer, a Pooling maximum value Pooling layer, a constraint convolutional layer and a Batch Normal normalization layer, and outputting the result to an Eltwise addition layer. Wherein a short-circuit path exists in the maximum value pooling layer and leads to the addition layer, the output of the addition layer leads to the activation function layer, and the output of the activation function layer is used as the input of the next convolution layer and the next addition layer; an average value pooling layer and a full connection layer are connected behind the last activation function layer; the output of the average value pooling layer of the two sub-networks and the label value output by the data layer in the two sub-networks are input into a contrast Loss layer for training; and the output result of the full connection layer and the label value are sent to a cost function SoftmaxWithLoss layer and an Accuracy layer.

4. And multiplying the classification loss value output by the cost function layer by a specific coefficient, and adding the result and the comparison loss function value to obtain a final loss value, namely:

L＝L₁+αL₂

the network is trained with this total loss.

5. The super parameter is set when the network is trained, the initial learning rate is set to be 0.1, and the learning rate can be updated according to the specified iteration number by adopting a multi-step learning strategy. Two step values (stepvalue) are set to 20000 and 80000, respectively, and the parameter constant factor is set to 0.1, i.e., the learning rate changes to 0.01 by 20000 iterations and to 0.001 by 80000 iterations. The accuracy of the model can be effectively improved by using the changed learning rate in the training process.

6. When the iteration is carried out for a certain number of times, the network is converged, the precision reaches the highest, and the model is saved.

7. Unknown station markers are detected using the trained model.

In a third aspect, a third embodiment of the present invention provides a station caption detecting apparatus, including:

In a fourth aspect, a fourth embodiment of the present invention provides a computer-readable storage medium, on which an implementation program for information transfer is stored, and the program, when executed by a processor, implements the steps of the method of the first embodiment.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A station caption detection method is characterized by comprising the following steps:

2. The method of claim 1, wherein said obtaining a station caption data set comprises:

3. The method of claim 2, wherein the grouping the station caption data set to obtain a station caption training set comprises:

4. The method of claim 3, wherein constructing a multi-loss fused twin neural network comprises:

constructing a residual error neural network comprising a set depth;

5. The method of claim 4, wherein training the constructed multi-loss fused twin neural network based on the station mark training set obtains a trained multi-loss fused twin neural network, comprising:

6. The method of claim 5, wherein the similarity comparison of input data by a contrast-loss layer to train the multi-loss fused twin neural network comprises:

7. The method of claim 6, wherein training the multi-loss fused twin neural network according to the final loss values comprises:

8. A station caption detecting apparatus, characterized in that the apparatus comprises:

9. A computer-readable storage medium, characterized in that it has stored thereon a program for implementing the transfer of information, which program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 7.