CN111859237A - Network content auditing method and device, electronic equipment and storage medium - Google Patents

Network content auditing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111859237A
CN111859237A CN202010717301.1A CN202010717301A CN111859237A CN 111859237 A CN111859237 A CN 111859237A CN 202010717301 A CN202010717301 A CN 202010717301A CN 111859237 A CN111859237 A CN 111859237A
Authority
CN
China
Prior art keywords
content
webpage
training
image
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010717301.1A
Other languages
Chinese (zh)
Inventor
张宁
蔡琳
李玉惠
刘瑞
傅强
阿曼太
梁彧
马寒军
田野
王杰
杨满智
金红
陈晓光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Eversec Beijing Technology Co Ltd
Original Assignee
Eversec Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Eversec Beijing Technology Co Ltd filed Critical Eversec Beijing Technology Co Ltd
Priority to CN202010717301.1A priority Critical patent/CN111859237A/en
Publication of CN111859237A publication Critical patent/CN111859237A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the disclosure discloses a method and a device for auditing network content, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a webpage image to be audited; the webpage image is input to a pre-trained content auditing model, and first result information and second result information output by the content auditing model are acquired, wherein the first result information is used for representing the type of the suspected violation of the content displayed by the webpage image, and the second result information is used for representing the probability that the content displayed by the webpage image belongs to the type displayed by the first result information, so that the content of the suspected violation of the webpage can be identified, and the identification efficiency can be improved.

Description

Network content auditing method and device, electronic equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of machine learning, in particular to a network content auditing method, a device, electronic equipment and a storage medium.
Background
Today, networks are highly developed, and network information is not uniform. The network betting and network lottery information is also mixed in various information, and may be inadvertently guided to the betting lottery's web page by the phishing website.
The current technology for gambling on web pages and mobile phone application program interfaces and detecting lottery information is keyword matching based on an optical character recognition technology, which is to say simply to firstly recognize characters of input pictures and then match the recognized characters with keywords. Because the character recognition is carried out on the whole input picture, the accuracy rate of character matching can be very high as long as the recognized characters are correct, but the character recognition speed is low, and the efficiency is low.
Disclosure of Invention
In view of this, the present disclosure provides a method and an apparatus for auditing network content, an electronic device, and a storage medium, so as to improve the identification efficiency.
Additional features and advantages of the disclosed embodiments will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosed embodiments.
In a first aspect of the present disclosure, an embodiment of the present disclosure provides a method for auditing network content, including:
acquiring a webpage image to be audited;
inputting the webpage image into a pre-trained content auditing model, and acquiring first result information and second result information output by the content auditing model, wherein the first result information is used for indicating the type of suspected violation of content displayed by the webpage image, and the second result information is used for indicating the probability that the content displayed by the webpage image belongs to the type displayed by the first result information.
In an embodiment, the content audit model is obtained by training as follows:
acquiring a training sample set, wherein the training sample comprises a webpage image, first labeling information used for representing the type of suspected violation of content displayed by the webpage image, and second labeling information used for representing the probability that the content displayed by the webpage image belongs to the type corresponding to the first labeling information;
determining an initialized content auditing model, wherein the initialized content auditing model comprises a first target layer and a second target layer, the first target layer is used for outputting the type of content suspected violation displayed by a webpage image, and the second target layer is used for representing the probability of the content suspected violation displayed by the webpage image;
and by utilizing a machine learning method, taking the webpage images in the training samples in the training sample set as the input of the initialized content auditing model, taking the first labeling information and the second labeling information corresponding to the input webpage images as the expected output of the initialized content auditing model, and training to obtain the content auditing model.
In one embodiment, the initialized content audit model comprises a convolutional neural network model.
In one embodiment, after obtaining the training sample set, the method further includes: counting training samples contained in the training sample set according to the violation categories of the webpage images in the training samples; and determining at least one class with a small number of training samples according to the statistical result, and performing oversampling processing on the training samples of the at least one class to balance the number of the training samples of each class in the training sample set.
In one embodiment, after obtaining the training sample set, the method further includes:
counting training samples contained in the training sample set according to the image sizes in the training samples;
and determining at least one size with less training samples according to the statistical result, and performing oversampling processing on the training samples with the at least one size so as to balance the number of the training samples with each size in the training sample set.
In one embodiment, after obtaining the training sample set, the method further includes:
determining the number of difficult samples and the number of easy samples of training samples contained in the training sample set;
and performing oversampling processing on a small number of difficult samples or easy samples so as to balance the number of the difficult samples and the number of the easy samples in the training sample set.
In one embodiment, obtaining the training sample set includes:
and crawling a plurality of webpage pictures from a website, performing duplicate removal operation on the webpage pictures, and marking the pictures reserved after the duplicate removal operation to be used as training samples so as to form the training sample set.
In an embodiment, the performing the deduplication operation on the plurality of web page pictures includes: and carrying out duplicate removal operation on the plurality of webpage pictures by adopting a difference value hash algorithm.
In a second aspect of the present disclosure, an embodiment of the present disclosure further provides a network content auditing apparatus, including:
the image acquisition unit is used for acquiring a webpage image to be audited;
and the image auditing unit is used for inputting the webpage image into a pre-trained content auditing model and acquiring first result information and second result information output by the content auditing model, wherein the first result information is used for indicating the type of the violation suspected by the content displayed by the webpage image, and the second result information is used for indicating the probability that the content displayed by the webpage image belongs to the type displayed by the first result information.
In an embodiment, the content audit model in the image audit unit is obtained by training through the following modules:
the system comprises a sample acquisition module, a training sample collection and a comparison module, wherein the training sample collection is used for acquiring a webpage image, first labeling information used for representing the type of suspected violation of content displayed by the webpage image and second labeling information used for representing the probability that the content displayed by the webpage image belongs to the type corresponding to the first labeling information;
the model determining module is used for determining an initialized content auditing model, wherein the initialized content auditing model comprises a first target layer and a second target layer, the first target layer is used for outputting the type of the content suspected violation displayed by the webpage image, and the second target layer is used for representing the probability of the content suspected violation displayed by the webpage image;
and the model training module is used for taking the webpage images in the training samples in the training sample set as the input of the initialized content auditing model by using a machine learning method, taking the first labeling information and the second labeling information corresponding to the input webpage images as the expected output of the initialized content auditing model, and training to obtain the content auditing model.
In one embodiment, the initialized content audit model comprises a convolutional neural network model.
In one embodiment, the sample acquisition module is further configured to, after acquiring the training sample set:
counting training samples contained in the training sample set according to the violation categories of the webpage images in the training samples;
and determining at least one class with a small number of training samples according to the statistical result, and performing oversampling processing on the training samples of the at least one class to balance the number of the training samples of each class in the training sample set.
In one embodiment, the sample acquisition module is further configured to, after acquiring the training sample set:
counting training samples contained in the training sample set according to the image sizes in the training samples;
and determining at least one size with less training samples according to the statistical result, and performing oversampling processing on the training samples with the at least one size so as to balance the number of the training samples with each size in the training sample set.
In one embodiment, the sample acquisition module is further configured to, after acquiring the training sample set:
determining the number of difficult samples and the number of easy samples of training samples contained in the training sample set;
and performing oversampling processing on a small number of difficult samples or easy samples so as to balance the number of the difficult samples and the number of the easy samples in the training sample set.
In one embodiment, the sample acquisition module is configured to: and crawling a plurality of webpage pictures from a website, performing duplicate removal operation on the webpage pictures, and marking the pictures reserved after the duplicate removal operation to be used as training samples so as to form the training sample set.
In an embodiment, the sample obtaining module is configured to perform a deduplication operation on the plurality of web page pictures, and includes: and the webpage image duplication eliminating device is used for carrying out duplication eliminating operation on the webpage images by adopting a difference value hash algorithm.
In a third aspect of the disclosure, an electronic device is provided. The electronic device includes: a processor; and a memory for storing executable instructions that, when executed by the processor, cause the electronic device to perform the method of the first aspect.
In a fourth aspect of the disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the method in the first aspect.
The technical scheme provided by the embodiment of the disclosure has the beneficial technical effects that:
according to the method and the device for verifying the webpage image, the webpage image to be verified is input into the pre-trained content verification model, and the first result information and the second result information output by the content verification model are obtained, wherein the first result information is used for representing the type of the content displayed by the webpage image suspected to be violated, the second result information is used for representing the probability that the content displayed by the webpage image belongs to the type displayed by the first result information, the content suspected to be violated in the webpage can be recognized, and the recognition efficiency can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments of the present disclosure will be briefly described below, and it is obvious that the drawings in the following description are only a part of the embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the contents of the embodiments of the present disclosure and the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a network content auditing method provided according to an embodiment of the present disclosure;
FIG. 2 is a flow chart of a method for training a content audit model according to an embodiment of the present disclosure;
FIG. 3 is a flow chart diagram of another method for training a content audit model according to an embodiment of the disclosure;
fig. 4 is a schematic structural diagram of a network content auditing apparatus according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a training apparatus for a content audit model according to an embodiment of the present disclosure;
FIG. 6 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.
Detailed Description
In order to make the technical problems solved, technical solutions adopted and technical effects achieved by the embodiments of the present disclosure clearer, the technical solutions of the embodiments of the present disclosure will be described in further detail below with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments, but not all embodiments, of the embodiments of the present disclosure. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present disclosure, belong to the protection scope of the embodiments of the present disclosure.
It should be noted that the terms "system" and "network" are often used interchangeably in the embodiments of the present disclosure. Reference to "and/or" in embodiments of the present disclosure is meant to include any and all combinations of one or more of the associated listed items. The terms "first", "second", and the like in the description and claims of the present disclosure and in the drawings are used for distinguishing between different objects and not for limiting a particular order.
It should also be noted that, in the embodiments of the present disclosure, each of the following embodiments may be executed alone, or may be executed in combination with each other, and the embodiments of the present disclosure are not limited specifically.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
The technical solutions of the embodiments of the present disclosure are further described by the following detailed description in conjunction with the accompanying drawings.
Fig. 1 shows a schematic flow chart of a method for auditing network content according to an embodiment of the present disclosure, where the embodiment is applicable to an event that whether a violation is suspected in an audit webpage, and the method may be executed by a network content auditing apparatus configured in an electronic device, as shown in fig. 1, the method for auditing network content according to the embodiment includes:
in step S110, a web page image to be reviewed is acquired.
The source of the acquired web page images to be audited can be set according to the purpose of the invention, for example, the web page images are crawled in batch from the target website to be audited by adopting a crawler technology. Furthermore, the calculation amount is reduced, and the duplicate removal operation can be performed on the crawled massive webpage images, for example, the difference value hash algorithm is adopted to perform the duplicate removal operation so as to reduce the number of the webpage images to be audited, so that the auditing efficiency is improved.
In step S120, the web page image is input to a pre-trained content audit model, and first result information and second result information output by the content audit model are obtained, where the first result information is used to indicate a type of violation suspected by content displayed by the web page image, and the second result information is used to indicate a probability that content displayed by the web page image belongs to the type displayed by the first result information.
In this embodiment, the content auditing model is used to audit whether the web page content is suspected to be illegal, for example, audit whether the web page content is suspected to be gambled, for example, judge whether the web page content includes gambling related characters and picture information, for example, judge whether the web page content includes related characters and picture information such as playing cards, chess cards, baccarat, cattle and slot machines.
For another example, the web page content may be examined to determine whether the web page content is suspected of being a lottery ticket, such as determining whether the web page content includes lottery ticket related text and picture information, for example, determining whether the web page content includes lottery ticket, six lottery tickets, real time lottery ticket, fast third, pk10, and 11 select 5 related text and picture information.
The present embodiment does not limit the method and the characteristics of the training method for the content audit model, and exemplarily, a flow diagram of the training method for the content audit model provided in fig. 2 is shown in fig. 2, where the content audit model is obtained by training through the following method:
in step S210, a training sample set is obtained, where the training sample includes a web page image, first annotation information used for indicating a type of a suspected violation of content displayed by the web page image, and second annotation information used for indicating a probability that the content displayed by the web page image belongs to a type corresponding to the first annotation information.
In order to avoid the repeated samples contained in the training sample set, when the training sample set is obtained, a plurality of webpage pictures can be crawled from a website, the plurality of webpage pictures are subjected to duplication elimination operation, and the pictures reserved after the duplication elimination operation are labeled and then serve as training samples to form the training sample set.
When the duplicate removal operation is performed, the multiple web page pictures can be subjected to the duplicate removal operation by adopting a difference value hash algorithm, so that the situation that the training sample set contains very similar web page pictures is avoided.
In step S220, an initialized content audit model is determined, where the initialized content audit model includes a first target layer for outputting a type of content suspected violation for web page image display and a second target layer for indicating a probability of content suspected violation for web page image display.
The initialized content audit model comprises a plurality of types, for example, a convolutional neural network model can be adopted.
In step S230, using a machine learning method, taking the web page image in the training sample set as an input of the initialized content audit model, taking the first label information and the second label information corresponding to the input web page image as an expected output of the initialized content audit model, and training to obtain the content audit model.
In order to balance various training samples in the training sample set and improve the accuracy of the content auditing model, oversampling processing can be performed.
On one hand, after a training sample set is obtained, statistics can be carried out on training samples contained in the training sample set according to the violation categories of webpage images in the training samples; and determining at least one class with a small number of training samples according to the statistical result, and performing oversampling processing on the training samples of the at least one class to balance the number of the training samples of each class in the training sample set.
On the other hand, after a training sample set is obtained, statistics can be performed on training samples contained in the training sample set according to the sizes of images in the training samples; and determining at least one size with less training samples according to the statistical result, and performing oversampling processing on the training samples with the at least one size so as to balance the number of the training samples with each size in the training sample set.
On the other hand, after a training sample set is obtained, the number of difficult samples and the number of easy samples can be determined for training samples contained in the training sample set; and performing oversampling processing on a small number of difficult samples or easy samples so as to balance the number of the difficult samples and the number of the easy samples in the training sample set.
In this embodiment, a webpage image to be checked is input to a pre-trained content checking model, and first result information and second result information output by the content checking model are obtained, where the first result information is used to indicate a type of a suspected violation of a content displayed by the webpage image, and the second result information is used to indicate a probability that the content displayed by the webpage image belongs to the type displayed by the first result information, so that the content suspected of the violation of the webpage can be identified, and the identification efficiency can be improved.
Fig. 3 is a schematic flow chart of another content audit model training method provided in the embodiment of the present disclosure, and the embodiment is based on the foregoing embodiment and performs improved optimization. As shown in fig. 3, the method for auditing network content according to this embodiment includes image crawling, image storage, image deduplication using a difference hash algorithm, image tagging, image data oversampling, convolutional neural network structure building, convolutional neural network model parameter initialization, convolutional neural network forward propagation, and convolutional neural network backward propagation, and specifically includes:
in step S301, image crawling is performed. And acquiring a large number of web page interface pictures from the network by adopting a crawler algorithm.
In step S302, the picture is stored, and the picture is saved in the local computer or the bed server.
In step S303, a difference value hash algorithm is performed to remove duplicate images.
Illustratively, the difference value hashing algorithm image deduplication can adopt the following detailed implementation method:
firstly, inputting a color image, carrying out gray processing on the color image, then rescaling the color image to 80 in terms of width and height, setting an empty character string for storing characters, and setting an empty image fingerprint database list for storing the character string. And traversing the whole image, adding a character 1 to the character string when the previous pixel value of each row is greater than the next pixel value, adding a character 0 to the character string when the previous pixel value of each row is less than or equal to the next pixel value, obtaining a new character string after traversing, comparing the new character string with the character strings in the image fingerprint library list, if the new character string exists, indicating that the image exists, not adding the character string in the image fingerprint library list, and if the new character string does not exist, adding the character string in the image fingerprint library list. The images which are basically consistent with the images in the library are removed, and the differentiated repeated images are further removed, wherein the differentiated repeated images mean that the two images have basically the same content, the resolution may be different, and the pixel values have integral deviation. And further removing the differential repeated image, firstly counting the number of the characters 1 in the obtained character string, comparing the counted number with a set difference threshold, wherein if the counted number is larger than or equal to the difference threshold, the image repetition needs to be removed, and if the counted number is smaller than the difference threshold, the image repetition does not need to be removed. The difference value hash algorithm is used for removing the duplicate of the image, so that a large amount of repetitive labor can be reduced, and great help is provided for improving the accuracy, generalization capability and robustness of the algorithm.
In step S304, a picture is labeled.
And manually marking the remaining pictures after the duplication removal, wherein the main mark pictures comprise the relevant character and picture information of the gambling, including six types of playing cards, chess cards, baccarat, cattle and tiger machines, and the relevant character and picture information of the lottery comprises six types of lottery, six-color lottery, real-time lottery, fast three, pk10 and 5 from 11. Each of the 12 categories corresponds to a label, for example, the label corresponding to the lottery is 0, the label corresponding to the six-lottery is 1, and so on until the 11 th label, and finally, there is a 12 th label corresponding to the background, where the background is a part that does not include the 12 category information, and the background label does not need to be manually labeled.
In step S305, picture data oversampling is performed.
After manual labeling is completed, labeling files corresponding to the names of the labeling pictures are generated, the labeling files are counted, the problem that samples are extremely unbalanced, the problem that the number of difficult samples is extremely unbalanced and the problem that the number of large and small targets is extremely unbalanced are found, and in order to solve the three problems, a strategy of labeling data oversampling is adopted.
The oversampling data sample is obtained by matting the labeled sample from the original image according to the size of the labeling frame and copying the labeled sample to a new image, wherein the new image does not contain the image related to the target information to be detected. The width and height of the label box are w, h, and the area is S, as follows:
S=w×h
the strategy firstly counts the number of each category of labeled samples and the distribution of the size of a labeled frame in each category, the category with the largest number of samples is taken as a reference M, each category is divided into 4 categories according to the area of the labeled frame, and the categories correspond to S1,S2,S3,S4The value range is as follows, the type is S1,S2,S3,S4The number of classes corresponding to the class is
Figure BDA0002598682660000111
Where i corresponds to the categories of tag 0 through tag 11:
Figure BDA0002598682660000112
verified by a large number of experiments, corresponds to S1,S2,S3,S4Over-sampling the sample data according to the ratio of 3:2:1:1 to obtain a new sampleNumber of samples NEWiWhere i corresponds to the categories of tags 0 through 11, the over-sampling formula is as follows:
Figure BDA0002598682660000113
therefore, the method realizes the balance of the number of small samples, the balance of the number of each category and the balance of the number of difficult and easy samples.
In step S306, a content audit model of the convolutional neural network structure is built.
Taking the examination of the gambling content and the lottery content as an example, the gambling and lottery content examination system comprises a graphics bed server, an image encryption and decryption module, a data processing module and a result output module.
The image bed server is mainly used for storing the images to be detected, and each image stored in the image bed server generates a corresponding Uniform Resource Locator (URL) which is used as an address for obtaining the image.
The image encryption and decryption module is mainly used for carrying out Base64 encryption and decryption on images in the image transmission process, and Base64 is a method for representing binary data based on 64 printable characters. In order to ensure that the privacy of the image is not stolen, Base64 encryption is carried out before the image is transmitted to the data processing module from the image bed server, and Base64 decryption is carried out after the image is transmitted to the data processing module, so that the image data is restored.
The data processing module inputs the decrypted image data into a trained convolutional neural network model to perform reasoning process, and the reasoning result is to output the coordinate positions and confidence degrees of 5 types of coordinates selected from 5 types of coordinates of playing cards, chess cards, baccarat, cow, tiger machines, lottery tickets, six-in-one lottery, real-time lottery, fast three, pk10 and 11, wherein the detected playing cards and lottery are related.
The result output module performs summary processing on the detection results to judge whether the input image contains gambling contents and lottery contents.
The data processing module may include the following processes:
the program begins loading a trained content audit model for identifying gambling and lottery tickets, waits for picture input, scales the input picture width and height to predetermined sizes (e.g., 800 pixels and 1400 pixels for width and height, respectively) if picture input is determined, and continues to wait if no picture input is determined.
And (4) performing gambling and lottery content detection on the zoomed pictures, and outputting position coordinates and confidence degrees of the gambling and the lottery content according to the detection result.
And comparing the confidence with a set detection threshold, judging the gambling and lottery contents if the confidence is greater than the detection threshold, and returning to the waiting picture input position if the confidence is less than the detection threshold.
For example, the gambling content is judged, if the gambling content returns a character string 'gambling', if the non-gambling content returns an empty character string; and (4) judging the lottery content, if the lottery content returns a character string 'lottery', if no lottery content returns an empty character string, finally summarizing the result of the character string and outputting.
In step S307, the content verification model parameters are initialized.
In step S308, content audit model training is performed.
For example, the content audit model training process may include convolutional neural network forward propagation, convolutional neural network backward propagation. The prepared picture data and the label file can be input to the convolutional neural network, and meanwhile, the initialization operation is carried out on the parameters of the whole convolutional neural network model. The convolutional neural network is a hierarchical structure, which is formed by arranging and combining a series of convolutional layers, activation layers, pooling layers and normalization layers and is finally connected to a full-connection layer and a loss layer. The loss layer is used for calculating the difference value between the predicted value and the true value, and in order to minimize the difference value, the parameters of the whole convolutional neural network model are updated through a back propagation algorithm of the convolutional neural network. The forward propagation and the backward propagation of the convolutional neural network are iterated for N times repeatedly to obtain the optimal parameters, and finally the trained content auditing model is obtained.
The embodiment takes the example of checking whether the content displayed by the webpage image contains gambling content and lottery content, and discloses a training method of a content checking model with a convolutional neural network structure.
As an implementation of the methods shown in the above figures, the present application provides an embodiment of a network content auditing apparatus, and fig. 4 illustrates a schematic structural diagram of a network content auditing apparatus provided in this embodiment, where the embodiment of the apparatus corresponds to the method embodiments shown in fig. 1 to fig. 3, and the apparatus may be specifically applied to various electronic devices. As shown in fig. 4, the network content auditing apparatus according to the present embodiment includes an image obtaining unit 410 and an image auditing unit 420.
The image obtaining unit 410 is configured to obtain a web page image to be reviewed.
The image auditing unit 420 is configured to input the web page image into a pre-trained content auditing model, and acquire first result information and second result information output by the content auditing model, where the first result information is used to indicate a type of content displayed by the web page image suspected of violation, and the second result information is used to indicate a probability that the content displayed by the web page image belongs to the type displayed by the first result information.
The network content auditing device provided by the embodiment can execute the network content auditing method provided by the embodiment of the method disclosed by the invention, and has corresponding functional modules and beneficial effects of the execution method.
Fig. 5 is a schematic structural diagram of a training apparatus for a content audit model according to an embodiment of the present disclosure, and as shown in fig. 5, the training apparatus for a content audit model according to this embodiment includes a sample obtaining module 510, a model determining module 520, and a model training module 530.
The sample obtaining module 510 is configured to obtain a training sample set, where a training sample includes a web page image, first annotation information used for indicating a type of a suspected violation of content displayed by the web page image, and second annotation information used for indicating a probability that the content displayed by the web page image belongs to a type corresponding to the first annotation information.
The model determination module 520 is configured to determine an initialized content audit model, where the initialized content audit model includes a first target layer for outputting a type of content suspected violation for display of a web page image and a second target layer for representing a probability of content suspected violation for display of the web page image.
The model training module 530 is configured to train the web page images in the training samples in the training sample set as the input of the initialized content audit model, and the first label information and the second label information corresponding to the input web page images as the expected output of the initialized content audit model by using a machine learning method to obtain the content audit model.
According to one or more embodiments of the present disclosure, the initialized content audit model comprises a convolutional neural network model.
In accordance with one or more embodiments of the present disclosure, the sample acquisition module 510 is configured to, after acquiring the set of training samples: counting training samples contained in the training sample set according to the violation categories of the webpage images in the training samples; and determining at least one class with a small number of training samples according to the statistical result, and performing oversampling processing on the training samples of the at least one class to balance the number of the training samples of each class in the training sample set.
In accordance with one or more embodiments of the present disclosure, the sample acquisition module 510 is configured to, after acquiring the set of training samples: counting training samples contained in the training sample set according to the image sizes in the training samples; and determining at least one size with less training samples according to the statistical result, and performing oversampling processing on the training samples with the at least one size so as to balance the number of the training samples with each size in the training sample set.
In accordance with one or more embodiments of the present disclosure, the sample acquisition module 510 is configured to, after acquiring the set of training samples: determining the number of difficult samples and the number of easy samples of training samples contained in the training sample set; and performing oversampling processing on a small number of difficult samples or easy samples so as to balance the number of the difficult samples and the number of the easy samples in the training sample set.
According to one or more embodiments of the present disclosure, the sample obtaining module 510 is configured to crawl a plurality of web page pictures from a website, perform a deduplication operation on the plurality of web page pictures, label pictures remaining after the deduplication operation, and then use the pictures as training samples to form the training sample set.
Further, the sample obtaining module is configured to perform a deduplication operation on the plurality of web page pictures, and includes: and the webpage image duplication eliminating device is used for carrying out duplication eliminating operation on the webpage images by adopting a difference value hash algorithm.
The training device for the content audit model provided by the embodiment can execute the training method for the content audit model provided by the embodiment of the method disclosed by the invention, and has corresponding functional modules and beneficial effects of the execution method.
Referring now to FIG. 6, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium described above in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the disclosed embodiments, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the disclosed embodiments, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a webpage image to be audited; inputting the webpage image into a pre-trained content auditing model, and acquiring first result information and second result information output by the content auditing model, wherein the first result information is used for indicating the type of suspected violation of content displayed by the webpage image, and the second result information is used for indicating the probability that the content displayed by the webpage image belongs to the type displayed by the first result information.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".
The foregoing description is only a preferred embodiment of the disclosed embodiments and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure in the embodiments of the present disclosure is not limited to the particular combination of the above-described features, but also encompasses other embodiments in which any combination of the above-described features or their equivalents is possible without departing from the scope of the present disclosure. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (12)

1. A network content auditing method is characterized by comprising the following steps:
acquiring a webpage image to be audited;
inputting the webpage image into a pre-trained content auditing model, and acquiring first result information and second result information output by the content auditing model, wherein the first result information is used for indicating the type of suspected violation of content displayed by the webpage image, and the second result information is used for indicating the probability that the content displayed by the webpage image belongs to the type displayed by the first result information.
2. The method of claim 1, wherein the content audit model is trained by:
acquiring a training sample set, wherein the training sample comprises a webpage image, first labeling information used for representing the type of suspected violation of content displayed by the webpage image, and second labeling information used for representing the probability that the content displayed by the webpage image belongs to the type corresponding to the first labeling information;
determining an initialized content auditing model, wherein the initialized content auditing model comprises a first target layer and a second target layer, the first target layer is used for outputting the type of content suspected violation displayed by a webpage image, and the second target layer is used for representing the probability of the content suspected violation displayed by the webpage image;
and by utilizing a machine learning method, taking the webpage images in the training samples in the training sample set as the input of the initialized content auditing model, taking the first labeling information and the second labeling information corresponding to the input webpage images as the expected output of the initialized content auditing model, and training to obtain the content auditing model.
3. The method of claim 2, wherein the initialized content audit model comprises a convolutional neural network model.
4. The method of claim 2, further comprising, after obtaining the set of training samples:
counting training samples contained in the training sample set according to the violation categories of the webpage images in the training samples;
and determining at least one class with a small number of training samples according to the statistical result, and performing oversampling processing on the training samples of the at least one class to balance the number of the training samples of each class in the training sample set.
5. The method of claim 2, further comprising, after obtaining the set of training samples:
counting training samples contained in the training sample set according to the image sizes in the training samples;
and determining at least one size with less training samples according to the statistical result, and performing oversampling processing on the training samples with the at least one size so as to balance the number of the training samples with each size in the training sample set.
6. The method of claim 2, further comprising, after obtaining the set of training samples:
determining the number of difficult samples and the number of easy samples of training samples contained in the training sample set;
and performing oversampling processing on a small number of difficult samples or easy samples so as to balance the number of the difficult samples and the number of the easy samples in the training sample set.
7. The method of claim 2, wherein obtaining a set of training samples comprises:
and crawling a plurality of webpage pictures from a website, performing duplicate removal operation on the webpage pictures, and marking the pictures reserved after the duplicate removal operation to be used as training samples so as to form the training sample set.
8. The method of claim 7, wherein the performing the deduplication operation on the plurality of web page pictures comprises:
and carrying out duplicate removal operation on the plurality of webpage pictures by adopting a difference value hash algorithm.
9. A network content auditing apparatus, comprising:
the image acquisition unit is used for acquiring a webpage image to be audited;
and the image auditing unit is used for inputting the webpage image into a pre-trained content auditing model and acquiring first result information and second result information output by the content auditing model, wherein the first result information is used for indicating the type of the violation suspected by the content displayed by the webpage image, and the second result information is used for indicating the probability that the content displayed by the webpage image belongs to the type displayed by the first result information.
10. The apparatus according to claim 9, wherein the content audit model in the image audit unit is trained by:
the system comprises a sample acquisition module, a training sample collection and a comparison module, wherein the training sample collection is used for acquiring a webpage image, first labeling information used for representing the type of suspected violation of content displayed by the webpage image and second labeling information used for representing the probability that the content displayed by the webpage image belongs to the type corresponding to the first labeling information;
the model determining module is used for determining an initialized content auditing model, wherein the initialized content auditing model comprises a first target layer and a second target layer, the first target layer is used for outputting the type of the content suspected violation displayed by the webpage image, and the second target layer is used for representing the probability of the content suspected violation displayed by the webpage image;
and the model training module is used for taking the webpage images in the training samples in the training sample set as the input of the initialized content auditing model by using a machine learning method, taking the first labeling information and the second labeling information corresponding to the input webpage images as the expected output of the initialized content auditing model, and training to obtain the content auditing model.
11. An electronic device, comprising:
a processor; and
a memory to store executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-8.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-8.
CN202010717301.1A 2020-07-23 2020-07-23 Network content auditing method and device, electronic equipment and storage medium Pending CN111859237A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010717301.1A CN111859237A (en) 2020-07-23 2020-07-23 Network content auditing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010717301.1A CN111859237A (en) 2020-07-23 2020-07-23 Network content auditing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111859237A true CN111859237A (en) 2020-10-30

Family

ID=72949609

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010717301.1A Pending CN111859237A (en) 2020-07-23 2020-07-23 Network content auditing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111859237A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112492606A (en) * 2020-11-10 2021-03-12 恒安嘉新(北京)科技股份公司 Classification and identification method and device for spam messages, computer equipment and storage medium
CN112507936A (en) * 2020-12-16 2021-03-16 平安银行股份有限公司 Image information auditing method and device, electronic equipment and readable storage medium
CN112597339A (en) * 2020-12-25 2021-04-02 合安科技技术有限公司 Content security auditing method and device and related equipment
CN113434790A (en) * 2021-06-16 2021-09-24 北京百度网讯科技有限公司 Method and device for identifying repeated links and electronic equipment
CN113569953A (en) * 2021-07-29 2021-10-29 中国工商银行股份有限公司 Training method and device of classification model and electronic equipment
CN115099966A (en) * 2022-06-21 2022-09-23 中国银行股份有限公司 Approval mode determination method and device and electronic equipment
CN116932854A (en) * 2023-09-14 2023-10-24 百鸟数据科技(北京)有限责任公司 Webpage information anticreeper method, device, system, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108171260A (en) * 2017-12-15 2018-06-15 百度在线网络技术(北京)有限公司 A kind of image identification method and system
CN108229535A (en) * 2017-12-01 2018-06-29 百度在线网络技术(北京)有限公司 Relate to yellow image audit method, apparatus, computer equipment and storage medium
CN110852231A (en) * 2019-11-04 2020-02-28 云目未来科技(北京)有限公司 Illegal video detection method and device and storage medium
CN111225234A (en) * 2019-12-23 2020-06-02 广州市百果园信息技术有限公司 Video auditing method, video auditing device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229535A (en) * 2017-12-01 2018-06-29 百度在线网络技术(北京)有限公司 Relate to yellow image audit method, apparatus, computer equipment and storage medium
CN108171260A (en) * 2017-12-15 2018-06-15 百度在线网络技术(北京)有限公司 A kind of image identification method and system
CN110852231A (en) * 2019-11-04 2020-02-28 云目未来科技(北京)有限公司 Illegal video detection method and device and storage medium
CN111225234A (en) * 2019-12-23 2020-06-02 广州市百果园信息技术有限公司 Video auditing method, video auditing device, equipment and storage medium

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112492606A (en) * 2020-11-10 2021-03-12 恒安嘉新(北京)科技股份公司 Classification and identification method and device for spam messages, computer equipment and storage medium
CN112492606B (en) * 2020-11-10 2024-05-17 恒安嘉新(北京)科技股份公司 Classification recognition method and device for spam messages, computer equipment and storage medium
CN112507936A (en) * 2020-12-16 2021-03-16 平安银行股份有限公司 Image information auditing method and device, electronic equipment and readable storage medium
CN112507936B (en) * 2020-12-16 2024-04-23 平安银行股份有限公司 Image information auditing method and device, electronic equipment and readable storage medium
CN112597339A (en) * 2020-12-25 2021-04-02 合安科技技术有限公司 Content security auditing method and device and related equipment
CN113434790A (en) * 2021-06-16 2021-09-24 北京百度网讯科技有限公司 Method and device for identifying repeated links and electronic equipment
CN113434790B (en) * 2021-06-16 2023-07-25 北京百度网讯科技有限公司 Method and device for identifying repeated links and electronic equipment
CN113569953A (en) * 2021-07-29 2021-10-29 中国工商银行股份有限公司 Training method and device of classification model and electronic equipment
CN115099966A (en) * 2022-06-21 2022-09-23 中国银行股份有限公司 Approval mode determination method and device and electronic equipment
CN116932854A (en) * 2023-09-14 2023-10-24 百鸟数据科技(北京)有限责任公司 Webpage information anticreeper method, device, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111859237A (en) Network content auditing method and device, electronic equipment and storage medium
US10915980B2 (en) Method and apparatus for adding digital watermark to video
CN112766189B (en) Deep forgery detection method and device, storage medium and electronic equipment
CN110008428B (en) News data processing method and device, blockchain node equipment and storage medium
CN112949477B (en) Information identification method, device and storage medium based on graph convolution neural network
CN112766284B (en) Image recognition method and device, storage medium and electronic equipment
CN111767838B (en) Video auditing method and system, computer system and computer readable storage medium
CN115346278A (en) Image detection method, device, readable medium and electronic equipment
CN110287350A (en) Image search method, device and electronic equipment
CN113343069B (en) User information processing method, device, medium and electronic equipment
CN112507884B (en) Live content detection method and device, readable medium and electronic equipment
CN113971402A (en) Content identification method, device, medium and electronic equipment
CN117523586A (en) Check seal verification method and device, electronic equipment and medium
CN112214770A (en) Malicious sample identification method and device, computing equipment and medium
CN111221461A (en) Method, apparatus, device and storage medium for gradually presenting image presentation content
CN116363365A (en) Image segmentation method based on semi-supervised learning and related equipment
CN115525781A (en) Multi-mode false information detection method, device and equipment
CN113031950B (en) Picture generation method, device, equipment and medium
CN114301713A (en) Risk access detection model training method, risk access detection method and risk access detection device
CN116092094A (en) Image text recognition method and device, computer readable medium and electronic equipment
CN113888760A (en) Violation information monitoring method, device, equipment and medium based on software application
CN112364682A (en) Case searching method and device
CN113542527A (en) Face image transmission method and device, electronic equipment and storage medium
CN111563276A (en) Webpage tampering detection method, detection system and related equipment
CN113111833B (en) Safety detection method and device of artificial intelligence system and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201030