CN116610962A - Content auditing method and device, electronic equipment and storage medium - Google Patents

Content auditing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116610962A
CN116610962A CN202310457132.6A CN202310457132A CN116610962A CN 116610962 A CN116610962 A CN 116610962A CN 202310457132 A CN202310457132 A CN 202310457132A CN 116610962 A CN116610962 A CN 116610962A
Authority
CN
China
Prior art keywords
violation
sample
characterization
database
checked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310457132.6A
Other languages
Chinese (zh)
Inventor
丁顺意
林明安
张璐
陶明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Renyimen Technology Co ltd
Original Assignee
Shanghai Renyimen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Renyimen Technology Co ltd filed Critical Shanghai Renyimen Technology Co ltd
Priority to CN202310457132.6A priority Critical patent/CN116610962A/en
Publication of CN116610962A publication Critical patent/CN116610962A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a content auditing method and device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: obtaining a violation characterization database; the violation characterization database comprises a plurality of violation characterizations, and the violation characterizations are extracted based on the characterization of the violation sample; when a sample to be audited is received, extracting the representation of the sample to be audited; calculating the similarity between the characterization of the sample to be checked and the violation characterization in the violation characterization database; and if the similarity between the characterization of the sample to be checked and the target violation characterization is larger than a preset value, judging that the sample to be checked is a violation sample. Therefore, the content auditing method provided by the application is simple in deployment and higher in efficiency, and can meet the content auditing requirements of newly-added illegal scenes.

Description

Content auditing method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of network technologies, and in particular, to a method and apparatus for content auditing, an electronic device, and a computer readable storage medium.
Background
The content security audit is one of core business in the Internet industry, and mainly ensures the content security of a platform, presses illegal contents such as yellow gambling toxic garbage advertisements and the like, and creates an Internet environment with positive wind and clear air. In the related art, a corresponding security audit model needs to be trained for each violation type independently, and generally operations such as data acquisition, cleaning, labeling, model training, testing, deployment and the like are needed, so that the output time is long, the cost is high, the audit of continuously changing data on the internet is not met, and the requirement of quickly responding to a newly added violation scene is difficult to adapt.
Disclosure of Invention
The application aims to provide a content auditing method and device, electronic equipment and a computer readable storage medium, which meet the content auditing requirements of newly added illegal scenes.
In order to achieve the above object, the present application provides a content auditing method, including:
obtaining a violation characterization database; the violation characterization database comprises a plurality of violation characterizations, and the violation characterizations are extracted based on the characterization of the violation sample;
when a sample to be audited is received, extracting the representation of the sample to be audited;
calculating the similarity between the characterization of the sample to be checked and the violation characterization in the violation characterization database;
and if the similarity between the characterization of the sample to be checked and the target violation characterization is larger than a preset value, judging that the sample to be checked is a violation sample.
The violation characterization database is a violation type characterization database, and the violation type characterization database comprises a plurality of violation characterizations of violation types;
the obtaining the violation characterization database comprises:
obtaining a plurality of violation samples corresponding to the violation types, and extracting characterization of the violation samples;
calculating the cluster centers of the characterizations of the plurality of violation samples as the violation characterizations of the violation types, and adding the violation characterizations of the violation types to the violation type characterization database.
If the similarity between the characterization of the sample to be checked and the target violation characterization is greater than a preset value, determining that the sample to be checked is a violation sample comprises:
and if the similarity between the characterization of the sample to be checked and the first target violation characterization of the target violation type in the violation type characterization database is larger than a first preset value, judging that the sample to be checked belongs to the violation sample of the target violation type.
The characterization database is a violation sample characterization database, and the violation sample characterization database comprises violation characterizations of a plurality of violation samples;
the obtaining the violation characterization database comprises:
obtaining a violation sample, and extracting a violation characterization of the violation sample;
adding the offending characterization of the offending sample to the offending sample characterization database.
If the similarity between the characterization of the sample to be checked and the target violation characterization is greater than a preset value, determining that the sample to be checked is a violation sample comprises:
and if the similarity between the characterization of the sample to be checked and the second target violation characterization of the target violation sample in the violation sample characterization database is larger than a second preset value, judging that the sample to be checked is a violation sample.
The violation characterization database is a violation type characterization database and a violation sample characterization database, wherein the violation type characterization database comprises a plurality of violation characterizations of violation types, and the violation sample characterization database comprises violation characterizations of a plurality of violation samples;
if the similarity between the characterization of the sample to be checked and the target violation characterization is greater than a preset value, determining that the sample to be checked is a violation sample comprises:
and if the similarity between the representation of the sample to be checked and the first target violation representation of the target violation type in the violation type representation database is greater than a first preset value, and/or the similarity between the representation of the sample to be checked and the second target violation representation of the target violation sample in the violation sample representation database is greater than a second preset value, judging that the sample to be checked is a violation sample.
Wherein, still include:
obtaining a sample library, and extracting the representation of samples in the sample library;
calculating the similarity between the illegal representation in the representation database and the representation of the sample, sorting a plurality of samples in the sample library according to the similarity from large to small, and taking the first preset number of samples in the sorting result as target samples;
determining an offending signature of the target sample; wherein the violation markers include violations and non-violations;
and determining the similarity between the characterization of the last sample in the sequencing result and the illegal characterization as the preset value.
Wherein after determining the violation of the target sample, the method further comprises:
determining the proportion of the violation samples in the target samples according to the violation marks;
judging whether the proportion is larger than a preset proportion, if so, judging that the preset value meets the precision requirement.
In order to achieve the above object, the present application provides a content auditing apparatus, comprising:
the first acquisition module is used for acquiring the violation characterization database; the violation characterization database comprises a plurality of violation characterizations, and the violation characterizations are extracted based on the characterization of the violation sample;
the extraction module is used for extracting the representation of the sample to be checked when the sample to be checked is received;
the first calculation module is used for calculating the similarity between the characterization of the sample to be checked and the violation characterization in the violation characterization database;
and the auditing module is used for judging the sample to be checked as the illegal sample when the similarity between the characterization of the sample to be checked and the target illegal characterization is larger than a preset value.
To achieve the above object, the present application provides an electronic device including:
a memory for storing a computer program;
and a processor for implementing the steps of the content auditing method as described above when executing the computer program.
To achieve the above object, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the content auditing method as described above.
According to the scheme, the content auditing method provided by the application comprises the following steps: obtaining a violation characterization database; the violation characterization database comprises a plurality of violation characterizations, and the violation characterizations are extracted based on the characterization of the violation sample; when a sample to be audited is received, extracting the representation of the sample to be audited; calculating the similarity between the characterization of the sample to be checked and the violation characterization in the violation characterization database; and if the similarity between the characterization of the sample to be checked and the target violation characterization is larger than a preset value, judging that the sample to be checked is a violation sample.
According to the content auditing method provided by the application, when the violation type is newly added, only the new violation characterization of the violation type is needed to be added into the violation characterization database, and when the next content auditing is carried out, whether the auditing sample belongs to the new violation type is judged. Therefore, the content auditing method provided by the application is simple in deployment and higher in efficiency, and can meet the content auditing requirements of newly-added illegal scenes. The application also discloses a content auditing device, an electronic device and a computer readable storage medium, and the technical effects can be realized.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate the disclosure and together with the description serve to explain, but do not limit the disclosure. In the drawings:
FIG. 1 is a flow chart illustrating a method of content auditing, according to an example embodiment;
FIG. 2 is a flow chart illustrating another method of content auditing, according to an example embodiment;
FIG. 3 is a block diagram of a content auditing apparatus, according to an example embodiment;
fig. 4 is a block diagram of an electronic device, according to an example embodiment.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. In addition, in the embodiments of the present application, "first", "second", etc. are used to distinguish similar objects and are not necessarily used to describe a particular order or precedence.
The embodiment of the application discloses a content auditing method which meets the content auditing requirements of newly added violation scenes.
Referring to fig. 1, a flowchart of a content auditing method according to an exemplary embodiment is shown, as shown in fig. 1, including:
s101: obtaining a violation characterization database; the violation characterization database comprises a plurality of violation characterizations, and the violation characterizations are extracted based on the characterization of the violation sample;
the aim of the embodiment is to conduct content auditing on a sample to be audited.
In this step, a violation characterization database is obtained, which contains a plurality of violation characterizations extracted based on the characterization of the violation sample. The offensive samples herein may include forms of images, single pieces of text, voices, etc., and the embodiment is not particularly limited. The violation characterization herein may be a vector comprising multiple dimensions.
As a possible implementation, the violation characterization database is a violation type characterization database, which contains a plurality of violation characterizations of violation types; the obtaining the violation characterization database comprises: obtaining a plurality of violation samples corresponding to the violation types, and extracting characterization of the violation samples; calculating the cluster centers of the characterizations of the plurality of violation samples as the violation characterizations of the violation types, and adding the violation characterizations of the violation types to the violation type characterization database.
In a specific implementation, the violation type characterization database contains multiple violation characterizations corresponding to the violation type. When the violation type characterization database is constructed, the violation type can be marked on the violation samples, a small number of the violation samples corresponding to the violation types are obtained, and the single-mode encoder is utilized to extract the characterization of the violation samples. If the violation sample is a violation image, extracting the representation of the violation image by using an image encoder, if the violation sample is a violation text, extracting the representation of the violation text by using a text encoder, and if the violation sample is a violation voice, extracting the representation of the violation voice by using a voice encoder. The image encoder, the text encoder and the speech encoder are models with global characterization capability. Further, the characterization of the plurality of violation samples is clustered according to the violation types, clusters corresponding to each violation type are formed, the cluster center of each cluster is calculated to serve as the corresponding violation characterization of the violation type, and the violation characterization is added into a violation type characterization database. When a new violation type appears, a small number of multiple violation samples of the new violation type are obtained, the cluster centers of the characterization of the multiple violation samples are calculated as the new violation type violation characterization according to the mode, and the new violation type violation characterization is added into a violation type characterization database, so that the expansion of the violation type characterization database is realized.
As another possible implementation, the characterization database is a violation sample characterization database containing violation characterizations of a plurality of violation samples; the obtaining the violation characterization database comprises: obtaining a violation sample, and extracting a violation characterization of the violation sample; adding the offending characterization of the offending sample to the offending sample characterization database.
In a specific implementation, each violation sample can be used as a separate violation type, and the violation characterization of each violation sample can be extracted and added into the violation sample characterization database.
S102: when a sample to be audited is received, extracting the representation of the sample to be audited;
in this step, when the sample to be audited is received, the type of the sample to be audited, that is, the image, the single-segment text, the voice, etc., is determined, and the representation of the sample to be audited is extracted by using the corresponding encoder, and the specific representation extraction method is described in detail in the previous step, and is not described in detail herein.
S103: calculating the similarity between the characterization of the sample to be checked and the violation characterization in the violation characterization database;
in this step, the similarity between the characterization of the sample to be checked and the offending characterization in the offending characterization database is calculated. The specific calculation mode of the similarity is not limited in this embodiment, and for example, the similarity between two token vectors may be calculated by using a cosine method, and the specific calculation mode is as follows:
wherein a={A 1 ,A 2 ,…,A n Characterization of the sample to be examined, b= { B 1 ,B 2 ,…,B n The rule-breaking representation in the rule-breaking representation database is represented by n, and the n is the dimension of A and B, A i Is the element of the ith dimension in A, B i Is the element of the ith dimension in B, i is more than or equal to 1 and less than or equal to n.
Of course, other ways of calculating the similarity between the two vectors are within the protection scope of the present embodiment, and will not be described herein.
S104: and if the similarity between the characterization of the sample to be checked and the target violation characterization is larger than a preset value, judging that the sample to be checked is a violation sample.
In the step, whether the similarity between the characterization of the sample to be checked and each violation characterization in the violation characterization database is larger than a preset value is judged, so that a content checking result of the sample to be checked is obtained. If the similarity between the target violation characterization and the characterization of the sample to be checked is larger than a preset value in the violation characterization database, the sample to be checked is judged to be a violation sample, otherwise, the sample to be checked is judged to be a non-violation sample.
For the preset values, the auditor can be flexibly set according to experience, and can also automatically calculate according to a preset algorithm, and the preset values corresponding to different violation characterizations can be the same or different, and are not particularly limited.
As a possible implementation manner, the calculation process of the preset value includes: obtaining a sample library, and extracting the representation of samples in the sample library; calculating the similarity between the illegal representation in the representation database and the representation of the sample, sorting a plurality of samples in the sample library according to the similarity from large to small, and taking the first preset number of samples in the sorting result as target samples; determining an offending signature of the target sample; wherein the violation markers include violations and non-violations; and determining the similarity between the characterization of the last sample in the sequencing result and the illegal characterization as the preset value.
In a specific implementation, a sample library comprises a plurality of samples, the similarity between the characterization of each sample and the illegal characterization in a characterization database is calculated, the plurality of samples in the sample library are ranked according to the similarity from large to small, an auditor performs illegal marking on the first preset number of target samples in the ranking result, and the similarity between the characterization of the last sample in the ranking result and the illegal characterization is determined to be a preset value. For example, the sample library includes 500 ten thousand pictures, the similarity between the violation characterization and each picture in the characterization database is calculated, the first 1000 pictures with the largest similarity are taken, whether the artificial mark is violated, and if the first 90% of the pictures are violated, the similarity between the 900 th picture and the violation characterization is used as a preset value.
Further, the precision verification may be further performed on the calculated preset value, and after determining the violation mark of the target sample, the method further includes: determining the proportion of the violation samples in the target samples according to the violation marks; judging whether the proportion is larger than a preset proportion, if so, judging that the preset value meets the precision requirement.
In specific implementation, determining the proportion of the violation sample in the target sample according to the violation mark, and judging that the calculated preset value meets the precision requirement when the proportion is larger than the preset proportion, otherwise, the calculated preset value does not meet the precision requirement and needs to be recalculated. In the above example, if the preset ratio is 80%, since 90% is greater than 80%, it is determined that the calculated preset value satisfies the accuracy requirement.
When the violation characterization database is a violation type characterization database, the method comprises the following steps: and if the similarity between the characterization of the sample to be checked and the first target violation characterization of the target violation type in the violation type characterization database is larger than a first preset value, judging that the sample to be checked belongs to the violation sample of the target violation type.
In a specific implementation, it is determined whether a similarity between the characterization of the sample to be audited and each violation characterization in the violation type characterization database is greater than a first preset value. If the similarity between the first target violation characterization and the characterization of the sample to be checked exists in the violation type characterization database, judging that the sample to be checked belongs to the violation sample of the target violation type corresponding to the first target violation characterization, otherwise, judging that the sample to be checked is a non-violation sample.
Therefore, the scheme of the violation type characterization database can be used for indicating a violation type sample contained in the violation sample characterization database, so that the recall rate is high and the content auditing precision is higher when the content auditing is carried out on the sample to be audited.
When the violation characterization database is a violation sample characterization database, the method comprises the following steps: and if the similarity between the characterization of the sample to be checked and the second target violation characterization of the target violation sample in the violation sample characterization database is larger than a second preset value, judging that the sample to be checked is a violation sample.
In a specific implementation, it is determined whether a similarity between the characterization of the sample to be audited and each offending characterization in the offending sample characterization database is greater than a second preset value. If the similarity between the second target violation characterization and the characterization of the sample to be checked is larger than a second preset value in the violation sample characterization database, the sample to be checked is judged to be the violation sample, otherwise, the sample to be checked is judged to be the non-violation sample.
Therefore, according to the scheme of the illegal sample characterization database, because a single illegal sample is taken as an independent illegal type, when the illegal sample characterization database is constructed, a plurality of characterized cluster centers do not need to be calculated, and the speed is high. When the content of the sample to be audited is audited, the offence characterization contained in the offence sample characterization database is not necessarily matched with the characterization of the sample to be audited, the recall rate is low, but once the matching is successful, the fact that the sample to be audited is very similar to the offence sample corresponding to a certain offence characterization in the offence sample characterization database is shown, the probability of being the offence sample is very high, and the accuracy of content audit is higher.
According to the content auditing method provided by the embodiment of the application, when the violation type is newly added, only the new violation characterization of the violation type is needed to be added into the violation characterization database, and when the next content auditing is carried out, whether the auditing sample belongs to the new violation type is judged. Therefore, the content auditing method provided by the embodiment of the application is simple to deploy and higher in efficiency, and can meet the content auditing requirements of newly added illegal scenes.
The embodiment of the application discloses a content auditing method, which further describes and optimizes a technical scheme relative to the previous embodiment. Specific:
referring to fig. 2, a flowchart of another content auditing method is shown, according to an example embodiment, as shown in fig. 2, including:
s201: obtaining a plurality of violation samples corresponding to the violation types, and extracting violation characterization of the plurality of violation samples;
s202: adding the offending characterizations of a plurality of the offending samples to an offending sample characterization database;
s203: calculating the cluster centers of the characterizations of the plurality of violation samples as the violation characterizations of the violation types, and adding the violation characterizations of the violation types to a violation type characterization database;
in this embodiment, a violation sample characterization database and a violation type characterization database are respectively constructed, where the violation type characterization database includes a plurality of violation characterizations corresponding to the violation types, and the specific construction manner of the violation sample characterization database includes the violation characterizations of the plurality of violation samples is described in detail in the previous embodiment, and is not described herein again.
S204: when a sample to be audited is received, extracting the representation of the sample to be audited;
s205: calculating a first similarity between the characterization of the sample to be checked and the offending characterization in the offending sample characterization database, and calculating a second similarity between the characterization of the sample to be checked and the offending characterization in the offending type characterization database;
s206: and if the similarity between the representation of the sample to be checked and the first target violation representation of the target violation type in the violation type representation database is greater than a first preset value, and/or the similarity between the representation of the sample to be checked and the second target violation representation of the target violation sample in the violation sample representation database is greater than a second preset value, judging that the sample to be checked is a violation sample.
In this embodiment, when a sample to be audited is received, extracting a representation of the sample to be audited, respectively calculating a first similarity between the representation of the sample to be audited and a violation representation in a violation sample representation database, and a second similarity between the representation of the sample to be audited and a violation representation in a violation type representation database, respectively judging whether the first similarity is greater than a first preset value and whether the second similarity is greater than a second preset value, and determining a content audit result of the sample to be audited according to the judgment result. The first preset value and the second preset value in this embodiment may be the same or different, and are not particularly limited herein.
As a possible implementation manner, when any one of the judging results is yes, judging that the sample to be audited is a violation sample, otherwise, judging that the sample to be audited is a non-violation sample. As another possible implementation manner, when the two judging results are yes, the sample to be checked is judged to be a violation sample, otherwise, the sample to be checked is judged to be a non-violation sample.
The following describes an application embodiment provided by the application, which specifically comprises the following steps:
step 1: the on-line manual auditing finds out that 10 cartoon kiss violations graphs which are audited by the machine pass through, and each cartoon kiss graph is converted into 10 multiplied by 768 dimensional vector characterization by using a CLIP image encoder;
step 2: calculating to obtain a 1X 768-dimensional vector by using a 10X 768-dimensional vector representation as a cartoon kissing violation type cluster center representation;
step 3: adding 10×768-dimensional vector characterizations to a black sample library;
step 4: characterizing 1X 768-dimensional cluster centers into a black cluster center library;
step 5: obtaining a representation of a sample to be checked through a model of a certain image to be checked;
step 6: calculating the similarity S1 between the sample characterization to be audited and each characterization in the black sample library;
step 7: calculating the similarity S2 between the representation of the sample to be audited and each representation in the black cluster heart library;
step 8: judging whether the sample to be checked is illegal, if the similarity S1 is larger than a given threshold Z1 or the similarity S2 is larger than a given threshold Z2, judging that the image to be checked is illegal, otherwise, judging that the sample to be checked is not an image.
The following describes a content auditing device according to an embodiment of the present application, and a content auditing device described below and a content auditing method described above may be referred to each other.
Referring to fig. 3, a structure diagram of a content auditing apparatus according to an exemplary embodiment is shown, as shown in fig. 3, including:
a first obtaining module 301, configured to obtain a violation characterization database; the violation characterization database comprises a plurality of violation characterizations, and the violation characterizations are extracted based on the characterization of the violation sample;
an extracting module 302, configured to extract, when a sample to be checked is received, a representation of the sample to be checked;
a first calculation module 303, configured to calculate a similarity between the characterization of the sample to be checked and the violation characterization in the violation characterization database;
and the auditing module 304 is configured to determine that the sample to be audited is a violation sample when the similarity between the characterization of the sample to be audited and the target violation characterization is greater than a preset value.
When the violation types are newly added, the violation characterization of the new violation types is only required to be added into the violation characterization database, and when the next content is audited, whether an audit sample belongs to the new violation types is judged. Therefore, the content auditing device provided by the embodiment of the application is simple to deploy and higher in efficiency, and can meet the content auditing requirements of newly added illegal scenes.
On the basis of the above embodiment, as a preferred implementation manner, the violation characterization database is a violation type characterization database, and the violation type characterization database contains multiple violation characterizations of violation types;
the first obtaining module 301 is specifically configured to: obtaining a plurality of violation samples corresponding to the violation types, and extracting characterization of the violation samples; calculating the cluster centers of the characterizations of the plurality of violation samples as the violation characterizations of the violation types, and adding the violation characterizations of the violation types to the violation type characterization database.
Based on the foregoing embodiment, as a preferred implementation manner, the auditing module 304 is specifically configured to: and if the similarity between the characterization of the sample to be checked and the first target violation characterization of the target violation type in the violation type characterization database is larger than a first preset value, judging that the sample to be checked belongs to the violation sample of the target violation type.
Based on the above embodiment, as a preferred implementation manner, the characterization database is an offence sample characterization database, and the offence sample characterization database includes offence characterizations of a plurality of offence samples;
the first obtaining module 301 is specifically configured to: obtaining a violation sample, and extracting a violation characterization of the violation sample; adding the offending characterization of the offending sample to the offending sample characterization database.
Based on the foregoing embodiment, as a preferred implementation manner, the auditing module 304 is specifically configured to: and if the similarity between the characterization of the sample to be checked and the second target violation characterization of the target violation sample in the violation sample characterization database is larger than a second preset value, judging that the sample to be checked is a violation sample.
On the basis of the embodiment, as a preferred implementation manner, the violation characterization database is a violation type characterization database and a violation sample characterization database, wherein the violation type characterization database comprises a plurality of violation characterizations of violation types, and the violation sample characterization database comprises violation characterizations of a plurality of violation samples;
the auditing module 304 is specifically configured to: and if the similarity between the representation of the sample to be checked and the first target violation representation of the target violation type in the violation type representation database is greater than a first preset value, and/or the similarity between the representation of the sample to be checked and the second target violation representation of the target violation sample in the violation sample representation database is greater than a second preset value, judging that the sample to be checked is a violation sample.
On the basis of the above embodiment, as a preferred implementation manner, the method further includes:
the second acquisition module is used for acquiring a sample library and extracting the representation of the samples in the sample library;
the second calculation module is used for calculating the similarity between the illegal representation in the representation database and the representation of the sample, sorting the samples in the sample library from large to small according to the similarity, and taking the first preset number of samples in the sorting result as target samples;
a first determining module for determining an offence marking of the target sample; wherein the violation markers include violations and non-violations;
and the second determining module is used for determining the similarity between the characterization of the last sample in the sequencing result and the violation characterization as the preset value.
On the basis of the above embodiment, as a preferred implementation manner, the method further includes:
the judging module is used for determining the proportion of the violation samples in the target samples according to the violation marks; judging whether the proportion is larger than a preset proportion, if so, judging that the preset value meets the precision requirement.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Based on the hardware implementation of the program modules, and in order to implement the method according to the embodiment of the present application, the embodiment of the present application further provides an electronic device, and fig. 4 is a block diagram of an electronic device according to an exemplary embodiment, and as shown in fig. 4, the electronic device includes:
a communication interface 1 capable of information interaction with other devices such as network devices and the like;
and the processor 2 is connected with the communication interface 1 to realize information interaction with other devices and is used for executing the content auditing method provided by one or more technical schemes when the computer program is run. And the computer program is stored on the memory 3.
Of course, in practice, the various components in the electronic device are coupled together by a bus system 4. It will be appreciated that the bus system 4 is used to enable connected communications between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. But for clarity of illustration the various buses are labeled as bus system 4 in fig. 4.
The memory 3 in the embodiment of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.
It will be appreciated that the memory 3 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Wherein the nonvolatile Memory may be Read Only Memory (ROM), programmable Read Only Memory (PROM, programmable Read-Only Memory), erasable programmable Read Only Memory (EPROM, erasable Programmable Read-Only Memory), electrically erasable programmable Read Only Memory (EEPROM, electrically Erasable Programmable Read-Only Memory), magnetic random access Memory (FRAM, ferromagnetic random access Memory), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or compact disk Read Only Memory (CD-ROM, compact Disc Read-Only Memory); the magnetic surface memory may be a disk memory or a tape memory. The volatile memory may be random access memory (RAM, random Access Memory), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronous Static Random Access Memory), dynamic random access memory (DRAM, dynamic Random Access Memory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random Access Memory), double data rate synchronous dynamic random access memory (ddr SDRAM, double Data Rate Synchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The memory 3 described in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.
The method disclosed in the above embodiment of the present application may be applied to the processor 2 or implemented by the processor 2. The processor 2 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 2 or by instructions in the form of software. The processor 2 described above may be a general purpose processor, DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the application can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium in the memory 3 and the processor 2 reads the program in the memory 3 to perform the steps of the method described above in connection with its hardware.
The corresponding flow in each method of the embodiments of the present application is implemented when the processor 2 executes the program, and for brevity, will not be described in detail herein.
In an exemplary embodiment, the present application also provides a storage medium, i.e. a computer storage medium, in particular a computer readable storage medium, for example comprising a memory 3 storing a computer program executable by the processor 2 for performing the steps of the method described above. The computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, CD-ROM, etc.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.
Alternatively, the above-described integrated units of the present application may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied essentially or in part in the form of a software product stored in a storage medium, including instructions for causing an electronic device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (11)

1. A method of content auditing, comprising:
obtaining a violation characterization database; the violation characterization database comprises a plurality of violation characterizations, and the violation characterizations are extracted based on the characterization of the violation sample;
when a sample to be audited is received, extracting the representation of the sample to be audited;
calculating the similarity between the characterization of the sample to be checked and the violation characterization in the violation characterization database;
and if the similarity between the characterization of the sample to be checked and the target violation characterization is larger than a preset value, judging that the sample to be checked is a violation sample.
2. The content auditing method of claim 1, wherein the violation characterization database is a violation type characterization database that contains a plurality of violation characterizations of violation types;
the obtaining the violation characterization database comprises:
obtaining a plurality of violation samples corresponding to the violation types, and extracting characterization of the violation samples;
calculating the cluster centers of the characterizations of the plurality of violation samples as the violation characterizations of the violation types, and adding the violation characterizations of the violation types to the violation type characterization database.
3. The content auditing method according to claim 2, wherein if the similarity between the characterization of the sample to be audited and the target violation characterization is greater than a preset value, determining that the sample to be audited is a violation sample comprises:
and if the similarity between the characterization of the sample to be checked and the first target violation characterization of the target violation type in the violation type characterization database is larger than a first preset value, judging that the sample to be checked belongs to the violation sample of the target violation type.
4. The content auditing method of claim 1, wherein the characterization database is a violation sample characterization database that contains violation characterizations of a plurality of violation samples;
the obtaining the violation characterization database comprises:
obtaining a violation sample, and extracting a violation characterization of the violation sample;
adding the offending characterization of the offending sample to the offending sample characterization database.
5. The content auditing method according to claim 4, wherein if the similarity between the characterization of the sample to be audited and the target violation characterization is greater than a preset value, determining that the sample to be audited is a violation sample comprises:
and if the similarity between the characterization of the sample to be checked and the second target violation characterization of the target violation sample in the violation sample characterization database is larger than a second preset value, judging that the sample to be checked is a violation sample.
6. The content auditing method of claim 1, wherein the violation characterization database is a violation type characterization database and a violation sample characterization database, the violation type characterization database containing violation characterizations of a plurality of violation types, the violation sample characterization database containing violation characterizations of a plurality of violation samples;
if the similarity between the characterization of the sample to be checked and the target violation characterization is greater than a preset value, determining that the sample to be checked is a violation sample comprises:
and if the similarity between the representation of the sample to be checked and the first target violation representation of the target violation type in the violation type representation database is greater than a first preset value, and/or the similarity between the representation of the sample to be checked and the second target violation representation of the target violation sample in the violation sample representation database is greater than a second preset value, judging that the sample to be checked is a violation sample.
7. The content auditing method of claim 1, further comprising:
obtaining a sample library, and extracting the representation of samples in the sample library;
calculating the similarity between the illegal representation in the representation database and the representation of the sample, sorting a plurality of samples in the sample library according to the similarity from large to small, and taking the first preset number of samples in the sorting result as target samples;
determining an offending signature of the target sample; wherein the violation markers include violations and non-violations;
and determining the similarity between the characterization of the last sample in the sequencing result and the illegal characterization as the preset value.
8. The method of content auditing according to claim 7, wherein after said determining the violation of the target sample, further comprising:
determining the proportion of the violation samples in the target samples according to the violation marks;
judging whether the proportion is larger than a preset proportion, if so, judging that the preset value meets the precision requirement.
9. A content auditing apparatus, comprising:
the first acquisition module is used for acquiring the violation characterization database; the violation characterization database comprises a plurality of violation characterizations, and the violation characterizations are extracted based on the characterization of the violation sample;
the extraction module is used for extracting the representation of the sample to be checked when the sample to be checked is received;
the first calculation module is used for calculating the similarity between the characterization of the sample to be checked and the violation characterization in the violation characterization database;
and the auditing module is used for judging the sample to be checked as the illegal sample when the similarity between the characterization of the sample to be checked and the target illegal characterization is larger than a preset value.
10. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the content auditing method according to any one of claims 1 to 8 when executing the computer program.
11. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the content auditing method according to any of claims 1 to 8.
CN202310457132.6A 2023-04-25 2023-04-25 Content auditing method and device, electronic equipment and storage medium Pending CN116610962A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310457132.6A CN116610962A (en) 2023-04-25 2023-04-25 Content auditing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310457132.6A CN116610962A (en) 2023-04-25 2023-04-25 Content auditing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116610962A true CN116610962A (en) 2023-08-18

Family

ID=87684533

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310457132.6A Pending CN116610962A (en) 2023-04-25 2023-04-25 Content auditing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116610962A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117473339A (en) * 2023-12-28 2024-01-30 智者四海(北京)技术有限公司 Content auditing method and device, electronic equipment and storage medium
CN117610002A (en) * 2024-01-22 2024-02-27 南京众智维信息科技有限公司 Multi-mode feature alignment-based lightweight malicious software threat detection method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117473339A (en) * 2023-12-28 2024-01-30 智者四海(北京)技术有限公司 Content auditing method and device, electronic equipment and storage medium
CN117473339B (en) * 2023-12-28 2024-04-30 智者四海(北京)技术有限公司 Content auditing method and device, electronic equipment and storage medium
CN117610002A (en) * 2024-01-22 2024-02-27 南京众智维信息科技有限公司 Multi-mode feature alignment-based lightweight malicious software threat detection method
CN117610002B (en) * 2024-01-22 2024-04-30 南京众智维信息科技有限公司 Multi-mode feature alignment-based lightweight malicious software threat detection method

Similar Documents

Publication Publication Date Title
CN116610962A (en) Content auditing method and device, electronic equipment and storage medium
CN110781299A (en) Asset information identification method and device, computer equipment and storage medium
CN110826006B (en) Abnormal collection behavior identification method and device based on privacy data protection
CN110675862A (en) Corpus acquisition method, electronic device and storage medium
CN112860841A (en) Text emotion analysis method, device and equipment and storage medium
CN112256849B (en) Model training method, text detection method, device, equipment and storage medium
CN110619115B (en) Template creating method and device, electronic equipment and storage medium
CN110968689A (en) Training method of criminal name and law bar prediction model and criminal name and law bar prediction method
CN112214984A (en) Content plagiarism identification method, device, equipment and storage medium
CN112257413A (en) Address parameter processing method and related equipment
CN113076961B (en) Image feature library updating method, image detection method and device
CN113887551A (en) Target person analysis method based on ticket data, terminal device and storage medium
CN111523322A (en) Requirement document quality evaluation model training method and requirement document quality evaluation method
CN111783425A (en) Intention identification method based on syntactic analysis model and related device
CN116189215A (en) Automatic auditing method and device, electronic equipment and storage medium
CN116340892A (en) Video infringement prevention method and system based on blockchain, storage medium and platform
CN115617998A (en) Text classification method and device based on intelligent marketing scene
CN115437930A (en) Identification method of webpage application fingerprint information and related equipment
CN113449506A (en) Data detection method, device and equipment and readable storage medium
CN113421552A (en) Audio recognition method and device
CN113255670A (en) Unbalanced small sample target detection method and device and computer equipment
CN113836297A (en) Training method and device for text emotion analysis model
CN115935775A (en) Neural network model training method, device, equipment and storage medium
CN111753521B (en) Reading understanding method based on artificial intelligence and related equipment
CN111325024B (en) Risk item statistical method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination