CN112508432A

CN112508432A - Advertisement potential risk detection method and device, electronic equipment, medium and product

Info

Publication number: CN112508432A
Application number: CN202011483791.XA
Authority: CN
Inventors: 郑诗琪; 李思; 王笑吉
Original assignee: Baidu International Technology Shenzhen Co ltd
Current assignee: Baidu International Technology Shenzhen Co Ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-03-16

Abstract

The present disclosure provides a method and an apparatus for detecting advertisement potential risk, an electronic device, a computer-readable storage medium, and a computer program product, which relate to the field of computers, and in particular, to the field of natural language processing and cloud computing. The implementation scheme is as follows: acquiring an advertisement material to be detected; acquiring one or more groups of character data in the advertisement material; in response to the preset risk detection condition being met, splicing at least one group of text data in one or more groups of text data serving as to-be-detected information with a path address of a search engine to obtain a search link containing the to-be-detected information; acquiring a corresponding search result page according to the search link; and analyzing the search result page to judge whether the advertisement material is a risk material according to the analysis result.

Description

Advertisement potential risk detection method and device, electronic equipment, medium and product

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to natural language processing and cloud computing, and in particular, to a method and an apparatus for detecting potential risks of advertisements, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Cloud computing (cloud computing) refers to a technology architecture that accesses a flexibly extensible shared physical or virtual resource pool through a network, where resources may include servers, operating systems, networks, software, applications, storage devices, and the like, and may be deployed and managed in an on-demand, self-service manner. Through the cloud computing technology, high-efficiency and strong data processing capacity can be provided for technical application and model training of artificial intelligence, block chains and the like.

For products containing advertisements in the internet, the auditing of the advertisement materials is a very important part, and the advertisement materials displayed on line need to be ensured to meet the requirements of the advertisement law and not to cause adverse effects on the society. However, many advertising materials may be hidden from risk, even in rare terms that are not recognized by some unrelated people, for dangerous words such as sequential conversion, use of wrongly written, pinyin, or traditional words. Therefore, accurate identification and monitoring are difficult to achieve no matter manual examination or machine examination based on related word banks for rule matching, and great potential safety hazards and violation risks are caused.

Disclosure of Invention

The present disclosure provides a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for advertisement potential risk detection.

According to an aspect of the present disclosure, there is provided an advertisement potential risk detection method, including: acquiring an advertisement material to be detected; acquiring one or more groups of character data in the advertisement material; in response to the preset risk detection condition being met, splicing at least one group of text data in one or more groups of text data serving as to-be-detected information with a path address of a search engine to obtain a search link containing the to-be-detected information; obtaining a corresponding search result page according to the search link; and analyzing the search result page to judge whether the advertisement material is a risk material according to the analysis result.

According to another aspect of the present disclosure, there is provided an advertisement risk potential detection apparatus, including: the first acquisition unit is configured to acquire the advertisement material to be detected; the second acquisition unit is configured to acquire one or more groups of character data in the advertisement material; the response unit is configured to respond to the condition that a preset risk detection condition is met, splice at least one group of text data in one or more groups of text data as to-be-detected information with a path address of a search engine, and obtain a search link containing the to-be-detected information; the third acquisition unit is configured to acquire a corresponding search result page according to the search link; and the analysis unit is configured to analyze the search result page so as to judge whether the advertisement material is a risk material according to the analysis result.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform an advertisement risk potential detection method.

According to another aspect of the present disclosure, a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to execute an advertisement potential risk detection method is provided.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements an advertisement potential risk detection method.

According to one or more embodiments of the disclosure, the detection efficiency of the risks of the advertising materials can be improved, and the potential safety hazards and the violation risks can be effectively reduced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of advertising material according to an embodiment of the present disclosure;

FIG. 3 shows a flow diagram of an advertisement potential risk detection method according to an embodiment of the present disclosure;

FIG. 4 illustrates a flow diagram of advertisement potential risk detection in accordance with an exemplary embodiment of the present disclosure;

FIG. 5 illustrates a flow diagram of advertisement potential risk detection according to another exemplary embodiment of the present disclosure;

FIG. 6 shows a block diagram of an advertisement potential risk detection apparatus according to an embodiment of the present disclosure; and

FIG. 7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, server 120 may run one or more services or software applications that enable the execution of the advertisement potential risk detection method.

In some embodiments, the server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

Client devices

101, 102, 103, 104, 105, and/or 106 may be used to present the advertised material or receive corresponding advertising material data, and so forth. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as Microsoft Windows, Apple iOS, UNIX-like operating systems, Linux, or Linux-like operating systems (e.g., Google Chrome OS); or include various Mobile operating systems, such as Microsoft Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular telephones, smart phones, tablets, Personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as picture files and risk word files. The data store 130 may reside in various locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The data store 130 may be of different types. In certain embodiments, the data store used by the server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

In the exemplary embodiment shown in fig. 2, the advertising material for placement on the internet advertising product may generally include a match 201, a match 202. It should be understood that other data carriers, such as video (e.g., short video, moving picture, etc.), voice, etc., may also be included in the advertising material, and the advertising material in the present disclosure includes various possible forms of material information, which is not limited herein.

Before the advertisement material is officially released at an advertisement releasing end, the advertisement material needs to be inspected for safety. A text review is typically performed based on the collocations in the advertising material. For the advertisement collocations using the uncommon professional risk terms, the manual review often omits due to the professional nature of the reviewers; in general machine review, the advertisement matching is only subjected to conventional text matching with the risk word bank, and the advertisement matching cannot be recognized basically because no corresponding data exists in the risk word bank. The content in the risk word stock is generally words or phrases obtained by text extraction of the advertisement matched words marked with risks by recheckers, and rarely contains uncommon terms which are only known by related persons, so that the examination and verification of advertisement materials are limited, and the risk word stock has great potential safety hazards and violation risks.

There is therefore provided, in accordance with an exemplary embodiment of the present disclosure, an advertisement potential risk detection method 300, as shown in fig. 3, including: acquiring an advertisement material to be detected (step 310); acquiring one or more groups of text data in the advertising material (step 320); in response to the preset risk detection condition being met, splicing at least one group of text data in one or more groups of text data serving as information to be detected with a path address of a search engine to obtain a search link containing the information to be detected (step 330); acquiring a corresponding search result page according to the search link (step 340); and analyzing the search result page to judge whether the advertisement material is a risk material according to the analysis result (step 350).

According to the embodiment of the disclosure, the advertisement potential risk detection method can improve the detection efficiency of potential risks of advertisement materials, and effectively reduces potential safety hazards and violation risks.

In step 310, advertising material to be detected is obtained.

In some examples, the advertising material may be from a to-be-detected advertising material library, which may be a collection of collected advertising materials that need to be presented at one or more clients. Additionally or alternatively, the advertisement material to be displayed submitted via the client may also be directly obtained for detection, so that the material may be displayed at one or more clients when no risk is found after detection is completed.

It should be understood that the above description is by way of example only, and that other methods of obtaining advertising material are possible and not limiting.

At step 320, one or more sets of textual data in the advertising material are obtained.

According to some embodiments, the one or more sets of textual data are from at least one of: advertisement matching words and advertisement matching pictures. It should be understood that the one or more sets of textual data may also be from other data carriers in the advertising material, such as possible data carriers of video, voice, etc., without limitation.

In some examples, the one or more sets of textual data include, but are not limited to, possible textual forms of chinese text (including out-of-order, simplified, traditional, etc.), foreign text, character strings, pinyin, and the like, without limitation.

In some examples, the text in the captured advertising match may be recognized by Optical Character Recognition (OCR) techniques. Illustratively, words, phrases or sentences at different positions in the collocation chart are respectively recognized as different sets of literal data. For example, as shown in fig. 2, the 2 nd match from top to bottom includes textual data XXXX and XXX, which represent possible words, phrases, or sentences, respectively. Thus, in this example, two sets of textual data, i.e., XXXX and XXX, may be identified from the profile. It should be understood that other suitable methods by which textual data in an advertising match may be identified are possible and not limited herein.

In some examples, the advertising configuration may be identified as a set of textual data or sets of textual data, depending on the circumstances. Illustratively, in examples where the advertising configuration includes a title, a summary, etc., the title may be identified as a set of textual data and the summary may be identified as another set of textual data. Alternatively, each sentence of text in the advertisement configuration may be identified as a set of text data, or each line of text may be identified as a set of text data, and so on. Other suitable methods of identifying textual data for advertising formulations are possible and not limited herein.

In some examples, other forms of data carriers, such as video, may also be included in the advertising material. In this example, the video may be framed to obtain a picture for each frame extracted. The text in the obtained picture is recognized, for example, by an Optical Character Recognition (OCR) technique, to obtain one or more sets of text data. In some examples, the advertising material may also include a data carrier in the form of speech, or the like, for example, speech information may be extracted and the extracted speech converted into text using Automatic Speech Recognition (ASR) techniques to obtain one or more sets of text data. Other possible forms of data carrier or other possible ways of acquiring the text data set are possible at present, without being limited thereto.

In step 330, in response to that a preset risk detection condition is met, at least one group of text data in the one or more groups of text data is used as information to be detected and is spliced with a path address of a search engine to obtain a search link containing the information to be detected.

According to some embodiments, the method 300 may further comprise: before splicing at least one group of text data in one or more groups of text data as the information to be detected with the path address of the search engine (i.e. before executing step 330), at least one group of text data in one or more groups of text data is matched with the risk lexicon. If the direct matching with the risk lexicon is successful, the advertisement material can be directly considered as the risk material, and the step 330 is not needed, so that the detection efficiency is improved.

In some examples, the preset risk detection condition includes: and the one or more groups of character data are not matched with the risk word bank.

According to some embodiments, none of the one or more sets of textual data matches a risk thesaurus may include: any one group of character data in one or more groups of character data does not contain risk words in the risk word stock. Illustratively, the above-described match to the risk thesaurus may be an exact match on the text.

It should be understood that other possible matching manners are possible and are not limited herein.

According to some embodiments, the matching at least one of the one or more sets of textual data to the risk thesaurus may include: matching the character data from the advertisement match with the character data from the advertisement match in response to the plurality of groups of character data from the advertisement match and the advertisement match; responding to all groups of character data from the advertisement matching in the character data from the advertisement matching, and matching the character data from the advertisement matching with the risk word bank; and matching the at least one group of the character data from the advertisement matching picture and the character data from the advertisement matching word with the risk word bank in response to the at least one group of the character data from the advertisement matching picture not being included in the character data from the advertisement matching word.

In the exemplary embodiment shown in FIG. 4, the advertising material includes advertising matches and advertising matches. First, an advertisement material to be detected is obtained (step 410), and matching images and character data in matching words are obtained (step 420). Judging whether all the groups of character data in the advertisement match are directly contained in the advertisement match (step 430), if not, carrying out risk detection on the character data groups which are not contained in the advertisement match (step 440) and continuing to carry out risk detection on the character data groups in the advertisement match (step 450); if all are included (step 430, yes), then only the text data set for the ad configuration is risk tested (step 450). As described above, the risk detection may be detection "in response to a preset risk detection condition being met, at least one group of text data in one or more groups of text data is used as to-be-detected information to be spliced with a path address of a search engine to obtain a search link including the to-be-detected information". For example, before at least one group of text data in the one or more groups of text data is used as the information to be detected and is spliced with the path address of the search engine, at least one group of text data in the one or more groups of text data may be matched with the risk lexicon.

In some examples, a universal path address for a respective search engine may be obtained. For example, in hundredths, the path address to be spliced is "https:// www.baidu.com/swing ═ which can be pre-saved, and then the information to be spliced can be spliced. For example, the information to be detected is "XX", and the search link containing the information to be detected obtained after splicing is "https:// www.baidu.com/swing ═ XX". It should be understood, however, that this is merely an example, different search engines may have different splicing rules, and any other possible splicing manner is possible, including but not limited to converting the text information to be detected into UTF-8 encoding or GBK (GB2312) encoding, and then splicing according to the corresponding rules, and so on, which is not limited herein.

In some examples, the path addresses of multiple search engines may be pre-saved to adaptively obtain corresponding path addresses according to a selected search engine for splicing according to their corresponding rules. In this way, multiple search engines can be adapted to meet detection requirements in multiple application scenarios.

According to some embodiments, splicing at least one group of text data in one or more groups of text data as the information to be detected and the path address of the search engine includes: performing word segmentation processing on at least one group of character data in one or more groups of character data to obtain one or more word segmentation data; and splicing the obtained word segmentation data serving as the information to be detected and the path address of the search engine respectively. The character data after word segmentation are respectively used as information to be detected to be spliced, so that the accuracy of detecting the risk of the advertising materials through a search engine can be further improved.

In some examples, at least one of the one or more sets of textual data is a set of advertisement match word textual data remaining after match filtering the textual data in each match with textual data in an advertisement match and a set of advertisement match word textual data that may not be included in the advertisement match textual data.

In some examples, it may also be configured to perform the risk detection operation based on the search engine on one or more groups of text data in sequence according to a predetermined order or rule, and when a certain group of text data is detected as having a risk, the risk detection operation of the next group of text data is not performed any more.

According to some embodiments, at least one of the one or more sets of textual data may be participled by a trained network model adapted for risk detection. It should be understood that other methods of enabling analysis are possible and not limited herein.

In step 340, a corresponding search result page is obtained according to the search link.

According to some embodiments, the retrieved search results page is an HTML page.

HTML (hypertext markup language) is a kind of identifying language, which includes a series of labels (tags) to mark each part of the web page to be displayed, and through these labels, the document format on the network can be unified, so that the scattered Internet resources are connected into a logic whole. An HTML page is a descriptive page consisting of HTML commands that can be used to tell the browser how to display the content therein. The HTML commands may state text, graphics, animations, sounds, tables, links, etc. By analyzing the HTML page, the related identification and content of the page can be obtained more conveniently and rapidly, and the detection efficiency is further improved; and a search page does not need to be rendered, so that the detection time is shortened.

It should be understood, however, that other forms of search results pages are possible and are not limiting herein.

In step 350, the search result page is parsed to determine whether the advertisement material is a risk material according to the parsing result.

According to some embodiments, step 350 may include: analyzing the search result page to identify a natural search result in the search result page; and carrying out risk detection on the natural search result so as to judge whether the advertisement material is a risk material according to the risk detection result.

Natural searches, also referred to as organic searches, may have search results referred to as natural search results, which are typically results generated without paid advertisements. Corresponding to this is a sponsored search (spoonsored search), the search results of which are typically promotional results (e.g., advertisements). Natural search refers to search results that the search engine gives all websites in their index database according to its own algorithm and returns to the user for search keywords. This search is not controlled by the advertisement and is given automatic ranking entirely by the algorithm program.

In embodiments according to the present disclosure, sponsored search results may generally be considered to be the lowest risk, and are generally presented as promotional information after being strictly reviewed, and therefore, identifying natural search results in an obtained search results page first will greatly improve detection efficiency and speed of detection.

According to some embodiments, parsing the search results page to identify natural search results in the search results page may include: and analyzing the search result page, identifying the natural content identification in the search result page, and judging the corresponding search result as a natural search result according to the natural content identification.

In some embodiments, the retrieved search results page may be an HTML page. Natural search results tagged with corresponding natural content identifications in the HTML page, which may differ in different search engines. For example, in some search engines, the natural content is identified as "class ═ c-result result," and when the identification is detected, the current search result can be considered a natural search result. It should be understood, however, that the above-described natural content designations are merely provided as exemplary forms of convenience in understanding and are not intended to be limiting.

According to some embodiments, the risk detection of the natural search result to determine whether the advertisement material is a risk material according to the risk detection result may include: at least one of title text data and abstract text data of the natural search result is used as content to be matched to perform rule matching with a risk word bank; and in response to a successful match, stopping the risk detection and judging the advertising material as a risk material.

In this disclosure. The above-mentioned "at least one" may mean that only the header letter data may be set to be detected; or the detection can be set to only the abstract character data; or may be set to detect both title and abstract text data.

According to some embodiments, risk detection is performed on the natural search result to judge whether the advertisement material is a risk material according to the risk detection result, further comprising: analyzing the search result page to obtain a landing page path address of a natural search result, and acquiring corresponding landing page content according to the landing page path address; taking character data in landing page content as content to be matched to perform rule matching with a risk word bank; and in response to a successful match, stopping the risk detection and judging the advertising material as a risk material.

According to some embodiments, the rule in rule matching the content to be matched with the risk thesaurus may include: one or more words in the risk word bank exist in the content to be matched.

It should be understood that other rules that may be used to achieve matching of the content to be matched with the risk thesaurus are possible and are not limited herein.

According to some embodiments, the method 300 may further comprise: and in response to the fact that the advertising materials are judged to be the risk materials, identifying and recording search keywords marked by a search engine in at least one of title text data and abstract text data of the natural search results.

In some examples, a partial HTML page is shown below, where the text identified in the middle of < em > </em > is a search keyword that is tagged by the search engine. Illustratively, the tagged search keywords are typically rendered in the browser as red fonts. It should be understood that the above examples are merely possible forms thereof shown for convenience of describing the search keyword, but are not limited thereto, and any other possible forms are possible.

In some examples, risk detection is performed on textual data in ad matchings and/or ad matchings. As shown in fig. 5, first, matching one or more sets of text data with a risk thesaurus (step 510), if the matching is not successful (step 510, "not matching"), splicing at least one set of text data in the one or more sets of text data as information to be detected and a path address of a search engine to obtain a search link containing the information to be detected, and obtaining a corresponding search page result according to the search link (step 520). Natural search results in the obtained search results page are identified (step 530), and at least one of a title and a summary in the natural search results is rule-matched with the risk thesaurus (step 540). If the match is successful (step 540, "match"), the corresponding search keyword is recorded (step 550) and the advertising material is marked as risky material (step 580). If the mismatch is not successful (step 540, "no match"), the landing page of the natural search results continues to be rule matched against the risk thesaurus (step 560). If the match is successful (step 560, "match"), the corresponding search keyword is recorded (step 550) and the advertising material is marked as risky material (step 580); if the mismatch is successful (step 560, "no match"), the advertising material is marked as no risk material (step 570).

In some examples, only at least one of the title and the abstract in the natural search result may be matched with the risk thesaurus by rules, and if the matching is successful, the search keyword is recorded and the advertisement material is marked as risk material, otherwise, the advertisement material is directly marked as risk-free material.

According to some embodiments, the method 300 may further comprise: in response to determining the advertising material as a risk material, a risk lexicon is extended based on the advertising material. The accuracy of the risk lexicon is guaranteed, and meanwhile the detection speed and efficiency of detecting other advertisement materials subsequently can be further improved.

According to some embodiments, augmenting the risk thesaurus based on the advertising material comprises: extracting core texts from one or more groups of character data of the advertisement materials judged as risk materials; refining the extracted core text based on the recorded search keywords; and adding the refined core text into a risk word bank according to the risk type.

In some examples, one or more sets of textual data for advertising material determined to be risk material may be subject to core text extraction by a trained network model applicable to risk detection.

In some examples, a key word extraction algorithm may be used for core text extraction of one or more sets of textual data of an advertising material determined to be a risk material, and algorithms that may be used for key word extraction include, but are not limited to: TF-IDF keyword extraction method, Topic-model keyword extraction method, RAKE keyword extraction method, TextRank algorithm, LDA algorithm, TPR algorithm and the like.

It should be understood that the above examples are only possible methods that may be used to implement core text extraction, but are not limited thereto, and any other possible method is possible.

In some examples, the core text extraction may be performed periodically (e.g., every few days) on the search engine-based at-risk material. The obtained words or phrases are classified and added to the risk word stock according to the risk types thereof so as to expand the risk word stock, for example, through further refining by auditors based on the search mark key words and self-reading comprehension recorded before.

According to an exemplary embodiment of the present disclosure, as shown in fig. 6, there is also provided an advertisement potential risk detection apparatus 600, including: a first obtaining unit 610 configured to obtain an advertisement material to be detected; a second obtaining unit 620, configured to obtain one or more sets of text data in the advertisement material; the response unit 630 is configured to respond to that a preset risk detection condition is met, splice at least one group of text data in the one or more groups of text data as to-be-detected information with a path address of a search engine, so as to obtain a search link containing the to-be-detected information; a third obtaining unit 640, configured to obtain a corresponding search result page according to the search link; and an analysis unit 650 configured to analyze the search result page to determine whether the advertisement material is a risk material according to an analysis result.

According to some embodiments, the parsing unit comprises: analyzing the search result page to identify a unit of a natural search result in the search result page; and carrying out risk detection on the natural search result so as to judge whether the advertisement material is a unit of risk material according to the risk detection result.

Here, the operations of the above units 610 to 650 of the advertisement risk potential detecting device 600 are similar to the operations of the above steps 310 to 350, and are not described herein again.

There is also provided, in accordance with an exemplary embodiment of the present disclosure, an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of advertisement risk potential detection described above.

There is also provided, in accordance with an exemplary embodiment of the present disclosure, a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to execute the above advertisement potential risk detection method.

There is also provided, in accordance with an exemplary embodiment of the present disclosure, a computer program product, comprising a computer program, wherein the computer program, when executed by a processor, implements the above advertisement potential risk detection method.

Referring to fig. 7, a block diagram of a structure of an electronic device 700, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the device 700, and the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote controller. Output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 708 may include, but is not limited to, magnetic or optical disks. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the method 300. For example, in some embodiments, the method 300 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into RAM 703 and executed by the computing unit 701, one or more steps of the method 300 described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the method 300 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. An advertisement potential risk detection method, comprising:

acquiring an advertisement material to be detected;

acquiring one or more groups of character data in the advertisement material;

in response to the condition that a preset risk detection condition is met, splicing at least one group of character data in the one or more groups of character data serving as information to be detected with a path address of a search engine to obtain a search link containing the information to be detected;

acquiring a corresponding search result page according to the search link; and

and analyzing the search result page to judge whether the advertisement material is a risk material according to an analysis result.

2. The method of claim 1, wherein parsing the search results page to determine whether the advertising material is a risk material according to the parsing result comprises:

analyzing the search result page to identify natural search results in the search result page; and

and carrying out risk detection on the natural search result so as to judge whether the advertisement material is a risk material according to the risk detection result.

3. The method of claim 1, wherein the one or more sets of textual data are from at least one of: advertisement matching words and advertisement matching pictures.

4. The method of claim 3, further comprising:

before splicing at least one group of text data in the one or more groups of text data serving as information to be detected with a path address of a search engine, matching at least one group of text data in the one or more groups of text data with a risk word bank,

wherein the preset risk detection condition comprises: and the one or more groups of character data are not matched with the risk word bank.

5. The method of claim 4, wherein none of the one or more sets of textual data matches the risk thesaurus, comprising:

and any one group of character data in the one or more groups of character data does not contain risk words in the risk word stock.

6. The method of claim 4, wherein matching at least one of the one or more sets of textual data to a risk thesaurus comprises:

matching the text data from the advertisement match with the text data from the advertisement match in response to the plurality of sets of text data from the advertisement match and the advertisement match;

matching the character data from the advertisement matching word with the risk word bank in response to the character data from the advertisement matching word comprises all groups of character data from the advertisement matching word; and

matching at least one set of text data from the advertisement mapping and text data from the advertisement mapping with the risk thesaurus in response to the at least one set of text data from the advertisement mapping not being included in the text data from the advertisement mapping.

7. The method of claim 1, wherein splicing at least one of the one or more sets of textual data as information to be detected with a path address of a search engine comprises:

performing word segmentation processing on at least one group of character data in the one or more groups of character data to obtain one or more word segmentation data; and

and splicing the obtained word segmentation data serving as the information to be detected and the path address of the search engine respectively.

8. The method of claim 2, wherein the retrieved search results page is an HTML page.

9. The method of claim 8, wherein parsing the search results page to identify natural search results in the search results page comprises:

analyzing the search result page, identifying the natural content identification in the search result page, and judging the corresponding search result as a natural search result according to the natural content identification.

10. The method of claim 8, wherein risk detecting the natural search result to determine whether the advertising material is a risk material according to the risk detection result comprises:

at least one of the title text data and the abstract text data of the natural search result is used as a content to be matched to perform rule matching with a risk word bank; and

and responding to the successful matching, stopping risk detection and judging the advertisement material as a risk material.

11. The method of claim 10, wherein risk detecting the natural search result to determine whether the advertisement material is a risk material according to the risk detection result further comprises:

parsing the search results page to obtain a landing page path address of the natural search results,

acquiring corresponding landing page contents according to the landing page path address;

taking the character data in the landing page content as the content to be matched to perform rule matching with a risk word bank; and

12. The method of claim 10 or 11, wherein the rule comprises: one or more words in the risk word bank exist in the content to be matched.

13. The method of claim 10 or 11, further comprising:

and in response to the fact that the advertising materials are judged to be risk materials, identifying and recording search keywords marked by the search engine in at least one of title text data and abstract text data of the natural search results.

14. The method of claim 13, further comprising: in response to determining the advertising material as a risk material, extending the risk thesaurus based on the advertising material.

15. The method of claim 14, wherein augmenting the risk thesaurus based on the advertising material comprises:

extracting core texts from one or more groups of character data of the advertisement materials judged as risk materials;

refining the extracted core text based on the recorded search keywords; and

and adding the refined core text into the risk word bank according to the risk type.

16. An advertisement potential risk detection apparatus comprising:

the first acquisition unit is configured to acquire the advertisement material to be detected;

the second acquisition unit is configured to acquire one or more groups of character data in the advertisement material;

the response unit is configured to respond to the condition that a preset risk detection condition is met, splice at least one group of character data in the one or more groups of character data as to-be-detected information with a path address of a search engine, and obtain a search link containing the to-be-detected information;

the third acquisition unit is configured to acquire a corresponding search result page according to the search link; and

and the analysis unit is configured to analyze the search result page so as to judge whether the advertisement material is a risk material according to an analysis result.

17. The apparatus of claim 16, wherein the parsing unit comprises:

analyzing the search result page to identify a unit of a natural search result in the search result page; and

and carrying out risk detection on the natural search result so as to judge whether the advertisement material is a unit of risk material according to the risk detection result.

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.

19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-15.

20. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-15 when executed by a processor.