CN112508432B

CN112508432B - Advertisement potential risk detection method and device, electronic equipment, medium and product

Info

Publication number: CN112508432B
Application number: CN202011483791.XA
Authority: CN
Inventors: 郑诗琪; 李思; 王笑吉
Original assignee: Baidu International Technology Shenzhen Co ltd
Current assignee: Baidu International Technology Shenzhen Co ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2024-08-02
Anticipated expiration: 2040-12-15
Also published as: CN112508432A

Abstract

The disclosure provides a method, a device, an electronic device, a computer readable storage medium and a computer program product for detecting advertisement potential risks, and relates to the field of computers, in particular to the fields of natural language processing and cloud computing. The implementation scheme is as follows: acquiring advertisement materials to be detected; acquiring one or more groups of text data in the advertisement materials; in response to meeting a preset risk detection condition, splicing at least one group of text data in one group or a plurality of groups of text data serving as information to be detected with a path address of a search engine to obtain a search link containing the information to be detected; acquiring a corresponding search result page according to the search link; and analyzing the search result page to judge whether the advertisement material is a risk material according to the analysis result.

Description

Advertisement potential risk detection method and device, electronic equipment, medium and product

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to natural language processing and cloud computing, and more particularly, to a method, an apparatus, an electronic device, a computer readable storage medium, and a computer program product for advertisement risk detection.

Background

Cloud computing (cloud computing) refers to a technical system that accesses an elastically extensible shared physical or virtual resource pool through a network, wherein resources can include servers, operating systems, networks, software, applications, storage devices and the like, and can be deployed and managed in an on-demand and self-service manner. Through cloud computing technology, high-efficiency and powerful data processing capability can be provided for technical application such as artificial intelligence and blockchain, and model training.

For products containing advertisements in the internet, auditing of advertisement materials is a very important part, and it is required to ensure that the advertisement materials displayed on line meet the requirements of advertisement laws and do not cause adverse effects on society. However, many advertising materials may be very hidden from the risk of, for example, sequence shifting, misuse of wrongly written, pinyin or traditional words, and the like, for dangerous words, even with the use of uncommon terms that are not recognizable by some unrelated individuals. Therefore, accurate identification and monitoring are difficult to achieve in both manual auditing and machine auditing based on rule matching of related word stock, and the method has great potential safety hazards and illegal risks.

Disclosure of Invention

The present disclosure provides a method, apparatus, electronic device, computer-readable storage medium, and computer program product for advertisement risk detection.

According to an aspect of the present disclosure, there is provided an advertisement risk potential detection method, including: acquiring advertisement materials to be detected; acquiring one or more groups of text data in the advertisement materials; in response to meeting a preset risk detection condition, splicing at least one group of text data in one group or a plurality of groups of text data serving as information to be detected with a path address of a search engine to obtain a search link containing the information to be detected; acquiring a corresponding search result page according to the search link; and analyzing the search result page to judge whether the advertisement material is a risk material according to the analysis result.

According to another aspect of the present disclosure, there is provided an advertising risk potential detecting apparatus, including: the first acquisition unit is configured to acquire advertisement materials to be detected; the second acquisition unit is configured to acquire one or more groups of text data in the advertising materials; the response unit is configured to respond to the condition of meeting the preset risk detection, and splice at least one group of text data in one group or a plurality of groups of text data serving as information to be detected with a path address of a search engine so as to obtain a search link containing the information to be detected; a third obtaining unit configured to obtain a corresponding search result page according to the search link; and the analysis unit is configured to analyze the search result page so as to judge whether the advertisement material is a risk material according to the analysis result.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the advertising risk detection method.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform an advertising risk potential detection method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program when executed by a processor implements an advertising risk potential detection method.

According to one or more embodiments of the disclosure, the detection efficiency of advertisement material risks can be improved, and potential safety hazards and illegal risks are effectively reduced.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of advertising material in accordance with an embodiment of the present disclosure;

FIG. 3 illustrates a flow chart of an advertising risk potential detection method according to an embodiment of the disclosure;

FIG. 4 illustrates a flow chart of advertisement risk potential detection according to an example embodiment of the present disclosure;

FIG. 5 illustrates a flow chart of advertisement risk potential detection according to another exemplary embodiment of the present disclosure;

FIG. 6 shows a block diagram of an advertising risk potential detection device, according to an embodiment of the disclosure; and

Fig. 7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, server 120 may run one or more services or software applications that enable the performance of the advertisement risk potential detection method.

In some embodiments, server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The client devices 101, 102, 103, 104, 105, and/or 106 may be used to present advertising material placed or receive corresponding advertising material data, and the like. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as Microsoft Windows, apple iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., google Chrome OS); or include various mobile operating systems such as Microsoft Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and 106.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and Virtual special server (VPS PRIVATE SERVER) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as picture files and risk word files. The data store 130 may reside in a variety of locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The data store 130 may be of different types. In some embodiments, the data store used by server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

In an exemplary embodiment as shown in FIG. 2, advertising materials for placement on Internet advertising products may generally include a word allocation 201, a diagram allocation 202. It should be understood that other data carriers may be included in the advertising material, such as video (e.g., short video, moving pictures, etc.), voice, etc., and that the advertising material in this disclosure includes material information in a variety of possible forms, without limitation.

Before formally delivering the advertisement materials to the advertisement delivery end, the advertisement materials need to be inspected for safety. Text reviews are typically based on terms in the ad feed. For advertisement word matching using a rare professional risk term, manual auditing is often omitted due to the professionality of auditors; in general machine auditing, only conventional text matching is carried out on advertisement matching words and risk word stock, and the advertisement matching words and the risk word stock cannot be identified basically because no corresponding data exists in the risk word stock. The content in the risk word stock is generally words or phrases obtained by carrying out text extraction on advertisement matched words marked with risks by rechecking staff, and rarely contains the uncommon terms which are known only by related persons, so that the auditing of advertisement materials is limited, and the risk of potential safety hazards and violations is high.

Accordingly, there is provided, in accordance with an exemplary embodiment of the present disclosure, an advertising risk potential detection method 300, as shown in fig. 3, comprising: acquiring advertisement materials to be detected (step 310); acquiring one or more sets of text data in the advertising material (step 320); in response to the preset risk detection condition being met, at least one set of text data in the one or more sets of text data is spliced with a path address of the search engine as information to be detected, so as to obtain a search link containing the information to be detected (step 330); acquiring a corresponding search result page according to the search link (step 340); and parsing the search result page to determine whether the advertisement material is a risk material according to the parsing result (step 350).

According to the embodiment of the disclosure, the advertisement potential risk detection method can improve the detection efficiency of the potential risk of advertisement materials and effectively reduce potential safety hazards and illegal risks.

In step 310, advertising material to be detected is acquired.

In some examples, the advertising material may be from a library of advertising materials to be detected, which may be a collection of collected advertising material that needs to be presented at one or more clients. Additionally or alternatively, it is also possible to detect the direct acquisition of advertising material submitted via clients to be displayed, so as to display said material on one or more clients when no risk is found after the detection is completed.

It should be understood that the above description is by way of example only and that other methods of enabling advertising material to be obtained are possible and are not limiting herein.

At step 320, one or more sets of text data in the advertising material are obtained.

According to some embodiments, the one or more sets of text data are from at least one of: advertisement word matching and advertisement picture matching. It should be understood that the one or more sets of text data may also come from other data carriers in the advertising material, such as video, voice, etc. possible data carriers, without limitation.

In some examples, the one or more sets of literal data include, but are not limited to, chinese literals (including out-of-order, simplified, traditional, etc.), foreign literals, strings, pinyin, and the like, as possible literal forms, without limitation.

In some examples, the text in the captured advertising map may be identified by Optical Character Recognition (OCR) techniques. For example, words, phrases or sentences at different positions in the map are respectively identified as different sets of text data. For example, as shown in fig. 2, the 2 nd figure from top to bottom includes text data XXXX and XXX, which represent possible words, phrases, or sentences, respectively. Thus, in this example, two sets of literal data, XXXX and XXX, can be identified from the map. It should be understood that other suitable methods for identifying text data in an advertising profile are possible and are not limited in this regard.

In some examples, the advertisement formulation may be identified as one or more sets of text data depending on the actual situation. For example, in examples where the advertisement formulation includes a title, a abstract, etc., the title may be identified as one set of text data, the abstract may be identified as another set of text data. Alternatively, each sentence word in the advertisement matched word may be identified as a set of word data, or each row of words may be identified as a set of word data, or the like. Other suitable methods of identifying text data for the advertisement match are possible and are not limited in this regard.

In some examples, other forms of data carriers, such as video, may also be included in the advertising material. In this example, the video may be framed to obtain a picture of each frame extracted. The text in the obtained picture is identified, for example by Optical Character Recognition (OCR) techniques, to obtain one or more sets of text data. In some examples, data carriers in the form of speech, etc., may also be included in the advertising material, for example, speech information may be extracted and the extracted speech converted to text using Automatic Speech Recognition (ASR) techniques to obtain one or more sets of text data. Currently, other possible data carrier formats or other possible ways of retrieving the text data sets are possible, without limitation.

In step 330, at least one set of text data of the one or more sets of text data is spliced with the path address of the search engine as the information to be detected in response to the preset risk detection condition being satisfied, so as to obtain a search link containing the information to be detected.

According to some embodiments, the method 300 may further comprise: at least one set of one or more sets of text data is matched with the risk lexicon before the at least one set of one or more sets of text data is spliced as the information to be detected with the path address of the search engine (i.e., before step 330 is performed). If the matching with the risk lexicon is successful, the advertisement material can be directly considered as the risk material, and step 330 is not needed to be performed, so as to improve the detection efficiency.

In some examples, the preset risk detection condition includes: and the one or more groups of text data are not matched with the risk word stock.

According to some embodiments, none of the one or more sets of text data matches the risk thesaurus implementation, may include: any one of the one or more sets of text data does not contain the risk words in the risk word stock. The above-mentioned matches with the risk lexicon may be exact matches on text, for example.

It should be understood that other possible ways of matching are possible and are not limiting herein.

According to some embodiments, the matching at least one of the one or more sets of text data with the risk lexicon may include: responding to the plurality of groups of text data from the advertisement matching words and the advertisement matching drawings, and matching the text data from the advertisement matching drawings with the text data from the advertisement matching words; responding to the word data from the advertisement match words to contain all groups of word data from the advertisement match graph, and matching the word data from the advertisement match words with a risk word stock; and matching the at least one set of text data from the advertisement match and the text data from the advertisement match with the risk lexicon in response to the at least one set of text data from the advertisement match not being included in the text data from the advertisement match.

In the exemplary embodiment shown in FIG. 4, the advertising material includes advertising graphics and advertising word pairs. First, the advertisement material to be detected is acquired (step 410) and text data in its matching chart and matching word is acquired (step 420). Judging whether all groups of text data in the advertisement match word are directly contained in the advertisement match word (step 430), if not, carrying out risk detection on the text data groups not contained in the advertisement match word (step 440) and continuing to carry out risk detection on the text data groups in the advertisement match word (step 450); if all are included (step 430, "yes"), risk detection is performed on only the word data set of the advertisement match word (step 450). As described above, the risk detection may be "in response to the satisfaction of the preset risk detection condition, at least one set of text data of one or more sets of text data is spliced with the path address of the search engine as the information to be detected, so as to obtain the search link containing the information to be detected". For example, at least one set of text data in the one or more sets of text data may be matched with the risk lexicon before the at least one set of text data is spliced with the path address of the search engine as the information to be detected.

In some examples, a common path address for a respective search engine may be obtained. Taking hundred degrees as an example, for example, the path address to be spliced is "https:// www.baidu.com/swords=", where "=" can be spliced to obtain the information to be detected. For example, the information to be detected is XX, and the search link containing the information to be detected obtained after the information to be detected is https:// www.baidu.com/word=XX. It should be understood that this is merely an example, and that different search engines may have different stitching rules, and that any other possible stitching method is possible, including but not limited to converting text information to be detected into UTF-8 code or GBK (GB 2312) code, stitching according to their corresponding rules, and so on.

In some examples, path addresses of multiple search engines may be pre-saved to adaptively obtain corresponding path addresses according to a selected search engine to splice according to their corresponding rules. In this way, multiple search engines can be adapted to meet detection requirements in multiple application scenarios.

According to some embodiments, stitching at least one of the one or more sets of text data with a path address of a search engine as information to be detected includes: word segmentation processing is carried out on at least one group of text data in one group or a plurality of groups of text data so as to obtain one or a plurality of word segmentation data; and splicing the obtained word segmentation data with the path address of the search engine respectively as the information to be detected. The word data after word segmentation are respectively used as information to be detected for splicing, so that the accuracy of advertisement material risk detection through a search engine can be further improved.

In some examples, at least one of the one or more sets of text data is an advertisement match text data set remaining after matching filtering text data in each match with text data in an advertisement match and an advertisement match text data set that may not be included in the advertisement match text data.

In some examples, the risk detection operation based on the search engine may be performed on one or more sets of text data sequentially according to a predetermined sequence or rule, and when one set of text data is detected to be risk, the risk detection operation of the next set of text data is not performed any more.

According to some embodiments, at least one of the one or more sets of text data may be word segmented by a trained network model suitable for risk detection. It should be understood that other methods of enabling analysis are possible and are not limiting herein.

In step 340, a corresponding search results page is obtained from the search link.

According to some embodiments, the retrieved search results page is an HTML page.

HTML (hypertext markup language) is an identifying language that includes a series of identifications (tags) by which individual portions of a web page to be displayed are marked, and by which document formats on a network can be unified, so that distributed Internet resources are connected as a logical whole. HTML pages are descriptive pages made up of HTML commands that can be used to tell a browser how to display the content therein. HTML commands may specify text, graphics, animation, sound, tables, links, etc. The HTML page is analyzed, so that the relevant identification and content of the page can be acquired more conveniently and rapidly, and the detection efficiency is further improved; and the search page does not need to be rendered, so that the detection time is shortened.

It should be understood that other forms of search results pages are possible and are not limiting herein.

In step 350, the search results page is parsed to determine whether the advertisement material is a risk material according to the parsing result.

According to some embodiments, step 350 may include: analyzing the search result page to identify natural search results in the search result page; and performing risk detection on the natural search result to judge whether the advertisement material is a risk material according to the risk detection result.

Natural searches, also referred to as organic searches, may be referred to as natural search results, which are generally the results of the generated unpaid advertisements. Corresponding to this is a sponsored search (sponsored search) whose search results are typically promotional results (e.g., advertisements). Natural searching refers to the search engine giving all websites in their index database search results returned to the user for search keywords according to its own algorithm. Such searches are not controlled by advertisements, but are given automatic ranking by the algorithm program entirely.

In embodiments according to the present disclosure, sponsored search results may be generally considered to be least risky, which are generally presented as promotional information after being strictly reviewed, so first identifying natural search results in an acquired search results page will greatly improve detection efficiency and speed of detection.

According to some embodiments, parsing the search results page to identify natural search results in the search results page may include: analyzing the search result page, and identifying natural content identifiers in the search result page so as to judge that the corresponding search result is a natural search result according to the natural content identifiers.

In some embodiments, the retrieved search results page may be an HTML page. Natural search results marked in HTML pages by corresponding natural content identifications, which may be different in different search engines. For example, in some search engines the natural content is identified as "class=" c-result result ", and the current search result may be considered a natural search result when the identification is detected. It should be understood that the above description of natural content is merely illustrative of the forms given for ease of understanding and is not limiting herein.

According to some embodiments, performing risk detection on the natural search result to determine whether the advertisement material is a risk material according to the risk detection result may include: at least one of title text data and abstract text data of natural search results is used as content to be matched with a risk word stock in a rule matching mode; and stopping risk detection and judging the advertisement materials as risk materials in response to successful matching.

In this disclosure. The above-mentioned "at least one" may mean that only the heading character data may be set to be detected; or can be set to detect only abstract text data; or may be configured to detect both headlines and abstract literal data.

According to some embodiments, risk detection is performed on the natural search result to determine whether the advertisement material is a risk material according to the risk detection result, and the method further includes: analyzing the search result page to obtain a landing page path address of a natural search result, and obtaining corresponding landing page contents according to the landing page path address; the text data in the landing page content is used as the content to be matched and is subjected to rule matching with a risk word stock; and stopping risk detection and judging the advertisement materials as risk materials in response to successful matching.

According to some embodiments, the rule in rule matching the content to be matched with the risk lexicon may include: one or more words in the risk word stock exist in the content to be matched.

It should be understood that other rules that may be used to achieve matching of content to be matched to the risk thesaurus are possible and are not limiting herein.

According to some embodiments, the method 300 may further comprise: in response to determining the advertising material as a risk material, search keywords marked by a search engine in at least one of title text data and summary text data of the natural search result are identified and recorded.

In some examples, the text identified in the middle of < em > </em > is a search keyword that is tagged by a search engine. Illustratively, the tagged search keywords are typically rendered in a red font in the browser. It should be understood that the above examples are merely illustrative of possible forms thereof for convenience in describing the search keyword, but are not limited thereto, and any other possible forms are possible.

In some examples, risk detection is performed on the advertisement word and/or text data in the advertisement map. As shown in fig. 5, first, one or more sets of text data are matched with a risk word stock (step 510), if the matching is unsuccessful (step 510, no match is made), at least one set of text data in the one or more sets of text data is spliced with a path address of a search engine as information to be detected, so as to obtain a search link containing the information to be detected, and a corresponding search page result is obtained according to the search link (step 520). Natural search results in the obtained search results page are identified (step 530), and at least one of a title and a abstract in the natural search results is rule-matched with the risk lexicon (step 540). If the match is successful (step 540, "match"), the corresponding search keyword is recorded (step 550) and the advertising material is marked as risky material (step 580). If the mismatch is successful (step 540, "mismatch"), the floor page of the natural search result continues to be regularly matched with the risk lexicon (step 560). If the match is successful (step 560, "match"), then the corresponding search keyword is recorded (step 550) and the advertising material is marked as risky material (step 580); if the mismatch is successful (step 560, "mismatch"), the advertising material is marked as risk-free material (step 570).

In some examples, at least one of the title and the abstract in the natural search result can be regularly matched with the risk word stock, if the matching is successful, the search keyword is recorded and the advertisement material is marked as the risky material, otherwise, the advertisement material is directly marked as the risky material.

According to some embodiments, the method 300 may further comprise: in response to determining the advertising material as a risk material, a risk thesaurus is augmented based on the advertising material. The accuracy of the risk word stock is guaranteed, and meanwhile the detection speed and the detection efficiency of detecting other advertisement materials can be further improved.

According to some embodiments, augmenting the risk thesaurus based on the advertising material comprises: extracting core text from one or more groups of text data of advertisement materials which are judged to be risk materials; extracting the extracted core text based on the recorded search keywords; and adding the extracted core text into a risk word stock according to the risk type.

In some examples, core text extraction may be performed on one or more sets of text data of advertising material determined to be a risk material through a trained network model suitable for risk detection.

In some examples, core text extraction may be performed on one or more sets of text data of advertising material determined to be a risk material using a keyword extraction algorithm, which may be used for keyword extraction including, but not limited to: TF-IDF keyword extraction method, topic-model keyword extraction method, RAKE keyword extraction method, textRank algorithm, LDA algorithm, TPR algorithm, etc.

It should be appreciated that the above examples are only possible methods that may be used to implement core text extraction, but are not limited thereto and any other possible methods are possible.

In some examples, core text extraction may be performed periodically (e.g., every few days) on risky material obtained based on the search engine. For example, the auditor can further refine the keyword based on the previously recorded search mark and the self-reading understanding, and the obtained words or phrases are classified and added into the risk word stock according to the risk type of the word or phrase so as to expand the risk word stock.

There is also provided, as shown in fig. 6, an advertising risk potential detecting apparatus 600 according to an exemplary embodiment of the present disclosure, including: a first acquiring unit 610 configured to acquire advertisement materials to be detected; a second obtaining unit 620 configured to obtain one or more sets of text data in the advertisement materials; a response unit 630, configured to splice at least one set of text data in the one or more sets of text data as information to be detected with a path address of a search engine in response to satisfaction of a preset risk detection condition, so as to obtain a search link containing the information to be detected; a third acquiring unit 640 configured to acquire a corresponding search result page according to the search link; and a parsing unit 650 configured to parse the search result page, so as to determine whether the advertisement material is a risk material according to the parsing result.

According to some embodiments, the parsing unit comprises: means for parsing the search results page to identify natural search results in the search results page; and a unit for performing risk detection on the natural search result to judge whether the advertisement material is a risk material according to the risk detection result.

Here, the operations of the above units 610 to 650 of the advertisement risk potential detection apparatus 600 are similar to the operations of the steps 310 to 350 described above, respectively, and are not repeated here.

There is also provided, in accordance with an exemplary embodiment of the present disclosure, an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the advertising risk detection method described above.

There is also provided in accordance with an exemplary embodiment of the present disclosure a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the above-described advertising risk potential detection method.

There is also provided in accordance with an exemplary embodiment of the present disclosure a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the advertising risk potential detection method described above.

Referring to fig. 7, a block diagram of an electronic device 700 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the device 700, the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 708 may include, but is not limited to, magnetic disks, optical disks. The communication unit 709 allows the device 700 to exchange information/data with other devices through computer networks, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as method 300. For example, in some embodiments, the method 300 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. One or more of the steps of the method 300 described above may be performed when a computer program is loaded into RAM 703 and executed by computing unit 701. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the method 300 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. An advertising risk potential detection method, comprising:

acquiring advertisement materials to be detected;

Acquiring one or more groups of text data in the advertisement materials;

In response to meeting a preset risk detection condition, splicing at least one group of text data in the one or more groups of text data serving as information to be detected with a path address of a search engine to obtain a search link containing the information to be detected;

acquiring a corresponding search result page according to the search link; and

Analyzing the search result page to judge whether the advertisement materials are risk materials according to the analysis result,

Analyzing the search result page to judge whether the advertisement material is a risk material according to an analysis result, wherein the method comprises the following steps:

Analyzing the search result page to identify natural search results in the search result page; and

And performing risk detection on the natural search result to judge whether the advertisement material is a risk material according to the risk detection result.

2. The method of claim 1, wherein the one or more sets of text data are from at least one of: advertisement word matching and advertisement picture matching.

3. The method of claim 2, further comprising:

Matching at least one set of text data in the one or more sets of text data with a risk word stock before splicing the at least one set of text data in the one or more sets of text data as information to be detected with a path address of a search engine,

Wherein, the preset risk detection conditions include: and the one or more groups of text data are not matched with the risk word stock.

4. The method of claim 3, wherein none of the one or more sets of literal data are matched with the risk thesaurus, comprising:

Any one of the one or more sets of text data does not contain the risk words in the risk word stock.

5. The method of claim 3, wherein matching at least one of the one or more sets of literal data to a risk lexicon comprises:

matching the text data from the advertisement match graph with the text data from the advertisement match word in response to the plurality of sets of text data from the advertisement match word and the advertisement match graph;

Matching the text data from the advertisement matched word with the risk word stock in response to the text data from the advertisement matched word including all groups of text data from the advertisement matched graph; and

And matching the at least one set of text data from the advertisement match graph and the text data from the advertisement match word with the risk word stock in response to the at least one set of text data from the advertisement match graph not being included in the text data from the advertisement match word.

6. The method of claim 1, wherein stitching at least one of the one or more sets of literal data with a path address of a search engine as information to be detected comprises:

word segmentation is carried out on at least one group of text data in the one or more groups of text data so as to obtain one or more word segmentation data; and

And respectively splicing the obtained word segmentation data serving as the information to be detected and the path address of the search engine.

7. The method of claim 1, wherein the retrieved search results page is an HTML page.

8. The method of claim 7, wherein parsing the search results page to identify natural search results in the search results page comprises:

Analyzing the search result page, and identifying natural content identifiers in the search result page so as to judge that corresponding search results are natural search results according to the natural content identifiers.

9. The method of claim 7, wherein performing risk detection on the natural search results to determine whether the advertising material is a risk material based on the risk detection results comprises:

At least one of the title text data and the abstract text data of the natural search result is used as the content to be matched with a risk word stock in a rule matching manner; and

And stopping risk detection and judging the advertisement materials as risk materials in response to successful matching.

10. The method of claim 9, wherein performing risk detection on the natural search results to determine whether the advertising material is a risk material based on the risk detection results, further comprises:

parsing the search result page to obtain a landing page path address of the natural search result,

Acquiring corresponding landing page contents according to the landing page path address;

performing rule matching on the text data in the landing page content serving as the content to be matched and a risk word stock; and

11. The method of claim 9 or 10, wherein the rule comprises: one or more words in the risk word stock exist in the content to be matched.

12. The method of claim 9 or 10, further comprising:

And identifying and recording search keywords marked by the search engine in at least one of the title text data and the abstract text data of the natural search result in response to judging the advertisement material as the risk material.

13. The method of claim 12, further comprising: and in response to judging the advertising material as a risk material, expanding the risk word stock based on the advertising material.

14. The method of claim 13, wherein augmenting the risk thesaurus based on the advertising material comprises:

extracting core text from one or more groups of text data of the advertisement materials judged to be risk materials;

Refining the extracted core text based on the recorded search keywords; and

And adding the refined core text into the risk word stock according to the risk type.

15. An advertising risk potential detection device, comprising:

The first acquisition unit is configured to acquire advertisement materials to be detected;

a second obtaining unit configured to obtain one or more sets of text data in the advertisement material;

The response unit is configured to respond to the condition of meeting the preset risk detection, and splice at least one group of text data in the one or more groups of text data serving as information to be detected with a path address of a search engine so as to obtain a search link containing the information to be detected;

a third obtaining unit configured to obtain a corresponding search result page according to the search link; and

The analysis unit is configured to analyze the search result page so as to judge whether the advertisement material is a risk material according to an analysis result, and the analysis unit comprises:

Means for parsing the search results page to identify natural search results in the search results page; and

And performing risk detection on the natural search result to judge whether the advertisement material is a unit of risk material according to the risk detection result.

16. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-14.

17. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-14.

18. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-14.