WO2019169769A1 - Advertisement picture identification method, electronic device, and readable storage medium - Google Patents

Advertisement picture identification method, electronic device, and readable storage medium Download PDF

Info

Publication number
WO2019169769A1
WO2019169769A1 PCT/CN2018/089720 CN2018089720W WO2019169769A1 WO 2019169769 A1 WO2019169769 A1 WO 2019169769A1 CN 2018089720 W CN2018089720 W CN 2018089720W WO 2019169769 A1 WO2019169769 A1 WO 2019169769A1
Authority
WO
WIPO (PCT)
Prior art keywords
analyzed
advertisement
keyword
font
image
Prior art date
Application number
PCT/CN2018/089720
Other languages
French (fr)
Chinese (zh)
Inventor
宋杰
郑佳
赵骏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019169769A1 publication Critical patent/WO2019169769A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text

Definitions

  • the present application relates to the field of computer technologies, and in particular, to an advertisement picture identification method, an electronic device, and a readable storage medium.
  • the purpose of the present application is to provide an advertisement picture identification method, an electronic device, and a readable storage medium, which are intended to improve the efficiency of identifying an advertisement picture.
  • a first aspect of the present application provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores an advertisement picture authentication system that can be run on the processor, and the advertisement picture
  • the authentication system implements the following steps when executed by the processor:
  • the second aspect of the present application further provides an advertisement picture identification method, where the advertisement picture identification method includes:
  • a third aspect of the present application further provides a computer readable storage medium, where the computer readable storage medium stores an advertisement picture authentication system, where the advertisement picture authentication system is executable by at least one processor And causing the at least one processor to perform the steps of the advertisement picture identification method as described above.
  • the advertisement picture identification method, system and readable storage medium proposed by the application the optical characters are recognized by the image to be analyzed; the recognized words are segmented; and each participle is associated with each advertisement in the pre-established advertisement keyword library.
  • the keyword is matched, and the corresponding keyword matching score is assigned according to the matching matching rule according to the matching result; the different font sizes of each text are identified, and corresponding fonts are assigned according to the font size of the matched word segment according to the preset font score rule.
  • the font score is determined according to the keyword matching score and the font score, and the preset rule is used to determine whether the image to be analyzed is an advertisement image.
  • the present application can match each word segment in the image to be analyzed with each advertisement keyword in the pre-established advertisement keyword library, according to The matching situation assigns a corresponding keyword matching score, and assigns a corresponding font score according to the font size of the matched word segment, and combines the keyword matching score and the font score to perform comprehensive identification, which can more accurately and effectively determine the image to be analyzed. Whether it is an ad image with advertising information. Moreover, without manual detection, the identification of the advertisement picture can be automatically performed, and the detection efficiency is effectively improved.
  • FIG. 1 is a schematic diagram of an operating environment of a preferred embodiment of an advertisement picture authentication system 10 of the present application
  • FIG. 2 is a schematic flow chart of an embodiment of an advertisement picture identification method according to the present application.
  • FIG. 1 is a schematic diagram of an operating environment of a preferred embodiment of the advertisement image authentication system 10 of the present application.
  • the advertisement picture authentication system 10 is installed and operated in the electronic device 1.
  • the electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13.
  • Figure 1 shows only the electronic device 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
  • the memory 11 is at least one type of readable computer storage medium, which in some embodiments may be an internal storage unit of the electronic device 1, such as a hard disk or memory of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital device. (Secure Digital, SD) card, flash card, etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 is configured to store application software and various types of data installed in the electronic device 1, such as program codes of the advertisement picture authentication system 10, and the like.
  • the memory 11 can also be used to temporarily store data that has been output or is about to be output.
  • the processor 12 in some embodiments, may be a central processing unit (CPU), a microprocessor or other data processing chip for running program code or processing data stored in the memory 11, for example
  • the advertisement picture authentication system 10 and the like are executed.
  • the display 13 in some embodiments may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like.
  • the display 13 is configured to display information processed in the electronic device 1 and a user interface for displaying visualization, such as text recognized by the optical character of the image to be analyzed, word segmentation result of the recognized text, and image to be analyzed.
  • the word segmentation (mark) of the advertisement keyword in the advertisement keyword library, whether the image to be analyzed is the final identification result of the advertisement image, and the like.
  • the components 11-13 of the electronic device 1 communicate with one another via a system bus.
  • the advertising picture authentication system 10 includes at least one computer readable instruction stored in the memory 11, the at least one computer readable instruction being executable by the processor 12 to implement various embodiments of the present application.
  • Step S1 After receiving the picture to be analyzed, perform optical character recognition on the picture to be analyzed, and identify the text in the picture to be analyzed.
  • the advertisement picture identification system receives an advertisement picture identification request sent by the user, including, for example, an advertisement picture identification request sent by the user through a mobile phone, a tablet computer, a self-service terminal device, and the like, such as receiving the user in the mobile phone.
  • the advertisement image identification system After receiving the advertisement image authentication request sent by the user, the advertisement image identification system performs Optical Character Recognition (OCR) on the image to be analyzed in the advertisement image identification request, that is, the printed character is optically
  • OCR Optical Character Recognition
  • the text is converted into a black and white dot matrix image file, and the text in the image is converted into a text format by the recognition software.
  • the OCR is used to analyze the picture for character recognition to identify the text in the picture to be analyzed.
  • the unmatched word matching strategy can be implemented in the OCR recognition process. Since the advertisement information is easy to understand and easy to publicize, generally no rare characters appear. Therefore, in the OCR recognition process of the image to be analyzed, if When one of the characters performs character recognition, the matching degree of a certain rare word matching the character is high, but if the matching degree of some common universal words matching the character is low, if the OCR recognition error is determined, the text is determined.
  • the lexicon detection that matches the characters around it into the OCR recognition, when a high match is completed with a certain phrase, identifies the common common word of the corresponding position in the matched phrase. In this way, the recognition accuracy of the advertisement information in the subsequent analysis image can be improved.
  • the advertisement information sometimes performs some special processing on the text, the text is distorted, for example, circled on the text, crossed, assembled by the advertisement font, etc. Etc., these special inclusions can be removed after detection, and the text itself can be restored to facilitate subsequent matching and identification of the advertisement information.
  • the image to be analyzed may also be subjected to two-dimensional code detection. Once the image to be analyzed contains the two-dimensional code information, the image to be analyzed is directly determined as an advertisement image, and the identification is completed without Follow-up actions.
  • step S2 word segmentation processing is performed on the recognized characters.
  • the characters extracted by the OCR recognition are preprocessed, such as culling the preliminary recognized special characters, and the line break processing is performed on the characters with the same font size and close distance. Partition the pre-processed text. Including: a, taking m characters of the segmentation statement from left to right as matching fields, and m is the longest number of entries in the preset machine dictionary. b. Find and match the extracted m characters in the machine dictionary. If the matching is successful, the matching field is segmented as a word; if the matching is unsuccessful, the last word of the matching field is removed. The next string is used as the new matching field, and the process is repeated again. The above process is repeated until all the words are segmented. c, operate a and b from right to left for word segmentation.
  • the second process can be performed, and the overall capitalization of the consecutive uppercase numbers or English is performed and translated to identify the advertisement information that is promoted by continuous numbers or English.
  • the N-gram model, the Hidden Markov Model (HMM), and the Maximum Entropy Model may be used for word segmentation, and the word segmentation algorithm may be used. Including: forward maximum match, reverse maximum match, two-way maximum match, shortest path algorithm.
  • Step S3 matching each participle with each advertisement keyword in the pre-established advertisement keyword library to obtain a word segment matching the advertisement keyword in the pre-established advertisement keyword library; and matching the score according to the matching result according to the matching result The rule assigns a corresponding keyword matching score.
  • an advertisement keyword library may be established in advance, for example, an advertisement keyword library may be classified according to different advertisement categories, for example, a keyword library is established according to product advertisement, brand advertisement, concept advertisement, public service advertisement, and the like.
  • the advertisements can also be graded according to different levels. For example, the popular yellow gambling gambling and fraudulent advertisements on the network are set to a high-risk level, which must be eliminated; for the competition system and the brand advertisements related to the business system, the risk level is set. For general merchandise advertisements, etc., it is set to the normal level.
  • the specific defined matching matching rule includes:
  • the matching condition can be appropriately extended compared to the precise inclusion, and can be extended to the synonym of the keyword, the synonym, the related word, and the phrase containing the keyword, or the partial literal order is reversed or spaced, and the like. That is, the matching condition is that the to-be-matched word completely includes the deformed form of the keyword in the keyword library (insertion, inversion, synonym, synonym, related word), and p3 is 8 points.
  • the core includes: if the word segmentation of the image to be analyzed matches the core part of the advertisement keyword in the pre-established advertisement keyword library or the preset related word of the core part, the corresponding third key is assigned Word match score. That is, the matching condition is that the to-be-matched word contains the core part of the keyword in the keyword library, the deformation of the core part of the keyword (insert, reverse, synonym, synonym, related words), and p3 points 6 points.
  • the keyword matching After the keyword matching is completed, if the word segmentation in the image to be analyzed matches the keyword in the keyword library (whether it is precisely included, synonymous, or core included), and the matched keyword belongs to the font of the high-risk advertisement, It is directly determined that the image to be analyzed contains high-risk advertisements, which need to be eliminated, and the identification is completed without subsequent operations.
  • the matching keywords are not in the font of the high-risk ad, that is, the fonts belonging to the dangerous level and the normal level of advertising, further semantic analysis can be continued. For example, whether the advertisement information or its advertisement category, rank, and the like are included in the image to be analyzed may be determined according to the contextual meaning of the matched keyword or a combination of multiple keywords. It can also detect whether the picture to be analyzed contains direct contact information such as qq, WeChat, email address, website address, mobile phone, etc. If included, it can directly determine that the image to be analyzed contains advertisement information, such as non-business system related advertisements.
  • the method for detecting whether the direct contact information is included is as follows: when the character in the picture to be analyzed includes a series of numbers, whether there is a monetary unit information, a unit of measurement information, etc., if not, whether the phone number is detected.
  • Step S4 Identify different font sizes of each character in the image to be analyzed, and assign a corresponding font score according to a preset font scoring rule according to the font size of the matched word segmentation.
  • font size analysis may be performed on each of the recognized characters.
  • the font color analysis may be performed on each of the recognized characters, such as the text recognized by the optical characters in the image to be analyzed, and the font color of each character is calculated to be significant.
  • a character that recognizes a font color saliency greater than a preset color saliency threshold as a high color saliency character, and a character whose font color saliency is less than or equal to a preset color saliency threshold as a low color saliency text Setting a corresponding color saliency score for each character in the image to be analyzed according to the font color saliency, wherein the color saliency score corresponding to the high color saliency text is greater than the color saliency score corresponding to the low color saliency text .
  • the color of the font is highly noticeable.
  • the advertisement information may obtain a better publicity effect by improving the color saliency. Therefore, in this embodiment, the color saliency score p2 is given to the character font color in the image to be analyzed.
  • Step S5 Determine, according to the keyword matching score and the font score, whether the image to be analyzed is an advertisement image by using a preset rule.
  • the P value when determining whether the image to be analyzed is an advertisement image by using a preset rule, the P value may be calculated according to the following formula:
  • P1 is a font score corresponding to a font size of the matched participle in the image to be analyzed
  • P2 is a color saliency score corresponding to a font color saliency of the matched participle in the to-be-analyzed picture
  • P3 is the The keyword matching score corresponding to the matched word segment in the picture to be analyzed
  • a threshold is set in advance, and when the calculated P value reaches the threshold, the image to be analyzed is determined to be an advertisement image containing advertisement information, and an early warning is performed.
  • the advertisement information may be comprehensively evaluated according to the font, color, keyword level, number of keywords, etc. of the matched word segmentation in the image to be analyzed, and different measures may be taken for different advertisements by setting advertisement classification and advertisement level. .
  • the present embodiment discriminates the characters by optical characters by analyzing the pictures; classifies the recognized words; and matches each part word with each advertisement keyword in the pre-established advertisement keyword library, and According to the matching result, the corresponding keyword matching score is assigned according to the preset matching scoring rule; the different font sizes of each text are identified, and the corresponding font score is assigned according to the font size of the matched word segment according to the preset font scoring rule; The keyword matching score and the font score are used to determine whether the image to be analyzed is an advertisement image by using a preset rule. Since the advertisement information generally appears in the image, the advertisement font will be different from other normal texts, such as font size or font color.
  • each word segment in the image to be analyzed can be matched with each advertisement keyword in the pre-established advertisement keyword library, and the corresponding keyword matching score is assigned according to the matching situation, and the font size is allocated according to the matching word segment.
  • Corresponding font scores according to the color saliency of the matching participles, set the corresponding color saliency scores.
  • FIG. 2 is a schematic flowchart of an embodiment of an advertisement picture identification method according to an embodiment of the present application.
  • the method for identifying an advertisement picture includes the following steps:
  • Step S10 After receiving the picture to be analyzed, perform optical character recognition on the picture to be analyzed, and identify the text in the picture to be analyzed.
  • the advertisement picture identification system receives an advertisement picture identification request sent by the user, including, for example, an advertisement picture identification request sent by the user through a mobile phone, a tablet computer, a self-service terminal device, and the like, such as receiving the user in the mobile phone.
  • the advertisement image identification system After receiving the advertisement image authentication request sent by the user, the advertisement image identification system performs Optical Character Recognition (OCR) on the image to be analyzed in the advertisement image identification request, that is, the printed character is optically
  • OCR Optical Character Recognition
  • the text is converted into a black and white dot matrix image file, and the text in the image is converted into a text format by the recognition software.
  • the OCR is used to analyze the picture for character recognition to identify the text in the picture to be analyzed.
  • the unmatched word matching strategy can be implemented in the OCR recognition process. Since the advertisement information is easy to understand and easy to publicize, generally no rare characters appear. Therefore, in the OCR recognition process of the image to be analyzed, if When one of the characters performs character recognition, the matching degree of a certain rare word matching the character is high, but if the matching degree of some common universal words matching the character is low, if the OCR recognition error is determined, the text is determined.
  • the lexicon detection that matches the characters around it into the OCR recognition, when a high match is completed with a certain phrase, identifies the common common word of the corresponding position in the matched phrase. In this way, the recognition accuracy of the advertisement information in the subsequent analysis image can be improved.
  • the advertisement information sometimes performs some special processing on the text, the text is distorted, for example, circled on the text, crossed, assembled by the advertisement font, etc. Etc., these special inclusions can be removed after detection, and the text itself can be restored to facilitate subsequent matching and identification of the advertisement information.
  • the image to be analyzed may also be subjected to two-dimensional code detection. Once the image to be analyzed contains the two-dimensional code information, the image to be analyzed is directly determined as an advertisement image, and the identification is completed without Follow-up actions.
  • step S20 word segmentation processing is performed on the recognized characters.
  • the characters extracted by the OCR recognition are preprocessed, such as culling the preliminary recognized special characters, and the line break processing is performed on the characters with the same font size and close distance. Partition the pre-processed text. Including: a, taking m characters of the segmentation statement from left to right as matching fields, and m is the longest number of entries in the preset machine dictionary. b. Find and match the extracted m characters in the machine dictionary. If the matching is successful, the matching field is segmented as a word; if the matching is unsuccessful, the last word of the matching field is removed. The next string is used as the new matching field, and the process is repeated again. The above process is repeated until all the words are segmented. c, operate a and b from right to left for word segmentation.
  • the second process can be performed, and the overall capitalization of the consecutive uppercase numbers or English is performed and translated to identify the advertisement information that is promoted by continuous numbers or English.
  • the N-gram model, the Hidden Markov Model (HMM), and the Maximum Entropy Model may be used for word segmentation, and the word segmentation algorithm may be used. Including: forward maximum match, reverse maximum match, two-way maximum match, shortest path algorithm.
  • Step S30 matching each word segment with each advertisement keyword in the pre-established advertisement keyword library, obtaining a word segment matching the advertisement keyword in the pre-established advertisement keyword library; and ranking according to the matching result according to the matching result.
  • the rule assigns a corresponding keyword matching score.
  • an advertisement keyword library may be established in advance, for example, an advertisement keyword library may be classified according to different advertisement categories, for example, a keyword library is established according to product advertisement, brand advertisement, concept advertisement, public service advertisement, and the like.
  • the advertisements can also be graded according to different levels. For example, the popular yellow gambling gambling and fraudulent advertisements on the network are set to a high-risk level, which must be eliminated; for the competition system and the brand advertisements related to the business system, the risk level is set. For general merchandise advertisements, etc., it is set to the normal level.
  • the specific defined matching matching rule includes:
  • the matching condition can be appropriately extended compared to the precise inclusion, and can be extended to the synonym of the keyword, the synonym, the related word, and the phrase containing the keyword, or the partial literal order is reversed or spaced, and the like. That is, the matching condition is that the to-be-matched word completely includes the deformed form of the keyword in the keyword library (insertion, inversion, synonym, synonym, related word), and p3 is 8 points.
  • the core includes: if the word segmentation of the image to be analyzed matches the core part of the advertisement keyword in the pre-established advertisement keyword library or the preset related word of the core part, the corresponding third key is assigned Word match score. That is, the matching condition is that the to-be-matched word contains the core part of the keyword in the keyword library, the deformation of the core part of the keyword (insert, reverse, synonym, synonym, related words), and p3 points 6 points.
  • the keyword matching After the keyword matching is completed, if the word segmentation in the image to be analyzed matches the keyword in the keyword library (whether it is precisely included, synonymous, or core included), and the matched keyword belongs to the font of the high-risk advertisement, It is directly determined that the image to be analyzed contains high-risk advertisements, which need to be eliminated, and the identification is completed without subsequent operations.
  • the matching keywords are not in the font of the high-risk ad, that is, the fonts belonging to the dangerous level and the normal level of advertising, further semantic analysis can be continued. For example, whether the advertisement information or its advertisement category, rank, and the like are included in the image to be analyzed may be determined according to the contextual meaning of the matched keyword or a combination of multiple keywords. It can also detect whether the picture to be analyzed contains direct contact information such as qq, WeChat, email address, website address, mobile phone, etc. If included, it can directly determine that the image to be analyzed contains advertisement information, such as non-business system related advertisements.
  • the method for detecting whether the direct contact information is included is as follows: when the character in the picture to be analyzed includes a series of numbers, whether there is a monetary unit information, a unit of measurement information, etc., if not, whether the phone number is detected.
  • Step S40 Identify different font sizes of each character in the image to be analyzed, and assign a corresponding font score according to a preset font scoring rule according to the font size of the matched word segmentation.
  • font size analysis may be performed on each of the recognized characters.
  • the font color analysis may be performed on each of the recognized characters, such as the text recognized by the optical characters in the image to be analyzed, and the font color of each character is calculated to be significant.
  • a character that recognizes a font color saliency greater than a preset color saliency threshold as a high color saliency character, and a character whose font color saliency is less than or equal to a preset color saliency threshold as a low color saliency text Setting a corresponding color saliency score for each character in the image to be analyzed according to the font color saliency, wherein the color saliency score corresponding to the high color saliency text is greater than the color saliency score corresponding to the low color saliency text .
  • the color of the font is highly noticeable.
  • the advertisement information may obtain a better publicity effect by improving the color saliency. Therefore, in this embodiment, the color saliency score p2 is given to the character font color in the image to be analyzed.
  • Step S50 Determine, according to the keyword matching score and the font score, whether the image to be analyzed is an advertisement image by using a preset rule.
  • the P value when determining whether the image to be analyzed is an advertisement image by using a preset rule, the P value may be calculated according to the following formula:
  • P1 is a font score corresponding to a font size of the matched participle in the image to be analyzed
  • P2 is a color saliency score corresponding to a font color saliency of the matched participle in the to-be-analyzed picture
  • P3 is the The keyword matching score corresponding to the matched word segment in the picture to be analyzed
  • a threshold is set in advance, and when the calculated P value reaches the threshold, the image to be analyzed is determined to be an advertisement image containing advertisement information, and an early warning is performed.
  • the advertisement information may be comprehensively evaluated according to the font, color, keyword level, number of keywords, etc. of the matched word segmentation in the image to be analyzed, and different measures may be taken for different advertisements by setting advertisement classification and advertisement level. .
  • the present embodiment discriminates the characters by optical characters by analyzing the pictures; classifies the recognized words; and matches each part word with each advertisement keyword in the pre-established advertisement keyword library, and According to the matching result, the corresponding keyword matching score is assigned according to the preset matching scoring rule; the different font sizes of each text are identified, and the corresponding font score is assigned according to the font size of the matched word segment according to the preset font scoring rule; The keyword matching score and the font score are used to determine whether the image to be analyzed is an advertisement image by using a preset rule. Since the advertisement information generally appears in the image, the advertisement font will be different from other normal texts, such as font size or font color.
  • each word segment in the image to be analyzed can be matched with each advertisement keyword in the pre-established advertisement keyword library, and the corresponding keyword matching score is assigned according to the matching situation, and the font size is allocated according to the matching word segment.
  • Corresponding font scores according to the color saliency of the matching participles, set the corresponding color saliency scores.
  • the present application also provides a computer readable storage medium storing an advertisement picture authentication system, the advertisement picture authentication system being executable by at least one processor to cause the at least one processor.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and can also be implemented by hardware, but in many cases, the former is A better implementation.
  • the technical solution of the present application which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • General Engineering & Computer Science (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Artificial Intelligence (AREA)
  • Game Theory and Decision Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present application relates to an advertisement picture identification method, an electronic device, and a readable storage medium. The method comprises: performing optical character recognition on a picture to be analyzed, to recognize characters; performing word segmentation on the recognized characters; comparing segmented words with advertisement keywords in a pre-established advertisement keyword library, to obtain segmented words matching the advertisement keywords in the advertisement keyword library; allocating corresponding keyword matching scores according to the matching result and according to a preset matching scoring rule; recognizing different font sizes of characters in said picture, and allocating corresponding font scores according to the font sizes of the matching segmented words and according to a preset font scoring rule; according to the keyword matching score and the font score, using a preset rule to determine whether said picture is an advertisement picture. The present application can accurately and effectively determine whether a picture to be analyzed is an advertisement picture. Furthermore, the present invention can automatically identify an advertisement picture without manual detection, effectively improving detection efficiency.

Description

广告图片鉴定方法、电子装置及可读存储介质Advertising picture identification method, electronic device and readable storage medium
优先权申明Priority claim
本申请基于巴黎公约申明享有2018年3月6日递交的申请号为CN 201810183371.6、名称为“广告图片鉴定方法、电子装置及可读存储介质”中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。This application is based on the priority of the Chinese Patent Application entitled "Advertising Picture Identification Method, Electronic Device and Readable Storage Medium", which is filed on March 6, 2018, with the application number of CN 201810183371.6, which is filed on March 6, 2018. The content is incorporated herein by reference.
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种广告图片鉴定方法、电子装置及可读存储介质。The present application relates to the field of computer technologies, and in particular, to an advertisement picture identification method, an electronic device, and a readable storage medium.
背景技术Background technique
目前,对于大型互联网金融企业,在各种业务流程中会涉及到大量业务图片,而业务图片中有可能会夹杂各种广告图片,这些广告图片中包含各种广告信息、垃圾信息等,会干扰正常的业务处理,必须有效鉴定并剔除。传统的鉴定广告图片的方式是由人工对大量业务图片进行逐一审核以筛选出其中的广告图片,这种人工检测成本高,且比较耗时,效率较低。At present, for large Internet finance companies, a large number of business images are involved in various business processes, and various business images may be mixed in the business images. These advertisement images contain various advertisement information, garbage information, etc., which may interfere. Normal business processing must be effectively identified and eliminated. The traditional way of identifying advertisement pictures is to manually check a large number of business pictures one by one to screen out the advertisement pictures. This manual detection cost is high, and it is time-consuming and inefficient.
发明内容Summary of the invention
本申请的目的在于提供一种广告图片鉴定方法、电子装置及可读存储介质,旨在提高鉴定广告图片的效率。The purpose of the present application is to provide an advertisement picture identification method, an electronic device, and a readable storage medium, which are intended to improve the efficiency of identifying an advertisement picture.
为实现上述目的,本申请第一方面提供一种电子装置,所述电子装置包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的广告图片鉴定系统,所述广告图片鉴定系统被所述处理器执行时实现如下步骤:In order to achieve the above object, a first aspect of the present application provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores an advertisement picture authentication system that can be run on the processor, and the advertisement picture The authentication system implements the following steps when executed by the processor:
在收到待分析图片后,对所述待分析图片进行光学字符识别,识别出所述待分析图片中的文字;After receiving the picture to be analyzed, performing optical character recognition on the picture to be analyzed, and identifying the text in the picture to be analyzed;
对识别出的文字进行分词处理;Perform word segmentation on the recognized words;
将各个分词与预先建立的广告关键词库中的各个广告关键词进行匹配,得到与预先建立的广告关键词库中广告关键词相匹配的分词;并根据匹配结果按预设匹配评分规则分配对应的关键字匹配评分;Matching each participle with each advertisement keyword in the pre-established advertisement keyword library to obtain a word segment matching the advertisement keyword in the pre-established advertisement keyword library; and assigning a corresponding according to the matching result according to the preset matching scoring rule Keyword matching rating;
识别出所述待分析图片中各个文字的不同字体大小,并根据相匹配的分词的字体大小按预设字体评分规则分配对应的字体评分;Identifying different font sizes of each character in the image to be analyzed, and assigning a corresponding font score according to a preset font score rule according to a font size of the matched word segment;
根据所述关键字匹配评分、字体评分,利用预设规则判断所述待分析图片是否为广告图片。And determining, according to the keyword matching score and the font score, whether the image to be analyzed is an advertisement image by using a preset rule.
此外,为实现上述目的,本申请第二方面还提供一种广告图片鉴定方法,所述广告图片鉴定方法包括:In addition, in order to achieve the above object, the second aspect of the present application further provides an advertisement picture identification method, where the advertisement picture identification method includes:
在收到待分析图片后,对所述待分析图片进行光学字符识别,识别出所述待分析图片中的文字;After receiving the picture to be analyzed, performing optical character recognition on the picture to be analyzed, and identifying the text in the picture to be analyzed;
对识别出的文字进行分词处理;Perform word segmentation on the recognized words;
将各个分词与预先建立的广告关键词库中的各个广告关键词进行匹配,得到与预先建立的广告关键词库中广告关键词相匹配的分词;并根据匹配结果按预设匹配评分规则分配对应的关键字匹配评分;Matching each participle with each advertisement keyword in the pre-established advertisement keyword library to obtain a word segment matching the advertisement keyword in the pre-established advertisement keyword library; and assigning a corresponding according to the matching result according to the preset matching scoring rule Keyword matching rating;
识别出所述待分析图片中各个文字的不同字体大小,并根据相匹配的分词的字体大小按预设字体评分规则分配对应的字体评分;Identifying different font sizes of each character in the image to be analyzed, and assigning a corresponding font score according to a preset font score rule according to a font size of the matched word segment;
根据所述关键字匹配评分、字体评分,利用预设规则判断所述待分析图片是否为广告图片。And determining, according to the keyword matching score and the font score, whether the image to be analyzed is an advertisement image by using a preset rule.
进一步地,为实现上述目的,本申请第三方面还提供一种计算机可读存储介质,所述计算机可读存储介质存储有广告图片鉴定系统,所述广告图片鉴定系统可被至少一个处理器执行,以使所述至少一个处理器执行如上述的广告图片鉴定方法的步骤。Further, in order to achieve the above object, a third aspect of the present application further provides a computer readable storage medium, where the computer readable storage medium stores an advertisement picture authentication system, where the advertisement picture authentication system is executable by at least one processor And causing the at least one processor to perform the steps of the advertisement picture identification method as described above.
本申请提出的广告图片鉴定方法、系统及可读存储介质,通过对待分析图片进行光学字符识别出文字;对识别出的文字进行分词;将各个分词与预先建立的广告关键词库中的各个广告关键词进行匹配,并根据匹配结果按预设匹配评分规则分配对应的关键字匹配评分;识别出各个文字的不同字体大小,并根据相匹配的分词的字体大小按预设字体评分规则分配对应的字体评分;根据所述关键字匹配评分、字体评分,利用预设规则判断所述待分析图片是否为广告图片。由于一般在图片中出现广告信息时,广告字体与其他正常文字会有所不同,本申请能将待分析图片中的各个分词与预先建立的广告关键词库中的各个广告关键词进行匹配,根据匹配情况分配对应的关键字匹配评分,并根据相匹配的分词的字体大小分配对应的字体评分,结合关键字匹配评分以及字体评分来进行综合鉴定,能更加准确有效地判断出所述待分析图片是否为包含广告信息的广告图片。而且,无需人工进行检测,能自动进行广告图片的鉴定,有效提高检测效率。The advertisement picture identification method, system and readable storage medium proposed by the application, the optical characters are recognized by the image to be analyzed; the recognized words are segmented; and each participle is associated with each advertisement in the pre-established advertisement keyword library. The keyword is matched, and the corresponding keyword matching score is assigned according to the matching matching rule according to the matching result; the different font sizes of each text are identified, and corresponding fonts are assigned according to the font size of the matched word segment according to the preset font score rule. The font score is determined according to the keyword matching score and the font score, and the preset rule is used to determine whether the image to be analyzed is an advertisement image. Since the advertisement font is different from other normal texts when the advertisement information is generally displayed in the image, the present application can match each word segment in the image to be analyzed with each advertisement keyword in the pre-established advertisement keyword library, according to The matching situation assigns a corresponding keyword matching score, and assigns a corresponding font score according to the font size of the matched word segment, and combines the keyword matching score and the font score to perform comprehensive identification, which can more accurately and effectively determine the image to be analyzed. Whether it is an ad image with advertising information. Moreover, without manual detection, the identification of the advertisement picture can be automatically performed, and the detection efficiency is effectively improved.
附图说明DRAWINGS
图1为本申请广告图片鉴定系统10较佳实施例的运行环境示意图;1 is a schematic diagram of an operating environment of a preferred embodiment of an advertisement picture authentication system 10 of the present application;
图2为本申请广告图片鉴定方法一实施例的流程示意图。FIG. 2 is a schematic flow chart of an embodiment of an advertisement picture identification method according to the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。It should be noted that the descriptions of "first", "second" and the like in the present application are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. . Thus, features defining "first" or "second" may include at least one of the features, either explicitly or implicitly. In addition, the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. Nor is it within the scope of protection required by this application.
本申请提供一种广告图片鉴定系统。请参阅图1,是本申请广告图片鉴定系统10较佳实施例的运行环境示意图。The application provides an advertisement picture identification system. Please refer to FIG. 1 , which is a schematic diagram of an operating environment of a preferred embodiment of the advertisement image authentication system 10 of the present application.
在本实施例中,所述的广告图片鉴定系统10安装并运行于电子装置1中。该电子装置1可包括,但不仅限于,存储器11、处理器12及显示器13。图1仅示出了具有组件11-13的电子装置1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。In the embodiment, the advertisement picture authentication system 10 is installed and operated in the electronic device 1. The electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13. Figure 1 shows only the electronic device 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
所述存储器11为至少一种类型的可读计算机存储介质,所述存储器11在一些实施例中可以是所述电子装置1的内部存储单元,例如该电子装置1的硬盘或内存。所述存储器11在另一些实施例中也可以是所述电子装置1的外部存储设备,例如所述电子装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器11还可以既包括所述电子装置1的内部存储单元也包括外部存储设备。所述存储器11用于存储安装于所述电子装置1的应用软件及各类数据,例如所述广告图片鉴定系统10的程序代码等。所述存储器11还可以用于暂时地存储已经输出或者将要输出的数据。The memory 11 is at least one type of readable computer storage medium, which in some embodiments may be an internal storage unit of the electronic device 1, such as a hard disk or memory of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital device. (Secure Digital, SD) card, flash card, etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 is configured to store application software and various types of data installed in the electronic device 1, such as program codes of the advertisement picture authentication system 10, and the like. The memory 11 can also be used to temporarily store data that has been output or is about to be output.
所述处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行所述存储器11中存储的程序代码或处理数据,例如执行所述广告图片鉴定系统10等。The processor 12, in some embodiments, may be a central processing unit (CPU), a microprocessor or other data processing chip for running program code or processing data stored in the memory 11, for example The advertisement picture authentication system 10 and the like are executed.
所述显示器13在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。所述显示器13用于显示在所述电子装置1中处理的信息以及用于显示可视化的用户界面,例如待分析图片光学字 符识别出的文字、对识别出文字的分词结果、待分析图片中与广告关键词库中广告关键词相匹配的分词(标记)、待分析图片是否为广告图片的最终鉴定结果等。所述电子装置1的部件11-13通过系统总线相互通信。The display 13 in some embodiments may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like. The display 13 is configured to display information processed in the electronic device 1 and a user interface for displaying visualization, such as text recognized by the optical character of the image to be analyzed, word segmentation result of the recognized text, and image to be analyzed The word segmentation (mark) of the advertisement keyword in the advertisement keyword library, whether the image to be analyzed is the final identification result of the advertisement image, and the like. The components 11-13 of the electronic device 1 communicate with one another via a system bus.
广告图片鉴定系统10包括至少一个存储在所述存储器11中的计算机可读指令,该至少一个计算机可读指令可被所述处理器12执行,以实现本申请各实施例。The advertising picture authentication system 10 includes at least one computer readable instruction stored in the memory 11, the at least one computer readable instruction being executable by the processor 12 to implement various embodiments of the present application.
其中,上述广告图片鉴定系统10被所述处理器12执行时实现如下步骤:Wherein, when the advertisement picture authentication system 10 is executed by the processor 12, the following steps are implemented:
步骤S1,在收到待分析图片后,对所述待分析图片进行光学字符识别,识别出所述待分析图片中的文字。Step S1: After receiving the picture to be analyzed, perform optical character recognition on the picture to be analyzed, and identify the text in the picture to be analyzed.
本实施例中,广告图片鉴定系统接收用户发出的包含待分析图片的广告图片鉴定请求,例如,接收用户通过手机、平板电脑、自助终端设备等终端发送的广告图片鉴定请求,如接收用户在手机、平板电脑、自助终端设备等终端中预先安装的客户端上发送来的广告图片鉴定请求,或接收用户在手机、平板电脑、自助终端设备等终端中的浏览器系统上发送来的广告图片鉴定请求。In this embodiment, the advertisement picture identification system receives an advertisement picture identification request sent by the user, including, for example, an advertisement picture identification request sent by the user through a mobile phone, a tablet computer, a self-service terminal device, and the like, such as receiving the user in the mobile phone. An advertisement picture authentication request sent by a pre-installed client in a terminal such as a tablet computer or a self-service terminal device, or an advertisement picture identification sent by a user on a browser system in a terminal such as a mobile phone, a tablet computer, or a self-service terminal device request.
广告图片鉴定系统在收到用户发出的广告图片鉴定请求后,对广告图片鉴定请求中的待分析图片进行光学字符识别(Optical Character Recognition,简称OCR),即针对印刷体字符,采用光学的方式将文字转换成为黑白点阵的图像文件,并通过识别软件将图像中的文字转换成文本格式。After receiving the advertisement image authentication request sent by the user, the advertisement image identification system performs Optical Character Recognition (OCR) on the image to be analyzed in the advertisement image identification request, that is, the printed character is optically The text is converted into a black and white dot matrix image file, and the text in the image is converted into a text format by the recognition software.
利用OCR对待分析图片进行字符识别,以识别出待分析图片中的文字。其中,本实施例在OCR识别过程中可实施生僻字匹配策略,由于广告信息中为了简单易懂便于宣传,一般较少会出现生僻字,因此,在对待分析图片的OCR识别过程中,若对其中一个文字进行字符识别时出现识别出的与该文字匹配的某生僻字匹配度高,但与该文字匹配的一些常见通用字匹配度低的情况,则判断为OCR识别出错,则将该文字与其周围字符组成词组进入OCR识别匹配的词库检测,当与某词组完成高匹配时,则识别该文字为匹配的词组中相应位置的常见通用字。这样,能提高后续对待分析图片中广告信息的识别精度。The OCR is used to analyze the picture for character recognition to identify the text in the picture to be analyzed. In this embodiment, the unmatched word matching strategy can be implemented in the OCR recognition process. Since the advertisement information is easy to understand and easy to publicize, generally no rare characters appear. Therefore, in the OCR recognition process of the image to be analyzed, if When one of the characters performs character recognition, the matching degree of a certain rare word matching the character is high, but if the matching degree of some common universal words matching the character is low, if the OCR recognition error is determined, the text is determined. The lexicon detection that matches the characters around it into the OCR recognition, when a high match is completed with a certain phrase, identifies the common common word of the corresponding position in the matched phrase. In this way, the recognition accuracy of the advertisement information in the subsequent analysis image can be improved.
还可对待分析图片中识别出的生僻字进行畸变检测,由于广告信息中有时会对文字进行一些特殊处理,导致该文字畸变,例如,在文字上面画圈、打叉、由广告字库拼装组成等等,可进行检测后去除这些特殊符合,还原文字本身,以便后续进行广告信息的匹配、识别操作。It is also possible to perform distortion detection on the uncommon words identified in the analysis image. Because the advertisement information sometimes performs some special processing on the text, the text is distorted, for example, circled on the text, crossed, assembled by the advertisement font, etc. Etc., these special inclusions can be removed after detection, and the text itself can be restored to facilitate subsequent matching and identification of the advertisement information.
在一种可选的实施方式中,还可对待分析图片进行二维码检测,一旦检测到待分析图片中含有二维码信息,则直接判定该待分析图片 为广告图片,鉴定结束,无需进行后续操作。In an optional implementation manner, the image to be analyzed may also be subjected to two-dimensional code detection. Once the image to be analyzed contains the two-dimensional code information, the image to be analyzed is directly determined as an advertisement image, and the identification is completed without Follow-up actions.
步骤S2,对识别出的文字进行分词处理。In step S2, word segmentation processing is performed on the recognized characters.
本实施例中,对OCR识别提取的文字进行预处理,如对于初步识别的特殊字符做剔除处理,对于字体大小一致且距离较近的字符进行去除换行符处理。对预处理后的文字进行分词。包括:a,从左向右取待切分语句的m个字符作为匹配字段,m为预设的机器词典中最长词条个数。b,将取出的m个字符在机器词典中查找并进行匹配,若匹配成功,则将这个匹配字段作为一个词切分出来;若匹配不成功,则将这个匹配字段的最后一个字去掉,剩下的字符串作为新的匹配字段,进行再次匹配,重复以上过程,直到切分出所有词为止。c,从右向左操作a和b进行分词处理。In this embodiment, the characters extracted by the OCR recognition are preprocessed, such as culling the preliminary recognized special characters, and the line break processing is performed on the characters with the same font size and close distance. Partition the pre-processed text. Including: a, taking m characters of the segmentation statement from left to right as matching fields, and m is the longest number of entries in the preset machine dictionary. b. Find and match the extracted m characters in the machine dictionary. If the matching is successful, the matching field is segmented as a word; if the matching is unsuccessful, the last word of the matching field is removed. The next string is used as the new matching field, and the process is repeated again. The above process is repeated until all the words are segmented. c, operate a and b from right to left for word segmentation.
进一步地,分词后还可进行二次处理,对连续的大写数字或英文做整体分词,并进行翻译处理,以便识别出利用连续数字或英文进行宣传的广告信息。Further, after the word segmentation, the second process can be performed, and the overall capitalization of the consecutive uppercase numbers or English is performed and translated to identify the advertisement information that is promoted by continuous numbers or English.
本实施例中分词时还可采用N元文法统计模型(N-gram Model)、隐马尔科夫模型(Hidden Markov Model,简称HMM)、最大熵模型(Maximum Entropy Model)来进行分词,分词算法可包括:正向最大匹配,反向最大匹配,双向最大匹配,最短路径算法。In the present embodiment, the N-gram model, the Hidden Markov Model (HMM), and the Maximum Entropy Model may be used for word segmentation, and the word segmentation algorithm may be used. Including: forward maximum match, reverse maximum match, two-way maximum match, shortest path algorithm.
步骤S3,将各个分词与预先建立的广告关键词库中的各个广告关键词进行匹配,得到与预先建立的广告关键词库中广告关键词相匹配的分词;并根据匹配结果按预设匹配评分规则分配对应的关键字匹配评分。Step S3: matching each participle with each advertisement keyword in the pre-established advertisement keyword library to obtain a word segment matching the advertisement keyword in the pre-established advertisement keyword library; and matching the score according to the matching result according to the matching result The rule assigns a corresponding keyword matching score.
本实施例中,预先可建立广告关键词库,如:可按不同广告类别进行分类建立广告关键词库,如按照产品广告、品牌广告、观念广告、公益广告等分类建立关键词库。还可按不同级别对广告进行定级,如对于网络上流行的黄赌毒、诈骗类非法广告设定为高危级别,必须剔除;对于本业务系统相关的竞品和品牌广告设定为危险级别,对于普通商品广告等设定为普通级别。In this embodiment, an advertisement keyword library may be established in advance, for example, an advertisement keyword library may be classified according to different advertisement categories, for example, a keyword library is established according to product advertisement, brand advertisement, concept advertisement, public service advertisement, and the like. The advertisements can also be graded according to different levels. For example, the popular yellow gambling gambling and fraudulent advertisements on the network are set to a high-risk level, which must be eliminated; for the competition system and the brand advertisements related to the business system, the risk level is set. For general merchandise advertisements, etc., it is set to the normal level.
利用建立的广告关键词库对待分析图片中的分词进行关键字匹配,并根据待分析图片中的分词与广告关键词库的匹配结果给予评分p3,具体定义的预设匹配评分规则包括:Using the established keyword library to perform keyword matching on the word segmentation in the analysis image, and assigning a score p3 according to the matching result of the word segmentation in the image to be analyzed and the keyword library, the specific defined matching matching rule includes:
a,精确包含的情况:若所述待分析图片的各个分词与预先建立的广告关键词库中的广告关键词相匹配,则分配对应的第一关键字匹配评分;即匹配条件是待匹配词完全包含广告关键字库中的关键词时认为精确命中,p3记10分。a, accurately included: if each participle of the picture to be analyzed matches an advertisement keyword in a pre-established advertisement keyword library, the corresponding first keyword matching score is assigned; that is, the matching condition is a to-be-matched word An exact hit is considered when the keyword in the keyword library is completely included, and p3 is scored 10 points.
b,同义包含的情况:若所述待分析图片的各个分词与预先建立的广告关键词库中广告关键词的预设相关词相匹配,则分配对应的第二关键字匹配评分;其中,广告关键词的预设相关词包括广告关键词 的同义词、近义词、与该广告关键词相关的短语和/或该广告关键词字面产生颠倒或间隔后的变形形态词汇。即匹配条件相比精确包含可以适当进行延伸,可扩展至关键词的同义词,近义词,相关词,以及包含关键词的短语,或包含部分字面顺序颠倒或有间隔,等。即匹配条件是待匹配词完全包含广告关键字库中关键词的变形形态(插入、颠倒、同义词、近义词、相关词),p3记8分。b, synonymously included: if each participle of the picture to be analyzed matches a preset related word of an advertisement keyword in a pre-established advertisement keyword library, the corresponding second keyword matching score is assigned; wherein The preset related words of the advertising keyword include synonym of the advertising keyword, the synonym, the phrase related to the advertising keyword, and/or the deformed vocabulary of the advertising keyword literally generated after the reverse or interval. That is, the matching condition can be appropriately extended compared to the precise inclusion, and can be extended to the synonym of the keyword, the synonym, the related word, and the phrase containing the keyword, or the partial literal order is reversed or spaced, and the like. That is, the matching condition is that the to-be-matched word completely includes the deformed form of the keyword in the keyword library (insertion, inversion, synonym, synonym, related word), and p3 is 8 points.
c,核心包含的情况:若所述待分析图片的各个分词与预先建立的广告关键词库中广告关键词的核心部分或该核心部分的预设相关词相匹配,则分配对应的第三关键字匹配评分。即匹配条件是待匹配词包含广告关键字库中关键词的核心部分、关键词核心部分的变形(插入、颠倒、同义词、近义词、相关词),p3记6分。c, the core includes: if the word segmentation of the image to be analyzed matches the core part of the advertisement keyword in the pre-established advertisement keyword library or the preset related word of the core part, the corresponding third key is assigned Word match score. That is, the matching condition is that the to-be-matched word contains the core part of the keyword in the keyword library, the deformation of the core part of the keyword (insert, reverse, synonym, synonym, related words), and p3 points 6 points.
在完成关键字匹配后,若待分析图片中的分词与广告关键字库中的关键字匹配(无论是精确包含、同义包含或核心包含),且匹配的关键字属于高危级别广告的字库,则直接认定该待分析图片中包含高危级别广告,需进行剔除,鉴定结束,无需进行后续操作。After the keyword matching is completed, if the word segmentation in the image to be analyzed matches the keyword in the keyword library (whether it is precisely included, synonymous, or core included), and the matched keyword belongs to the font of the high-risk advertisement, It is directly determined that the image to be analyzed contains high-risk advertisements, which need to be eliminated, and the identification is completed without subsequent operations.
若匹配的关键字不属于高危级别广告的字库,即属于危险级别和普通级别广告的字库,则可继续进行进一步语义分析。例如,可根据匹配的关键字的上下文意思、或多个关键字的组合判断待分析图片中是否包含广告信息或其广告类别、等级等。还可检测待分析图片中是否包含qq、微信、邮箱、网址、手机等直接联络方式信息,若包含有,则可直接认定待分析图片中包含广告信息,如非业务系统相关广告。具体地,检测是否包含直接联络方式信息的方法如下:当待分析图片中的字符包含连串数字时,检测后面是否有货币单位信息、计量单位信息等,若无则检测是否为电话号码形式。If the matching keywords are not in the font of the high-risk ad, that is, the fonts belonging to the dangerous level and the normal level of advertising, further semantic analysis can be continued. For example, whether the advertisement information or its advertisement category, rank, and the like are included in the image to be analyzed may be determined according to the contextual meaning of the matched keyword or a combination of multiple keywords. It can also detect whether the picture to be analyzed contains direct contact information such as qq, WeChat, email address, website address, mobile phone, etc. If included, it can directly determine that the image to be analyzed contains advertisement information, such as non-business system related advertisements. Specifically, the method for detecting whether the direct contact information is included is as follows: when the character in the picture to be analyzed includes a series of numbers, whether there is a monetary unit information, a unit of measurement information, etc., if not, whether the phone number is detected.
步骤S4,识别出所述待分析图片中各个文字的不同字体大小,并根据相匹配的分词的字体大小按预设字体评分规则分配对应的字体评分。Step S4: Identify different font sizes of each character in the image to be analyzed, and assign a corresponding font score according to a preset font scoring rule according to the font size of the matched word segmentation.
在对所述待分析图片利用光学字符识别OCR识别出所述待分析图片中的各个文字时,还可对识别出的各个文字进行字体大小分析,具体地,可对待分析图片先进行高斯模糊处理,如f'(x,y)=f(x,y)*g(x,y),其中g(x,y)=exp(-(x2+y2)/9),对f'(x,y)画出峰值分布图,按阶梯分布抽取不同层级的峰值分布图。即对待分析图片中各个字符的大体轮廓进行分析,区分出待分析图片中各个字符的不同字体大小。如可将预设层级的峰值分布图中的字符识别为较大字体,所述待分析图片中的其余字符识别为较小字体。由于在实际应用中,若业务图片中夹杂有广告信息,则为了引人注目,广告信息一般会采用较大字体来展示。因此,本实施例中针对待分析图片中的字符字体给予字体评分p1,其中,较大字体的字符分配的字体评分高于较小字体的字符 的字体评分。例如,较大字体的字符的p1=2,较小字体的字符的p1=1。When the individual characters in the to-be-analyzed picture are identified by using the optical character recognition OCR on the picture to be analyzed, font size analysis may be performed on each of the recognized characters. Specifically, the image to be analyzed may be subjected to Gaussian blur processing first. , such as f'(x,y)=f(x,y)*g(x,y), where g(x,y)=exp(-(x2+y2)/9), for f'(x, y) Draw a peak distribution map and extract peak distribution maps of different levels according to the step distribution. That is, the general outline of each character in the image to be analyzed is analyzed to distinguish the different font sizes of the characters in the image to be analyzed. If the characters in the peak profile of the preset level are recognized as larger fonts, the remaining characters in the image to be analyzed are recognized as smaller fonts. In the actual application, if the business picture contains advertisement information, in order to attract attention, the advertisement information is generally displayed in a larger font. Therefore, in the present embodiment, the font score p1 is given to the character font in the picture to be analyzed, wherein the font score of the character assignment of the larger font is higher than the font score of the character of the smaller font. For example, a larger font character has p1=2, and a smaller font character has p1=1.
进一步地,在一种可选的实施方式中,还可对识别出的各个文字进行字体颜色分析,如对所述待分析图片中进行光学字符识别出的文字,计算每个文字的字体色彩显著度;将字体色彩显著度大于预设色彩显著度阈值的文字识别为高色彩显著度的文字,将字体色彩显著度小于或等于预设色彩显著度阈值的文字识别为低色彩显著度的文字;为所述待分析图片中的各个文字按字体色彩显著度设置对应的色彩显著度评分,其中,高色彩显著度的文字对应的色彩显著度评分大于低色彩显著度的文字对应的色彩显著度评分。具体地,对于OCR检测出的字体,计算字体的色彩显著度,例如,当字体的drgb=([rgb(x,y-[rgb(s,t))^2大于某一特定阈值时认定该字体的色彩显著度高。在实际应用中,广告信息可能会通过提高色彩显著度来获得更好的宣传效果。因此,本实施例中针对待分析图片中的字符字体颜色给予色彩显著度评分p2,其中,色彩显著度高的字符分配的色彩显著度评分高于色彩显著度低的字符的色彩显著度评分。例如,色彩显著度高的字符的p2=1,色彩显著度低的字符的p1=0.5。Further, in an optional implementation manner, the font color analysis may be performed on each of the recognized characters, such as the text recognized by the optical characters in the image to be analyzed, and the font color of each character is calculated to be significant. a character that recognizes a font color saliency greater than a preset color saliency threshold as a high color saliency character, and a character whose font color saliency is less than or equal to a preset color saliency threshold as a low color saliency text; Setting a corresponding color saliency score for each character in the image to be analyzed according to the font color saliency, wherein the color saliency score corresponding to the high color saliency text is greater than the color saliency score corresponding to the low color saliency text . Specifically, for the font detected by the OCR, the color saliency of the font is calculated, for example, when the font's drgb=([rgb(x, y-[rgb(s, t))^2 is greater than a certain threshold) The color of the font is highly noticeable. In practical applications, the advertisement information may obtain a better publicity effect by improving the color saliency. Therefore, in this embodiment, the color saliency score p2 is given to the character font color in the image to be analyzed. Wherein, the color saliency score of the character with high color saliency is higher than the color saliency score of the character with low color saliency. For example, p2 of the character with high color saliency and p1 of the character with low color saliency =0.5.
步骤S5,根据所述关键字匹配评分、字体评分,利用预设规则判断所述待分析图片是否为广告图片。Step S5: Determine, according to the keyword matching score and the font score, whether the image to be analyzed is an advertisement image by using a preset rule.
本实施例中,在利用预设规则判断所述待分析图片是否为广告图片时,可按照如下公式计算得到P值:In this embodiment, when determining whether the image to be analyzed is an advertisement image by using a preset rule, the P value may be calculated according to the following formula:
P=a1*P1+a2*P2+a3*P3P=a1*P1+a2*P2+a3*P3
其中,P1为所述待分析图片中相匹配的分词的字体大小对应的字体评分,P2为所述待分析图片中相匹配的分词的字体色彩显著度对应的色彩显著度评分,P3为所述待分析图片中相匹配的分词对应的关键字匹配评分;a1、a2、a3为预先为字体评分P1,色彩显著度评分P2,以及关键字匹配评分P3设置的参数权重,例如,可设置a1=0.2,a2=0.1,a3=0.7。Wherein, P1 is a font score corresponding to a font size of the matched participle in the image to be analyzed, and P2 is a color saliency score corresponding to a font color saliency of the matched participle in the to-be-analyzed picture, P3 is the The keyword matching score corresponding to the matched word segment in the picture to be analyzed; a1, a2, and a3 are parameter weights set in advance for the font score P1, the color saliency score P2, and the keyword matching score P3, for example, a1= can be set. 0.2, a2 = 0.1, a3 = 0.7.
预先设定一阈值,当计算得到的P值达到该阈值时,则判定待分析图片为包含广告信息的广告图片,并进行预警。此外,还可结合所述待分析图片中相匹配的分词的字体、颜色、关键字级别、关键字个数等来综合评估广告信息,并通过制定广告分类及广告级别可以对不同广告采取不同措施。A threshold is set in advance, and when the calculated P value reaches the threshold, the image to be analyzed is determined to be an advertisement image containing advertisement information, and an early warning is performed. In addition, the advertisement information may be comprehensively evaluated according to the font, color, keyword level, number of keywords, etc. of the matched word segmentation in the image to be analyzed, and different measures may be taken for different advertisements by setting advertisement classification and advertisement level. .
与现有技术相比,本实施例通过对待分析图片进行光学字符识别出文字;对识别出的文字进行分词;将各个分词与预先建立的广告关键词库中的各个广告关键词进行匹配,并根据匹配结果按预设匹配评分规则分配对应的关键字匹配评分;识别出各个文字的不同字体大小,并根据相匹配的分词的字体大小按预设字体评分规则分配对应的 字体评分;根据所述关键字匹配评分、字体评分,利用预设规则判断所述待分析图片是否为广告图片。由于一般在图片中出现广告信息时,广告字体与其他正常文字会有所不同,如字体大小或字体色彩显著度不同。本实施例能将待分析图片中的各个分词与预先建立的广告关键词库中的各个广告关键词进行匹配,根据匹配情况分配对应的关键字匹配评分,并根据相匹配的分词的字体大小分配对应的字体评分,根据相匹配的分词的字体色彩显著度设置对应的色彩显著度评分,最后,结合关键字匹配评分以及字体评分、色彩显著度评分来进行综合鉴定,能更加准确有效地判断出所述待分析图片是否为包含广告信息的广告图片。而且,无需人工进行检测,能自动进行广告图片的鉴定,有效提高检测效率。Compared with the prior art, the present embodiment discriminates the characters by optical characters by analyzing the pictures; classifies the recognized words; and matches each part word with each advertisement keyword in the pre-established advertisement keyword library, and According to the matching result, the corresponding keyword matching score is assigned according to the preset matching scoring rule; the different font sizes of each text are identified, and the corresponding font score is assigned according to the font size of the matched word segment according to the preset font scoring rule; The keyword matching score and the font score are used to determine whether the image to be analyzed is an advertisement image by using a preset rule. Since the advertisement information generally appears in the image, the advertisement font will be different from other normal texts, such as font size or font color. In this embodiment, each word segment in the image to be analyzed can be matched with each advertisement keyword in the pre-established advertisement keyword library, and the corresponding keyword matching score is assigned according to the matching situation, and the font size is allocated according to the matching word segment. Corresponding font scores, according to the color saliency of the matching participles, set the corresponding color saliency scores. Finally, combined with keyword matching scores, font scores, and color saliency scores for comprehensive identification, it can be more accurately and effectively judged. Whether the picture to be analyzed is an advertisement picture containing advertisement information. Moreover, without manual detection, the identification of the advertisement picture can be automatically performed, and the detection efficiency is effectively improved.
如图2所示,图2为本申请广告图片鉴定方法一实施例的流程示意图,该广告图片鉴定方法包括以下步骤:As shown in FIG. 2, FIG. 2 is a schematic flowchart of an embodiment of an advertisement picture identification method according to an embodiment of the present application. The method for identifying an advertisement picture includes the following steps:
步骤S10,在收到待分析图片后,对所述待分析图片进行光学字符识别,识别出所述待分析图片中的文字。Step S10: After receiving the picture to be analyzed, perform optical character recognition on the picture to be analyzed, and identify the text in the picture to be analyzed.
本实施例中,广告图片鉴定系统接收用户发出的包含待分析图片的广告图片鉴定请求,例如,接收用户通过手机、平板电脑、自助终端设备等终端发送的广告图片鉴定请求,如接收用户在手机、平板电脑、自助终端设备等终端中预先安装的客户端上发送来的广告图片鉴定请求,或接收用户在手机、平板电脑、自助终端设备等终端中的浏览器系统上发送来的广告图片鉴定请求。In this embodiment, the advertisement picture identification system receives an advertisement picture identification request sent by the user, including, for example, an advertisement picture identification request sent by the user through a mobile phone, a tablet computer, a self-service terminal device, and the like, such as receiving the user in the mobile phone. An advertisement picture authentication request sent by a pre-installed client in a terminal such as a tablet computer or a self-service terminal device, or an advertisement picture identification sent by a user on a browser system in a terminal such as a mobile phone, a tablet computer, or a self-service terminal device request.
广告图片鉴定系统在收到用户发出的广告图片鉴定请求后,对广告图片鉴定请求中的待分析图片进行光学字符识别(Optical Character Recognition,简称OCR),即针对印刷体字符,采用光学的方式将文字转换成为黑白点阵的图像文件,并通过识别软件将图像中的文字转换成文本格式。After receiving the advertisement image authentication request sent by the user, the advertisement image identification system performs Optical Character Recognition (OCR) on the image to be analyzed in the advertisement image identification request, that is, the printed character is optically The text is converted into a black and white dot matrix image file, and the text in the image is converted into a text format by the recognition software.
利用OCR对待分析图片进行字符识别,以识别出待分析图片中的文字。其中,本实施例在OCR识别过程中可实施生僻字匹配策略,由于广告信息中为了简单易懂便于宣传,一般较少会出现生僻字,因此,在对待分析图片的OCR识别过程中,若对其中一个文字进行字符识别时出现识别出的与该文字匹配的某生僻字匹配度高,但与该文字匹配的一些常见通用字匹配度低的情况,则判断为OCR识别出错,则将该文字与其周围字符组成词组进入OCR识别匹配的词库检测,当与某词组完成高匹配时,则识别该文字为匹配的词组中相应位置的常见通用字。这样,能提高后续对待分析图片中广告信息的识别精度。The OCR is used to analyze the picture for character recognition to identify the text in the picture to be analyzed. In this embodiment, the unmatched word matching strategy can be implemented in the OCR recognition process. Since the advertisement information is easy to understand and easy to publicize, generally no rare characters appear. Therefore, in the OCR recognition process of the image to be analyzed, if When one of the characters performs character recognition, the matching degree of a certain rare word matching the character is high, but if the matching degree of some common universal words matching the character is low, if the OCR recognition error is determined, the text is determined. The lexicon detection that matches the characters around it into the OCR recognition, when a high match is completed with a certain phrase, identifies the common common word of the corresponding position in the matched phrase. In this way, the recognition accuracy of the advertisement information in the subsequent analysis image can be improved.
还可对待分析图片中识别出的生僻字进行畸变检测,由于广告信息中有时会对文字进行一些特殊处理,导致该文字畸变,例如,在文 字上面画圈、打叉、由广告字库拼装组成等等,可进行检测后去除这些特殊符合,还原文字本身,以便后续进行广告信息的匹配、识别操作。It is also possible to perform distortion detection on the uncommon words identified in the analysis image. Because the advertisement information sometimes performs some special processing on the text, the text is distorted, for example, circled on the text, crossed, assembled by the advertisement font, etc. Etc., these special inclusions can be removed after detection, and the text itself can be restored to facilitate subsequent matching and identification of the advertisement information.
在一种可选的实施方式中,还可对待分析图片进行二维码检测,一旦检测到待分析图片中含有二维码信息,则直接判定该待分析图片为广告图片,鉴定结束,无需进行后续操作。In an optional implementation manner, the image to be analyzed may also be subjected to two-dimensional code detection. Once the image to be analyzed contains the two-dimensional code information, the image to be analyzed is directly determined as an advertisement image, and the identification is completed without Follow-up actions.
步骤S20,对识别出的文字进行分词处理。In step S20, word segmentation processing is performed on the recognized characters.
本实施例中,对OCR识别提取的文字进行预处理,如对于初步识别的特殊字符做剔除处理,对于字体大小一致且距离较近的字符进行去除换行符处理。对预处理后的文字进行分词。包括:a,从左向右取待切分语句的m个字符作为匹配字段,m为预设的机器词典中最长词条个数。b,将取出的m个字符在机器词典中查找并进行匹配,若匹配成功,则将这个匹配字段作为一个词切分出来;若匹配不成功,则将这个匹配字段的最后一个字去掉,剩下的字符串作为新的匹配字段,进行再次匹配,重复以上过程,直到切分出所有词为止。c,从右向左操作a和b进行分词处理。In this embodiment, the characters extracted by the OCR recognition are preprocessed, such as culling the preliminary recognized special characters, and the line break processing is performed on the characters with the same font size and close distance. Partition the pre-processed text. Including: a, taking m characters of the segmentation statement from left to right as matching fields, and m is the longest number of entries in the preset machine dictionary. b. Find and match the extracted m characters in the machine dictionary. If the matching is successful, the matching field is segmented as a word; if the matching is unsuccessful, the last word of the matching field is removed. The next string is used as the new matching field, and the process is repeated again. The above process is repeated until all the words are segmented. c, operate a and b from right to left for word segmentation.
进一步地,分词后还可进行二次处理,对连续的大写数字或英文做整体分词,并进行翻译处理,以便识别出利用连续数字或英文进行宣传的广告信息。Further, after the word segmentation, the second process can be performed, and the overall capitalization of the consecutive uppercase numbers or English is performed and translated to identify the advertisement information that is promoted by continuous numbers or English.
本实施例中分词时还可采用N元文法统计模型(N-gram Model)、隐马尔科夫模型(Hidden Markov Model,简称HMM)、最大熵模型(Maximum Entropy Model)来进行分词,分词算法可包括:正向最大匹配,反向最大匹配,双向最大匹配,最短路径算法。In the present embodiment, the N-gram model, the Hidden Markov Model (HMM), and the Maximum Entropy Model may be used for word segmentation, and the word segmentation algorithm may be used. Including: forward maximum match, reverse maximum match, two-way maximum match, shortest path algorithm.
步骤S30,将各个分词与预先建立的广告关键词库中的各个广告关键词进行匹配,得到与预先建立的广告关键词库中广告关键词相匹配的分词;并根据匹配结果按预设匹配评分规则分配对应的关键字匹配评分。Step S30, matching each word segment with each advertisement keyword in the pre-established advertisement keyword library, obtaining a word segment matching the advertisement keyword in the pre-established advertisement keyword library; and ranking according to the matching result according to the matching result. The rule assigns a corresponding keyword matching score.
本实施例中,预先可建立广告关键词库,如:可按不同广告类别进行分类建立广告关键词库,如按照产品广告、品牌广告、观念广告、公益广告等分类建立关键词库。还可按不同级别对广告进行定级,如对于网络上流行的黄赌毒、诈骗类非法广告设定为高危级别,必须剔除;对于本业务系统相关的竞品和品牌广告设定为危险级别,对于普通商品广告等设定为普通级别。In this embodiment, an advertisement keyword library may be established in advance, for example, an advertisement keyword library may be classified according to different advertisement categories, for example, a keyword library is established according to product advertisement, brand advertisement, concept advertisement, public service advertisement, and the like. The advertisements can also be graded according to different levels. For example, the popular yellow gambling gambling and fraudulent advertisements on the network are set to a high-risk level, which must be eliminated; for the competition system and the brand advertisements related to the business system, the risk level is set. For general merchandise advertisements, etc., it is set to the normal level.
利用建立的广告关键词库对待分析图片中的分词进行关键字匹配,并根据待分析图片中的分词与广告关键词库的匹配结果给予评分p3,具体定义的预设匹配评分规则包括:Using the established keyword library to perform keyword matching on the word segmentation in the analysis image, and assigning a score p3 according to the matching result of the word segmentation in the image to be analyzed and the keyword library, the specific defined matching matching rule includes:
a,精确包含的情况:若所述待分析图片的各个分词与预先建立的广告关键词库中的广告关键词相匹配,则分配对应的第一关键字匹 配评分;即匹配条件是待匹配词完全包含广告关键字库中的关键词时认为精确命中,p3记10分。a, accurately included: if each participle of the picture to be analyzed matches an advertisement keyword in a pre-established advertisement keyword library, the corresponding first keyword matching score is assigned; that is, the matching condition is a to-be-matched word An exact hit is considered when the keyword in the keyword library is completely included, and p3 is scored 10 points.
b,同义包含的情况:若所述待分析图片的各个分词与预先建立的广告关键词库中广告关键词的预设相关词相匹配,则分配对应的第二关键字匹配评分;其中,广告关键词的预设相关词包括广告关键词的同义词、近义词、与该广告关键词相关的短语和/或该广告关键词字面产生颠倒或间隔后的变形形态词汇。即匹配条件相比精确包含可以适当进行延伸,可扩展至关键词的同义词,近义词,相关词,以及包含关键词的短语,或包含部分字面顺序颠倒或有间隔,等。即匹配条件是待匹配词完全包含广告关键字库中关键词的变形形态(插入、颠倒、同义词、近义词、相关词),p3记8分。b, synonymously included: if each participle of the picture to be analyzed matches a preset related word of an advertisement keyword in a pre-established advertisement keyword library, the corresponding second keyword matching score is assigned; wherein The preset related words of the advertising keyword include synonym of the advertising keyword, the synonym, the phrase related to the advertising keyword, and/or the deformed vocabulary of the advertising keyword literally generated after the reverse or interval. That is, the matching condition can be appropriately extended compared to the precise inclusion, and can be extended to the synonym of the keyword, the synonym, the related word, and the phrase containing the keyword, or the partial literal order is reversed or spaced, and the like. That is, the matching condition is that the to-be-matched word completely includes the deformed form of the keyword in the keyword library (insertion, inversion, synonym, synonym, related word), and p3 is 8 points.
c,核心包含的情况:若所述待分析图片的各个分词与预先建立的广告关键词库中广告关键词的核心部分或该核心部分的预设相关词相匹配,则分配对应的第三关键字匹配评分。即匹配条件是待匹配词包含广告关键字库中关键词的核心部分、关键词核心部分的变形(插入、颠倒、同义词、近义词、相关词),p3记6分。c, the core includes: if the word segmentation of the image to be analyzed matches the core part of the advertisement keyword in the pre-established advertisement keyword library or the preset related word of the core part, the corresponding third key is assigned Word match score. That is, the matching condition is that the to-be-matched word contains the core part of the keyword in the keyword library, the deformation of the core part of the keyword (insert, reverse, synonym, synonym, related words), and p3 points 6 points.
在完成关键字匹配后,若待分析图片中的分词与广告关键字库中的关键字匹配(无论是精确包含、同义包含或核心包含),且匹配的关键字属于高危级别广告的字库,则直接认定该待分析图片中包含高危级别广告,需进行剔除,鉴定结束,无需进行后续操作。After the keyword matching is completed, if the word segmentation in the image to be analyzed matches the keyword in the keyword library (whether it is precisely included, synonymous, or core included), and the matched keyword belongs to the font of the high-risk advertisement, It is directly determined that the image to be analyzed contains high-risk advertisements, which need to be eliminated, and the identification is completed without subsequent operations.
若匹配的关键字不属于高危级别广告的字库,即属于危险级别和普通级别广告的字库,则可继续进行进一步语义分析。例如,可根据匹配的关键字的上下文意思、或多个关键字的组合判断待分析图片中是否包含广告信息或其广告类别、等级等。还可检测待分析图片中是否包含qq、微信、邮箱、网址、手机等直接联络方式信息,若包含有,则可直接认定待分析图片中包含广告信息,如非业务系统相关广告。具体地,检测是否包含直接联络方式信息的方法如下:当待分析图片中的字符包含连串数字时,检测后面是否有货币单位信息、计量单位信息等,若无则检测是否为电话号码形式。If the matching keywords are not in the font of the high-risk ad, that is, the fonts belonging to the dangerous level and the normal level of advertising, further semantic analysis can be continued. For example, whether the advertisement information or its advertisement category, rank, and the like are included in the image to be analyzed may be determined according to the contextual meaning of the matched keyword or a combination of multiple keywords. It can also detect whether the picture to be analyzed contains direct contact information such as qq, WeChat, email address, website address, mobile phone, etc. If included, it can directly determine that the image to be analyzed contains advertisement information, such as non-business system related advertisements. Specifically, the method for detecting whether the direct contact information is included is as follows: when the character in the picture to be analyzed includes a series of numbers, whether there is a monetary unit information, a unit of measurement information, etc., if not, whether the phone number is detected.
步骤S40,识别出所述待分析图片中各个文字的不同字体大小,并根据相匹配的分词的字体大小按预设字体评分规则分配对应的字体评分。Step S40: Identify different font sizes of each character in the image to be analyzed, and assign a corresponding font score according to a preset font scoring rule according to the font size of the matched word segmentation.
在对所述待分析图片利用光学字符识别OCR识别出所述待分析图片中的各个文字时,还可对识别出的各个文字进行字体大小分析,具体地,可对待分析图片先进行高斯模糊处理,如f'(x,y)=f(x,y)*g(x,y),其中g(x,y)=exp(-(x2+y2)/9),对f'(x,y)画出峰值分布图,按阶梯分布抽取不同层级的峰值分布图。即对待分析图片中各个字符的大体轮廓进行分析,区分出待分析图片中各个字符的不同字体大小。 如可将预设层级的峰值分布图中的字符识别为较大字体,所述待分析图片中的其余字符识别为较小字体。由于在实际应用中,若业务图片中夹杂有广告信息,则为了引人注目,广告信息一般会采用较大字体来展示。因此,本实施例中针对待分析图片中的字符字体给予字体评分p1,其中,较大字体的字符分配的字体评分高于较小字体的字符的字体评分。例如,较大字体的字符的p1=2,较小字体的字符的p1=1。When the individual characters in the to-be-analyzed picture are identified by using the optical character recognition OCR on the picture to be analyzed, font size analysis may be performed on each of the recognized characters. Specifically, the image to be analyzed may be subjected to Gaussian blur processing first. , such as f'(x,y)=f(x,y)*g(x,y), where g(x,y)=exp(-(x2+y2)/9), for f'(x, y) Draw a peak distribution map and extract peak distribution maps of different levels according to the step distribution. That is, the general outline of each character in the image to be analyzed is analyzed to distinguish the different font sizes of the characters in the image to be analyzed. If the characters in the peak profile of the preset level are recognized as larger fonts, the remaining characters in the image to be analyzed are recognized as smaller fonts. In the actual application, if the business picture contains advertisement information, in order to attract attention, the advertisement information is generally displayed in a larger font. Therefore, in this embodiment, the font score p1 is given to the character font in the picture to be analyzed, wherein the font score of the character assignment of the larger font is higher than the font score of the character of the smaller font. For example, a larger font character has p1=2, and a smaller font character has p1=1.
进一步地,在一种可选的实施方式中,还可对识别出的各个文字进行字体颜色分析,如对所述待分析图片中进行光学字符识别出的文字,计算每个文字的字体色彩显著度;将字体色彩显著度大于预设色彩显著度阈值的文字识别为高色彩显著度的文字,将字体色彩显著度小于或等于预设色彩显著度阈值的文字识别为低色彩显著度的文字;为所述待分析图片中的各个文字按字体色彩显著度设置对应的色彩显著度评分,其中,高色彩显著度的文字对应的色彩显著度评分大于低色彩显著度的文字对应的色彩显著度评分。具体地,对于OCR检测出的字体,计算字体的色彩显著度,例如,当字体的drgb=([rgb(x,y-[rgb(s,t))^2大于某一特定阈值时认定该字体的色彩显著度高。在实际应用中,广告信息可能会通过提高色彩显著度来获得更好的宣传效果。因此,本实施例中针对待分析图片中的字符字体颜色给予色彩显著度评分p2,其中,色彩显著度高的字符分配的色彩显著度评分高于色彩显著度低的字符的色彩显著度评分。例如,色彩显著度高的字符的p2=1,色彩显著度低的字符的p1=0.5。Further, in an optional implementation manner, the font color analysis may be performed on each of the recognized characters, such as the text recognized by the optical characters in the image to be analyzed, and the font color of each character is calculated to be significant. a character that recognizes a font color saliency greater than a preset color saliency threshold as a high color saliency character, and a character whose font color saliency is less than or equal to a preset color saliency threshold as a low color saliency text; Setting a corresponding color saliency score for each character in the image to be analyzed according to the font color saliency, wherein the color saliency score corresponding to the high color saliency text is greater than the color saliency score corresponding to the low color saliency text . Specifically, for the font detected by the OCR, the color saliency of the font is calculated, for example, when the font's drgb=([rgb(x, y-[rgb(s, t))^2 is greater than a certain threshold) The color of the font is highly noticeable. In practical applications, the advertisement information may obtain a better publicity effect by improving the color saliency. Therefore, in this embodiment, the color saliency score p2 is given to the character font color in the image to be analyzed. Wherein, the color saliency score of the character with high color saliency is higher than the color saliency score of the character with low color saliency. For example, p2 of the character with high color saliency and p1 of the character with low color saliency =0.5.
步骤S50,根据所述关键字匹配评分、字体评分,利用预设规则判断所述待分析图片是否为广告图片。Step S50: Determine, according to the keyword matching score and the font score, whether the image to be analyzed is an advertisement image by using a preset rule.
本实施例中,在利用预设规则判断所述待分析图片是否为广告图片时,可按照如下公式计算得到P值:In this embodiment, when determining whether the image to be analyzed is an advertisement image by using a preset rule, the P value may be calculated according to the following formula:
P=a1*P1+a2*P2+a3*P3P=a1*P1+a2*P2+a3*P3
其中,P1为所述待分析图片中相匹配的分词的字体大小对应的字体评分,P2为所述待分析图片中相匹配的分词的字体色彩显著度对应的色彩显著度评分,P3为所述待分析图片中相匹配的分词对应的关键字匹配评分;a1、a2、a3为预先为字体评分P1,色彩显著度评分P2,以及关键字匹配评分P3设置的参数权重,例如,可设置a1=0.2,a2=0.1,a3=0.7。Wherein, P1 is a font score corresponding to a font size of the matched participle in the image to be analyzed, and P2 is a color saliency score corresponding to a font color saliency of the matched participle in the to-be-analyzed picture, P3 is the The keyword matching score corresponding to the matched word segment in the picture to be analyzed; a1, a2, and a3 are parameter weights set in advance for the font score P1, the color saliency score P2, and the keyword matching score P3, for example, a1= can be set. 0.2, a2 = 0.1, a3 = 0.7.
预先设定一阈值,当计算得到的P值达到该阈值时,则判定待分析图片为包含广告信息的广告图片,并进行预警。此外,还可结合所述待分析图片中相匹配的分词的字体、颜色、关键字级别、关键字个数等来综合评估广告信息,并通过制定广告分类及广告级别可以对不同广告采取不同措施。A threshold is set in advance, and when the calculated P value reaches the threshold, the image to be analyzed is determined to be an advertisement image containing advertisement information, and an early warning is performed. In addition, the advertisement information may be comprehensively evaluated according to the font, color, keyword level, number of keywords, etc. of the matched word segmentation in the image to be analyzed, and different measures may be taken for different advertisements by setting advertisement classification and advertisement level. .
与现有技术相比,本实施例通过对待分析图片进行光学字符识别出文字;对识别出的文字进行分词;将各个分词与预先建立的广告关键词库中的各个广告关键词进行匹配,并根据匹配结果按预设匹配评分规则分配对应的关键字匹配评分;识别出各个文字的不同字体大小,并根据相匹配的分词的字体大小按预设字体评分规则分配对应的字体评分;根据所述关键字匹配评分、字体评分,利用预设规则判断所述待分析图片是否为广告图片。由于一般在图片中出现广告信息时,广告字体与其他正常文字会有所不同,如字体大小或字体色彩显著度不同。本实施例能将待分析图片中的各个分词与预先建立的广告关键词库中的各个广告关键词进行匹配,根据匹配情况分配对应的关键字匹配评分,并根据相匹配的分词的字体大小分配对应的字体评分,根据相匹配的分词的字体色彩显著度设置对应的色彩显著度评分,最后,结合关键字匹配评分以及字体评分、色彩显著度评分来进行综合鉴定,能更加准确有效地判断出所述待分析图片是否为包含广告信息的广告图片。而且,无需人工进行检测,能自动进行广告图片的鉴定,有效提高检测效率。Compared with the prior art, the present embodiment discriminates the characters by optical characters by analyzing the pictures; classifies the recognized words; and matches each part word with each advertisement keyword in the pre-established advertisement keyword library, and According to the matching result, the corresponding keyword matching score is assigned according to the preset matching scoring rule; the different font sizes of each text are identified, and the corresponding font score is assigned according to the font size of the matched word segment according to the preset font scoring rule; The keyword matching score and the font score are used to determine whether the image to be analyzed is an advertisement image by using a preset rule. Since the advertisement information generally appears in the image, the advertisement font will be different from other normal texts, such as font size or font color. In this embodiment, each word segment in the image to be analyzed can be matched with each advertisement keyword in the pre-established advertisement keyword library, and the corresponding keyword matching score is assigned according to the matching situation, and the font size is allocated according to the matching word segment. Corresponding font scores, according to the color saliency of the matching participles, set the corresponding color saliency scores. Finally, combined with keyword matching scores, font scores, and color saliency scores for comprehensive identification, it can be more accurately and effectively judged. Whether the picture to be analyzed is an advertisement picture containing advertisement information. Moreover, without manual detection, the identification of the advertisement picture can be automatically performed, and the detection efficiency is effectively improved.
此外,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有广告图片鉴定系统,所述广告图片鉴定系统可被至少一个处理器执行,以使所述至少一个处理器执行如上述实施例中的广告图片鉴定方法的步骤,该广告图片鉴定方法的步骤S10、S20、S30等具体实施过程如上文所述,在此不再赘述。Moreover, the present application also provides a computer readable storage medium storing an advertisement picture authentication system, the advertisement picture authentication system being executable by at least one processor to cause the at least one processor The specific implementation process of the steps S10, S20, and S30 of the method for identifying the advertisement image is as described above, and is not described herein again.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It is to be understood that the term "comprises", "comprising", or any other variants thereof, is intended to encompass a non-exclusive inclusion, such that a process, method, article, or device comprising a series of elements includes those elements. It also includes other elements that are not explicitly listed, or elements that are inherent to such a process, method, article, or device. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件来实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and can also be implemented by hardware, but in many cases, the former is A better implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.
以上参照附图说明了本申请的优选实施例,并非因此局限本申请的权利范围。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。另外,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。The preferred embodiments of the present application have been described above with reference to the drawings, and are not intended to limit the scope of the application. The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments. Additionally, although logical sequences are shown in the flowcharts, in some cases the steps shown or described may be performed in a different order than the ones described herein.
本领域技术人员不脱离本申请的范围和实质,可以有多种变型方案实现本申请,比如作为一个实施例的特征可用于另一实施例而得到又一实施例。凡在运用本申请的技术构思之内所作的任何修改、等同替换和改进,均应在本申请的权利范围之内。A person skilled in the art can implement the present application in various variants without departing from the scope and spirit of the present application. For example, the features as one embodiment can be used in another embodiment to obtain another embodiment. Any modifications, equivalent substitutions and improvements made within the technical concept of the application should be within the scope of the application.

Claims (20)

  1. 一种电子装置,其特征在于,所述电子装置包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的广告图片鉴定系统,所述广告图片鉴定系统被所述处理器执行时实现如下步骤:An electronic device, comprising: a memory, a processor, on the memory, an advertisement picture authentication system operable on the processor, wherein the advertisement picture identification system is used by the processor The following steps are implemented during execution:
    在收到待分析图片后,对所述待分析图片进行光学字符识别,识别出所述待分析图片中的文字;After receiving the picture to be analyzed, performing optical character recognition on the picture to be analyzed, and identifying the text in the picture to be analyzed;
    对识别出的文字进行分词处理;Perform word segmentation on the recognized words;
    将各个分词与预先建立的广告关键词库中的各个广告关键词进行匹配,得到与预先建立的广告关键词库中广告关键词相匹配的分词;并根据匹配结果按预设匹配评分规则分配对应的关键字匹配评分;Matching each participle with each advertisement keyword in the pre-established advertisement keyword library to obtain a word segment matching the advertisement keyword in the pre-established advertisement keyword library; and assigning a corresponding according to the matching result according to the preset matching scoring rule Keyword matching rating;
    识别出所述待分析图片中各个文字的不同字体大小,并根据相匹配的分词的字体大小按预设字体评分规则分配对应的字体评分;Identifying different font sizes of each character in the image to be analyzed, and assigning a corresponding font score according to a preset font score rule according to a font size of the matched word segment;
    根据所述关键字匹配评分、字体评分,利用预设规则判断所述待分析图片是否为广告图片。And determining, according to the keyword matching score and the font score, whether the image to be analyzed is an advertisement image by using a preset rule.
  2. 如权利要求1所述的电子装置,其特征在于,所述识别出所述待分析图片中各个文字的不同字体大小包括:The electronic device according to claim 1, wherein the identifying different font sizes of each character in the image to be analyzed comprises:
    对所述待分析图片进行高斯模糊处理,画出经高斯模糊处理后的待分析图片的峰值分布图,按阶梯分布抽取不同层级的峰值分布图;将预设层级的峰值分布图中的字符识别为较大字体,所述待分析图片中的其余字符识别为较小字体;Performing Gaussian blur processing on the image to be analyzed, drawing a peak distribution map of the image to be analyzed after Gaussian blur processing, extracting peak distribution maps of different levels according to the step distribution; and identifying characters in the peak distribution map of the preset level For a larger font, the remaining characters in the image to be analyzed are identified as smaller fonts;
    所述预设字体评分规则包括:The preset font scoring rules include:
    为所述待分析图片中的各个字符按字体大小设置对应的字体评分,其中,较大字体的字符对应的字体评分大于较小字体的字符对应的字体评分。The corresponding font score is set according to the font size for each character in the image to be analyzed, wherein the font score corresponding to the character of the larger font is greater than the font score corresponding to the character of the smaller font.
  3. 如权利要求1所述的电子装置,其特征在于,所述处理器还用于执行所述广告图片鉴定系统,以实现以下步骤:The electronic device according to claim 1, wherein the processor is further configured to execute the advertisement picture authentication system to implement the following steps:
    对所述待分析图片中进行光学字符识别出的文字,计算每个文字的字体色彩显著度;Calculating a font color saliency of each character for the characters recognized by the optical characters in the image to be analyzed;
    将字体色彩显著度大于预设色彩显著度阈值的文字识别为高色彩显著度的文字,将字体色彩显著度小于或等于预设色彩显著度阈值的文字识别为低色彩显著度的文字;Recognizing a character whose font color saliency is greater than a preset color saliency threshold as a character with a high color saliency, and identifying a character whose font color saliency is less than or equal to a preset color saliency threshold as a character with a low color saliency;
    为所述待分析图片中的各个文字按字体色彩显著度设置对应的色彩显著度评分,其中,高色彩显著度的文字对应的色彩显著度评分大于低色彩显著度的文字对应的色彩显著度评分。Setting a corresponding color saliency score for each character in the image to be analyzed according to the font color saliency, wherein the color saliency score corresponding to the high color saliency text is greater than the color saliency score corresponding to the low color saliency text .
  4. 如权利要求2所述的电子装置,其特征在于,所述处理器还用于执行所述广告图片鉴定系统,以实现以下步骤:The electronic device according to claim 2, wherein the processor is further configured to execute the advertisement picture authentication system to implement the following steps:
    对所述待分析图片中进行光学字符识别出的文字,计算每个文字 的字体色彩显著度;Calculating the font color saliency of each character for the characters recognized by the optical characters in the image to be analyzed;
    将字体色彩显著度大于预设色彩显著度阈值的文字识别为高色彩显著度的文字,将字体色彩显著度小于或等于预设色彩显著度阈值的文字识别为低色彩显著度的文字;Recognizing a character whose font color saliency is greater than a preset color saliency threshold as a character with a high color saliency, and identifying a character whose font color saliency is less than or equal to a preset color saliency threshold as a character with a low color saliency;
    为所述待分析图片中的各个文字按字体色彩显著度设置对应的色彩显著度评分,其中,高色彩显著度的文字对应的色彩显著度评分大于低色彩显著度的文字对应的色彩显著度评分。Setting a corresponding color saliency score for each character in the image to be analyzed according to the font color saliency, wherein the color saliency score corresponding to the high color saliency text is greater than the color saliency score corresponding to the low color saliency text .
  5. 如权利要求3所述的电子装置,其特征在于,所述预设匹配评分规则包括:The electronic device according to claim 3, wherein the preset matching scoring rule comprises:
    若所述待分析图片的各个分词与预先建立的广告关键词库中相匹配的广告关键词为预设高危级别广告词,则直接判断所述待分析图片是广告图片;If the word segmentation of the to-be-analyzed image matches the pre-established keyword in the advertisement keyword library as a preset high-risk level advertisement word, directly determining that the image to be analyzed is an advertisement picture;
    若所述待分析图片的各个分词与预先建立的广告关键词库中相匹配的广告关键词不是预设高危级别广告词,则:If the word segmentation of the to-be-analyzed image is not a preset high-risk ad word with a pre-established ad keyword library, then:
    若所述待分析图片的各个分词与预先建立的广告关键词库中的广告关键词相匹配,则分配对应的第一关键字匹配评分;If the word segmentation of the image to be analyzed matches the advertisement keyword in the pre-established advertisement keyword library, assigning a corresponding first keyword matching score;
    若所述待分析图片的各个分词与预先建立的广告关键词库中广告关键词的预设相关词相匹配,则分配对应的第二关键字匹配评分;其中,广告关键词的预设相关词包括广告关键词的同义词、近义词、与该广告关键词相关的短语和/或该广告关键词字面产生颠倒或间隔后的变形形态词汇;If the word segmentation of the image to be analyzed matches the preset related word of the advertisement keyword in the pre-established advertisement keyword library, the corresponding second keyword matching score is assigned; wherein the preset keyword of the advertisement keyword Synonyms, synonymous words, phrases related to the advertising keyword, and/or the vocabulary of the deformed form after the reverse or interval of the advertising keyword literally;
    若所述待分析图片的各个分词与预先建立的广告关键词库中广告关键词的核心部分或该核心部分的预设相关词相匹配,则分配对应的第三关键字匹配评分;If the word segmentation of the image to be analyzed matches the core part of the advertisement keyword in the pre-established advertisement keyword library or the preset related word of the core part, the corresponding third keyword matching score is allocated;
    其中,所述第一关键字匹配评分大于所述第二关键字匹配评分,所述第二关键字匹配评分大于所述第三关键字匹配评分。The first keyword matching score is greater than the second keyword matching score, and the second keyword matching score is greater than the third keyword matching score.
  6. 如权利要求4所述的电子装置,其特征在于,所述预设匹配评分规则包括:The electronic device according to claim 4, wherein the preset matching scoring rule comprises:
    若所述待分析图片的各个分词与预先建立的广告关键词库中相匹配的广告关键词为预设高危级别广告词,则直接判断所述待分析图片是广告图片;If the word segmentation of the to-be-analyzed image matches the pre-established keyword in the advertisement keyword library as a preset high-risk level advertisement word, directly determining that the image to be analyzed is an advertisement picture;
    若所述待分析图片的各个分词与预先建立的广告关键词库中相匹配的广告关键词不是预设高危级别广告词,则:If the word segmentation of the to-be-analyzed image is not a preset high-risk ad word with a pre-established ad keyword library, then:
    若所述待分析图片的各个分词与预先建立的广告关键词库中的广告关键词相匹配,则分配对应的第一关键字匹配评分;If the word segmentation of the image to be analyzed matches the advertisement keyword in the pre-established advertisement keyword library, assigning a corresponding first keyword matching score;
    若所述待分析图片的各个分词与预先建立的广告关键词库中广告关键词的预设相关词相匹配,则分配对应的第二关键字匹配评分;其中,广告关键词的预设相关词包括广告关键词的同义词、近义词、 与该广告关键词相关的短语和/或该广告关键词字面产生颠倒或间隔后的变形形态词汇;If the word segmentation of the image to be analyzed matches the preset related word of the advertisement keyword in the pre-established advertisement keyword library, the corresponding second keyword matching score is assigned; wherein the preset keyword of the advertisement keyword Synonym, synonym, phrase related to the advertisement keyword, and/or the keyword of the advertisement keyword may be reversed or separated to form a deformed form vocabulary;
    若所述待分析图片的各个分词与预先建立的广告关键词库中广告关键词的核心部分或该核心部分的预设相关词相匹配,则分配对应的第三关键字匹配评分;If the word segmentation of the image to be analyzed matches the core part of the advertisement keyword in the pre-established advertisement keyword library or the preset related word of the core part, the corresponding third keyword matching score is allocated;
    其中,所述第一关键字匹配评分大于所述第二关键字匹配评分,所述第二关键字匹配评分大于所述第三关键字匹配评分。The first keyword matching score is greater than the second keyword matching score, and the second keyword matching score is greater than the third keyword matching score.
  7. 如权利要求5所述的电子装置,其特征在于,所述利用预设规则判断所述待分析图片是否为广告图片包括:The electronic device according to claim 5, wherein the determining, by the preset rule, whether the image to be analyzed is an advertisement image comprises:
    按照如下公式计算得到P值:Calculate the P value according to the following formula:
    P=a1*P1+a2*P2+a3*P3P=a1*P1+a2*P2+a3*P3
    其中,P1为所述待分析图片中相匹配的分词的字体大小对应的字体评分,P2为所述待分析图片中相匹配的分词的字体色彩显著度对应的色彩显著度评分,P3为所述待分析图片中相匹配的分词对应的关键字匹配评分;a1、a2、a3为预先为字体评分P1,色彩显著度评分P2,以及关键字匹配评分P3设置的参数权重;Wherein, P1 is a font score corresponding to a font size of the matched participle in the image to be analyzed, and P2 is a color saliency score corresponding to a font color saliency of the matched participle in the to-be-analyzed picture, P3 is the The keyword matching score corresponding to the matched participle in the picture to be analyzed; a1, a2, and a3 are parameter weights set in advance for the font score P1, the color saliency score P2, and the keyword matching score P3;
    将计算得到的P值与预先设定的阈值进行比较,若P值大于预先设定的阈值,则判断所述待分析图片是广告图片。The calculated P value is compared with a preset threshold. If the P value is greater than a preset threshold, it is determined that the to-be-analyzed picture is an advertisement picture.
  8. 如权利要求6所述的电子装置,其特征在于,所述利用预设规则判断所述待分析图片是否为广告图片包括:The electronic device according to claim 6, wherein the determining, by the preset rule, whether the image to be analyzed is an advertisement image comprises:
    按照如下公式计算得到P值:Calculate the P value according to the following formula:
    P=a1*P1+a2*P2+a3*P3P=a1*P1+a2*P2+a3*P3
    其中,P1为所述待分析图片中相匹配的分词的字体大小对应的字体评分,P2为所述待分析图片中相匹配的分词的字体色彩显著度对应的色彩显著度评分,P3为所述待分析图片中相匹配的分词对应的关键字匹配评分;a1、a2、a3为预先为字体评分P1,色彩显著度评分P2,以及关键字匹配评分P3设置的参数权重;Wherein, P1 is a font score corresponding to a font size of the matched participle in the image to be analyzed, and P2 is a color saliency score corresponding to a font color saliency of the matched participle in the to-be-analyzed picture, P3 is the The keyword matching score corresponding to the matched participle in the picture to be analyzed; a1, a2, and a3 are parameter weights set in advance for the font score P1, the color saliency score P2, and the keyword matching score P3;
    将计算得到的P值与预先设定的阈值进行比较,若P值大于预先设定的阈值,则判断所述待分析图片是广告图片。The calculated P value is compared with a preset threshold. If the P value is greater than a preset threshold, it is determined that the to-be-analyzed picture is an advertisement picture.
  9. 一种广告图片鉴定方法,其特征在于,所述广告图片鉴定方法包括:An advertisement picture identification method, characterized in that the advertisement picture identification method comprises:
    在收到待分析图片后,对所述待分析图片进行光学字符识别,识别出所述待分析图片中的文字;After receiving the picture to be analyzed, performing optical character recognition on the picture to be analyzed, and identifying the text in the picture to be analyzed;
    对识别出的文字进行分词处理;Perform word segmentation on the recognized words;
    将各个分词与预先建立的广告关键词库中的各个广告关键词进行匹配,得到与预先建立的广告关键词库中广告关键词相匹配的分词;并根据匹配结果按预设匹配评分规则分配对应的关键字匹配评分;Matching each participle with each advertisement keyword in the pre-established advertisement keyword library to obtain a word segment matching the advertisement keyword in the pre-established advertisement keyword library; and assigning a corresponding according to the matching result according to the preset matching scoring rule Keyword matching rating;
    识别出所述待分析图片中各个文字的不同字体大小,并根据相匹配的分词的字体大小按预设字体评分规则分配对应的字体评分;Identifying different font sizes of each character in the image to be analyzed, and assigning a corresponding font score according to a preset font score rule according to a font size of the matched word segment;
    根据所述关键字匹配评分、字体评分,利用预设规则判断所述待分析图片是否为广告图片。And determining, according to the keyword matching score and the font score, whether the image to be analyzed is an advertisement image by using a preset rule.
  10. 如权利要求9所述的广告图片鉴定方法,其特征在于,所述识别出所述待分析图片中各个文字的不同字体大小包括:The method for identifying an advertisement image according to claim 9, wherein the identifying different font sizes of each character in the image to be analyzed includes:
    对所述待分析图片进行高斯模糊处理,画出经高斯模糊处理后的待分析图片的峰值分布图,按阶梯分布抽取不同层级的峰值分布图;将预设层级的峰值分布图中的字符识别为较大字体,所述待分析图片中的其余字符识别为较小字体;Performing Gaussian blur processing on the image to be analyzed, drawing a peak distribution map of the image to be analyzed after Gaussian blur processing, extracting peak distribution maps of different levels according to the step distribution; and identifying characters in the peak distribution map of the preset level For a larger font, the remaining characters in the image to be analyzed are identified as smaller fonts;
    所述预设字体评分规则包括:The preset font scoring rules include:
    为所述待分析图片中的各个字符按字体大小设置对应的字体评分,其中,较大字体的字符对应的字体评分大于较小字体的字符对应的字体评分。The corresponding font score is set according to the font size for each character in the image to be analyzed, wherein the font score corresponding to the character of the larger font is greater than the font score corresponding to the character of the smaller font.
  11. 如权利要求9所述的广告图片鉴定方法,其特征在于,该方法还包括:The method for identifying an advertisement picture according to claim 9, wherein the method further comprises:
    对所述待分析图片中进行光学字符识别出的文字,计算每个文字的字体色彩显著度;Calculating a font color saliency of each character for the characters recognized by the optical characters in the image to be analyzed;
    将字体色彩显著度大于预设色彩显著度阈值的文字识别为高色彩显著度的文字,将字体色彩显著度小于或等于预设色彩显著度阈值的文字识别为低色彩显著度的文字;Recognizing a character whose font color saliency is greater than a preset color saliency threshold as a character with a high color saliency, and identifying a character whose font color saliency is less than or equal to a preset color saliency threshold as a character with a low color saliency;
    为所述待分析图片中的各个文字按字体色彩显著度设置对应的色彩显著度评分,其中,高色彩显著度的文字对应的色彩显著度评分大于低色彩显著度的文字对应的色彩显著度评分。Setting a corresponding color saliency score for each character in the image to be analyzed according to the font color saliency, wherein the color saliency score corresponding to the high color saliency text is greater than the color saliency score corresponding to the low color saliency text .
  12. 如权利要求10所述的广告图片鉴定方法,其特征在于,该方法还包括:The method for identifying an advertisement picture according to claim 10, wherein the method further comprises:
    对所述待分析图片中进行光学字符识别出的文字,计算每个文字的字体色彩显著度;Calculating a font color saliency of each character for the characters recognized by the optical characters in the image to be analyzed;
    将字体色彩显著度大于预设色彩显著度阈值的文字识别为高色彩显著度的文字,将字体色彩显著度小于或等于预设色彩显著度阈值的文字识别为低色彩显著度的文字;Recognizing a character whose font color saliency is greater than a preset color saliency threshold as a character with a high color saliency, and identifying a character whose font color saliency is less than or equal to a preset color saliency threshold as a character with a low color saliency;
    为所述待分析图片中的各个文字按字体色彩显著度设置对应的色彩显著度评分,其中,高色彩显著度的文字对应的色彩显著度评分大于低色彩显著度的文字对应的色彩显著度评分。Setting a corresponding color saliency score for each character in the image to be analyzed according to the font color saliency, wherein the color saliency score corresponding to the high color saliency text is greater than the color saliency score corresponding to the low color saliency text .
  13. 如权利要求11所述的广告图片鉴定方法,其特征在于,所述预设匹配评分规则包括:The method for identifying an advertisement picture according to claim 11, wherein the preset matching scoring rule comprises:
    若所述待分析图片的各个分词与预先建立的广告关键词库中相匹配的广告关键词为预设高危级别广告词,则直接判断所述待分析图 片是广告图片;If the word segmentation of the to-be-analyzed image matches the pre-established advertisement keyword library as a preset high-risk level advertisement word, directly determining that the image to be analyzed is an advertisement image;
    若所述待分析图片的各个分词与预先建立的广告关键词库中相匹配的广告关键词不是预设高危级别广告词,则:If the word segmentation of the to-be-analyzed image is not a preset high-risk ad word with a pre-established ad keyword library, then:
    若所述待分析图片的各个分词与预先建立的广告关键词库中的广告关键词相匹配,则分配对应的第一关键字匹配评分;If the word segmentation of the image to be analyzed matches the advertisement keyword in the pre-established advertisement keyword library, assigning a corresponding first keyword matching score;
    若所述待分析图片的各个分词与预先建立的广告关键词库中广告关键词的预设相关词相匹配,则分配对应的第二关键字匹配评分;其中,广告关键词的预设相关词包括广告关键词的同义词、近义词、与该广告关键词相关的短语和/或该广告关键词字面产生颠倒或间隔后的变形形态词汇;If the word segmentation of the image to be analyzed matches the preset related word of the advertisement keyword in the pre-established advertisement keyword library, the corresponding second keyword matching score is assigned; wherein the preset keyword of the advertisement keyword Synonyms, synonymous words, phrases related to the advertising keyword, and/or the vocabulary of the deformed form after the reverse or interval of the advertising keyword literally;
    若所述待分析图片的各个分词与预先建立的广告关键词库中广告关键词的核心部分或该核心部分的预设相关词相匹配,则分配对应的第三关键字匹配评分;If the word segmentation of the image to be analyzed matches the core part of the advertisement keyword in the pre-established advertisement keyword library or the preset related word of the core part, the corresponding third keyword matching score is allocated;
    其中,所述第一关键字匹配评分大于所述第二关键字匹配评分,所述第二关键字匹配评分大于所述第三关键字匹配评分。The first keyword matching score is greater than the second keyword matching score, and the second keyword matching score is greater than the third keyword matching score.
  14. 如权利要求12所述的广告图片鉴定方法,其特征在于,所述预设匹配评分规则包括:The method for identifying an advertisement picture according to claim 12, wherein the preset matching scoring rule comprises:
    若所述待分析图片的各个分词与预先建立的广告关键词库中相匹配的广告关键词为预设高危级别广告词,则直接判断所述待分析图片是广告图片;If the word segmentation of the to-be-analyzed image matches the pre-established keyword in the advertisement keyword library as a preset high-risk level advertisement word, directly determining that the image to be analyzed is an advertisement picture;
    若所述待分析图片的各个分词与预先建立的广告关键词库中相匹配的广告关键词不是预设高危级别广告词,则:If the word segmentation of the to-be-analyzed image is not a preset high-risk ad word with a pre-established ad keyword library, then:
    若所述待分析图片的各个分词与预先建立的广告关键词库中的广告关键词相匹配,则分配对应的第一关键字匹配评分;If the word segmentation of the image to be analyzed matches the advertisement keyword in the pre-established advertisement keyword library, assigning a corresponding first keyword matching score;
    若所述待分析图片的各个分词与预先建立的广告关键词库中广告关键词的预设相关词相匹配,则分配对应的第二关键字匹配评分;其中,广告关键词的预设相关词包括广告关键词的同义词、近义词、与该广告关键词相关的短语和/或该广告关键词字面产生颠倒或间隔后的变形形态词汇;If the word segmentation of the image to be analyzed matches the preset related word of the advertisement keyword in the pre-established advertisement keyword library, the corresponding second keyword matching score is assigned; wherein the preset keyword of the advertisement keyword Synonyms, synonymous words, phrases related to the advertising keyword, and/or the vocabulary of the deformed form after the reverse or interval of the advertising keyword literally;
    若所述待分析图片的各个分词与预先建立的广告关键词库中广告关键词的核心部分或该核心部分的预设相关词相匹配,则分配对应的第三关键字匹配评分;If the word segmentation of the image to be analyzed matches the core part of the advertisement keyword in the pre-established advertisement keyword library or the preset related word of the core part, the corresponding third keyword matching score is allocated;
    其中,所述第一关键字匹配评分大于所述第二关键字匹配评分,所述第二关键字匹配评分大于所述第三关键字匹配评分。The first keyword matching score is greater than the second keyword matching score, and the second keyword matching score is greater than the third keyword matching score.
  15. 如权利要求13所述的广告图片鉴定方法,其特征在于,所述利用预设规则判断所述待分析图片是否为广告图片包括:The method for identifying an advertisement image according to claim 13, wherein the determining, by the preset rule, whether the image to be analyzed is an advertisement image comprises:
    按照如下公式计算得到P值:Calculate the P value according to the following formula:
    P=a1*P1+a2*P2+a3*P3P=a1*P1+a2*P2+a3*P3
    其中,P1为所述待分析图片中相匹配的分词的字体大小对应的字体评分,P2为所述待分析图片中相匹配的分词的字体色彩显著度对应的色彩显著度评分,P3为所述待分析图片中相匹配的分词对应的关键字匹配评分;a1、a2、a3为预先为字体评分P1,色彩显著度评分P2,以及关键字匹配评分P3设置的参数权重;Wherein, P1 is a font score corresponding to a font size of the matched participle in the image to be analyzed, and P2 is a color saliency score corresponding to a font color saliency of the matched participle in the to-be-analyzed picture, P3 is the The keyword matching score corresponding to the matched participle in the picture to be analyzed; a1, a2, and a3 are parameter weights set in advance for the font score P1, the color saliency score P2, and the keyword matching score P3;
    将计算得到的P值与预先设定的阈值进行比较,若P值大于预先设定的阈值,则判断所述待分析图片是广告图片。The calculated P value is compared with a preset threshold. If the P value is greater than a preset threshold, it is determined that the to-be-analyzed picture is an advertisement picture.
  16. 如权利要求14所述的广告图片鉴定方法,其特征在于,所述利用预设规则判断所述待分析图片是否为广告图片包括:The method for identifying an advertisement image according to claim 14, wherein the determining, by the preset rule, whether the image to be analyzed is an advertisement image comprises:
    按照如下公式计算得到P值:Calculate the P value according to the following formula:
    P=a1*P1+a2*P2+a3*P3P=a1*P1+a2*P2+a3*P3
    其中,P1为所述待分析图片中相匹配的分词的字体大小对应的字体评分,P2为所述待分析图片中相匹配的分词的字体色彩显著度对应的色彩显著度评分,P3为所述待分析图片中相匹配的分词对应的关键字匹配评分;a1、a2、a3为预先为字体评分P1,色彩显著度评分P2,以及关键字匹配评分P3设置的参数权重;Wherein, P1 is a font score corresponding to a font size of the matched participle in the image to be analyzed, and P2 is a color saliency score corresponding to a font color saliency of the matched participle in the to-be-analyzed picture, P3 is the The keyword matching score corresponding to the matched participle in the picture to be analyzed; a1, a2, and a3 are parameter weights set in advance for the font score P1, the color saliency score P2, and the keyword matching score P3;
    将计算得到的P值与预先设定的阈值进行比较,若P值大于预先设定的阈值,则判断所述待分析图片是广告图片。The calculated P value is compared with a preset threshold. If the P value is greater than a preset threshold, it is determined that the to-be-analyzed picture is an advertisement picture.
  17. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有广告图片鉴定系统,所述广告图片鉴定系统被处理器执行时实现如下步骤:A computer readable storage medium, wherein the computer readable storage medium stores an advertisement picture authentication system, and when the advertisement picture authentication system is executed by the processor, the following steps are implemented:
    在收到待分析图片后,对所述待分析图片进行光学字符识别,识别出所述待分析图片中的文字;After receiving the picture to be analyzed, performing optical character recognition on the picture to be analyzed, and identifying the text in the picture to be analyzed;
    对识别出的文字进行分词处理;Perform word segmentation on the recognized words;
    将各个分词与预先建立的广告关键词库中的各个广告关键词进行匹配,得到与预先建立的广告关键词库中广告关键词相匹配的分词;并根据匹配结果按预设匹配评分规则分配对应的关键字匹配评分;Matching each participle with each advertisement keyword in the pre-established advertisement keyword library to obtain a word segment matching the advertisement keyword in the pre-established advertisement keyword library; and assigning a corresponding according to the matching result according to the preset matching scoring rule Keyword matching rating;
    识别出所述待分析图片中各个文字的不同字体大小,并根据相匹配的分词的字体大小按预设字体评分规则分配对应的字体评分;Identifying different font sizes of each character in the image to be analyzed, and assigning a corresponding font score according to a preset font score rule according to a font size of the matched word segment;
    根据所述关键字匹配评分、字体评分,利用预设规则判断所述待分析图片是否为广告图片。And determining, according to the keyword matching score and the font score, whether the image to be analyzed is an advertisement image by using a preset rule.
  18. 如权利要求17所述的计算机可读存储介质,其特征在于,所述识别出所述待分析图片中各个文字的不同字体大小包括:The computer readable storage medium according to claim 17, wherein the identifying different font sizes of each of the characters in the image to be analyzed comprises:
    对所述待分析图片进行高斯模糊处理,画出经高斯模糊处理后的待分析图片的峰值分布图,按阶梯分布抽取不同层级的峰值分布图;将预设层级的峰值分布图中的字符识别为较大字体,所述待分析图片中的其余字符识别为较小字体;Performing Gaussian blur processing on the image to be analyzed, drawing a peak distribution map of the image to be analyzed after Gaussian blur processing, extracting peak distribution maps of different levels according to the step distribution; and identifying characters in the peak distribution map of the preset level For a larger font, the remaining characters in the image to be analyzed are identified as smaller fonts;
    所述预设字体评分规则包括:The preset font scoring rules include:
    为所述待分析图片中的各个字符按字体大小设置对应的字体评分,其中,较大字体的字符对应的字体评分大于较小字体的字符对应的字体评分。The corresponding font score is set according to the font size for each character in the image to be analyzed, wherein the font score corresponding to the character of the larger font is greater than the font score corresponding to the character of the smaller font.
  19. 如权利要求17所述的计算机可读存储介质,其特征在于,该方法还包括:The computer readable storage medium of claim 17 further comprising:
    对所述待分析图片中进行光学字符识别出的文字,计算每个文字的字体色彩显著度;Calculating a font color saliency of each character for the characters recognized by the optical characters in the image to be analyzed;
    将字体色彩显著度大于预设色彩显著度阈值的文字识别为高色彩显著度的文字,将字体色彩显著度小于或等于预设色彩显著度阈值的文字识别为低色彩显著度的文字;Recognizing a character whose font color saliency is greater than a preset color saliency threshold as a character with a high color saliency, and identifying a character whose font color saliency is less than or equal to a preset color saliency threshold as a character with a low color saliency;
    为所述待分析图片中的各个文字按字体色彩显著度设置对应的色彩显著度评分,其中,高色彩显著度的文字对应的色彩显著度评分大于低色彩显著度的文字对应的色彩显著度评分。Setting a corresponding color saliency score for each character in the image to be analyzed according to the font color saliency, wherein the color saliency score corresponding to the high color saliency text is greater than the color saliency score corresponding to the low color saliency text .
  20. 如权利要求18所述的计算机可读存储介质,其特征在于,该方法还包括:The computer readable storage medium of claim 18, wherein the method further comprises:
    对所述待分析图片中进行光学字符识别出的文字,计算每个文字的字体色彩显著度;Calculating a font color saliency of each character for the characters recognized by the optical characters in the image to be analyzed;
    将字体色彩显著度大于预设色彩显著度阈值的文字识别为高色彩显著度的文字,将字体色彩显著度小于或等于预设色彩显著度阈值的文字识别为低色彩显著度的文字;Recognizing a character whose font color saliency is greater than a preset color saliency threshold as a character with a high color saliency, and identifying a character whose font color saliency is less than or equal to a preset color saliency threshold as a character with a low color saliency;
    为所述待分析图片中的各个文字按字体色彩显著度设置对应的色彩显著度评分,其中,高色彩显著度的文字对应的色彩显著度评分大于低色彩显著度的文字对应的色彩显著度评分。Setting a corresponding color saliency score for each character in the image to be analyzed according to the font color saliency, wherein the color saliency score corresponding to the high color saliency text is greater than the color saliency score corresponding to the low color saliency text .
PCT/CN2018/089720 2018-03-06 2018-06-03 Advertisement picture identification method, electronic device, and readable storage medium WO2019169769A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810183371.6A CN108399161A (en) 2018-03-06 2018-03-06 Advertising pictures identification method, electronic device and readable storage medium storing program for executing
CN201810183371.6 2018-03-06

Publications (1)

Publication Number Publication Date
WO2019169769A1 true WO2019169769A1 (en) 2019-09-12

Family

ID=63091969

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/089720 WO2019169769A1 (en) 2018-03-06 2018-06-03 Advertisement picture identification method, electronic device, and readable storage medium

Country Status (2)

Country Link
CN (1) CN108399161A (en)
WO (1) WO2019169769A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191430A (en) * 2019-12-27 2020-05-22 中国平安财产保险股份有限公司 Automatic table building method and device, computer equipment and storage medium
CN112561549A (en) * 2019-09-25 2021-03-26 北京国双科技有限公司 Advertisement generation method, advertisement delivery method, advertisement generation device and advertisement delivery device
CN114758216A (en) * 2022-05-05 2022-07-15 北京容联易通信息技术有限公司 Illegal advertisement detection method and system based on machine vision
CN116996840A (en) * 2023-09-26 2023-11-03 北京百悟科技有限公司 Short message auditing method, device, equipment and storage medium

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063076B (en) * 2018-07-24 2021-07-13 维沃移动通信有限公司 Picture generation method and mobile terminal
CN109246465A (en) * 2018-08-30 2019-01-18 维沃移动通信有限公司 A kind of interface display method and terminal device
CN109241437A (en) * 2018-09-19 2019-01-18 麒麟合盛网络技术股份有限公司 A kind of generation method, advertisement recognition method and the system of advertisement identification model
CN109583443B (en) * 2018-11-15 2022-10-18 四川长虹电器股份有限公司 Video content judgment method based on character recognition
CN110163203B (en) * 2019-04-09 2021-08-24 浙江口碑网络技术有限公司 Character recognition method, device, storage medium and computer equipment
CN110598211B (en) * 2019-09-02 2023-09-26 腾讯科技(深圳)有限公司 Article identification method and device, storage medium and electronic device
CN110705364B (en) * 2019-09-06 2021-04-30 武汉美格科技股份有限公司 Malicious advertisement eliminating method and system
CN114444504B (en) * 2022-04-11 2022-08-05 西南交通大学 Enterprise business classification coding method, device, equipment and readable storage medium
CN116841424B (en) * 2023-08-28 2024-02-09 华能信息技术有限公司 Screen capture monitoring method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130330008A1 (en) * 2011-09-24 2013-12-12 Lotfi A. Zadeh Methods and Systems for Applications for Z-numbers
CN104376304A (en) * 2014-11-18 2015-02-25 新浪网技术(中国)有限公司 Identification method and device for text advertisement image
CN104715248A (en) * 2015-03-19 2015-06-17 无锡华云数据技术服务有限公司 Method for recognizing mail advertisement picture

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605692A (en) * 2013-11-04 2014-02-26 北京奇虎科技有限公司 Device and method used for shielding advertisement contents in ask-and-answer community
CN106815242A (en) * 2015-11-30 2017-06-09 腾讯科技(深圳)有限公司 Textual resources data detection method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130330008A1 (en) * 2011-09-24 2013-12-12 Lotfi A. Zadeh Methods and Systems for Applications for Z-numbers
CN104376304A (en) * 2014-11-18 2015-02-25 新浪网技术(中国)有限公司 Identification method and device for text advertisement image
CN104715248A (en) * 2015-03-19 2015-06-17 无锡华云数据技术服务有限公司 Method for recognizing mail advertisement picture

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112561549A (en) * 2019-09-25 2021-03-26 北京国双科技有限公司 Advertisement generation method, advertisement delivery method, advertisement generation device and advertisement delivery device
CN111191430A (en) * 2019-12-27 2020-05-22 中国平安财产保险股份有限公司 Automatic table building method and device, computer equipment and storage medium
CN114758216A (en) * 2022-05-05 2022-07-15 北京容联易通信息技术有限公司 Illegal advertisement detection method and system based on machine vision
CN114758216B (en) * 2022-05-05 2023-01-13 北京容联易通信息技术有限公司 Illegal advertisement detection method and system based on machine vision
CN116996840A (en) * 2023-09-26 2023-11-03 北京百悟科技有限公司 Short message auditing method, device, equipment and storage medium
CN116996840B (en) * 2023-09-26 2023-12-29 北京百悟科技有限公司 Short message auditing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN108399161A (en) 2018-08-14

Similar Documents

Publication Publication Date Title
WO2019169769A1 (en) Advertisement picture identification method, electronic device, and readable storage medium
CN108519970B (en) Method for identifying sensitive information in text, electronic device and readable storage medium
US10853638B2 (en) System and method for extracting structured information from image documents
WO2019184217A1 (en) Hotspot event classification method and apparatus, and storage medium
CN112507936B (en) Image information auditing method and device, electronic equipment and readable storage medium
CN107704512B (en) Financial product recommendation method based on social data, electronic device and medium
US11361570B2 (en) Receipt identification method, apparatus, device and storage medium
US11055327B2 (en) Unstructured data parsing for structured information
US9754176B2 (en) Method and system for data extraction from images of semi-structured documents
CN108491866B (en) Pornographic picture identification method, electronic device and readable storage medium
US20190294912A1 (en) Image processing device, image processing method, and image processing program
CA2656425A1 (en) Recognizing text in images
US7136526B2 (en) Character string recognition apparatus, character string recognizing method, and storage medium therefor
US20210073535A1 (en) Information processing apparatus and information processing method for extracting information from document image
US20080008391A1 (en) Method and System for Document Form Recognition
WO2023038722A1 (en) Entry detection and recognition for custom forms
US8571262B2 (en) Methods of object search and recognition
KR102282025B1 (en) Method for automatically sorting documents and extracting characters by using computer
JP2022095391A (en) Information processing apparatus and information processing program
CN111144345A (en) Character recognition method, device, equipment and storage medium
CN113342977B (en) Invoice image classification method, device, equipment and storage medium
CN110807322B (en) Method, device, server and storage medium for identifying new words based on information entropy
CN114663886A (en) Text recognition method, model training method and device
CN113536782A (en) Sensitive word recognition method and device, electronic equipment and storage medium
US9224040B2 (en) Method for object recognition and describing structure of graphical objects

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18908595

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.12.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18908595

Country of ref document: EP

Kind code of ref document: A1