WO2022222850A1 - 一种多媒体内容的识别方法、相关装置、设备及存储介质 - Google Patents

一种多媒体内容的识别方法、相关装置、设备及存储介质 Download PDF

Info

Publication number
WO2022222850A1
WO2022222850A1 PCT/CN2022/086948 CN2022086948W WO2022222850A1 WO 2022222850 A1 WO2022222850 A1 WO 2022222850A1 CN 2022086948 W CN2022086948 W CN 2022086948W WO 2022222850 A1 WO2022222850 A1 WO 2022222850A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
malicious
video
promotion
information
Prior art date
Application number
PCT/CN2022/086948
Other languages
English (en)
French (fr)
Inventor
郝彦超
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to US17/967,454 priority Critical patent/US20230032728A1/en
Publication of WO2022222850A1 publication Critical patent/WO2022222850A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/435Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0248Avoiding fraud
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/45Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements
    • G06Q30/0269Targeted advertisements based on user profile or attribute
    • G06Q30/0271Personalized advertisement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0277Online advertisement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • G06V30/19013Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application relates to the field of cloud computing technology, and in particular, to the identification of multimedia content.
  • malicious promotion can be identified based on advertisement promotion text.
  • Embodiments of the present application provide a multimedia content identification method, related devices, equipment, and storage medium, which identify the degree of malicious promotion of multimedia content from multiple perspectives. , images and audio, etc., to more comprehensively grasp the video quality, and form a more complete video malicious promotion identification strategy, thereby improving the accuracy of identifying video malicious promotion.
  • one aspect of the present application provides a method for identifying multimedia content, including:
  • target text information and content information of the video to be identified wherein the target text information includes at least one of title text and introduction text, and the content information includes at least one of image data and audio data;
  • the associated text information includes at least one of image text and audio text
  • the image text is obtained after text recognition is performed on the image data
  • the audio text is obtained by performing text recognition on the audio data. obtained after text recognition
  • the target text classification result represents the text content degree of malicious promotion
  • a video recognition result corresponding to the to-be-recognized video is determined according to the target text classification result, wherein the video recognition result indicates the degree of malicious promotion of the to-be-recognized video.
  • Another aspect of the present application provides a method for identifying multimedia content, comprising:
  • the text information of the text to be recognized and the original text information for the text to be recognized wherein the target text information includes at least one of title text and introduction text, and the original text information includes at least one of comment information and bullet screen information ;
  • a text classification model is used to obtain the target text classification result, wherein the target text classification result represents the degree of malicious promotion of the target text information
  • the original text classification result is obtained through the text classification model, wherein the original text classification result indicates the malicious promotion degree of the original text information
  • the text recognition result corresponding to the text to be recognized is determined, wherein the text recognition result indicates the degree of malicious promotion of the text to be recognized.
  • Another aspect of the present application provides a method for identifying multimedia content, comprising:
  • the image text classification result is obtained through a text classification model based on the image text, wherein the image text classification result represents the degree of malicious promotion of the image text;
  • the original text classification result is obtained through the text classification model, wherein the original text classification result indicates the malicious promotion degree of the original text information
  • the image recognition result corresponding to the to-be-recognized image is determined, wherein the image recognition result indicates the degree of malicious promotion of the to-be-recognized image.
  • Another aspect of the present application provides a method for identifying multimedia content, comprising:
  • the audio text classification result is obtained through a text classification model based on the audio text, wherein the audio text classification result represents the degree of malicious promotion of the audio text;
  • the original text classification result is obtained through the text classification model, wherein the original text classification result indicates the malicious promotion degree of the original text information
  • the audio recognition result corresponding to the audio to be recognized is determined, wherein the audio recognition result indicates the degree of malicious promotion of the audio to be recognized.
  • a multimedia content identification device comprising:
  • an acquisition module configured to acquire target text information and content information of the video to be identified, wherein the target text information includes at least one of title text and introduction text, and the content information includes at least one of image data and audio data;
  • the recognition module is used to perform text recognition processing on the content information to obtain associated text information, wherein the associated text information includes at least one of image text and audio text, the image text is obtained after text recognition is performed on the image data, and the audio text is obtained. It is obtained after text recognition of audio data;
  • the acquisition module is further configured to use at least one of the target text information and associated text information that satisfies the first malicious promotion condition as the text content, and then obtain the target text classification result through the text classification model based on the text content, wherein the target text The classification result indicates the degree of malicious promotion of the text content;
  • the recognition module is further configured to determine a video recognition result corresponding to the video to be recognized according to the target text classification result, wherein the video recognition result indicates the degree of malicious promotion of the video to be recognized.
  • a multimedia content identification device comprising:
  • the acquisition module is used to acquire the text information of the text to be recognized and the original text information for the text to be recognized, wherein the target text information includes at least one of title text and introduction text, and the original text information includes comment information and bullet screen information at least one of;
  • the obtaining module is further configured to obtain the target text classification result through the text classification model based on the target text information if the target text information satisfies the first malicious promotion condition, wherein the target text classification result represents the malicious promotion degree of the target text information;
  • the obtaining module is further configured to obtain the original text classification result through the text classification model based on the original text information if the original text information satisfies the first malicious promotion condition, wherein the original text classification result indicates the malicious promotion degree of the original text information;
  • the recognition module is configured to determine the text recognition result corresponding to the text to be recognized according to the target text classification result and/or the original text classification result, wherein the text recognition result indicates the degree of malicious promotion of the text to be recognized.
  • Another aspect of the present application provides a method for identifying multimedia content, comprising:
  • an acquisition module configured to acquire image data of the image to be recognized and original text information for the image to be recognized, wherein the original text information includes at least one of comment information and bullet screen information;
  • the recognition module is used to perform text recognition processing on the image data to obtain the image text;
  • the obtaining module is further configured to obtain the image text classification result through the text classification model based on the image text if the image text satisfies the first malicious promotion condition, wherein the image text classification result represents the malicious promotion degree of the image text;
  • the obtaining module is further configured to obtain the original text classification result through the text classification model based on the original text information if the original text information satisfies the first malicious promotion condition, wherein the original text classification result indicates the malicious promotion degree of the original text information;
  • the recognition module is further configured to determine the image recognition result corresponding to the to-be-recognized image according to the image text classification result and/or the original text classification result, wherein the image recognition result represents the degree of malicious promotion of the to-be-recognized image.
  • Another aspect of the present application provides a method for identifying multimedia content, comprising:
  • an acquisition module configured to acquire audio data of the audio to be recognized and original text information for the audio to be recognized, wherein the original text information includes at least one of comment information and bullet screen information;
  • the recognition module is used to perform text recognition processing on the audio data to obtain audio text
  • the obtaining module is further configured to obtain the audio text classification result through the text classification model based on the audio text if the audio text satisfies the first malicious promotion condition, wherein the audio text classification result represents the malicious promotion degree of the audio text;
  • the obtaining module is further configured to obtain the original text classification result through the text classification model based on the original text information if the original text information satisfies the first malicious promotion condition, wherein the original text classification result indicates the malicious promotion degree of the original text information;
  • the recognition module is further configured to determine the audio recognition result corresponding to the audio to be recognized according to the audio text classification result and/or the original text classification result, wherein the audio recognition result indicates the degree of malicious promotion of the audio to be recognized.
  • a computer device including: a memory, a processor, and a bus system;
  • the memory is used to store the program
  • the processor is used to execute the program in the memory, and the processor is used to execute the methods of the above aspects according to the instructions in the program code;
  • the bus system is used to connect the memory and the processor to enable the memory and the processor to communicate.
  • Another aspect of the present application provides a computer-readable storage medium having instructions stored in the computer-readable storage medium, which, when executed on a computer, cause the computer to perform the methods of the above aspects.
  • Another aspect of the present application provides a computer program product or computer program, the computer program product or computer program comprising computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided by the above aspects.
  • the embodiments of the present application have the following advantages:
  • a method for identifying multimedia content is provided. First, target text information and content information of a video to be identified are obtained.
  • the target text information includes at least one of title text and introduction text, and the content information includes image data and At least one of the audio data, and then perform text recognition processing on the content information to obtain associated text information. If the first malicious promotion condition is met, the target text classification result is obtained through the text classification model. Finally, it can be determined according to the target text classification result.
  • the video recognition result corresponding to the video to be recognized where the video recognition result indicates the degree of malicious promotion of the video to be recognized.
  • FIG. 1 is a schematic structural diagram of a multimedia content identification system in an embodiment of the application
  • FIG. 2 is a schematic diagram of an application framework of the multimedia content recognition system in the embodiment of the application;
  • FIG. 3 is a schematic flowchart of a method for identifying multimedia content in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a recognition framework based on the dimension of multiple video information sources in the embodiment of the present application;
  • FIG. 5 is a schematic diagram of a recognition scene of video media content in the embodiment of the present application.
  • FIG. 6 is a schematic diagram of performing deduplication processing on subtitles based on coordinate information in an embodiment of the present application
  • FIG. 7 is a schematic diagram of outputting a text classification result based on a single classifier in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of outputting text classification results based on multiple classifiers in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of another identification framework based on the dimension of multiple video information sources in the embodiment of the present application.
  • FIG. 10 is a schematic diagram of a recognition framework supplemented based on non-content level features in an embodiment of the application;
  • FIG. 11 is a schematic diagram of an identification framework supplemented based on non-content level features in an embodiment of the present application.
  • FIG. 12 is a schematic diagram of an overall recognition framework for video in an embodiment of the application.
  • FIG. 13 is another schematic flowchart of a method for identifying multimedia content in an embodiment of the present application.
  • 15 is a schematic diagram of a recognition scene of textual media content in an embodiment of the present application.
  • 16 is another schematic flowchart of a method for identifying multimedia content in an embodiment of the present application.
  • 17 is a schematic diagram of an overall recognition framework for images in an embodiment of the present application.
  • FIG. 18 is a schematic diagram of a recognition scene of image media content in an embodiment of the application.
  • 20 is a schematic diagram of an overall recognition framework for audio in an embodiment of the application.
  • 21 is a schematic diagram of a recognition scene of audio media content in the embodiment of the application.
  • FIG. 22 is a schematic diagram of a multimedia content identification device in an embodiment of the application.
  • FIG. 23 is another schematic diagram of a multimedia content identification device in an embodiment of the present application.
  • FIG. 24 is another schematic diagram of a multimedia content identification device in an embodiment of the present application.
  • 25 is another schematic diagram of a multimedia content identification device in an embodiment of the present application.
  • FIG. 26 is a schematic structural diagram of a server in an embodiment of the present application.
  • Embodiments of the present application provide a multimedia content identification method, related devices, equipment, and storage medium, which identify the degree of malicious promotion of multimedia content from multiple perspectives. , images and audio, etc., to more comprehensively grasp the video quality, and form a more complete video malicious promotion identification strategy, thereby improving the accuracy of identifying video malicious promotion.
  • Multimedia includes text, pictures, photos, sounds, animations and videos, as well as interactive features provided by the program.
  • Multimedia technology is a rapidly developing comprehensive electronic information technology, which brings directional changes to traditional computer systems, audio and video equipment, and will have a profound impact on mass media.
  • Multimedia computers will accelerate the process of computers entering all aspects of the family and society, and bring profound revolution to people's work, life and entertainment.
  • 5G 5th generation mobile networks
  • more and more multimedia contents mainly uploaded by users are available to the public.
  • a large number of maliciously promoted multimedia contents seriously endanger the development of the content ecology.
  • “Promotion” means to let more people and organizations understand and accept one's own products, services, technology, culture and deeds through media advertising, so as to achieve the purpose of publicity and popularization.
  • malicious promotion refers to the promotion of contact information in videos, pictures, voices and texts or advertising for products, which affects more or less users who watch it, thereby causing damage to the content ecology.
  • it can be divided into medical beauty, finance, loan and credit card, stock recommendation, lottery, entrepreneurship and making money, medical treatment, feng shui and numerology, cultural play collection, chess and card. Game plug-ins, Pick-up Artist (PUA), credit check and chat records, fishing and fake customer service, etc.
  • the present application proposes a multimedia content recognition method, which focuses on the detection and recognition of malicious promotion of multimedia content (including video, picture, voice and text). etc.), as a machine strategy to assist human review.
  • the method is applied to the multimedia content identification system shown in Figure 1.
  • the multimedia content identification system includes a server and a terminal device, and the client is deployed on the terminal device.
  • the server involved in this application may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms.
  • the terminal device may be a smart phone, a tablet computer, a notebook computer, a palmtop computer, a personal computer, a smart TV, a smart watch, etc., but is not limited thereto.
  • the terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this application.
  • the number of servers and terminal devices is also not limited.
  • the content publisher selects the local multimedia content of terminal device A, and then uploads this multimedia content to the server, and the server adopts the identification strategy provided by the application.
  • the multimedia content is recognized, and the recognition result is output.
  • Content managers can choose whether to further perform manual review of the identification results. If it is determined based on the identification result that the multimedia content is not maliciously promoted, the server publishes the multimedia content, and other users can view the published multimedia content through the terminal device B. On the contrary, if it is determined based on the identification result that the multimedia content is maliciously promoted, the server intercepts the multimedia content and will not publish it on the Internet.
  • FIG. 2 is a schematic diagram of an application framework of the multimedia content recognition system in the embodiment of the application.
  • the multimedia content recognition system mainly includes three modules, which are respectively It is the application service module, the basic service module and the underlying architecture module.
  • the underlying architecture modules include network communication, data security and database.
  • network communication is used to support the communication between the terminal device and the server.
  • Data security can use blockchain technology to upload the identification results of multimedia content to the chain.
  • the database stores information related to the multimedia content, for example, basic information and behavior information of the content publisher.
  • OCR Optical Character Recognition
  • ASR Automatic Speech Recognition
  • Neural networks and processing strategies are used to determine whether multimedia content is maliciously promoted.
  • intelligent identification refers to calling the neural network and processing strategy to determine whether the multimedia content is malicious promotion content, and then can output the identification result, thereby displaying the identification result to the content manager.
  • the multimedia content recognition system utilizes a deep neural network based on semantic transfer, and combines heuristic strategies and dictionary expansion to form a set of combined strategies to improve the accuracy of malicious title recognition on the premise of ensuring the recall rate.
  • Deep neural networks need to use large-scale training corpus and be trained by machine learning (ML) methods.
  • ML is a multi-domain interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity. theory and many other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance.
  • ML is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence.
  • ML and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other techniques.
  • AI artificial intelligence
  • AI is the use of digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the most
  • theories, methods, techniques and applied systems for optimal results In other words, AI is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • AI is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • AI technology is a comprehensive discipline involving a wide range of fields, including both hardware-level technologies and software-level technologies.
  • AI basic technologies generally include technologies such as sensors, dedicated AI chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • AI software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the multimedia content identification system provided by the present application adopts cloud computing (cloud computing) to realize parallel processing of a large amount of multimedia content.
  • Cloud computing refers to the delivery and use mode of Internet technology (Internet Technology, IT) infrastructure, which refers to obtaining the required resources through the network in an on-demand and easy-to-expand way;
  • cloud computing in a broad sense refers to the delivery and use mode of services, which refers to the Get the services you need in an on-demand, scalable way.
  • Such services can be IT and software, Internet-related, or other services.
  • Cloud computing is grid computing (Grid Computing), distributed computing (Distributed Computing), parallel computing (Parallel Computing), utility computing (Utility Computing), network storage (Network Storage Technologies), virtualization (Virtualization), load balancing ( Load Balance) and other traditional computer and network technology development and integration products.
  • Cloud computing has grown rapidly with the development of the Internet, real-time data streaming, the diversity of connected devices, and the need for search services, social networking, mobile commerce, and open collaboration. Different from the parallel distributed computing in the past, the emergence of cloud computing will promote revolutionary changes in the entire Internet model and enterprise management model.
  • Cloud computing is a type of cloud technology, in which cloud technology refers to a kind of hosting that unifies a series of resources such as hardware, software, and network in a wide area network or a local area network to realize the computing, storage, processing and sharing of data technology.
  • Cloud technology is based on the general term of network technology, information technology, integration technology, management platform technology, application technology, etc. applied in the cloud computing business model. It can form a resource pool, which can be used on demand and is flexible and convenient. Cloud computing technology will become an important support. Background services of technical network systems require a lot of computing and storage resources, such as video websites, picture websites and more portal websites.
  • each item may have its own identification mark, which needs to be transmitted to the back-end system for logical processing.
  • Data of different levels will be processed separately, and all kinds of industry data need to be strong.
  • the system backing support can only be achieved through cloud computing.
  • the multimedia content identification system provided by this application can also be connected to a blockchain system, thereby preventing the generated information from being tampered with and improving the reliability of the information source.
  • Blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the underlying platform of the blockchain can include processing modules such as user management, basic services, smart contracts, and operation management.
  • the user management module is responsible for the identity information management of all blockchain participants, including maintenance of public and private key generation (account management), key management, and maintenance of the corresponding relationship between the user's real identity and blockchain address (authority management), etc.
  • account management maintenance of public and private key generation
  • key management key management
  • authorization management maintenance of the corresponding relationship between the user's real identity and blockchain address
  • the basic service module is deployed on all blockchain node devices to verify the validity of business requests, After completing the consensus on valid requests, record them in the storage.
  • the basic service For a new business request, the basic service first adapts the interface for analysis and authentication processing (interface adaptation), and then encrypts the business information through the consensus algorithm (consensus management), After encryption, it is completely and consistently transmitted to the shared ledger (network communication), and records are stored; the smart contract module is responsible for the registration and issuance of contracts, as well as contract triggering and contract execution.
  • Developers can define contract logic through a programming language and publish to On the blockchain (contract registration), according to the logic of the contract terms, call the key or other events to trigger execution, complete the contract logic, and also provide the function of contract upgrade and cancellation;
  • the operation management module is mainly responsible for the deployment in the product release process , configuration modification, contract settings, cloud adaptation, and visual output of real-time status in product operation, such as: alarms, management network conditions, management node equipment health status, etc.
  • the platform product service layer provides the basic capabilities and implementation framework of typical applications. Based on these basic capabilities, developers can superimpose business characteristics to complete the blockchain implementation of business logic.
  • the application service layer provides application services based on blockchain solutions for business participants to use.
  • An embodiment of the method for identifying multimedia content in the embodiment of the present application includes:
  • target text information and content information of the video to be identified wherein the target text information includes at least one of title text and introduction text, and the content information includes at least one of image data and audio data;
  • the multimedia content recognition apparatus acquires target text information and content information of the video to be recognized, wherein the target text information includes at least one of a title text and an abstract text of the video to be recognized, and the content (content) information includes at least one of image data and audio data.
  • the title text of the video to be identified is a very important point of information. From the actual situation, the title of maliciously promoted video is sparser than that of maliciously promoted video images or maliciously promoted voice. From the daily video streaming sampling of the market, the proportion of the title text involving malicious promotion is about 4.76/10,000 (about 5/10,000).
  • the multimedia content identification device may be deployed in a server, a terminal device, or a multimedia content identification system composed of a terminal device and a server, which is not limited here.
  • the multimedia content recognition apparatus performs text recognition processing on the content information, thereby obtaining associated text information.
  • the technical means used to generate the associated text information will be described below.
  • CV computer vision
  • CV calculation is a science that studies how to make machines "see”. More specifically, it refers to the use of cameras and computers instead of human eyes to identify, follow, and measure targets and other machine vision, and further perform graphics processing to make Computer processing becomes an image more suitable for human eye observation or transmission to instruments for detection.
  • CV studies related theories and technologies trying to build artificial intelligence systems that can obtain information from images or multi-dimensional data.
  • CV technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, 3D object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous localization and map construction It also includes common biometric identification technologies such as face recognition and fingerprint recognition.
  • speech technology can be used to recognize the corresponding audio text.
  • the key technologies of speech technology include ASR technology and speech synthesis (Text To Speech, TTS) technology and voiceprint recognition technology. It is the future development direction of human-computer interaction that allows computers to hear, see, speak, and feel, and voice will become one of the promising human-computer interaction methods in the future.
  • the target text classification result represents the maliciousness of the text content. degree of promotion
  • the multimedia content identification device needs to determine whether the target text information and the associated text information satisfy the first malicious promotion condition.
  • the condition that satisfies the first malicious promotion condition may be a keyword or template in the matching library.
  • the target text classification result For the target text information or associated text information that satisfies the first malicious promotion condition, input it into the trained text classification model, and then the target text classification result can be obtained, wherein the target text classification result can be a binary classification result.
  • the target text classification result may be a multi-classification result, for example, "belongs to malicious promotion type", "suspected malicious promotion type” or "does not belong to malicious promotion type”.
  • NLP natural language processing
  • FIG. 4 is a schematic diagram of a recognition framework based on the dimension of multiple video information sources in an embodiment of the present application.
  • the video to be recognized includes title text, introduction text, and image data. and audio data.
  • the title text and the introduction text can be classified separately.
  • the results of the two classifications should be consistent, and the types that can be classified include but are not limited to general, medical and aesthetics, Finance, loan and credit card, stock recommendation, lottery, entrepreneurship and making money, medical treatment, feng shui numerology, literary play collection, chess and card game plug-in, PUA, credit checking, chat record checking, and fishing and fake customer service .
  • the multimedia content recognition device determines the video recognition result corresponding to the video to be recognized according to the target text classification result. For example, if the target text classification result is "belongs to malicious promotion type”, the video recognition result of the output video to be recognized is " It belongs to the type of malicious promotion”, and the degree of malicious promotion is the highest. For another example, if the target text classification result is "suspected malicious promotion type”, the video recognition result of the output video to be recognized is "suspected malicious promotion type”, and the malicious promotion degree is medium. For another example, if the classification result of the target text is "does not belong to the malicious promotion type", the video recognition result of the output video to be recognized is "does not belong to the malicious promotion type", and the degree of malicious promotion is low.
  • FIG. 5 is a schematic diagram of a recognition scene of video media content in the embodiment of the application.
  • the recognition of different videos can be displayed to the content manager. result.
  • the video recognition platform can directly remove or delete videos that "belong to malicious promotion”.
  • content managers can further check the specific information of the videos and manually check the accuracy of the output results.
  • Maliciously promoted videos involve all aspects of human social life and have a variety of topics. Compared with ordinary videos, maliciously promoted videos have a more obvious tendency to partial segments.
  • a method for identifying multimedia content is provided.
  • the media form of video multimedia content we can more comprehensively grasp the video quality from the title, introduction, image and audio, etc., forming a more comprehensive picture.
  • Perfect video malicious promotion identification strategy so as to improve the accuracy of identifying video malicious promotion.
  • the content information includes image data
  • Perform text recognition processing on the content information to obtain associated text information which may specifically include:
  • L is an integer greater than or equal to 1 and less than K
  • the deduplicated subtitles in each video frame are used as the image text in the associated text information.
  • a method for performing OCR identification on image data is introduced.
  • the image data of the maliciously promoted video itself is also an important source of information.
  • L video frames may be acquired from K video frames at a preset frame rate (eg, one frame is extracted per second).
  • OCR is used to extract the subtitles of each of the L video frames and the coordinate information of the subtitles.
  • FIG. 6 is a schematic diagram of performing deduplication processing on subtitles based on coordinate information in an embodiment of the present application.
  • the frame of picture is divided into four blocks, That is, the block indicated by S1, the block indicated by S2, the block indicated by S3 and the block indicated by S4.
  • the lower left corner can be set as the coordinate origin, that is, each block can be represented by horizontal and vertical coordinates.
  • the subtitle "Quick Investment” will be extracted. If in the next frame, the Similarly, the subtitle "Quick Investment” is extracted from block S1, and then deduplication is performed to obtain complete and clean information.
  • the image text is obtained.
  • the coordinate information of the subtitles in each video frame is very important, and is used to help locate or cluster the content that may be a piece of information.
  • a method for performing OCR recognition on image data is provided.
  • first frame extraction processing is performed on the image data of the video, and then the recognized image text is matched with the template, thereby , to increase the dimension of identifying maliciously promoted videos, thereby helping to improve the identification accuracy of maliciously promoted videos.
  • the content information includes audio data
  • Perform text recognition processing on the content information to obtain associated text information which may specifically include:
  • T is an integer greater than or equal to 1;
  • Feature extraction processing is performed on each audio frame in the T audio frames to obtain an audio feature vector corresponding to each audio frame;
  • the audio text in the associated text information is determined.
  • the phoneme information may be output through the acoustic model; based on the phoneme information, the audio text in the associated text information may be output through the language model.
  • a method for performing ASR identification on audio data is introduced.
  • the audio data of the maliciously promoted video itself is also an important source of information.
  • the audio data included in the video to be identified may be processed in frames to obtain T audio frames.
  • feature extraction is performed on each of the T audio frames to obtain an audio feature vector corresponding to each audio frame.
  • the phoneme information is output through the acoustic model, and finally the audio text is output through the language model based on the phoneme information.
  • the present application may choose not to perform ASR processing on these audio data, or not to perform subsequent processing. Template matching.
  • a method for performing ASR recognition on audio data is provided.
  • the audio data of the video is first recognized, and then the recognized audio text is matched with the template, thereby increasing the Identify the dimensions of maliciously promoted videos, so as to improve the recognition accuracy of maliciously promoted videos.
  • another optional embodiment provided by the embodiment of the present application may further include:
  • the title text successfully matches the template in the matching library, and the title text fails to match the information in the whitelist, it is determined that the title text satisfies the first malicious promotion condition, and the target text information satisfies the first malicious promotion condition;
  • the profile text matches the template in the matching library successfully, and the profile text fails to match the information in the whitelist, it is determined that the profile text satisfies the first malicious promotion condition, and the target text information satisfies the first malicious promotion condition;
  • the image text matches the template in the matching library successfully, and the image text fails to match with the information in the whitelist, it is determined that the image text satisfies the first malicious promotion condition, and the associated text information satisfies the first malicious promotion condition;
  • the audio text matches the template in the matching library successfully, and the audio text fails to match the information in the whitelist, it is determined that the audio text satisfies the first malicious promotion condition, and the associated text information satisfies the first malicious promotion condition.
  • the condition satisfying the first malicious promotion condition includes hitting a template or keyword in the matching library, and may also include missing any information in the whitelist. This is to consider that if only the templates or keywords in the matching library are hit, some multimedia content that does not belong to malicious promotion may be recalled. Therefore, further set the "rejection policy" (that is, refuse to recognize the information that appears in the whitelist) ).
  • the title text is taken as an example for description. It can be understood that the processing methods of the introduction text, the image text and the audio text are similar to those of the title text, which will not be repeated here.
  • the title text of the video is short text. According to the characteristics of the malicious promotion title category of each video, in each divided category (for example, general category, medical medical beauty category, and financial category, etc.), maintain a keyword and template-based match library. After reviewing the analysis of the accumulated malicious promotion videos, in a long period of time (unless there is an obvious change in ideology or mainstream media form), each type of title text of malicious promotion videos can summarize relevant keywords or templates . Keywords or templates related to malicious promotion titles are necessary but not sufficient conditions for malicious promotion of video titles.
  • the title text hit by the matching library is not necessarily the real malicious promotion title of the video, but the real malicious promotion title of the video will definitely be hit by the matching library. Therefore, it is necessary to maintain a full-scale matching library based on keywords and templates, and the strategy module for video titles should pay attention to the recall rate. In the case of giving priority to ensuring recall, in order to improve the accuracy rate as much as possible, this application also designs a "rejection recognition" Strategy".
  • a method for determining the type of malicious promotion based on template matching and a "rejection strategy".
  • videos that may have malicious promotion can be found based on Filtering out the videos that belong to the whitelist is beneficial to improve the accuracy of recalling videos.
  • Text classification results which can include:
  • the title text classification result is obtained through the text classification model based on the title text
  • the introduction text classification result is obtained through the text classification model based on the introduction text
  • the image text classification result is obtained through the text classification model based on the image text
  • the audio text classification result is obtained through the text classification model based on the audio text.
  • a prediction method for accessing a text classification model is introduced.
  • a text classification model ie, a classifier
  • the accurate malicious promotion title text, malicious promotion introduction text, malicious promotion image text and malicious promotion audio text are selected as much as possible.
  • FIG. 7 is a schematic diagram of outputting a text classification result based on a single classifier in an embodiment of the present application.
  • any text for example, title text, introduction text, image text or audio text
  • the text classification model outputs the probability distribution.
  • the title text is input into the text classification model, and the probability distribution of the output of the text classification model is (0.1, 0.9), where 0.1 represents the probability of "belonging to malicious promotion type", and 0.9 represents the probability of "belonging to malicious promotion type”.
  • the title text is input into the text classification model, and the probability distribution of the output of the text classification model is (0.1, 0.6, 0.3), where 0.1 represents the probability of "belonging to the malicious promotion type", 0.6 represents the probability of "suspected malicious promotion type", and 0.3 represents the probability of "not belonging to malicious promotion type”. Therefore, the title text classification result is "suspected malicious promotion type”.
  • the method of obtaining the classification result of the introduction text, the classification result of the image text, and the classification result of the audio text is similar to the method of obtaining the classification result of the title text, so it is not repeated here.
  • the text classification model is obtained after supervised training by manually labeling a certain amount of samples.
  • feature extraction and feature classifiers can be used, or text classification based on deep learning, which has become popular in the industry in recent years, can be selected.
  • text classification based on deep learning can generally be divided into models based on Recurrent Neural Network (RNN), models based on Convolutional Neural Networks (CNN), and models based on translation encoders (Transformer -encoder) classification models, as well as some mixtures of models.
  • RNN Recurrent Neural Network
  • CNN Convolutional Neural Networks
  • Transformer -encoder translation encoders
  • deep hybrid models based on the above three types of models, attention-based deep models, and memory network-based deep models.
  • the deep model used in text classification is basically a two-stage pre-trained Transformer-encoder-based model, and some simple domain adaptations are made based on it, including but not limited to pre-trained models.
  • Perform secondary pre-training perform secondary pre-training on large-scale unsupervised domain corpus, perform model compression by model distillation, and perform multi-model result fusion by stacking and other methods.
  • a text classification model When a text classification model performs semantic analysis on text, it is first preprocessed, and the text (for example, title text, introduction text, image text, or audio text) is regarded as a document, and then the information retrieval field and natural language processing are used.
  • Document representation methods commonly used in the field such as the bag of words model. This document is represented by a weightless one-hot representation or a weighted tf-idf representation. Usually in the context of the bag-of-words model, the order in which terms appear in the text unit is not considered.
  • One-hot is represented as the 0/1 representation of the term, regardless of the weight of the term, while tf-idf is represented by the tf-idf score value calculated for each term to represent the term.
  • term frequency refers to how often a term appears in a document.
  • some terms for example, auxiliary words
  • auxiliary words have very weak discriminative ability for the calculation of relevancy.
  • most text units may contain "concern”, and the word "concern” cannot well reflect the text unit semantic distinction between.
  • idf inverse document frequency
  • the vector space model can be used to model the relationship between text units. For example, to calculate similarity or semantic relevance, the most typical method can be used to calculate Cosine similarity between text unit vectors. The higher the similarity between two text units, the more similar their topics are.
  • supervised classification is performed according to the process of text classification in NLP.
  • Classification models can be roughly divided into two categories. One is the traditional method. First, feature extraction is performed (that is, feature engineering is used), and a classifier is added to the extracted features, such as logistic regression, linear regression, and support vector machine ( Support Vector Machine, SVM), adaptive boosting (adaptive boosting, Adaboost) and extreme gradient boosting (extreme gradient boosting, XGboost), etc.
  • SVM Support Vector Machine
  • adaptive boosting adaptive boosting
  • Adaboost adaptive gradient boosting
  • XGboost extreme gradient boosting
  • Another class of methods are deep learning based short text classification models, mainly RNNs or CNNs or variants based on recurrent neural networks or CNNs.
  • Convolutional layers are essentially feature extraction layers, and you can specify how many convolution kernels there are with a hyperparameter.
  • CNN Since the convolution kernel covers the sliding window, it extracts N-gram (n-gram) segment features, and the size of n determines the distance of the captured features. Since there is no relative position dependency between each sliding window, CNN has a high degree of parallelism, which is also an advantage of CNN.
  • the encoder of Transformer-encoder is actually a feature extractor with good performance that can be calculated in parallel, which is stacked by the self-attention mechanism.
  • the self-attention mechanism allows each word to be related to any other word and then integrated into an embedded vector. Therefore, its advantage over CNN is that the distance for extracting local features is infinitely free and not fixed by the convolution kernel.
  • a large number of experiments have proved that Transformer-encoder surpasses RNN and CNN in the ability to extract semantic features.
  • Transformer-encoder is slightly better than RNN, and RNN is better than CNN.
  • In terms of parallel computing efficiency, Transformer-encoder is slightly better than CNN, and CNN is better than RNN.
  • a prediction method for accessing a text classification model is provided.
  • the target text classification results (including the title text classification results) can be directly predicted. , introduction text classification results, image text classification results and audio text classification results), and can process multiple prediction branches in parallel, that is, classify multiple multimedia contents at the same time, thereby improving the efficiency of classification prediction.
  • the target text classification result is obtained through a text classification model based on the text content, which may specifically include:
  • title text classification sub-results If the title text satisfies the first malicious promotion condition, then based on the title text, obtain N title text classification sub-results through N sub-category models included in the text classification model, and determine the title text classification result according to the N title text classification sub-results , where each title text classification sub-result corresponds to a malicious promotion type;
  • introduction text satisfies the first malicious promotion condition
  • N introduction text classification sub-results through N sub-classification models included in the text classification model, and determine the introduction text classification result according to the N introduction text classification sub-results , where each profile text classification sub-result corresponds to a malicious promotion type;
  • N image and text classification sub-results are obtained through N sub-classification models included in the text classification model, and the image and text classification results are determined according to the N image and text classification sub-results. , where each image text classification sub-result corresponds to a malicious promotion type;
  • the audio text satisfies the first malicious promotion condition, then based on the audio text, obtain N audio text classification sub-results through the N sub-classification models included in the text classification model, and determine the audio text classification result according to the N audio text classification sub-results , where each audio-text classification sub-result corresponds to a malicious promotion type.
  • another prediction method for accessing the text classification model is introduced.
  • multiple sub-category models that is, multiple sub-category models collectively constitute a text
  • may be followed by matching recall. classification model) and strives to pick out the most accurate malicious promotion title text, malicious promotion introduction text, malicious promotion image text, and malicious promotion audio text as much as possible from the large number of recalled videos.
  • FIG. 8 is a schematic diagram of outputting text classification results based on multiple classifiers in an embodiment of the present application.
  • the text classification model includes three sub-classification models (that is, Assuming that N is equal to 3), each sub-classification model corresponds to a classification, for example, sub-classification model 1 is the classification model of "stock recommendation”, sub-classification model 2 is the classification model of "treatment class”, and sub-classification model 3 It is a classification model for "loan credit card class". Any text (for example, title text, introduction text, image text, or audio text) is input into different sub-classification models, and each sub-classification model outputs probability distributions separately.
  • the title text is input into the sub-classification model 1 included in the text classification model, and the probability distribution of the output of the sub-classification model 1 is (0.1, 0.9), where 0.1 means "belongs to the stock The probability of "recommended class", 0.9 represents the probability of "not belonging to the stock recommendation class”.
  • the title text is input into the sub-classification model 2 included in the text classification model, and the probability distribution of the output of the sub-classification model 2 is (0.9, 0.1), where 0.9 represents the probability of "belonging to the treatment category", and 0.1 indicates "does not belong to the treatment category”. disease category” probability.
  • the probability of a credit card class Based on this, the probability of "belonging to the treatment category" is the highest, so the title text classification result is "belonging to the treatment category malicious promotion".
  • the title text is input into the sub-classification model 1 included in the text classification model, and the probability distribution of the output of the sub-classification model 1 is (0.1, 0.6, 0.3), where 0.1 means " The probability of "belonging to the stock recommendation category", 0.6 represents the probability of "suspected stock recommendation category”, and 0.3 indicates the probability of "not belonging to the stock recommendation category”.
  • the title text is input into the sub-classification model 2 included in the text classification model, and the probability distribution of the output of the sub-classification model 2 is (0.8, 0.1, 0.1), where 0.8 represents the probability of "belonging to the treatment class", and 0.1 represents "suspected The probability of "treatment class", 0.1 represents the probability of "not belonging to the treatment class”.
  • Input the title text into the sub-classification model 3 included in the text classification model, and the probability distribution of the output of the sub-classification model 3 is (0.4, 0.4, 0.2), where 0.4 represents the probability of “belonging to the credit card class”, and 0.4 represents the probability of “suspected The probability of “loan credit card class”, 0.2 represents the probability of “not belonging to the loan credit card class”. Based on this, the probability of "belonging to the treatment category” is the highest, so the title text classification result is "belonging to the treatment category malicious promotion".
  • the method of obtaining the classification result of the introduction text, the classification result of the image text, and the classification result of the audio text is similar to the method of obtaining the classification result of the title text, so it is not repeated here.
  • the text classification model includes multiple classifiers, multiple text classification sub-results (including title texts) can be predicted. classification sub-results, introductory text classification sub-results, image-text classification sub-results, and audio-text classification sub-results), and finally the target text classification results are determined based on the plurality of text classification sub-results.
  • the multimedia content can be classified more precisely, and subsequent identification can be performed for the specific classification, thereby helping to improve the accuracy of the identification.
  • another optional embodiment provided by the embodiment of the present application may further include:
  • an image classification result is obtained through an image classification model based on the image data, wherein the image classification result indicates the degree of malicious promotion of the image data;
  • Determine the video recognition result corresponding to the video to be recognized according to the target text classification result which may specifically include:
  • the video recognition result corresponding to the video to be recognized is determined.
  • a method of using CV technology to assist in judging malicious promotion is introduced.
  • the judgment of whether a video is maliciously promoted is divided into "white judgment” strategy algorithm and "black judgment” strategy algorithm.
  • the divided video frames can be input into an image classification model, and the image classification model is used to identify the region of interest (ROI) in the video frame, for example, a two-dimensional code, trend chart or station logo, etc.
  • ROI region of interest
  • the image classification model includes but is not limited to the You Only Look Once (YOLO) model and the Single Shot MultiBox Detector (SSD), etc., which are not limited here.
  • the video recognition result corresponding to the video to be recognized is determined according to the target text classification result and the image classification result. For example, if the classification result of the target text is "belongs to malicious promotion type" and the image classification result is "belongs to malicious promotion type”, then the video recognition result corresponding to the video to be identified is “belongs to malicious promotion type”. For another example, if the classification result of the target text is "belongs to malicious promotion type” and the image classification result is "does not belong to malicious promotion type", then the video recognition result corresponding to the video to be identified is "suspected malicious promotion type".
  • the video recognition result corresponding to the video to be identified is "suspected malicious promotion type”.
  • the classification result of the target text is "does not belong to malicious promotion type”
  • the image classification result is "does not belong to malicious promotion type”
  • the video recognition result corresponding to the video to be identified is "does not belong to malicious promotion type”.
  • a method for judging malicious promotion by using CV technology is provided.
  • the image data in the video is recognized to obtain the image classification result, and then the target text classification obtained after the text recognition is combined.
  • the malicious promotion of the video to be identified is judged jointly, so that the strategies in each source information dimension are combined to achieve a complementary effect.
  • the video recognition result corresponding to the video to be recognized is determined according to the target text classification result and the image classification result. , which can include:
  • the image classification result indicates that the to-be-recognized video includes an information code, and the target text classification result satisfies the second malicious promotion condition, it is determined that the video recognition result corresponding to the to-be-recognized video is a malicious promotion video;
  • the image classification result indicates that the to-be-recognized video includes a trend chart, and the target text classification result satisfies the second malicious promotion condition, it is determined that the video recognition result corresponding to the to-be-recognized video is a malicious promotion video;
  • the image classification result indicates that the video to be recognized belongs to the preset video type, it is determined that the video recognition result corresponding to the video to be recognized is a non-malicious promotion video.
  • a method for realizing auxiliary "black judgment” and direct “white judgment” using CV technology is introduced.
  • auxiliary feature for judging that a video is a maliciously promoted video "judging black” is not directly judged as malicious promotion.
  • the algorithm of the "black judgment” strategy includes identification of information codes (eg, QR codes and barcodes) and recommended stock charts.
  • the algorithm of "judgment” strategy includes government propaganda videos and news propaganda videos.
  • the video to be identified includes an information code, and the target text classification result satisfies the second malicious promotion condition, then it is directly determined that the video to be identified is a malicious promotion video, that is, the video recognition result is "belonging to" Types of Malicious Promotion". If it is determined in the "black judgment” strategy algorithm that the video to be identified includes a trend chart, and the classification result of the target text satisfies the second malicious promotion condition, it is directly determined that the video to be identified is a malicious promotion video, that is, the video recognition result is "belongs to malicious promotion type" ".
  • the video to be recognized is directly determined to be a non-malicious promotional video, that is, the video recognition result is "not belonging to Types of Malicious Promotion".
  • satisfying the second malicious promotion condition includes, but is not limited to, the classification result of the target text is "belongs to the malicious promotion type", or the target text classification result is "suspected malicious promotion type", or the malicious promotion type is divided into is five risk levels, and the classification result of the target text is "level three" or above. There is no limitation here.
  • FIG. 9 is a schematic diagram of another recognition framework based on the dimension of video multi-information sources in the embodiment of the present application.
  • a “black judgment” strategy algorithm the , use CV algorithm to judge whether there is an information code for more than 2 seconds, and then carry out contact promotion or product promotion.
  • another “black judgment” strategy algorithm it is judged whether the video to be identified is a recommended stock trend video.
  • the more obvious feature is that the video frame drawing often contains a long-term stock trend chart, and a lecture-style recommendation of stocks or trend explanation is carried out.
  • the algorithm of "judgment" strategy includes government propaganda videos and news propaganda videos.
  • the government propaganda video strategy is to determine whether a video is a government propaganda video of the state or local government. Such videos have certain promotional attributes, but they do not belong to the category of malicious promotion videos. The same is true for news promotion videos. It is to judge whether a video is a long-segment news promotion video. The news promotion clips of such videos are often mismatched by the pre-template matching link, and they also need to be judged.
  • the "white judgment” strategy is a direct white judgment strategy. If one of the "white judgment” strategy algorithms is hit, the video will be automatically classified as a non-malicious promotion video.
  • a method for assisting "black judgment” and direct “white judgment” using CV technology is provided.
  • the "white judgment” strategy algorithm and the “black judgment” strategy algorithm are introduced. Among them, if any one of the "white judgment” strategy algorithms is hit, the to-be-identified video is automatically classified as a non-malicious promotion video. If it hits the "black judgment” strategy algorithm, it is used as an auxiliary feature of malicious promotion video, but it is not directly judged as malicious promotion.
  • another optional embodiment provided by the embodiment of the present application may further include:
  • the original text classification result is obtained through the text classification model, wherein the original text classification result indicates the malicious promotion degree of the original text information
  • Determine the video recognition result corresponding to the video to be recognized according to the target text classification result which may specifically include:
  • the target text classification result the original text classification result and the image classification result, the video recognition result corresponding to the video to be recognized is determined, wherein the image classification result is obtained through the image classification model based on image data, wherein the image classification result represents the image The degree of malicious promotion of the data.
  • a supplementary identification method combining comment information and bullet screen information is introduced.
  • URC User Generated Content
  • FIG. 10 is a schematic diagram of an identification framework supplemented based on non-content-level features in an embodiment of the application.
  • UGC is original text information
  • original text information includes comment information. and at least one of the bullet screen information.
  • the original text information is not necessarily available for every video to be identified, but in the case of original text information, malicious promotion can be detected for these contents. For example, if the original text information satisfies the first malicious promotion condition, the original text classification result is obtained through the text classification model.
  • the video recognition result corresponding to the video to be recognized may be determined according to the target text classification result and the original text classification result. For example, if the classification result of the target text is "belongs to malicious promotion type", and the original text classification result is "belongs to malicious promotion type”, then the video recognition result corresponding to the video to be identified is “belongs to malicious promotion type”. For another example, the classification result of the target text is "belongs to malicious promotion type", and the original text classification result is "does not belong to malicious promotion type”, then the video recognition result corresponding to the video to be identified is "suspected malicious promotion type”.
  • the video recognition result corresponding to the video to be recognized may be determined according to the target text classification result, the original text classification result, and the image classification result.
  • the target text classification result is "belongs to malicious promotion type”
  • the original text classification result is "belongs to malicious promotion type”
  • the image classification result is "belongs to malicious promotion type”
  • the video recognition result corresponding to the video to be identified is "belongs to malicious promotion type” Types of Malicious Promotion”.
  • the classification result of the target text is "belongs to malicious promotion type"
  • the original text classification result is "does not belong to malicious promotion type”
  • the image classification result is "belongs to malicious promotion type”
  • the video recognition result corresponding to the video to be recognized is "Suspected malicious promotion type”.
  • a method for supplementary identification combining comment information and bullet screen information is provided.
  • the malicious promotion of videos can be supplemented with non-content-level features through additional comment information, and malicious promotion of these contents can be carried out.
  • Promote detection, and those that exceed a certain threshold can be pushed to manual detailed verification, optimize the identification process, and realize the positive closed-loop supplement to the multi-source information hierarchical strategy combination algorithm.
  • another optional embodiment provided by the embodiment of the present application may further include:
  • Determine the video recognition result corresponding to the video to be recognized according to the target text classification result which may specifically include:
  • the video recognition result corresponding to the video to be recognized is determined.
  • a method for dynamically identifying a content publisher's malicious promotion tendency tag is introduced.
  • Each maliciously promoted video corresponds to a content publisher. Therefore, modeling content publishers' malicious promotion-related tags is also of great significance for the judgment of malicious promotion of videos.
  • the content publishers who publish a large number of malicious promotional videos often have the nature of network navy. For example, if a content publisher publishes dozens of malicious promotional videos in a concentrated period of time to promote the same or similar products, it can be determined that the content publisher is a malicious promotion force of a certain company or a certain product. If a content publisher has been publishing malicious promotional videos for a long time, involving a variety of products or content, it can be determined that the content publisher is a malicious promotion intermediary army who earns a living by receiving live events.
  • FIG. 11 is a schematic diagram of an identification framework supplemented based on non-content-level features in the embodiment of the application.
  • the basic information of the content publisher including the number of followers, is used.
  • the identity confidence of the content publisher is used to assist in judging whether the video published or forwarded by the content publisher should be judged as malicious promotion, and to increase the confidence of the judgment result of the multi-source information hierarchical strategy combination algorithm.
  • the harm level of publishing maliciously promoted videos is higher than that of forwarding maliciously promoted videos.
  • the malicious promotion tendency label of the content publisher is constantly changing dynamically with the behavior of the content publisher. Therefore, the publisher information for the video to be identified in this application is based on the current period. Generally, if a content publisher publishes more videos that are determined to be maliciously promoted in a shorter period of time, the content publisher's malicious promotion tendency will increase faster. However, if a maliciously promoted video has been published or forwarded in the past, but the content publisher has not published a maliciously promoted video recently, the malicious promotion tendency of the content publisher is decreasing over time.
  • a method for dynamically identifying a content publisher's malicious promotion tendency tag is provided.
  • the identity confidence of the content publisher is determined according to the publisher information obtained in each cycle.
  • a more accurate malicious promotion tendency label of the content publisher can be obtained, which is beneficial to assist in determining the malicious degree of the video.
  • the malicious promotion tendency label of the content publisher is established to assist in judging whether the video published or forwarded by the content publisher should be judged as malicious promotion.
  • the confidence level of the decision result of the information level strategy combination algorithm is provided.
  • the target text classification result and the identity confidence degree of the content publisher determine the corresponding video to be identified.
  • the video recognition results of specifically can include:
  • the video recognition result corresponding to the video to be recognized is determined, wherein the classification result of the original text is obtained through the text classification model based on the original text information.
  • the text classification result indicates the degree of malicious promotion of the original text information, and the original text information includes at least one of comment information and bullet screen information; or,
  • the video recognition result corresponding to the video to be recognized is determined according to the target text classification result, the image classification result and the identity confidence of the content publisher, wherein the image classification result is obtained based on the image data through the image classification model, and the image classification result represents the extent of malicious promotion of image data; or,
  • the video recognition result corresponding to the video to be recognized is determined according to the target text classification result, the original text classification result, the image classification result and the identity confidence of the content publisher.
  • a method for video recognition based on non-content level feature supplementation is introduced.
  • multiple classification results are used for joint judgment, which increases the confidence of the judgment results of the multi-source information hierarchical strategy combination algorithm.
  • the video recognition result corresponding to the video to be recognized may be determined according to the classification result of the target text, the classification result of the original text, and the identity confidence of the content publisher.
  • the classification result of the target text is "belongs to the malicious promotion type”
  • the original text classification result is "belongs to the malicious promotion type”
  • the identity confidence of the content publisher is "belongs to the malicious promotion type”
  • the video corresponding to the video to be identified is identified.
  • the result is "of the malicious promotion type”.
  • the classification result of the target text is "belongs to malicious promotion type”
  • the original text classification result is "does not belong to malicious promotion type”
  • the identity confidence level of the content publisher is "does not belong to malicious promotion type”
  • the corresponding video to be identified corresponds to 's video recognition result is "suspected malicious promotion type”.
  • the video recognition result corresponding to the to-be-recognized video may be determined according to the target text classification result, the image classification result, and the identity confidence of the content publisher. For example, if the classification result of the target text is "belongs to malicious promotion type", the image classification result is “belongs to malicious promotion type”, and the identity confidence level of the content publisher is "belongs to malicious promotion type”, then the video recognition result corresponding to the video to be identified as "belonging to the malicious promotion type”.
  • the classification result of the target text is "belongs to malicious promotion type”
  • the image classification result is "does not belong to malicious promotion type”
  • the identity confidence level of the content publisher is "does not belong to malicious promotion type”
  • the corresponding The video recognition result is "suspected malicious promotion type”.
  • the video recognition result corresponding to the to-be-recognized video may be determined according to the target text classification result, the original text classification result, the image classification result, and the identity confidence of the content publisher.
  • the classification result of the target text is "belongs to malicious promotion type”
  • the original text classification result is “belongs to malicious promotion type”
  • the image classification result is “belongs to malicious promotion type”
  • the identity confidence level of the content publisher is “belongs to malicious promotion type”
  • the video recognition result corresponding to the video to be recognized is “belongs to malicious promotion type”.
  • first-level malicious promotion type indicates the most likely type ” indicates that the type next to the “first-level malicious promotion type” may be the type of malicious promotion, and so on.
  • non-content level feature supplementation can be performed for malicious promotion of videos through additional object behavior, and malicious promotion of these contents can be performed. Promote detection, and those that exceed a certain threshold can be pushed to manual detailed verification, optimize the identification process, and realize the positive closed-loop supplement to the multi-source information hierarchical strategy combination algorithm.
  • the video recognition result corresponding to the to-be-recognized video is determined according to the target text classification result, which may specifically include:
  • the target text classification result is greater than or equal to the target text classification threshold, it is determined that the video recognition result corresponding to the video to be recognized is a malicious promotion type
  • the target text classification result is less than the target text classification threshold, it is determined that the video recognition result corresponding to the video to be recognized is a non-malicious promotion type
  • the target text classification result is greater than or equal to the target text classification threshold, and the original text classification result is greater than or equal to the original text classification threshold, it is determined that the video recognition result corresponding to the video to be recognized is a malicious promotion type
  • the target text classification result is greater than or equal to the target text classification threshold, and the original text classification result is greater than or equal to the original text classification threshold, and the image classification result is greater than or equal to the image classification threshold, it is determined that the video recognition result corresponding to the video to be recognized is malicious type of promotion;
  • the target text classification result is greater than or equal to the target text classification threshold, and the original text classification result is greater than or equal to the original text classification threshold, and the image classification result is greater than or equal to the image classification threshold, and the identity confidence of the content publisher is greater than or equal to the identity confidence If the degree threshold is set, it is determined that the video recognition result corresponding to the video to be recognized is a malicious promotion type;
  • the video recognition result corresponding to the video to be recognized according to the target text classification result it may further include:
  • At least one of the target text classification threshold, the original text classification threshold, the image classification threshold, and the identity confidence threshold is adjusted.
  • This application further proposes a design scheme that integrates machine intelligence and swarm intelligence, which mainly includes two stages of work, namely the early stage of malicious video promotion and the later stage of malicious video promotion.
  • FIG. 12 is a schematic diagram of an overall recognition framework for videos in this embodiment of the present application.
  • the early stage of video malicious promotion identification refers to a video that has just been released or has been released soon.
  • the algorithm architecture of the multi-source information dimension and the non-content-level feature supplements determine that the video is suspected of malicious promotion, then it will be identified by manual operation before the final marking and output identification conclusion. This is because current computer equipment cannot guarantee 100% accuracy.
  • the current entire technical architecture should focus on recall, and gradually improve the accuracy rate while giving priority to ensuring recall.
  • the to-be-recognized video is recognized only based on the target text classification result. If the target text classification result is greater than or equal to the target text classification threshold, it is determined that the video recognition result corresponding to the video to be recognized is a malicious promotion type. Conversely, if the target text classification result is less than the target text classification threshold, it is determined that the video recognition result corresponding to the video to be recognized is a non-malicious promotion type.
  • the target text classification threshold can be adjusted, for example, increase the target text classification threshold, or reduce the target text classification threshold. Text classification threshold.
  • the to-be-recognized video is recognized based on the target text classification result and the original text classification result together. If the target text classification result is greater than or equal to the target text classification threshold, and the original text classification result is greater than or equal to the original text classification threshold, it is determined that the video recognition result corresponding to the video to be recognized is a malicious promotion type.
  • At this time at this time, at least one of the target text classification threshold and the original text classification threshold can be adjusted.
  • the to-be-recognized video is recognized based on the target text classification result, the original text classification result, and the image classification result together. If the target text classification result is greater than or equal to the target text classification threshold, and the original text classification result is greater than or equal to the original text classification threshold, and the image classification result is greater than or equal to the image classification threshold, it is determined that the video recognition result corresponding to the video to be recognized is malicious Promotion type.
  • At least one of the target text classification threshold, original text classification threshold, and image classification threshold can be adjusted.
  • the identity confidence of the content publisher is jointly recognized for the video to be recognized. If the target text classification result is greater than or equal to the target text classification threshold, and the original text classification result is greater than or equal to the original text classification threshold, and the image classification result is greater than or equal to the image classification threshold, and the identity confidence of the content publisher is greater than or equal to the identity confidence If the degree threshold is set, it is determined that the video recognition result corresponding to the video to be recognized is a malicious promotion type;
  • At this time at least one of the target text classification threshold, original text classification threshold, image classification threshold and identity confidence threshold can be determined. an adjustment.
  • the present application proposes a design scheme that integrates machine intelligence and crowd intelligence, which can gradually improve the accuracy rate while giving priority to ensuring recall.
  • Another embodiment of the method for identifying multimedia content in the embodiment of the present application includes:
  • the multimedia content recognition device obtains the text information of the text to be recognized and the original text information.
  • FIG. 14 is a schematic diagram of an overall recognition framework for text in this embodiment of the application, as shown in FIG.
  • the target text information may include at least one of title text and introduction text
  • the original text information may include at least one of comment information and bullet screen information.
  • the multimedia content identification device may be deployed in a server, a terminal device, or a multimedia content identification system composed of a terminal device and a server, which is not limited here.
  • the target text information satisfies the first malicious promotion condition, obtain the target text classification result through a text classification model based on the target text information, wherein the target text classification result represents the malicious promotion degree of the target text information;
  • the multimedia content identification device determines whether the target text information satisfies the first malicious promotion condition.
  • the condition that the first malicious promotion condition is satisfied may be that a keyword or a template in the matching library is hit.
  • the target text classification result can be obtained by inputting it into the trained text classification model, wherein the target text classification result can be a binary classification result, for example, "belongs to the malicious promotion type” or "does not belong to the malicious promotion type".
  • the target text classification result may be a multi-classification result, for example, "belongs to malicious promotion type", "suspected malicious promotion type” or "does not belong to malicious promotion type”.
  • the original text information satisfies the first malicious promotion condition, based on the original text information, obtain the original text classification result through a text classification model, wherein the original text classification result indicates the malicious promotion degree of the original text information;
  • the multimedia content identification device determines whether the original text information satisfies the first malicious promotion condition. For the original text information that meets the first malicious promotion condition, input it into the trained text classification model to obtain the original text classification result, where the original text classification result can be a binary classification result, or A multi-class result.
  • the multimedia content recognition device determines the text recognition result corresponding to the text to be recognized according to the target text classification result and the original text classification result. For example, if the classification result of the target text is "belongs to the malicious promotion type", and the original text classification result is also "belongs to the malicious promotion type”, the text recognition result of the output text to be recognized is “belongs to the malicious promotion type", and the degree of malicious promotion is the highest . For another example, if the classification result of the target text is "belongs to malicious promotion type", and the original text classification result is "does not belong to malicious promotion type", the text recognition result of the output text to be recognized is "suspected malicious promotion type".
  • the classification result of the target text is "does not belong to malicious promotion type"
  • the original text classification result is also "does not belong to malicious promotion type”
  • the text recognition result of the output text to be recognized is "does not belong to malicious promotion type”.
  • FIG. 15 is a schematic diagram of a recognition scene of text-based media content in the embodiment of the application.
  • the recognition of different texts can be displayed to the content manager. result.
  • the text recognition platform can directly remove or delete texts that are "malicious promotion types”.
  • content managers can further check the specific information of the texts and manually check the accuracy of the output results.
  • a method for identifying multimedia content is provided.
  • the media form of text-based multimedia content we can more comprehensively grasp the text quality from the aspects of title, introduction and original text information, and form a more comprehensive picture.
  • Perfect text malicious promotion identification strategy so as to improve the accuracy of identifying text malicious promotion.
  • Another embodiment of the method for identifying multimedia content in the embodiment of the present application includes:
  • the multimedia content recognition device obtains the image data and original text information of the image to be recognized.
  • FIG. 17 is a schematic diagram of an overall recognition framework for images in this embodiment of the application, as shown in FIG. shown, and the original text information may include at least one of comment information and bullet screen information.
  • the multimedia content identification device may be deployed in a server, a terminal device, or a multimedia content identification system composed of a terminal device and a server, which is not limited here.
  • the multimedia content recognition apparatus performs OCR processing on the image data in the image to be recognized to obtain image text.
  • the image text satisfies the first malicious promotion condition, then based on the image text, obtain an image text classification result through a text classification model, wherein the image text classification result represents the degree of malicious promotion of the image text;
  • the multimedia content identification device determines whether the image text satisfies the first malicious promotion condition.
  • the condition that the first malicious promotion condition is satisfied may be that a keyword or a template in the matching library is hit.
  • the image text classification result can be a binary classification result, for example, " It belongs to the malicious promotion type” or "does not belong to the malicious promotion type”.
  • the image-text classification result may be a multi-classification result, for example, "belongs to malicious promotion type", “suspected to be malicious promotion type", or "does not belong to malicious promotion type”.
  • the original text information satisfies the first malicious promotion condition, based on the original text information, obtain the original text classification result through a text classification model, wherein the original text classification result indicates the degree of malicious promotion of the original text information;
  • the multimedia content identification device determines whether the original text information satisfies the first malicious promotion condition. For the original text information that meets the first malicious promotion condition, input it into the trained text classification model to obtain the original text classification result, where the original text classification result can be a binary classification result, or A multi-class result.
  • the multimedia content recognition device determines the image recognition result corresponding to the image to be recognized according to the image text classification result and the original text classification result. For example, if the image text classification result is "belongs to malicious promotion type", and the original text classification result is also "belongs to malicious promotion type”, the image recognition result of the output image to be recognized is “belongs to malicious promotion type", and the degree of malicious promotion is the highest . For another example, if the image text classification result is "belongs to malicious promotion type", and the original text classification result is "does not belong to malicious promotion type", the image recognition result of the output image to be recognized is "suspected malicious promotion type".
  • the image recognition result of the output image to be recognized is "does not belong to malicious promotion type”.
  • FIG. 18 is a schematic diagram of a recognition scene of image-based media content in the embodiment of the application. As shown in the figure, after text recognition, the recognition of different images can be displayed to the content manager. result. On the one hand, the image recognition platform can directly remove or delete images that "belong to the malicious promotion type"; on the other hand, content managers can further check the specific information of the images and manually check the accuracy of the output results.
  • a method for identifying multimedia content is provided.
  • the media form of image multimedia content we can more comprehensively grasp the image quality from the aspects of image data and original text information, and form a more complete picture.
  • the image malicious promotion identification strategy can improve the accuracy of identifying image malicious promotion.
  • Another embodiment of the method for identifying multimedia content in the embodiment of the present application includes:
  • the multimedia content recognition device acquires audio data and original text information of the audio to be recognized.
  • FIG. 20 is a schematic diagram of an overall recognition framework for audio in this embodiment of the application, as shown in FIG. shown, and the original text information may include at least one of comment information and bullet screen information.
  • the multimedia content identification device may be deployed in a server, a terminal device, or a multimedia content identification system composed of a terminal device and a server, which is not limited here.
  • the multimedia content recognition apparatus performs ASR processing on the audio data in the audio to be recognized to obtain audio text.
  • the audio text satisfies the first malicious promotion condition, based on the audio text, obtain an audio text classification result through a text classification model, wherein the audio text classification result represents the malicious promotion degree of the audio text;
  • the multimedia content identification device determines whether the audio text satisfies the first malicious promotion condition.
  • the condition that the first malicious promotion condition is satisfied may be that a keyword or a template in the matching library is hit.
  • the audio text classification result can be a binary classification result, for example, " It belongs to the malicious promotion type” or "does not belong to the malicious promotion type”.
  • the audio-text classification result may be a multi-classification result, for example, "belongs to a malicious promotion type", “suspected to be a malicious promotion type", or "does not belong to a malicious promotion type”.
  • the original text information satisfies the first malicious promotion condition, based on the original text information, obtain the original text classification result through a text classification model, wherein the original text classification result indicates the degree of malicious promotion of the original text information;
  • the multimedia content identification device determines whether the original text information satisfies the first malicious promotion condition. For the original text information that meets the first malicious promotion condition, input it into the trained text classification model to obtain the original text classification result, where the original text classification result can be a binary classification result, or A multi-class result.
  • the multimedia content recognition device determines the audio recognition result corresponding to the audio to be recognized according to the audio text classification result and the original text classification result. For example, if the audio text classification result is "belongs to malicious promotion type", and the original text classification result is also "belongs to malicious promotion type”, the audio recognition result of the output audio to be recognized is “belongs to malicious promotion type", and the degree of malicious promotion is the highest . For another example, if the audio text classification result is "belongs to malicious promotion type", and the original text classification result is "does not belong to malicious promotion type", the audio recognition result of the output to be recognized audio is "suspected malicious promotion type".
  • the audio recognition result of the output audio to be recognized is "does not belong to malicious promotion type”.
  • FIG. 21 is a schematic diagram of a recognition scene of audio media content in the embodiment of the application.
  • the recognition of different audios can be displayed to the content manager. result.
  • the audio recognition platform can directly remove or delete the audio that is "malicious promotion type”.
  • the content manager can further check the specific information of the audio, and manually check the accuracy of the output results.
  • a method for identifying multimedia content is provided.
  • the media form of audio multimedia content we can more comprehensively grasp the audio quality from the aspects of audio data and original text information, forming a more complete system.
  • the audio malicious promotion identification strategy can improve the accuracy of identifying audio malicious promotion.
  • FIG. 22 is a schematic diagram of an embodiment of the multimedia content identification device in the embodiment of the present application.
  • the multimedia content identification device 50 includes:
  • an acquisition module 501 configured to acquire target text information and content information of the video to be identified, wherein the target text information includes at least one of title text and introduction text, and the content information includes at least one of image data and audio data;
  • the identification module 502 is configured to perform text recognition processing on the content information to obtain associated text information, wherein the associated text information includes at least one of image text and audio text, the image text is obtained after text recognition is performed on the image data, and the audio text is obtained.
  • the text is obtained after text recognition of audio data;
  • the obtaining module 501 is further configured to use at least one of the target text information that satisfies the first malicious promotion condition and the associated text information as the text content, and obtain the target text classification result through the text classification model based on the text content, wherein the target text classification result Indicates the degree of malicious promotion of the text content;
  • the identification module 502 is further configured to determine a video identification result corresponding to the video to be identified according to the target text classification result, where the video identification result indicates the degree of malicious promotion of the video to be identified.
  • a multimedia content identification device is provided, and the above device is used to identify the degree of malicious promotion of multimedia content from multiple perspectives. Audio and other aspects of the video quality are more comprehensively grasped, and a more complete video malicious promotion identification strategy is formed, thereby improving the accuracy of identifying video malicious promotion.
  • the content information includes image data
  • the identification module 502 is specifically configured to perform frame-by-frame processing on the image data included in the video to be identified to obtain K video frames, where K is an integer greater than or equal to 1;
  • L is an integer greater than or equal to 1 and less than K
  • the deduplicated subtitles in each video frame are used as the image text in the associated text information.
  • a multimedia content recognition device is provided.
  • the image data of the video is first subjected to frame extraction processing, and then the recognized image text is matched with the template, thereby increasing the recognition of malicious promotion videos. , which is beneficial to improve the recognition accuracy of video malicious promotion.
  • the content information includes audio data
  • the identification module 502 is specifically configured to perform frame-by-frame processing on the audio data included in the video to be identified, to obtain T audio frames, where T is an integer greater than or equal to 1;
  • Feature extraction processing is performed on each audio frame in the T audio frames to obtain an audio feature vector corresponding to each audio frame;
  • the audio text in the associated text information is determined based on the audio feature vector corresponding to each audio frame.
  • a multimedia content identification device is provided. Using the above device, the audio data of the video is first identified, and then the identified audio text is matched with a template, thereby increasing the dimension of identifying malicious promotion videos. , which is beneficial to improve the recognition accuracy of video malicious promotion.
  • the apparatus for identifying multimedia content 50 further includes a determining module 503;
  • the determining module 503 is used for determining that the title text satisfies the first malicious promotion condition and the target text information satisfies the first malicious promotion condition if the title text is successfully matched with the template in the matching library, and the title text fails to match the information in the whitelist condition;
  • the determining module 503 is further configured to determine that the introduction text satisfies the first malicious promotion condition and the target text information satisfies the first malicious promotion condition if the introduction text successfully matches the template in the matching library, and the introduction text fails to match the information in the whitelist. promotion conditions;
  • the determining module 503 is further configured to determine that the image text satisfies the first malicious promotion condition, and the associated text information satisfies the first malicious promotion conditions;
  • the determining module 503 is further configured to determine that the audio text satisfies the first malicious promotion condition and the associated text information satisfies the first malicious promotion condition if the audio text is successfully matched with the template in the matching library, and the audio text fails to match the information in the whitelist promotion conditions.
  • a multimedia content identification device is provided.
  • videos that may be maliciously promoted can be found based on template matching, and videos belonging to a whitelist can be filtered out based on a rejection strategy. Therefore, there are It is beneficial to improve the accuracy of recalled videos.
  • the obtaining module 501 is specifically configured to obtain a title text classification result through a text classification model based on the title text if the title text satisfies the first malicious promotion condition;
  • the introduction text classification result is obtained through the text classification model based on the introduction text
  • the image text classification result is obtained through the text classification model based on the image text
  • the audio text classification result is obtained through the text classification model based on the audio text.
  • a multimedia content recognition device is provided.
  • the text classification model is a single classifier
  • the target text classification results including title text classification results, introduction text classification results, Image text classification results and audio text classification results
  • the target text classification results can process multiple prediction branches in parallel, that is, classify multiple multimedia contents at the same time, thereby improving the efficiency of classification prediction.
  • the obtaining module 501 is specifically configured to obtain N sub-results of title text classification through N sub-classification models included in the text classification model, based on the title text, if the title text satisfies the first malicious promotion condition, and according to the N titles
  • the text classification sub-result determines the title text classification result, wherein each title text classification sub-result corresponds to a malicious promotion type;
  • introduction text satisfies the first malicious promotion condition
  • N introduction text classification sub-results through N sub-classification models included in the text classification model, and determine the introduction text according to the N introduction text classification sub-results Classification results, where each profile text classification sub-result corresponds to a malicious promotion type
  • N image and text classification sub-results are obtained through N sub-classification models included in the text classification model, and the image text is determined according to the N image and text classification sub-results Classification results, where each image text classification sub-result corresponds to a malicious promotion type;
  • the audio text satisfies the first malicious promotion condition, then based on the audio text, obtain N audio text classification sub-results through N sub-classification models included in the text classification model, and determine the audio text according to the N audio text classification sub-results Classification results, where each audio-text classification sub-result corresponds to a malicious promotion type.
  • a multimedia content recognition device is provided.
  • the text classification model includes multiple classifiers
  • multiple text classification sub-results (including title text classification sub-results, introduction text classification sub-results) can be predicted. sub-results, image-text categorization sub-results, and audio-text categorization sub-results), and finally determine a target text categorization result based on multiple text categorization sub-results.
  • the multimedia content can be classified more precisely, and subsequent identification can be performed for the specific classification, thereby helping to improve the accuracy of the identification.
  • the obtaining module 501 is further configured to obtain an image classification result through an image classification model based on the image data if the content information includes image data, wherein the image classification result represents the degree of malicious promotion of the image data;
  • the recognition module 502 is specifically configured to determine the video recognition result corresponding to the video to be recognized according to the target text classification result and the image classification result.
  • a multimedia content recognition device is provided. Using the above device, the image data in the video is recognized to obtain the image classification result, and the target text classification result obtained after the text recognition is combined to jointly identify the video content. The malicious promotion situation is judged, so that the hierarchical strategies in each source information dimension are combined to achieve a complementary effect.
  • the identification module 502 is specifically configured to determine that the video identification result corresponding to the to-be-identified video is a maliciously promoted video if the image classification result indicates that the video to be identified includes an information code, and the target text classification result satisfies the second malicious promotion condition;
  • the image classification result indicates that the to-be-recognized video includes a trend chart, and the target text classification result satisfies the second malicious promotion condition, it is determined that the video recognition result corresponding to the to-be-recognized video is a malicious promotion video;
  • the image classification result indicates that the video to be recognized belongs to the preset video type, it is determined that the video recognition result corresponding to the video to be recognized is a non-malicious promotion video.
  • a multimedia content identification device is provided.
  • a "judgment” strategy algorithm and “judgment judgment” are introduced.
  • "Black” strategy algorithm Among them, if any one of the “white judgment” strategy algorithms is hit, the to-be-identified video is automatically classified as a non-malicious promotion video. If it hits the "black judgment” strategy algorithm, it is used as an auxiliary feature of malicious promotion video, but it is not directly judged as malicious promotion.
  • the obtaining module 501 is further configured to obtain original text information for the video to be identified, wherein the original text information includes at least one of comment information and bullet screen information;
  • the obtaining module 501 is further configured to obtain the original text classification result through the text classification model based on the original text information if the original text information satisfies the first malicious promotion condition, wherein the original text classification result represents the malicious promotion degree of the original text information;
  • the recognition module 502 is specifically configured to determine the video recognition result corresponding to the video to be recognized according to the target text classification result and the original text classification result; or,
  • the target text classification result the original text classification result and the image classification result, the video recognition result corresponding to the video to be recognized is determined, wherein the image classification result is obtained through the image classification model based on image data, wherein the image classification result represents the image The degree of malicious promotion of the data.
  • a multimedia content identification device is provided.
  • the malicious promotion of videos can be supplemented with non-content-level features through additional comment information, and malicious promotion detection can be performed on these contents, and only if the content exceeds a certain threshold.
  • Push to manual detailed verification optimize the identification process, and realize the positive closed-loop supplement to the multi-source information hierarchical strategy combination algorithm.
  • the acquiring module 501 is further configured to acquire publisher information for the video to be identified in the current cycle, wherein the publisher information includes basic information and behavior information of the content publisher in the current cycle;
  • the determining module 503 is further configured to determine the identity confidence of the content publisher according to the publisher information
  • the identification module 502 is specifically used to determine the video identification result corresponding to the video to be identified according to the target text classification result and the identity confidence of the content publisher.
  • a multimedia content identification device is provided, and the above device is used to determine the identity confidence level of the content publisher according to the publisher information obtained in each cycle.
  • a more accurate malicious promotion tendency label of the content publisher can be obtained, which is beneficial to assist in determining the malicious degree of the video.
  • the malicious promotion tendency label of the content publisher is established to assist in judging whether the video published or forwarded by the content publisher should be judged as malicious promotion.
  • the confidence level of the decision result of the information level strategy combination algorithm is provided, and the above device is used to determine the identity confidence level of the content publisher according to the publisher information obtained in each cycle.
  • the identification module 502 is specifically configured to determine the video identification result corresponding to the video to be identified according to the classification result of the target text, the classification result of the original text and the identity confidence of the content publisher, wherein the classification result of the original text is based on the original text information, through Obtained by the text classification model, the original text classification result indicates the degree of malicious promotion of the original text information, and the original text information includes at least one of comment information and bullet screen information; or,
  • the video recognition result corresponding to the video to be recognized is determined according to the target text classification result, the image classification result and the identity confidence of the content publisher, wherein the image classification result is obtained based on the image data through the image classification model, and the image classification result represents the extent of malicious promotion of image data; or,
  • the video recognition result corresponding to the video to be recognized is determined according to the target text classification result, the original text classification result, the image classification result and the identity confidence of the content publisher.
  • a multimedia content identification device is provided.
  • the malicious promotion of videos can be supplemented with non-content-level features through additional object behaviors, and malicious promotion of these contents can be detected.
  • Push to manual detailed verification optimize the identification process, and realize the positive closed-loop supplement to the multi-source information hierarchical strategy combination algorithm.
  • the multimedia content identification device 50 further includes an adjustment module 504;
  • the identification module 502 is specifically configured to determine that the video identification result corresponding to the video to be identified is a malicious promotion type if the target text classification result is greater than or equal to the target text classification threshold;
  • the target text classification result is less than the target text classification threshold, it is determined that the video recognition result corresponding to the video to be recognized is a non-malicious promotion type
  • the target text classification result is greater than or equal to the target text classification threshold, and the original text classification result is greater than or equal to the original text classification threshold, it is determined that the video recognition result corresponding to the video to be recognized is a malicious promotion type
  • the target text classification result is greater than or equal to the target text classification threshold, and the original text classification result is greater than or equal to the original text classification threshold, and the image classification result is greater than or equal to the image classification threshold, it is determined that the video recognition result corresponding to the video to be recognized is malicious type of promotion;
  • the target text classification result is greater than or equal to the target text classification threshold, and the original text classification result is greater than or equal to the original text classification threshold, and the image classification result is greater than or equal to the image classification threshold, and the identity confidence of the content publisher is greater than or equal to the identity confidence If the degree threshold is set, it is determined that the video recognition result corresponding to the video to be recognized is a malicious promotion type;
  • the acquisition module 501 is also used to obtain the video labeling result corresponding to the video to be recognized after determining the corresponding video recognition result of the video to be recognized according to the target text classification result;
  • the adjustment module 504 is configured to adjust at least one of the target text classification threshold, the original text classification threshold, the image classification threshold and the identity confidence threshold if the video recognition result does not match the video annotation result.
  • a multimedia content identification device is provided.
  • the present application proposes a design scheme that integrates machine intelligence and crowd intelligence, which can gradually improve the accuracy rate while giving priority to ensuring recall.
  • FIG. 23 is a schematic diagram of an embodiment of the multimedia content identification device in the embodiment of the present application.
  • the multimedia content identification device 60 includes:
  • the acquisition module 601 is used to acquire the text information of the text to be recognized and the original text information for the text to be recognized, wherein the target text information includes at least one of title text and introduction text, and the original text information includes comment information and bullet screen at least one of the information;
  • the obtaining module 601 is further configured to obtain a target text classification result through a text classification model based on the target text information if the target text information satisfies the first malicious promotion condition, wherein the target text classification result represents the malicious promotion degree of the target text information;
  • the obtaining module 601 is further configured to obtain the original text classification result through the text classification model based on the original text information if the original text information satisfies the first malicious promotion condition, wherein the original text classification result indicates the malicious promotion degree of the original text information;
  • the recognition module 602 is configured to determine the text recognition result corresponding to the text to be recognized according to the target text classification result and/or the original text classification result, wherein the text recognition result indicates the degree of malicious promotion of the text to be recognized.
  • a multimedia content identification device is provided.
  • the malicious promotion degree of multimedia content is identified from multiple angles.
  • the media form of text-based multimedia content it can be divided into title, introduction and original text.
  • the information and other aspects of the text quality are more comprehensively grasped, and a more perfect text malicious promotion identification strategy is formed, thereby improving the accuracy of identifying text malicious promotion.
  • FIG. 24 is a schematic diagram of an embodiment of the multimedia content identification device in the embodiment of the present application.
  • the multimedia content identification device 70 includes:
  • An acquisition module 701, configured to acquire image data of the image to be recognized and original text information for the image to be recognized, wherein the original text information includes at least one of comment information and bullet screen information;
  • the obtaining module 701 is further configured to obtain an image text classification result through a text classification model based on the image text if the image text satisfies the first malicious promotion condition, wherein the image text classification result represents the malicious promotion degree of the image text;
  • the obtaining module 701 is further configured to obtain the original text classification result through the text classification model based on the original text information if the original text information satisfies the first malicious promotion condition, wherein the original text classification result represents the malicious promotion degree of the original text information;
  • the recognition module 702 is further configured to determine the image recognition result corresponding to the to-be-recognized image according to the image-text classification result and/or the original text classification result, wherein the image recognition result represents the degree of malicious promotion of the to-be-recognized image.
  • a multimedia content identification device is provided.
  • the malicious promotion degree of multimedia content is identified from multiple perspectives.
  • image data and original text information are analyzed from image data and original text information.
  • the image quality is more comprehensively grasped in other aspects, and a more perfect image malicious promotion identification strategy is formed, thereby improving the accuracy of identifying image malicious promotion.
  • FIG. 25 is a schematic diagram of an embodiment of the multimedia content identification device in the embodiment of the present application.
  • the multimedia content identification device 80 includes:
  • the acquisition module 801 is used to acquire audio data of the audio to be recognized and original text information for the audio to be recognized, wherein the original text information includes at least one of comment information and bullet screen information;
  • the identification module 802 is used to perform text identification processing on the audio data to obtain audio text;
  • the obtaining module 801 is further configured to obtain an audio text classification result through a text classification model based on the audio text if the audio text satisfies the first malicious promotion condition, wherein the audio text classification result represents the malicious promotion degree of the audio text;
  • the obtaining module 801 is further configured to obtain the original text classification result through the text classification model based on the original text information if the original text information satisfies the first malicious promotion condition, wherein the original text classification result indicates the malicious promotion degree of the original text information;
  • the recognition module 802 is further configured to determine the audio recognition result corresponding to the audio to be recognized according to the audio text classification result and/or the original text classification result, wherein the audio recognition result represents the malicious promotion degree of the audio to be recognized.
  • a multimedia content identification device is provided, and the above device is used to identify the degree of malicious promotion of multimedia content from multiple perspectives.
  • the audio quality is more comprehensively grasped in other aspects, and a more complete audio malicious promotion identification strategy is formed, thereby improving the accuracy of identifying audio malicious promotion.
  • FIG. 26 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 900 may vary greatly due to different configurations or performances. Includes one or more central processing units (CPUs) 922 (eg, one or more processors) and memory 932, one or more storage media 930 (eg, one or more storage media 942) that store applications 942 or data 944 above mass storage devices).
  • the memory 932 and the storage medium 930 may be short-term storage or persistent storage.
  • the program stored in the storage medium 930 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server.
  • the central processing unit 922 may be configured to communicate with the storage medium 930 to execute a series of instruction operations in the storage medium 930 on the server 900 .
  • Server 900 may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input and output interfaces 958, and/or, one or more operating systems 941, such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM and many more.
  • operating systems 941 such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM and many more.
  • the steps performed by the server in the above embodiment may be based on the server structure shown in FIG. 26 .
  • Embodiments of the present application also provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it runs on a computer, causes the computer to execute the methods described in the foregoing embodiments.
  • the embodiments of the present application also provide a computer program product including a program, which, when run on a computer, causes the computer to execute the methods described in the foregoing embodiments.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种可应用于云计算技术领域的多媒体内容识别方法。本申请包括:获取待识别视频的目标文本信息以及内容信息;对内容信息进行文本识别处理,得到关联文本信息;若目标文本信息以及关联文本信息中的至少一种满足第一恶意推广条件,则通过文本分类模型获取目标文本分类结果;根据目标文本分类结果确定待识别视频所对应的视频识别结果。本申请还提供了相关装置、设备及存储介质。本申请从多个角度对多媒体内容进行恶意推广程度的识别,针对于视频类多媒体内容的媒体形态,从标题、简介、图像以及音频等方面对视频质量进行更全面地把握,形成更完善的视频恶意推广识别策略,从而提升识别出视频恶意推广的准确率。

Description

一种多媒体内容的识别方法、相关装置、设备及存储介质
本申请要求于2021年04月20日提交中国专利局、申请号为202110426652.1、申请名称为“一种多媒体内容的识别方法、相关装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及云计算技术领域,尤其涉及多媒体内容的识别。
背景技术
随着视频应用的蓬勃发展,视频内容生态发生了翻天覆地的变化,越来越多以用户上传为主的多媒体内容(例如,视频、图片、文本或者音频等)面向公众。而大量恶意推广性质的多媒体内容已经严重危害到内容生态的发展。
目前,可针对广告推广文本进行恶意推广的识别。
然而,由于恶意推广的内容多种多样,而且更新速度非常快,导致识别出恶意推广信息的准确率较低,难以形成有效地打击恶意推广信息。
发明内容
本申请实施例提供了一种多媒体内容的识别方法、相关装置、设备及存储介质,从多个角度对多媒体内容进行恶意推广程度的识别,针对于视频类多媒体内容的媒体形态,从标题、简介、图像以及音频等方面对视频质量进行更全面地把握,形成了更完善的视频恶意推广识别策略,从而提升识别出视频恶意推广的准确率。
有鉴于此,本申请一方面提供一种多媒体内容的识别方法,包括:
获取待识别视频的目标文本信息以及内容信息,其中,目标文本信息包括标题文本以及简介文本中的至少一种,内容信息包括图像数据以及音频数据中的至少一种;
对内容信息进行文本识别处理,得到关联文本信息,其中,关联文本信息包括图像文本以及音频文本中的至少一种,图像文本为对图像数据进行文本识别后得到的,音频文本为对音频数据进行文本识别后得到的;
将满足第一恶意推广条件的目标文本信息以及关联文本信息中的至少一种作为文本内容,基于所述文本内容通过文本分类模型获取目标文本分类结果,其中,目标文本分类结果表示所述文本内容的恶意推广程度;
根据目标文本分类结果确定待识别视频所对应的视频识别结果,其中,视频识别结果表示待识别视频的恶意推广程度。
本申请另一方面提供一种多媒体内容的识别方法,包括:
获取待识别文本的文本信息以及针对于待识别文本的原创文本信息,其中,目标文本信息包括标题文本以及简介文本中的至少一种,原创文本信息包括评论信息以及弹幕信息中的至少一种;
若目标文本信息满足第一恶意推广条件,则基于目标文本信息,通过文本分类模型获取目标文本分类结果,其中,目标文本分类结果表示目标文本信息的恶意推广程度;
若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
根据目标文本分类结果和/或原创文本分类结果,确定待识别文本所对应的文本识别结果,其中,文本识别结果表示待识别文本的恶意推广程度。
本申请另一方面提供一种多媒体内容的识别方法,包括:
获取待识别图像的图像数据以及针对于待识别图像的原创文本信息,其中,原创文本信息包括评论信息以及弹幕信息中的至少一种;
对图像数据进行文本识别处理,得到图像文本;
若图像文本满足第一恶意推广条件,则基于图像文本,通过文本分类模型获取图像文本分类结果,其中,图像文本分类结果表示图像文本的恶意推广程度;
若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
根据图像文本分类结果和/或原创文本分类结果,确定待识别图像所对应的图像识别结果,其中,图像识别结果表示待识别图像的恶意推广程度。
本申请另一方面提供一种多媒体内容的识别方法,包括:
获取待识别音频的音频数据以及针对于待识别音频的原创文本信息,其中,原创文本信息包括评论信息以及弹幕信息中的至少一种;
对音频数据进行文本识别处理,得到音频文本;
若音频文本满足第一恶意推广条件,则基于音频文本,通过文本分类模型获取音频文本分类结果,其中,音频文本分类结果表示音频文本的恶意推广程度;
若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
根据音频文本分类结果和/或原创文本分类结果,确定待识别音频所对应的音频识别结果,其中,音频识别结果表示待识别音频的恶意推广程度。
本申请另一方面提供一种多媒体内容识别装置,包括:
获取模块,用于获取待识别视频的目标文本信息以及内容信息,其中,目标文本信息包括标题文本以及简介文本中的至少一种,内容信息包括图像数据以及音频数据中的至少一种;
识别模块,用于对内容信息进行文本识别处理,得到关联文本信息,其中,关联文本信息包括图像文本以及音频文本中的至少一种,图像文本为对图像数据进行文本识别后得到的,音频文本为对音频数据进行文本识别后得到的;
获取模块,还用于将满足第一恶意推广条件的目标文本信息以及关联文本信息中的至少一种作为文本内容,则基于所述文本内容通过文本分类模型获取目标文本分类结果,其中,目标文本分类结果表示所述文本内容的恶意推广程度;
识别模块,还用于根据目标文本分类结果确定待识别视频所对应的视频识别结果,其中,视频识别结果表示待识别视频的恶意推广程度。
本申请另一方面提供一种多媒体内容识别装置,包括:
获取模块,用于获取待识别文本的文本信息以及针对于待识别文本的原创文本信息,其中,目标文本信息包括标题文本以及简介文本中的至少一种,原创文本信息包括评论信 息以及弹幕信息中的至少一种;
获取模块,还用于若目标文本信息满足第一恶意推广条件,则基于目标文本信息,通过文本分类模型获取目标文本分类结果,其中,目标文本分类结果表示目标文本信息的恶意推广程度;
获取模块,还用于若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
识别模块,用于根据目标文本分类结果和/或原创文本分类结果,确定待识别文本所对应的文本识别结果,其中,文本识别结果表示待识别文本的恶意推广程度。
本申请另一方面提供一种多媒体内容的识别方法,包括:
获取模块,用于获取待识别图像的图像数据以及针对于待识别图像的原创文本信息,其中,原创文本信息包括评论信息以及弹幕信息中的至少一种;
识别模块,用于对图像数据进行文本识别处理,得到图像文本;
获取模块,还用于若图像文本满足第一恶意推广条件,则基于图像文本,通过文本分类模型获取图像文本分类结果,其中,图像文本分类结果表示图像文本的恶意推广程度;
获取模块,还用于若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
识别模块,还用于根据图像文本分类结果和/或原创文本分类结果,确定待识别图像所对应的图像识别结果,其中,图像识别结果表示待识别图像的恶意推广程度。
本申请另一方面提供一种多媒体内容的识别方法,包括:
获取模块,用于获取待识别音频的音频数据以及针对于待识别音频的原创文本信息,其中,原创文本信息包括评论信息以及弹幕信息中的至少一种;
识别模块,用于对音频数据进行文本识别处理,得到音频文本;
获取模块,还用于若音频文本满足第一恶意推广条件,则基于音频文本,通过文本分类模型获取音频文本分类结果,其中,音频文本分类结果表示音频文本的恶意推广程度;
获取模块,还用于若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
识别模块,还用于根据音频文本分类结果和/或原创文本分类结果,确定待识别音频所对应的音频识别结果,其中,音频识别结果表示待识别音频的恶意推广程度。
本申请另一方面提供一种计算机设备,包括:存储器、处理器和总线系统;
其中,存储器用于存储程序;
处理器用于执行存储器中的程序,处理器用于根据程序代码中的指令执行上述各方面的方法;
总线系统用于连接存储器和处理器,以使存储器和处理器进行通信。
本申请的另一方面提供了一种计算机可读存储介质,计算机可读存储介质中存储有指 令,当其在计算机上运行时,使得计算机执行上述各方面的方法。
本申请的另一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各方面所提供的方法。
从以上技术方案可以看出,本申请实施例具有以下优点:
本申请实施例中,提供了一种多媒体内容的识别方法,首先获取待识别视频的目标文本信息以及内容信息,目标文本信息包括标题文本以及简介文本中的至少一种,内容信息包括图像数据以及音频数据中的至少一种,然后对内容信息进行文本识别处理,得到关联文本信息,如果满足第一恶意推广条件,那么通过文本分类模型获取目标文本分类结果,最后,可根据目标文本分类结果确定待识别视频所对应的视频识别结果,该视频识别结果表示待识别视频的恶意推广程度。通过上述方式,从多个角度对多媒体内容进行恶意推广程度的识别,针对于视频类多媒体内容的媒体形态,从标题、简介、图像以及音频等方面对视频质量进行更全面地把握,形成了更完善的视频恶意推广识别策略,从而提升识别出视频恶意推广的准确率。
附图说明
图1为本申请实施例中多媒体内容识别系统的一个架构示意图;
图2为本申请实施例中多媒体内容识别系统的一个应用框架示意图;
图3为本申请实施例中多媒体内容识别方法的一个流程示意图;
图4为本申请实施例中基于视频多信息源维度的一个识别框架示意图;
图5为本申请实施例中视频类媒体内容的一个识别场景示意图;
图6为本申请实施例中基于坐标信息对字幕进行去重处理的一个示意图;
图7为本申请实施例中基于单个分类器输出文本分类结果的一个示意图;
图8为本申请实施例中基于多个分类器输出文本分类结果的一个示意图;
图9为本申请实施例中基于视频多信息源维度的另一个识别框架示意图;
图10为本申请实施例中基于非内容层面特征补充的一个识别框架示意图;
图11为本申请实施例中基于非内容层面特征补充的一个识别框架示意图;
图12为本申请实施例中针对于视频的一个整体识别框架示意图;
图13为本申请实施例中多媒体内容识别方法的另一个流程示意图;
图14为本申请实施例中针对于文本的一个整体识别框架示意图;
图15为本申请实施例中文本类媒体内容的一个识别场景示意图;
图16为本申请实施例中多媒体内容识别方法的另一个流程示意图;
图17为本申请实施例中针对于图像的一个整体识别框架示意图;
图18为本申请实施例中图像类媒体内容的一个识别场景示意图;
图19为本申请实施例中多媒体内容识别方法的另一个流程示意图;
图20为本申请实施例中针对于音频的一个整体识别框架示意图;
图21为本申请实施例中音频类媒体内容的一个识别场景示意图;
图22为本申请实施例中多媒体内容识别装置的一个示意图;
图23为本申请实施例中多媒体内容识别装置的另一个示意图;
图24为本申请实施例中多媒体内容识别装置的另一个示意图;
图25为本申请实施例中多媒体内容识别装置的另一个示意图;
图26为本申请实施例中服务器的一个结构示意图。
具体实施方式
本申请实施例提供了一种多媒体内容的识别方法、相关装置、设备及存储介质,从多个角度对多媒体内容进行恶意推广程度的识别,针对于视频类多媒体内容的媒体形态,从标题、简介、图像以及音频等方面对视频质量进行更全面地把握,形成了更完善的视频恶意推广识别策略,从而提升识别出视频恶意推广的准确率。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“对应于”和他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
多媒体包括文字、图片、照片、声音、动画和影片,以及程式所提供的互动功能。多媒体技术是一种迅速发展的综合性电子信息技术,它给传统的计算机系统、音频和视频设备带来了方向性的变革,将对大众传媒产生深远的影响。多媒体计算机将加速计算机进人家庭和社会各个方面的进程,给人们的工作、生活和娱乐带来深刻的革命。伴随着第五代移动通信技术(5th generation mobile networks,5G)开放,越来越多以用户上传为主的多媒体内容面向公众。而大量恶意推广性质的多媒体内容严重危害内容生态的发展。
“推广”是指把自己的产品、服务、技术、文化和事迹等通过媒体广告让更多的人和组织机构等了解和接受,从而达到宣传和普及的目的。在海量内容场景下,恶意推广是指在视频、图片、语音以及文本中推广联系方式或者针对产品进行广告,对观看的用户或多或少造成影响,从而对内容生态造成破坏。在恶意推广的细分领域上,具体可以分为医疗医美类、财经类、借贷信用卡类、股票推荐类、彩票类、创业赚钱类、治病类、风水命理类、文玩收藏类、棋牌游戏外挂类、搭讪艺术(Pick-up Artist,PUA)类、查征信查聊天记录类、钓鱼假冒客服类等。近年来随着科学技术不断发展,媒体形式也从传统纸质媒体向电子化、多元化、动态化发展,视频(特别是短视频)作为越来越流行的媒体传播载体,受到了越来越多人的关注。视频等多媒体形态的发展使得大量恶意推广信息以多媒体形式快速且广泛地在人群中传播,导致人们难以从纷繁复杂的信息中甄别得到可信的信息,进而影响人们正常的生活秩序,甚至还会引导非健康和非科学的生活方式。
基于此,为了提高多媒体内容的审核效率,降低人力审核成本,提升审核的准确度,本申请提出了一种多媒体内容识别方法,聚焦于多媒体内容恶意推广检测识别(包括视频、图片、语音以及文本等),作为机器策略辅助人工审核。该方法应用于图1所示的多媒体内 容识别系统,如图所示,多媒体内容识别系统包括服务器和终端设备,且客户端部署于终端设备上。本申请涉及的服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、和大数据和人工智能平台等基础云计算服务的云服务器。终端设备可以是智能手机、平板电脑、笔记本电脑、掌上电脑、个人电脑、智能电视、智能手表等,但并不局限于此。终端设备和服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。服务器和终端设备的数量也不做限制。
示例性地,以图1所示的多媒体内容识别系统,内容发布者(content provider,CP)选择终端设备A本地的多媒体内容,然后将该多媒体内容上传至服务器,服务器采用本申请提供的识别策略对该多媒体内容进行识别,然后输出识别结果。内容管理者可选择是否进一步对识别结果进行人工审核。如果基于识别结果确定该多媒体内容不是恶意推广的内容,则服务器发布该多媒体内容,其他用户可通过终端设备B查看发布的多媒体内容。反之,如果基于识别结果确定该多媒体内容是恶意推广的内容,则服务器拦截该多媒体内容,且不会发布到网上。
基于图1所示的多媒体内容识别系统,请参阅图2,图2为本申请实施例中多媒体内容识别系统的一个应用框架示意图,如图所示,多媒体内容识别系统主要包括三个模块,分别为应用服务模块、基础服务模块以及底层架构模块。底层架构模块包括网络通讯、数据安全以及数据库。在底层架构模块中,网络通讯用于支持终端设备与服务器之间的通信。数据安全可采用区块链技术,将多媒体内容的识别结果进行上链处理。数据库中存储有多媒体内容相关的信息,例如,内容发布者的基础信息以及行为信息等。在基础服务模块中,光学字符识别(Optical Character Recognition,OCR)技术用于识别图像或视频中的文本。自动语音识别技术(Automatic Speech Recognition,ASR)用于识别音频中的文本。神经网络和处理策略用于判断多媒体内容是否属于恶意推广的内容。在应用服务中,智能识别是指调用神经网络和处理策略,确定多媒体内容是否为恶意推广内容,然后可以输出识别结果,由此,向内容管理者展示该识别结果。
本申请提供的多媒体内容识别系统,利用基于语义迁移的深度神经网络,同时结合启发式策略和词典扩充,形成一套组合策略,在保证召回率的前提下提高恶意标题识别的准确率。深度神经网络需要利用大规模训练语料,并采用机器学习(Machine Learning,ML)的方法训练得到,ML是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。ML是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。ML和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。
ML属于人工智能(Artificial Intelligence,AI)领域中的一项重要技术,其中,AI是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,AI是计算机科学 的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。AI也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
AI技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。AI基础技术一般包括如传感器、专用AI芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。AI软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
本申请提供的多媒体内容识别系统,采用云计算(cloud computing)实现并行处理大量的多媒体内容。云计算指互联网技术(Internet Technology,IT)基础设施的交付和使用模式,指通过网络以按需、易扩展的方式获得所需资源;广义云计算指服务的交付和使用模式,指通过网络以按需、易扩展的方式获得所需服务。这种服务可以是IT和软件、互联网相关,也可是其他服务。云计算是网格计算(Grid Computing)、分布式计算(Distributed Computing)、并行计算(Parallel Computing)、效用计算(Utility Computing)、网络存储(Network Storage Technologies)、虚拟化(Virtualization)、负载均衡(Load Balance)等传统计算机和网络技术发展融合的产物。随着互联网、实时数据流、连接设备多样化的发展,以及搜索服务、社会网络、移动商务和开放协作等需求的推动,云计算迅速发展起来。不同于以往的并行分布式计算,云计算的产生从理念上将推动整个互联网模式、企业管理模式发生革命性的变革。
云计算属于云技术(cloud technology)的一种,其中,云技术是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来,实现数据的计算、储存、处理和共享的一种托管技术。云技术基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、应用技术等的总称,可以组成资源池,按需所用,灵活便利。云计算技术将变成重要支撑。技术网络系统的后台服务需要大量的计算、存储资源,如视频网站、图片类网站和更多的门户网站。伴随着互联网行业的高度发展和应用,将来每个物品都有可能存在自己的识别标志,都需要传输到后台系统进行逻辑处理,不同程度级别的数据将会分开处理,各类行业数据皆需要强大的系统后盾支撑,只能通过云计算来实现。
本申请提供的多媒体内容识别系统,还能够接入区块链系统,从而防止生成的信息被篡改,并且能够提升信息来源的可靠性。区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层。
区块链底层平台可以包括用户管理、基础服务、智能合约以及运营管理等处理模块。其中,用户管理模块负责所有区块链参与者的身份信息管理,包括维护公私钥生成(账户管理)、密钥管理以及用户真实身份和区块链地址对应关系维护(权限管理)等,并且在授权的情况下,监管和审计某些真实身份的交易情况,提供风险控制的规则配置(风控审计);基础服务模块部署在所有区块链节点设备上,用来验证业务请求的有效性,并对有效请求 完成共识后记录到存储上,对于一个新的业务请求,基础服务先对接口适配解析和鉴权处理(接口适配),然后通过共识算法将业务信息加密(共识管理),在加密之后完整一致的传输至共享账本上(网络通信),并进行记录存储;智能合约模块负责合约的注册发行以及合约触发和合约执行,开发人员可以通过某种编程语言定义合约逻辑,发布到区块链上(合约注册),根据合约条款的逻辑,调用密钥或者其它的事件触发执行,完成合约逻辑,同时还提供对合约升级注销的功能;运营管理模块主要负责产品发布过程中的部署、配置的修改、合约设置、云适配以及产品运行中的实时状态的可视化输出,例如:告警、管理网络情况、管理节点设备健康状态等。
平台产品服务层提供典型应用的基本能力和实现框架,开发人员可以基于这些基本能力,叠加业务的特性,完成业务逻辑的区块链实现。应用服务层提供基于区块链方案的应用服务给业务参与方进行使用。
结合上述介绍,下面将对本申请中多媒体内容的识别方法进行介绍,请参阅图3,本申请实施例中多媒体内容识别方法的一个实施例包括:
101、获取待识别视频的目标文本信息以及内容信息,其中,目标文本信息包括标题文本以及简介文本中的至少一种,内容信息包括图像数据以及音频数据中的至少一种;
本实施例中,多媒体内容识别装置获取待识别视频的目标文本信息以及内容信息,其中,目标文本信息包括待识别视频的标题(title)文本以及简介(abstract)文本中的至少一种,而内容(content)信息包括图像数据以及音频数据中的至少一种。待识别视频的标题文本是一个非常重要的信息点,从实际情况来看,恶意推广的视频标题相比于恶意推广的视频画面或者恶意推广的语音而言更加稀疏。从每日大盘视频流水抽样上,标题文本涉及恶意推广的比例大概是万分之4.76(约为万分之5)。
需要说明的是,多媒体内容识别装置可部署于服务器,也可以部署于终端设备,还可以部署于由终端设备和服务器组成的多媒体内容识别系统中,此处不做限定。
102、对内容信息进行文本识别处理,得到关联文本信息,其中,关联文本信息包括图像文本以及音频文本中的至少一种,图像文本为对图像数据进行文本识别后得到的,音频文本为对音频数据进行文本识别后得到的;
本实施例中,多媒体内容识别装置对内容信息进行文本识别处理,由此得到关联文本信息。下面将介绍生成关联文本信息所使用的技术手段。
以对图像数据进行文本识别为例,可采用计算机视觉(Computer Vision,CV)技术识别出对应的图像文本。其中,CV计算是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟随和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,CV研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。CV技术通常包括图像处理、图像识别、图像语义理解、图像检索、OCR、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。
以对音频数据进行文本识别为例,可采用语音技术(Speech Technology)识别出对应 的音频文本。其中,语音技术的关键技术有ASR技术和语音合成(Text To Speech,TTS)技术以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来被看好的人机交互方式之一。
103、将满足第一恶意推广条件的目标文本信息以及关联文本信息中的至少一种作为文本内容,基于文本内容通过文本分类模型获取目标文本分类结果,其中,目标文本分类结果表示文本内容的恶意推广程度;
本实施例中,多媒体内容识别装置需要判断目标文本信息和关联文本信息是否满足第一恶意推广条件,可以理解的是,满足第一恶意推广条件的情况可以是命中匹配库中的关键词或模板。对于满足第一恶意推广条件的目标文本信息或者关联文本信息而言,将其输入至训练好的文本分类模型中,即可得到目标文本分类结果,其中,目标文本分类结果可以是一个二分类的结果,例如,“属于恶意推广类型”或者“不属于恶意推广类型”。或者,目标文本分类结果可以是一个多分类的结果,例如,“属于恶意推广类型”、“疑似恶意推广类型”或者“不属于恶意推广类型”。
可以理解的是,语义理解涉及自然语言处理(Nature Language processing,NLP),NLP是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。NLP是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系。NLP技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。
具体地,为了便于理解,请参阅图4,图4为本申请实施例中基于视频多信息源维度的一个识别框架示意图,如图所示,假设待识别视频包括标题文本、简介文本、图像数据以及音频数据,基于此,可分别对标题文本和简介文本进行分类,需要说明的是,两者分类的结果应为一致的,且可划分的类型包含但不仅限于通用类、医疗医美类、财经类、借贷信用卡类、股票推荐类、彩票类、创业赚钱类、治病类、风水命理类、文玩收藏类、棋牌游戏外挂类、PUA、查征信查聊天记录类以及钓鱼假冒客服类。这些类别中,推广形式大部分是引导关注联系方式,进行产品形式或者动作形式推广。对于每个种类的恶意推广视频,特征的侧重点不尽相同,为此,从多源信息维度入手,形成层级策略的组合算法,对于不同类型的恶意推广视频,每个信息源架构中的侧重点也不同。此外,对内容信息进行分类,基于OCR技术和ASR技术生成对应的图像文本和音频文本,结合标题文本和简介文本,作为模式匹配的依据,从而提升召回率。
104、根据目标文本分类结果确定待识别视频所对应的视频识别结果,其中,视频识别结果表示待识别视频的恶意推广程度。
本实施例中,多媒体内容识别装置根据目标文本分类结果确定待识别视频所对应的视频识别结果,例如,目标文本分类结果为“属于恶意推广类型”,则输出待识别视频的视频识别结果为“属于恶意推广类型”,且恶意推广程度最高。又例如,目标文本分类结果为“疑似恶意推广类型”,则输出待识别视频的视频识别结果为“疑似恶意推广类型”,且恶意推广程度为中。又例如,目标文本分类结果为“不属于恶意推广类型”,则输出待识别视频的 视频识别结果为“不属于恶意推广类型”,且恶意推广程度为低。
具体地,为了便于理解,请参阅图5,图5为本申请实施例中视频类媒体内容的一个识别场景示意图,如图所示,经过视频识别后,可向内容管理者展示不同视频的识别结果。一方面,视频识别平台可直接下架或删除“属于恶意推广类型”的视频,另一方面,内容管理者还可以进一步查看视频的具体信息,以人工的方式查验输出结果的准确度。恶意推广视频涉及人类社会生活的方方面面,话题多种多样,与普通视频相比,恶意推广视频具有比较明显的局部片段倾向性。
本申请实施例中,提供了一种多媒体内容的识别方法。通过上述方式,从多个角度对多媒体内容进行恶意推广程度的识别,针对于视频类多媒体内容的媒体形态,从标题、简介、图像以及音频等方面对视频质量进行更全面地把握,形成了更完善的视频恶意推广识别策略,从而提升识别出视频恶意推广的准确率。
可选地,在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,内容信息包括图像数据;
对内容信息进行文本识别处理,得到关联文本信息,具体可以包括:
对待识别视频所包括的图像数据进行分帧处理,得到K个视频帧,其中,K为大于或等于1的整数;
按照预设帧率从K个视频帧中获取L个视频帧,其中,L为大于或等于1,且小于K的整数;
对L个视频帧中的每个视频帧进行光学字符识别OCR处理,得到每个视频帧的文本识别结果,其中,文本识别结果包括字幕以及字幕所对应的坐标信息;
针对于L个视频帧中的每个视频帧,根据字幕所对应的坐标信息,对字幕进行去重处理;
将每个视频帧中经过去重后的字幕作为关联文本信息中的图像文本。
本实施例中,介绍了一种对图像数据进行OCR识别的方式。除了恶意推广视频的标题文本之外,恶意推广视频本身的图像数据也是一个重要的信息源。针对图像数据,可按照预设帧率(例如,每秒抽取一帧)从K个视频帧中获取L个视频帧。利用OCR来提取L个视频帧中每个视频帧的字幕以及字幕的坐标信息。
具体地,下面将结合图6,介绍基于坐标信息对字幕进行去重的过程。请参阅图6,图6为本申请实施例中基于坐标信息对字幕进行去重处理的一个示意图,如图所示,以一帧画面为例,假设将该帧画面划分为四个区块,即S1所指示的区块,S2所指示的区块,S3所指示的区块和S4所指示的区块。可设置左下角为坐标原点,即每个区块都可以用横纵坐标来表示,这样的情况下,对于区块S1而言,会提取到字幕“快投资”,如果在下一帧画面中,同样从区块S1中提取到字幕“快投资”,则进行去重得到完整干净的信息。当对L个视频帧都进行去重处理后,即得到图像文本。
由此可见,在实现的过程中,每个视频帧中字幕的坐标信息是非常重要的,用于帮助定位或者聚类可能是一条信息的内容。
其次,本申请实施例中,提供了一种对图像数据进行OCR识别的方式,通过上述方式,先对视频的图像数据进行抽帧处理,然后将识别到的图像文本与模板进行匹配,由此,增 加识别恶意推广视频的维度,从而有利于提升视频恶意推广的识别准确率。
可选地,在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,内容信息包括音频数据;
对内容信息进行文本识别处理,得到关联文本信息,具体可以包括:
对待识别视频所包括的音频数据进行分帧处理,得到T个音频帧,其中,T为大于或等于1的整数;
对T个音频帧中的每个音频帧进行特征提取处理,得到每个音频帧所对应的音频特征向量;
基于每个音频帧所对应的音频特征向量,确定关联文本信息中的音频文本。
其中,可选的,可以基于每个音频帧所对应的音频特征向量,通过声学模型输出音素信息;基于音素信息,通过语言模型输出关联文本信息中的音频文本。
本实施例中,介绍了一种对音频数据进行ASR识别的方式。除了恶意推广视频的标题文本之外,恶意推广视频本身的音频数据也是一个重要的信息源。针对音频数据,可对待识别视频所包括的音频数据进行分帧处理,得到T个音频帧。然后在编码部分,对T个音频帧中的每个音频帧进行特征提取处理,得到每个音频帧所对应的音频特征向量。再在解码部分,基于每个音频帧所对应的音频特征向量,通过声学模型输出音素信息,最后基于音素信息,通过语言模型输出音频文本。
具体地,考虑到音频数据可能存在背景音乐或者方言口音很重的情况,这种情况往往难以正确识别出音频文本,因此,本申请可选择不对这些音频数据进行ASR处理,或者,不进行后续的模板匹配。
其次,本申请实施例中,提供了一种对音频数据进行ASR识别的方式,通过上述方式,先对视频的音频数据进行识别,然后将识别到的音频文本与模板进行匹配,由此,增加识别恶意推广视频的维度,从而有利于提升视频恶意推广的识别准确率。
可选地,在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,还可以包括:
若标题文本与匹配库中的模板匹配成功,且标题文本与白名单中的信息匹配失败,则确定标题文本满足第一恶意推广条件,且目标文本信息满足第一恶意推广条件;
若简介文本与匹配库中的模板匹配成功,且简介文本与白名单中的信息匹配失败,则确定简介文本满足第一恶意推广条件,且目标文本信息满足第一恶意推广条件;
若图像文本与匹配库中的模板匹配成功,且图像文本与是白名单中的信息匹配失败,则确定图像文本满足第一恶意推广条件,且关联文本信息满足第一恶意推广条件;
若音频文本与匹配库中的模板匹配成功,且音频文本与白名单中的信息匹配失败,则确定音频文本满足第一恶意推广条件,且关联文本信息满足第一恶意推广条件。
本实施例中,介绍了一种结合模板匹配以及“拒识策略”判定恶意推广类型的方式。由前述实施例可知,满足第一恶意推广条件的情况包括命中匹配库中的模板或关键词,此外,还可以包括未命中白名单中的任意信息。这是考虑到,如果仅命中匹配库中的模板或关键词,那么可能会召回一些不属于恶意推广的多媒体内容,因此,进一步设置“拒识策 略”(即拒绝识别出现在白名单中的信息)。
具体地,以标题文本为例进行说明,可以理解的是,简介文本、图像文本和音频文本的处理方式与标题文本类似,此处不作赘述。视频的标题文本属于短文本,依据各个视频恶意推广标题类别的特点,在每个划分的类别(例如,通用类、医疗医美类和财经类等)上,维护一个基于关键词和模板的匹配库。经过审核累积的恶意推广视频分析来看,在很长的一段时间内(除非有明显的意识形态或者主流媒体形式变化),恶意推广视频的每类标题文本都可以总结出相关的关键词或模板。恶意推广标题相关的关键词或模板对于视频标题恶意推广而言是必要不充分条件。换言之,被匹配库命中的标题文本,不一定是真正的视频恶意推广标题,但真正的视频恶意推广标题一定会被匹配库命中。因此,需要维护一个全量的基于关键词和模板的匹配库,而且针对视频标题的策略模块应该重视召回率,在优先保证召回的情况下,为了尽量提升准确率,本申请还设计了“拒识策略”。
在很多基于标题文本被召回的多媒体内容(例如,待识别视频)中,有些是关于国家或者各地政府部门政务方面的推广,又或者是社会新闻,而这些多媒体内容是不适合被判为恶意推广的。针对这种情况,在“拒识策略”中就设计了政务宣传体以及社会新闻体的策略算法。该策略算法的核心也是通过匹配进行多个特征的并行判断,例如,待识别视频的标题文本为“做好环保人人有责”,其中,“环保”命中白名单,因此,该标题文本不满足第一恶意推广条件,即目标文本信息也不满足第一恶意推广条件。类似地,对于简介文本、图像文本和音频而言,也需要确定是否命中名单,由此得到是否满足第一恶意推广条件的判别结果。
白名单中可加入很多政务和新闻方面收集到的相关资源,例如,爬取并整理相关政府部门的职能宣传单位名称以及各省级市级县级名称等。这些资源可用作白名单的关键词,由此进行政务宣传体训练预料的粗召回,便于人工更进一步进行分类标签打标,同时,也可以在分类器有监督训练中用于生成大量样本来进行训练集的数据增强。在社会新闻体的算法上,还可以利用负面新闻的相关标题语料,由此进行有监督训练。
需要说明的是,由于全量的基于关键词和模板的匹配库非常重要,因此,设置了多种策略(例如,人工参与或者设备自动调整)来保证匹配库的多样性和实时更新性。
其次,本申请实施例中,提供了一种结合模板匹配以及“拒识策略”判定恶意推广类型的方式,通过上述方式,基于模板匹配能够找出可能存在恶意推广的视频,基于拒识策略能够过滤掉属于白名单内的视频,由此,有利于提升召回视频的准确率。
可选地,在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,基于目标文本信息以及关联文本信息中的至少一种,通过文本分类模型获取目标文本分类结果,具体可以包括:
若标题文本满足第一恶意推广条件,则基于标题文本,通过文本分类模型获取标题文本分类结果;
若简介文本满足第一恶意推广条件,则基于简介文本,通过文本分类模型获取简介文本分类结果;
若图像文本满足第一恶意推广条件,则基于图像文本,通过文本分类模型获取图像文 本分类结果;
若音频文本满足第一恶意推广条件,则基于音频文本,通过文本分类模型获取音频文本分类结果。
本实施例中,介绍了一种接入文本分类模型的预测方式。如前述实施例可知,在确定文本满足第一恶意推广条件的情况下,为了提升恶意推广视频的识别准确率,还可以在匹配召回后接一个文本分类模型(即分类器),力求在召回的大量视频中尽可能挑出准确的恶意推广标题文本、恶意推广简介文本、恶意推广图像文本以及恶意推广音频文本。
具体地,为了便于理解,请参阅图7,图7为本申请实施例中基于单个分类器输出文本分类结果的一个示意图,如图所示,将任意一个文本(例如,标题文本、简介文本、图像文本或音频文本)输入至文本分类模型,由文本分类模型输出概率分布。
示例性地,以二分类输出结果为例,将标题文本输入至文本分类模型,文本分类模型输出的概率分布为(0.1,0.9),其中,0.1表示“属于恶意推广类型”的概率,0.9表示“不属于恶意推广类型”的概率,因此,标题文本分类结果为“不属于恶意推广类型”。
示例性地,以多分类输出结果为例,将标题文本输入至文本分类模型,文本分类模型输出的概率分布为(0.1,0.6,0.3),其中,0.1表示“属于恶意推广类型”的概率,0.6表示“疑似恶意推广类型”的概率,0.3表示“不属于恶意推广类型”的概率,因此,标题文本分类结果为“疑似恶意推广类型”。
需要说明的是,获取简介文本分类结果、图像文本分类结果以及音频文本分类结果的方式,与获取标题文本分类结果的方式类似,故此处不做赘述。
其中,文本分类模型是通过人工标注方式标注一定量的样本,以此进行监督训练后得到的。在文本分类模型的选择上,可采用特征提取和特征分类器的方式,也可以选择近年来业界流行的基于深度学习的文本分类。其中,基于深度学习的文本分类一般来讲又可以分为基于循环神经网络(Recurrent Neural Network,RNN)的模型、基于卷积神经网络(Convolutional Neural Networks,CNN)的模型、基于翻译编码器(Transformer-encoder)的分类模型,以及一些模型的混合。除此之外,还有一些基于上述三类模型的深度混合模型,基于注意力的深度模型以及基于记忆网络的深度模型等变体。在文本分类时使用的深度模型基本上为两阶预训练的基于Transformer-encoder的模型,以及在其基础之上做一些简单的领域适应性改造,包括但不限于在已经预训练好的模型上进行二次预训练,在大规模无监督领域语料上进行二次预训练,通过模型蒸馏的方式进行模型压缩、通过堆栈(stacking)等方式进行多模型结果融合等。
下面将结合三种文本分类模型,用于实现语义分析。
一、基于术语频率–逆文档频率(term frequency–inverse document frequency,tf-idf)的文本分类模型;
在文本分类模型对文本进行语义分析时,首先对其进行预处理,将文本(例如,标题文本、简介文本、图像文本或音频文本)看作一篇文档,然后采用信息检索领域和自然语言处理领域中常用的文档表示方法,比如词袋子模型。用无权重独热(one-hot)表示或者有权重的tf-idf表示来代表这篇文档。通常在这种词袋子模型这种场景下,不考虑词项在文 本单位中出现的顺序。One-hot表示为词项的0/1表示,并不考虑词项的权重,而tf-idf表示是用每个词项计算出来的tf-idf得分值来进行词项表征。对于一个文本单位而言,词项频率(term frequency,tf)指一个词项在文档中出现的频次。而某些词项(比如,助词)对于相关度的计算而言,区分能力很弱,比如,可能大多数文本单位都会包含“关注”,而“关注”这个词项无法很好地体现文本单位之间的语义区别。为此,引入逆文档频率(inverse document frequency,idf)的概念,表示词项在所有文档中出现的频率的倒数。
高频的对文本单元没有区分度的词,它们的idf分值就会比较低。结合tf-idf,可对文本词项的重要性程度有一个较好的表征。如果某个词项没有在当前文本单位中出现,那么该词项对应的分量值就为0。将文本单位转化为无权重或有权重的词袋子表征后,就可以用向量空间模型对文本单位之间的关系进行建模,比如,计算相似度或者语义相关度,最典型的方法可以去计算文本单位向量之间的余弦相似度。两个文本单位之间的相似度越高,它们的主题越相近。
二、基于深度学习的文本分类模型;
通过人工标注一些推广样本,有监督地按照NLP中文本分类的流程进行分类。分类模型大致上可以分为两类,一类是传统方法,首先进行特征提取(即采用特征工程),在提取好的特征之上接一个分类器,比如逻辑回归、线性回归、支持向量机(Support Vector Machine,SVM)、自适应增强(adaptive boosting,Adaboost)和极端梯度提升(extreme gradient boosting,XGboost)等。另一类方法是基于深度学习的短文本分类模型,主要为RNN或CNN或者基于循环神经网络或CNN的变种。卷积层本质上是特征抽取层,可以用一个超参数指定有多少个卷积核。由于卷积核覆盖的是滑动窗口,所以它提取的是N元语法(n-gram)片段特征,n的大小决定了捕获特征距离的远近。由于每个滑动窗口之间没有相对位置的依赖关系,所以CNN的并行程度很高,这也是CNN的一个优点。
三、基于Transformer-encoder的文本分类模型;
Transformer-encoder的编码器实则是一个性能很好的可以并行计算的特征抽取器,由自注意力机制堆叠而成。自注意力机制允许每个词和其他任何词发生关系,然后集成到一个嵌入式向量中。所以它相比于CNN的优势在于提取局部特征的距离无限自由,并不受卷积核固定。同时有大量实验证明,Transformer-encoder在提取语义特征的能力上,要超过RNN和CNN。在长距离特征捕获上,Transformer-encoder微优于RNN,RNN优于CNN。在并行计算效率方面,Transformer-encoder微优于CNN,CNN优于RNN。
再次,本申请实施例中,提供了一种接入文本分类模型的预测方式,通过上述方式,由于文本分类模型是单个分类器,因此,可直接预测出目标文本分类结果(包括标题文本分类结果、简介文本分类结果、图像文本分类结果和音频文本分类结果),而且能够并行处理多条预测分支,即同时对多个多媒体内容进行分类,由此提升分类预测效率。
可选地,在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,基于文本内容通过文本分类模型获取目标文本分类结果,具体可以包括:
若标题文本满足第一恶意推广条件,则基于标题文本,通过文本分类模型所包括的N个子分类模型分别获取N个标题文本分类子结果,并根据N个标题文本分类子结果确定标题 文本分类结果,其中,每个标题文本分类子结果对应于一个恶意推广类型;
若简介文本满足第一恶意推广条件,则基于简介文本,通过文本分类模型所包括的N个子分类模型分别获取N个简介文本分类子结果,并根据N个简介文本分类子结果确定简介文本分类结果,其中,每个简介文本分类子结果对应于一个恶意推广类型;
若图像文本满足第一恶意推广条件,则基于图像文本,通过文本分类模型所包括的N个子分类模型分别获取N个图像文本分类子结果,并根据N个图像文本分类子结果确定图像文本分类结果,其中,每个图像文本分类子结果对应于一个恶意推广类型;
若音频文本满足第一恶意推广条件,则基于音频文本,通过文本分类模型所包括的N个子分类模型分别获取N个音频文本分类子结果,并根据N个音频文本分类子结果确定音频文本分类结果,其中,每个音频文本分类子结果对应于一个恶意推广类型。
本实施例中,介绍了另一种接入文本分类模型的预测方式。如前述实施例可知,在确定文本满足第一恶意推广条件的情况下,为了提升恶意推广视频的识别准确率,还可以在匹配召回后接多个子分类模型(即,多个子分类模型共同构成文本分类模型),力求在召回的大量视频中尽可能挑出准确的恶意推广标题文本、恶意推广简介文本、恶意推广图像文本以及恶意推广音频文本。
具体地,为了便于理解,请参阅图8,图8为本申请实施例中基于多个分类器输出文本分类结果的一个示意图,如图所示,假设文本分类模型包括3个子分类模型(即,假设N等于3),每个子分类模型对应于一种分类,例如,子分类模型1为“股票推荐类”的分类模型,子分类模型2为“治病类”的分类模型,子分类模型3为“借贷信用卡类”的分类模型。将任意一个文本(例如,标题文本、简介文本、图像文本或音频文本)输入至不同的子分类模型,由每个子分类模型分别输出概率分布。
示例性地,以二分类输出结果为例,将标题文本输入至文本分类模型所包括的子分类模型1,子分类模型1输出的概率分布为(0.1,0.9),其中,0.1表示“属于股票推荐类”的概率,0.9表示“不属于股票推荐类”的概率。将标题文本输入至文本分类模型所包括的子分类模型2,子分类模型2输出的概率分布为(0.9,0.1),其中,0.9表示“属于治病类”的概率,0.1表示“不属于治病类”的概率。将标题文本输入至文本分类模型所包括的子分类模型3,子分类模型3输出的概率分布为(0.4,0.6),其中,0.4表示“属于借贷信用卡类”的概率,0.6表示“不属于借贷信用卡类”的概率。基于此,“属于治病类”的概率最高,因此,标题文本分类结果为“属于治病类恶意推广”。
示例性地,以多分类输出结果为例,将标题文本输入至文本分类模型所包括的子分类模型1,子分类模型1输出的概率分布为(0.1,0.6,0.3),其中,0.1表示“属于股票推荐类”的概率,0.6表示“疑似股票推荐类”的概率,0.3表示“不属于股票推荐类”的概率。将标题文本输入至文本分类模型所包括的子分类模型2,子分类模型2输出的概率分布为(0.8,0.1,0.1),其中,0.8表示“属于治病类”的概率,0.1表示“疑似治病类”的概率,0.1表示“不属于治病类”的概率。将标题文本输入至文本分类模型所包括的子分类模型3,子分类模型3输出的概率分布为(0.4,0.4,0.2),其中,0.4表示“属于借贷信用卡类”的概率,0.4表示“疑似借贷信用卡类”的概率,0.2表示“不属于借贷信用卡类”的概率。基于此, “属于治病类”的概率最高,因此,标题文本分类结果为“属于治病类恶意推广”。
需要说明的是,获取简介文本分类结果、图像文本分类结果以及音频文本分类结果的方式,与获取标题文本分类结果的方式类似,故此处不做赘述。
再次,本申请实施例中,提供了另一种接入文本分类模型的预测方式,通过上述方式,由于文本分类模型包括多个分类器,因此,可预测多个文本分类子结果(包括标题文本分类子结果、简介文本分类子结果、图像文本分类子结果和音频文本分类子结果),最后基于多个文本分类子结果确定目标文本分类结果。由此,能够对多媒体内容进行更精细的分类,针对具体分类进行后续的识别,从而有利于提升识别的准确性。
可选地,在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,还可以包括:
若内容信息包括图像数据,则基于图像数据,通过图像分类模型获取图像分类结果,其中,图像分类结果表示图像数据的恶意推广程度;
根据目标文本分类结果确定待识别视频所对应的视频识别结果,具体可以包括:
根据目标文本分类结果以及图像分类结果,确定待识别视频所对应的视频识别结果。
本实施例中,介绍了一种利用CV技术辅助判断恶意推广的方式。在视频内容层面上,基于对视频低质量识别的能力积累,结合视频内容特征将判断是否为恶意推广视频分为“判白”策略算法以及“判黑”策略算法。
具体地,对图像数据进行分帧处理后,可将划分的视频帧输入至图像分类模型,图像分类模型用于识别出视频帧中的感兴趣区域(region of interest,ROI),例如,二维码、走势图或者台标等。需要说明的是,图像分类模型包含但不仅限于你只能看一次(You Only Look Once,YOLO)模型以及单级多框检测器(Single Shot MultiBox Detector,SSD)等,此处不做限定。
基于此,根据目标文本分类结果以及图像分类结果,确定待识别视频所对应的视频识别结果。例如,目标文本分类结果为“属于恶意推广类型”,图像分类结果为“属于恶意推广类型”,那么待识别视频所对应的视频识别结果为“属于恶意推广类型”。又例如,目标文本分类结果为“属于恶意推广类型”,图像分类结果为“不属于恶意推广类型”,那么待识别视频所对应的视频识别结果为“疑似恶意推广类型”。又例如,目标文本分类结果为“不属于恶意推广类型”,图像分类结果为“属于恶意推广类型”,那么待识别视频所对应的视频识别结果为“疑似恶意推广类型”。又例如,目标文本分类结果为“不属于恶意推广类型”,图像分类结果为“不属于恶意推广类型”,那么待识别视频所对应的视频识别结果为“不属于恶意推广类型”。
其次,本申请实施例中,提供了一种利用CV技术辅助判断恶意推广的方式,通过上述方式,对视频中的图像数据进行识别,得到图像分类结果,再结合文本识别后得到的目标文本分类结果,共同对待识别视频的恶意推广情况进行判别,由此,将各个源信息维度中的策略组合起来,达到相辅相成的效果。
可选地,在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,根据目标文本分类结果以及图像分类结果,确定待识别视频所对应的视频识别结果,具体 可以包括:
若图像分类结果表示待识别视频包括信息码,且目标文本分类结果满足第二恶意推广条件,则确定待识别视频所对应的视频识别结果为恶意推广视频;
若图像分类结果表示待识别视频包括走势图,且目标文本分类结果满足第二恶意推广条件,则确定待识别视频所对应的视频识别结果为恶意推广视频;
若图像分类结果表示待识别视频属于预设视频类型,则确定待识别视频所对应的视频识别结果为非恶意推广视频。
本实施例中,介绍了一种利用CV技术实现辅助“判黑”以及直接“判白”的方式。“判黑”作为判断视频为恶意推广视频的一个辅助特征,并不是直接判定为恶意推广。“判黑”策略算法包括信息码(例如,二维码和条形码)识别以及推荐股走势图等。“判白”策略算法包括政务宣传视频以及新闻宣传视频等。
具体地,如果在“判黑”策略算法中确定待识别视频包括信息码,且目标文本分类结果满足第二恶意推广条件,则直接确定待识别视频为恶意推广视频,即视频识别结果为“属于恶意推广类型”。如果在“判黑”策略算法中确定待识别视频包括走势图,且目标文本分类结果满足第二恶意推广条件,则直接确定待识别视频为恶意推广视频,即视频识别结果为“属于恶意推广类型”。如果在“判白”策略算法中确定待识别视频输入预设视频类型(例如,政务宣传视频或者新闻宣传视频),则直接确定待识别视频为非恶意推广视频,即视频识别结果为“不属于恶意推广类型”。
需要说明的是,满足第二恶意推广条件情况包含但不仅限于,目标文本分类结果为“属于恶意推广类型”,或者,目标文本分类结果为“疑似恶意推广类型”,或者,将恶意推广类型划分为五个风险等级,目标文本分类结果为“等级三”以上。此处不做限定。
为了便于理解,请参阅图9,图9为本申请实施例中基于视频多信息源维度的另一个识别框架示意图,如图所示,在一个“判黑”策略算法中,通过待识别视频中的抽帧图,利用CV算法判断是否有超过2秒的信息码,进而进行联系方式推广或者产品推广。在另一个“判黑”策略算法中,判断待识别视频是否为荐股走势视频,比较明显的特征是视频抽帧图常常包含较长时间的股票走势图,进行讲课式地推荐股票或者走势讲解。“判白”策略算法包括政务宣传视频以及新闻宣传视频等。政务宣传视频策略是判断一个视频是否为国家或者地方政府的政务宣传视频,此类视频带有一定性质的推广属性,但并不属于恶意推广视频范畴。新闻宣传视频也类似,是判断一个视频是否为长段的新闻宣传类视频,这种视频的新闻宣传片段往往会被前置模板匹配环节误匹配,也是需要进行判白处理的。
这些策略算法基本都会用到监督训练的分类手段。与“判黑”策略的辅助特征不同,“判白”策略为直接判白策略,如果命中“判白”策略算法中的一种,则自动将视频分为非恶意推广视频。
再次,本申请实施例中,提供了一种利用CV技术实现辅助“判黑”以及直接“判白”的方式,通过上述方式,在视频内容层面上,基于对视频低质量识别的能力积累,结合视频内容特征,引入“判白”策略算法以及“判黑”策略算法。其中,如果命中“判白”策略算法中的任意一种,则自动将待识别视频划分为非恶意推广视频。如果命中“判黑”策 略算法,即作为恶意推广视频的一个辅助特征,但并不直接判定为恶意推广。
可选地,在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,还可以包括:
获取针对于待识别视频的原创文本信息,其中,原创文本信息包括评论信息以及弹幕信息中的至少一种;
若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
根据目标文本分类结果确定待识别视频所对应的视频识别结果,具体可以包括:
根据目标文本分类结果以及原创文本分类结果,确定待识别视频所对应的视频识别结果;或者,
根据目标文本分类结果、原创文本分类结果以及图像分类结果,确定待识别视频所对应的视频识别结果,其中,图像分类结果为基于图像数据,通过图像分类模型获取的,其中,图像分类结果表示图像数据的恶意推广程度。
本实施例中,介绍了一种结合评论信息以及弹幕信息补充识别的方式。通过纳入视频一些相关的特征信息来辅助判别恶意推广的视频,其中,一个比较有针对性的非视频内容特征为用户的用户原创内容(User Generated Content,UGC)。
具体地,为了便于理解,请参阅图10,图10为本申请实施例中基于非内容层面特征补充的一个识别框架示意图,如图所示,UGC即为原创文本信息,原创文本信息包括评论信息以及弹幕信息中的至少一种。而原创文本信息不一定每个待识别视频都有,但对于具有原创文本信息情况而言,可对这些内容进行恶意推广的检测。例如,原创文本信息满足第一恶意推广条件,则通过文本分类模型获取原创文本分类结果。
示例性地,可根据目标文本分类结果以及原创文本分类结果,确定待识别视频所对应的视频识别结果。例如,目标文本分类结果为“属于恶意推广类型”,原创文本分类结果为“属于恶意推广类型”,那么待识别视频所对应的视频识别结果为“属于恶意推广类型”。又例如,目标文本分类结果为“属于恶意推广类型”,原创文本分类结果为“不属于恶意推广类型”,那么待识别视频所对应的视频识别结果为“疑似恶意推广类型”。
示例性地,可根据目标文本分类结果、原创文本分类结果以及图像分类结果,确定待识别视频所对应的视频识别结果。例如,目标文本分类结果为“属于恶意推广类型”,原创文本分类结果为“属于恶意推广类型”,图像分类结果为“属于恶意推广类型”,那么待识别视频所对应的视频识别结果为“属于恶意推广类型”。又例如,目标文本分类结果为“属于恶意推广类型”,原创文本分类结果为“不属于恶意推广类型”,图像分类结果为“属于恶意推广类型”,那么待识别视频所对应的视频识别结果为“疑似恶意推广类型”。
其次,本申请实施例中,提供了一种结合评论信息以及弹幕信息补充识别的方式,通过上述方式,可以通过额外评论信息对视频恶意推广进行非内容层面的特征补充,对这些内容进行恶意推广检测,超过一定阈值的即可推送至人工详细查证,优化识别流程,实现对多源信息层级策略组合算法的正向闭环补充。
可选地,在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中, 还可以包括:
在当前周期内获取针对于待识别视频的发布者信息,其中,发布者信息包括内容发布者在当前周期内的基本信息以及行为信息;
根据发布者信息确定内容发布者的身份置信度;
根据目标文本分类结果确定待识别视频所对应的视频识别结果,具体可以包括:
根据目标文本分类结果以及内容发布者的身份置信度,确定待识别视频所对应的视频识别结果。
本实施例中,介绍了一种动态识别内容发布者恶意推广倾向标签的方式。每个恶意推广视频都会对应一个内容发布者,因此,建模内容发布者恶意推广相关的标签对于视频恶意推广的判别也具有重要意义。大量发布恶意推广视频的内容发布者,往往带有网络水军的性质。例如,某个内容发布者在一段很集中的时间内发布了几十条恶意推广视频来推广相同或相似产品,那么可以认定该内容发布者为某公司或某产品的恶意推广水军。若某个内容发布者在很长时间范围内一直在发布恶意推广视频,涉及的产品或内容五花八门,那么可以认定该内容发布者为恶意推广中介水军,以接活赚钱为生。
具体地,为了便于理解,请参阅图11,图11为本申请实施例中基于非内容层面特征补充的一个识别框架示意图,如图所示,利用内容发布者的基本信息,包括关注者数,有无签名、签名内容、注册时间、认证类型、历史发布视频数以及历史转发视频数等特征,再结合用户发布或转发的历史恶意推广视频数的行为信息,共同判断内容发布者的身份置信度,即建立内容发布者的基本恶意推广倾向标签。内容发布者的身份置信度用于辅助判别内容发布者所发布或转发的视频是否应该判定为恶意推广,增加多源信息层级策略组合算法判定结果的置信度。其中,发布恶意推广视频的危害等级高于转发恶意推广视频的危害等级。
需要说明的是,内容发布者的恶意推广倾向标签是随着内容发布者自身的行为在不断动态变化的。因此,本申请中针对于待识别视频的发布者信息是基于当前周期的。通常情况下,内容发布者如果在越短的时间段内发布越多被判定为恶意推广的视频,则该内容发布者的恶意推广倾向度增加越快。而如果历史发布或转发过恶意推广视频,但是该内容发布者在近期没有发布过恶意推广视频,那么该内容发布者的恶意推广倾向度是在随着时间推移不断降低的。
其次,本申请实施例中,提供了一种动态识别内容发布者恶意推广倾向标签的方式,通过上述方式,根据每个周期内获取到的发布者信息,确定内容发布者的身份置信度。一方面能够得到内容发布者更准确的恶意推广倾向标签,有利于辅助判定视频的恶意程度。另一方面,结合用户发布或转发的历史恶意推广视频数的行为信息,建立内容发布者的恶意推广倾向标签,辅助判别内容发布者所发布或转发的视频是否应该判定为恶意推广,增加多源信息层级策略组合算法判定结果的置信度。
可选地,在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,根据目标文本分类结果以及内容发布者的身份置信度,确定待识别视频所对应的视频识别结果,具体可以包括:
根据目标文本分类结果、原创文本分类结果以及内容发布者的身份置信度,确定待识别视频所对应的视频识别结果,其中,原创文本分类结果为基于原创文本信息,通过文本分类模型获取的,原创文本分类结果表示原创文本信息的恶意推广程度,原创文本信息包括评论信息以及弹幕信息中的至少一种;或者,
根据目标文本分类结果、图像分类结果以及内容发布者的身份置信度,确定待识别视频所对应的视频识别结果,其中,图像分类结果为基于图像数据,通过图像分类模型获取的,图像分类结果表示图像数据的恶意推广程度;或者,
根据目标文本分类结果、原创文本分类结果、图像分类结果以及内容发布者的身份置信度,确定待识别视频所对应的视频识别结果。
本实施例中,介绍了一种基于非内容层面特征补充进行视频识别的方式。在辅助判断中,采用多个分类结果进行联合判定,增加多源信息层级策略组合算法判定结果的置信度。
示例性地,可根据目标文本分类结果、原创文本分类结果以及内容发布者的身份置信度,确定待识别视频所对应的视频识别结果。例如,目标文本分类结果为“属于恶意推广类型”,原创文本分类结果为“属于恶意推广类型”,内容发布者的身份置信度为“属于恶意推广类型”,那么待识别视频所对应的视频识别结果为“属于恶意推广类型”。又例如,目标文本分类结果为“属于恶意推广类型”,原创文本分类结果为“不属于恶意推广类型”,内容发布者的身份置信度为“不属于恶意推广类型”,那么待识别视频所对应的视频识别结果为“疑似恶意推广类型”。
示例性地,可根据目标文本分类结果、图像分类结果以及内容发布者的身份置信度,确定待识别视频所对应的视频识别结果。例如,目标文本分类结果为“属于恶意推广类型”,图像分类结果为“属于恶意推广类型”,内容发布者的身份置信度为“属于恶意推广类型”,那么待识别视频所对应的视频识别结果为“属于恶意推广类型”。又例如,目标文本分类结果为“属于恶意推广类型”,图像分类结果为“不属于恶意推广类型”,内容发布者的身份置信度为“不属于恶意推广类型”,那么待识别视频所对应的视频识别结果为“疑似恶意推广类型”。
示例性地,可根据目标文本分类结果、原创文本分类结果、图像分类结果以及内容发布者的身份置信度,确定待识别视频所对应的视频识别结果。例如,目标文本分类结果为“属于恶意推广类型”,原创文本分类结果为“属于恶意推广类型”,图像分类结果为“属于恶意推广类型”,内容发布者的身份置信度为“属于恶意推广类型”,那么待识别视频所对应的视频识别结果为“属于恶意推广类型”。
需要说明的是,上述例子仅为一个示意,在实际应用中,还可以采用更细的分类,例如,“一级恶意推广类型”表示最有可能是恶意推广的类型,“二级恶意推广类型”表示仅次于“一级恶意推广类型”的有可能是恶意推广的类型,以此类推。
再次,本申请实施例中,提供了一种基于非内容层面特征补充进行视频识别的方式,通过上述方式,可以通过额外对象行为对视频恶意推广进行非内容层面的特征补充,对这些内容进行恶意推广检测,超过一定阈值的即可推送至人工详细查证,优化识别流程,实现对多源信息层级策略组合算法的正向闭环补充。
可选地,在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,根据目标文本分类结果确定待识别视频所对应的视频识别结果,具体可以包括:
若目标文本分类结果大于或等于目标文本分类阈值,则确定待识别视频所对应的视频识别结果为恶意推广类型;
若目标文本分类结果小于目标文本分类阈值,则确定待识别视频所对应的视频识别结果为非恶意推广类型;
若目标文本分类结果大于或等于目标文本分类阈值,且原创文本分类结果大于或等于原创文本分类阈值,则确定待识别视频所对应的视频识别结果为恶意推广类型;
若目标文本分类结果大于或等于目标文本分类阈值,且原创文本分类结果大于或等于原创文本分类阈值,且图像分类结果大于或等于图像分类阈值,则确定待识别视频所对应的视频识别结果为恶意推广类型;
若目标文本分类结果大于或等于目标文本分类阈值,且原创文本分类结果大于或等于原创文本分类阈值,且图像分类结果大于或等于图像分类阈值,且内容发布者的身份置信度大于或等于身份置信度阈值,则确定待识别视频所对应的视频识别结果为恶意推广类型;
根据目标文本分类结果确定待识别视频所对应的视频识别结果之后,还可以包括:
获取待识别视频所对应的视频标注结果;
若视频识别结果与视频标注结果匹配不一致,则对目标文本分类阈值、原创文本分类阈值、图像分类阈值以及身份置信度阈值中的至少一项进行调整。
本实施例中,介绍了一种结合群体智慧优化多媒体识别效果的方式。本申请进提出一种综合机器智能与群体智慧的设计方案,主要包含两个阶段工作,分别是视频恶意推广前期和视频恶意推广后期。
具体地,为了便于理解,请参阅图12,图12为本申请实施例中针对于视频的一个整体识别框架示意图,如图所示,视频恶意推广识别前期指一个视频刚刚发布或发布不久,通过多源信息维度的算法架构以及非内容层面的特征补充,判定视频为疑似恶意推广的情况,那么在最终打标输出鉴定结论之前会经过人工运营鉴定。这是因为当前计算机设备无法保证100%准确率。为了整个视频恶意推广识别打击的生态建设,当前整个技术架构应该偏重召回,在优先保证召回的情况下,逐步提升准确率。
在视频恶意推广后期,对于经过群体智慧判定为恶意推广的,需要根据不同的情况将其归纳入“判黑”策略算法或“判白”策略算法,不断补充库内信息形成良性循环。对于人工判定为视频恶意推广但是机器智能各环节综合判定为非视频恶意推广的失败案例,还可以进行反溯查询原因,并修正调整部分阈值,并补充进入监督训练数据。同时对于视频发布者或视频转发者进行恶意推广倾向反馈,更新其恶意推广倾向度分值,并将恶意推广倾向度分值高于一定阈值的内容发布者加入可疑内容发布者库,对可疑内容发布者库内的内容发布者发布或转发的视频进行更加严格的判定。
示例性地,仅基于目标文本分类结果对待识别视频进行识别。如果目标文本分类结果大于或等于目标文本分类阈值,则确定待识别视频所对应的视频识别结果为恶意推广类型。反之,如果目标文本分类结果小于目标文本分类阈值,则确定待识别视频所对应的视频识 别结果为非恶意推广类型。
假设被机器判定为恶意推广类型的待识别视频,经过人工审核之后判定为非恶意推广类型,此时,可以对目标文本分类阈值进行调整,例如,增大目标文本分类阈值,或,减小目标文本分类阈值。
示例性地,基于目标文本分类结果和原创文本分类结果共同对待识别视频进行识别。如果目标文本分类结果大于或等于目标文本分类阈值,且原创文本分类结果大于或等于原创文本分类阈值,则确定待识别视频所对应的视频识别结果为恶意推广类型。
假设被机器判定为恶意推广类型的待识别视频,经过人工审核之后判定为非恶意推广类型,此时,可以对目标文本分类阈值以及原创文本分类阈值中的至少一项进行调整。
示例性地,基于目标文本分类结果、原创文本分类结果以及图像分类结果共同对待识别视频进行识别。如果目标文本分类结果大于或等于目标文本分类阈值,且原创文本分类结果大于或等于原创文本分类阈值,且图像分类结果大于或等于图像分类阈值,则确定待识别视频所对应的视频识别结果为恶意推广类型。
假设被机器判定为恶意推广类型的待识别视频,经过人工审核之后判定为非恶意推广类型,此时,可以对目标文本分类阈值、原创文本分类阈值以及图像分类阈值中的至少一项进行调整。
示例性地,基于目标文本分类结果、原创文本分类结果、图像分类结果内容发布者的身份置信度共同对待识别视频进行识别。如果目标文本分类结果大于或等于目标文本分类阈值,且原创文本分类结果大于或等于原创文本分类阈值,且图像分类结果大于或等于图像分类阈值,且内容发布者的身份置信度大于或等于身份置信度阈值,则确定待识别视频所对应的视频识别结果为恶意推广类型;
假设被机器判定为恶意推广类型的待识别视频,经过人工审核之后判定为非恶意推广类型,此时,可以对目标文本分类阈值、原创文本分类阈值、图像分类阈值以及身份置信度阈值中的至少一项进行调整。
其次,本申请实施例中,提供了一种结合群体智慧优化多媒体识别效果的方式,通过上述方式,在多源信息维度的层级策略组合算法和非内容层面特征补充的基础上,虽然能够对恶意推广视频进行识别,但是考虑到可能还存在一些识别不准确的情况。因此,本申请进提出一种综合机器智能与群体智慧的设计方案,能够在优先保证召回的情况下,逐步提升准确率。
结合上述介绍,下面将对本申请中多媒体内容的识别方法进行介绍,请参阅图13,本申请实施例中多媒体内容识别方法的另一个实施例包括:
201、获取待识别文本的文本信息以及针对于待识别文本的原创文本信息,其中,目标文本信息包括标题文本以及简介文本中的至少一种,原创文本信息包括评论信息以及弹幕信息中的至少一种;
本实施例中,多媒体内容识别装置获取待识别文本的文本信息以及原创文本信息,为了便于理解,请参阅图14,图14为本申请实施例中针对于文本的一个整体识别框架示意图,如图所示,目标文本信息可包括标题文本以及简介文本中的至少一种,且原创文本信息可 包括评论信息以及弹幕信息中的至少一种。
需要说明的是,多媒体内容识别装置可部署于服务器,也可以部署于终端设备,还可以部署于由终端设备和服务器组成的多媒体内容识别系统中,此处不做限定。
202、若目标文本信息满足第一恶意推广条件,则基于目标文本信息,通过文本分类模型获取目标文本分类结果,其中,目标文本分类结果表示目标文本信息的恶意推广程度;
本实施例中,多媒体内容识别装置判断目标文本信息是否满足第一恶意推广条件。可以理解的是,满足第一恶意推广条件的情况可以是命中匹配库中的关键词或模板。对于满足第一恶意推广条件的目标文本信息而言,将其输入至训练好的文本分类模型中,即可得到目标文本分类结果,其中,目标文本分类结果可以是一个二分类的结果,例如,“属于恶意推广类型”或者“不属于恶意推广类型”。或者,目标文本分类结果可以是一个多分类的结果,例如,“属于恶意推广类型”、“疑似恶意推广类型”或者“不属于恶意推广类型”。
203、若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
本实施例中,类似地,多媒体内容识别装置判断原创文本信息是否满足第一恶意推广条件。对于满足第一恶意推广条件的原创文本信息而言,将其输入至训练好的文本分类模型中,即可得到原创文本分类结果,其中,原创文本分类结果可以是一个二分类的结果,或者是一个多分类的结果。
204、根据目标文本分类结果和/或原创文本分类结果,确定待识别文本所对应的文本识别结果,其中,文本识别结果表示待识别文本的恶意推广程度。
本实施例中,多媒体内容识别装置根据目标文本分类结果以及原创文本分类结果,确定待识别文本所对应的文本识别结果。例如,目标文本分类结果为“属于恶意推广类型”,且原创文本分类结果也为“属于恶意推广类型”,则输出待识别文本的文本识别结果为“属于恶意推广类型”,且恶意推广程度最高。又例如,目标文本分类结果为“属于恶意推广类型”,且原创文本分类结果为“不属于恶意推广类型”,则输出待识别文本的文本识别结果为“疑似恶意推广类型”。又例如,目标文本分类结果为“不属于恶意推广类型”,且原创文本分类结果也为“不属于恶意推广类型”,则输出待识别文本的文本识别结果为“不属于恶意推广类型”。
具体地,为了便于理解,请参阅图15,图15为本申请实施例中文本类媒体内容的一个识别场景示意图,如图所示,经过文本识别后,可向内容管理者展示不同文本的识别结果。一方面,文本识别平台可直接下架或删除“属于恶意推广类型”的文本,另一方面,内容管理者还可以进一步查看文本的具体信息,以人工的方式查验输出结果的准确度。
本申请实施例中,提供了一种多媒体内容的识别方法。通过上述方式,从多个角度对多媒体内容进行恶意推广程度的识别,针对于文本类多媒体内容的媒体形态,从标题、简介以及原创文本信息等方面对文本质量进行更全面地把握,形成了更完善的文本恶意推广识别策略,从而提升识别出文本恶意推广的准确率。
结合上述介绍,下面将对本申请中多媒体内容的识别方法进行介绍,请参阅图16,本申请实施例中多媒体内容识别方法的另一个实施例包括:
301、获取待识别图像的图像数据以及针对于待识别图像的原创文本信息,其中,原创文本信息包括评论信息以及弹幕信息中的至少一种;
本实施例中,多媒体内容识别装置获取待识别图像的图像数据以及原创文本信息,为了便于理解,请参阅图17,图17为本申请实施例中针对于图像的一个整体识别框架示意图,如图所示,且原创文本信息可包括评论信息以及弹幕信息中的至少一种。
需要说明的是,多媒体内容识别装置可部署于服务器,也可以部署于终端设备,还可以部署于由终端设备和服务器组成的多媒体内容识别系统中,此处不做限定。
302、对图像数据进行文本识别处理,得到图像文本;
本实施例中,多媒体内容识别装置对待识别图像中的图像数据进行OCR处理,得到图像文本。
303、若图像文本满足第一恶意推广条件,则基于图像文本,通过文本分类模型获取图像文本分类结果,其中,图像文本分类结果表示图像文本的恶意推广程度;
本实施例中,多媒体内容识别装置判断图像文本是否满足第一恶意推广条件。可以理解的是,满足第一恶意推广条件的情况可以是命中匹配库中的关键词或模板。对于满足第一恶意推广条件的图像文本而言,将其输入至训练好的文本分类模型中,即可得到图像文本分类结果,其中,图像文本分类结果可以是一个二分类的结果,例如,“属于恶意推广类型”或者“不属于恶意推广类型”。或者,图像文本分类结果可以是一个多分类的结果,例如,“属于恶意推广类型”、“疑似恶意推广类型”或者“不属于恶意推广类型”。
304、若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
本实施例中,类似地,多媒体内容识别装置判断原创文本信息是否满足第一恶意推广条件。对于满足第一恶意推广条件的原创文本信息而言,将其输入至训练好的文本分类模型中,即可得到原创文本分类结果,其中,原创文本分类结果可以是一个二分类的结果,或者是一个多分类的结果。
305、根据图像文本分类结果和/或原创文本分类结果,确定待识别图像所对应的图像识别结果,其中,图像识别结果表示待识别图像的恶意推广程度。
本实施例中,多媒体内容识别装置根据图像文本分类结果以及原创文本分类结果,确定待识别图像所对应的图像识别结果。例如,图像文本分类结果为“属于恶意推广类型”,且原创文本分类结果也为“属于恶意推广类型”,则输出待识别图像的图像识别结果为“属于恶意推广类型”,且恶意推广程度最高。又例如,图像文本分类结果为“属于恶意推广类型”,且原创文本分类结果为“不属于恶意推广类型”,则输出待识别图像的图像识别结果为“疑似恶意推广类型”。又例如,图像文本分类结果为“不属于恶意推广类型”,且原创文本分类结果也为“不属于恶意推广类型”,则输出待识别图像的图像识别结果为“不属于恶意推广类型”。
具体地,为了便于理解,请参阅图18,图18为本申请实施例中图像类媒体内容的一个识别场景示意图,如图所示,经过文本识别后,可向内容管理者展示不同图像的识别结果。一方面,图像识别平台可直接下架或删除“属于恶意推广类型”的图像,另一方面,内容 管理者还可以进一步查看图像的具体信息,以人工的方式查验输出结果的准确度。
本申请实施例中,提供了一种多媒体内容的识别方法。通过上述方式,从多个角度对多媒体内容进行恶意推广程度的识别,针对于图像类多媒体内容的媒体形态,从图像数据以及原创文本信息等方面对图像质量进行更全面地把握,形成了更完善的图像恶意推广识别策略,从而提升识别出图像恶意推广的准确率。
结合上述介绍,下面将对本申请中多媒体内容的识别方法进行介绍,请参阅图19,本申请实施例中多媒体内容识别方法的另一个实施例包括:
401、获取待识别音频的音频数据以及针对于待识别音频的原创文本信息,其中,原创文本信息包括评论信息以及弹幕信息中的至少一种;
本实施例中,多媒体内容识别装置获取待识别音频的音频数据以及原创文本信息,为了便于理解,请参阅图20,图20为本申请实施例中针对于音频的一个整体识别框架示意图,如图所示,且原创文本信息可包括评论信息以及弹幕信息中的至少一种。
需要说明的是,多媒体内容识别装置可部署于服务器,也可以部署于终端设备,还可以部署于由终端设备和服务器组成的多媒体内容识别系统中,此处不做限定。
402、对音频数据进行文本识别处理,得到音频文本;
本实施例中,多媒体内容识别装置对待识别音频中的音频数据进行ASR处理,得到音频文本。
403、若音频文本满足第一恶意推广条件,则基于音频文本,通过文本分类模型获取音频文本分类结果,其中,音频文本分类结果表示音频文本的恶意推广程度;
本实施例中,多媒体内容识别装置判断音频文本是否满足第一恶意推广条件。可以理解的是,满足第一恶意推广条件的情况可以是命中匹配库中的关键词或模板。对于满足第一恶意推广条件的音频文本而言,将其输入至训练好的文本分类模型中,即可得到音频文本分类结果,其中,音频文本分类结果可以是一个二分类的结果,例如,“属于恶意推广类型”或者“不属于恶意推广类型”。或者,音频文本分类结果可以是一个多分类的结果,例如,“属于恶意推广类型”、“疑似恶意推广类型”或者“不属于恶意推广类型”。
404、若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
本实施例中,类似地,多媒体内容识别装置判断原创文本信息是否满足第一恶意推广条件。对于满足第一恶意推广条件的原创文本信息而言,将其输入至训练好的文本分类模型中,即可得到原创文本分类结果,其中,原创文本分类结果可以是一个二分类的结果,或者是一个多分类的结果。
405、根据音频文本分类结果和/或原创文本分类结果,确定待识别音频所对应的音频识别结果,其中,音频识别结果表示待识别音频的恶意推广程度。
本实施例中,多媒体内容识别装置根据音频文本分类结果以及原创文本分类结果,确定待识别音频所对应的音频识别结果。例如,音频文本分类结果为“属于恶意推广类型”,且原创文本分类结果也为“属于恶意推广类型”,则输出待识别音频的音频识别结果为“属于恶意推广类型”,且恶意推广程度最高。又例如,音频文本分类结果为“属于恶意推广类 型”,且原创文本分类结果为“不属于恶意推广类型”,则输出待识别音频的音频识别结果为“疑似恶意推广类型”。又例如,音频文本分类结果为“不属于恶意推广类型”,且原创文本分类结果也为“不属于恶意推广类型”,则输出待识别音频的音频识别结果为“不属于恶意推广类型”。
具体地,为了便于理解,请参阅图21,图21为本申请实施例中音频类媒体内容的一个识别场景示意图,如图所示,经过文本识别后,可向内容管理者展示不同音频的识别结果。一方面,音频识别平台可直接下架或删除“属于恶意推广类型”的音频,另一方面,内容管理者还可以进一步查看音频的具体信息,以人工的方式查验输出结果的准确度。
本申请实施例中,提供了一种多媒体内容的识别方法。通过上述方式,从多个角度对多媒体内容进行恶意推广程度的识别,针对于音频类多媒体内容的媒体形态,从音频数据以及原创文本信息等方面对音频质量进行更全面地把握,形成了更完善的音频恶意推广识别策略,从而提升识别出音频恶意推广的准确率。
下面对本申请中的多媒体内容识别装置进行详细描述,请参阅图22,图22为本申请实施例中多媒体内容识别装置的一个实施例示意图,多媒体内容识别装置50包括:
获取模块501,用于获取待识别视频的目标文本信息以及内容信息,其中,目标文本信息包括标题文本以及简介文本中的至少一种,内容信息包括图像数据以及音频数据中的至少一种;
识别模块502,用于对内容信息进行文本识别处理,得到关联文本信息,其中,关联文本信息包括图像文本以及音频文本中的至少一种,图像文本为对图像数据进行文本识别后得到的,音频文本为对音频数据进行文本识别后得到的;
获取模块501,还用于将满足第一恶意推广条件的目标文本信息以及关联文本信息中的至少一种作为文本内容,基于文本内容通过文本分类模型获取目标文本分类结果,其中,目标文本分类结果表示文本内容的恶意推广程度;
识别模块502,还用于根据目标文本分类结果确定待识别视频所对应的视频识别结果,其中,视频识别结果表示待识别视频的恶意推广程度。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,从多个角度对多媒体内容进行恶意推广程度的识别,针对于视频类多媒体内容的媒体形态,从标题、简介、图像以及音频等方面对视频质量进行更全面地把握,形成了更完善的视频恶意推广识别策略,从而提升识别出视频恶意推广的准确率。
可选地,在上述图22所对应的实施例的基础上,本申请实施例提供的多媒体内容识别装置50的另一实施例中,内容信息包括图像数据;
识别模块502,具体用于对待识别视频所包括的图像数据进行分帧处理,得到K个视频帧,其中,K为大于或等于1的整数;
按照预设帧率从K个视频帧中获取L个视频帧,其中,L为大于或等于1,且小于K的整数;
对L个视频帧中的每个视频帧进行光学字符识别OCR处理,得到每个视频帧的文本识别结果,其中,文本识别结果包括字幕以及字幕所对应的坐标信息;
针对于L个视频帧中的每个视频帧,根据字幕所对应的坐标信息,对字幕进行去重处理;
将每个视频帧中经过去重后的字幕作为关联文本信息中的图像文本。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,先对视频的图像数据进行抽帧处理,然后将识别到的图像文本与模板进行匹配,由此,增加识别恶意推广视频的维度,从而有利于提升视频恶意推广的识别准确率。
可选地,在上述图22所对应的实施例的基础上,本申请实施例提供的多媒体内容识别装置50的另一实施例中,内容信息包括音频数据;
识别模块502,具体用于对待识别视频所包括的音频数据进行分帧处理,得到T个音频帧,其中,T为大于或等于1的整数;
对T个音频帧中的每个音频帧进行特征提取处理,得到每个音频帧所对应的音频特征向量;
基于每个音频帧所对应的音频特征向量确定关联文本信息中的音频文本。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,先对视频的音频数据进行识别,然后将识别到的音频文本与模板进行匹配,由此,增加识别恶意推广视频的维度,从而有利于提升视频恶意推广的识别准确率。
可选地,在上述图22所对应的实施例的基础上,本申请实施例提供的多媒体内容识别装置50的另一实施例中,多媒体内容识别装置50还包括确定模块503;
确定模块503,用于若标题文本与匹配库中的模板匹配成功,且标题文本与白名单中的信息匹配失败,则确定标题文本满足第一恶意推广条件,且目标文本信息满足第一恶意推广条件;
确定模块503,还用于若简介文本与匹配库中的模板匹配成功,且简介文本与白名单中的信息匹配失败,则确定简介文本满足第一恶意推广条件,且目标文本信息满足第一恶意推广条件;
确定模块503,还用于若图像文本与匹配库中的模板匹配成功,且图像文本与是白名单中的信息匹配失败,则确定图像文本满足第一恶意推广条件,且关联文本信息满足第一恶意推广条件;
确定模块503,还用于若音频文本与匹配库中的模板匹配成功,且音频文本与白名单中的信息匹配失败,则确定音频文本满足第一恶意推广条件,且关联文本信息满足第一恶意推广条件。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,基于模板匹配能够找出可能存在恶意推广的视频,基于拒识策略能够过滤掉属于白名单内的视频,由此,有利于提升召回视频的准确率。
可选地,在上述图22所对应的实施例的基础上,本申请实施例提供的多媒体内容识别装置50的另一实施例中,
获取模块501,具体用于若标题文本满足第一恶意推广条件,则基于标题文本,通过文本分类模型获取标题文本分类结果;
若简介文本满足第一恶意推广条件,则基于简介文本,通过文本分类模型获取简介文 本分类结果;
若图像文本满足第一恶意推广条件,则基于图像文本,通过文本分类模型获取图像文本分类结果;
若音频文本满足第一恶意推广条件,则基于音频文本,通过文本分类模型获取音频文本分类结果。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,由于文本分类模型是单个分类器,因此,可直接预测出目标文本分类结果(包括标题文本分类结果、简介文本分类结果、图像文本分类结果和音频文本分类结果),而且能够并行处理多条预测分支,即同时对多个多媒体内容进行分类,由此提升分类预测效率。
可选地,在上述图22所对应的实施例的基础上,本申请实施例提供的多媒体内容识别装置50的另一实施例中,
获取模块501,具体用于若标题文本满足第一恶意推广条件,则基于标题文本,通过文本分类模型所包括的N个子分类模型分别获取N个标题文本分类子结果,并根据所述N个标题文本分类子结果确定标题文本分类结果,其中,每个标题文本分类子结果对应于一个恶意推广类型;
若简介文本满足第一恶意推广条件,则基于简介文本,通过文本分类模型所包括的N个子分类模型分别获取N个简介文本分类子结果,并根据所述N个简介文本分类子结果确定简介文本分类结果,其中,每个简介文本分类子结果对应于一个恶意推广类型;
若图像文本满足第一恶意推广条件,则基于图像文本,通过文本分类模型所包括的N个子分类模型分别获取N个图像文本分类子结果,并根据所述N个图像文本分类子结果确定图像文本分类结果,其中,每个图像文本分类子结果对应于一个恶意推广类型;
若音频文本满足第一恶意推广条件,则基于音频文本,通过文本分类模型所包括的N个子分类模型分别获取N个音频文本分类子结果,并根据所述N个音频文本分类子结果确定音频文本分类结果,其中,每个音频文本分类子结果对应于一个恶意推广类型。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,由于文本分类模型包括多个分类器,因此,可预测多个文本分类子结果(包括标题文本分类子结果、简介文本分类子结果、图像文本分类子结果和音频文本分类子结果),最后基于多个文本分类子结果确定目标文本分类结果。由此,能够对多媒体内容进行更精细的分类,针对具体分类进行后续的识别,从而有利于提升识别的准确性。
可选地,在上述图22所对应的实施例的基础上,本申请实施例提供的多媒体内容识别装置50的另一实施例中,
获取模块501,还用于若内容信息包括图像数据,则基于图像数据,通过图像分类模型获取图像分类结果,其中,图像分类结果表示图像数据的恶意推广程度;
识别模块502,具体用于根据目标文本分类结果以及图像分类结果,确定待识别视频所对应的视频识别结果。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,对视频中的图像数据进行识别,得到图像分类结果,再结合文本识别后得到的目标文本分类结果,共同对 待识别视频的恶意推广情况进行判别,由此,将各个源信息维度中的层级策略组合起来,达到相辅相成的效果。
可选地,在上述图22所对应的实施例的基础上,本申请实施例提供的多媒体内容识别装置50的另一实施例中,
识别模块502,具体用于若图像分类结果表示待识别视频包括信息码,且目标文本分类结果满足第二恶意推广条件,则确定待识别视频所对应的视频识别结果为恶意推广视频;
若图像分类结果表示待识别视频包括走势图,且目标文本分类结果满足第二恶意推广条件,则确定待识别视频所对应的视频识别结果为恶意推广视频;
若图像分类结果表示待识别视频属于预设视频类型,则确定待识别视频所对应的视频识别结果为非恶意推广视频。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,在视频内容层面上,基于对视频低质量识别的能力积累,结合视频内容特征,引入“判白”策略算法以及“判黑”策略算法。其中,如果命中“判白”策略算法中的任意一种,则自动将待识别视频划分为非恶意推广视频。如果命中“判黑”策略算法,即作为恶意推广视频的一个辅助特征,但并不直接判定为恶意推广。
可选地,在上述图22所对应的实施例的基础上,本申请实施例提供的多媒体内容识别装置50的另一实施例中,
获取模块501,还用于获取针对于待识别视频的原创文本信息,其中,原创文本信息包括评论信息以及弹幕信息中的至少一种;
获取模块501,还用于若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
识别模块502,具体用于根据目标文本分类结果以及原创文本分类结果,确定待识别视频所对应的视频识别结果;或者,
根据目标文本分类结果、原创文本分类结果以及图像分类结果,确定待识别视频所对应的视频识别结果,其中,图像分类结果为基于图像数据,通过图像分类模型获取的,其中,图像分类结果表示图像数据的恶意推广程度。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,可以通过额外评论信息对视频恶意推广进行非内容层面的特征补充,对这些内容进行恶意推广检测,超过一定阈值的即可推送至人工详细查证,优化识别流程,实现对多源信息层级策略组合算法的正向闭环补充。
可选地,在上述图22所对应的实施例的基础上,本申请实施例提供的多媒体内容识别装置50的另一实施例中,
获取模块501,还用于在当前周期内获取针对于待识别视频的发布者信息,其中,发布者信息包括内容发布者在当前周期内的基本信息以及行为信息;
确定模块503,还用于根据发布者信息确定内容发布者的身份置信度;
识别模块502,具体用于根据目标文本分类结果以及内容发布者的身份置信度,确定待 识别视频所对应的视频识别结果。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,根据每个周期内获取到的发布者信息,确定内容发布者的身份置信度。一方面能够得到内容发布者更准确的恶意推广倾向标签,有利于辅助判定视频的恶意程度。另一方面,结合用户发布或转发的历史恶意推广视频数的行为信息,建立内容发布者的恶意推广倾向标签,辅助判别内容发布者所发布或转发的视频是否应该判定为恶意推广,增加多源信息层级策略组合算法判定结果的置信度。
可选地,在上述图22所对应的实施例的基础上,本申请实施例提供的多媒体内容识别装置50的另一实施例中,
识别模块502,具体用于根据目标文本分类结果、原创文本分类结果以及内容发布者的身份置信度,确定待识别视频所对应的视频识别结果,其中,原创文本分类结果为基于原创文本信息,通过文本分类模型获取的,原创文本分类结果表示原创文本信息的恶意推广程度,原创文本信息包括评论信息以及弹幕信息中的至少一种;或者,
根据目标文本分类结果、图像分类结果以及内容发布者的身份置信度,确定待识别视频所对应的视频识别结果,其中,图像分类结果为基于图像数据,通过图像分类模型获取的,图像分类结果表示图像数据的恶意推广程度;或者,
根据目标文本分类结果、原创文本分类结果、图像分类结果以及内容发布者的身份置信度,确定待识别视频所对应的视频识别结果。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,可以通过额外对象行为对视频恶意推广进行非内容层面的特征补充,对这些内容进行恶意推广检测,超过一定阈值的即可推送至人工详细查证,优化识别流程,实现对多源信息层级策略组合算法的正向闭环补充。
可选地,在上述图22所对应的实施例的基础上,本申请实施例提供的多媒体内容识别装置50的另一实施例中,多媒体内容识别装置50还包括调整模块504;
识别模块502,具体用于若目标文本分类结果大于或等于目标文本分类阈值,则确定待识别视频所对应的视频识别结果为恶意推广类型;
若目标文本分类结果小于目标文本分类阈值,则确定待识别视频所对应的视频识别结果为非恶意推广类型;
若目标文本分类结果大于或等于目标文本分类阈值,且原创文本分类结果大于或等于原创文本分类阈值,则确定待识别视频所对应的视频识别结果为恶意推广类型;
若目标文本分类结果大于或等于目标文本分类阈值,且原创文本分类结果大于或等于原创文本分类阈值,且图像分类结果大于或等于图像分类阈值,则确定待识别视频所对应的视频识别结果为恶意推广类型;
若目标文本分类结果大于或等于目标文本分类阈值,且原创文本分类结果大于或等于原创文本分类阈值,且图像分类结果大于或等于图像分类阈值,且内容发布者的身份置信度大于或等于身份置信度阈值,则确定待识别视频所对应的视频识别结果为恶意推广类型;
获取模块501,还用于根据目标文本分类结果确定待识别视频所对应的视频识别结果之 后,获取待识别视频所对应的视频标注结果;
调整模块504,用于若视频识别结果与视频标注结果匹配不一致,则对目标文本分类阈值、原创文本分类阈值、图像分类阈值以及身份置信度阈值中的至少一项进行调整。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,在多源信息维度的层级策略组合算法和非内容层面特征补充的基础上,虽然能够对恶意推广视频进行识别,但是考虑到可能还存在一些识别不准确的情况。因此,本申请进提出一种综合机器智能与群体智慧的设计方案,能够在优先保证召回的情况下,逐步提升准确率。
下面对本申请中的多媒体内容识别装置进行详细描述,请参阅图23,图23为本申请实施例中多媒体内容识别装置的一个实施例示意图,多媒体内容识别装置60包括:
获取模块601,用于获取待识别文本的文本信息以及针对于待识别文本的原创文本信息,其中,目标文本信息包括标题文本以及简介文本中的至少一种,原创文本信息包括评论信息以及弹幕信息中的至少一种;
获取模块601,还用于若目标文本信息满足第一恶意推广条件,则基于目标文本信息,通过文本分类模型获取目标文本分类结果,其中,目标文本分类结果表示目标文本信息的恶意推广程度;
获取模块601,还用于若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
识别模块602,用于根据目标文本分类结果和/或原创文本分类结果,确定待识别文本所对应的文本识别结果,其中,文本识别结果表示待识别文本的恶意推广程度。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,从多个角度对多媒体内容进行恶意推广程度的识别,针对于文本类多媒体内容的媒体形态,从标题、简介以及原创文本信息等方面对文本质量进行更全面地把握,形成了更完善的文本恶意推广识别策略,从而提升识别出文本恶意推广的准确率。
下面对本申请中的多媒体内容识别装置进行详细描述,请参阅图24,图24为本申请实施例中多媒体内容识别装置的一个实施例示意图,多媒体内容识别装置70包括:
获取模块701,用于获取待识别图像的图像数据以及针对于待识别图像的原创文本信息,其中,原创文本信息包括评论信息以及弹幕信息中的至少一种;
识别模块702,用于对图像数据进行文本识别处理,得到图像文本;
获取模块701,还用于若图像文本满足第一恶意推广条件,则基于图像文本,通过文本分类模型获取图像文本分类结果,其中,图像文本分类结果表示图像文本的恶意推广程度;
获取模块701,还用于若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
识别模块702,还用于根据图像文本分类结果和/或原创文本分类结果,确定待识别图像所对应的图像识别结果,其中,图像识别结果表示待识别图像的恶意推广程度。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,从多个角度对多 媒体内容进行恶意推广程度的识别,针对于图像类多媒体内容的媒体形态,从图像数据以及原创文本信息等方面对图像质量进行更全面地把握,形成了更完善的图像恶意推广识别策略,从而提升识别出图像恶意推广的准确率。
下面对本申请中的多媒体内容识别装置进行详细描述,请参阅图25,图25为本申请实施例中多媒体内容识别装置的一个实施例示意图,多媒体内容识别装置80包括:
获取模块801,用于获取待识别音频的音频数据以及针对于待识别音频的原创文本信息,其中,原创文本信息包括评论信息以及弹幕信息中的至少一种;
识别模块802,用于对音频数据进行文本识别处理,得到音频文本;
获取模块801,还用于若音频文本满足第一恶意推广条件,则基于音频文本,通过文本分类模型获取音频文本分类结果,其中,音频文本分类结果表示音频文本的恶意推广程度;
获取模块801,还用于若原创文本信息满足第一恶意推广条件,则基于原创文本信息,通过文本分类模型获取原创文本分类结果,其中,原创文本分类结果表示原创文本信息的恶意推广程度;
识别模块802,还用于根据音频文本分类结果和/或原创文本分类结果,确定待识别音频所对应的音频识别结果,其中,音频识别结果表示待识别音频的恶意推广程度。
本申请实施例中,提供了一种多媒体内容识别装置,采用上述装置,从多个角度对多媒体内容进行恶意推广程度的识别,针对于音频类多媒体内容的媒体形态,从音频数据以及原创文本信息等方面对音频质量进行更全面地把握,形成了更完善的音频恶意推广识别策略,从而提升识别出音频恶意推广的准确率。
本申请提供的多媒体内容识别装置可部署于服务器,请参阅图26,图26是本申请实施例提供的一种服务器结构示意图,该服务器900可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)922(例如,一个或一个以上处理器)和存储器932,一个或一个以上存储应用程序942或数据944的存储介质930(例如一个或一个以上海量存储设备)。其中,存储器932和存储介质930可以是短暂存储或持久存储。存储在存储介质930的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器922可以设置为与存储介质930通信,在服务器900上执行存储介质930中的一系列指令操作。
服务器900还可以包括一个或一个以上电源926,一个或一个以上有线或无线网络接口950,一个或一个以上输入输出接口958,和/或,一个或一个以上操作系统941,例如Windows Server TM,Mac OS X TM,Unix TM,Linux TM,FreeBSD TM等等。
上述实施例中由服务器所执行的步骤可以基于该图26所示的服务器结构。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行如前述各个实施例描述的方法。
本申请实施例中还提供一种包括程序的计算机程序产品,当其在计算机上运行时,使得计算机执行前述各个实施例描述的方法。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (19)

  1. 一种多媒体内容的识别方法,所述方法由计算机设备执行,所述方法包括:
    获取待识别视频的目标文本信息以及内容信息,其中,所述目标文本信息包括标题文本以及简介文本中的至少一种,所述内容信息包括图像数据以及音频数据中的至少一种;
    对所述内容信息进行文本识别处理,得到关联文本信息,其中,所述关联文本信息包括图像文本以及音频文本中的至少一种,所述图像文本为对所述图像数据进行文本识别后得到的,所述音频文本为对所述音频数据进行文本识别后得到的;
    将满足第一恶意推广条件的所述目标文本信息以及所述关联文本信息中的至少一种作为文本内容,基于所述文本内容通过文本分类模型获取目标文本分类结果,其中,所述目标文本分类结果表示所述文本内容的恶意推广程度;
    根据所述目标文本分类结果确定所述待识别视频所对应的视频识别结果,其中,所述视频识别结果表示所述待识别视频的恶意推广程度。
  2. 根据权利要求1所述的识别方法,若所述内容信息包括所述图像数据;
    所述对所述内容信息进行文本识别处理,得到关联文本信息,包括:
    对所述待识别视频所包括的所述图像数据进行分帧处理,得到K个视频帧,其中,所述K为大于或等于1的整数;
    按照预设帧率从所述K个视频帧中获取L个视频帧,其中,所述L为大于或等于1,且小于所述K的整数;
    对所述L个视频帧中的每个视频帧进行光学字符识别OCR处理,得到每个视频帧的文本识别结果,其中,所述文本识别结果包括字幕以及所述字幕所对应的坐标信息;
    针对于所述L个视频帧中的每个视频帧,根据所述字幕所对应的坐标信息,对所述字幕进行去重处理;
    将所述每个视频帧中经过去重后的字幕作为所述关联文本信息中的所述图像文本。
  3. 根据权利要求1所述的识别方法,若所述内容信息包括所述音频数据;
    所述对所述内容信息进行文本识别处理,得到关联文本信息,包括:
    对所述待识别视频所包括的所述音频数据进行分帧处理,得到T个音频帧,其中,所述T为大于或等于1的整数;
    对所述T个音频帧中的每个音频帧进行特征提取处理,得到所述每个音频帧所对应的音频特征向量;
    基于所述每个音频帧所对应的音频特征向量,确定所述关联文本信息中的所述音频文本。
  4. 根据权利要求1所述的识别方法,所述方法还包括:
    若所述标题文本与匹配库中的模板匹配成功,且所述标题文本与白名单中的信息匹配失败,则确定所述标题文本满足所述第一恶意推广条件,且所述目标文本信息满足所述第一恶意推广条件;
    若所述简介文本与所述匹配库中的模板匹配成功,且所述简介文本与所述白名单中的信息匹配失败,则确定所述简介文本满足所述第一恶意推广条件,且所述目标文本信息满 足所述第一恶意推广条件;
    若所述图像文本与所述匹配库中的模板匹配成功,且所述图像文本与是白名单中的信息匹配失败,则确定所述图像文本满足所述第一恶意推广条件,且所述关联文本信息满足所述第一恶意推广条件;
    若所述音频文本与所述匹配库中的模板匹配成功,且所述音频文本与所述白名单中的信息匹配失败,则确定所述音频文本满足所述第一恶意推广条件,且所述关联文本信息满足所述第一恶意推广条件。
  5. 根据权利要求4所述的识别方法,所述基于所述文本内容通过文本分类模型获取目标文本分类结果,包括:
    若所述标题文本满足所述第一恶意推广条件,则基于所述标题文本,通过所述文本分类模型获取标题文本分类结果;
    若所述简介文本满足所述第一恶意推广条件,则基于所述简介文本,通过所述文本分类模型获取简介文本分类结果;
    若所述图像文本满足所述第一恶意推广条件,则基于所述图像文本,通过所述文本分类模型获取图像文本分类结果;
    若所述音频文本满足所述第一恶意推广条件,则基于所述音频文本,通过所述文本分类模型获取音频文本分类结果。
  6. 根据权利要求4所述的识别方法,所述基于所述文本内容通过文本分类模型获取目标文本分类结果,包括:
    若所述标题文本满足所述第一恶意推广条件,则基于所述标题文本,通过所述文本分类模型所包括的N个子分类模型分别获取N个标题文本分类子结果,并根据所述N个标题文本分类子结果确定标题文本分类结果,其中,每个标题文本分类子结果对应于一个恶意推广类型;
    若所述简介文本满足所述第一恶意推广条件,则基于所述简介文本,通过所述文本分类模型所包括的N个子分类模型分别获取N个简介文本分类子结果,并根据所述N个简介文本分类子结果确定简介文本分类结果,其中,每个简介文本分类子结果对应于一个恶意推广类型;
    若所述图像文本满足所述第一恶意推广条件,则基于所述图像文本,通过所述文本分类模型所包括的N个子分类模型分别获取N个图像文本分类子结果,并根据所述N个图像文本分类子结果确定图像文本分类结果,其中,每个图像文本分类子结果对应于一个恶意推广类型;
    若所述音频文本满足所述第一恶意推广条件,则基于所述音频文本,通过所述文本分类模型所包括的N个子分类模型分别获取N个音频文本分类子结果,并根据所述N个音频文本分类子结果确定音频文本分类结果,其中,每个音频文本分类子结果对应于一个恶意推广类型。
  7. 根据权利要求1所述的识别方法,所述方法还包括:
    若所述内容信息包括所述图像数据,则基于所述图像数据,通过图像分类模型获取图 像分类结果,其中,所述图像分类结果表示所述图像数据的恶意推广程度;
    所述根据所述目标文本分类结果确定所述待识别视频所对应的视频识别结果,包括:
    根据所述目标文本分类结果以及所述图像分类结果,确定所述待识别视频所对应的视频识别结果。
  8. 根据权利要求7所述的识别方法,所述根据所述目标文本分类结果以及所述图像分类结果,确定所述待识别视频所对应的视频识别结果,包括:
    若所述图像分类结果表示所述待识别视频包括信息码,且所述目标文本分类结果满足第二恶意推广条件,则确定所述待识别视频所对应的视频识别结果为恶意推广视频;
    若所述图像分类结果表示所述待识别视频包括走势图,且所述目标文本分类结果满足所述第二恶意推广条件,则确定所述待识别视频所对应的视频识别结果为恶意推广视频;
    若所述图像分类结果表示所述待识别视频属于预设视频类型,则确定所述待识别视频所对应的视频识别结果为非恶意推广视频。
  9. 根据权利要求1所述的识别方法,所述方法还包括:
    获取针对于所述待识别视频的原创文本信息,其中,所述原创文本信息包括评论信息以及弹幕信息中的至少一种;
    若所述原创文本信息满足所述第一恶意推广条件,则基于所述原创文本信息,通过文本分类模型获取原创文本分类结果,其中,所述原创文本分类结果表示所述原创文本信息的恶意推广程度;
    所述根据所述目标文本分类结果确定所述待识别视频所对应的视频识别结果,包括:
    根据所述目标文本分类结果以及所述原创文本分类结果,确定所述待识别视频所对应的视频识别结果;或者,
    根据所述目标文本分类结果、所述原创文本分类结果以及图像分类结果,确定所述待识别视频所对应的视频识别结果,其中,所述图像分类结果为基于所述图像数据,通过图像分类模型获取的,其中,所述图像分类结果表示所述图像数据的恶意推广程度。
  10. 根据权利要求1所述的识别方法,所述方法还包括:
    在当前周期内获取针对于所述待识别视频的发布者信息,其中,所述发布者信息包括内容发布者在所述当前周期内的基本信息以及行为信息;
    根据所述发布者信息确定所述内容发布者的身份置信度;
    所述根据所述目标文本分类结果确定所述待识别视频所对应的视频识别结果,包括:
    根据所述目标文本分类结果以及所述内容发布者的身份置信度,确定所述待识别视频所对应的视频识别结果。
  11. 根据权利要求10所述的识别方法,所述根据所述目标文本分类结果以及所述内容发布者的身份置信度,确定所述待识别视频所对应的视频识别结果,包括:
    根据所述目标文本分类结果、原创文本分类结果以及所述内容发布者的身份置信度,确定所述待识别视频所对应的视频识别结果,其中,所述原创文本分类结果为基于原创文本信息,通过文本分类模型获取的,所述原创文本分类结果表示所述原创文本信息的恶意推广程度,所述原创文本信息包括评论信息以及弹幕信息中的至少一种;或者,
    根据所述目标文本分类结果、图像分类结果以及所述内容发布者的身份置信度,确定所述待识别视频所对应的视频识别结果,其中,所述图像分类结果为基于所述图像数据,通过图像分类模型获取的,所述图像分类结果表示所述图像数据的恶意推广程度;或者,
    根据所述目标文本分类结果、所述原创文本分类结果、所述图像分类结果以及所述内容发布者的身份置信度,确定所述待识别视频所对应的视频识别结果。
  12. 根据权利要求1至11中任一项所述的识别方法,所述根据所述目标文本分类结果确定所述待识别视频所对应的视频识别结果,包括:
    若所述目标文本分类结果大于或等于目标文本分类阈值,则确定所述待识别视频所对应的视频识别结果为恶意推广类型;
    若所述目标文本分类结果小于所述目标文本分类阈值,则确定所述待识别视频所对应的视频识别结果为非恶意推广类型;
    若所述目标文本分类结果大于或等于目标文本分类阈值,且原创文本分类结果大于或等于原创文本分类阈值,则确定所述待识别视频所对应的视频识别结果为恶意推广类型;
    若所述目标文本分类结果大于或等于目标文本分类阈值,且所述原创文本分类结果大于或等于所述原创文本分类阈值,且图像分类结果大于或等于图像分类阈值,则确定所述待识别视频所对应的视频识别结果为恶意推广类型;
    若所述目标文本分类结果大于或等于目标文本分类阈值,且所述原创文本分类结果大于或等于所述原创文本分类阈值,且所述图像分类结果大于或等于所述图像分类阈值,且内容发布者的身份置信度大于或等于身份置信度阈值,则确定所述待识别视频所对应的视频识别结果为恶意推广类型;
    所述根据所述目标文本分类结果确定所述待识别视频所对应的视频识别结果之后,所述方法还包括:
    获取所述待识别视频所对应的视频标注结果;
    若所述视频识别结果与所述视频标注结果匹配不一致,则对所述目标文本分类阈值、所述原创文本分类阈值、所述图像分类阈值以及所述身份置信度阈值中的至少一项进行调整。
  13. 一种多媒体内容的识别方法,所述方法由计算机设备执行,所述方法包括:
    获取待识别文本的文本信息以及针对于所述待识别文本的原创文本信息,其中,所述目标文本信息包括标题文本以及简介文本中的至少一种,所述原创文本信息包括评论信息以及弹幕信息中的至少一种;
    若所述目标文本信息满足第一恶意推广条件,则基于所述目标文本信息,通过文本分类模型获取目标文本分类结果,其中,所述目标文本分类结果表示所述目标文本信息的恶意推广程度;
    若所述原创文本信息满足所述第一恶意推广条件,则基于所述原创文本信息,通过文本分类模型获取原创文本分类结果,其中,所述原创文本分类结果表示所述原创文本信息的恶意推广程度;
    根据所述目标文本分类结果和/或所述原创文本分类结果,确定所述待识别文本所对应 的文本识别结果,其中,所述文本识别结果表示所述待识别文本的恶意推广程度。
  14. 一种多媒体内容的识别方法,所述方法由计算机设备执行,所述方法包括:
    获取待识别图像的图像数据以及针对于所述待识别图像的原创文本信息,其中,所述原创文本信息包括评论信息以及弹幕信息中的至少一种;
    对所述图像数据进行文本识别处理,得到图像文本;
    若所述图像文本满足第一恶意推广条件,则基于所述图像文本,通过文本分类模型获取图像文本分类结果,其中,所述图像文本分类结果表示所述图像文本的恶意推广程度;
    若所述原创文本信息满足所述第一恶意推广条件,则基于所述原创文本信息,通过文本分类模型获取原创文本分类结果,其中,所述原创文本分类结果表示所述原创文本信息的恶意推广程度;
    根据所述图像文本分类结果和/或所述原创文本分类结果,确定所述待识别图像所对应的图像识别结果,其中,所述图像识别结果表示所述待识别图像的恶意推广程度。
  15. 一种多媒体内容的识别方法,所述方法由计算机设备执行,所述方法包括:
    获取待识别音频的音频数据以及针对于所述待识别音频的原创文本信息,其中,所述原创文本信息包括评论信息以及弹幕信息中的至少一种;
    对所述音频数据进行文本识别处理,得到音频文本;
    若所述音频文本满足第一恶意推广条件,则基于所述音频文本,通过文本分类模型获取音频文本分类结果,其中,所述音频文本分类结果表示所述音频文本的恶意推广程度;
    若所述原创文本信息满足所述第一恶意推广条件,则基于所述原创文本信息,通过文本分类模型获取原创文本分类结果,其中,所述原创文本分类结果表示所述原创文本信息的恶意推广程度;
    根据所述音频文本分类结果和/或所述原创文本分类结果,确定所述待识别音频所对应的音频识别结果,其中,所述音频识别结果表示所述待识别音频的恶意推广程度。
  16. 一种多媒体内容识别装置,包括:
    获取模块,用于获取待识别视频的目标文本信息以及内容信息,其中,所述目标文本信息包括标题文本以及简介文本中的至少一种,所述内容信息包括图像数据以及音频数据中的至少一种;
    识别模块,用于对所述内容信息进行文本识别处理,得到关联文本信息,其中,所述关联文本信息包括图像文本以及音频文本中的至少一种,所述图像文本为对所述图像数据进行文本识别后得到的,所述音频文本为对所述音频数据进行文本识别后得到的;
    所述获取模块,还用于将满足第一恶意推广条件的所述目标文本信息以及所述关联文本信息中的至少一种作为文本内容,基于所述文本内容通过文本分类模型获取目标文本分类结果,其中,所述目标文本分类结果表示所述文本内容的恶意推广程度;
    所述识别模块,还用于根据所述目标文本分类结果确定所述待识别视频所对应的视频识别结果,其中,所述视频识别结果表示所述待识别视频的恶意推广程度。
  17. 一种计算机设备,包括:存储器、处理器和总线系统;
    其中,所述存储器用于存储程序;
    所述处理器用于执行所述存储器中的程序,所述处理器用于根据程序代码中的指令执行权利要求1至12中任一项所述的识别方法,或者,执行权利要求13所述的识别方法,或者,执行权利要求14所述的识别方法,或者,执行权利要求15所述的识别方法;
    所述总线系统用于连接所述存储器和所述处理器,以使所述存储器和所述处理器进行通信。
  18. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至12中任一项所述的识别方法,或者,执行权利要求13所述的识别方法,或者,执行权利要求14所述的识别方法,或者,执行权利要求15所述的识别方法。
  19. 一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行权利要求1至12中任一项所述的识别方法,或者,执行权利要求13所述的识别方法,或者,执行权利要求14所述的识别方法,或者,执行权利要求15所述的识别方法。
PCT/CN2022/086948 2021-04-20 2022-04-15 一种多媒体内容的识别方法、相关装置、设备及存储介质 WO2022222850A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/967,454 US20230032728A1 (en) 2021-04-20 2022-10-17 Method and apparatus for recognizing multimedia content

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110426652.1 2021-04-20
CN202110426652.1A CN113761235A (zh) 2021-04-20 2021-04-20 一种多媒体内容的识别方法、相关装置、设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/967,454 Continuation US20230032728A1 (en) 2021-04-20 2022-10-17 Method and apparatus for recognizing multimedia content

Publications (1)

Publication Number Publication Date
WO2022222850A1 true WO2022222850A1 (zh) 2022-10-27

Family

ID=78787007

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/086948 WO2022222850A1 (zh) 2021-04-20 2022-04-15 一种多媒体内容的识别方法、相关装置、设备及存储介质

Country Status (3)

Country Link
US (1) US20230032728A1 (zh)
CN (1) CN113761235A (zh)
WO (1) WO2022222850A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761235A (zh) * 2021-04-20 2021-12-07 腾讯科技(深圳)有限公司 一种多媒体内容的识别方法、相关装置、设备及存储介质
CN114611637B (zh) * 2022-05-11 2022-08-05 腾讯科技(深圳)有限公司 一种数据处理方法、装置、设备以及可读存储介质
CN115205766A (zh) * 2022-09-16 2022-10-18 北京吉道尔科技有限公司 基于区块链的网络安全异常视频大数据检测方法及系统
CN116524394A (zh) * 2023-03-30 2023-08-01 北京百度网讯科技有限公司 视频检测方法、装置、设备以及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190364126A1 (en) * 2018-05-25 2019-11-28 Mark Todd Computer-implemented method, computer program product, and system for identifying and altering objectionable media content
CN110798703A (zh) * 2019-11-04 2020-02-14 云目未来科技(北京)有限公司 视频违规内容检测的方法、装置以及存储介质
CN111460267A (zh) * 2020-04-01 2020-07-28 腾讯科技(深圳)有限公司 对象识别方法、装置和系统
CN111858973A (zh) * 2020-07-30 2020-10-30 北京达佳互联信息技术有限公司 多媒体事件信息的检测方法、装置、服务器及存储介质
CN112257661A (zh) * 2020-11-11 2021-01-22 腾讯科技(深圳)有限公司 低俗图像的识别方法、装置、设备及计算机可读存储介质
CN113761235A (zh) * 2021-04-20 2021-12-07 腾讯科技(深圳)有限公司 一种多媒体内容的识别方法、相关装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190364126A1 (en) * 2018-05-25 2019-11-28 Mark Todd Computer-implemented method, computer program product, and system for identifying and altering objectionable media content
CN110798703A (zh) * 2019-11-04 2020-02-14 云目未来科技(北京)有限公司 视频违规内容检测的方法、装置以及存储介质
CN111460267A (zh) * 2020-04-01 2020-07-28 腾讯科技(深圳)有限公司 对象识别方法、装置和系统
CN111858973A (zh) * 2020-07-30 2020-10-30 北京达佳互联信息技术有限公司 多媒体事件信息的检测方法、装置、服务器及存储介质
CN112257661A (zh) * 2020-11-11 2021-01-22 腾讯科技(深圳)有限公司 低俗图像的识别方法、装置、设备及计算机可读存储介质
CN113761235A (zh) * 2021-04-20 2021-12-07 腾讯科技(深圳)有限公司 一种多媒体内容的识别方法、相关装置、设备及存储介质

Also Published As

Publication number Publication date
US20230032728A1 (en) 2023-02-02
CN113761235A (zh) 2021-12-07

Similar Documents

Publication Publication Date Title
Xue et al. Detecting fake news by exploring the consistency of multimodal data
WO2022222850A1 (zh) 一种多媒体内容的识别方法、相关装置、设备及存储介质
CN111079444B (zh) 一种基于多模态关系的网络谣言检测方法
Wu et al. Deep learning for video classification and captioning
Xu et al. Video structured description technology based intelligence analysis of surveillance videos for public security applications
Varshney et al. A unified approach for detection of Clickbait videos on YouTube using cognitive evidences
Liu et al. Generalized zero-shot learning for action recognition with web-scale video data
Maigrot et al. Mediaeval 2016: A multimodal system for the verifying multimedia use task
CN111444387A (zh) 视频分类方法、装置、计算机设备和存储介质
Lin et al. A cloud-based face video retrieval system with deep learning
Salur et al. A soft voting ensemble learning-based approach for multimodal sentiment analysis
Cui Social-sensed multimedia computing
Ozkan et al. A large-scale database of images and captions for automatic face naming
Al-Tai et al. Deep learning for fake news detection: Literature review
Yan et al. Shared-private information bottleneck method for cross-modal clustering
McCrae et al. Multi-modal semantic inconsistency detection in social media news posts
Wang et al. Detecting fake news on Chinese social media based on hybrid feature fusion method
Glavan et al. InstaIndoor and multi-modal deep learning for indoor scene recognition
Mubeen et al. Linguistic Based Emotion Detection from Live Social Media Data Classification Using Metaheuristic Deep Learning Techniques
Sreeja et al. A unified model for egocentric video summarization: an instance-based approach
Liu et al. A multimodal approach for multiple-relation extraction in videos
CN112579771A (zh) 一种内容的标题检测方法及装置
CN116955707A (zh) 内容标签的确定方法、装置、设备、介质及程序产品
Zhang et al. Towards better graph representation: Two-branch collaborative graph neural networks for multimodal marketing intention detection
Li et al. Frame aggregation and multi-modal fusion framework for video-based person recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22790958

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE