CN114330476A

CN114330476A - Model training method for media content recognition and media content recognition method

Info

Publication number: CN114330476A
Application number: CN202111287497.6A
Authority: CN
Inventors: 王赟豪; 陈少华; 余亭浩; 张绍明; 侯昊迪
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2022-04-12

Abstract

The application discloses a model training method for media content identification and a media content identification method. The method comprises the following steps: acquiring a media content sample set and a multi-dimensional feature set; when a decision tree to be trained is trained, determining a training sample set based on a media content sample set, and determining a training feature set based on a multi-dimensional feature set; determining an optimal splitting rule meeting a preset splitting requirement based on a training sample set and a training characteristic set, and training a decision tree to be trained based on the optimal splitting rule to obtain a first decision tree; and repeating the steps of determining the training sample set until obtaining the first decision tree, and constructing at least two random forests based on the plurality of first decision trees to obtain a decision model. According to the method and the device, on the scene that whether the media content belongs to the abnormal classified content or not is judged, the decision model comprising the random forest is constructed to identify, and the identification capability of the media content corresponding to the multi-dimensional characteristics is improved.

Description

Model training method for media content recognition and media content recognition method

Technical Field

The present application relates to the field of internet communication technologies, and in particular, to a model training method for media content recognition and a media content recognition method.

Background

With the development of internet communication technology, networks have become an important way for people to acquire and share information. The information acquisition form of the user can be active (for example, sending an information acquisition request to the server side through the client), or passive (for example, the client passively receives information actively pushed by the server side). To enhance the information reading experience for the user, the information is often presented in the form of media content indicating multimodal (multi-dimensional) features.

Whether the information is obtained actively or passively, the media content read by the user may belong to abnormal content (such as content which is easy to cause discomfort and discomfort for the user), which often affects the information reading experience of the user, and therefore, abnormal content identification is required. In the related art, the media content is usually subjected to anomaly identification from each single-mode (single-dimension) angle, and then the anomaly identification from each single-mode angle is integrated to determine whether the media content belongs to the anomalous content. Thus, the determination of the abnormal content is often poor in accuracy and low in efficiency.

Disclosure of Invention

In order to solve the problems of poor accuracy, low efficiency and the like when the prior art is applied to the abnormal content identification of media content, the application provides a model training method for media content identification and a media content identification method, wherein the model training method comprises the following steps:

according to a first aspect of the present application, there is provided a model training method for media content recognition, the method comprising:

acquiring a media content sample set and a multi-dimensional feature set; the multi-dimensional feature set is constructed based on multi-dimensional features corresponding to each media content sample in the media content sample set, and the media content samples carry tags for judging whether the media content samples belong to abnormal classified content or not;

when a decision tree to be trained is trained, determining a training sample set participating in training based on the media content sample set, and determining a training feature set participating in training based on the multi-dimensional feature set;

determining an optimal splitting rule meeting a preset splitting requirement based on the training sample set and the training feature set, and training the decision tree to be trained based on the optimal splitting rule to obtain a first decision tree;

and repeating the steps of determining the training sample set participating in training until obtaining a first decision tree, and constructing at least two random forests based on a plurality of first decision trees to obtain a decision model.

According to a second aspect of the present application, there is provided a media content identification method, wherein the method comprises:

acquiring media content to be processed;

determining a multi-dimensional characteristic corresponding to the media content to be processed;

obtaining an identification result of the media content to be processed by using the multi-dimensional feature as an input and the decision model according to the first aspect; wherein the identification result indicates the relation between the media content to be processed and the abnormal classified content.

According to a third aspect of the present application, there is provided a model training apparatus for media content recognition, the apparatus comprising:

a set acquisition module: the method comprises the steps of obtaining a media content sample set and a multi-dimensional feature set; the multi-dimensional feature set is constructed based on multi-dimensional features corresponding to each media content sample in the media content sample set, and the media content samples carry tags for judging whether the media content samples belong to abnormal classified content or not;

a set determination module: the device comprises a media content sample set, a multi-dimensional feature set and a training sample set, wherein the media content sample set is used for determining a training sample set participating in training and a training feature set participating in training based on the multi-dimensional feature set when a decision tree to be trained is trained;

a decision tree construction module: the training system comprises a training sample set, a training characteristic set and a decision tree to be trained, wherein the training sample set is used for training a training characteristic of the decision tree to be trained;

a decision model building module: and repeating the steps of determining the training sample set participating in the training until obtaining a first decision tree, and constructing at least two random forests based on a plurality of first decision trees to obtain a decision model.

According to a fourth aspect of the present application, there is provided a media content identification apparatus, the apparatus comprising:

a media content acquisition module: the system is used for acquiring media content to be processed;

a feature determination module: the multi-dimensional characteristics corresponding to the media content to be processed are determined;

an identification module: the system is used for obtaining the identification result of the media content to be processed by using the multi-dimensional feature as an input and utilizing the decision model in the first aspect; wherein the identification result indicates the relation between the media content to be processed and the abnormal classified content.

According to a fifth aspect of the present application, there is provided an electronic device comprising a processor and a memory, wherein at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded by the processor and executed to implement the model training method for media content recognition according to the first aspect or the media content recognition method according to the second aspect.

According to a sixth aspect of the present application, there is provided a computer-readable storage medium having stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement the method for model training for media content recognition according to the first aspect or the method for media content recognition according to the second aspect.

According to a seventh aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the model training method for media content recognition as described in the first aspect, or the media content recognition method as described in the second aspect.

The model training method for media content identification and the media content identification method provided by the application have the following technical effects:

the method comprises the steps of obtaining a media content sample set and a multi-dimensional feature set; then when a decision tree to be trained is trained, determining a training sample set participating in training based on the media content sample set, and determining a training feature set participating in training based on the multi-dimensional feature set; determining an optimal splitting rule meeting a preset splitting requirement based on a training sample set and a training characteristic set, and training a decision tree to be trained based on the optimal splitting rule to obtain a first decision tree; and repeating the steps of determining the training sample set participating in the training until obtaining the first decision tree, and constructing at least two random forests based on the plurality of first decision trees to obtain the decision model. The multi-dimensional feature set is constructed based on the multi-dimensional features corresponding to each media content sample in the media content sample set, and the media content samples carry tags for judging whether the media content samples belong to abnormal classified content. According to the method and the device, on the scene that whether the media content belongs to the abnormal classified content or not is judged, the decision model comprising the random forest is constructed to identify, and the identification accuracy and the identification efficiency of the media content corresponding to the multi-dimensional characteristics are improved. The application of the decision tree and the random forest can effectively capture the incidence relation among the features, and meanwhile, the random forest is also beneficial to reducing the overfitting risk. When the decision model obtained by training is used for identification, the identification adaptability of the multidimensional characteristics with form difference can be improved, and the reliability and effectiveness of media content identification can be greatly improved.

Drawings

In order to more clearly illustrate the technical solutions and advantages of the embodiments of the present application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of an application environment provided by an embodiment of the present application;

FIG. 2 is a flowchart illustrating a model training method for media content recognition according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of determining an optimal splitting rule that meets a preset splitting requirement based on a training sample set and a training feature set, and training a decision tree to be trained based on the optimal splitting rule, according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a method for constructing at least two random forests based on a plurality of first decision trees to obtain a decision model according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a media content identification method according to an embodiment of the present application;

FIG. 6 is a block diagram of a multimodal fusion module provided by an embodiment of the present application;

FIG. 7 is a block diagram illustrating a model training apparatus for media content recognition according to an embodiment of the present disclosure;

fig. 8 is a block diagram illustrating a media content recognition apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 10 is a block chain system according to an embodiment of the present invention;

fig. 11 is a block structure diagram according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of this application and the above-described drawings, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic diagram of an application environment provided by an embodiment of the present application, where the application environment may include a client 10 and a server 20, and the client 10 and the server 20 may be connected directly or indirectly through wired or wireless communication. The user can send an information acquisition request to the server 20 through the client 10. The server 20 determines corresponding information to be returned based on the information acquisition request, then determines a multidimensional characteristic corresponding to the information to be returned, and then processes the multidimensional characteristic by using a preset decision model to obtain an identification result of the information to be returned so as to determine whether the information to be returned belongs to the abnormal classified content. If not, the client returns to the client 10 directly, and if the client belongs, the step of returning to the client 10 is executed after the relevant processing is carried out on the client. It should be noted that fig. 1 is only an example.

The client may be an entity device of a type such as a smart phone, a computer (e.g., a desktop computer, a tablet computer, a laptop computer), an Augmented Reality (AR)/Virtual Reality (VR) device, a digital assistant, an intelligent voice interaction device (e.g., an intelligent sound box), an intelligent wearable device, an intelligent household appliance, a vehicle-mounted terminal, or may be software running in the entity device, such as a computer program. The operating system corresponding to the client may be an Android system (Android system), an iOS system (mobile operating system developed by apple inc.), a linux system (one operating system), a Microsoft Windows system (Microsoft Windows operating system), and the like.

The server side may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. Which may include a network communication unit, a processor, and memory, among others. The server side can provide background services for the corresponding client side.

In practical applications, the predetermined decision model used above relates to Computer Vision technology (CV) of artificial intelligence. Computer vision is a science for researching how to make a machine "see", and further, it means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition.

The client 10 and the server 20 may be used to construct a system for media content identification, which may be a distributed system. Taking a distributed system as an example of a blockchain system, referring To fig. 10, fig. 10 is an optional structural schematic diagram of the distributed system 100 applied To the blockchain system provided in the embodiment of the present invention, and the optional structural schematic diagram is formed by a plurality of nodes (computing devices in any form in an access network, such as servers and user terminals) and clients, a Peer-To-Peer (P2P, Peer To Peer) network is formed between the nodes, and the P2P Protocol is an application layer Protocol operating on top of a Transmission Control Protocol (TCP). In a distributed system, any machine, such as a server or a terminal, can join to become a node, and the node comprises a hardware layer, a middle layer, an operating system layer and an application layer.

Referring to the functions of each node in the blockchain system shown in fig. 10, the functions involved include:

1) routing, a basic function that a node has, is used to support communication between nodes.

Besides the routing function, the node may also have the following functions:

2) the application is used for being deployed in a block chain, realizing specific services according to actual service requirements, recording data related to the realization functions to form recording data, carrying a digital signature in the recording data to represent a source of task data, and sending the recording data to other nodes in the block chain system, so that the other nodes add the recording data to a temporary block when the source and integrity of the recording data are verified successfully.

For example, the services implemented by the application include:

2.1) wallet, for providing the function of transaction of electronic money, including initiating transaction (i.e. sending the transaction record of current transaction to other nodes in the blockchain system, after the other nodes are successfully verified, storing the record data of transaction in the temporary blocks of the blockchain as the response of confirming the transaction is valid; of course, the wallet also supports the querying of the remaining electronic money in the electronic money address;

and 2.2) sharing the account book, wherein the shared account book is used for providing functions of operations such as storage, query and modification of account data, record data of the operations on the account data are sent to other nodes in the block chain system, and after the other nodes verify the validity, the record data are stored in a temporary block as a response for acknowledging that the account data are valid, and confirmation can be sent to the node initiating the operations.

2.3) Intelligent contracts, computerized agreements, which can enforce the terms of a contract, implemented by codes deployed on a shared ledger for execution when certain conditions are met, for completing automated transactions according to actual business requirement codes, such as querying the logistics status of goods purchased by a buyer, transferring the buyer's electronic money to the merchant's address after the buyer signs for the goods; of course, smart contracts are not limited to executing contracts for trading, but may also execute contracts that process received information.

3) And the Block chain comprises a series of blocks (blocks) which are mutually connected according to the generated chronological order, new blocks cannot be removed once being added into the Block chain, and recorded data submitted by nodes in the Block chain system are recorded in the blocks.

Referring to fig. 11, fig. 11 is an optional schematic diagram of a Block Structure (Block Structure) according to an embodiment of the present invention, where each Block includes a hash value of a transaction record stored in the Block (hash value of the Block) and a hash value of a previous Block, and the blocks are connected by the hash values to form a Block chain. The block may include information such as a time stamp at the time of block generation. A block chain (Blockchain), which is essentially a decentralized database, is a string of data blocks associated by using cryptography, and each data block contains related information for verifying the validity (anti-counterfeiting) of the information and generating a next block.

The following describes a specific embodiment of a model training method for media content recognition, and fig. 2 is a flowchart of a model training method for media content recognition provided by the embodiment of the present application, and the present application provides the method operation steps described in the embodiment or the flowchart, but may include more or less operation steps based on conventional or non-creative labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In actual system or product execution, sequential execution or parallel execution (e.g., parallel processor or multi-threaded environment) may be possible according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 2, the method may include:

s201: acquiring a media content sample set and a multi-dimensional feature set; the multi-dimensional feature set is constructed based on multi-dimensional features corresponding to each media content sample in the media content sample set, and the media content samples carry tags for judging whether the media content samples belong to abnormal classified content or not;

in the embodiment of the application, a server side obtains a media content sample set and a multi-dimensional feature set. The information may be presented in the form of media content indicative of multi-dimensional features, and a sample of the media content may include at least two dimensional objects: text, images, audio, probability values, and label vectors (label embedding), etc. The multi-dimensional feature set is derived from multi-dimensional features corresponding to respective media content samples. Illustratively, the media content sample set includes media content samples 1-2, the multidimensional features corresponding to the media content samples 1 are text features a and image features b, and the multidimensional features corresponding to the media content samples 2 are text features c, image features d and audio features e. Then the multi-dimensional feature set includes text feature a, image feature b, text feature c, image feature d, and audio feature e. Of course, media content indicative of multimodal features may also be presented as a stream of information, and a sample of the media content may then characterize the stream of information. Information streams are carriers of information, and the information streams may contain various modal information such as text, pictures, voice, video, etc., which may be processed as media content indicating multi-dimensional features. It will be appreciated that media content indicative of multi-dimensional features may characterize a single piece of information, or may characterize an information stream comprising multiple pieces of information.

The media content sample carries a label of whether it belongs to an anomalous classification content. From this tag it can be determined whether the media content sample is normal content or abnormal content and, when abnormal content, to which particular abnormal classification it belongs. The abnormal content can be content which is easy to cause user dislike and discomfort, such as content containing abnormal characteristic information of thriller, snake, skin disease and the like. The specific anomaly classification can be determined from anomaly characteristic information, such as an anomaly classification indicative of thriller, an anomaly classification indicative of snake, and an anomaly classification indicative of skin disease. Of course, the specific exception classification is not limited to the above, and the specific exception classification may be updated according to the service requirement, for example, deleting the original exception classification, replacing the original exception classification, adding a new exception classification, and the like. The abnormal content may be content that does not comply with the laws and administrative regulations, violates social penchants, and impairs national benefits, social public benefits, and third-party benefits.

S202: when a decision tree to be trained is trained, determining a training sample set participating in training based on the media content sample set, and determining a training feature set participating in training based on the multi-dimensional feature set;

in the embodiment of the application, when a decision tree to be trained is trained, a server side determines a training sample set participating in training based on a media content sample set, and determines a training feature set participating in training based on a multi-dimensional feature set.

Generally, each decision tree in the decision model is independent, and steps S202 and S203 herein illustrate the process of training one decision tree. The training sample set used for training each decision tree is derived from the media content sample set, and the training feature set used is derived from the training feature set.

Considering that a random forest needs to be constructed by using the decision trees subsequently, when each decision tree is trained, the training sample set and the training feature set can be determined by combining the type of the random forest.

For a first type of random forest (ordinary random forest), in training each decision tree for constructing the first type of random forest, a plurality of media content samples participating in training can be randomly determined from a media content sample set to construct a training sample set, and a plurality of feature items participating in training can be randomly determined from a multi-dimensional feature set to construct a training feature set. The number of the media contents in the training sample set is less than or equal to the number of the media contents in the media content sample set; the number of feature items of the training feature set is smaller than that of the multi-dimensional feature set.

For the second type of random forest (extreme random forest), in training each decision tree for constructing the second type of random forest, the media content sample set can be determined as a training sample set, and one feature item participating in training is randomly determined from the multi-dimensional feature set to construct a training feature set.

Taking the multi-dimensional feature set including the text feature a, the image feature b, the text feature c, the image feature d, and the audio feature e as an example, if the text feature is taken as a feature item, the image feature is taken as a feature item, and the audio feature is taken as a feature item, then the training feature may be determined from the text feature, the image feature, and the audio feature. Further, finer grained characterization terms may also be employed. For example, the text feature a and the text feature c indicate different feature items, the text feature a reflects the text feature in the aspect of word segmentation/word segmentation, and the text feature c reflects the text feature in the aspect of emotion. Image feature b and image feature d indicate different feature items, image feature b reflecting image features in terms of pixels and image feature d reflecting image features in terms of shapes. Then, training features may be determined from text feature terms reflecting word/word segmentation aspects, text feature terms reflecting emotion aspects, image feature terms reflecting pixel aspects, text feature terms reflecting shape aspects, and audio features.

For the above "random determination" involved in constructing the training sample set and the training feature set, the "random determination" may take the form of a put-back in the case of a random forest. In practical application, for a first-class random forest (common random forest), if the number of feature items of a multi-dimensional feature set is s, the feature items can be randomly determined "

Individual characteristic itemTo construct a training feature set.

S203: determining an optimal splitting rule meeting a preset splitting requirement based on the training sample set and the training feature set, and training the decision tree to be trained based on the optimal splitting rule to obtain a first decision tree;

in the embodiment of the application, the server side determines an optimal splitting rule meeting the preset splitting requirement based on the training sample set and the training feature set, and trains the decision tree to be trained based on the optimal splitting rule to obtain the first decision tree. In machine learning, a decision tree is a predictive model that represents a mapping between object attributes and object values. A decision tree is a tree-like structure in which each internal node represents a test on an attribute, each branch represents a test output, and each leaf node represents a category. When training of a decision tree to be trained is started, tree structure information (which may include depth limitation for the decision tree, number limitation for each layer of decision nodes in the decision tree, and the like) of the decision tree to be trained may be obtained, so as to initialize a root node to be split. The execution logic for starting feature splitting with the root node as the starting point is described with reference to the optimal splitting rule determined by the training sample set, the training feature set and the preset splitting requirement. For example, the optimal splitting rule may be determined once and globally, and after determining the optimal classification rule, the optimal classification rule and the tree structure information of the decision tree to be trained are fused to construct the first decision tree. At this time, the optimal splitting rule describes the content of constructing the splitting logic by using the target training features, the target training features indicate at least one training feature from the training feature set, the splitting logic indicates the execution sequence of at least one splitting condition (corresponding to the local optimal splitting rule described later), and the splitting condition corresponds to the decision nodes in the decision tree to be trained one to one. The optimal splitting rule may also indicate a plurality of locally optimal splitting rules, and after each locally optimal splitting rule is determined, the most recently determined locally optimal splitting rule and the latest tree structure information of the decision tree to be trained are fused to construct the first decision tree.

In an exemplary embodiment, the preset splitting requirement includes a preset splitting evaluation parameter and a preset splitting ending condition, and the preset splitting evaluation parameter includes at least one of: information gain and kini coefficient.

1. For information gain:

the information gain is measured by entropy, and the larger the information gain is, the stronger the classification capability of the attribute is. Partition the sample set T into T if the value of attribute A₁，T₂....T_mM subsets, then the information gain is defined as the following equation one:

wherein, | T | is the number of samples of T, | T_iL is T_iNumber of samples of (1), Encopy (T)_i) The calculation formula of (2) is as follows:

wherein, freq (C)_jT) is that the sample in T belongs to C_jThe frequency of the categories, s, is the number of subcategories of T.

2. For the kini coefficient:

the kini coefficient is suitable for the attribute of which the attribute value is a continuous numerical type, and the specific idea is as follows: assuming that a data sample set T at a certain node T contains k classes of records, the kini coefficient is defined as the following formula three:

where p (j | t) is the probability of the category j at the node t, so the value of the kini coefficient is non-negative. If gini (t) is equal to 0, it means that all samples of the node are classified, i.e. the information amount is maximum. If Gini (t) is maximum, the minimum amount of information is obtained.

The information gain can be selected independently as a preset splitting evaluation parameter, the Gini coefficient can be selected independently as a preset splitting evaluation parameter, and the combination of the information gain and the Gini coefficient can also be selected as a preset splitting evaluation parameter. For the combination of the information gain + the kini coefficient, the weight of the information gain for the split evaluation and the weight of the kini coefficient for the split evaluation may be set, respectively. Of course, the preset splitting evaluation parameter may also include an information gain rate and the like.

The preset split end condition may be derived from tree structure information of the decision tree to be trained, such as depth limit information for the decision tree. The predetermined splitting termination condition may also be an upper limit or a lower limit depending on the value of the predetermined splitting evaluation parameter.

Correspondingly, referring to fig. 3, the determining an optimal splitting rule meeting a preset splitting requirement based on the training sample set and the training feature set, and training the decision tree to be trained based on the optimal splitting rule include:

s301: selecting training features indicating current local optimal splitting from the training feature set, and generating a local optimal splitting rule based on the selected training features; wherein the current locally optimal split is determined based on the training sample set and the preset split evaluation parameter;

s302: establishing an incidence relation between the local optimal splitting rule and a current decision node to guide the current decision node to perform feature splitting; wherein, the current decision node is a child node of a decision node which is last subjected to feature splitting in the decision tree to be trained;

s303: and repeating the steps from the selection of the training features indicating the current local optimal splitting to the guidance of the current decision node for feature splitting until the preset splitting ending condition is met so as to train the decision tree to be trained.

By combining the introduction of the related information gain and the kini coefficient, when the current local optimal splitting can be used as the splitting characteristic, the training characteristic with the maximum value of the information gain is indicated; when the current local optimal splitting can be used as the splitting characteristic, the training characteristic with the minimum value of the indication kini coefficient is used. The generation of the locally optimal splitting rule based on the selected training features is equivalent to the generation of the splitting condition to be performed based on the selected training features.

And if the root node is in the state to be split, the root node is the current decision node. If the root node is split, the child node or the same-level node of the last decision node for feature splitting is the current decision node. And associating the local optimal splitting rule with the current decision node so as to guide the current decision node to carry out feature splitting. And continuously fusing the recently determined local optimal splitting rule and the latest tree structure information of the decision tree to be trained to realize the construction of the first decision tree. This process of constructing the first decision tree may reduce the risk of over-fitting and may also reduce training time compared to locally optimal splitting rules.

S204: and repeating the steps of determining the training sample set participating in training until obtaining a first decision tree, and constructing at least two random forests based on a plurality of first decision trees to obtain a decision model.

In the embodiment of the application, the server repeats the steps of determining the training sample set participating in the training until obtaining the first decision tree, and constructs at least two random forests based on the plurality of first decision trees to obtain the decision model. Training processes of the decision trees are independent, and after the training of the decision trees is finished, random forests can be constructed based on the trained first decision trees, so that a decision model formed by at least two random forests is obtained. In practical applications, two first-type random forests and two second-type random forests may be constructed, each first-type random forest comprising 500 decision trees, and each second-type random forest comprising 500 decision trees.

The random forest has the good characteristics of simultaneously processing discrete and continuous characteristic input, being capable of processing high-dimensional data classification prediction problems in parallel, having good tolerance to abnormal values and noise, having high prediction accuracy and the like. The decision model constructed by at least two random forests can be regarded as a plurality of trained first decision trees which are classified once again, so that the internal architecture of the decision model is improved, and the recognition reliability and the recognition stability of the decision model for the media content with corresponding multi-dimensional characteristics can be better ensured.

Further, the built-in architecture of the decision model can have random forests of different levels besides at least two random forests of a flat level. As shown in fig. 4, the constructing at least two random forests based on a plurality of the first decision trees to obtain a decision model includes:

s401: constructing a plurality of first random forests based on a plurality of the first decision trees;

s402: constructing a first-layer classification structure based on the plurality of first random forests;

s403: when a non-first-layer classification structure is constructed, a target splitting rule is determined based on an output result of a target training sample set and associated input features by utilizing a previous-layer classification structure, and a second decision tree for constructing an adjacent next-layer classification structure is obtained based on the target splitting rule training; wherein the target set of training samples is determined based on the set of media content samples, the neighboring next-level classification structure is constructed based on a plurality of second random forests constructed from a plurality of the second decision trees;

s404: and fusing the first-layer classification structure and at least one non-first-layer classification structure to obtain the decision model.

The first random forest may be constructed directly by using the first decision tree obtained through the foregoing steps S202 to S203, and then the first-level classification structure of the decision model may be constructed based on a plurality of first random forests. The first random forest in the first-layer classification structure can be the same type of random forest or different types of random forests.

Compared with the first decision tree, the data sources selected for training the sample set and the training feature set to obtain the second decision tree for training are as follows: output results and associated input features for a target training sample set using a previous level classification structure, wherein the target training sample set is determined based on the media content sample set. Taking the media content sample set as a target training sample set as an example, the data sources selected for training the training sample set to obtain the second decision tree are as follows: a media content sample set, and an output result (such as a probability value belonging to a specific abnormal classification and a probability value of normal content) aiming at a target training sample set by using a previous layer classification structure; the data sources selected for training the feature set to obtain the second decision tree for training are: multi-dimensional feature sets and output results (such as probability values belonging to specific abnormal classes and probability values of normal contents) aiming at the target training sample set by utilizing the upper-layer classification structure. Reference may be made to the process of training one decision tree in the foregoing steps S202 to S203 to train a second decision tree for constructing a non-first-layer structure, which is not described in detail herein.

An Nth random forest can be constructed based on the trained multiple Nth decision trees, and then an Nth layer classification structure of the decision model is constructed based on the multiple Nth random forests. The Nth random forest in the Nth layer classification structure can be the same type of random forest or different types of random forests. Wherein N is an integer of 2 or more. It should be noted that the second decision tree mentioned in step S403 is only used for distinguishing the aforementioned first decision tree, and is not limited to be used for constructing the second-layer classification structure.

Through the steps, a decision model comprising a multi-layer classification structure is constructed, wherein each layer of classification structure comprises a plurality of random forests. Therefore, the difference of different dimensional characteristics can be eliminated by layer-by-layer (layer-by-layer) processing of the random forest with depth, and the effectiveness of the identification result obtained by using the decision model is ensured. The decision model has a structure processed layer by layer, so that the integration of each layer of random forest is not used as a simple classifier or a feature extraction structure, and the difference of features with different dimensions can be better processed compared with a fully-connected classifier.

In practical applications, referring to fig. 6, four random forests may be used for each layer of classification structure in the decision model, including two first-type random forests and two second-type random tree forests, each random forest containing 500 decision trees. That is, there are a total of 2000 decision trees per level of classification structure. In the application of the decision model, the input of the next layer is that the probability value generated by the previous layer is spliced with the characteristics of the previous layer, and finally, the probability value generated by each random forest of the last layer is subjected to global averaging to generate the final prediction probability. The recognition result of the decision model is the average of the scores (probabilities) for the classes generated by each random forest of the last level.

In addition, the decision model can be used as a depth forest fusion module to construct a multi-modal fusion module together with the multi-modal feature extraction module. For the multi-modal feature extraction module, high-dimensional features of an image modality and a text modality can be extracted through a deep neural network in combination with corresponding global or average pooling operations.

For the depth-based forest fusion module, random forests with different tree structures, tree depths, forest widths, forest depths and different classification heads are designed and combined.

Referring to fig. 6, the overall structure of the multimodal fusion module is mainly composed of three modules: the system comprises an image side BiT module, a text side BERT module and a depth forest fusion module. For the image-side BiT module, a BiT model (migration model) as a pre-training model is adopted, the BiT model is optimized for pre-training, and a larger-scale pre-training corpus is used. In the pre-training stage, GN (group normalization) + Weight normalization replaces BN (batch normalization), and the influence of batch _ size on training is reduced; and simultaneously, a HyperRule mechanism is provided to reduce the parameter adjusting work of the finetune stage. The representation capability of the BiT model is greatly improved through the pre-training optimization, and a good effect can be achieved only by using fewer labeled samples to perform finetune on a downstream task.

For the title-side BERT module, a BERT model (bidirectional coding representation model based on a transformer) is adopted as a pre-training model, and BERT has a good effect in many NLP (natural language processing) tasks. The BERT model uses a bi-directional Transformer (coding) structure to train two tasks on large-scale unsupervised corpus: a) masked LM task: the tokens (vectors) in 15% of the random MASK corpus, where 80% of tokens are replaced with [ MASK ] tokens, 10% of tokens are replaced with arbitrary tokens, and the remaining 10% of tokens remain unchanged. The model needs to predict the corresponding value of token removed by the mask through context semantics. b) Next sequence Prediction (Prediction below): given two sentences A and B, where B has a 50% probability of being the next utterance of A, the model needs to predict whether B is the next utterance of A. The BERT model can enable the model to learn sufficient context semantic features by training the two tasks on large-scale unsupervised corpora, and huge unsupervised corpora can be collected for pre-training because labeling data is not needed in the pre-training stage. The representation capability of the pre-training model is greatly improved, and a good effect can be obtained only by marking a small amount of samples when the finetune BERT model is used on a downstream task.

For the depth forest fusion module, the header-side features and the image-side features need to be fused, so that the fused vector contains the features of two modalities. The input to such a model is the high-dimensional feature vector of the multi-modal. The first type of random forest and the second type of random forest may be used in the model. For the second class of random forests, there is no need to introduce randomness on the samples (corresponding to the selected training samples) because the feature level (corresponding to the selected training features) has enough randomness. Randomness and diversity are beneficial to ensemble learning, and then in the case where the overall structure is a forest, different trees are used to promote randomness and diversity.

A two-stage training approach can be employed: firstly, respectively training classification tasks under single modes, namely training an image mode model (corresponding to a preset image feature extraction model described later) based on a BiT model to obtain 2048-dimensional image features, and training a text mode model (corresponding to a preset text feature extraction model) based on a BERT model to obtain 768-dimensional text features. After the single mode is trained, fixing the single mode model, and then training the model based on the deep forest fusion model (corresponding to the decision model). The method has the advantages that the two modes can be trained by large-scale single-mode data respectively, so that the single-mode model can learn sufficient semantic features; and then retraining a deep forest fusion model based on small-scale common data, and learning fusion characteristics.

The multi-mode fusion module provided by the embodiment of the application can effectively identify the multi-mode high-dimensionality uncomfortable content by combining different model units and a good network structure design when content identification is carried out, and the identification accuracy is high. Through experiments, the misjudgment rate of the recognition effect is 0.3%, and the recall rate is 70%. Furthermore, in the deep forest structure, different optimizations can be performed according to business scenes, and the base classifier can use an LR (logistic regression) classifier, a LightGBM (an algorithm framework) based classifier and the like besides a decision tree and a random forest. Of course, the number and width of the deep forest layers and the characteristics of each layer can be adjusted according to the service scene.

According to the technical scheme provided by the embodiment of the application, the embodiment of the application establishes the decision model comprising the random forest for identification on the scene of judging whether the media content belongs to the abnormal classified content, so that the identification accuracy and the identification efficiency of the media content corresponding to the multidimensional characteristics are improved. The application of the decision tree and the random forest can effectively capture the incidence relation among the features, and meanwhile, the random forest is also beneficial to reducing the overfitting risk.

While specific embodiments of a media content identification method of the present application are described below, fig. 5 is a flowchart of a media content identification method provided by embodiments of the present application, which provides the method operation steps described in the embodiments or flowcharts, but may include more or less operation steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In actual system or product execution, sequential execution or parallel execution (e.g., parallel processor or multi-threaded environment) may be possible according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 5, the method may include:

s501: acquiring media content to be processed;

in the embodiment of the application, the server side obtains the media content to be processed. The server side can respond to the information acquisition request sent by the client side, and determine corresponding information to be returned or information flow to be returned according to the information acquisition request. The information to be returned or the information stream to be returned is a data source of the media content to be processed. The server side can also actively push information or information streams to the user, and the information to be pushed or the information streams to be pushed are data sources of the media content to be processed. For understanding the media content to be processed, reference may be made to the related introduction of the media content sample in step S201, and details are not described again. Here, the server side that is the execution subject of steps S501 to S503 may be the same as or different from the server side related to steps S201 to S204.

S502: determining a multi-dimensional characteristic corresponding to the media content to be processed;

in the embodiment of the application, the server determines the multi-dimensional characteristics corresponding to the media content to be processed. For understanding the "multidimensional characteristics corresponding to the media content to be processed", reference may be made to the related introduction of the "multidimensional characteristics corresponding to the media content sample" in step S201, and details are not repeated.

The image class characteristics corresponding to the media content to be processed of the model can be extracted by utilizing preset image characteristics; extracting text type characteristics corresponding to the media content to be processed by utilizing preset text characteristics; thereby obtaining the multi-dimensional features based on the image class features and the text class features. The preset image feature extraction model is obtained by training with a migration model as an initial model, and the preset text feature extraction model is obtained by training with a bidirectional coding representation model based on a converter as the initial model. For understanding the "preset image feature extraction model" and the "preset text feature extraction model", reference may be made to the related descriptions of the "image modality model" and the "text modality model" in the foregoing step S204, and details are not repeated. The trained model is used for extracting the features, so that the efficiency and the accuracy of feature extraction are ensured, and the efficiency and the accuracy of identifying the media content to be processed can be improved.

S503: obtaining the identification result of the media content to be processed by using the multi-dimensional characteristics as input and the decision model in the steps S201-S204; wherein the identification result indicates the relation between the media content to be processed and the abnormal classified content.

In the embodiment of the present application, the server obtains the recognition result of the media content to be processed by using the decision model in the foregoing steps S201 to S204 with the multidimensional feature as an input. The relationship between the media content to be processed and the abnormal classified content indicated by the identification result may be 1) irrelevant, and then the media content to be processed is normal content and not abnormal content; 2) there is a relationship, then the pending media content belongs to some particular exception classification. The pending media content may belong to at least one specific exception classification. In practical applications, the existence of the relationship can be judged by presetting a threshold value. Taking the existence of 10 specific abnormal classifications as an example, the identification result includes the probabilities that the multidimensional features corresponding to the media content to be processed belong to the 10 specific abnormal classifications respectively, and only when the probabilities are larger than or equal to a preset threshold, it is indicated that the processed media content is related to the abnormal classification content. For understanding the "exception content" and the "specific exception classification", reference may be made to the related description in the foregoing step S201, and details are not repeated. Of course, referring to fig. 6, the multi-modal fusion module described with reference to the foregoing step S204 may use the to-be-processed media content as an input, and obtain the recognition result of the to-be-processed media content by using the multi-modal fusion module.

In an exemplary embodiment, the process of obtaining the identification result of the media content to be processed includes the following steps: firstly, predicting the multidimensional characteristics by respectively utilizing at least two random forests in the decision model; then, generating an identification result of the media content to be processed based on the prediction result of each random forest; wherein the prediction result is determined based on a classification result of each decision tree in the random forest, and the classification result is determined based on the multi-dimensional features and starting to visit by taking a root node of the decision tree as a starting point until a corresponding leaf node is reached.

It will be appreciated that the decision model comprises at least two random forests of flat level, the recognition results being taken from a global average of the class probabilities generated for each random forest. And the class probability generated by each random forest is taken from a global average of the classification results generated by each decision tree in the random forests. Taking the existence of 10 specific abnormal classifications as an example, the classification result generated by one decision tree includes the probabilities that the multidimensional features corresponding to the media content to be processed belong to the 10 specific abnormal classifications respectively. And performing a global average on the classification result generated by each decision tree to obtain a prediction result of the random forest, wherein the prediction result comprises the probabilities (namely class probabilities) that the multidimensional features corresponding to the media content to be processed belong to the 10 specific abnormal classifications respectively. And performing a global average on the prediction results generated by each random forest to obtain the recognition result of the decision model.

Compared with the related art, the method and the device have the advantages that the media content is subjected to abnormal recognition from each single mode (single dimension) angle, and whether the media content belongs to the abnormal content is judged by integrating the abnormal recognition of each single mode angle. The method and the device for the probability dimension modeling of the features introduce the decision tree to carry out probability dimension modeling on the features, the decision tree is used as a classifier, the decision process can be equivalent to a probability modeling process, starting from a root node, the features of a certain dimension are used for judging at internal nodes of the tree, and which branch node is entered is determined according to the judgment result until the leaf node is reached, so that the classification result is obtained. Probability relation among nodes can be established in the characteristic processing process, association relation among characteristics can be effectively captured, and therefore the obtained classification result is high in precision. For multidimensional features, using random forests (especially multiple/multi-class random forests), the risk of overfitting can be reduced. Generally, a wrong prediction is made only if more than half of the base classifiers (decision trees) have errors. The random forest is very stable, even if a new data point appears in the data set, the whole algorithm is not influenced too much, only one decision tree is influenced, and all decision trees are hardly influenced.

In an exemplary embodiment, the process of obtaining the identification result of the media content to be processed includes the following steps: processing the multi-dimensional features layer by utilizing at least two layers of classification structures in the decision model to obtain an identification result of the media content to be processed; wherein the input of the next-layer classification structure is determined by the output and the input of the adjacent previous-layer classification structure, the output of each layer of the classification structure is determined based on the prediction result of each random forest, the prediction result is determined based on the classification result of each decision tree in the random forest, and the classification result is determined based on the multi-dimensional features, and the access is started by taking the root node of the decision tree as the starting point until the corresponding leaf node is reached.

It is to be understood that the decision model includes different levels of random forests, and the recognition results are taken from a global average of the class probabilities generated for each random forest in the last level. And the class probability generated by each random forest of the last layer is taken from a global average of the classification results generated by each decision tree in the random forests. Considering that there is a difference in the input feature distributions of different modalities is not favorable for classifier learning. Different modal characteristics are obtained through different source data forms and different model structures, and have different characteristic dimensions, and the differences can cause difficulty for learning of the classifier and are not beneficial to normal convergence of the model. Layer-by-layer processing of random forests with depth can eliminate differences of different dimensional characteristics, and the effectiveness of recognition results obtained by using a decision model is guaranteed.

In an exemplary embodiment, the method further comprises: when the media content to be processed belongs to the abnormal classified content, performing abnormal marking on the media content to be processed; wherein the exception flag is used for indicating to reduce the weight for recommendation of the media content to be processed or to exclude the media content to be processed from being used as a candidate cover page.

In the personalized recommendation scene, if the media content to be processed as the candidate recommendation content belongs to the abnormal classification content, the media content is subjected to the weight reduction processing to reduce the exposure rate, so that poor use experience of related internet products can be avoided for users. In the cover selection scene, if the media content to be processed as the candidate cover belongs to the abnormal classified content, the media content is filtered to avoid being used as the cover, and the influence on the display effect of the characterized object is avoided. Illustratively, the to-be-processed media content is a candidate cover of a theme video album, which is an object that the to-be-processed media content will characterize.

According to the technical scheme provided by the embodiment of the application, when the decision model obtained by training is used for identification, the identification adaptability of the multi-dimensional characteristics with form difference can be improved, and the reliability and effectiveness of media content identification can be greatly improved. In the service scene of providing information (such as personalized recommendation scene and information flow content service scene), the information quality can be ensured to improve the user experience for the effective and accurate identification of abnormal content.

The embodiment of the present application further provides a model training apparatus for media content recognition, and as shown in fig. 7, the model training apparatus 70 for media content recognition includes:

the set acquisition module 701: the method comprises the steps of obtaining a media content sample set and a multi-dimensional feature set; the multi-dimensional feature set is constructed based on multi-dimensional features corresponding to each media content sample in the media content sample set, and the media content samples carry tags for judging whether the media content samples belong to abnormal classified content or not;

set determination module 702: the device comprises a media content sample set, a multi-dimensional feature set and a training sample set, wherein the media content sample set is used for determining a training sample set participating in training and a training feature set participating in training based on the multi-dimensional feature set when a decision tree to be trained is trained;

decision tree building block 703: the training system comprises a training sample set, a training characteristic set and a decision tree to be trained, wherein the training sample set is used for training a training characteristic of the decision tree to be trained;

decision model building module 704: and repeating the steps of determining the training sample set participating in the training until obtaining a first decision tree, and constructing at least two random forests based on a plurality of first decision trees to obtain a decision model.

It should be noted that the device and method embodiments in the device embodiment are based on the same inventive concept.

An embodiment of the present application further provides a media content identification apparatus, as shown in fig. 8, where the media content identification apparatus 80 includes:

the media content acquisition module 801: the system is used for acquiring media content to be processed;

the feature determination module 802: the multi-dimensional characteristics corresponding to the media content to be processed are determined;

the identification module 803: the system is used for obtaining the identification result of the media content to be processed by using the multi-dimensional features as input and using the decision model in the previous steps S201-S204; wherein the identification result indicates the relation between the media content to be processed and the abnormal classified content.

The embodiment of the present application provides an electronic device, which includes a processor and a memory, where at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the model training method for media content recognition or the media content recognition method provided in the above method embodiment.

Further, fig. 9 is a schematic diagram of a hardware structure of an electronic device for implementing the model training method or the media content recognition method for media content recognition provided in the embodiment of the present application, and the electronic device may participate in forming or including the model training apparatus or the media content recognition apparatus for media content recognition provided in the embodiment of the present application. As shown in fig. 9, the electronic device 90 may include one or more (shown here as 902a, 902b, … …, 902 n) processors 902 (the processors 902 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 904 for storing data, and a transmission device 906 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 9 is only an illustration and is not intended to limit the structure of the electronic device. For example, the electronic device 90 may also include more or fewer components than shown in FIG. 9, or have a different configuration than shown in FIG. 9.

It should be noted that the one or more processors 902 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the electronic device 90 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 904 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the model training method for media content recognition described in the embodiments of the present application, or program instructions/data storage devices corresponding to the media content recognition method, and the processor 902 executes various functional applications and data processing by running the software programs and modules stored in the memory 904, so as to implement one of the above-described model training methods for media content recognition or media content recognition methods. The memory 904 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 904 may further include memory located remotely from the processor 902, which may be connected to the electronic device 90 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmitting means 906 is used for receiving or sending data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the electronic device 90. In one example, the transmission device 906 includes a network adapter (NIC) that can be connected to other network devices through a base station so as to communicate with the internet. In one embodiment, the transmitting device 906 can be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the electronic device 90 (or mobile device).

Embodiments of the present application further provide a computer-readable storage medium, which may be disposed in an electronic device to store at least one instruction or at least one program for implementing a model training method for media content recognition or a media content recognition method in method embodiments, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the model training method for media content recognition or the media content recognition method provided in the method embodiments.

Alternatively, in this embodiment, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device and electronic apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A model training method for media content recognition, the method comprising:

2. The method of claim 1, wherein:

the preset splitting requirement comprises a preset splitting evaluation parameter and a preset splitting ending condition, wherein the preset splitting evaluation parameter comprises at least one of the following parameters: information gain and kini coefficient;

the determining an optimal splitting rule meeting a preset splitting requirement based on the training sample set and the training feature set, and training the decision tree to be trained based on the optimal splitting rule include:

selecting training features indicating current local optimal splitting from the training feature set, and generating a local optimal splitting rule based on the selected training features; wherein the current locally optimal split is determined based on the training sample set and the preset split evaluation parameter;

establishing an incidence relation between the local optimal splitting rule and a current decision node to guide the current decision node to perform feature splitting; the current decision node is a child node or a same-level node of a decision node which is subjected to feature splitting in the decision tree to be trained;

and repeating the steps from the selection of the training features indicating the current local optimal splitting to the guidance of the current decision node for feature splitting until the preset splitting ending condition is met so as to train the decision tree to be trained.

3. A method according to claim 1 or 2, wherein said constructing at least two random forests based on a plurality of said first decision trees to derive a decision model comprises:

constructing a plurality of first random forests based on a plurality of the first decision trees;

constructing a first-layer classification structure based on the plurality of first random forests;

when a non-first-layer classification structure is constructed, a target splitting rule is determined based on an output result of a target training sample set and associated input features by utilizing a previous-layer classification structure, and a second decision tree for constructing an adjacent next-layer classification structure is obtained based on the target splitting rule training; wherein the target set of training samples is determined based on the set of media content samples, the neighboring next-level classification structure is constructed based on a plurality of second random forests constructed from a plurality of the second decision trees;

and fusing the first-layer classification structure and at least one non-first-layer classification structure to obtain the decision model.

4. A method for media content identification, the method comprising:

acquiring media content to be processed;

obtaining an identification result of the media content to be processed by using the multi-dimensional feature as an input and using the decision model according to any one of claims 1 to 3; wherein the identification result indicates the relation between the media content to be processed and the abnormal classified content.

5. The method according to claim 4, wherein the step of obtaining the identification result of the media content to be processed comprises the following steps:

predicting the multidimensional characteristics by respectively utilizing at least two random forests in the decision model;

generating an identification result of the media content to be processed based on the prediction result of each random forest; wherein the prediction result is determined based on a classification result of each decision tree in the random forest, and the classification result is determined based on the multi-dimensional features and starting to visit by taking a root node of the decision tree as a starting point until a corresponding leaf node is reached.

6. The method according to claim 4, wherein the step of obtaining the identification result of the media content to be processed comprises the following steps:

processing the multi-dimensional features layer by utilizing at least two layers of classification structures in the decision model to obtain an identification result of the media content to be processed; wherein the input of the next-layer classification structure is determined by the output and the input of the adjacent previous-layer classification structure, the output of each layer of the classification structure is determined based on the prediction result of each random forest, the prediction result is determined based on the classification result of each decision tree in the random forest, and the classification result is determined based on the multi-dimensional features, and the access is started by taking the root node of the decision tree as the starting point until the corresponding leaf node is reached.

7. The method of claim 4, further comprising:

when the media content to be processed belongs to the abnormal classified content, performing abnormal marking on the media content to be processed; wherein the exception flag is used for indicating to reduce the weight for recommendation of the media content to be processed or to exclude the media content to be processed from being used as a candidate cover page.

8. The method of claim 4, wherein the determining the multi-dimensional features corresponding to the media content to be processed comprises:

extracting image features corresponding to the media content to be processed by using a preset image feature extraction model; the preset image feature extraction model is obtained by training by taking a migration model as an initial model;

extracting text type characteristics corresponding to the media content to be processed by utilizing preset text characteristics; the preset text feature extraction model is obtained by training by taking a bidirectional coding representation model based on a converter as an initial model;

and obtaining the multi-dimensional features based on the image class features and the text class features.

9. A model training apparatus for media content recognition, the apparatus comprising:

10. An apparatus for identifying media content, the apparatus comprising:

an identification module: the method is used for obtaining the identification result of the media content to be processed by using the decision model of any one of claims 1 to 3 and taking the multi-dimensional features as input; wherein the identification result indicates the relation between the media content to be processed and the abnormal classified content.

11. An electronic device, comprising a processor and a memory, wherein at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded by the processor and executed to implement the model training method for media content recognition according to any one of claims 1 to 3 or the media content recognition method according to any one of claims 4 to 8.

12. A computer-readable storage medium, in which at least one instruction or at least one program is stored, which is loaded and executed by a processor to implement the model training method for media content recognition according to any one of claims 1 to 3 or the media content recognition method according to any one of claims 4 to 8.

13. A computer program product comprising at least one instruction or at least one program which is loaded and executed by a processor to implement the method for model training for media content recognition according to any of claims 1-3 or the method for media content recognition according to any of claims 4-8.