CN114611637A

CN114611637A - Data processing method, device, equipment and readable storage medium

Info

Publication number: CN114611637A
Application number: CN202210509663.0A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2022-06-10
Anticipated expiration: 2042-05-11
Also published as: CN114611637B

Abstract

The application discloses a data processing method, a device, equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a target data frame corresponding to target media data, and identifying a target media category to which the target media data belongs, a target image characteristic corresponding to the target data frame and a target image category; acquiring a target matching parameter commonly indicated by a target media category and a target image category in a parameter mapping table; searching for matched image features matched with the target image features in the candidate image feature set according to the target image features and the target matching parameters; the candidate image characteristic set is a set formed by image characteristics corresponding to each media data to be recalled in the media data set to be recalled; and determining effective recalling media data in the media data set to be recalled according to the matched image characteristics and the target image characteristics. By adopting the method and the device, the retrieval accuracy can be improved in the retrieval service of the media data.

Description

Data processing method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method, apparatus, device, and readable storage medium.

Background

With the rapid development of multimedia technology, mass multimedia data (such as music, video, etc.) enters the field of vision of people, and the retrieval of multimedia data becomes more and more important. The retrieval of the multimedia data is based on the media content contained in certain multimedia data, and the multimedia data with certain similar media content with the multimedia data is retrieved from each candidate multimedia data.

The retrieval of multimedia data in the prior art is usually performed manually. However, due to the diversity of the media service platform, the multimedia data of various types (such as match type, movie and television synthesis type, concert type, daily life type, music type, etc.) are included, and each type of multimedia data includes different contents, so that under the condition that the multimedia data has diversity, richness and mass, the multimedia data retrieval mode is performed based on manual operation, which undoubtedly requires a great deal of manpower and time investment, and the cost is extremely high; meanwhile, due to the limitation of manual operation, the coverage is not complete, the retrieved media data is not accurate enough, and the accuracy is very low.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device, data processing equipment and a readable storage medium, and can improve the retrieval accuracy rate in the retrieval service of media data.

An embodiment of the present application provides a data processing method, including:

acquiring a target data frame corresponding to target media data, and identifying a target media type to which the target media data belongs, a target image characteristic corresponding to the target data frame and a target image type;

acquiring a target matching parameter commonly indicated by a target media category and a target image category in a parameter mapping table; the parameter mapping table comprises a configuration media category set, a configuration image category set and a mapping relation among the matching parameter sets, wherein the mapping relation exists among one configuration media category in the configuration media category set, one configuration image category in the configuration image category set and one configuration matching parameter in the matching parameter set; a configuration matching parameter for reflecting a matching condition of image features of a data frame having a corresponding configuration media category and a corresponding configuration image category;

searching for matched image features matched with the target image features in the candidate image feature set according to the target image features and the target matching parameters; the candidate image characteristic set is a set formed by image characteristics corresponding to each media data to be recalled in the media data set to be recalled;

And determining effective recall media data in the media data set to be recalled according to the matched image characteristics and the target image characteristics.

An embodiment of the present application provides a data processing apparatus, including:

the frame acquisition module is used for acquiring a target data frame corresponding to the target media data;

the identification module is used for identifying the target media type to which the target media data belong, the target image characteristics corresponding to the target data frame and the target image type;

the threshold value obtaining module is used for obtaining a target matching parameter jointly indicated by a target media category and a target image category in the parameter mapping table; the parameter mapping table comprises a configuration media category set, a configuration image category set and a mapping relation among the matching parameter sets, wherein the mapping relation exists among one configuration media category in the configuration media category set, one configuration image category in the configuration image category set and one configuration matching parameter in the matching parameter set; a configuration matching parameter for reflecting a matching condition of image features of a data frame having a corresponding configuration media category and a corresponding configuration image category;

the feature matching module is used for searching for matched image features matched with the target image features in the candidate image feature set according to the target image features and the target matching parameters; the candidate image characteristic set is a set formed by image characteristics corresponding to each media data to be recalled in the media data set to be recalled;

And the effective media determining module is used for determining effective recall media data in the media data set to be recalled according to the matched image characteristics and the target image characteristics.

In one embodiment, the identification module comprises:

the feature extraction unit is used for inputting the target data frame into the multitask recognition model and extracting the image basic features corresponding to the target data frame through a basic feature extraction layer in the multitask recognition model;

the characteristic input unit is used for inputting the image basic characteristics to a convolution network layer in the multitask identification model, determining image embedding characteristics corresponding to the target data frame through the convolution network layer and the image basic characteristics, and determining the image embedding characteristics as target image characteristics;

the characteristic input unit is also used for inputting the image basic characteristics to an image category prediction layer in the multitask identification model and determining a target image category corresponding to the target data frame through the image category prediction layer and the image basic characteristics;

and the characteristic input unit is also used for inputting the image basic characteristics to a media category prediction layer in the multitask identification model and determining the target media category to which the target media data belongs according to the media category prediction layer and the image basic characteristics.

In one embodiment, the number of target data frames is N, including target data frame S_iThe image base features include a target data frame S_iCorresponding image base feature T_iN, i are all positive integers;

a feature input unit comprising:

a category determining subunit for predicting the layer and the target data frame S by the media category_iCorresponding image basis features T_iDetermining a target data frame S_iA corresponding frame media category;

the frame classification subunit is used for classifying the N target data frames according to the N frame media categories to obtain M data frame sets when the frame media categories corresponding to the N target data frames respectively are determined; the frame media categories to which the target data frames contained in each data frame set belong are the same; m is a positive integer;

the quantity counting subunit is used for counting the quantity of target data frames contained in each of the M data frame sets to obtain the quantity of M frames;

the number counting subunit is further configured to obtain a maximum frame number from the M frame numbers, and determine a data frame set corresponding to the maximum frame number as a target data frame set;

and the category prediction subunit is used for determining the frame media category to which the target data frame contained in the target data frame set belongs as the target media category to which the target media data belongs.

In one embodiment, each configured matching parameter in the set of matching parameters comprises a configured similarity threshold, and the target matching parameter comprises a target similarity threshold;

the feature matching module includes:

the similarity determining unit is used for determining the feature similarity between the target image features and each candidate image feature in the candidate image feature set respectively to obtain a feature similarity set;

the characteristic determining unit is used for determining the characteristic similarity which is greater than the target similarity threshold value in the characteristic similarity set as the target characteristic similarity;

and the characteristic determining unit is further used for determining the candidate image characteristic corresponding to the target characteristic similarity as a matched image characteristic matched with the target image characteristic.

In one embodiment, the number of the target data frames is N, the target image features corresponding to the target data frames include target image features corresponding to the N target data frames, respectively, and N is a positive integer; the number of the matched image features is Q, the Q matched image features are composed of matched image features respectively matched with the N target image features, and Q is a positive integer;

the effective media determination module includes:

the characteristic classification unit is used for acquiring the media data to be recalled to which the Q matched image characteristics belong respectively in the media data set to be recalled;

The characteristic classification unit is also used for carrying out characteristic classification on the Q matched image characteristics according to the to-be-recalled media data to which the Q matched image characteristics respectively belong to obtain W matched characteristic sets; the media data to be recalled to which the matched image features contained in each matched feature set belong are the same media data; the W matched feature sets comprise a matched feature set R_jW, j are positive integers;

a feature quantity statistic unit for counting the matching feature set R_jA first number of features of the matched image features contained therein;

an attribute determining unit for determining a matching feature set R according to the first feature quantity and the N target image features_jA recall attribute of the indicated media data to recall;

and the effective media determining unit is used for determining the media data to be recalled, of which the recall attribute is the effective attribute, in the media data to be recalled, which is respectively indicated by the W matching feature sets when the recall attribute of the media data to be recalled, which is respectively indicated by the W matching feature sets, is determined as the effective recall media data.

In one embodiment, the set of matching features R_jThe matching image features included in (1) comprise a first matching image feature and a second matching image feature;

The attribute determining unit is further specifically configured to acquire, among the N target image features, a first target image feature matched with the first matching image feature and a second target image feature matched with the second matching image;

the attribute determining unit is further specifically configured to determine the total number of features included in the first target image feature and the second target image feature as a second feature number;

an attribute determining unit, further specifically configured to determine a matching feature set R according to the first feature quantity, the second feature quantity and the target media data_jIndicated to be waited forRecall attributes of the recalled media data.

In one embodiment, the attribute determining unit includes:

a duration obtaining subunit, configured to obtain the matching feature set R_jThe indicated first media time length corresponding to the media data to be recalled and the indicated second media time length corresponding to the target media data;

the ratio determining subunit is used for determining a first ratio between the first characteristic quantity and the first media time length and a second ratio between the second characteristic quantity and the second media time length;

an attribute determining subunit, configured to determine, if at least one of the first ratio and the second ratio is greater than a ratio threshold, a matching feature set R _jThe indicated recall attribute of the media data to be recalled is determined as an effective attribute;

the attribute determining subunit is further configured to determine the matching feature set R if the first ratio and the second ratio are both smaller than a ratio threshold_jThe recall attribute of the indicated media data to be recalled is determined to be an invalid attribute.

In one embodiment, the recall media data includes recall media data K_a(ii) a a is a positive integer; inclusion of recall-enabled media data K in matching image features_aA corresponding valid matching image feature set; the target image features comprise an effective target image feature set, and the effective target image feature set comprises effective target image features matched with each effective matching image feature in the effective matching image feature set; the target data frame comprises an effective target data frame set corresponding to the effective target image feature set;

the data processing apparatus further includes:

the timestamp acquisition module is used for acquiring a frame timestamp corresponding to each effective target data frame in the effective target data frame set;

the frame sorting module is used for sorting the effective target data frame set according to the time sequence of the frame time stamps corresponding to each effective target data frame respectively to obtain an effective frame sequence;

The business processing module is used for determining the media segments indicated by the effective frame sequences in the target media data as segments to be compared;

the business processing module is also used for effectively recalling the media data K according to the fragments to be compared_aMedia category to which the target media data and the recall-enabled media data K belong_aAnd carrying out media service processing.

In one embodiment, the traffic processing module comprises:

a recall category determination unit for recalling the media data K in effect_aThe media category to which the media belongs is determined as a recall media category;

a first processing unit for recalling the media data K in effect if the category attribute of the recalling media category is a private resource attribute_aThe effective media fragments corresponding to the effective matching image feature set are obtained, the fragments to be compared and the effective media fragments are compared and analyzed, abnormal warning information is generated based on the analysis result obtained by the comparison and analysis, and the abnormal warning information is sent to target terminal equipment; the target terminal equipment is the terminal equipment corresponding to the target object for generating the target media data; the abnormal warning information is used for prompting the target object to correct the target media data based on the analysis result;

The first processing unit is also used for recalling the media data K in effect if the category attribute of the recalling media category is the shared resource attribute_aThe effective media segments corresponding to the effective matching image feature set are obtained, the media subjects matched with the effective media segments and the segments to be compared are determined, and similar media data containing the media subjects are pushed to the target terminal device.

In one embodiment, the data processing apparatus further comprises:

the sample acquisition module is used for acquiring sample image triples; the sample image triplets comprise target sample images, first similar sample images corresponding to the target sample images and second similar sample images corresponding to the target sample images;

the model processing module is used for inputting the sample image triples into the initial multi-task recognition model;

the model processing module is further used for determining a first sample image embedding feature corresponding to the target sample image, a first sample image category and a first sample target media category, a second sample image embedding feature corresponding to the first similar sample image, a second sample image category and a second sample target media category, a third sample image embedding feature corresponding to the second similar sample image, a third sample image category and a third sample target media category through the initial multi-task recognition model;

The loss value determining module is used for determining a first loss value according to the first sample image embedding feature, the second sample image embedding feature and the third sample image embedding feature;

the loss value determining module is further used for determining a second loss value according to the first sample image category, the second sample image category and the third sample image category;

the loss value determining module is further used for determining a third loss value according to the first sample target media category, the second sample target media category and the third sample target media category;

the loss value determining module is further used for generating a target loss value according to the first loss value, the second loss value and the third loss value;

and the model adjusting module is also used for adjusting the initial multi-task recognition model according to the target loss value to obtain the multi-task recognition model.

In one embodiment, the sample acquisition module comprises:

a sample set acquisition unit for acquiring a sample image set; the sample image set comprises at least two similar sample image pairs, and one similar sample image pair comprises two sample images with similar relation;

the image combination unit is used for acquiring a target similar sample image pair in at least two similar sample image pairs;

The image combination unit is also used for selecting a sample image to be operated from the sample images contained in the residual similar sample image pairs; the remaining similar sample image pairs refer to similar sample image pairs except the target similar sample image pair in at least two similar sample image pairs;

and the image combination unit is also used for determining a sample image triple according to the sample image to be operated and the target similar sample image pair.

In one embodiment, the image combining unit includes:

an image similarity determining subunit, configured to select a target sample image from the sample images included in the target similar sample image pair;

the image similarity determining subunit is further used for acquiring sample image representation features corresponding to the sample image to be operated and target image representation features corresponding to the target sample image;

the image similarity determining subunit is further used for determining the representing feature similarity between the sample image representing feature and the target image representing feature;

the triplet determining subunit is configured to determine, if the feature similarity is greater than the feature similarity threshold, the remaining sample images as first similar sample images corresponding to the target sample image, determine the sample image to be operated as second similar sample images corresponding to the target sample image, and determine the target sample image, the first similar sample images, and the second similar sample images as sample image triples; and the rest sample images are sample images except the target sample image in the target similar sample image pair.

An embodiment of the present application provides a computer device, including: a processor and a memory;

the memory stores a computer program that, when executed by the processor, causes the processor to perform the method in the embodiments of the present application.

An aspect of the embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, where the computer program includes program instructions, and the program instructions, when executed by a processor, perform the method in the embodiments of the present application.

In one aspect of the present application, a computer program product is provided, the computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device executes the method provided by the aspect of the embodiment of the present application.

In the embodiment of the application, different matching parameters are configured for different media categories and image categories, so that a parameter mapping table is obtained through configuration, and the parameter mapping table comprises a mapping relation among a configured media category, a configured image category and a configured matching parameter; wherein one configuration matching parameter is used for reflecting the matching condition of the image characteristics of the data frame with the corresponding configuration media category and the corresponding configuration image category. When the target media data is obtained, in a retrieval service for the target media data, a target media category of the target media data, a target image feature corresponding to a target data frame, and a target image category may be identified first, then a target matching parameter indicated by the target media category and the target image category together may be obtained in a parameter mapping table, and according to the target matching parameter, a matching image feature matching the target image feature may be searched in a candidate image feature set, where a matching condition reflected by the target matching parameter is satisfied between the matching image feature and the target image feature; since the candidate image feature set is a set formed by image features respectively corresponding to each piece of media data to be recalled in the set of media data to be recalled, valid recall media data can be determined in the set of media data to be recalled through the matched image features and the target image features, and the valid recall media data are retrieved recall media data aiming at the target media data. It should be understood that, in the media retrieval recall service, the media category information of the media data, the image category information of the data frame of the media data, and the image feature information are simultaneously utilized, different matching parameters are configured for different media categories and different image categories, and the matching image feature meeting the matching condition under the image category can be found according to the matching parameters, so that the recall media data meeting the matching condition can be found, that is, the targeted retrieval aiming at different media categories and different image categories can be realized, and more accurate media retrieval recall can be well performed according to the image information presented in the data frame of the media data, thereby improving the accuracy of the recall result. In conclusion, the method and the device can perform targeted retrieval based on different media types and image types in the retrieval service of the media data, and improve the retrieval accuracy.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a diagram of a network architecture provided in an embodiment of the present application;

fig. 2a is a schematic view of a scenario for performing a media re-ranking process according to an embodiment of the present application;

fig. 2b is a schematic view of a scenario for performing a media re-ranking process according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of model training provided by an embodiment of the present application;

FIG. 5 is an architecture diagram for determining a target loss value according to an embodiment of the present application;

FIG. 6 is a diagram of a system architecture provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The present application relates to artificial intelligence and other related technologies, and for ease of understanding, the following description will give priority to the related concepts such as artificial intelligence.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The scheme provided by the embodiment of the application belongs to Computer Vision technology (CV) and Machine Learning (ML) belonging to the field of artificial intelligence.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and in particular, it refers to a method for using a camera and a Computer to perform machine Vision such as identification and measurement on a target, and further performing graphic processing, so that the Computer processing becomes an image more suitable for observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronized positioning and mapping, among other technologies.

Machine Learning (ML) is a multi-domain cross subject, and relates to multi-domain subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The specialized research computer acquires new knowledge or skills, reorganizes the existing knowledge structure and continuously improves the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning.

The scheme of the application particularly relates to an image recognition technology in a computer vision technology, and can realize image recognition processing on an image so as to perform subsequent processing based on an image recognition result; the scheme of the application also specifically relates to machine learning, and can realize training the model, so that the trained model can more accurately perform image recognition processing.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present disclosure. As shown in fig. 1, the network architecture may include a server 1000 and a cluster of end devices. The cluster of end devices may include one or more end devices. As shown in fig. 1, the terminal device cluster may specifically include a terminal device 100a, a terminal device 100b, terminal devices 100c and …, and a terminal device 100 n. As shown in fig. 1, the terminal device 100a, the terminal device 100b, the terminal devices 100c, …, and the terminal device 100n may be respectively in network connection with the server 1000, so that each terminal device may perform data interaction with the server 1000 through the network connection. The network connection here is not limited to a connection manner, and may be directly or indirectly connected through a wired communication manner, may be directly or indirectly connected through a wireless communication manner, and may also be connected through another manner, which is not limited herein.

Wherein, each terminal device in the terminal device cluster can include: the intelligent terminal comprises an intelligent terminal with image recognition and media retrieval functions, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent sound box, an intelligent watch, a vehicle-mounted terminal, an intelligent television, an intelligent vehicle-mounted terminal and the like. It should be understood that each terminal device in the terminal device cluster shown in fig. 1 may be installed with a target application (i.e., an application client), and when the application client runs in each terminal device, data interaction may be performed with the server 1000 shown in fig. 1. The application client may include a social client, a multimedia client (e.g., a video client), an entertainment client (e.g., a game client), an education client, a live client, and the like. The application client may be an independent client, or may be an embedded sub-client integrated in a certain client (for example, a social client, an educational client, a multimedia client, and the like), which is not limited herein.

As shown in fig. 1, the server 1000 in this embodiment may be a server corresponding to the application client. The server 1000 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The embodiment of the application does not limit the number of the terminal devices and the servers.

For convenience of understanding, in the embodiment of the present application, one terminal device may be selected as a target terminal device from a plurality of terminal devices shown in fig. 1. For example, the terminal device 100a shown in fig. 1 may be used as a target terminal device in the embodiment of the present application, and a target application (i.e., an application client) may be integrated in the target terminal device. At this time, the target terminal device may implement data interaction with the server 1000 through the service data platform corresponding to the application client. The target application may run a trained multitask recognition model, which may be a neural network model for performing image recognition on a target data frame, and the multitask recognition model may predict a media category (e.g., a living category, a movie and television synthesis category, a concert category, etc.) to which a certain media data (e.g., a target media data, such as a certain video) belongs, an image category (e.g., an image category corresponding to image content included in the data frame, such as an animal category, a person category, a text category, etc.) corresponding to the data frame of the media data, an image feature (i.e., a representation feature for representing image content of the data frame) corresponding to the data frame, a media category based on the target media data recognized by the multitask recognition model, an image category (e.g., a target image category) of a certain data frame (e., a target data frame) of the target media data, and a graph The image feature (which may be referred to as a target image feature) may be used to perform media data retrieval processing to retrieve whether a media segment with a sufficient duration exists in the target media data, which is the same as a media segment in existing media data (which may also be understood as performing media rearrangement processing on media data).

It can be understood that, when performing the media data retrieval processing, the present application may perform the retrieval processing by comparing the data frames of the two media data. Taking the target media data as an example, if it is determined whether the target media data and a certain media data (which may be referred to as to-be-compared media data or to-be-recalled media data) have the same media segment with a sufficient length (i.e., it is determined whether the two media data are the same media data), data frames of the two media data may be obtained respectively, and then the data frames of the two media data are compared. When two data frames are compared, the image features (such as the embedding feature) corresponding to the two data frames respectively can be identified and obtained, then the distance (such as the Euclidean distance) between the two image features is calculated, and if the distance between the two image features is small enough (such as smaller than a distance threshold), the two data frames can be indicated to be similar data frames. It should be understood that when two media data have enough data frames (e.g., data frames greater than a number threshold) as similar data frames, it may indicate that the two media data have a media segment of a fixed length as a similar segment (or a repeated segment). However, because the image contents contained in each data frame have diversity, the capabilities that a single image feature (such as an embedding feature) can represent on different data frames are different, for example, for images containing text contents, fine-grained objects (such as faces), and the like, the single image feature cannot well represent the image contents, thereby affecting the comparison accuracy and further affecting the accuracy of media retrieval.

In order to improve the comparison accuracy and thus the media rearrangement (or referred to as media retrieval) accuracy, the present application relates to a data processing method (i.e., a media rearrangement method, which may also be referred to as a media retrieval method or a media comparison method), different matching parameters may be configured for different media categories and different image categories, one matching parameter is used for reflecting a matching condition of image features of data frames having a corresponding media category and a corresponding image category, that is, for two data frames, the image features thereof are determined to be similar only when the matching condition indicated by the matching parameter is satisfied. Specifically, taking target media data as an example, when a target media category of the target media data, a target image category to which a target data frame belongs, and a target image feature are identified and obtained, a target matching parameter indicated by the target media category and the target image category together may be obtained, and based on the target matching parameter, a matching image feature matching with the target image feature may be found in a candidate image feature set (a set formed by image features corresponding to each piece of media data to be recalled in a set of media data to be recalled); and further determining effective recall media data in the media data set to be recalled according to the matched image characteristics and the target image characteristics. The recall-available media data is the media data which is retrieved by the method of the application and has a large amount of identical content (same media segments or similar media segments) with the target media data. It can be understood that, in the embodiment of the present application, by configuring matching parameters for different media categories and different image categories, customized retrieval can be provided for different media categories and different image categories, image contents on different data frames are fully considered, the retrieval accuracy of different image guarantee effects is improved, the retrieval requirements of different media categories are met, the representation capability of image features is considered, the retrieval requirements of the media categories can be fully considered, and the media comparison accuracy can be improved well.

It is understood that the method provided by the embodiment of the present application may be executed by a computer device, which includes but is not limited to a terminal device or a server. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

Alternatively, it is understood that the computer device (such as the server 1000, the terminal device 100a, the terminal device 100b, and the like) may be a node in a distributed system, wherein the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting a plurality of nodes through a network communication mode. The P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP). In a distributed system, any form of computer device, such as a server, a terminal device, etc., may become a node in the blockchain system by joining the peer-to-peer network. For ease of understanding, the concept of blockchains will be explained below: the block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm, and is mainly used for sorting data according to a time sequence and encrypting the data into an account book, so that the data cannot be falsified or forged, and meanwhile, the data can be verified, stored and updated. When the computer device is a blockchain node, due to the characteristics of non-falsification and forgery prevention of the blockchain, the data (such as the uploaded target media data, the identified target media type of the target media data, the target image feature of the target data frame, the target image type, and the like) in the application can have authenticity and security, so that the result obtained after the relevant data processing is performed based on the data is more reliable.

In the embodiments of the present application, data related to user information, user data (such as uploaded images, media data, and the like) and the like needs to be authorized by the user for obtaining. That is, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of relevant data need to comply with relevant laws and regulations and standards of relevant countries and regions.

The embodiment of the application can be applied to various scenes, including but not limited to audio and video, cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like. For ease of understanding, please refer to fig. 2a, and fig. 2a is a schematic view of a scenario for performing a media re-ranking process according to an embodiment of the present application. The scenario shown in fig. 2a is described by taking media data as video data, a data frame of the media data as a video frame, and a media category to which the media data belongs as a video category as an example, where the server 200 shown in fig. 2a may be the server 1000, and the terminal device 100a shown in fig. 2a may be any one terminal device selected from the terminal device cluster of the embodiment corresponding to fig. 1, for example, the terminal device may be the terminal device 100 a.

As shown in fig. 2a, when an object a (e.g., a user a) uses a target application (e.g., a video client) in a terminal device, the object a uploads a piece of video data (e.g., video data 1 shown in fig. 2a, where the video data 1 may be used as target video data) through the video client, and the server 200 may obtain the video data 1 uploaded by the object a. Further, the server 200 may perform frame extraction processing on the video data 1 according to a preset frame extraction parameter (the frame extraction parameter may refer to the number of frames extracted at intervals, for example, the frame extraction parameter may be 1 frame extracted uniformly per second, 2 frames extracted uniformly per second, 1 frame extracted uniformly per 0.5 second, and the like, which will not be described by way of example herein one by one) (a video frame obtained after frame extraction may be referred to as a target video frame of the video data 1). It is assumed here that the video frames of the video data 1 are arranged in the chronological order of { video frame 1, video frame 2, video frame 3, …, video frame n }, and the frame extraction parameter is 1 frame extracted uniformly per second, and after the frame extraction is performed on the video data 1 according to the frame extraction parameter, the obtained video frames are video frame 1, video frame 3, video frame 5, …, and video frame n (here, the extracted video frame 1, video frame 3, video frame 5, …, and video frame n are taken as examples, and the examples are only for convenience of understanding, and do not have actual reference meanings).

Further, the server 200 may respectively input each video frame obtained by frame extraction into the multitask identification model, and may identify and obtain a video category, an image category, and an image feature (where the image feature may refer to an embedding feature) corresponding to each video frame through the multitask identification model, and for a specific implementation manner of identifying the video category, the image category, and the image feature corresponding to each video frame through the multitask identification model, reference may be made to the description in the embodiment corresponding to fig. 3 below. As shown in fig. 2a, the video category corresponding to the identified video frame 1 is video category 1, and the image category is image category 2; the video category corresponding to the video frame 3 is a video category 2, and the image category is an image category 3; …; the video category corresponding to the video frame n is video category 1, and the image category is image category 1. Further, the server 200 may determine the video category to which the video data 1 belongs according to the video category corresponding to each video frame. For example, the total number of video frames included in each video category may be counted among the video frames, and the video category with the maximum total number may be selected as the video category to which the video data 1 belongs. For example, video frame 1, video frame 3, …, the video categories of 10 video frames in the video frame n are all video category 1, the video categories of 3 video frames are all video category 2, the video categories of 4 video frames are all video category 3, and then the maximum total number is 10, then the video category to which the video data 1 belongs may be determined as video category 1.

Further, the server 200 finds video frames similar to each video frame in the database according to the video category 1 of the video data 1 and the image category of each video frame. The method comprises the following steps: the server 200 may obtain the parameter mapping table 20, where the parameter mapping table 20 includes a video category set, a mapping relationship between an image category set and a matching parameter set, it should be understood that, in the present application, a matching parameter may be configured for different image categories in different video categories, as a matching condition for image features of a certain video frame, for the video frame, when it is to be determined whether another video frame is a similar video frame to the video frame, image features of two video frames may be obtained, and when the two image features satisfy the matching condition, it may be determined that the two video frames are similar video frames.

For convenience of understanding, taking the video frame 1 as an example, the server 200 may extract an image feature corresponding to the video frame 1 (the image feature may be an embedding feature, and is assumed to be the image feature 1); since the image category corresponding to the video frame 1 is the image category 2, the server 200 may obtain, in the parameter mapping table 20, a matching parameter (the matching parameter is the matching parameter 2) corresponding to the image category 2 under the video category 1 (the video category to which the video data 1 belongs), and according to a matching condition indicated by the matching parameter 2, the server 200 may perform feature matching on the image feature 1 corresponding to the video frame 1 and the existing image features (which may be referred to as stock image features, each stock image feature is an image feature corresponding to each video frame of each existing video, wherein each existing video may be referred to as video data to be recalled and may be referred to as stock video data) in the database one by one, so that stock image features satisfying the matching condition with the image feature 1 can be found, and stock video frames corresponding to the stock image features satisfying the matching condition, i.e. a video frame similar to the video frame 1.

For ease of understanding, please refer to fig. 2b together, and fig. 2b is a schematic view of a scenario for performing a media re-ranking process according to an embodiment of the present application. As shown in fig. 2b, taking the image feature of the video frame 1 as the image feature 1, the image feature of the video frame 3 as the image features 3 and …, and the image feature of the video frame n as the image feature n, as can be seen from the above, according to the matching parameters corresponding to different image categories under the video category 1, a similar video frame corresponding to each video frame can be determined, here, taking the inventory image features matching with the image feature 1 (i.e. satisfying the matching conditions corresponding to the matching parameters thereof) including the inventory image feature 1 and the inventory image feature 3, the inventory image features matching with the image feature 3 including the inventory image features 3 and …, and the inventory image features matching with the image feature n including the inventory image feature 2 as examples, video data (which may be referred to as inventory video data) respectively corresponding to the inventory image features on each match can be obtained, as shown in fig. 2b, the video data may specifically include stock video data 1, … and stock video data 2. Further, according to the present application, valid recall video data for the video data 1, that is, video data having a repeated video clip (or similar video clip) of a certain length with the video data 1, may be determined in the inventory video data 1, … and the inventory video data 2 according to the image features of the inventory image features on each match and the image features of each video frame (video frame 1, video frame 3, … and video frame n) of the video data 1. For the matched inventory image features and image features, a specific implementation manner of the effective recall video data is determined, which may be described in an embodiment corresponding to fig. 3 later.

Further, the server 200 may perform corresponding video service processing according to different video categories to which the valid recalled video data belongs. For example, in the valid recalled video data, the video data whose video category is the movie and television integrated art category is included, and the video category is a category that copy copying is prohibited and is protected by copyright, the server 200 may generate an abnormal warning message and send the abnormal warning message to the terminal device 100a, where the abnormal warning message may be used to prompt an object a, a part of video content existing in the video data 1 is similar to the valid recalled video data of the movie and television integrated art category, the video data 1 is an abnormal video, and the object a should perform corresponding processing on the video data 1 in time. The object a can check the abnormal warning information in the display interface of the terminal device 100a, and perform corresponding processing according to the abnormal warning information (for example, delete the video data 1, or delete a video clip of the video data 1 that is similar to the recall-available video data).

It should be understood that, in the embodiment of the present application, when retrieving and removing duplicate video, customized retrieval of different video types and different image types may be configured, and video type information of an obstructed video and different image type information of different video frames may be simultaneously utilized in the retrieval, and even if image contents included in each video frame are diverse, targeted retrieval may be performed by using matching parameters of different image types, so that accuracy of retrieving and removing duplicate video may be well improved.

Further, please refer to fig. 3, wherein fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application. As shown in fig. 3, the method may be performed by a terminal device (e.g., the terminal device in fig. 1), a server (e.g., the server in fig. 1), or both the terminal device and the server. For the sake of understanding, the present embodiment is described as an example in which the method is performed by the server described above, so as to illustrate a specific process of performing media data processing in the server. Wherein, the method at least comprises the following steps S101-S104:

step S101, acquiring a target data frame corresponding to target media data, and identifying a target media type to which the target media data belongs, a target image feature corresponding to the target data frame, and a target image type.

In this application, the media data may include video data, music data, text data, and the like, where the target media data may refer to media data to be compared, and may be certain media data (e.g., a certain video) uploaded by a certain object through a target application in a terminal device, and in a media retrieval service (or referred to as a media deduplication service), in order to protect the rights of an originator of the media data, it is necessary to determine whether a media segment similar to or even identical to other media data exists for each uploaded media data, and if a media segment existing for a certain period of time is similar to or even identical to other existing media data, the media data is the same media data, and the media data needs to be correspondingly processed (e.g., the media data needs to be hidden and displayed or deleted).

It can be understood that, for a certain media data, when determining whether it is the duplicate media data, it may be used as the media data to be determined, which is the target media data in the present application. For the target media data, the present application needs to search whether similar media data exists with the target media data (whether media data with similar media segments or the same media segments exists with the target media data), and if so, the similar media data in the database can be used as recall media data of the target media data.

In the present application, the retrieval and the duplication elimination of the media data are mainly performed based on the data frame of the media data, so after the target media data is acquired, the target data frame of the target media data can be acquired, and then the retrieval and the duplication elimination processing are performed on the target media data based on the target data frame. Each data frame of the target media data can be determined as a target data frame; the data frame of the target media data can be extracted according to preset frame extraction parameters to improve the retrieval efficiency, and the extracted data frame can be used as the target data frame of the target media data. The frame extraction parameter may include a frame extraction number according to a certain time, for example, the frame extraction parameter may be 3 frames per second, 1 frame per 0.5 second, and so on, which will not be illustrated here.

In the application, after a target data frame of target media data is acquired, a target media category to which the target media data belongs, a target image feature corresponding to the target data frame, and a target image category may be identified, where the media category may refer to a media type, a media category, and the like to which the certain media data belongs, and may include a daily life category, a movie and television integrated art category, a sports category, a concert category, a competition category, a game category, a live broadcast category, and the like; the image category may refer to an image type to which image content included in a certain data frame belongs, which may include a person category, an animal category, a building category, a text category, a music category, and the like, and the image feature may refer to an image embedding feature (embedding feature). It should be understood that the media category and the image category in the present application are two categories with different angles, the media category can be determined by the behavior presented by each object in the media data (for example, the media data is the scene of a concert of an object, the object presents singing behavior, the objects of the audience present behaviors of waving a fluorescent stick and shouting and singing, and the media category of the media data can be the category of the concert), to some extent, the media category can reflect the event presented by the media data (such as a game event, a concert event, and the like), and the image category can be determined by the object category of the object contained in a certain data frame of the media data, for example, the image content of a certain data frame is that a puppy is running, the object category of the puppy is the animal category, and the image category of the data frame can be the animal category, in fact, the image category of a certain data frame can reflect the attribute of the data frame to some extent. Of course, the difference between the media category and the image category is described above for the convenience of understanding, and it is not meant that the media category and the image category are limited thereto.

In this application, the identification of the target media category to which the target media data belongs, the target image feature corresponding to the target data frame, and the target image category may be identified by a multitask identification model, and the specific implementation manner may be: the target data frame can be input into the multitask recognition model, and the image basic features corresponding to the target data frame can be extracted through a basic feature extraction layer in the multitask recognition model; subsequently, the image basic features can be input into a convolution network layer in the multitask identification model, image embedding features corresponding to the target data frame can be determined through the convolution network layer and the image basic features, and the image embedding features can be determined as target image features; meanwhile, the image basic characteristics can be input into an image category prediction layer in the multitask identification model, and the target image category corresponding to the target data frame can be determined through the image category prediction layer and the image basic characteristics; and inputting the image basic characteristics into a media category prediction layer in the multitask identification model, and determining the target media category to which the target media data belongs through the media category prediction layer and the image basic characteristics.

Wherein, the number of the target data frames is N, and the N target data frames comprise a target data frame S_iThe image basic feature includes the target data frame S_iCorresponding image basis features T_i(N, i are positive integers), for example, the specific implementation manner of determining the target media category to which the target media data belongs through the media category prediction layer and the image base feature may be: predicting layers and target data frames S by media categories_iCorresponding image basis features T_iThe target data frame S can be determined_iA corresponding frame media category; according to the determined target data frame S_iThe mode of the corresponding frame media category can also determine the target data frame S of the N target data frames_iWhen the frame media types corresponding to the N target data frames are determined, classifying the N target data frames according to the N frame media types to obtain M data frame sets; the frame media categories to which the target data frames contained in each data frame set belong are the same category; m is a positive integer; then, the number of target data frames contained in each of the M data frame sets can be counted to obtain the number of M frames; the maximum frame number can be obtained from the M frame numbers, and then the maximum frame number can be corresponded A data frame set determined as a target data frame set; further, the frame media category to which the target data frame included in the target data frame set belongs may be determined as the target media category to which the target media data belongs.

It can be understood that, for the identification of the target media category to which the target media data belongs, through the multi-task identification model in the present application, the media category (which may be referred to as a frame media category) of each target data frame may be identified, then, the target data frames belonging to the same frame media category in the N target data frames may be divided into one set (i.e., a data frame set), the number of the target data frames included in each data frame set may be counted to obtain the number of frames, the data frame set corresponding to the maximum number of frames may be determined as the target data frame set, and the frame media category to which the target data frame included in the target data frame set belongs may be the target media category to which the target media data belongs.

It is to be understood that the multi-tasking recognition model in the present application may be a model trained based on a sample image database, which may be a large visualization database for visual object recognition software research. Optionally, the multitask recognition model may also be an open-source model, for example, the multitask recognition model may be an Imagenet pre-training model, which is a deep learning network model trained based on a large-scale general object recognition open-source data set (e.g., an Imagenet data set). It is understood that the multitask recognition model in the present application may include a plurality of recognition networks, which may respectively include a basic feature recognition Network (i.e., corresponding to a basic feature extraction layer, and may specifically be a Convolutional Network), an embedding feature recognition Network (i.e., corresponding to a Convolutional Network layer), a media class recognition Network (corresponding to a media class prediction layer), and an image class recognition Network (i.e., corresponding to an image class prediction layer), where the four recognition networks may each be a Convolutional Neural Network (CNN), but the Network structures of the four CNNs are different. For the embedding feature recognition network, the media category recognition network and the image category recognition network, the three recognition networks can share the same bottom layer feature (also called as a basic feature, namely an image basic feature extracted by the basic feature recognition network), the three recognition networks share the same bottom layer feature, and different recognition results are respectively output by the three different branch recognition networks, so that model reasoning resources can be well reduced. The method can adopt an open-source Imagenet to pre-train a residual error network (such as ResNet-101) in the initial multi-task recognition model, and the residual error network obtained through pre-training can be used as a basic feature extraction network in the initial multi-task recognition model. And then, after the initial multi-task recognition model is trained based on the sample image, the multi-task recognition model for executing the image recognition task can be obtained.

Further, please refer to table 1 together for ease of understanding, where table 1 is a schematic table of a residual network structure provided in an embodiment of the present application. The residual network structure is schematically represented as a network structure of ResNet-101, and table 1 may include convolutional layers (Layer names), Output image sizes (Output sizes), and convolutional information in each convolutional Layer. As shown in table 1, the residual network structure schematic table may include 7 convolutional network layers, and specifically may include a convolutional network layer 1 (e.g., Conv1), a convolutional network layer 2 (e.g., Conv2_ x), a convolutional network layer 3 (e.g., Conv3_ x), a convolutional network layer 4 (e.g., Conv4_ x), a convolutional network layer 5 (e.g., Conv5_ x), and a convolutional network layer 6 (e.g., Pool _ cr1, which may also be referred to as Max Pool, i.e., pooling layer). The layer 101 network in the residual error network refers to the number of layers of each convolutional network layer, and the active layer or the pooling layer is not calculated.

As shown in table 1, there are 64 convolutions of 7x7 in convolutional network layer 1, and the stride is 2. The convolutional network layer 2, the convolutional network layer 3, the convolutional network layer 4, and the convolutional network layer 5 are all residual blocks (blocks), for example, the convolutional network layer 2 includes a largest pooling layer (3 × 3 pooling layer with step 2) and 3 residual blocks, and each residual block includes 3 layers, and specifically may include 64 convolutions of 1 × 1, 64 convolutions of 3 × 3, and 256 convolutions of 1 × 1. As shown in table 1:

TABLE 1

For convenience of understanding, please refer to table 2 together, where table 2 is a schematic table of a structure of an embedding feature recognition network provided in an embodiment of the present application. Table 2 may include convolutional layers (Layer names), Output image sizes (Output sizes), and specific meanings (i.e., convolutional meanings, layers) of each convolutional Layer. As shown in table 2, the imbedding feature recognition network structure schematic table may include 1 embedded feature layer (imbedding layer), which may be specifically a convolutional network layer (e.g., full connection layer). As shown in table 2:

TABLE 2

For ease of understanding, please refer to table 3 together, where table 3 is a schematic table of a structure of a media category identification network according to an embodiment of the present application. Table 3 may include convolutional layers (Layer names), Output image sizes (Output sizes), and specific meanings (i.e., convolutional meanings, layers) of each convolutional Layer. As shown in table 3, the media category identification network structure schematic table may include 1 Fc1 layer, which may be a convolutional network layer (e.g., full connection layer). The Fc1 layer may specifically output media category classification, Nclass1 being the number of media category classifications (number 1 as shown in table 3). As shown in table 3:

TABLE 3

For ease of understanding, please refer to table 4 together, where table 4 is a schematic structural table of an image category identification network provided in an embodiment of the present application. Table 4 may include convolutional layers (Layer names), Output image sizes (Output sizes), and specific meanings (i.e., convolutional meanings, layers) of each convolutional Layer. As shown in table 4, the media category identification network structure schematic table may include 1 Fc2 layer, which may be a convolutional network layer (e.g., full connection layer). The Fc2 layer may specifically output image classification categories, Nclass2 being the number of image classification categories (number 2 as shown in table 4). As shown in table 4:

TABLE 4

It should be noted that the network structures shown in tables 1 to 4 are only examples to describe a possible network structure, and each identification network structure may also adopt other structures, such as the structures in tables 2 to 4, a structure in which a plurality of full-link layers are stacked and activated by Relu before final output, and the like, and for the network structure, the present application shall not specifically limit it.

Step S102, acquiring a target matching parameter jointly indicated by a target media category and a target image category in a parameter mapping table; the parameter mapping table comprises a configuration media category set, a configuration image category set and a mapping relation among the matching parameter sets, wherein the mapping relation exists among one configuration media category in the configuration media category set, one configuration image category in the configuration image category set and one configuration matching parameter in the matching parameter set; a configuration matching parameter is used to reflect a matching condition of image features of data frames having a corresponding configuration media category and a corresponding configuration image category.

In the application, different matching parameters can be configured for different media categories and different image categories, so that a parameter mapping table can be obtained. When the matching parameters are configured for a certain media category and a certain image category, the media category can be called a configured media category, the image category can be called a configured image category, and the configured matching parameters can be called configuration matching parameters. After configuration, a parameter mapping table can be obtained, the parameter mapping table comprises matching parameters configured for each media category and each image category, and after the matching parameters are configured for a certain media category and a certain image category, the matching parameters can be determined to have a mapping relation with the media category and the image category, so that the parameter mapping table can comprise a media category set, a mapping relation between an image category set and a matching parameter set, and a mapping relation exists between a configured media category, an image category and a matching parameter.

It can be understood that, according to the foregoing description, when comparing whether two media data are similar media data, the comparison may be performed based on the data frame of the media data, and specifically, the comparison may be performed based on the image feature of the data frame. Specifically, if the two data frames are compared to determine whether they are similar, the similarity of the image features between the two data frames may be determined, and if the similarity is greater than the similarity threshold, the two data frames may be determined as similar data frames. Accordingly, each configuration matching parameter in the present application may include a configuration similarity threshold, where the configuration similarity threshold may be used to reflect a matching condition of image features of video frames having a media category and an image category corresponding to the configuration similarity threshold (for example, greater than the configuration similarity threshold), and only when the similarity between two image features is greater than the configuration similarity threshold, it may be determined that the two image features are matched (as similar image features), and the corresponding two data frames are similar data frames. The Euclidean distance between two image features can be determined, the similarity between the two image features can be determined according to the Euclidean distance, and the greater the Euclidean distance is, the more dissimilar the two image features are; accordingly, the configuration matching parameter may include a distance threshold, and the matching condition reflected by the configuration matching parameter may be (when the euclidean distance between two image features is smaller than the distance threshold, it is determined that two images may match).

It should be noted that, as can be seen from the above description, the multitask identification model may identify the frame media category corresponding to each target data frame, but when the target media category to which the target media data belongs is determined according to the frame media category of each target data frame, the frame media categories to which all the target data frames belong may be adjusted to the target media category (that is, the frame media category corresponding to each target data frame is the target media category, and when the matching parameter is used, the matching parameter of a certain target data frame is also determined by the target media category).

Step S103, searching for a matching image feature matched with the target image feature in the candidate image feature set according to the target image feature and the target matching parameter; the candidate image feature set is a set formed by image features respectively corresponding to each piece of media data to be recalled in the media data set to be recalled.

In this application, taking as an example that each configuration matching parameter in the matching parameter set includes a configuration similarity threshold, and the target matching parameter includes a target similarity threshold, a specific implementation manner for searching for a matching image feature matched with the target image feature in the candidate image feature set according to the target image feature and the target matching parameter may be: the feature similarity between the target image features and the candidate image features in the candidate image feature set can be determined, so that a feature similarity set can be obtained; then, the feature similarity greater than the target similarity threshold in the feature similarity set can be determined as the target feature similarity; the candidate image features corresponding to the target feature similarity may be determined as matching image features matching the target image features.

For the feature similarity, the feature similarity can be determined based on Euclidean distances between the target image feature and the candidate image features, one Euclidean distance can indicate one feature similarity, and the greater the Euclidean distance is, the smaller the feature similarity is; the feature similarity may also be determined based on a cosine similarity between the target image feature and the candidate image feature, and the cosine similarity may be determined based on the representation vectors of the target image feature and the candidate image feature. The present application is not limited to a specific determination manner of the feature similarity.

It should be noted that, in the present application, a media data search library (also referred to as a database) may be pre-constructed, for currently acquired media data (which may be uploaded by an object and may include media data with copyright protected and also may include media data with copyright open), the media data may be stored in the media data search library as stock media data (also referred to as media data to be recalled), for each media data to be recalled, an image feature (embedding feature) of a corresponding data frame, an image category of each data frame, and a media category to which the media data to be recalled belongs may be acquired, a plurality of image features of each media data to be recalled, an image category (actually also understood as an image category corresponding to each image feature) of each data frame, and a media category to which the media data to be recalled belongs may all be stored in the media data search library together, and establishing an index relationship between the plurality of image characteristics of each media data to be recalled, the media category to which the image category corresponding to each data frame belongs and the media data to be recalled respectively, so that when certain information is acquired, other information can be indexed according to the index relationship. Meanwhile, for each piece of media data to be recalled, the timestamp (for example, in the second seconds of the media data) of the data frame corresponding to each image feature of the piece of media data to be recalled and the total media duration of the piece of media data to be recalled may also be stored in the media data search library together, so as to facilitate subsequent processing. It should be noted that, after a plurality of image features of a certain media data to be recalled, an image category corresponding to each data frame (actually, an image category corresponding to each image feature may also be understood), and a media category to which the media data to be recalled belongs are all stored in the media data search library, the image features may be referred to as stock image features or candidate image features, the image category may be referred to as stock image category, and the media category may be referred to as stock media category. The candidate image feature set in the present application may be a set formed by candidate image features corresponding to each media data to be recalled.

And step S104, determining effective recall media data in the media data set to be recalled according to the matched image characteristics and the target image characteristics.

In the present application, as can be seen from the above description, the target image features corresponding to one target data frame may be matched with one or more stock image features, that is, one target image feature may be matched with one or more matching image features, so that the number of the target data frames is N, the target image features corresponding to the target data frames include target image features (that is, include N target image features) corresponding to N (N is a positive integer) target data frames, the number of the matching image features is Q (Q is a positive integer), the Q matching image features are exemplified by a group of matching image features respectively matched with the N target image features, and a specific manner of determining valid recall media data is described. The specific implementation manner of determining the effective recall media data in the to-be-recalled media data set according to the matching image features and the target image features may be that the to-be-recalled media data to which the Q matching image features respectively belong may be acquired in the to-be-recalled media data set; subsequently, according to the media data to be recalled to which the Q matched image features respectively belong, carrying out feature classification on the Q matched image features, and thus obtaining W matched feature sets; the media data to be recalled to which the matched image features contained in each matched feature set belong are the same media data; here, W matching feature sets are included in the matching feature set R _j(W, j are all positive integers) as an example; can statistically match the feature set R_jThe feature quantity (which may be referred to as a first feature quantity) of the matched image features included in the image feature set can be determined according to the first feature quantity and the N target image features_jRecall attributes of indicated media data to be recalled (where the feature set R is matched_jThe indicated media data to be recalled is the matching feature set R_jThe media data to be recalled to which the matched image features contained in (1) belong); in the same way, according to the determined matching feature set R_jIndicated media to recallThe data recall attribute mode can determine W matched feature sets, and the matched feature sets are divided into a matched feature set R_jExcept for the recall attribute of the to-be-recalled media data respectively indicated by each of the other matching feature sets, when the recall attribute of the to-be-recalled media data respectively indicated by the W matching feature sets is determined, the to-be-recalled media data of which the recall attribute is the effective attribute in the to-be-recalled media data respectively indicated by the W matching feature sets may be determined as the effective recall media data.

Wherein, the feature set R is matched_jThe matching image features included in (1) include a first matching image feature and a second matching image feature, for example, for determining a matching feature set R according to the first feature number and N target image features _jThe specific implementation manner of the indicated recall attribute of the media data to be recalled may be: acquiring a first target image feature matched with the first matching image feature and a second target image feature matched with the second matching image from the N target image features; then, the total number of the features included in the first target image feature and the second target image feature may be determined as a second feature number; according to the first characteristic quantity, the second characteristic quantity and the target media data, a matching characteristic set R can be determined_jA recall attribute of the indicated media data to be recalled.

Wherein, for the first characteristic quantity, the second characteristic quantity and the target media data, the matching characteristic set R is determined_jThe specific implementation manner of the indicated recall attribute of the media data to be recalled can be as follows: can obtain the matching feature set R_jThe indicated first media time length corresponding to the media data to be recalled and the indicated second media time length corresponding to the target media data; subsequently, a first ratio between the first characteristic quantity and the first media duration and a second ratio between the second characteristic quantity and the second media duration may be determined; if at least one of the first ratio and the second ratio is greater than the ratio threshold, the matched feature set R can be determined _jThe indicated recall attribute of the media data to be recalled is determined as an effective attribute; and if it is the firstIf both the first ratio and the second ratio are less than the threshold ratio, the matched feature set R can be determined_jThe indicated recall attribute of the media data to be recalled is determined to be an invalid attribute.

It can be understood that when Q matching image features are determined, media data to be recalled (which may be referred to as initial recall media data) to which the Q matching image features respectively belong may be acquired in the media data search library according to an index relationship, and how many image features (embedding features) are matched on each initial recall media data may be counted according to the media data to be recalled to which each matching image feature belongs. The specific statistical method may be that, according to the media data to be recalled to which the Q matching image features respectively belong, the Q matching image features are subjected to feature classification, that is, the matching image features belonging to the same media data to be recalled are divided into one matching feature set, so that W matching feature sets may be obtained, and the total number of features of the matching image features included in one matching feature set is the number of image features matched with a certain initial recall media data. For some initial recalled media data (e.g., matching feature set R) _jIndicated media data to recall), whether it is valid recall media data may be further determined from the total number of features of the matching image features to which it is matched (which may be referred to as a first number of features), and the N target image features.

In particular, the matching feature set R is presented here for ease of illustration_jThe indicated media data to be recalled is called initial media data U, and the target image features (such as the first target image feature and the second target image feature) matched with the matching image features of the initial media data U can be acquired from the N target image features. For example, the matching image features of the initial media data U include matching image feature 1, matching image feature 5, and matching image feature 9, and of the N target image features, the target image feature matching with matching image feature 1 is target image feature 1, and the target image feature matching with matching image feature 5 is also target image featureAnd the feature 1 and the target image feature matched with the matching image feature 9 are the target image feature 3, so that the obtained target image features matched with the matching image features of the initial media data U can comprise the target image feature 1 and the target image feature 3. The total number of features of the matched target image features may be counted as 2, and the total number of features may be the second number of features. The total media duration of the initial media data U (the total media duration may be referred to as a first media duration) or the total media duration of the target media data U (the total media duration may be referred to as a second media duration) may be obtained in the media data search library, a recall frame number ratio of the initial media data U (i.e., a first ratio, a first characteristic number/a first media duration) may be determined according to the first characteristic number and the total media duration of the initial media data U, and a recall frame number ratio of the target media data (i.e., a second ratio, a second characteristic number/a second media duration) may be determined according to the second characteristic number and the total media duration of the target media data U. If the first ratio is greater than the ratio threshold (which may be considered as a preset value) and the second ratio is greater than the ratio threshold, it may indicate that the recall attribute of the initial media data U is a valid attribute, that is, the initial media data U is valid recall media data. If the first ratio is greater than the ratio threshold (which may be considered as a preset value) and the second ratio is less than the ratio threshold, it may also indicate that the recall attribute of the initial media data U is an effective attribute, that is, the initial media data U is effective recall media data; if the first ratio is smaller than the ratio threshold (which may be considered as a preset value) and the second ratio is larger than the ratio threshold, it may also indicate that the recall attribute of the initial media data U is a valid attribute, i.e., the initial media data U is valid recall media data.

It should be understood that, through the customized matching parameters of different media categories and different image categories, targeted media retrieval and rearrangement can be performed on different image categories, and the accuracy of a retrieval recall result can be well improved; meanwhile, even if a new type of media category or image category exists, the matching parameters can be quickly configured for the media category and the image category, and the parameter mapping table can be quickly updated, that is, for a new media category, the method can be quickly expanded.

It should be noted that, the present application may use matching parameters of other media categories (media categories not currently considered, which may be referred to as extra media categories) or other image categories (image categories not currently considered, which may be referred to as extra image categories) in a parameter mapping table, for a certain target data frame, when a corresponding target media category and a corresponding target image category are determined, it is assumed that a target matching parameter corresponding to the target media category and the target image category does not exist in the parameter mapping table, if the target media category exists in the parameter mapping table but a target matching parameter corresponding to a target image category under the target media category does not exist, a matching parameter of an extra image category under the target media category may be obtained in the parameter mapping table, and the matching parameter may be used as a target matching parameter; if the target media category does not exist in the parameter mapping table, the configuration matching parameter of the additional image category under the additional media category can be used as the target matching parameter.

Optionally, it can be understood that, after the effective recall media data is determined, the application may perform corresponding recall feedback (i.e., perform media service processing) according to different media categories to which the effective recall media data belongs. For example, for media data such as movie and television synthesis, concert, etc. which need copyright protection, an abnormal alarm (for example, the abnormal alarm is used for prompting an author corresponding to target media data, the target media data is identical to the effective recall media data, and the effective recall media data is protected by copyright, and the target media data is corrected or deleted in time) can be processed; for the media data such as the life-type small video, similar life scenes can be pushed for the creators of the media data, so that the creation desire of the creators can be promoted, and the like. Including recall-enabled media data K with recall-enabled media data_a(a is a positive integer) matching image features including recall-enabled media data K_aCorresponding effective matching image feature set, wherein the target image features comprise effective target image feature set and effective targetThe image feature set includes an effective target image feature matched with each effective matching image feature in the effective matching image feature set, and the target data frame includes an effective target data frame set corresponding to the effective target image feature set as an example, and the media service processing specifically may be: frame timestamps corresponding to each effective target data frame in the effective target data frame set can be obtained; then, the effective target data frame sets can be sequenced according to the time sequence of the frame time stamps respectively corresponding to each effective target data frame, so as to obtain an effective frame sequence; subsequently, the media segment indicated by the effective frame sequence in the target media data can be determined as the segment to be compared, and the media segment K is recalled according to the segment to be compared and the effective frame sequence _aTo which media category, i.e. targeted media data and recall-enabled media data K_aAnd carrying out media service processing.

Wherein, according to the segment to be compared and the effective recall media data K_aBelonging to media category, target media data and recall-enabled media data K_aThe specific implementation manner of performing media service processing may be: can recall the media data K effectively_aThe media category to which the media belongs is determined as a recall media category; if the category attribute of the media category to be recalled is a private resource attribute, the media data K can be recalled effectively_aThe effective media fragments corresponding to the effective matching image feature set are obtained, the fragments to be compared and the effective media fragments can be compared and analyzed, abnormal warning information can be generated based on the analysis result obtained by the comparison and analysis, and the abnormal warning information can be sent to the target terminal equipment; the target terminal equipment is the terminal equipment corresponding to the target object for generating the target media data; the abnormal warning information is used for prompting the target object to correct the target media data based on the analysis result; if the category attribute of the recall media category is the shared resource attribute, then media data K is recalled effectively _aThe effective media segments corresponding to the effective matching image feature set are obtained, the media subjects matched with the effective media segments and the segments to be compared are determined, and similar media data containing the media subjects are pushed to the target terminal device.

It can be understood that, for a certain valid recalled media data, a target data frame (which may be referred to as a valid target data frame) matched with the matching image feature thereof may be obtained in the target data frame, since the target data frames in the present application are consecutive frames extracted according to the frame extraction parameters, the valid target data frames may be ordered according to a time sequence (for example, a sequence from early to late) of frame timestamps of the valid target data frames, so as to obtain a valid frame sequence, a media segment indicated by the valid frame sequence may be determined as a segment to be compared (for example, a valid target data frame having a largest frame timestamp in the valid frame sequence and a valid target data frame having a smallest frame timestamp may be obtained, and then, in all data frames of the target media data, a data frame having a frame timestamp greater than the smallest frame timestamp and smaller than the largest frame timestamp may be obtained, a media segment is composed of these data frames and the two valid target data frames with the largest frame timestamp and the smallest frame timestamp). If the category attribute of the recalled-effective media data is a private resource attribute (e.g., a copyright protected attribute, such as a media category of a movie and television show, a concert, etc.), based on the matching image feature of the recalled-effective media data, in the same manner, determining a to-be-compared segment (called an effective media segment for easy distinction) in the effective recall media data, performing comparison analysis processing based on the to-be-compared segment and the effective media segment, and generating abnormal warning information (such as prompt information for correcting or deleting target media data) based on comparison analysis results (such as results including identity of people), the comparison analysis module is used for prompting that the target media data is the same as the effective recall media data, the effective recall media data is protected by copyright, and an author of the target media data needs to immediately correct or delete the target media data based on the comparison analysis result.

If the category attribute of the media data to be recalled is the shared resource attribute (e.g., the attribute without copyright protection, i.e., the attribute of copyright openness, such as the category of daily life, etc.), the media theme matching both the effective media segment and the segment to be compared may be determined (e.g., the media theme may be determined for an occurrence in the media data, such as an occurrence of a field cooking event, the media theme may include field cooking and cooking, such as an occurrence of an indoor dance event, the media theme may include indoor dancing and such as an occurrence of a school lesson event, such as a school lesson event, and education). Upon determining the media topic, similar media data that is more highly inclusive of the media topic may be pushed to the author of the targeted media data to prompt the author to author more similar media data.

Optionally, it can be understood that, when it is determined that the media data are effectively recalled, the present application may also sort the effectively recalled media data according to a recall duration ratio of the target media data, so as to obtain an effective media data sequence. For example, for a certain valid recall media data, after determining an effective target image feature matched with the matched image feature, an effective target data frame of the effective target image feature can be acquired, and then a frame timestamp of each effective target data frame is acquired; then, the minimum frame timestamp may be subtracted from the maximum frame timestamp, and the obtained difference result may be used as a duration (which may be referred to as a repetition duration) of the identical segment of the target media data and the valid recall media data, and a ratio between the repetition duration and the total media duration of the target media data may be a recall duration ratio of the target media data to the valid recall media data. When the recall time length proportion of the target media data aiming at each effective recall media data is determined, sequencing can be carried out according to the sequence from large to small of the proportion, so that the target media data and which effective recall media data have the most repetition time length and which effective recall media data have the least repetition time length can be quickly and clearly reflected.

In the embodiment of the application, in the media retrieval recall service, media category information of media data, image category information of a data frame of the media data, and image feature information can be simultaneously utilized, different matching parameters are configured for different media categories and different image categories, and matching image features meeting matching conditions under the image categories can be found according to the matching parameters, so that recall media data meeting the matching conditions can be found, that is, targeted retrieval for different media categories and different image categories can be realized, more accurate media retrieval recall can be well performed according to image information presented in the data frame of the media data, and accuracy of recall results is improved. In conclusion, the method and the device can perform targeted retrieval based on different media types and image types in the retrieval service of the media data, and improve the retrieval accuracy.

It can be understood that, according to the above description, the image recognition process may be performed based on the multitask recognition model, and in order to improve the accuracy of the image recognition, the multitask recognition model may be trained so that the trained and adjusted multitask recognition model is optimal, and based on the trained multitask recognition model, the image recognition process may be performed on the target data frame (for example, the target image feature, the target image category, and the frame medium category of the target data frame are recognized). For ease of understanding, please refer to fig. 4, fig. 4 is a schematic flowchart of a model training process provided in an embodiment of the present application. As shown in fig. 4, the method for model training may be executed by a terminal device (e.g., the terminal device in fig. 1 described above), may be executed by a server (e.g., the server in fig. 1 described above), or may be executed by both the terminal device and the server. For convenience of understanding, the method for model training in the present embodiment is described as an example executed by the server to illustrate a specific process of model training in the server. The process of training the model at least includes the following steps S201 to S206:

Step S201, acquiring a sample image triple, and inputting the sample image triple into an initial multi-task recognition model; the sample image triplets comprise a target sample image, a first similar sample image corresponding to the target sample image and a second similar sample image corresponding to the target sample image.

Specifically, three sample images can be screened from the full sample image to form a triplet, where the three sample images may include an anchor sample image (anchor), a sample image similar to the anchor sample image (positive sample image), and a sample image dissimilar to the anchor sample image (negative sample image). It can be understood that, in order to improve the recognition capability after model training, the method and the device can improve the antagonism and the interference of the triples, so that the triples are more difficult to be recognized by the model, and the model trained by the triples with higher antagonism and interference can have better recognition capability. Then the specific implementation for obtaining the triplet of sample images may be: a sample image set may be obtained; the sample image set comprises at least two similar sample image pairs, and one similar sample image pair comprises two sample images with similar relation; a target pair of similar sample images may be acquired in at least two pairs of similar sample images; subsequently, a sample image to be operated can be selected from the sample images contained in the remaining similar sample image pairs; the residual similar sample image pair refers to a similar sample image pair except the target similar sample image pair in at least two similar sample image pairs; and determining a sample image triple according to the to-be-operated sample image and the target similar sample image pair.

The specific implementation manner for determining the sample image triplet according to the sample image to be operated and the target similar sample image pair may be: a target sample image may be selected among the sample images included in the target similar sample image pair; then, sample image representation characteristics corresponding to the sample image to be operated and target image representation characteristics corresponding to the target sample image can be obtained; a representative feature similarity between the sample image representative features and the target image representative features may be determined; if the characteristic similarity is larger than the characteristic similarity threshold, determining the residual sample images as first similar sample images corresponding to the target sample images, determining the sample images to be operated as second similar sample images corresponding to the target sample images, and determining the target sample images, the first similar sample images and the second similar sample images as sample image triples; and the rest sample images are sample images except the target sample image in the target similar sample image pair.

It is understood that the sample image set may refer to a full number of sample images, or may refer to a set (including 100 sample images) obtained by dividing images according to a certain number (e.g., 100). In the sample image set, two similar sample images can be combined to form a sample pair (referred to as a similar sample image pair, which can also be referred to as a positive sample pair); then, a similar sample image pair of the similar sample image pairs can be determined as a target similar sample image pair; for the target similar sample image pair, one sample image may be randomly selected as an anchor sample image (i.e., a target sample image), and in the remaining similar sample image pairs (referred to as remaining similar sample image pairs), one image may be randomly selected in each remaining similar sample image pair as a sample image to be operated on. Then, for the sample images to be operated, the image distance between each sample image to be operated and the target sample image may be calculated, for example, the sample image representation features (such as embedding features) corresponding to the operated sample images and the representation feature similarity between the target image representation features (such as embedding features) corresponding to the target sample images may be calculated, the greater the representation feature similarity is, the smaller the image distance between the two images is proved, and when the representation feature similarity is greater than the feature similarity threshold, the sample image to be operated and the target sample image may be characterized as similar images. In other words, the sample image triplets ultimately determined by the present application may include an anchor sample image, a positive sample image (a sample image similar to the anchor sample image), and a negative sample image (another sample image similar to the anchor sample image). It should be understood that, in the media retrieval and re-arrangement service, if the target media data and a certain piece of media data to be recalled are identical media data, the target media data and the certain piece of media data to be recalled are very similar, if the data frame of the target media data is obtained by performing certain image transformation (noise addition, clipping, frame adding, etc.) on the data frame of the certain piece of media data to be recalled, then the contents of the two data frames are almost identical, and actually, the image randomly extracted from the massive sample images is often an image different from the target sample image, then we can select a sample image representing a similarity greater than a similarity threshold as a negative sample image of the positive sample pair, and although the negative sample image is still different from the target sample image, the sample image is a sample image similar to the target sample image in the massive sample images, it may still be determined to be a similar sample image to the target sample image.

Optionally, in a feasible embodiment, the method may calculate an image distance between each sample image to be operated and the target sample image, sort the sample images to be operated according to a distance order of the image distances (for example, a sequence from small to large), and then select the first E negative sample images as the positive sample pairs from the sorted image sequence, thereby obtaining E sample image triples.

Step S202, a first sample image embedding feature corresponding to a target sample image, a first sample image category and a first sample target media category corresponding to the first sample image, a second sample image embedding feature corresponding to the first similar sample image, a second sample image category and a second sample target media category corresponding to the second sample image, a third sample image embedding feature corresponding to the second similar sample image, a third sample image category and a third sample target media category are determined through an initial multi-task recognition model.

Specifically, the basic feature extraction layer in the initial multi-task identification model can extract the basic features of the sample images corresponding to each sample image in the sample image triple; through a convolution network layer in the multi-task recognition model and the corresponding sample image basic characteristics, the corresponding sample image characteristics can be output; through an image category prediction layer in the multi-task recognition model and sample image basic features corresponding to the image category prediction layer, the corresponding sample image categories can be output; and outputting the corresponding sample media type through the media type prediction layer in the multi-task recognition model and the corresponding sample image basic characteristic.

Step S203 determines a first loss value according to the first sample image embedding feature, the second sample image embedding feature, and the third sample image embedding feature.

Specifically, for a sample image triplet, a feature loss value (i.e., a first loss value) may be determined according to the first sample image embedding feature, the second sample image embedding feature, and the third sample image embedding feature. For ease of understanding, a specific implementation for determining the first loss value can be as shown in equation (1):

formula (1)

Wherein, as shown in formula (1)

A first penalty value that may be used to characterize a certain triplet;

can be used to characterize the corresponding sample image features of the anchor sample image (e.g., the first sample image embedded features);

can be used to characterize the corresponding sample image features (e.g., second sample image embedded features) of the positive sample image;

can be used to characterize the sample image features corresponding to the negative sample image (e.g., third sample image embedded features);

and

both can be used to characterize the L2 distance (euclidean distance) between two features; can be an edge parameter, which can be set to 0.6; the loss value shown in equation (1) aims at a distance between the anchor sample image (anchor) and the negative sample image (negative) that is greater than a preset value (e.g., 0.6) than a distance between the anchor sample image (anchor) and the positive sample image (positive).

Step S204, a second loss value is determined according to the first sample image category, the second sample image category and the third sample image category.

Specifically, for each sample image in a sample image triplet, the corresponding real image category of the sample image triplet may be respectively labeled (in this application, because the contents of two sample images in a positive sample pair are extremely similar, the real image categories of the two sample images may be set as the same image category), and according to the first sample image category and the real image category corresponding to the target sample image, an image category sub-loss value may be determined; determining an image class sub-loss value according to the second sample image class and the real image class corresponding to the first similar sample image; and determining an image class sub-loss value according to the real image class corresponding to the third sample image class and the second similar sample image. And adding the three image category sub-loss values and then calculating an average value to obtain a total image category loss value (namely a second loss value) corresponding to the sample image triplet.

For ease of understanding, a specific implementation for determining the second loss value can be shown as equation (2):

Formula (2)

Wherein, as shown in formula (2)

Image class loss values, pairs, that can be used to characterize tripletsFor a certain sample image i, the image content classification layer Fc2 may output a classification probability vector, which may be calculated as a cross entropy loss (cross entropy loss) with the real image class; vectors corresponding to the true image categories that can be used to characterize the sample image i (specifically, vectors 0 and 1 that can be 1 x 5, only the true category in the vectors takes 1, and the others take 0); a prediction value for the sample image i (specifically, prediction probabilities corresponding to 1 × 5 candidate image categories, where i represents the ith sample image, and c represents a value corresponding to the c-th classification position, where M may be the total number of c, and may be 5) that may be used to characterize the model output.

Step S205 determines a third loss value according to the first sample target media category, the second sample target media category, and the third sample target media category.

Specifically, for each sample image in a sample image triple, the corresponding real media category may be respectively labeled (in this application, because the contents of two sample images in a positive sample pair are extremely similar, the real media categories of the two sample images may be set as the same media category), and according to the first sample target media category and the real media category corresponding to the target sample image, a media category sub-loss value may be determined; determining a media category sub-loss value according to the second sample target media category and the real media category corresponding to the first similar sample image; a media category sub-loss value may be determined based on the third sample target media category and the actual media category corresponding to the second similar sample image. And adding the three media category sub-loss values and then calculating an average value to obtain a total media category loss value (namely a second loss value) corresponding to the sample image triple.

For ease of understanding, a specific implementation for determining the second loss value can be shown as equation (3):

formula (3)

Wherein, as shown in formula (3)

The media category loss value of the triplets can be represented, for a certain sample image i, the media content classification layer Fc1 can output a classification probability vector, and the cross entropy loss value of the classification probability vector and the real media category can be calculated;

vectors corresponding to the real media categories that can be used to characterize the sample image i (specifically, vectors 0 and 1 that can be 1 × 10, only the true category in the vectors takes 1, and the others take 0);

a prediction value for the sample image i (specifically, may be a prediction probability corresponding to each of 1 × 10 candidate media categories, where i represents the i-th sample image, and c represents a value corresponding to the c-th classification position, where M may be the total number of c, and may be 10) that may be used to characterize the model output.

And S206, generating a target loss value according to the first loss value, the second loss value and the third loss value, and adjusting the initial multi-task recognition model according to the target loss value to obtain the multi-task recognition model.

Specifically, a specific implementation manner of generating the target loss value according to the first loss value, the second loss value, and the third loss value may be as shown in formula (4):

Formula (4)

Wherein, as shown in formula (4)

Can be used to characterize the target loss value,

can be used to characterize the first loss value;

can be used to characterize the second loss value;

can be used to characterize the third loss value. w is a₁、w₂And w₃Can be used to characterize the weight coefficients, and in particular can be used to characterize w₁Set to 1, w₂Set to 0.5, w₃Set to 0.5.

For a better understanding of the specific manner of generating the target loss value, please refer to fig. 5, and fig. 5 is an architecture diagram for determining the target loss value according to an embodiment of the present application. As shown in fig. 5, after the triple sample is input to the basic feature extraction network, the image basic features output by the basic feature extraction network may be input to the embedding feature identification network, the medium class identification network, and the image class identification network, respectively, the hash loss value (corresponding to the first loss value), the medium classification loss value (corresponding to the third loss value), and the image classification loss value (corresponding to the second loss value) may be determined by the predicted values output by the embedding feature identification network, the medium class identification network, and the image class identification network, respectively, and the total loss value (i.e., the target loss value) may be determined from the first loss value, the second loss value, and the third loss value.

Further, multiple rounds of adjustment training can be performed on the model parameters in the initial multi-task model according to the target loss value, and until the adjustment iteration number of the initial multi-task recognition model reaches a number threshold, the model at the time can be determined as the multi-task recognition model for executing the image recognition task.

In the embodiment of the application, the sample image triples used for training the sample images can be labeled and adjusted, so that the sample image triples have better interference performance, and the model can be trained by using the sample with better interference performance, so that the anti-interference capability of the model can be better improved, and the identification accuracy of the trained model can be improved.

Further, for ease of understanding, please refer to fig. 6 together, and fig. 6 is a diagram illustrating a system architecture according to an embodiment of the present application. The system architecture shown in fig. 6 is an example of a system using media data as a video, and the system architecture may include a multitask recognition model, a matching parameter obtaining module, a frame-level customized recall module, a recall result sorting module, and a feedback module. The query video may refer to a certain video to be compared, the query video is input into the multitask identification model, and image features (embedding features), image categories and video categories of video frames of the query video may be obtained through the multitask identification model. In the matching parameter obtaining module, corresponding matching parameters can be obtained according to the video category and the image category of each video frame, wherein the video category includes a life category small video, a movie and television integrated art video and a concert video as an example, the matching parameter obtaining module can include a parameter mapping table, for the life category small video, the image category includes a text, a non-text and the like, and the corresponding matching parameters are 0.1, 0.5 and 0.3 respectively; for the video of the movie and television integrated art, the image types comprise characters, non-characters and the like, and the matching parameters respectively corresponding to the characters, the non-characters and the like are 0.3, 0 and 0.3; for the concert video, the image categories include characters, texts, and others, and the matching parameters are 0.2, 0.1, and 0.2, respectively. Then, the matching parameter corresponding to a certain video frame can be obtained through the parameter mapping table.

Furthermore, in the frame-level customized recall module, matched image features matched with the image features of each video frame can be found according to the matching parameters of the video frame, and effective recall video data is determined based on the matched image features and the image features of the video frames; these active recall video data may then be sequenced in a recall result sequencing module to obtain an active video sequence. In the feedback module, feedback can be made on each effective recall video data in the effective video sequence, for example, similar video recommendation processing can be performed on the effective recall video data of the life-type small videos; and carrying out exception warning processing on the effective recall video data of the movie and television integrated art videos or the concert videos.

Further, please refer to fig. 7, where fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing means may be a computer program (comprising program code) running on a computer device, for example the data processing means being an application software; the data processing apparatus may be adapted to perform the method illustrated in fig. 3. As shown in fig. 7, the data processing apparatus 1 may include: a frame acquisition module 100, a recognition module 200, a threshold acquisition module 300, a feature matching module 400, and an effective medium determination module 500.

A frame acquiring module 100, configured to acquire a target data frame corresponding to target media data;

the identification module 200 is configured to identify a target media category to which the target media data belongs, a target image feature corresponding to the target data frame, and a target image category;

a threshold obtaining module 300, configured to obtain, in the parameter mapping table, a target matching parameter indicated by both the target media category and the target image category; the parameter mapping table comprises a configuration media category set, a configuration image category set and a mapping relation among the matching parameter sets, wherein the mapping relation exists among one configuration media category in the configuration media category set, one configuration image category in the configuration image category set and one configuration matching parameter in the matching parameter set; a configuration matching parameter for reflecting a matching condition of image features of a data frame having a corresponding configuration media category and a corresponding configuration image category;

the feature matching module 400 is configured to search a candidate image feature set for a matching image feature matching the target image feature according to the target image feature and the target matching parameter; the candidate image characteristic set is a set formed by image characteristics corresponding to each media data to be recalled in the media data set to be recalled;

And the effective media determining module 500 is configured to determine effective recall media data in the to-be-recalled media data set according to the matching image feature and the target image feature.

For specific implementation manners of the frame obtaining module 100, the identifying module 200, the threshold obtaining module 300, the feature matching module 400, and the effective medium determining module 500, reference may be made to the descriptions of step S101 to step S104 in the embodiment corresponding to fig. 3, which will not be described herein again.

It can be understood that the data processing apparatus 1 in this embodiment of the application can perform the description of the multimedia data processing method in the embodiment corresponding to fig. 3, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, please refer to fig. 8, wherein fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 8, the data processing apparatus 2 may include: a frame acquisition module 11, a recognition module 12, a threshold acquisition module 13, a feature matching module 14, and a valid media determination module 15.

A frame acquiring module 11, configured to acquire a target data frame corresponding to target media data;

the identification module 12 is configured to identify a target media category to which the target media data belongs, a target image feature corresponding to the target data frame, and a target image category;

A threshold obtaining module 13, configured to obtain, in the parameter mapping table, a target matching parameter indicated by both the target media category and the target image category; the parameter mapping table comprises a configuration media category set, a configuration image category set and a mapping relation between matching parameter sets, wherein the mapping relation exists between one configuration media category in the configuration media category set, one configuration image category in the configuration image category set and one configuration matching parameter in the matching parameter set; a configuration matching parameter for reflecting a matching condition of image features of a data frame having a corresponding configuration media category and a corresponding configuration image category;

the feature matching module 14 is configured to search for a matching image feature matching the target image feature in the candidate image feature set according to the target image feature and the target matching parameter; the candidate image characteristic set is a set consisting of image characteristics corresponding to each media data to be recalled in the media data sets to be recalled;

and the effective media determining module 15 is configured to determine effective recall media data in the to-be-recalled media data set according to the matching image features and the target image features.

The specific implementation manners of the frame obtaining module 11, the identifying module 12, the threshold obtaining module 13, the feature matching module 14, and the effective medium determining module 15 are respectively consistent with the frame obtaining module 100, the identifying module 200, the threshold obtaining module 300, the feature matching module 400, and the effective medium determining module 500 in fig. 7, and will not be described again here.

In one embodiment, the identification module 12 may include: feature extraction section 121 and feature input section 122.

The feature extraction unit 121 is configured to input the target data frame into the multitask recognition model, and extract an image basic feature corresponding to the target data frame through a basic feature extraction layer in the multitask recognition model;

the feature input unit 122 is configured to input the image basic features into a convolutional network layer in the multitask recognition model, determine image embedding features corresponding to the target data frame through the convolutional network layer and the image basic features, and determine the image embedding features as target image features;

the feature input unit 122 is further configured to input the image basic features into an image category prediction layer in the multitask identification model, and determine a target image category corresponding to the target data frame through the image category prediction layer and the image basic features;

The feature input unit 122 is further configured to input the image basic features into a media category prediction layer in the multitask identification model, and determine a target media category to which the target media data belongs according to the media category prediction layer and the image basic features.

For specific implementation manners of the feature extraction unit 121 and the feature input unit 122, reference may be made to the description of step S101 in the embodiment corresponding to fig. 3, and details are not repeated here.

In one embodiment, the number of target data frames is NN target data frames including a target data frame S_iThe image base features include a target data frame S_iCorresponding image base feature T_iN, i are all positive integers;

the feature input unit 122 may include: a category determination subunit 1221, a frame classification subunit 1222, a quantity statistics subunit 1223, and a category prediction subunit 1224.

A category determination subunit 1221 for predicting the layer and the target data frame S by media category_iCorresponding image basis features T_iDetermining a target data frame S_iA corresponding frame media category;

a frame classifying subunit 1222, configured to, when frame media categories corresponding to the N target data frames are determined, classify the N target data frames according to the N frame media categories to obtain M data frame sets; the frame media categories to which the target data frames contained in each data frame set belong are the same; m is a positive integer;

A number counting subunit 1223, configured to count the number of target data frames included in each of the M data frame sets, to obtain the number of M frames;

the number statistics subunit 1223 is further configured to obtain a maximum frame number from the M frame numbers, and determine a data frame set corresponding to the maximum frame number as a target data frame set;

the category predicting subunit 1224 is configured to determine, as a target media category to which the target media data belongs, a frame media category to which the target data frame included in the target data frame set belongs.

For a specific implementation manner of the category determining subunit 1221, the frame classifying subunit 1222, the quantity counting subunit 1223, and the category predicting subunit 1224, reference may be made to the description of step S101 in the embodiment corresponding to fig. 3, which will not be described herein again.

the feature matching module 14 may include: a similarity determination unit 141 and a feature determination unit 142.

A similarity determining unit 141, configured to determine feature similarities between the target image features and each candidate image feature in the candidate image feature set, respectively, to obtain a feature similarity set;

A feature determining unit 142, configured to determine, as a target feature similarity, a feature similarity that is greater than a target similarity threshold in the feature similarity set;

the feature determining unit 142 is further configured to determine the candidate image feature corresponding to the target feature similarity as a matching image feature matching the target image feature.

For specific implementation of the similarity determining unit 141 and the feature determining unit 142, reference may be made to the description of step S103 in the embodiment corresponding to fig. 3, which will not be repeated here.

In one embodiment, the number of the target data frames is N, the target image features corresponding to the target data frames include target image features corresponding to the N target data frames, respectively, where N is a positive integer; the number of the matched image features is Q, the Q matched image features are composed of matched image features respectively matched with the N target image features, and Q is a positive integer;

the effective medium determination module 15 may include: a feature classification unit 151, a feature quantity statistics unit 152, an attribute determination unit 153, and a valid media determination unit 154.

The feature classification unit 151 is configured to obtain media data to be recalled to which Q matching image features respectively belong from a media data set to be recalled;

The feature classification unit 151 is further configured to perform feature classification on the Q matching image features according to the to-be-recalled media data to which the Q matching image features respectively belong, so as to obtain W matching feature sets; the media data to be recalled to which the matched image features contained in each matched feature set belong are the same media data; the W matched feature sets comprise a matched feature set R_jW, j are positive integers;

a feature quantity statistic unit 152 for counting the matching feature set R_jIn which comprisesA first feature quantity of the matched image features of (1);

an attribute determining unit 153, configured to determine a matching feature set R according to the first feature quantity and the N target image features_jA recall attribute of the indicated media data to recall;

an effective media determining unit 154, configured to, when determining the recall attribute of the to-be-recalled media data respectively indicated by the W matching feature sets, determine, as effective recall media data, the to-be-recalled media data whose recall attribute is an effective attribute in the to-be-recalled media data respectively indicated by the W matching feature sets.

For specific implementation of the feature classification unit 151, the feature quantity statistics unit 152, the attribute determination unit 153, and the effective media determination unit 154, reference may be made to the description of step S104 in the embodiment corresponding to fig. 3, which will not be described herein again.

In one embodiment, the set of matching features R_jThe matched image features contained in the image data comprise a first matched image feature and a second matched image feature;

the attribute determining unit 153 is further specifically configured to obtain, among the N target image features, a first target image feature matched with the first matching image feature and a second target image feature matched with the second matching image;

the attribute determining unit 153 is further specifically configured to determine the total number of features included in the first target image feature and the second target image feature as a second feature number;

the attribute determining unit 153 is further specifically configured to determine the matching feature set R according to the first feature quantity, the second feature quantity and the target media data_jA recall attribute of the indicated media data to be recalled.

In one embodiment, the attribute determining unit 153 may include: a duration obtaining sub-unit 1531, a ratio determining sub-unit 1532, and an attribute determining sub-unit 1533.

A duration obtaining subunit 1531, configured to obtain the matching feature set R_jThe indicated first media duration corresponding to the media data to be recalled, and the targetA second media time corresponding to the media data;

a ratio determining subunit 1532, configured to determine a first ratio between the first feature amount and the first media duration and a second ratio between the second feature amount and the second media duration;

An attribute determining subunit 1533, configured to determine, if at least one of the first ratio and the second ratio is greater than the ratio threshold, the matching feature set R_jThe indicated recall attribute of the media data to be recalled is determined as an effective attribute;

the attribute determining subunit 1533 is further configured to determine, if the first ratio and the second ratio are both smaller than the ratio threshold, the matching feature set R_jThe recall attribute of the indicated media data to be recalled is determined to be an invalid attribute.

For a specific implementation manner of the duration obtaining subunit 1531, the ratio determining subunit 1532, and the attribute determining subunit 1533, reference may be made to the description of step S104 in the embodiment corresponding to fig. 3, which will not be described herein again.

The data processing apparatus 2 may include: a timestamp acquisition module 16, a frame ordering module 17 and a service processing module 18.

A timestamp obtaining module 16, configured to obtain a frame timestamp corresponding to each valid target data frame in the valid target data frame set;

a frame ordering module 17, configured to order the set of effective target data frames according to a time sequence of frame timestamps corresponding to each effective target data frame, so as to obtain an effective frame sequence;

the service processing module 18 is configured to determine a media segment indicated by an effective frame sequence in the target media data as a segment to be compared;

the business processing module 18 is further configured to recall the media data K effectively according to the to-be-compared segment_aMedia category to which the target media data and the recall-enabled media data K belong_aAnd carrying out media service processing.

For a specific implementation manner of the timestamp obtaining module 16, the frame sorting module 17, and the service processing module 18, reference may be made to the description of step S104 in the embodiment corresponding to fig. 3, which will not be described herein again.

In one embodiment, the traffic processing module 18 may include: a recall category determining unit 181, a first processing unit 182, and a second processing unit 183.

A recall category determination unit 181 for recalling the media data K available_aThe media category to which the user belongs is determined as a recall media category;

a first processing unit 182 for recalling the media data K in effect if the category attribute of the recalled media category is a private resource attribute_aThe effective media fragments corresponding to the effective matching image feature set are obtained, the fragments to be compared and the effective media fragments are compared and analyzed, abnormal warning information is generated based on the analysis result obtained by the comparison and analysis, and the abnormal warning information is sent to target terminal equipment; the target terminal equipment is the terminal equipment corresponding to the target object for generating the target media data; the abnormal warning information is used for prompting the target object to correct the target media data based on the analysis result;

a second processing unit 183, configured to recall the media data K effectively if the category attribute of the media category is the shared resource attribute_aThe effective media segment corresponding to the effective matching image feature set is obtained, the media theme matched with the effective media segment and the segment to be compared is determined, and similar media data containing the media theme are pushed to the target terminal device.

For a specific implementation manner of the recall category determining unit 181, the first processing unit 182, and the second processing unit 183, reference may be made to the description of step S104 in the embodiment corresponding to fig. 3, which will not be described herein again.

In one embodiment, the data processing apparatus 2 may include: a sample acquisition module 19, a model processing module 21, a loss value determination module 22, and a model adjustment module 23.

A sample obtaining module 19, configured to obtain sample image triples; the sample image triple comprises a target sample image, a first similar sample image corresponding to the target sample image and a second similar sample image corresponding to the target sample image;

the model processing module 21 is configured to input the sample image triplet into the initial multi-task recognition model;

the model processing module 21 is further configured to determine, through the initial multi-task recognition model, a first sample image embedding feature corresponding to the target sample image, a first sample image category and a first sample target media category, a second sample image embedding feature corresponding to the first similar sample image, a second sample image category and a second sample target media category, a third sample image embedding feature corresponding to the second similar sample image, a third sample image category and a third sample target media category;

a loss value determining module 22, configured to determine a first loss value according to the first sample image embedding feature, the second sample image embedding feature, and the third sample image embedding feature;

A loss value determining module 22, further configured to determine a second loss value according to the first sample image class, the second sample image class, and the third sample image class;

a loss value determining module 22, further configured to determine a third loss value according to the first sample target media category, the second sample target media category, and the third sample target media category;

the loss value determining module 22 is further configured to generate a target loss value according to the first loss value, the second loss value, and the third loss value;

and the model adjusting module 23 is configured to adjust the initial multi-task recognition model according to the target loss value to obtain the multi-task recognition model.

For specific implementation manners of the sample obtaining module 19, the model processing module 21, the loss value determining module 22, and the model adjusting module 23, reference may be made to the descriptions of step S201 to step S206 in the embodiment corresponding to fig. 4, and details will not be described here.

In one embodiment, the sample acquisition module 19 may include: a sample set acquisition unit 191, and an image combining unit 192.

A sample set acquiring unit 191 configured to acquire a sample image set; the sample image set comprises at least two similar sample image pairs, and one similar sample image pair comprises two sample images with similar relation;

An image combining unit 192 for obtaining a target pair of similar sample images among the at least two pairs of similar sample images;

the image combination unit 192 is further configured to select a sample image to be operated from the sample images included in the remaining similar sample image pairs; the remaining similar sample image pairs refer to similar sample image pairs except the target similar sample image pair in at least two similar sample image pairs;

the image combining unit 192 is further configured to determine a sample image triplet according to the sample image to be operated and the target similar sample image pair.

For a specific implementation of the sample set obtaining unit 191 and the image combining unit 192, reference may be made to the description of step S201 in the embodiment corresponding to fig. 4, which will not be repeated herein.

In one embodiment, the image combining unit 192 may include: an image similarity determination subunit 1921 and a triplet determination subunit 1922.

An image similarity determining subunit 1921 configured to select a target sample image from the sample images included in the target similar sample image pair;

the image similarity determining subunit 1921 is further configured to obtain a sample image representation feature corresponding to the sample image to be operated and a target image representation feature corresponding to the target sample image;

An image similarity determining subunit 1921, further configured to determine a representation feature similarity between the sample image representation feature and the target image representation feature;

a triplet determining subunit 1922, configured to determine, if the feature similarity is greater than the feature similarity threshold, the remaining sample image as a first similar sample image corresponding to the target sample image, determine the sample image to be operated as a second similar sample image corresponding to the target sample image, and determine the target sample image, the first similar sample image, and the second similar sample image as sample image triplets; and the rest sample images are sample images except the target sample image in the target similar sample image pair.

For specific implementation manners of the image similarity determining subunit 1921 and the triplet determining subunit 1922, reference may be made to the description of step S201 in the embodiment corresponding to fig. 4, and details will not be described here.

It can be understood that the data processing apparatus 2 in the embodiment of the present application can perform the description of the data processing method in the embodiment corresponding to fig. 3 to fig. 4, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, please refer to fig. 9, where fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 9, the computer device 8000 may be a terminal device or a server in the embodiment corresponding to fig. 1, and the computer device 8000 may include: a processor 8001, a network interface 8004, and a memory 8005, and the computer device 8000 further includes: a user interface 8003, and at least one communication bus 8002. The communication bus 8002 is used for connection communication between these components. In some embodiments, the user interface 8003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 8003 may also include a standard wired interface and a wireless interface. The network interface 8004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). Memory 8005 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory. Memory 8005 may optionally also be at least one storage device located remotely from the aforementioned processor 8001. As shown in fig. 9, the memory 8005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 8000 of fig. 9, a network interface 8004 may provide network communication functions; and user interface 8003 is primarily an interface for providing input to a user; and processor 8001 may be used to invoke a device control application stored in memory 8005 to implement:

acquiring a target data frame corresponding to target media data, and identifying a target media category to which the target media data belongs, a target image characteristic corresponding to the target data frame and a target image category;

It should be understood that the computer device 8000 described in this embodiment may perform the description of the data processing method in the embodiment corresponding to fig. 3 to fig. 4, may also perform the description of the data processing apparatus 1 in the embodiment corresponding to fig. 7, and may also perform the description of the data processing apparatus 2 in the embodiment corresponding to fig. 8, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, and the computer-readable storage medium stores therein a computer program executed by the aforementioned data processing computer device 8000, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the data processing method in the embodiment corresponding to fig. 3 to fig. 4 can be executed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

The computer-readable storage medium may be the data processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

In one aspect of the present application, a computer program product is provided, the computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and executes the computer program, so that the computer device executes the method provided by the aspect in the embodiment of the present application.

The terms "first," "second," and the like in the description and claims of embodiments of the present application and in the drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or modules recited, but may alternatively include other steps or modules not recited, or may alternatively include other steps or elements inherent to such process, method, apparatus, article, or apparatus.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and specifically, each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flows and/or blocks in the flowchart and/or the block diagram, may be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and should not be taken as limiting the scope of the present application, so that the present application will be covered by the appended claims.

Claims

1. A method of data processing, comprising:

in a parameter mapping table, acquiring a target matching parameter indicated by the target media category and the target image category; the parameter mapping table comprises a configuration media category set, a configuration image category set and a mapping relation among matching parameter sets, wherein the mapping relation exists among one configuration media category in the configuration media category set, one configuration image category in the configuration image category set and one configuration matching parameter in the matching parameter set; the configuration matching parameter is used for reflecting the matching condition of the image characteristics of the data frame with the corresponding configuration media category and the corresponding configuration image category;

Searching for a matched image feature matched with the target image feature in a candidate image feature set according to the target image feature and the target matching parameter; the candidate image feature set is a set consisting of image features respectively corresponding to each piece of media data to be recalled in the media data sets to be recalled;

and according to the matched image features and the target image features, effective recall media data are determined in the media data set to be recalled.

2. The method of claim 1, wherein the identifying a target media category to which the target media data belongs, a target image feature corresponding to the target data frame, and a target image category comprises:

inputting the target data frame into a multitask recognition model, and extracting image basic features corresponding to the target data frame through a basic feature extraction layer in the multitask recognition model;

inputting the image basic features into a convolution network layer in the multitask identification model, determining image embedding features corresponding to the target data frame through the convolution network layer and the image basic features, and determining the image embedding features as the target image features;

Inputting the image basic features into an image category prediction layer in the multitask recognition model, and determining a target image category corresponding to the target data frame through the image category prediction layer and the image basic features;

and inputting the image basic features into a media category prediction layer in the multitask recognition model, and determining a target media category to which the target media data belongs through the media category prediction layer and the image basic features.

3. The method of claim 2, wherein the number of target data frames is N, and wherein the N target data frames comprise a target data frame S_iThe image basisThe features include the target data frame S_iCorresponding image basis features T_iN, i are positive integers;

the determining, by the media category prediction layer and the image base feature, a target media category to which the target media data belongs includes:

predicting the layer and the target data frame S through the media category_iCorresponding image basis features T_iDetermining the target data frame S_iA corresponding frame media category;

when the frame media types corresponding to the N target data frames are determined, classifying the N target data frames according to the N frame media types to obtain M data frame sets; the frame media categories to which the target data frames contained in each data frame set belong are the same; m is a positive integer;

Counting the number of target data frames contained in each data frame set in the M data frame sets to obtain the number of M frames;

acquiring the maximum frame number from the M frame numbers, and determining a data frame set corresponding to the maximum frame number as a target data frame set;

and determining the frame media category to which the target data frame contained in the target data frame set belongs as the target media category to which the target media data belongs.

4. The method of claim 1, wherein each configured matching parameter in the set of matching parameters comprises a configured similarity threshold, and wherein the target matching parameter comprises a target similarity threshold;

the searching for the matched image feature matched with the target image feature in the candidate image feature set according to the target image feature and the target matching parameter includes:

determining the feature similarity between the target image features and each candidate image feature in the candidate image feature set respectively to obtain a feature similarity set;

determining the feature similarity larger than the target similarity threshold in the feature similarity set as a target feature similarity;

And determining the candidate image features corresponding to the similarity of the target features as matched image features matched with the target image features.

5. The method according to claim 1, wherein the number of the target data frames is N, the target image features corresponding to the target data frames include target image features corresponding to the N target data frames, respectively, and N is a positive integer; the number of the matched image features is Q, the Q matched image features are composed of matched image features respectively matched with the N target image features, and Q is a positive integer;

the determining effective recall media data in the set of media data to be recalled according to the matched image features and the target image features comprises:

in the media data set to be recalled, acquiring media data to be recalled to which the Q matched image features respectively belong;

according to the media data to be recalled to which the Q matched image features respectively belong, carrying out feature classification on the Q matched image features to obtain W matched feature sets; the media data to be recalled to which the matched image features contained in each matched feature set belong are the same media data; the W matched feature sets comprise a matched feature set R _jW, j are positive integers;

counting the matching feature set R_jDetermining the matching feature set R according to the first feature quantity and the N target image features_jA recall attribute of the indicated media data to recall;

and when the recall attribute of the to-be-recalled media data respectively indicated by the W matching feature sets is determined, determining the to-be-recalled media data of which the recall attribute is an effective attribute in the to-be-recalled media data respectively indicated by the W matching feature sets as the effective recall media data.

6. The method of claim 5, wherein the set of matching features R_jThe matching image features included in (1) comprise a first matching image feature and a second matching image feature;

determining the matching feature set R according to the first feature quantity and the N target image features_jRecall attributes of the indicated media data to be recalled include:

acquiring a first target image feature matched with the first matching image feature and a second target image feature matched with the second matching image in the N target image features;

Determining the total number of the first target image features and the second target image features as a second feature number;

determining the matching feature set R according to the first feature quantity, the second feature quantity and the target media data_jA recall attribute of the indicated media data to be recalled.

7. The method of claim 6, wherein the matching feature set R is determined according to the first feature quantity, the second feature quantity and the target media data_jRecall attributes of the indicated media data to recall include:

obtaining the matching feature set R_jThe indicated first media time length corresponding to the media data to be recalled and the indicated second media time length corresponding to the target media data;

determining a first ratio between the first characteristic quantity and the first media duration and a second ratio between the second characteristic quantity and the second media duration;

if at least one of the first ratio and the second ratio is larger than a ratio threshold, the matched feature set R is determined_jThe indicated recall attribute of the media data to be recalled is determined as A valid attribute;

if the first ratio and the second ratio are both smaller than the ratio threshold, the matched feature set R is determined_jThe recall attribute of the indicated media data to be recalled is determined to be an invalid attribute.

8. The method of claim 1, wherein the recall-enabled media data comprises recall-enabled media data K_a(ii) a a is a positive integer; including the recall-in-effect media data K in the matched image features_aA corresponding valid matching image feature set; the target image features comprise an effective target image feature set, and the effective target image feature set comprises effective target image features matched with each effective matching image feature in the effective matching image feature set; the target data frame comprises an effective target data frame set corresponding to the effective target image feature set;

the method further comprises the following steps:

acquiring a frame time stamp corresponding to each effective target data frame in the effective target data frame set;

sequencing the effective target data frame set according to the time sequence of the frame time stamps corresponding to each effective target data frame to obtain an effective frame sequence;

Determining the media segments indicated by the effective frame sequences in the target media data as segments to be compared, and recalling the media data K according to the segments to be compared and the effective media data K_aBelonging media category, for the target media data and the recall-enabled media data K_aAnd carrying out media service processing.

9. The method of claim 8, wherein the comparison is performed according to the segment to be compared and the recall-available media data K_aMedia category to which the target media data and the valid recall media data K belong_aPerforming media service processing, including:

will be provided withRecall-effective media data K_aThe media category to which the media belongs is determined as a recall media category;

if the category attribute of the recall media category is a private resource attribute, then the media data K is recalled effectively_aThe effective media fragments corresponding to the effective matching image feature set are obtained, the fragments to be compared and the effective media fragments are compared and analyzed, abnormal warning information is generated based on an analysis result obtained through comparison and analysis, and the abnormal warning information is sent to target terminal equipment; the target terminal equipment is the terminal equipment corresponding to the target object for generating the target media data; the abnormal warning information is used for prompting the target object to correct the target media data based on the analysis result;

If the category attribute of the recall media category is a shared resource attribute, the media data K is effectively recalled_aThe effective media segment corresponding to the effective matching image feature set is obtained, the media theme matched with the effective media segment and the segment to be compared is determined, and similar media data containing the media theme are pushed to the target terminal device.

10. The method of claim 2, further comprising:

acquiring sample image triples, and inputting the sample image triples into an initial multi-task recognition model; the sample image triplets comprise a target sample image, a first similar sample image corresponding to the target sample image and a second similar sample image corresponding to the target sample image;

determining a first sample image embedding feature, a first sample image category and a first sample target media category corresponding to the target sample image, a second sample image embedding feature, a second sample image category and a second sample target media category corresponding to the first similar sample image, a third sample image embedding feature, a third sample image category and a third sample target media category corresponding to the second similar sample image, through the initial multi-task recognition model;

Determining a first loss value according to the first sample image embedding feature, the second sample image embedding feature and the third sample image embedding feature;

determining a second loss value from the first sample image class, the second sample image class, and the third sample image class;

determining a third loss value according to the first sample target media category, the second sample target media category, and the third sample target media category;

and generating a target loss value according to the first loss value, the second loss value and the third loss value, and adjusting the initial multi-task recognition model according to the target loss value to obtain the multi-task recognition model.

11. The method of claim 10, wherein the obtaining a triplet of sample images comprises:

acquiring a sample image set; the sample image set comprises at least two similar sample image pairs, and one similar sample image pair comprises two sample images with similar relation;

obtaining a target pair of similar sample images in the at least two pairs of similar sample images;

selecting a sample image to be operated from the sample images contained in the residual similar sample image pairs; the remaining similar sample image pair refers to a similar sample image pair other than the target similar sample image pair in the at least two similar sample image pairs;

And determining the sample image triples according to the sample image to be operated and the target similar sample image pair.

12. The method of claim 11, wherein determining the sample image triplet from the sample image to be operated on and the target similar sample image pair comprises:

selecting the target sample image among sample images included in the target similar sample image pair;

acquiring sample image representation characteristics corresponding to the sample image to be operated and target image representation characteristics corresponding to the target sample image;

determining a representative feature similarity between the sample image representative features and the target image representative features;

if the representing feature similarity is larger than a feature similarity threshold, determining the remaining sample images as the first similar sample image corresponding to the target sample image, determining the sample image to be operated as the second similar sample image corresponding to the target sample image, and determining the target sample image, the first similar sample image and the second similar sample image as the sample image triple; the remaining sample images are sample images of the target similar sample image pair except the target sample image.

13. A data processing apparatus, characterized by comprising:

the identification module is used for identifying the target media category to which the target media data belongs, the target image characteristics corresponding to the target data frame and the target image category;

a threshold obtaining module, configured to obtain, in a parameter mapping table, a target matching parameter indicated by the target media category and the target image category; the parameter mapping table comprises a configuration media category set, a configuration image category set and a mapping relation among matching parameter sets, wherein the mapping relation exists among one configuration media category in the configuration media category set, one configuration image category in the configuration image category set and one configuration matching parameter in the matching parameter set; the configuration matching parameter is used for reflecting the matching condition of the image characteristics of the data frame with the corresponding configuration media category and the corresponding configuration image category;

the feature matching module is used for searching a matching image feature matched with the target image feature in a candidate image feature set according to the target image feature and the target matching parameter; the candidate image feature set is a set formed by image features respectively corresponding to each piece of media data to be recalled in a set of the media data to be recalled;

14. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected to the memory and the network interface, wherein the network interface is configured to provide a network communication function, the memory is configured to store a computer program, and the processor is configured to call the computer program to cause the computer device to perform the method of any one of claims 1 to 12.

15. A computer-readable storage medium, in which a computer program is stored which is adapted to be loaded by a processor and to carry out the method of any one of claims 1 to 12.