CN116955707A

CN116955707A - Content tag determination method, device, equipment, medium and program product

Info

Publication number: CN116955707A
Application number: CN202211483665.3A
Authority: CN
Inventors: 杨煜霖; 陈世哲; 刘霄
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-24
Filing date: 2022-11-24
Publication date: 2023-10-27

Abstract

The application discloses a method, a device, equipment, a medium and a program product for determining a content tag, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring a plurality of candidate labels of target content; acquiring text content corresponding to the target content; acquiring tag description contents corresponding to a plurality of candidate tags respectively; carrying out feature extraction on confidence scores corresponding to the candidate labels respectively to obtain score feature representation; extracting characteristics of the text content and the tag description content to obtain associated characteristic representations corresponding to the candidate tags respectively; and determining the content tag corresponding to the target content from the plurality of candidate tags based on the score feature representation and the associated feature representation. The text information of the target content and the candidate labels are integrated, rich text information is fully utilized, and the association information among the candidate labels is fully utilized to make decisions, so that the accuracy of finally obtaining the content labels is improved.

Description

Content tag determination method, device, equipment, medium and program product

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment, a medium and a program product for determining a content label.

Background

Content tags refer to keywords or words that highly summarize specified content, such as: for the video, the content labels can be names of people, names of drama, names of music, articles, scenes and the like appearing in the video, so that viewers can be helped to know the content of the video more directly.

In the related art, a method for determining the video tag is a method based on retrieval, and a corresponding video tag index library is required to be established according to a preset tag; when the label corresponding to the appointed video frame is actually determined, the target video frame with the highest similarity with the appointed video frame is required to be retrieved in the video label index library, and the preset label corresponding to the target video frame is determined as the label of the appointed video frame.

However, the tag determination method in the related art determines the tag of the specified video frame only by the picture similarity, and the accuracy of the obtained tag is low.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment, a medium and a program product for determining a content tag, which can improve the accuracy of the content tag, and the technical scheme is as follows:

in one aspect, a method for determining a content tag is provided, the method comprising:

acquiring a plurality of candidate labels of target content, wherein the plurality of candidate labels are corresponding to at least two acquisition modes, the at least two acquisition modes are modes for analyzing the target content based on at least two different analysis modes to obtain the candidate labels, and the candidate labels comprise confidence scores corresponding to the acquisition modes;

Acquiring text content corresponding to the target content, wherein the text content is text data associated with the target content;

acquiring tag description contents corresponding to the candidate tags respectively, wherein the tag description contents are used for describing the candidate tags;

carrying out feature extraction on confidence scores corresponding to the candidate labels respectively to obtain score feature representation; extracting features of the text content and the tag description content to obtain associated feature representations corresponding to the candidate tags respectively, wherein the associated feature representations are used for indicating association relations among different candidate tags;

and determining a content tag corresponding to the target content from the plurality of candidate tags based on the score feature representation and the association feature representation.

In another aspect, there is provided a content tag determining apparatus, the apparatus including:

the acquisition module is used for acquiring a plurality of candidate labels of target content, wherein the plurality of candidate labels correspond to at least two acquisition modes, the at least two acquisition modes are modes for analyzing the target content based on at least two different analysis modes to obtain candidate labels, and the candidate labels comprise confidence scores corresponding to the acquisition modes;

The acquisition module is further used for acquiring text content corresponding to the target content, wherein the text content is text data associated with the target content;

the acquisition module is further configured to acquire tag description contents corresponding to the plurality of candidate tags, where the tag description contents are used to describe the candidate tags;

the extraction module is used for extracting the characteristics of the confidence scores corresponding to the candidate labels respectively to obtain score characteristic representations; extracting features of the text content and the tag description content to obtain associated feature representations corresponding to the candidate tags respectively, wherein the associated feature representations are used for indicating association relations among different candidate tags;

and the determining module is used for determining the content tag corresponding to the target content from the candidate tags based on the score feature representation and the association feature representation.

In another aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory having stored therein a computer program, the computer program being loaded and executed by the processor to implement a method of determining content tags as in any of the above embodiments.

In another aspect, a computer readable storage medium is provided, in which a computer program is stored, the computer program being loaded and executed by a processor to implement a method for determining a content tag as described in any of the above embodiments.

In another aspect, a computer program product is provided that includes a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program to cause the computer device to execute the content tag determination method described in any one of the above embodiments.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

the method comprises the steps of obtaining text content corresponding to target content and tag description content respectively corresponding to a plurality of candidate tags of the target content, extracting score characteristic representation and associated characteristic representation, and carrying out joint analysis on the score characteristic representation and the associated characteristic representation, so that the content tag corresponding to the target content is determined from the plurality of candidate tags. On one hand, the plurality of candidate labels are obtained by analyzing the target content based on at least two different analysis modes, so that the diversity of the candidate labels corresponding to the target content is increased, and the fault tolerance of the finally obtained content labels is also improved; on the other hand, the text information of the target content and the candidate labels is integrated, rich text information is fully utilized, and the association information among the candidate labels is fully utilized to make decisions, so that the accuracy of finally obtaining the content labels is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a process diagram of a method for determining content tags provided in an exemplary embodiment of the present application;

FIG. 2 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a method of determining content tags provided by an exemplary embodiment of the present application;

FIG. 4 is a flow chart of a method of determining content tags provided by another exemplary embodiment of the present application;

FIG. 5 is an overall flow diagram of a method for determining content tags according to an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of a network architecture provided by an exemplary embodiment of the present application;

FIG. 7 is a flowchart of a training method for a tag determination model provided by an exemplary embodiment of the present application;

FIG. 8 is an overall block diagram of a training method for a tag determination model provided by an exemplary embodiment of the present application;

Fig. 9 is a block diagram of a content tag determination apparatus according to an exemplary embodiment of the present application;

fig. 10 is a block diagram showing a configuration of a content tag determining apparatus according to another exemplary embodiment of the present application;

fig. 11 is a block diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of promoting an understanding of the principles and advantages of the application, reference will now be made in detail to the embodiments of the application, some but not all of which are illustrated in the accompanying drawings. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," and the like in this disclosure are used for distinguishing between similar elements or items having substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the terms "first," "second," and no limitation on the amount or order of execution.

First, a brief description will be given of terms involved in the embodiments of the present application.

Artificial intelligence (Artificial Intelligence, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

Machine Learning (ML): is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Computer Vision technology (CV): the method is a science for researching how to make the machine "look at", and further means that a camera and a computer are used to replace human eyes to recognize and measure targets and other machine vision, and further graphic processing is performed, so that the computer is used to process images which are more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

In the related art, a method for determining the video tag is a method based on retrieval, and a corresponding video tag index library is required to be established according to a preset tag; when the label corresponding to the appointed video frame is actually determined, the target video frame with the highest similarity with the appointed video frame is required to be retrieved in the video label index library, and the preset label corresponding to the target video frame is determined as the label of the appointed video frame. However, the tag determination method in the related art determines the tag of the specified video frame only by the picture similarity, and the accuracy of the obtained tag is low.

The embodiment of the application provides a method for determining content labels, which takes labels of target content obtained through at least two obtaining modes as candidate labels, adds a candidate label sorting algorithm on the basis, fully utilizes abundant text information and fully utilizes association information among the candidate labels to make decisions, thereby improving the accuracy of determining the content labels corresponding to the obtained target content.

Referring to fig. 1, fig. 1 is a schematic overall process diagram of a method for determining a content tag according to an exemplary embodiment of the present application, as shown in fig. 1:

for the target video 101, in the video tag system, a plurality of candidate tags 102 of the target video 101 may be obtained through at least two acquisition modes (for example, face recognition, object detection, tag multi-classification, video retrieval, etc.), and for each candidate tag, there is a corresponding acquisition mode and a confidence score under the acquisition mode, for example: square dance (tag content), 0.9 (confidence score), object detection (acquisition mode).

After a plurality of candidate tags 102 are acquired, the tag determination model 103 provided by the embodiment of the application can acquire video text content corresponding to the target video 101 and tag description content corresponding to the plurality of candidate tags 102 respectively; then, feature extraction is carried out on confidence scores corresponding to the candidate labels 102 respectively to obtain score feature representations, feature extraction is carried out on video text contents and label description contents to obtain associated feature representations corresponding to the candidate labels 102 respectively, and the associated feature representations are used for indicating association relations among different candidate labels; finally, a video tag corresponding to the target video 101 is determined from the plurality of candidate tags 102 based on the fractional feature representation and the associated feature representation.

The method for determining the content tag provided by the embodiment of the application can be applied to a short video release scene, and when an creator uploads a short video on a short video platform, a computer device can identify the content tag corresponding to the short video and release the short video carrying the content tag; the method and the device can also be applied to video recommendation scenes, the computer equipment identifies the target video, and the content label corresponding to the target video is obtained through identification, so that the target video and the corresponding content label are stored in a correlated mode, and in the process of recommending the target video, when a viewer searches for the video related to the content label, the video can be recommended according to the video stored in a correlated mode, and recommendation accuracy is improved. It should be noted that the above application scenario is merely an illustrative example, and the method for determining a content tag provided in the present embodiment may also be applied to other scenarios, which is not limited in the embodiment of the present application.

Secondly, the implementation environment related to the embodiment of the present application is described, and optionally, the embodiment of the present application may be implemented by a terminal alone, or by a server alone, or by both the terminal and the server together. In this embodiment, a terminal and a server are implemented together as an example.

Referring to fig. 2, the implementation environment relates to a terminal 210 and a server 220, and the terminal 210 and the server 220 are connected through a communication network 230. The communication network 230 may be a wired network or a wireless network, which is not limited in the embodiments of the present application.

In some alternative embodiments, the terminal 210 has installed and running therein a target application program having a content tag determination function. The target application may be implemented as an instant messaging application, a video application, a news information application, a comprehensive search engine application, a social application, a game application, a shopping application, a map navigation application, etc., which is not limited in this embodiment of the present application. Illustratively, when a tag determination needs to be performed on a target content (for example, a target video), the target video may be input into the terminal 210, and after the terminal 210 identifies a content tag corresponding to the target video, the terminal 210 optionally displays the content tag.

In some optional embodiments, the server 220 is configured to provide a background service for a target application installed in the terminal 210, a tag labeling system and a tag determination model are set in the server 220, after the server 220 receives a target video, a plurality of candidate tags for obtaining the target video can be obtained by the tag labeling system in at least two obtaining modes, and for each candidate tag, a confidence score in a specified obtaining mode is corresponding to the candidate tag; the server 220 inputs a plurality of candidate tags into the tag determination model, and firstly, obtains text content corresponding to the target video and tag description content corresponding to the plurality of candidate tags respectively; secondly, carrying out feature extraction on confidence scores corresponding to the candidate labels respectively to obtain score feature representations, and carrying out feature extraction on text contents and label description contents to obtain associated feature representations corresponding to the candidate labels respectively; and finally, determining the content label corresponding to the target video from the plurality of candidate labels based on the score characteristic representation and the associated characteristic representation. Alternatively, the server 220 transmits the acquired content tag to the terminal 210.

In some optional embodiments, the tag labeling system is disposed in the terminal 210, that is, the terminal 210 may obtain a plurality of candidate tags after analyzing the target video by the tag labeling system; the terminal 210 then transmits the plurality of candidate tags to the server 220.

The terminal 210 includes at least one of a smart phone, a tablet computer, a portable laptop, a desktop computer, an intelligent sound box, an intelligent wearable device, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, and the like.

It should be noted that the server 220 can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms.

Cloud Technology (Cloud Technology) refers to a hosting Technology that unifies serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business model, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing. Optionally, server 220 may also be implemented as a node in a blockchain system.

It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.) and signals related to the present application are individually authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region. For example, the candidate tag, the text content corresponding to the target content, and the like related to the present application are all acquired under the condition of sufficient authorization.

In connection with the above description and the implementation environment, fig. 3 is a flowchart of a method for determining a content tag according to an embodiment of the present application, where the method may be performed by a server or a terminal, or may be performed by the server and the terminal together, and in the embodiment of the present application, the method is performed by the server as an example, as shown in fig. 3, and the method includes:

step 301, a plurality of candidate tags of the target content are acquired.

The method comprises the steps that a plurality of candidate labels are correspondingly provided with at least two acquisition modes, wherein the at least two acquisition modes are modes for analyzing target contents based on at least two different analysis modes to obtain the candidate labels, and the candidate labels comprise confidence scores corresponding to the acquisition modes.

Optionally, the target content includes at least one of a target video, a target text, a target audio, and the like; the target content may be any content that the object publishes on a specified platform (e.g., instant messaging application), such as: short videos, public number articles, music, etc.

Optionally, based on a preset tag system, a plurality of candidate tags of the target content are obtained through at least two obtaining modes. At least two recall models can be set in the tag system, each recall model corresponds to one acquisition mode, and the recall models are used for analyzing target contents to obtain corresponding candidate tags. Optionally, the recall model includes a face recognition model, an object detection model, a tag multi-classification model, a video retrieval model, a play name recognition model, a scene recognition model, and the like, which can perform tag recognition on the target content, which is not limited in the embodiment of the present application.

Optionally, different recall models are responsible for identifying different types of tags, such as: the face recognition model is responsible for recognizing and obtaining a name label; the drama name recognition model is responsible for recognizing and obtaining theme type labels such as movies, television dramas, variety and cartoon; the scene recognition model is responsible for recognizing and obtaining scene labels and the like.

Optionally, one candidate tag corresponds to one recall model; or one candidate label is correspondingly provided with a plurality of recall models, and the scene recognition model recognizes that the scene of the target video is a desert, the object detection model also recognizes that the desert is obtained, and two recall models corresponding to the label desert are provided.

And each candidate label is correspondingly provided with a label score, and the label score is used for indicating the probability that the obtained candidate label belongs to the actual label of the target content when the target content is analyzed in a specified analysis mode. Alternatively, the candidate tags may be stored in a preset tag system by way of triples, where the triples are formatted as < tag name, confidence score, recall model >, for example: one candidate label of the target video is < desert, 0.8, scene recognition model >, namely, the label 'desert' is obtained by representing that the scene recognition model is used for carrying out scene recognition on the target video, and the probability that the label belongs to the actual scene of the target video is 0.8.

Step 302, obtaining text content corresponding to the target content.

Wherein the text content is text data associated with the target content.

Optionally, if the target content is implemented as a target text, the text content includes first text data included in the target text itself, for example: title, keywords, text, etc.; the text content may also include second text data that is more relevant to the target text, for example: references, etc. Optionally, the first text data and the second text data are spliced to obtain text content corresponding to the target text.

Optionally, if the target content is implemented as target audio, the text content includes third text data included in the target audio itself, for example: title text, subtitle text, source information, etc.; the text content may also include fourth text data resulting from automatic speech recognition (Automatic Speech Recognition, ASR) of the target audio. Optionally, the third text data and the fourth text data are spliced to obtain text content corresponding to the target audio.

Optionally, if the target text is implemented as the target video, the text content includes fifth text data included in the target video itself, for example: title text, subtitle text, video links, etc.; the text content may also include text data obtained by identifying the target video, such as: ASR is carried out on the target video to obtain sixth text data; alternatively, after optical character recognition (Optical Character Recognition, OCR) is performed on the target video, text included in the video frame is converted into seventh text data. Optionally, the fifth text data, the sixth text data and the seventh text data are spliced to obtain text content corresponding to the target content.

And step 303, acquiring tag description contents corresponding to the plurality of candidate tags respectively.

The tag description content is used for describing the candidate tags.

Optionally, the preset tag system includes a tag library, where tag information corresponding to each of the plurality of candidate tags is recorded, and optionally, the tag information includes a tag identifier, a tag name, a tag description, a tag classification, and the like, for example: "001", "Liu somewhere", "domestic well-known actor", actor "in the participants" somewhere "etc.

Optionally, the label information of the candidate label may be screened and processed to obtain a complete sentence, which is used as the label description content corresponding to the candidate label. Illustratively, the tag description consists of: "label sign, classification: classification of tags, description: description of tag ", assuming that tag i is < Liu somewhere, 0.9, face recognition >, the tag description content of tag i is: "Liu somewhere, class: actor, describe: domestic well-known actors, a reference to "certain records", etc.

Step 304, extracting the characteristics of the confidence scores corresponding to the candidate labels respectively to obtain score characteristic representations; and extracting the characteristics of the text content and the tag description content to obtain associated characteristic representations corresponding to the candidate tags respectively.

Wherein the association feature representation is used for indicating the association relation between different candidate labels.

The following describes the extraction process of the score feature representation and the associated feature representation, respectively:

1. fractional feature representation

Optionally, feature extraction is performed on confidence scores corresponding to the candidate labels, so as to obtain a score feature vector as a score feature representation, wherein the number of at least two acquisition modes is the dimension of the score feature vector. Constructing a K-dimensional score vector as a score characteristic representation according to the K acquisition modes and the confidence scores of the candidate labels corresponding to the K acquisition modes respectively, wherein the score corresponding to the j-th dimension in the K-dimensional score vector is used for indicating the confidence score of the candidate label corresponding to the j-th acquisition mode, K and j are positive integers, j is smaller than or equal to K, and K is larger than or equal to 2; and extracting the characteristics of the K-dimensional fractional vector to obtain fractional characteristic representation.

The K acquisition modes correspond to K recall models, and the recall models are used for analyzing target contents to obtain corresponding candidate labels. Optionally, if the recall model includes 6 models, such as a face recognition model, an object detection model, a tag multi-classification model, a video retrieval model, a play name recognition model, and a scene recognition model, then 6 acquisition modes are corresponding, K is 6, that is, the constructed score vector is a 6-dimensional score vector.

Illustratively, for each candidate tag, its score under all recall models is extracted (the same candidate tag may be identified by multiple recall models simultaneously), forming a score vector. The dimension (or length) of the score vector is the number of all recall models, and illustratively, for the candidate label i, assuming that 7 recall models are total, wherein the face recognition model is the 3 rd recall model, the multi-modal classification model is the 5 th recall model, the candidate label i is < Liu somewhere, 0.9, face recognition > | < Liu somewhere, 0.3, the multi-modal classification model >, and the score vector of the candidate label i is [0,0,0.9,0,0.3,0,0]; after the score vector of the candidate label i is constructed, feature extraction is carried out on the vector, and the obtained score feature representation is the score feature representation of the candidate label i.

In some embodiments, feature extraction of a single-dimensional score vector may also be used to obtain a score feature representation. Optionally, for the candidate label identified by only one recall model, the confidence score under the recall model is the score vector corresponding to the candidate label; for the candidate labels identified by the plurality of recall models, the confidence scores of the candidate labels under the recall models are weighted and averaged, and the obtained weighted average result is the score vector corresponding to the candidate labels, for example: the confidence score of the label i under the recall model 1 is 0.9, and the confidence score under the recall model 2 is 0.6, and the final score vector of the label i can be 0.75 of the average value of the two.

2. Associated feature representation

The association characteristic representation refers to the association degree between different tags, and illustratively, the association characteristic representation 1 of the tag i is used for indicating the association degree between the tag i and the tag j, and can represent the possibility that the tag i belongs to an actual tag of the target content in the case that the tag j belongs to the actual tag of the target content. For example: if "Liu somewhere" and "Liu somewhere in the list of candidate tags appear in the list, then the two tags can be considered very trusted (Liu somewhere appears in the list), and if" Liu somewhere "and" Liu somewhere in the list "Ottman" appear in the list, then neither tag is trusted because the probability of the two tags appearing together is low.

Optionally, when extracting the score feature representation corresponding to each candidate tag, text information of the candidate tag and text information of the target content corresponding to the candidate tag are comprehensively considered. Performing text splicing processing on the text content and the tag description content to obtain text spliced content; and extracting the characteristics of the text splicing content to obtain associated characteristic representations respectively corresponding to the candidate labels.

It should be noted that, the text content spliced by each candidate tag is the same, i.e. the text content corresponding to the target content. Schematically, for the candidate tag i, splicing the tag description content corresponding to the candidate tag i and the text content corresponding to the target content to obtain an ith text spliced content, and extracting features of the ith text spliced content to obtain an associated feature representation corresponding to the candidate tag i.

Step 305, determining a content tag corresponding to the target content from the plurality of candidate tags based on the score feature representation and the associated feature representation.

Optionally, performing weighted average processing on the score characteristic representation and the associated characteristic representation to obtain target characteristic representations corresponding to the candidate labels respectively; and determining a content tag corresponding to the target content from the plurality of candidate tags based on the target feature representation.

Optionally, averaging the score feature representation and the associated feature representation to obtain a target feature representation; and carrying out feature analysis on the target feature representation, and determining a content tag corresponding to the target content from the plurality of candidate tags based on the result of the feature analysis.

In summary, according to the method for determining a content tag provided by the embodiment of the present application, the text content corresponding to the target content and the tag description content respectively corresponding to the plurality of candidate tags of the target content are obtained, the score feature representation and the association feature representation are extracted, and the score feature representation and the association feature representation are jointly analyzed, so that the content tag corresponding to the target content is determined from the plurality of candidate tags. On the one hand, the plurality of candidate labels are obtained by analyzing the target content based on at least two different analysis modes, so that the diversity of the candidate labels corresponding to the target content is increased, and the fault tolerance of the finally obtained content labels is also improved; on the other hand, the text information of the target content and the candidate labels is integrated, rich text information is fully utilized, and the association information among the candidate labels is fully utilized to make decisions, so that the accuracy of finally obtaining the content labels is improved.

According to the method provided by the embodiment of the application, the text information related to the target content and the text information related to the candidate labels are subjected to text splicing, so that for each candidate label, the corresponding associated feature representation comprises the related information of the target content, the information expression capability of the associated feature representation is increased, and the accuracy of the finally obtained content label is improved.

According to the method provided by the embodiment of the application, the score vector is constructed through the confidence scores respectively corresponding to the candidate labels in at least two acquisition modes, so that the score vector is subjected to feature extraction, and the score feature representation of the candidate labels is obtained. By the method for constructing the score vector, the confidence score condition of the candidate label under different analysis modes is better reflected, and the accuracy and the comprehensiveness of the extracted score feature representation are improved.

According to the method provided by the embodiment of the application, the target characteristic representation is obtained by carrying out weighted average processing on the score characteristic representation and the associated characteristic representation, so that the content label corresponding to the target content is determined from the plurality of candidate labels based on the target characteristic representation, the correlation between the candidate labels and the content associated with the candidate labels is maximized, and the accuracy and summarizing capability of the obtained target characteristic representation are improved.

In some alternative embodiments, the feature analysis may be performed on the target feature representation, so that each candidate tag may be re-scored, so that, according to a preset score threshold, a content tag corresponding to the target content is determined from multiple candidate tags by using a "card threshold" manner, and fig. 4 is a flowchart of another method for determining a content tag according to an embodiment of the present application, where the method may be performed by a server or a terminal, or may be performed by the server and the terminal together, and in the embodiment of the present application, the method is performed by the server as an example, and as shown in fig. 4, the method includes:

step 401, obtaining a plurality of candidate tags of the target content.

Step 402, obtaining text content corresponding to the target content.

Wherein the text content is text data associated with the target content.

Optionally, taking the implementation of the target content as the target video as an example, a title text, an ASR text (text obtained after automatic speech recognition of the target video), and an OCR text (text obtained after optical character recognition of the target video) corresponding to the target video are obtained.

Referring to fig. 5 schematically, an overall flow chart of a method for determining a content tag according to an embodiment of the present application is shown, and as shown in fig. 5, the method for determining a content tag is implemented by a tag determination model 500, where the tag determination model 500 includes an information processor 501, a score mapper 502, a text feature extractor 503, a text feature fusion 504, and a scoring unit 505. The title text, the ASR text and the OCR text corresponding to the target video are input into the information processor 501, and the title text, the ASR text and the OCR text are spliced in the information processor 501 to obtain video text information 506 corresponding to the target video.

Step 403, obtaining tag description contents corresponding to the plurality of candidate tags respectively.

The tag description content is used for describing the candidate tags.

Optionally, the plurality of candidate tags are stored in a tag system in a manner of triples < tag name, confidence score, recall model >, and tag information corresponding to the plurality of candidate tags respectively is also stored in the tag system, including: tag identification, tag name, tag description, tag classification, etc.

For illustration, please refer to fig. 5, the triples and the tag information corresponding to the plurality of candidate tags are input to the information processor 501, and the description information corresponding to the candidate tags and the text information thereof are integrated in the information processor 501, so as to obtain tag text information 507 corresponding to the plurality of candidate tags, namely tag description content, which may be stored as a form of "tag signature, classification: classification of tags, description: description of tags ".

And step 404, constructing a K-dimensional score vector according to the K acquisition modes and confidence scores corresponding to the candidate labels under the K acquisition modes.

The score corresponding to the jth dimension in the K-dimensional score vector is used for indicating the confidence score corresponding to the candidate label in the jth acquisition mode, K and j are positive integers, and j is smaller than or equal to K, and K is larger than or equal to 2.

For the candidate tag i, the confidence scores of the candidate tag i under all recall models corresponding to the candidate tag i are extracted from the information processor 501 to form a score vector; the length of the score vector is the number of all recall models; initializing the score vector into an all-zero vector, changing the number corresponding to the recall model position into the score thereof, and obtaining the vector which is the K-dimensional score vector corresponding to the candidate label i.

And step 405, extracting the characteristics of the K-dimensional fractional vector to obtain fractional characteristic representation.

Illustratively, referring to fig. 5, the K-dimensional score vector corresponding to the candidate tag i is input to the score mapper 502, and optionally, the score mapper 502 is a five-layer network, a layer full-connection layer, a layer_norm (for normalizing all features), an activation function layer (e.g., relu function), a full-connection layer, and a layer layer_norm. The dimension of the first full connection layer is K multiplied by 512, K is the number of recall models, namely K acquisition modes, numeral 512 is the feature dimension after mapping, namely the dimension after mapping the K-dimensional score vector is 512 dimensions; and the dimension of the second full connection layer is 512 multiplied by 512, the dimension of the finally output feature vector is unchanged to 512 dimensions, namely, the obtained score feature is expressed as 512 dimensions after feature extraction is carried out on the K-dimensional score vector corresponding to the candidate tag i. It should be noted that each candidate tag is mapped to a 512-dimensional fractional feature representation 508.

And step 406, performing text splicing processing on the text content and the tag description content to obtain text spliced content.

For schematic illustration, please refer to fig. 5, for each candidate tag, the video text information 506 and the tag text information 507 are spliced into a text, so as to obtain text splicing contents corresponding to each candidate tag.

Optionally, before performing the text splicing process, a preprocessing operation may also be performed on the text content corresponding to the target content, where the preprocessing operation includes: redundant information processing is removed, and the text content possibly contains redundant information such as unnecessary blank spaces, repeated punctuation marks, unnecessary repeated words and the like, so that the redundant information can be checked and deleted before text splicing operation is carried out; correcting the wrongly written word, wherein the text content possibly contains the wrongly written word, so that the wrongly written word needs to be detected, and then the wrongly written word in the target text is corrected; punctuation processing, which may include punctuation in the text content, these punctuation marks may be marked or deleted from the target text. It should be noted that the above preprocessing operation is merely an illustrative example, and the embodiments of the present application are not limited thereto.

And step 407, responding to the difference value between the text length corresponding to the text splicing content and the preset length being larger than a preset threshold value, and performing text length adjustment operation on the text splicing content according to the preset threshold value to obtain target text splicing content.

The plurality of candidate labels comprise a first part of candidate labels, a second part of candidate labels and a third part of candidate labels; the text splice content of the candidate labels of each part corresponds to the text length in a different length range.

Optionally, in response to the text length corresponding to the text splicing content of the first part of candidate labels being greater than the preset length, performing text length cutting operation on the text splicing content according to the preset length to obtain first text splicing content; responding to the text length corresponding to the text splicing content of the second part of candidate labels to be smaller than the preset length, and performing text length supplementing operation on the text splicing content according to the preset length to obtain second text splicing content; responding to the text splicing content of the third part candidate label, wherein the text length corresponding to the text splicing content is equal to the preset length, and taking the text splicing content as third text splicing content; and the first text splicing content, the second text splicing content and the third text splicing content are used as target text splicing contents respectively corresponding to the candidate labels.

For illustration, please refer to fig. 5, text splicing contents corresponding to the candidate labels are input into the text feature extractor 503, and in the text feature extractor 503, for each text splicing content, if the text length is greater than 384, the text is truncated, and if the text length is less than 384, a flag bit is added at the end of the text until the text length is 384, for example: a [ PAD ] flag bit; if the text length is equal to 384, no manipulation of the text length of the text splice content is required.

In some optional embodiments, text length adjustment operation can be performed on the text spliced content according to a semantic analysis result of the text spliced content, optionally, sentence processing is performed on the text spliced content in response to the text length corresponding to the text spliced content being greater than a preset length, so as to obtain a content sentence result; carrying out semantic similarity analysis on the content clause result to obtain a similarity analysis result, wherein the similarity analysis result is used for indicating semantic similarity among the clauses in the content clause result; based on the similarity analysis result, cutting the text splicing content according to the preset length to obtain the target text splicing content.

Schematically, when the text length of the text splicing content is greater than the preset length, the text can be cut, then the text splicing content is firstly divided into sentences, after each sentence is obtained, the semantic similarity among the sentences is analyzed, and the target text splicing content is obtained through the semantic similarity.

The condition that the target text splice content is obtained based on the semantic similarity comprises at least one of the following conditions:

1. the similarity analysis result comprises semantic similarity among m first clauses corresponding to the spliced text content in the text spliced content, wherein m is a positive integer, and m is more than or equal to 2; responding to the similarity between n pairs of first clauses in the similarity analysis result to be larger than a preset similarity threshold, wherein n is a positive integer, and 2n is smaller than or equal to m; and cutting the first clause of the n pairs according to the preset length to obtain the target text splicing content.

Schematically, after obtaining the clauses of the video text content in the text splicing content, calculating the similarity between the clauses in the video text content, and if the similarity between the clause 1 and the clause 2 in the video text is greater than a preset similarity threshold value, cutting the clause 1 and the clause 2. And calculating a difference value between the preset length and the text length, if the length of the clause 1 is close to the difference value, performing cutting operation on the clause 1, and if the length of the clause 2 is close to the difference value, performing cutting operation on the clause 2. Optionally, if the text length does not reach the preset length after clipping, continuing to perform a clipping operation on the next clause 3 and the next clause 4 with higher semantic similarity until the text length reaches the preset length.

2. The similarity analysis result comprises semantic similarity between a second clause corresponding to the label description content spliced in the text splicing content and m first clauses; obtaining a similarity average value between a second clause and m first clauses corresponding to the tag description content; and cutting the m first clauses according to the preset length based on the average similarity value to obtain the target text splicing content.

Schematically, obtaining clause a corresponding to the tag description content, obtaining clause 1, clause 2, clause 3 and clause 4 corresponding to the video text content, calculating the average value of semantic similarity between the clause a and the clause 1, the clause 2, the clause 3 and the clause 4, and optionally, cutting the m clauses from large to small according to the difference value between the m first clauses and the average value until the text length reaches the preset length.

And step 408, extracting features of the spliced content of the target text to obtain tag text feature representations corresponding to the candidate tags respectively.

Referring to fig. 5, in an exemplary embodiment, after text lengths corresponding to text splicing contents of all candidate tags are unified in a text feature extractor 503, the obtained target text splicing contents are subjected to embedding (embedding) to map each word in the target text splicing contents into a dense vector, so as to obtain 384 input text splicing contents; the text feature extractor 503 is provided with a bi-directional encoder chinese language representation (Bidirectional Encoder Representations from Transformer-base-chinese) model based on a converter, and the above 384 input emuddings are subjected to feature encoding by the BERT-base-chinese model to obtain 384 output emuddings, and the first bit [ CLS ] in the 384 output emuddings is taken to correspond to the emudding: feature_0 is taken as an output of the text feature extractor 503, and the feature vector dimension of each output is 512 dimensions, namely, the feature representation of the tag text; each candidate tag is then mapped to a 512-dimensional tag text feature representation.

It should be noted that the text feature extractor 503 may also be implemented as any chinese text feature extraction model, for example: a Set of Words (SoW) model, a Bag of Words (BoW) model, etc., as embodiments of the application are not limited in this regard.

And 409, performing association analysis on label text feature representations corresponding to an ith candidate label and other candidate labels in the plurality of candidate labels to obtain association feature representations corresponding to the ith candidate label, wherein i is a positive integer.

Schematically, please refer to fig. 5, the text feature representations of the i-th candidate tag and the other candidate tags in the plurality of candidate tags are input to the text feature fusion device 504, and are subjected to association analysis in the text feature fusion device 504, so as to obtain the associated feature representation 509 corresponding to the i-th candidate tag. It should be noted that, association degree analysis is performed on each candidate tag to obtain a corresponding association feature representation.

The text feature fusion device 504 may be implemented as a convolutional neural network including an attention mechanism, and optionally, the tag text feature representations corresponding to the ith candidate tag and other candidate tags are input into the convolutional neural network, and convolution operation is performed on the tag text feature representations corresponding to the ith candidate tag and other candidate tags; based on convolution operation, attention weight between the ith candidate label and other candidate labels is obtained, the attention weight is used as an associated feature representation corresponding to the ith candidate label, and the attention weight is used for indicating the association degree between the ith candidate label and the other candidate labels.

Referring to fig. 6, the convolutional neural network 600 has a simple three-layer multi-head attention layer stack, with layer_norm and activation function layer interposed therebetween. Optionally, a residual structure is added in the convolutional neural network 600, and the input tag correlation feature 601 is directly added with the output of the last layer to obtain an output feature, namely, a correlation feature representation. Optionally, k vectors with dimension 512 are input in the convolutional neural network 600, where k is the number of candidate labels; the k vectors of dimension 512 are output.

The text feature fusion device 504 may be at least one of a recurrent neural network (Long Short Term Memory, LSTM) based on long-short term memory, a graph roll-up network (graph convolutional networks, GCN), and the like, which is not limited in the embodiment of the present application.

And 410, performing weighted average processing on the score characteristic representation and the associated characteristic representation to obtain target characteristic representations corresponding to the candidate labels respectively.

Optionally, calculating an average feature between the score feature representation and the associated feature representation to obtain target feature representations respectively corresponding to the plurality of candidate tags.

Schematically, please refer to fig. 5, for the candidate tag i, the score feature representation 508 and the associated feature representation 509 are averaged to obtain the target feature representation.

In step 411, feature analysis is performed on the target feature representation, so as to obtain a re-scoring result corresponding to each of the plurality of candidate labels.

The reclassification result is obtained after updating the confidence score of the candidate label.

For the candidate tag i, the target feature representation is input to the scoring device 505, a fully-connected layer with dimension 512×1 is included in the scoring device 505, the target feature representation is mapped into a number through the fully-connected layer, the value range of the number is mapped into the (0, 1) range by using an activation function (e.g. sigmoid function), and the obtained score is the re-scoring result of the candidate tag i. For each candidate label, feature analysis is required to be carried out on the corresponding target feature representation to obtain a corresponding re-scoring result.

And step 412, taking the target candidate label with the re-scoring result being greater than or equal to the preset score threshold value of the plurality of candidate labels as the content label corresponding to the target content.

Illustratively, in the scoring device 505, if the re-scoring result of the candidate label is greater than or equal to the set score threshold, that is, the candidate label is considered to be correct, the candidate label is output; if the re-scoring result of the candidate label is smaller than the set score threshold, the candidate label is not output; and finally, taking all the output candidate labels as content labels corresponding to the target content.

Optionally, for the target candidate labels with the re-scoring result greater than or equal to the preset score threshold, the scoring device 505 may rank the target candidate labels while performing the scoring; sorting according to the size of the score retrieved by the candidate tag.

According to the method provided by the embodiment of the application, the association degree analysis is carried out on the label text characteristics of different candidate labels, the correlation among the candidate labels is fully considered by a method of an attention mechanism, the effectiveness of screening the content labels is enhanced, and the accuracy of the obtained content labels is improved.

According to the method provided by the embodiment of the application, the label text features of different candidate labels are analyzed after being unified in length, so that the complexity of a model is reduced, and the efficiency of obtaining the content labels is improved.

According to the method provided by the embodiment of the application, the target characteristic representation is subjected to characteristic analysis to obtain the re-scoring results corresponding to the candidate labels, the target candidate label with the re-scoring result being greater than or equal to the preset score threshold value in the candidate labels is used as the content label corresponding to the target content, and the candidate labels are screened in a threshold value setting mode, so that the accuracy of the obtained content label is improved.

In some alternative embodiments, the tag determination model in the above embodiments may be obtained through training by a method of supervised training, and fig. 7 is a flowchart of a training method of the tag determination model provided in the embodiment of the present application, where the method may be executed by a server or a terminal, or may be executed by the server and the terminal together, and in the embodiment of the present application, the method is executed by the server as an example, and as shown in fig. 7, the method includes:

Step 701, q sample tags of sample content are acquired, and a reference tag set of the sample content is acquired.

The q sample labels correspond to at least two acquisition modes, wherein the at least two acquisition modes are modes for analyzing sample contents based on at least two different analysis modes to obtain sample labels, the sample labels comprise confidence scores corresponding to the acquisition modes, and q is a positive integer; the reference tag set of the sample content is an actual tag set of the sample content, and optionally, the reference tag set of the sample content is obtained in a manual labeling mode.

Step 702, sample text content corresponding to the sample content is obtained.

Wherein the sample text content is text data associated with the sample content.

Optionally, taking sample content as sample video for example to describe, title text, ASR text and OCR text corresponding to the sample video are obtained.

Illustratively, a 50-thousand number of videos, corresponding title text, ASR text, OCR text, and their corresponding sample tags and the tag sets corresponding to the 50-thousand videos are prepared as training data sets of the sample tag determination model.

Referring to fig. 8, an overall structure diagram of a training method of a tag determination model according to an embodiment of the present application is shown, and as shown in fig. 8, the tag determination model 800 includes a sample information processor 801, a sample score mapper 802, a sample text feature extractor 803, a shared random mask 804, a sample text feature attention fusion 805, and a sample score 806. And inputting the title text, the ASR text and the OCR text corresponding to the sample video into a sample information processor 801, and splicing the title text, the ASR text and the OCR text to obtain sample video text information 807 corresponding to the sample video.

In step 703, sample tag description contents corresponding to the q sample tags are obtained through the sample tag determination model.

The sample tag description content is used for describing the sample tag. Optionally, q sample tags are stored in a tag system in a manner of triplet < sample tag name, confidence score, recall model >, and tag information corresponding to the q sample tags respectively is also stored in the tag system, including: tag identification, tag name, tag description, tag classification, etc. And integrating the label information to obtain sample label descriptive contents corresponding to the q sample labels respectively.

For illustration, please refer to fig. 8, the triples and the label information corresponding to the q sample labels are input to the sample information processor 801, and the description information corresponding to the sample labels and the text information thereof are integrated to obtain sample label text information 808 corresponding to the m sample labels, i.e. sample label description contents.

Step 704, constructing a K-dimensional score vector according to the K acquisition modes and confidence scores corresponding to the sample labels in the K acquisition modes through a sample label determination model; and extracting the characteristics of the K-dimensional fractional vector to obtain sample fractional characteristic representation.

The score corresponding to the jth dimension in the K-dimensional score vector is used for indicating the confidence score corresponding to the sample label in the jth acquisition mode, K and j are positive integers, j is smaller than or equal to K, and K is larger than or equal to 2.

For illustration, please refer to fig. 8, the triples corresponding to the q sample tags are input to the sample information processor 801, and for the sample tag i, confidence scores under all recall models corresponding to the sample tag i are extracted to form a score vector; the length of the score vector is the number of all recall models; initializing the score vector into an all-zero vector, changing the number corresponding to the recall model position into the score thereof, and obtaining a vector which is a K-dimensional score vector corresponding to the sample label i; the K-dimensional score vector corresponding to the sample tag i is input to the sample score mapper 802, and the sample score feature representation 809 corresponding to the sample tag i can be output, where the sample score mapper 802 has the same network structure as the score mapper 502, and the feature extraction process of the K-dimensional score vector can refer to step 405, which is not repeated herein.

And step 705, performing text splicing processing on the sample text content and the sample tag description content through a sample tag determination model to obtain sample text spliced content.

For illustration, referring to fig. 8, for each sample tag, the sample video text information 807 and the sample tag text information 808 are spliced into a text, so as to obtain sample text splicing contents respectively corresponding to each sample tag.

And step 706, in response to the difference value between the text length corresponding to the sample text splicing content and the preset length being greater than the preset threshold, performing text length adjustment operation on the sample text splicing content according to the preset threshold to obtain the target text splicing content.

For illustration, please refer to fig. 8, the sample text spliced contents corresponding to q sample tags are input to the sample text feature extractor 803, and for each sample text spliced content, the text is truncated if the text length is greater than 384, and if the text length is less than 384, a flag bit is added at the end of the text until the text length is 384, for example: a [ PAD ] flag bit; if the text length is equal to 384, no manipulation of the text length of the sample text splice content is required.

And step 707, extracting features of the target text splicing content through the sample tag determining model to obtain sample tag text feature representations corresponding to the sample tags respectively.

Referring to fig. 8, in the sample text feature extractor 803, after the text lengths corresponding to the sample text splicing contents of all the sample tags are unified, feature extraction is performed on the target text splicing contents, so as to obtain sample tag text feature representations corresponding to the plurality of sample tags respectively. The network structures of the sample text feature extractor 803 and the text feature extractor 503 are the same, and the process of feature extraction on the target text stitching content can refer to step 408, which is not described herein.

Step 708, according to the preset probability p, the sample label determining model performs discarding operation on the sample score feature representations and the sample label text feature representations corresponding to the q×p sample labels at random, so as to obtain sample score feature representations and sample label text feature representations corresponding to the y sample labels respectively.

Wherein p is greater than 0 and less than 1, y=q-q×p.

Illustratively, to enhance the generalization capability of the sample tag determination model, a shared random mask layer 804 is added after the sample score mapper 802 and the sample text feature extractor 803, respectively, to randomly discard features with probability p, p e (0, 1).

In the sample label determining model training process, a sample video is assumed to have q sample labels, sample score characteristic representations and sample label text characteristic representations respectively corresponding to the q multiplied by p sample labels are randomly discarded, and the position characteristic is set to be zero. The sample label corresponding to the sample score feature representation and the sample label text feature representation discarded in the shared random mask layer 804 needs to be the same label; and the range of each mask is randomly generated.

It should be noted that, in the actual application process of the model, p may be set to 0, that is, all the score feature representations and the label text feature representations corresponding to the candidate labels are reserved.

Step 709, performing association analysis on the text feature representations of the sample labels corresponding to the ith sample label and other sample labels in the q sample labels through the sample label determining model to obtain sample association feature representations corresponding to the ith sample label.

Wherein i is a positive integer, and i is less than or equal to q.

Schematically, please refer to fig. 8, the sample tag text feature representations corresponding to the i-th sample tag and other sample tags in the q sample tags are input to the sample text feature attention fusion device 805, and are subjected to association analysis, so as to output a sample association feature representation 810 corresponding to the i-th sample tag. The correlation analysis is performed on each sample label to obtain a corresponding sample correlation characteristic representation. The structure of the sample text feature attention fusion device 805 is the same as that of the text feature fusion device 504, and the specific process of association analysis may refer to step 409, which is not described herein.

And 710, carrying out weighted average processing on the sample score characteristic representation and the sample association characteristic representation through a sample label determining model to obtain sample characteristic representations respectively corresponding to a plurality of sample labels.

Optionally, calculating an average feature between the sample score feature representation and the sample association feature representation to obtain sample feature representations respectively corresponding to the plurality of sample tags.

Schematically, please refer to fig. 8, for the sample label i, the sample score feature representation 809 and the sample association feature representation 810 thereof are averaged to obtain the sample feature representation.

And 711, performing feature analysis on the sample feature representation through a sample label determining model, and predicting to obtain sample re-scoring results corresponding to the sample labels.

The sample re-scoring result is obtained after the confidence score of the sample label is updated.

For the sample tag i, referring to fig. 8, a sample feature representation is input to a sample score device 806, the sample feature representation is mapped into a number through a full connection layer included in the sample feature representation, the number of value fields are mapped into a (0, 1) range by using an activation function (e.g., sigmoid function), and the finally obtained score is the sample re-scoring result of the sample tag i. For each sample label, feature analysis is required to be carried out on the corresponding sample feature representation to obtain a corresponding sample re-scoring result.

And step 712, training the sample tag determination model based on the sample re-scoring result and the reference tag set to obtain the tag determination model.

Optionally, the sample tag determination model is subjected to supervised training by a mean square error loss function (Mean Square Error, MSE), and model parameters in the sample tag determination model are updated to obtain the tag determination model, wherein the MSE loss function is shown in the following formula one:

equation one: loss (X) _i ，Y _i )＝(X _i ，Y _i ) ²

Wherein X is a re-scoring score predicted by a sample tag determination model, y is a table, and for the table, if a sample tag corresponding to a sample re-scoring result belongs to a reference tag set, the table is 1; if the sample label corresponding to the sample reclassification result does not belong to the reference label set, the table is 0.

According to the training method of the label determination model, sample text content corresponding to sample content and sample label description content corresponding to a plurality of sample labels of the sample content are obtained, sample score characteristic representation and sample association characteristic representation are extracted, and combined analysis is carried out on the sample score characteristic representation and the sample association characteristic representation to obtain sample re-scoring results of the plurality of sample labels, so that the sample label determination model is trained based on the sample re-scoring results and a reference label set, and the label determination model is obtained. On the one hand, in the training process, the generalization performance of the model obtained by training is improved by randomly discarding the features, so that the phenomenon of overfitting of the model in actual use is reduced; on the other hand, by fully using text information corresponding to the target content and fully using text information associated with the tag, richer information is provided for the model, and the capability of feature extraction of the model is enhanced, so that the accuracy of determining the model by the tag obtained through training is improved.

Referring to fig. 9, a block diagram of a content tag determining apparatus according to an exemplary embodiment of the present application is shown, where the apparatus includes:

the obtaining module 900 is configured to obtain a plurality of candidate tags of the target content, where the plurality of candidate tags correspond to at least two obtaining manners, and the at least two obtaining manners are manners of analyzing the target content based on at least two different analysis manners to obtain a candidate tag, and the candidate tag includes a confidence score corresponding to the obtaining manner;

the obtaining module 900 is further configured to obtain text content corresponding to the target content, where the text content is text data associated with the target content;

the obtaining module 900 is further configured to obtain tag description contents corresponding to the plurality of candidate tags, where the tag description contents are used to describe the candidate tags;

an extracting module 910, configured to perform feature extraction on confidence scores corresponding to the plurality of candidate labels respectively, so as to obtain a score feature representation; extracting features of the text content and the tag description content to obtain associated feature representations corresponding to the candidate tags respectively, wherein the associated feature representations are used for indicating association relations among different candidate tags;

A determining module 920, configured to determine a content tag corresponding to the target content from the plurality of candidate tags based on the score feature representation and the associated feature representation.

Referring to fig. 10, in some alternative embodiments, the extracting module 910 includes:

a splicing unit 911, configured to perform text splicing processing on the text content and the tag description content to obtain text spliced content;

the extracting module 910 is further configured to perform feature extraction on the text spliced content, so as to obtain associated feature representations corresponding to the plurality of candidate tags respectively.

In some optional embodiments, the extracting module 910 is further configured to extract, based on the text splicing content, a tag text feature representation corresponding to each of the plurality of candidate tags, where the feature extraction is performed on the text splicing content to obtain associated feature representations corresponding to each of the plurality of candidate tags; the extraction module 910 includes:

the first analysis unit 912 is configured to perform association degree analysis on label text feature representations corresponding to an i-th candidate label and other candidate labels in the plurality of candidate labels, so as to obtain an association feature representation corresponding to the i-th candidate label, where i is a positive integer.

In some alternative embodiments, the extracting module 910 further includes:

an adjustment unit 913, configured to respond to a difference between a text length corresponding to the text splicing content and a preset length being greater than a preset threshold, and perform a text length adjustment operation on the text splicing content according to the preset threshold, so as to obtain a target text splicing content;

the extracting module 910 is further configured to perform feature extraction on the target text spliced content, so as to obtain tag text feature representations corresponding to the plurality of candidate tags respectively.

In some optional embodiments, the adjusting unit 913 is configured to respond to the text length corresponding to the text splicing content being greater than a preset length, and perform sentence processing on the text splicing content to obtain a content sentence result; the adjusting unit 913 is configured to perform semantic similarity analysis on the content clause result to obtain a similarity analysis result, where the similarity analysis result is used to indicate semantic similarity between clauses in the content clause result; the adjusting unit 913 is configured to perform a clipping operation on the text splicing content according to the preset length based on the similarity analysis result, so as to obtain the target text splicing content.

In some optional embodiments, the similarity analysis result includes semantic similarity between m first clauses corresponding to the text content spliced in the text splicing content, where m is a positive integer, and m is greater than or equal to 2; the adjusting unit 913 is configured to respond to the similarity between n pairs of first clauses in the similarity analysis result being greater than a preset similarity threshold, where n is a positive integer, and 2n is less than or equal to m; the adjusting unit 913 is configured to perform clipping operation on the n pairs of first clauses according to the preset length, so as to obtain the target text splicing content.

In some optional embodiments, the similarity analysis result includes semantic similarity between a second clause corresponding to the tag description content spliced in the text splice content and the m first clauses; the adjusting unit 913 is configured to obtain a similarity average value between the m first clauses and the second clauses corresponding to the tag description content; the adjusting unit 913 is configured to perform clipping operation on the m first clauses according to the preset length based on the average similarity value, so as to obtain the target text splicing content.

In some optional embodiments, the first analysis unit 912 is configured to input the tag text feature representations corresponding to the i-th candidate tag and the other candidate tags into a convolutional neural network, and perform a convolutional operation on the tag text feature representations corresponding to the i-th candidate tag and the other candidate tags; the first analysis unit 912 is further configured to obtain, based on the convolution operation, an attention weight between the ith candidate tag and the other candidate tags, and use the attention weight as an association feature representation corresponding to the ith candidate tag, where the attention weight is used to indicate a degree of association between the ith candidate tag and the other candidate tags.

In some alternative embodiments, the extracting module 910 includes:

a construction unit 914, configured to construct a K-dimensional score vector according to K acquisition modes and confidence scores corresponding to the candidate tags in the K acquisition modes, where a score corresponding to a j-th dimension in the K-dimensional score vector is used to indicate a confidence score corresponding to the candidate tag in the j-th acquisition mode, where K and j are positive integers, j is less than or equal to K, and K is greater than or equal to 2;

The extracting module 910 is further configured to perform feature extraction on the n-dimensional score vector, so as to obtain the score feature representation.

In some alternative embodiments, the determining module 920 includes:

a processing unit 921, configured to perform weighted average processing on the score feature representation and the associated feature representation, to obtain target feature representations corresponding to the plurality of candidate tags respectively;

the determining module 920 is further configured to determine, from the plurality of candidate tags, a content tag corresponding to the target content based on the target feature representation.

In some alternative embodiments, the determining module 920 includes:

a second analysis unit 922, configured to perform feature analysis on the target feature representation to obtain re-scoring results corresponding to the multiple candidate tags, where the re-scoring results are obtained by updating confidence scores of the candidate tags;

the determining module 920 is further configured to use a target candidate tag, where the re-scoring result of the plurality of candidate tags is greater than or equal to a preset score threshold, as a content tag corresponding to the target content.

In summary, according to the content tag determining device provided by the embodiment of the present application, the text content corresponding to the target content and the tag description content respectively corresponding to the plurality of candidate tags of the target content are obtained, the score feature representation and the association feature representation are extracted, and the score feature representation and the association feature representation are jointly analyzed, so that the content tag corresponding to the target content is determined from the plurality of candidate tags. On the one hand, the plurality of candidate labels are obtained by analyzing the target content based on at least two different analysis modes, so that the diversity of the candidate labels corresponding to the target content is increased, and the fault tolerance of the finally obtained content labels is also improved; on the other hand, the text information of the target content and the candidate labels is integrated, rich text information is fully utilized, and the association information among the candidate labels is fully utilized to make decisions, so that the accuracy of finally obtaining the content labels is improved.

It should be noted that: the content tag determining apparatus provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the content tag determining device provided in the above embodiment and the content tag determining method embodiment belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.

Fig. 11 shows a block diagram of a computer device 1100 provided by an exemplary embodiment of the application. The computer device 1100 may be a terminal or a server.

In general, the computer device 1100 includes: a processor 1101 and a memory 1102.

The processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1101 may be implemented in at least one hardware form of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 1101 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a central processor (Central Processing Unit, CPU), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with an image processor (Graphics Processing Unit, GPU) for use in connection with rendering and rendering of content to be displayed by the display screen. In some embodiments, the processor 1101 may also include an artificial intelligence (Artificial Intelligence, AI) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store a computer program for execution by processor 1101 to implement the method of determining content tags provided by the method embodiments of the present application.

Illustratively, the computer device 1100 also includes other components, and those skilled in the art will appreciate that the structure illustrated in FIG. 11 is not limiting of the computer device 1100, and may include more or fewer components than illustrated, or may combine certain components, or employ a different arrangement of components.

Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the above embodiments may be implemented by a computer program that is stored in a computer readable storage medium, which may be a computer readable storage medium included in a memory of the above embodiments; or may be a computer-readable storage medium, alone, that is not assembled into a computer device. The computer readable storage medium has stored therein a computer program that is loaded and executed by the processor to implement the method of determining a content tag according to any of the above embodiments.

Alternatively, the computer-readable storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), solid state disk (SSD, solid State Drives), or optical disk, etc. The random access memory may include resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of implementing the above-described embodiments may be implemented by hardware, or may be implemented by computer programs instructing the relevant hardware, where the computer programs may be stored on a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. A method of determining content tags, the method comprising:

2. The method according to claim 1, wherein the feature extracting the text content and the tag description content to obtain associated feature representations corresponding to the plurality of candidate tags respectively includes:

performing text splicing processing on the text content and the tag description content to obtain text spliced content;

and extracting the characteristics of the text splicing content to obtain associated characteristic representations corresponding to the candidate labels respectively.

3. The method according to claim 2, wherein the feature extracting the text splicing content to obtain associated feature representations corresponding to the plurality of candidate tags respectively includes:

extracting and obtaining label text characteristic representations respectively corresponding to the plurality of candidate labels based on the text splicing content;

and performing association degree analysis on label text characteristic representations corresponding to an ith candidate label and other candidate labels in the plurality of candidate labels to obtain association characteristic representations corresponding to the ith candidate label, wherein i is a positive integer.

4. The method according to claim 3, wherein extracting, based on the text splicing content, a tag text feature representation corresponding to each of the plurality of candidate tags includes:

Responding to the difference value between the text length corresponding to the text splicing content and the preset length being larger than a preset threshold value, and performing text length adjustment operation on the text splicing content according to the preset threshold value to obtain target text splicing content;

and extracting the characteristics of the target text splicing content to obtain label text characteristic representations respectively corresponding to the plurality of candidate labels.

5. The method according to claim 4, wherein the method further comprises:

responding to the text length corresponding to the text splicing content being greater than a preset length, performing sentence segmentation on the text splicing content to obtain a content sentence result;

carrying out semantic similarity analysis on the content clause result to obtain a similarity analysis result, wherein the similarity analysis result is used for indicating semantic similarity among clauses in the content clause result;

and based on the similarity analysis result, cutting the text splicing content according to the preset length to obtain the target text splicing content.

6. The method of claim 5, wherein the similarity analysis result includes semantic similarity between m first clauses corresponding to the text content spliced in the text splice content, m is a positive integer, and m is greater than or equal to 2;

Based on the similarity analysis result, cutting the text splicing content according to the preset length to obtain the target text splicing content, wherein the method comprises the following steps:

responding to the similarity between n pairs of first clauses in the similarity analysis result to be larger than a preset similarity threshold, wherein n is a positive integer, and 2n is smaller than or equal to m;

and cutting the n pairs of first clauses according to the preset length to obtain the target text splicing content.

7. The method of claim 6, wherein the similarity analysis result includes semantic similarity between a second clause corresponding to the tag description content spliced in the text splice content and the m first clauses;

obtaining a similarity average value between a second clause corresponding to the tag description content and the m first clauses;

and cutting the m first clauses according to the preset length based on the similarity average value to obtain the target text splicing content.

8. The method of claim 3, wherein performing association analysis on label text feature representations corresponding to an i-th candidate label and other candidate labels in the plurality of candidate labels to obtain an association feature representation corresponding to the i-th candidate label includes:

inputting the label text characteristic representations respectively corresponding to the ith candidate label and the other candidate labels into a convolutional neural network, and carrying out convolution operation on the label text characteristic representations respectively corresponding to the ith candidate label and the other candidate labels;

and based on the convolution operation, obtaining attention weights between the ith candidate tag and the other candidate tags, and taking the attention weights as associated characteristic representations corresponding to the ith candidate tag, wherein the attention weights are used for indicating the association degree between the ith candidate tag and the other candidate tags.

9. The method of claim 1, wherein the feature extracting the confidence scores corresponding to the candidate labels respectively to obtain a score feature representation comprises:

according to K acquisition modes and confidence scores corresponding to the candidate labels in the K acquisition modes, constructing a K-dimensional score vector, wherein the score corresponding to the j-th dimension in the K-dimensional score vector is used for indicating the confidence score corresponding to the candidate label in the j-th acquisition mode, K and j are positive integers, j is smaller than or equal to K, and K is larger than or equal to 2;

And extracting the characteristics of the K-dimensional fractional vector to obtain the fractional characteristic representation.

10. The method according to any one of claims 1 to 9, wherein the determining a content tag corresponding to the target content from the plurality of candidate tags based on the score feature representation and the associated feature representation comprises:

performing weighted average processing on the score feature representation and the associated feature representation to obtain target feature representations corresponding to the candidate tags respectively;

and determining a content tag corresponding to the target content from the plurality of candidate tags based on the target feature representation.

11. The method of claim 10, wherein the determining, based on the target feature representation, a content tag corresponding to the target content from the plurality of candidate tags comprises:

performing feature analysis on the target feature representation to obtain re-scoring results corresponding to the candidate labels respectively, wherein the re-scoring results are obtained after updating confidence scores of the candidate labels;

and taking the target candidate label with the re-scoring result being greater than or equal to a preset score threshold value in the plurality of candidate labels as a content label corresponding to the target content.

12. A content tag determination apparatus, the apparatus comprising:

13. A computer device comprising a processor and a memory, the memory having stored therein a computer program that is loaded and executed by the processor to implement the method of determining content tags according to any one of claims 1 to 11.

14. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, which is loaded and executed by a processor to implement the method of determining a content tag according to any of claims 1 to 11.

15. A computer program product comprising a computer program which, when executed by a processor, implements a method of determining content tags according to any one of claims 1 to 11.