WO2022199504A1 - 内容识别方法、装置、计算机设备和存储介质 - Google Patents

内容识别方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2022199504A1
WO2022199504A1 PCT/CN2022/081896 CN2022081896W WO2022199504A1 WO 2022199504 A1 WO2022199504 A1 WO 2022199504A1 CN 2022081896 W CN2022081896 W CN 2022081896W WO 2022199504 A1 WO2022199504 A1 WO 2022199504A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
text
target
extraction
adjusted
Prior art date
Application number
PCT/CN2022/081896
Other languages
English (en)
French (fr)
Inventor
徐启东
陈小帅
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022199504A1 publication Critical patent/WO2022199504A1/zh
Priority to US17/991,727 priority Critical patent/US20230077849A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/158Segmentation of character regions using character size, text spacings or pitch estimation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program

Definitions

  • the present application relates to the field of computer technology, and in particular, to a content identification method, apparatus, computer device and storage medium.
  • content recognition is required in many cases, such as video recognition.
  • the content can be identified based on the artificial intelligence model, and the required information can be obtained from the content.
  • text can be recognized to obtain desired content entities from the text.
  • a content identification method, apparatus, computer device, storage medium and computer program product are provided.
  • a content recognition method executed by a computer device, the method comprising: determining target content to be recognized, obtaining the target text and text-related data associated with the target text from the target content; Perform feature extraction on the target text to obtain text extraction features; perform feature extraction on the text associated data to obtain associated extraction features; determine the feature correlation degree between the associated extraction features and the text extraction features; the feature correlation degree is determined by characterizing the degree of association between the target text and the text-related data; adjusting the text extraction feature based on the feature association degree to obtain an adjusted text feature; and identifying based on the adjusted text feature to obtain The content recognition result corresponding to the target content.
  • a content recognition device comprises: a target content determination module for determining target content to be recognized, and from the target content, acquiring target text and text-related data associated with the target text; a feature extraction module , is used to perform feature extraction on the target text to obtain text extraction features; perform feature extraction on the text associated data to obtain associated extraction features; a feature correlation degree obtaining module is used to determine the correlation extraction feature and the text. Extract the feature correlation degree between the features; the feature correlation degree is used to represent the correlation degree between the target text and the text-related data; the adjustment text feature obtaining module is used for the text based on the feature correlation degree. extracting features for adjustment to obtain adjusted text features; and a content recognition result obtaining module for performing recognition based on the adjusted text features to obtain a content recognition result corresponding to the target content.
  • a computer device comprising a memory and one or more processors, the memory stores computer-readable instructions that, when executed by the processor, cause the one or more processors to execute the above Steps in a content identification method.
  • One or more non-transitory readable storage media having computer-readable instructions stored thereon that, when executed by one or more processors, cause the one or more processors to implement the foregoing Identify steps in a method.
  • a computer program product includes computer readable instructions that, when executed by a processor, implement the steps of the above image processing method.
  • 1 is an application environment diagram of a content recognition method in some embodiments
  • FIG. 2 is a schematic flowchart of a content identification method in some embodiments
  • FIG. 3 is a schematic diagram of video recognition using a content recognition method in some embodiments.
  • FIG. 4 is a framework diagram of a content recognition model in some embodiments.
  • Figure 5 is a framework diagram of a content recognition model in some embodiments.
  • FIG. 6 is a schematic diagram of entity identification using an entity identification network in some embodiments.
  • FIG. 7 is a block diagram of a content-aware network in some embodiments.
  • FIG. 8 is a schematic diagram of entity recognition using an entity recognition model in some embodiments.
  • FIG. 9 is a structural block diagram of a content identification device in some embodiments.
  • FIG. 10 is a diagram of the internal structure of a computer device in some embodiments.
  • Figure 11 is a diagram of the internal structure of a computer device in some embodiments.
  • the content identification method provided by this application can be applied to the application environment shown in FIG. 1 .
  • the application environment includes the terminal 102 and the server 104 .
  • the terminal 102 and the server 104 communicate through the network.
  • the server 104 may obtain the target content to be identified in response to the content identification request, and the target content to be identified may be carried in the content identification request, or obtained according to the content identification carried in the content identification request, the server 104
  • the target text can be obtained from the target content
  • the text-related data associated with the target text can be obtained from the target content
  • the feature extraction of the target text can be performed to obtain the text extraction features
  • the feature extraction of the text-related data can be obtained.
  • Determine the feature correlation degree between the correlation extraction feature and the text extraction feature is used to represent the correlation degree between the target text and the text-related data. Adjust the text extraction feature based on the feature correlation degree to obtain the adjusted text feature.
  • the text feature is identified to obtain a content identification result corresponding to the target content.
  • the server 104 may store the content identification result in association with the target content.
  • the content identification result may be used as a tag of the target content.
  • the content identification request may be triggered by the server 104 or sent to the server 104 by other devices such as a terminal.
  • the terminal 102 may be installed with a client, for example, may be installed with at least one of a video client, a browser client, an instant messaging client, or an education client.
  • the terminal 102 may send a content search request to the server 104 in response to the content search operation triggered by the user through the client.
  • the content search request may carry search information, and the server 104 may match the search information with the content identification result.
  • the identification result matches the content corresponding to the content identification result is sent to the terminal 102, and the terminal 102 can display the content returned by the server 104 in the client.
  • the terminal 102 can be, but is not limited to, a laptop computer, a smart phone, a smart TV, a desktop computer, a tablet computer and a portable wearable device, and the server 104 can be implemented by an independent server or a server cluster or cloud server composed of multiple servers .
  • the above application scenario is only an example, and does not constitute a limitation on the content identification method provided by the embodiment of the present application.
  • the method provided by the embodiment of the present application can also be applied to other application scenarios, such as the content identification method provided by the present application.
  • the method may be executed by the terminal 102 or the server 104, or may be executed by the terminal 102 and the server 104 in cooperation, the terminal 102 may upload the identified content identification result to the server 104, and the server 104 may upload the target content and the content identification result. associative storage.
  • a content identification method is provided, and the method is applied to the server 104 in FIG. 1 as an example for description, including the following steps:
  • Step 202 determining the target content to be identified, and obtaining the target text and text-related data associated with the target text from the target content.
  • the content may be any one of video, audio or text
  • the content includes text data, and may also include at least one of image data or audio data
  • the audio data may be, for example, voice data.
  • the text data in the content may include at least one of subtitles, bullet screens, comments or titles in the video
  • the image data in the content may be video frames in the video
  • the audio data in the content may be Audio data such as dubbing or music in a video.
  • the text data in the content may be text data corresponding to the audio data.
  • the audio data in the content may be audio frames.
  • the audio frame is obtained by framing the audio, and framing refers to dividing the audio into multiple sub-segments, and each sub-segment is a frame.
  • the target content refers to the content to be identified, which can be at least one of the content to be identified or the content to be identified by the scene. Identification refers to identifying the identity of the person appearing in the target content, for example The identity of the character can be determined by identifying the character information appearing in the target content. The character information can include at least one of the character's name or the character's face. Scene recognition refers to identifying the scene to which the target content belongs. Identify the location-determining scenes that appear in the target content.
  • the target text refers to the text data in the target content, and can include text data at any time in the target content. For example, when the target text is a video, the target text can include subtitles, bullet screens, At least one of a comment or a title. When the target content is a song, the target text data may be lyrics corresponding to the song.
  • the text-related data refers to data in the target content that is associated with the target text, and may include, for example, at least one of target image data or target audio data that is associated with the target text in the target content.
  • the target image data is image data associated with the target text in the target content
  • the target audio data is audio data associated with the target text in the target content.
  • the target image data may include one or more images, and multiple refers to at least two.
  • the target audio data may include one or more segments of audio frames, and multiple segments refer to at least two segments.
  • Associations may include temporal associations.
  • the text-related data may include data that occurs within the time when the target text appears in the target content, or data that appears within the time interval between the target content and the time when the target text appears is less than a time interval threshold.
  • the text association data can be the video frame and voice matching the subtitle, for example, the target text and the corresponding text association data can describe the same video scene The data.
  • the text-related data may include data appearing at the target time in the target video, for example, may include at least one of the video frames, bullet screens or audio frames appearing at the target time in the target video One, or include data that occurs when the time interval between the target video and the target time is less than the time interval threshold.
  • the time interval threshold can be preset or set as required.
  • the target video can be any video, it can be a video obtained by direct shooting, or it can be a video clip obtained from a video obtained by shooting, and the target video can be any type of video, including but not limited to advertisements.
  • At least one of a video, a TV series video, or a news video, and the target video may also be a video to be pushed to the user.
  • the video frames appearing at the target time in the target video may include one or more frames
  • the audio frames appearing at the target time in the target video may include one or more frames
  • multiple frames refer to at least two frames.
  • Associations may also include semantic associations.
  • the text-related data may include data in the target content that matches the semantics of the target text
  • the data that matches the semantics of the target text may include data that is consistent with the semantics of the target text
  • the difference between the semantics and the semantics of the target text is smaller than the semantic difference Threshold data.
  • the semantic difference threshold can be preset or set as needed.
  • the server may acquire the content to be identified, such as a video to be identified, take the content to be identified as the target content to be identified, and use the content identification method provided in this application to identify the content to be identified. Identify, obtain the identified entity words, build a knowledge graph based on the identified entity words, or use the identified entity words as labels corresponding to the target content.
  • a user matched by the target content can be determined according to a tag corresponding to the target content, and the target content is pushed to the terminal of the matched user.
  • the entity refers to a thing with a specific meaning, for example, it may include at least one of a place name, an institution name, or a proper noun.
  • the target text may include one or more entities, and entity words are words that represent entities. For example, if the target text is "monkey likes to eat bananas", the entities included in the target text are "monkey” and "banana", “monkey” is an entity word, and "banana” is an entity word.
  • Knowledge Graph is a graph-based data structure, including nodes (points) and edges (Edges), each node represents an entity, and each edge is a relationship between entities.
  • Entity recognition can also be called entity word recognition or Named Entity Recognition (NER).
  • Entity word recognition is an important research direction in the field of Natural Language Processing (NLP).
  • entity word recognition such as dictionary-based and rule-based methods, including Hidden Markov Model (Hidden Markov Model, HMM), Maximum Entropy Markov Model (MEMM), Conditional Random Fields (CRF) and other machine learning methods, including Recurrent Neural Networks (RNN, Recurrent Neural Networks) and Long Short-Term Memory Networks ( LSTM, Long Short-Term Memory) and other deep learning models, as well as recognition methods including the combination of LSTM and CRF.
  • natural language processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language.
  • the first terminal may send a content push request to the server, and the content push request may carry a content identifier corresponding to the content to be pushed, and the content identifier is used to uniquely identify the content.
  • the content to be pushed may be, for example, the content to be pushed.
  • the server may, in response to the content push request, obtain the content to be pushed corresponding to the content identifier carried in the content push request, as the target content to be identified.
  • the first terminal may display a content push interface
  • the content push interface may display a push content acquisition area and a content push trigger control
  • the push content acquisition area is used to receive content information corresponding to the content to be pushed
  • the content information includes one or more
  • the content push trigger control is used to trigger the first terminal to send a content push request to the server.
  • the first terminal obtains a triggering operation on the content push trigger control, it obtains the push content
  • the content information received in the zone sends a content push request carrying the content information to the server.
  • the server may acquire the content corresponding to each content identifier included in the content information, respectively, as each target content to be identified.
  • the server can use the content identification method provided in the present application to identify each content to be identified, determine the users matching each target content according to the identification result, and push the target content to the terminal of the matched user.
  • the recognition result can be matched with the user's user portrait, and when the match is successful, the target content can be pushed to the user's terminal
  • the content identification method may also be referred to as a video identification method
  • the content push request may also be referred to as a video push request
  • the content push interface may be, for example, the video push interface 300 in FIG.
  • the terminal can send a video push request to the server
  • the server identifies video A and video B according to the video recognition method, determines user 1 matched by video A, and user B matched by video B, pushes video A to the terminal of user 1, and pushes video B to user 2's terminal terminal.
  • Step 204 perform feature extraction on the target text to obtain text extraction features; perform feature extraction on text-related data to obtain associated extraction features.
  • the text extraction feature is a feature obtained by feature extraction on the target text
  • the text extraction feature can be a feature obtained by further feature extraction on the target word vector of the target word segmentation corresponding to the target text.
  • the target word segmentation is obtained by segmenting the target text.
  • the granularity of segmentation can be set as required. For example, it can be segmented in units of words, words or sentences to obtain segmented text blocks, and each text block is used as a participle.
  • a word corresponds to a text block, that is, a word is a word segment.
  • the target word vector is the vector representation of the target word segmentation. There may be one or more target word segments obtained by segmenting the target text, and multiple means at least two.
  • the association extraction feature is the feature obtained by the feature extraction of text association data.
  • the associated extraction feature may be a target image feature obtained by performing image feature extraction on the target image data.
  • the associated extraction feature may be a target audio feature obtained by performing audio feature extraction on the target audio data.
  • the target image feature is an image feature extracted by performing image feature extraction on the target image data.
  • the target audio feature is an audio feature extracted by performing audio feature extraction on the target audio feature.
  • the text extraction feature and the association extraction feature may be of the same dimension, for example, a vector of the same dimension or a matrix of the same dimension.
  • the server can input the target text into the text feature extraction network in the trained content recognition model, use the text feature extraction network to perform feature extraction on the target text, obtain text extraction features, and input the text associated data into the trained content
  • the association extraction network is used to perform feature extraction on the text association data to obtain association extraction features.
  • the trained content recognition model is used to recognize content to obtain a content recognition result, for example, used to recognize at least one of an entity word included in a subtitle of a video or a scene of the video.
  • the associated feature extraction network may include at least one of an image feature extraction network or an audio feature extraction network, and the image feature extraction network is used to extract the features of the image,
  • the audio feature extraction network is used to extract audio features.
  • the text-related data is the target image data
  • the text-related data can be input into the image feature extraction network, and the image features extracted by the image feature extraction network can be used as the correlation extraction features.
  • the text-related data is the target audio data
  • the text-related data can be input into the audio feature extraction network, and the audio features extracted by the audio feature extraction network are used as the correlation extraction features.
  • the text feature extraction network, the image feature extraction network, and the audio feature extraction network may be artificial intelligence-based neural networks, such as convolutional neural networks (Convolutional Neural Networks, CNN), and of course other types of neural networks.
  • the text feature extraction network may be, for example, a Transformer (converter) network or a Transformer-based bidirectional encoder (Bidirectional Encoder Representations from Transformers, BERT) network.
  • the image feature extraction network can be, for example, a residual convolutional network (ResNet, Residual Network).
  • the audio feature extraction network can be, for example, a VGG (Visual Geometry Group) convolutional network.
  • VGG stands for the Visual Geometry Group (VGG) of the University of Oxford.
  • the server can perform scale transformation on the target image to obtain the scale-transformed image, input the scale-transformed image data into the residual convolution network for image feature extraction, and extract the feature map (feature map) in the residual convolution network.
  • the features output by the extraction layer are pooled, for example, pooled to a fixed size of n*n, and the pooled features are used as associated extraction features.
  • n is a positive number greater than or equal to 1.
  • the step of performing feature extraction on the target text to obtain the text extraction features includes: segmenting the target text to obtain a target word segment, performing vector transformation on the target word segment, obtaining a target word vector corresponding to the target word segment, and converting the target word Word vectors are used as text extraction features.
  • the server can input the target text into an attention-based transformer model, and the transformer model, as an encoder of text features, can encode the target text to obtain the embedded image of each word in the target text.
  • (embedding) represents the encoding features of the form, and the encoding features corresponding to each word can be used as text extraction features.
  • the server may perform spectrum calculation on the target audio data to obtain a spectrogram corresponding to the target audio data, perform feature extraction on the spectrogram corresponding to the target audio data, and use the extracted features as associated extraction features.
  • the server may perform sound spectrum calculation on the spectrogram corresponding to the target audio data to obtain the sound spectrum information corresponding to the target audio data, and perform feature extraction on the sound spectrum information of the target audio data to obtain the associated extraction features.
  • the server can use the hann (Haining window) time window to Fourier transform the target audio data to obtain the spectrogram corresponding to the target audio data, and use the mel (Mel) filter to calculate the spectrogram to obtain the corresponding spectrogram of the target audio data.
  • the VGG convolutional network is used to extract the features of the sound spectrum information, and the audio features obtained by the feature extraction are used as the associated extraction features.
  • Step 206 Determine the feature correlation degree between the association extraction feature and the text extraction feature; the feature correlation degree is used to represent the correlation degree between the target text and the text-related data.
  • the feature correlation degree is the result obtained by the correlation calculation between the correlation extraction feature and the text extraction feature.
  • the feature correlation degree can include at least one of the image correlation degree or the audio correlation degree.
  • the image correlation degree refers to the result obtained by the correlation calculation between the target image feature and the text extraction feature
  • the audio correlation degree refers to the correlation between the target audio feature and the text extraction feature.
  • the associative computation may be at least one of a product operation or an addition operation.
  • the association extraction feature may include multiple ordered associated feature values, and the text extraction feature may include multiple ordered text feature values.
  • the text feature value refers to the feature value included in the text extraction feature
  • the associated feature value refers to the associated feature value. Extract the feature values included in the feature.
  • the correlation extraction feature and the text extraction feature can be of the same dimension, such as a vector or matrix of the same dimension, that is, the number of correlation feature values included in the correlation extraction feature is the same as the text feature included in the text extraction feature. The number of values can be the same.
  • vector A [a1,a2,a3]
  • vector A includes 3 elements, a1, a2 and a3, respectively
  • Each element in vector A is a text feature value.
  • vector B includes 3 elements, namely b1, b2 and b3, and each element in vector B is an associated feature value.
  • the association calculation may be at least one of a product operation or an addition operation.
  • the association calculation is a product operation
  • the association feature value in the association extraction feature and the text feature value at the corresponding position in the text extraction feature can be multiplied to obtain the product operation value.
  • the product operation value is added or averaged to obtain the statistical operation result, and the feature correlation degree can be obtained based on the statistical operation result.
  • the statistical operation result can be used as the feature correlation degree, or the statistical operation result can be normalized.
  • the result of the normalization process is used as the feature correlation degree.
  • association calculation is an addition operation
  • the association feature value in the association extraction feature and the text feature value at the corresponding position in the text extraction feature can be added to obtain the sum operation value, and the statistical operation is performed on each addition operation value.
  • an addition operation or an average operation can be performed on each addition operation value to obtain a statistical operation result.
  • the server may obtain the text extraction features obtained according to each target word segmentation, form a matrix of each text extraction feature, and use the formed matrix as a text extraction feature matrix, Each column in the text extraction feature matrix is a text extraction feature.
  • the server may perform a product operation on the association extraction feature and the text extraction feature matrix to obtain a result of the total product operation, and determine the feature correlation degree corresponding to each text extraction feature based on the total product operation result.
  • the step of performing a product operation on the association extraction feature and the text extraction feature matrix to obtain a result of the total product operation may include: performing a product operation on the text extraction features in the text extraction feature matrix and the association extraction feature respectively, to obtain the respective text extraction features.
  • each sub-product operation result For the corresponding sub-product operation results, use each sub-product operation result as the total product operation result.
  • the step of performing a product operation on the text extraction features in the text extraction feature matrix and the association extraction features respectively, and obtaining the sub-product operation results corresponding to each text extraction feature may include: combining the text feature values in the text extraction features with the association extraction The associated feature values of the corresponding positions in the feature are multiplied to obtain the sub-product operation result corresponding to the text extraction feature.
  • the step of determining the feature correlation degree corresponding to each text extraction feature based on the total product operation result may include: normalizing each sub-product operation result in the total product operation result, and obtaining each normalized sub-product operation result , and the normalized sub-product operation result is used as the feature correlation degree corresponding to the text extraction feature.
  • the server may form a matrix of target image features corresponding to each target image data, and use the formed matrix as an image feature matrix, and the image features Each column in the matrix is a target image feature.
  • the server can perform a matrix multiplication operation on the transposed matrix of the target image feature matrix and the text extraction feature matrix to obtain a first product matrix, and normalize each matrix value in the first product matrix to obtain a normalized
  • the first product matrix based on the normalized first product matrix, determines the image relevance corresponding to each text extraction feature, and the normalized first product matrix includes the image relevance corresponding to each text extraction feature.
  • the target text is “I'm so thirsty”
  • segment the target text according to the word as a unit, and get 3 target participles, namely "I", “good” and “thirsty”
  • one participle is a text
  • the text extraction feature matrix feature text can be expressed as formula (1).
  • the step of performing normalization processing on each matrix value in the first product matrix to obtain a normalized first product matrix includes: determining a scaling factor, and converting each matrix value in the first product matrix Divide by the scaling factor respectively to obtain the scaling value corresponding to each matrix value, perform normalization processing on each scaling value, and use the matrix formed by each scaling value as the normalized first product matrix.
  • the scaling factor can be preset or set according to needs.
  • the scaling factor can be determined according to the dimension of the text extraction feature.
  • the scaling factor can be positively correlated with the dimension of the text extraction feature.
  • the text extraction feature can be Perform the square root calculation on the dimension of the text extraction feature to obtain the scaling factor.
  • the dimension of the text extraction feature can be square rooted, and the ratio of the result of the square root processing to the first value is used as the scaling factor.
  • the first value may be preset.
  • the method used for normalization can be any function that can convert the input data into a number between 0 and 1.
  • the function softmax can be used for normalization.
  • the normalized first product matrix L2 can be obtained by calculation using formula (4). Among them, d is the dimension of the text extraction feature, and m is the first value.
  • the server may form a target audio feature matrix with the target audio features corresponding to each target audio data, and each column in the target audio feature matrix is a target Audio features, the server can perform a matrix multiplication operation on the transposed matrix of the target audio feature matrix and the text extraction feature matrix to obtain a second product matrix, and normalize each matrix value in the second product matrix to obtain a normalized
  • Step 208 Adjust the text extraction feature based on the feature correlation degree to obtain the adjusted text feature.
  • the adjusting text feature is a feature obtained by adjusting the text extraction feature based on the feature correlation degree, and adjusting the text feature may include at least one of the first adjusting text feature or the second adjusting text feature.
  • the first adjusted text feature refers to a feature obtained by adjusting the text extraction feature based on the image attention intensity.
  • the second adjusted text feature refers to a feature obtained by adjusting the text extraction feature based on the audio attention intensity.
  • the server may obtain the feature attention intensity corresponding to the text extraction feature based on the feature relevance degree, the feature relevance degree and the feature attention intensity are positively correlated, and adjust the text extraction feature based on the feature attention intensity to obtain the adjusted text feature.
  • the feature correlation degree is positively correlated with the feature attention intensity.
  • the feature correlation degree can be used as the feature attention intensity, or a linear operation or a nonlinear operation can be performed on the feature correlation degree, and the result of the operation can be used as the feature attention intensity.
  • the linear operation includes at least one of an addition operation or a product operation
  • the nonlinear operation includes at least one of an exponential operation or a logarithmic operation.
  • a positive correlation means that when other conditions remain unchanged, two variables change in the same direction. When one variable changes from large to small, the other variable also changes from large to small. It is understandable that the positive correlation here means that the direction of change is consistent, but it does not require that when one variable changes a little, the other variable must also change. For example, when the variable a is 10 to 20, the variable b can be set to 100, and when the variable a is 20 to 30, the variable b can be set to 120. In this way, the direction of change of a and b is that when a becomes larger, b also becomes larger. However, when a is in the range of 10 to 20, b may be unchanged.
  • the feature attention intensity may include at least one of an image attention intensity or an audio attention intensity.
  • the image attention intensity is obtained based on the image relevance, the image attention intensity is positively correlated with the image relevance, the audio attention intensity is obtained based on the audio relevance, and the audio attention intensity is positively correlated with the audio relevance.
  • the feature attention intensity is used to reflect the intensity of attention to the feature. The higher the feature attention intensity is, the more attention needs to be paid to the feature when performing content recognition.
  • the server may perform similarity calculation on the association extraction feature and the text extraction feature to obtain the feature similarity, use the feature similarity as the feature correlation, and obtain the feature attention intensity corresponding to the text extraction feature based on the feature correlation.
  • the correlation extraction feature and the text extraction feature may be calculated according to the cosine similarity calculation formula, and the calculated cosine similarity may be used as the feature similarity.
  • the server may use the feature attention intensity to adjust each text feature value in the text extraction feature to obtain the adjusted text feature.
  • a linear operation may be performed on the text feature value and the feature attention intensity to obtain the text after the linear operation.
  • Eigenvalue the adjusted text feature is obtained based on the text feature value after each linear operation.
  • the linear operation may include at least one of an addition operation or a product operation.
  • the server may perform a product operation on the feature attention intensity and each feature value in the text extraction feature respectively to obtain the product of each feature value, and sort the feature value product according to the position of the feature value in the text extraction feature to obtain the feature value sequence, Take the sequence of eigenvalues as the adjusted text feature.
  • the position of the text feature value in the text extraction feature is the same as the position of the feature value product calculated from the text feature value in the feature value sequence.
  • the text extraction feature is a vector [a1, a2, a3], a1, a2 and a3 are the eigenvalues in the text extraction feature
  • the eigenvalue sequence is a vector [a1*c, a2 *c,a3*c]
  • a1*c, a2*c and a3*c are the product of eigenvalues
  • the positions in the extracted features [a1, a2, a3] are the same.
  • the server may adjust the text extraction feature matrix by using the normalized first product matrix to obtain a first adjusted text feature matrix, and the normalized first product matrix includes each text extraction feature respectively.
  • the corresponding image correlation degree, the first adjusted text feature matrix may include the first adjusted text features corresponding to each text extraction feature respectively.
  • the server may perform a matrix multiplication operation on the normalized first product matrix and the transposed matrix of the text extraction feature matrix, and use the transposed matrix of the multiplied matrix as the first adjusted text feature matrix.
  • the first adjusted text feature matrix feature fusion1 can be obtained by calculating formula (5), where feature fusion1 represents the first adjusted text feature matrix, and [feature fusion1 ] T represents the transposed matrix of feature fusion1 .
  • the server may perform matrix multiplication operation by using the normalized second product matrix and the transposed matrix of the text extraction feature to obtain a second adjusted text feature matrix, and the normalized second product matrix includes each text
  • the audio correlation degrees corresponding to the extraction features are extracted, and the second adjusted text feature matrix may include second adjusted text features corresponding to the respective text extraction features.
  • the second adjusted text feature matrix feature fusion2 may be obtained by calculation using formula (6), where feature audio is the target audio feature matrix. [feature audio ] T represents the transpose matrix corresponding to the target audio feature matrix.
  • Step 210 Perform recognition based on the adjusted text feature, and obtain a content recognition result corresponding to the target content.
  • the content recognition result is the result obtained by the recognition based on the adjusted text features.
  • the content recognition result may be determined according to the content recognition network used in the recognition, and the content recognition results obtained may be the same or different for different content recognition networks.
  • the content-aware network may include at least one of a scene-aware network or an entity-recognition network.
  • the scene recognition network is used to recognize the scene
  • the entity recognition network is used to recognize the entity.
  • the content recognition model can also be called a scene recognition model
  • the content recognition model can also be called an entity recognition model or an entity word recognition model.
  • the server may input the adjusted text features into the content recognition network of the trained content recognition model, and use the content recognition model to recognize the adjusted text features to obtain a content recognition result corresponding to the target content.
  • the text extraction feature is the feature corresponding to the target word segmentation in the target text
  • the adjusted text features corresponding to each target word segmentation can be sorted according to the order of the target word segmentation in the target text, and the sequence obtained by sorting can be used as the feature sequence
  • the server can identify based on the feature sequence and obtain the content identification result corresponding to the target content.
  • the feature sequence can be input into the content identification network in the content identification model to obtain the content identification result.
  • the content identification network is an entity identification network, Entity words included in the target content can be identified.
  • the content recognition model 400 includes a text feature extraction network, an association feature extraction network, an attention intensity calculation module, a feature adjustment module and a content recognition network.
  • the attention intensity calculation module is used to perform an association calculation on the association extraction feature and the text extraction feature to obtain the feature attention intensity.
  • the feature adjustment module is used to adjust the text extraction features based on the feature attention intensity, obtain the adjusted text features, and input the adjusted text features into the content recognition network to obtain the content recognition results corresponding to the target content.
  • Each network and module in the content recognition model 400 may be obtained through joint training.
  • the server obtains the target text and text associated data from the target, inputs the target text into the text feature extraction network, obtains the text extraction features, inputs the associated text data into the associated feature extraction network, obtains the associated extraction features, and applies the text extraction features and
  • the correlation extraction feature is input into the attention intensity calculation module to obtain the feature attention intensity
  • the feature attention intensity and the text extraction feature are input into the feature adjustment module
  • the adjusted text feature is obtained
  • the adjusted text feature is input into the content recognition network to obtain the content recognition result.
  • the server may also fuse the adjusted text features and the text extraction features to obtain the fused text features. For example, statistical operations may be performed on the adjusted text features and the text extraction features, such as weighted calculation or mean calculation, to obtain the fused text.
  • the server can determine the adjustment feature weight corresponding to the adjustment text feature, and fuse the adjustment text feature and the text extraction feature based on the adjustment feature weight to obtain the fused text feature. The server can identify based on the fused text feature to obtain the content corresponding to the target content Identify the results.
  • the adjusted text feature includes a first adjusted text feature and a second adjusted text feature
  • the server may perform fusion based on the first adjusted text feature, the second adjusted text feature, and the text extraction feature, such as weighted calculation or mean calculation, Get fused text features.
  • adjusting the feature weight may include a first feature weight corresponding to the first adjusted text feature and a second feature weight corresponding to the second adjusted text feature
  • the server may fuse the first adjusted text feature and the text extraction feature based on the first feature weight , obtain the first fusion feature, fuse the second adjustment text feature and the text extraction feature based on the second feature weight, obtain the second fusion feature, perform a statistical operation on the first fusion feature and the second fusion feature, and combine the results of the statistical operation
  • the fused text feature for example, the eigenvalues of the corresponding positions in the first fused feature and the second fused feature are added and calculated to obtain each summed value. According to the position of the eigenvalue in the first fused feature or the second fused feature, Sort each summed value, and use the sequence obtained after sorting as the fused text feature.
  • the target content to be recognized is determined, the target text and the text-related data associated with the target text are obtained from the target content, the feature extraction is performed on the target text, the text extraction features are obtained, and the feature extraction is performed on the text-related data.
  • obtain the correlation extraction feature determine the feature correlation degree between the correlation extraction feature and the text extraction feature, the feature correlation degree is used to represent the correlation degree between the target text and the text-related data, and adjust the text extraction feature based on the feature correlation degree to obtain Adjust the text features, perform recognition based on the adjusted text features, and obtain a content recognition result corresponding to the target content.
  • the text extraction features are adjusted based on the feature correlation degree, and the adaptive text extraction is realized according to the correlation degree between the text-related data and the target text. Adjustment of features, so that when the recognition is performed based on the adjusted text features, the recognition results are affected by the text-related data.
  • the greater the degree of correlation between the target text and the text-related data the greater the impact of the text-related data on the recognition results. This makes the information with a greater degree of correlation pay more attention to the content identification, and improves the accuracy of the content identification.
  • performing recognition based on the adjusted text features to obtain a content recognition result corresponding to the target content includes: fusing the adjusted text features and text extraction features to obtain fused text features; content recognition results.
  • the fused text feature is a feature obtained by fusing the adjusted text feature and the text extraction feature.
  • the dimensions of fusing text features, adjusting text features, and text extracting features may be the same, for example, they may be vectors or matrices of the same dimensions.
  • the server may perform a statistical operation on the adjusted text feature and the text extraction feature, such as a mean value operation or an addition operation, and use the result of the statistical operation as the fused text feature.
  • the server may encode the text extraction feature to obtain the encoded feature corresponding to the text extraction feature, as the first encoded feature, may encode the adjusted text feature to obtain the encoded feature corresponding to the adjusted text feature, as the second encoded feature.
  • Encoding feature performing statistical operations on the first encoding feature and the second encoding feature, such as mean value operation or sum operation, and using the result of the operation as the fused text feature.
  • the server may input the fused text features into the content recognition network of the trained content recognition model, and use the content recognition network to recognize the fused text features to obtain a content recognition result corresponding to the target content.
  • the adjusted text features and the text extraction features are fused to obtain fused text features, and recognition is performed based on the fused text features to obtain a content recognition result corresponding to the target content, which can improve the accuracy of content recognition.
  • fusing the adjusted text feature and the text extraction feature to obtain the fused text feature includes: encoding the text extraction feature to obtain the first encoded feature, encoding the adjusted text feature to obtain the second encoded feature; The first coding feature and the second coding feature are fused to obtain the fused coding feature; the adjusted feature weight corresponding to the adjusted text feature is obtained based on the fused coding feature; the adjusted text feature and the text extraction feature are fused based on the adjusted feature weight to obtain the fused text feature .
  • the first encoding feature is a feature obtained by encoding the text extraction feature.
  • the second encoding feature is a feature obtained by encoding the adjusted text feature.
  • the fused coding feature is a feature obtained by fusing the first coding feature and the second coding feature.
  • the adjusted feature weights are obtained based on fused encoded features.
  • the content recognition model may further include a first encoder, a second encoder, and a feature fusion module, and the feature fusion module is configured to fuse the first coding feature and the second coding feature to obtain a fused coding feature.
  • the server may input the text extraction feature into the first encoder in the trained content recognition model for encoding to obtain the first encoding feature, and input the adjusted text feature into the second encoder in the trained content recognition model for encoding, The second coding feature is obtained, and the first coding feature and the second coding feature are fused.
  • the first coding feature and the second coding feature may be input into the feature fusion module to obtain the fused coding feature.
  • the first encoder and the second encoder may be artificial intelligence-based neural networks, and each network and module in the content recognition model may be obtained by joint training.
  • the first encoder and the second encoder may be obtained by joint training. of.
  • the server may perform a statistical operation on the first encoding feature and the second encoding feature to obtain the encoding statistical feature. For example, the first encoding feature and the second encoding feature may be added, and The result is used as a fused coding feature, or an average value operation is performed on the first coding feature and the second coding feature, and the calculated mean value is used as a fused coding feature.
  • the server may determine the fused coding feature based on the coding statistical feature, for example, the coding statistical feature may be used as the fused coding feature.
  • the server may perform normalization processing on the fusion coding feature, and use the result obtained by normalization as the adjusted feature weight corresponding to the adjusted text feature.
  • the trained content recognition model can include an activation layer, which can convert the data into data between 0 and 1, normalize the fusion coding feature, and use the result of the normalization as an adjustment
  • the step of feature weighting may include: inputting the fusion coding feature into the activation layer of the content recognition model for activation processing, and using the result of the activation processing as the adjustment feature weight corresponding to the adjustment text feature.
  • the server may calculate the product of the adjusted feature weight and the adjusted text feature to obtain the calculated adjusted text feature, and perform a statistical operation on the calculated adjusted text feature and the text extraction feature, such as a sum operation or an average value operation, and use the result of statistical operation as the fused text feature.
  • the server may determine the text feature weight corresponding to the text extraction feature, for example, may obtain a preset weight, and use the preset weight as the text feature weight, and the preset weight may be a preset weight as required.
  • the text feature weight can also be determined according to the adjusted feature weight.
  • the adjusted feature weight can be negatively correlated with the text feature weight.
  • the sum of the adjusted feature weight and the text feature weight can be a preset value, and the preset value can be preset as needed. For example, it can be 1.
  • the result obtained by subtracting the text feature weight from the preset value can be used as the text feature weight.
  • the text feature weight can be 0.7.
  • the preset value is greater than the text feature weight, and the preset value is greater than the adjustment feature weight.
  • Negative correlation means that when other conditions remain unchanged, the two variables change in opposite directions. When one variable changes from large to small, the other variable changes from small to large. It is understandable that the negative correlation here means that the direction of change is opposite, but it does not require that when one variable changes a little, the other variable must also change.
  • the first encoder may include at least one of a first text encoder or a second text encoder
  • the second encoder may include at least one of an image encoder or an audio encoder.
  • the first encoding feature may include at least one of a first text feature or a second text feature, where the first text feature is a feature obtained by using the first text encoder to encode the text extraction feature, and the second text feature is a feature obtained by using the second text encoder.
  • the text encoder encodes the features obtained by encoding the text extraction features.
  • the second encoding feature may include at least one of an image encoding feature or an audio encoding feature, where the image encoding feature is a feature obtained by encoding the first adjusted text feature using an image encoder, and the audio encoding feature is an audio encoding feature used to encode the second
  • the fusion coding features may include at least one of text image coding features or text audio coding features.
  • the text image coding feature is a feature obtained by fusing the first text coding feature and the image coding feature.
  • the text-audio coding feature is a feature obtained by fusing the second text-coding feature and the audio coding feature.
  • the server may input the text extraction feature into the first text encoder for encoding to obtain the first text feature, and input the first adjusted text feature into the image encoder for encoding, The image coding feature is obtained, and the first text feature and the image coding feature are fused to obtain the text image coding feature.
  • the server may input the text extraction feature into the second text encoder for encoding to obtain the second text feature, and input the second adjusted text feature into the audio encoder for encoding to obtain the audio Encoding feature, the second text feature and the audio encoding feature are fused to obtain the text-audio encoding feature, and the text-image encoding feature and the text-audio encoding feature can be used as the fusion encoding feature.
  • the first text encoder and the second text encoder may be the same encoder or different encoders, and the image encoder and the audio encoder may be the same encoder or different encoders.
  • the text extraction feature is encoded to obtain the first encoding feature
  • the adjusted text feature is encoded to obtain the second encoding feature
  • the first encoding feature and the second encoding feature are fused to obtain the fusion encoding feature, based on
  • the adjusted feature weights corresponding to the adjusted text features are obtained by fusing the coding features.
  • the adjusted text features and the text extraction features are fused to obtain the fused text features, so that the fused text features can reflect both the text extraction features and the adjusted text features.
  • the expression ability of the fusion text features is improved, and the recognition accuracy can be improved when the recognition is performed based on the adjusted text features.
  • the first encoded feature is encoded by a first encoder in the trained content recognition model
  • the second encoded feature is encoded by a second encoder in the content recognition model, based on fusion encoding
  • the adjustment feature weight corresponding to the adjustment text feature includes: inputting the fusion coding feature into the target activation layer in the content recognition model for activation processing, obtaining the target activation value, and using the target activation value as the adjustment feature weight corresponding to the adjustment text feature, activating The layer is the shared activation layer of the first encoder and the second encoder.
  • the activation layer is used to convert data into data between 0 and 1, which can be implemented by an activation function, and the activation function includes but is not limited to at least one of a Sigmoid function, a tanh function, or a Relu function.
  • the target activation layer is the activation layer in the trained content recognition model, and is the activation layer shared by the first encoder and the second encoder, that is, the target activation layer set can receive the output data of the first encoder and the second encoder.
  • the target activation value is the result obtained by using the target activation layer to activate the fusion coding feature.
  • the target activation value and the fusion coding feature can have the same dimension, for example, a vector or matrix of the same dimension.
  • the content recognition module 500 includes an association feature extraction network, an attention intensity calculation module, a text feature extraction network, a feature adjustment module, a first encoder, a second encoder, a feature
  • the target activation layer may include at least one of a first activation layer shared by the first text encoder and the image encoder, and a second activation layer shared by the second text encoder and the audio encoder.
  • the target activation value may include at least one of a first activation value obtained by activating a text-image coding feature, or a second activation value obtained by activating a text-audio coding feature.
  • the server When the fusion coding feature is a text-image coding feature, the server The text image coding feature can be input into the first activation layer for activation, the first activation value is obtained, and the first activation value is used as the first feature weight corresponding to the first adjusted text feature; when the fusion coding feature is the text audio coding feature, The server can input the text-audio coding feature into the second activation layer for activation, obtain a second activation value, use the second activation value as the second feature weight corresponding to the second adjusted text feature, and use the first feature weight and the second feature weight as the second feature weight. as the adjustment feature weight.
  • the server may perform matrix fusion of the first adjusted text feature matrix and the text extraction feature matrix, for example, the text
  • the extracted feature matrix is input to the first text encoder for encoding to obtain the first matrix encoding feature
  • the first adjusted text feature matrix is input to the image encoder for encoding
  • the second matrix encoding feature is obtained
  • the first matrix encoding feature and the second matrix encoding feature are obtained.
  • Statistical operation is performed on the matrix coding features to obtain the first matrix feature statistics result, and the first matrix feature statistics results are normalized.
  • the first matrix feature statistics results can be input into the first activation layer for activation, and normalization is obtained.
  • the normalized first matrix feature statistics result may include first feature weights corresponding to each first adjusted text feature respectively.
  • formula (7) can be used to obtain the normalized first matrix feature statistical result gate 1 .
  • gate 1 represents the normalized first matrix feature statistics result
  • sigmoid is the activation function
  • the server may perform matrix fusion on the second adjusted text feature matrix and the text extraction feature matrix, for example, the text The extracted feature matrix is input to the second text encoder for encoding, the third matrix encoding feature is obtained, the second adjusted text feature matrix is input into the audio encoder for encoding, the fourth matrix encoding feature is obtained, and the third matrix encoding feature and the third matrix encoding feature are Perform statistical operations on the four-matrix encoding features to obtain the second matrix feature statistical results, and normalize the second matrix feature statistical results.
  • the first matrix feature statistical results can be input into the second activation layer for activation, and normalized The normalized second matrix feature statistics result, and the normalized second matrix feature statistics results may include second feature weights corresponding to each of the second adjusted text features.
  • formula (8) can be used to obtain the normalized second matrix feature statistical result gate 2 .
  • gate 2 represents the normalized second matrix feature statistics result, are the model parameters of the second text encoder, are the model parameters of the audio encoder.
  • the fusion coding feature is input into the target activation layer in the content recognition model for activation processing to obtain the target activation value, and the target activation value is used as the adjustment feature weight corresponding to the adjustment text feature, so that the adjustment feature weight is normalized
  • the latter value improves the rationality of adjusting feature weights.
  • fusing the adjusted text features and the text extraction features based on the adjusted feature weights to obtain the fused text features includes: obtaining the text feature weights corresponding to the text extraction features based on the adjusted feature weights; Calculate the product to obtain the calculated adjusted text features; multiply the text feature weights and the text extraction features to obtain the calculated text extraction features; add the calculated adjusted text features and the calculated text extraction features to get Fusion text features.
  • the text feature weight may be determined according to the adjusted feature weight, and the text feature weight may have a negative correlation with the adjusted feature weight. For example, the result obtained by subtracting the text feature weight from a preset value may be used as the text feature weight.
  • the preset value is greater than the text feature weight, and the preset value is greater than the adjustment feature weight.
  • the server may use the result of multiplying the adjusted feature weight by the adjusted text feature as the calculated adjusted text feature, and may use the result of multiplying the text feature weight by the text extraction feature as the calculated text extraction feature,
  • the result obtained by adding the calculated adjusted text feature and the calculated text extraction feature can be used as a fused text feature.
  • the adjusted feature weight includes a first feature weight and a second feature weight
  • the text feature weight may include a first text weight obtained based on the first feature weight and a second text weight obtained based on the second feature weight.
  • the first text weight has a negative correlation with the first feature weight.
  • the second text weight has a negative correlation with the second feature weight.
  • the server may use the first feature weight, the second feature weight, the first text weight, and the second text weight to perform a weighted calculation on the first adjusted text feature, the second adjusted text feature, and the text extraction feature, and use the result of the weighted calculation as a fusion text features.
  • the server may use the first feature weight and the first text weight to perform a weighted calculation on the first adjusted text feature and the text extraction feature to obtain a first weighted value, and use the second feature weight and the second text weight to perform a weighted calculation on the second adjusted text feature and the text extraction feature to perform weighted calculation to obtain a second weighted value, and the result of adding the first weighted value and the second weighted value is used as a fusion text feature.
  • the server may perform a product calculation on the first text weight and the text extraction feature to obtain a first product value, perform a product calculation on the first feature weight and the first adjusted text feature to obtain a second product value, and calculate the second text weight Perform product calculation with the text extraction feature to obtain the third product value, multiply the second feature weight and the second adjusted text feature to obtain the fourth product value, and combine the first product value, the second product value, and the third product value. And the fourth product value is added, and the added result is used as the fused text feature.
  • the server may perform weighted calculation on the first adjusted text feature matrix and the second adjusted text feature matrix by using the normalized first matrix feature statistics result and the normalized second matrix feature statistics results , to obtain a fused text feature matrix
  • the fused text feature matrix may include fused text features corresponding to each text extraction feature respectively.
  • formula (9) can be used to obtain the fused text feature matrix output.
  • output refers to the fusion text feature matrix.
  • the product of the adjusted feature weight and the adjusted text feature is calculated to obtain the calculated adjusted text feature
  • the product of the text feature weight and the text extraction feature is calculated to obtain the calculated text extraction feature.
  • the text features and the calculated text extraction features are added to obtain the fused text features. Since the text feature weights are obtained based on adjusting the feature weights, the accuracy of the text feature weights is improved, thereby improving the accuracy of the fused text features.
  • the target content is a target video; from the target content, obtaining the target text and text associated data associated with the target text includes: obtaining the text corresponding to the target time in the target video to obtain the target text; obtaining the target text in the target video For the video-related data corresponding to the time, the video-related data is regarded as the text-related data associated with the target text, and the video-related data includes at least one of a video frame or an audio frame.
  • a video frame is the smallest unit that composes a video, and a video is composed of multiple images.
  • An image in the video is called a frame, which can also be called a video frame.
  • the target video can be any video, it can be a directly shot video, or it can be a video clip obtained from a shot video, and the target video can be any type of video, including but not limited to advertising videos, At least one of a TV series video or a news video, and the target video may also be a video to be pushed to the user.
  • the target time may be any time point or time period from the start time point to the end time point of the target video.
  • Video-related data refers to any data displayed or played at the target time in the target video, and can include at least one of the video frames displayed at the target time in the target video or the audio frames played at the target time, and the video frames displayed at the target time can be. It includes one or more frames, and the audio frame played at the target time may include one or more frames.
  • the server may obtain the text displayed at the target time in the target video as the target text, for example, at least one of subtitles, bullet screens or comments displayed at the target time, as the target text.
  • the server may acquire at least one of the video frame displayed at the target time or the audio frame played at the target time in the target video, as video-related data.
  • the text corresponding to the target time in the target video is obtained, the target text is obtained, the video-related data corresponding to the target time in the target video is obtained, and the video-related data is used as the text-related data associated with the target text.
  • the video-related data includes At least one of a video frame or an audio frame, thereby acquiring text data and image data or audio data other than the text data, so that the video can be identified on the basis of the text data in combination with the image data or the audio data, which is beneficial to Improve the accuracy of recognition.
  • adjusting the text feature includes a first adjusted text feature obtained by adjusting the video frame; performing identification based on the adjusted text feature to obtain a content recognition result corresponding to the target content includes: performing the first adjusted text feature and the text extraction feature on Fusion to obtain fused text features; recognition based on fused text features to obtain the content recognition result corresponding to the target content.
  • the server may obtain video frames from text-related data, perform feature extraction on the obtained video frames, obtain target image features, obtain first adjusted text features based on target image features, obtain audio frames from text-related data, The feature extraction is performed on the audio frame of the target audio frame to obtain the target audio feature, and the second adjusted text feature is obtained based on the target audio feature.
  • the server may perform weighted calculation on the first adjusted text feature and the text extraction feature, and use the result of the weighted calculation as the fused text feature. For example, the server may perform product calculation on the first text weight and the text extraction feature to obtain a first product value, perform product calculation on the first feature weight and the first adjusted text feature to obtain a second product value, and combine the second text weight with The text extraction feature performs product calculation to obtain a third product value, adds the first product value, the second product value and the third product value, and uses the added result as a fused text feature.
  • the first adjusted text feature and the text extraction feature are fused to obtain the fused text feature, so that the fused text feature is obtained based on the first adjusted text feature and the text extraction feature, thereby improving the fusion performance.
  • the feature richness of text features so that when recognition is performed based on fused text features, the recognition accuracy can be improved.
  • adjusting the text feature further includes a second adjusted text feature obtained by adjusting the audio frame, fusing the first adjusted text feature and the text extraction feature, and obtaining the fused text feature includes: combining the first adjusted text feature, the first adjusted text feature, the Second, adjust the text features and the text extraction features for fusion to obtain the fused text features.
  • the server may acquire the audio frame from the text-related data, perform feature extraction on the acquired audio frame to obtain the target audio feature, and obtain the second adjusted text feature based on the target audio feature.
  • the server may perform weighted calculation on the first adjusted text feature, the second adjusted text feature, and the text extraction feature, and use the result of the weighted calculation as the fused text feature. For example, the server may perform product calculation on the first text weight and the text extraction feature to obtain a first product value, perform product calculation on the first feature weight and the first adjusted text feature to obtain a second product value, and combine the second text weight with The text extraction features are multiplied to obtain a third product value, and the second feature weight and the second adjusted text feature are multiplied to obtain a fourth product value. The first product value, the second product value, the third product value and the The fourth product value is added, and the added result is used as the fused text feature.
  • the first adjusted text feature, the second adjusted text feature, and the text extraction feature are fused to obtain the fused text feature, so that the fused text feature is based on the first adjusted text feature, the second adjusted text feature, and the text extraction feature.
  • adjusting the text extraction feature based on the feature correlation degree, and obtaining the adjusted text feature includes: obtaining the feature attention intensity corresponding to the text extraction feature based on the feature correlation degree, and the feature correlation degree is positively correlated with the feature attention intensity; The attention intensity is adjusted to the text extraction features, and the adjusted text features are obtained.
  • the text extraction features are adjusted based on the feature attention intensity, so as to realize the adaptive extraction of text features according to the correlation degree between the text related data and the target text. Adjustment, so that when the recognition is performed based on the adjusted text features, the recognition result is affected by the text-related data. In content recognition, more attention is paid to information with a greater degree of correlation, which improves the accuracy of content recognition.
  • adjusting the text extraction feature based on the feature attention intensity to obtain the adjusted text feature includes: multiplying the feature attention intensity and each feature value of the text extraction feature to obtain a product of feature values; according to the feature value in the text extraction feature The position of the eigenvalue product is arranged, and the eigenvalue sequence obtained by the arrangement is used as the adjusted text feature.
  • the feature value product refers to the result obtained by multiplying the text feature value and the feature attention intensity.
  • the eigenvalue sequence is obtained by arranging the eigenvalue products obtained by calculating the text eigenvalues according to the positions of the text eigenvalues in the text extraction feature. That is, the position of the text feature value in the text extraction feature is the same as the position of the feature value product calculated from the text feature value in the feature value sequence.
  • the feature attention intensity is multiplied by each feature value of the text extraction feature to obtain the feature value product, so that the feature value product can reflect the attention degree of the text-related data to the text feature value, according to the feature value in the text extraction feature.
  • the position of the product of eigenvalues is arranged, and the sequence of eigenvalues obtained by the arrangement is used as the adjustment text feature, so that the adjustment of the text feature can reflect the attention of the text-related data to the text extraction feature.
  • the text extraction feature is a feature corresponding to the word segmentation in the target text; each adjusted text feature forms a feature sequence according to the order of the word segmentation in the target text; and the recognition based on the adjusted text feature is obtained, and the content recognition result corresponding to the target content includes: : Obtain the positional relationship of each word segment relative to the named entity based on the feature sequence; obtain the target named entity from the target text based on each positional relationship, and use the target named entity as the content recognition result corresponding to the target content.
  • the feature sequence is a sequence obtained by sorting the adjusted text features corresponding to the target word segmentation according to the order of the target word segmentation in the target text, and the target word segmentation refers to the word segmentation in the target text.
  • a named entity refers to an entity identified by a name, which can include at least one of a person's name, a place name or an organization name. The named entity can be, for example, "Zhang San", "A region" or "B organization”.
  • the positional relationship with respect to the named entity may include at least one of a named entity position or a non-named entity position.
  • the named entity position refers to the position where the named entity is located, and may include at least one of the starting position of the named entity, the ending position of the named entity, or the middle position of the named entity.
  • the intermediate position of the named entity may include various positions between the starting position and the ending position of the named entity.
  • the non-named entity position refers to the position of the participle outside the named entity.
  • the server may determine the positional relationship of each target word segment relative to the named entity based on the feature sequence, obtain the positional relationship corresponding to each target word segmentation, and obtain the target corresponding to the positional relationship belonging to the location of the named entity from each positional relationship Participle, as entity participle, obtains the target named entity based on each entity participle.
  • the trained content recognition model may include an entity recognition network
  • the server may input the feature sequence into the entity recognition network, and use the entity recognition network to perform position recognition on each adjusted text feature in the feature sequence, such as entity recognition
  • the network can determine the probability that the target segment corresponding to the adjusted text feature is at the position of the named entity based on the adjusted text feature, obtain the named entity probability, and determine the positional relationship of the target segment whose named entity probability is greater than the named entity probability threshold as the named entity position.
  • the named entity probability threshold can be set as desired.
  • the entity recognition network can also determine the probability that the target segment corresponding to the adjusted text feature is at the starting position of the named entity based on the adjusted text feature, obtain the starting probability, and determine the positional relationship of the target segment whose starting probability is greater than the starting probability threshold. is the starting position of the named entity.
  • the starting probability threshold can be set as required.
  • the entity recognition network can also determine the probability that the target participle corresponding to the adjusted text feature is at the end position of the named entity based on the adjustment text feature, obtain the end probability, and determine the positional relationship of the target participle whose end probability is greater than the end probability threshold as the named entity. end position.
  • the end probability threshold can be set as required.
  • the positional relationship of each word segment relative to the named entity is obtained based on the feature sequence
  • the target named entity is obtained from the target text based on each positional relationship
  • the target named entity is used as the content recognition result corresponding to the target content, so that the text can be adjusted based on the content recognition result.
  • the feature sequence formed by the feature is used for content recognition, which improves the accuracy of the content recognition.
  • acquiring the target named entity from the target text based on each positional relationship includes: acquiring a participle whose positional relationship is the starting position of the named entity, as the named entity starting word; In the participle, the participle whose positional relationship is inside the named entity is used as the constituent word of the named entity; the target named entity is obtained by combining the starting word of the named entity and the constituent word of the named entity.
  • the named entity start word refers to the participle at the starting position of the named entity
  • the backward participle corresponding to the named entity start word refers to the participle sorted after the named entity start word in the target text.
  • Named entity composition word refers to the participle inside the named entity in the target text.
  • the interior of the named entity includes the end position of the named entity and the middle position of the named entity. The end position of the named entity word and the middle position of the named entity can be the same. Location.
  • the target text is "Zhang San likes flowers"
  • the named entity is “Zhang San” which is two characters
  • the named entity The starting word is “Zhang”
  • the backward participles corresponding to the starting word of the named entity include "three", "xi”, “huan” and "flower”. Since "three" is inside the named entity, the named entity constitutes a word for "three".
  • the target named entity is the entity included in the target text, which is obtained by combining the named entity start word and the corresponding named entity composition word.
  • One or more target named entities may be included in the target text, and multiple refers to at least two. For example, if the target text is "Zhang San likes Li Si", the target text includes two target named entities, namely "Zhang San” and "Li Si”.
  • the server may, based on the positional relationship corresponding to each target word segmentation, obtain the word segmentation whose positional relationship is the starting position of the named entity from the target text, and use it as the starting word of the named entity. Obtain a backward participle from each backward participle of the starting word as the current backward participle. When the positional relationship of the current backward participle is inside the named entity, the current backward participle is used as the name corresponding to the starting word of the named entity.
  • Entity composition word when the positional relationship of the current backward participle is outside the named entity, stop obtaining the backward participle from each backward participle of the named entity start word, according to the named entity start word and the named entity composition word in the target text.
  • the position of the named entity is sorted from front to back, and the named entity starting word and the named entity constituent word are sorted to obtain the target named entity. For example, since the position of "Zhang" is before “Three", "Zhangsan” is obtained by sorting, that is, "Zhangsan” is the target named entity.
  • the participle whose positional relationship is the starting position of the named entity is obtained as the starting word of the named entity, and in the backward participle corresponding to the starting word of the named entity, the participle whose positional relationship is inside the named entity is used as the naming Entity composition word, the name entity start word and the named entity composition word are combined to obtain the target named entity, so that the entity recognition can be performed based on the feature sequence formed by adjusting the text features, and the accuracy of the entity recognition is improved.
  • obtaining the positional relationship of each participle relative to the named entity based on the feature sequence includes: obtaining the positional relationship of each participle relative to the named entity and the entity type corresponding to the participle based on the feature sequence; In the forward participle, the participle whose positional relationship is inside the named entity is used as the constituent word of the named entity, including: in the backward participle corresponding to the starting word of the named entity, the positional relationship is inside the named entity, and the entity type starts from the named entity. Participles with the same type of initial words are used as named entities to form words.
  • the entity type refers to the type of named entity, including at least one type of person name, institution name or place name.
  • Named entity start words and named entity composition words can correspond to entity types respectively.
  • the server can identify the entity type of each feature in the feature sequence, determine the entity type corresponding to each feature in the feature sequence, and in the order from front to back, sequentially from each backward direction of the starting word of the named entity Obtain a backward participle from the word segmentation as the current backward participle.
  • the current backward participle is used as the name
  • the named entity composition word corresponding to the entity start word.
  • the positional relationship of the current backward participle is external to the named entity or the entity type is different from the entity type of the named entity start word, stop starting from each backward participle of the named entity start word. Get backward participles in .
  • the text extraction feature is the feature corresponding to the target word segmentation in the target text; the fusion text features corresponding to each target word segmentation form a fusion feature sequence according to the order of the target word segmentation in the target text; based on the adjusted text features, the recognition is performed to obtain the target
  • the content recognition result corresponding to the content includes: obtaining the positional relationship of each word segment relative to the named entity based on the fusion feature sequence; obtaining the target named entity from the target text based on each positional relationship, and using the target named entity as the content recognition result corresponding to the target content.
  • the fused feature sequence may be input into an entity recognition network, and the entity recognition network performs entity word recognition on each fused text feature in the fused feature sequence.
  • the entity recognition network can be, for example, the CRF network in Figure 6.
  • the target text is "Zhang Xiaohua loves to laugh”
  • the fusion feature sequence is [h1, h2, h3, h4, h5], and h1 corresponds to the word segmentation "Zhang”
  • h2 is the fusion text feature corresponding to the participle "small”
  • h3 is the fusion text feature corresponding to the participle "flower”
  • h4 is the fusion text feature corresponding to the participle "love”
  • h5 is the fusion text corresponding to the participle "laugh” text features.
  • the CRF network can score the word segmentation in the target text based on each feature in the fusion feature sequence, and obtain the score corresponding to each word segmentation, and can use softmax to normalize the score of the word segmentation.
  • the probability distribution corresponding to the word segmentation is obtained by normalization processing.
  • the CRF network can use the "BIO” labeling method to label each target word segment in "Zhang Xiaohua Loves to Smile", and obtain the corresponding features of each fusion text.
  • B is the abbreviation of begin, which means the beginning of the entity word
  • I is the abbreviation of inside, which means the inside of the entity word
  • O is the abbreviation of outside, which means the outside of the entity word, as shown in the figure "Zhang Xiaohua Loves to Smile” is marked as "B-PER, I-PER, I-PER, O, O", where "PER” indicates that the entity word type is a person's name. From “B-PER, I-PER, I-PER, O, O”, "Zhang Xiaohua” in "Zhang Xiaohua Loves to Smile" can be determined as the target named entity.
  • each word segment relative to the named entity and the entity type corresponding to the word segment are obtained based on the feature sequence.
  • the participle of the same type as the starting word of the named entity is used as the constituent word of the named entity, which improves the accuracy of entity recognition.
  • determining the feature correlation degree between the association extraction feature and the text extraction feature includes: performing a product operation on the association feature value in the association extraction feature and the text feature value at the corresponding position in the text extraction feature to obtain a product operation value; Statistics are performed on the product operation value to obtain the feature correlation degree between the correlation extraction feature and the text extraction feature.
  • the association extraction feature may be a vector or matrix of the same dimension as the text extraction feature
  • the server may obtain the association feature value at the target ranking from the association extraction feature, as the first target feature value, and obtain the target position from the text extraction feature.
  • the text feature value at the location is used as the second target feature value, then the second target feature value and the second target feature value have a positional correspondence, and the server can operate on the product of the first target feature value and the second target feature value to obtain the target The computed product of the text eigenvalue at the location and the associated eigenvalue.
  • the target position may be any position in the association extraction feature or the text extraction feature.
  • the target position may be any ranking position, such as the first position.
  • the server may perform statistics on each product operation value to obtain a product statistical value, perform normalization processing on the product statistical value, and use the result of the normalization processing as the feature correlation degree.
  • the server can use the feature correlation degree as the feature attention intensity corresponding to the text extraction feature.
  • the product operation is performed on the association feature value in the association extraction feature and the text feature value at the corresponding position in the text extraction feature to obtain the product operation value, and statistical operation is performed on each product operation value to obtain the association extraction feature and the text extraction feature.
  • the feature correlation degree between the features so that the feature correlation degree can accurately reflect the relationship between the text-related data and the target text, so that when the text extraction features are adjusted based on the feature correlation degree, the adjustment accuracy can be improved.
  • a content identification method comprising the steps of:
  • Step A Determine the target video to be identified, and obtain the target text, target image data and target audio data associated with the target text from the target video.
  • Step B Perform feature extraction on the target text to obtain text extraction features, perform feature extraction on target image data to obtain target image features, and perform feature extraction on target audio data to obtain target audio features.
  • a trained entity recognition network 700 is shown.
  • the server can use the text feature extraction network in the trained entity recognition model to perform feature extraction on the target text to obtain text extraction features.
  • the image feature extraction network is used to perform feature extraction on the target image data to obtain the target image features, and the audio feature extraction network is used to perform feature extraction on the target audio data to obtain the target audio features.
  • Step C Perform correlation calculation on the target image feature and the text extraction feature to obtain the image correlation degree, use the image correlation degree as the image attention intensity, perform correlation calculation on the target audio feature and the text extraction feature, obtain the audio correlation degree, and use the audio correlation degree. As audio attention intensity.
  • the image attention intensity calculation module can be used to perform correlation calculation on the target image feature and the text extraction feature to obtain the image attention intensity
  • the audio attention intensity calculation module can be used to perform the target audio feature and text extraction feature. Correlation calculation obtains the audio attention intensity.
  • the image attention intensity calculation module includes a product operation unit and a normalization processing unit. The image attention intensity calculation module can perform a product operation on the target image feature and the text extraction feature through the product operation unit, and input the result of the operation into the normalization operation unit. Perform normalization processing to obtain the image attention intensity. For the process of obtaining the audio attention intensity calculated by the audio attention intensity calculation module, reference may be made to the image attention intensity calculation module.
  • Step D Adjust the text extraction feature based on the image attention intensity to obtain the first adjusted text feature, and adjust the text extraction feature based on the audio attention intensity to obtain the second adjusted text feature.
  • the image attention intensity and the text extraction feature can be input into the first feature adjustment module, and the first feature adjustment module can multiply the image attention intensity and each feature value of the text extraction feature, and extract the features according to the text.
  • the positions of the middle feature values are arranged to obtain the first adjusted text feature by arranging the multiplied values.
  • the second adjusted text feature can be obtained by using the second feature adjustment module.
  • Step E Determine the first feature weight corresponding to the first adjusted text feature, and determine the second feature weight corresponding to the second adjusted text feature.
  • the server may input the first adjusted text feature into the image encoder for encoding to obtain the image encoding feature, and input the text extraction feature into the first text encoder for encoding to obtain the first text feature, Inputting the first text feature and the image coding feature into the first feature fusion module to obtain the text image coding feature.
  • the server may input the second adjusted text feature into the audio encoder for encoding, obtain the audio encoding feature, input the text extraction feature into the second text encoder for encoding, obtain the second text feature, and combine the second text feature and the audio encoding feature Input into the second feature fusion module to obtain the text image coding feature, input the text image coding feature into the first activation layer for activation, obtain the first feature weight corresponding to the first adjusted text feature, and input the text audio coding feature into the first activation layer.
  • the second activation layer is activated to obtain the second feature weight corresponding to the second adjusted text feature.
  • Step F fuse the first adjusted text feature and the text extraction feature based on the first feature weight to obtain the first fusion feature, and fuse the second adjusted text feature and the text extraction feature based on the second feature weight to obtain the second fusion feature , perform a statistical operation on the first fusion feature and the second fusion feature, and use the result of the statistical operation as the fusion text feature.
  • the server may input the first feature weight, the first adjusted text feature and the text extraction feature into the first fusion text feature generation module to obtain the first fusion feature, and the second feature weight, the first 2.
  • the adjusted text features and the text extraction features are input into the second fused text feature generation module to obtain the second fused features.
  • Step G Perform named entity recognition based on the fused text features, obtain a target named entity corresponding to the target content, and use the target named entity as a content recognition result corresponding to the target content.
  • the target video is the video of "Zhang Xiaohua”
  • the target text is the subtitle “Zhang Xiaohua loves to laugh” in the video of "Zhang Xiaohua”
  • the target image data is the video of "Zhang Xiaohua” with the subtitle
  • the temporally associated images of "Zhang Xiaohua Loves to Smile" that is, the images that include "Zhang Xiaohua”
  • the audio that is temporally associated with the subtitle “Zhang Xiaohua Loves to Smile" in the video whose target audio data is "Zhang Xiaohua” that is, include
  • the subtitle “Zhang Xiaohua loves to laugh” the image including "Zhang Xiaohua
  • the audio including "Zhang Xiaohua” are input into the entity recognition model
  • the above-mentioned content recognition method is used for entity recognition, in addition to using the text information in the video, such as the title, subtitle or description information in the video, it also uses the audio features and image features of the video, and a variety of modal features are used. Fusion can extract video information more accurately and effectively, and enhance the recognition effect of entity word recognition, such as improving the accuracy and efficiency of entity word recognition. It can improve the precision and recall on the test data set.
  • one modality may be one data type, such as one modality for text, audio and image, and the multimodality includes at least two modalities.
  • Modal features can be, for example, any of text features, audio features, or image features.
  • Multimodal features include at least two modal features.
  • the entity word recognition model ie, the entity recognition model provided in this application can effectively extract video information.
  • the present application further provides an application scenario, where the above-mentioned content recognition method is applied to the application scenario, which can perform entity recognition on the text in the video.
  • the application of the content recognition method in this application scenario is as follows:
  • Receive a video tag generation request for the target video and in response to the video tag generation request, use the content recognition method provided by this application to perform entity word recognition on the target video, obtain the identified entity words, and use the identified entity words as the target.
  • the video tag corresponding to the video.
  • the content recognition method provided in the present application can save time for acquiring video information and improve the efficiency of understanding video information.
  • the present application also provides an application scenario, where the above content recognition method is applied to the application scenario, which can perform entity recognition on the text in the video.
  • the application of the content recognition method in this application scenario is as follows:
  • Receive a video recommendation request corresponding to the target user obtain a candidate video, use the content recognition method provided in this application to perform entity word recognition on the candidate video, use the identified entity word as a video tag corresponding to the candidate video, and obtain the user corresponding to the target user.
  • information when it is determined that the video tag matches the user information, for example, when the video tag matches the user's user portrait, the candidate video is pushed to the terminal corresponding to the target user.
  • the content identification method provided by the present application is applied to video recommendation, and can provide high-quality features for the video recommendation algorithm and optimize the video recommendation effect.
  • a content identification apparatus may adopt a software module or a hardware module, or a combination of the two to become a part of a computer device, and the apparatus specifically includes: target content determination Module 902, feature extraction module 904, feature correlation degree obtaining module 906, adjustment text feature obtaining module 908 and content recognition result obtaining module 910, wherein: target content determination module 902 is used to determine the target content to be identified, from the target content , get the target text and the text-related data associated with the target text.
  • the feature extraction module 904 is configured to perform feature extraction on the target text to obtain text extraction features; perform feature extraction on text-related data to obtain associated extraction features.
  • the feature correlation degree obtaining module 906 is used to determine the feature correlation degree between the correlation extraction feature and the text extraction feature; the feature correlation degree is used to represent the correlation degree between the target text and the text correlation data.
  • the adjusting text feature obtaining module 908 is configured to adjust the text extraction feature based on the feature correlation degree to obtain the adjusted text feature.
  • the content recognition result obtaining module 910 is configured to perform recognition based on the adjusted text feature to obtain a content recognition result corresponding to the target content.
  • the target content to be recognized is determined, the target text and the text-related data associated with the target text are obtained from the target content, the feature extraction is performed on the target text, the text extraction features are obtained, and the feature extraction is performed on the text-related data.
  • get the correlation extraction feature determine the feature correlation degree between the correlation extraction feature and the text extraction feature, adjust the text extraction feature based on the feature correlation degree, get the adjusted text feature, identify based on the adjusted text feature, and get the content recognition corresponding to the target content result. Since the feature correlation degree reflects the degree of correlation between the target text and the text-related data, the greater the feature correlation degree, the greater the correlation degree between the target text and the text-related data, and the smaller the feature correlation degree, the higher the target text.
  • the accuracy of the features used in identification improves the accuracy of content identification.
  • the content recognition result obtaining module 910 includes: a first fused text feature obtaining unit, configured to fuse the adjusted text feature and the text extraction feature to obtain a fused text feature.
  • the first content recognition result obtaining unit is configured to perform recognition based on the fused text feature to obtain a content recognition result corresponding to the target content.
  • the adjusted text features and the text extraction features are fused to obtain fused text features, and recognition is performed based on the fused text features to obtain a content recognition result corresponding to the target content, which can improve the accuracy of content recognition.
  • the first fusion text feature obtaining unit is further configured to encode the text extraction feature to obtain the first encoded feature, encode the adjusted text feature to obtain the second encoded feature; and combine the first encoded feature with the first encoded feature.
  • Two coding features are fused to obtain fused coding features; based on the fused coding features, the adjusted feature weights corresponding to the adjusted text features are obtained; based on the adjusted feature weights, the adjusted text features and text extraction features are fused to obtain fused text features.
  • the text extraction feature is encoded to obtain the first encoding feature
  • the adjusted text feature is encoded to obtain the second encoding feature
  • the first encoding feature and the second encoding feature are fused to obtain the fusion encoding feature, based on
  • the adjusted feature weights corresponding to the adjusted text features are obtained by fusing the coding features.
  • the adjusted text features and the text extraction features are fused to obtain the fused text features, so that the fused text features can reflect both the text extraction features and the adjusted text features.
  • the expression ability of the fusion text features is improved, and the recognition accuracy can be improved when the recognition is performed based on the adjusted text features.
  • the first encoded feature is encoded by a first encoder in the trained content recognition model
  • the second encoded feature is encoded by a second encoder in the content recognition model
  • the text feature obtaining unit is also used to input the fusion coding feature into the target activation layer in the content recognition model for activation processing to obtain an activation value
  • the activation value is used as the adjustment feature weight corresponding to the adjustment text feature
  • the activation layer is the first encoder. Shared activation layer with the second encoder.
  • the fusion coding feature is input into the target activation layer in the content recognition model for activation processing to obtain the target activation value, and the target activation value is used as the adjustment feature weight corresponding to the adjustment text feature, so that the adjustment feature weight is normalized
  • the latter value improves the rationality of adjusting feature weights.
  • the first fusion text feature obtaining unit is further configured to obtain the text feature weight corresponding to the text extraction feature based on the adjusted feature weight; the product of the adjusted feature weight and the adjusted text feature is calculated to obtain the calculated adjusted text feature Calculate the product of the text feature weight and the text extraction feature to obtain the calculated text extraction feature; add the calculated adjusted text feature and the calculated text extraction feature to obtain the fused text feature.
  • the product of the adjusted feature weight and the adjusted text feature is calculated to obtain the calculated adjusted text feature
  • the product of the text feature weight and the text extraction feature is calculated to obtain the calculated text extraction feature.
  • the text features and the calculated text extraction features are added to obtain the fused text features. Since the text feature weights are obtained based on adjusting the feature weights, the accuracy of the text feature weights is improved, thereby improving the accuracy of the fused text features.
  • the target content is a target video;
  • the target content determination module 902 includes: a target text obtaining unit, configured to obtain the text corresponding to the target time in the target video to obtain the target text.
  • the text-related data obtaining unit is used to obtain the video-related data corresponding to the target time in the target video, and use the video-related data as the text-related data associated with the target text, and the video-related data includes at least one of video frames or audio frames.
  • the text corresponding to the target time in the target video is obtained, the target text is obtained, the video-related data corresponding to the target time in the target video is obtained, and the video-related data is used as the text-related data associated with the target text.
  • the video-related data includes At least one of a video frame or an audio frame, thereby acquiring text data and image data or audio data other than the text data, so that the video can be identified on the basis of the text data in combination with the image data or the audio data, which is beneficial to Improve the accuracy of recognition.
  • adjusting the text feature includes a first adjusted text feature obtained by adjusting the video frame;
  • the content recognition result obtaining module 910 includes: a second fused text feature obtaining unit, configured to combine the first adjusted text feature and the text extraction feature Fusion is performed to obtain fused text features.
  • the second content recognition result obtaining unit is configured to perform recognition based on the fused text feature to obtain a content recognition result corresponding to the target content.
  • the first adjusted text feature, the second adjusted text feature, and the text extraction feature are fused to obtain the fused text feature, so that the fused text feature is based on the first adjusted text feature, the second adjusted text feature, and the text extraction feature.
  • adjusting the text feature further includes a second adjusted text feature obtained by adjusting the audio frame
  • the second fusion text feature obtaining unit is further configured to: combine the first adjusted text feature, the second adjusted text feature, and the text extraction feature Fusion is performed to obtain fused text features.
  • the first adjusted text feature, the second adjusted text feature, and the text extraction feature are fused to obtain the fused text feature, so that the fused text feature is based on the first adjusted text feature, the second adjusted text feature, and the text extraction feature.
  • adjusting the text extraction feature based on the feature correlation degree, and obtaining the adjusted text feature includes: obtaining the feature attention intensity corresponding to the text extraction feature based on the feature correlation degree, and the feature correlation degree is positively correlated with the feature attention intensity; The attention intensity is adjusted to the text extraction features, and the adjusted text features are obtained.
  • the feature correlation degree is positively correlated with the feature attention strength
  • the degree of association between the target text and the target text is used to adjust the text features, so that when the recognition is performed based on the adjusted text features, the recognition results are affected by the text-related data. The greater the influence of the data on the recognition result, the more relevant information is paid more attention to when the content is recognized, and the accuracy of the content recognition is improved.
  • the adjusting text feature obtaining module 908 includes: a feature value product obtaining unit, configured to multiply the feature attention intensity and each feature value of the text extraction feature to obtain a feature value product.
  • the adjusting text feature obtaining unit is used for arranging the feature value product according to the position of the feature value in the text extraction feature, and using the feature value sequence obtained by the arrangement as the adjusted text feature.
  • the feature attention intensity is multiplied by each feature value of the text extraction feature to obtain the feature value product, so that the feature value product can reflect the attention degree of the text-related data to the text feature value, according to the feature value in the text extraction feature.
  • Sorting sorts the product of eigenvalues, and uses the sequence of eigenvalues obtained by sorting as the adjusted text feature, so that the adjustment of the text feature can reflect the attention of the text-related data to the text extraction feature.
  • the text extraction feature is the feature corresponding to the word segmentation in the target text; each adjusted text feature forms a feature sequence according to the order of the word segmentation in the target text; the content recognition result obtaining module 910 includes: a position relationship obtaining unit for obtaining based on The feature sequence obtains the positional relationship of each participle relative to the named entity.
  • the third content recognition result obtaining unit is configured to obtain the target named entity from the target text based on each positional relationship, and use the target named entity as the content recognition result corresponding to the target content.
  • the positional relationship of each word segment relative to the named entity is obtained based on the feature sequence
  • the target named entity is obtained from the target text based on each positional relationship
  • the target named entity is used as the content recognition result corresponding to the target content, so that the text can be adjusted based on the content recognition result.
  • the feature sequence formed by the feature is used for content recognition, which improves the accuracy of the content recognition.
  • the third content recognition result obtaining unit is also used to obtain the word segment whose positional relationship is the starting position of the named entity, as the starting word of the named entity; in the backward participle corresponding to the starting word of the named entity, The positional relationship is that the participle within the named entity is used as the named entity constituent word; the target named entity is obtained by combining the named entity starting word and the named entity constituent word.
  • the participle whose positional relationship is the starting position of the named entity is obtained as the starting word of the named entity, and in the backward participle corresponding to the starting word of the named entity, the participle whose positional relationship is inside the named entity is used as the naming Entity composition word, the name entity start word and the named entity composition word are combined to obtain the target named entity, so that the entity recognition can be performed based on the feature sequence formed by adjusting the text features, and the accuracy of the entity recognition is improved.
  • the unit for obtaining the positional relationship is also used to obtain the positional relationship of each word segment relative to the named entity and the entity type corresponding to the word segment based on the feature sequence; the third unit for obtaining the content recognition result is also used to start the named entity.
  • the positional relationship is inside the named entity, and the participle whose entity type is the same as the type of the starting word of the named entity is used as the named entity composition word.
  • each word segment relative to the named entity and the entity type corresponding to the word segment are obtained based on the feature sequence.
  • the participle of the same type as the starting word of the named entity is used as the constituent word of the named entity, which improves the accuracy of entity recognition.
  • the feature correlation degree obtaining module 906 includes: a product operation value obtaining unit, configured to perform a product operation between the association feature value in the association extraction feature and the text feature value at the corresponding position in the text extraction feature to obtain the product operation value .
  • the feature attention intensity obtaining unit is used to count the product operation value to obtain the feature correlation degree between the correlation extraction feature and the text extraction feature.
  • the product operation is performed on the association feature value in the association extraction feature and the text feature value at the corresponding position in the text extraction feature to obtain the product operation value, and statistical operation is performed on each product operation value to obtain the association extraction feature and the text extraction feature.
  • the feature correlation degree between the features so that the feature correlation degree can accurately reflect the correlation degree between the text-related data and the target text, so that when the text extraction features are adjusted based on the feature correlation degree, the adjustment accuracy can be improved.
  • Each module in the above-mentioned content identification device may be implemented in whole or in part by software, hardware, or a combination thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10 .
  • the computer device includes a processor, memory, and a network interface connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the computer device's database is used to store content-identifying data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer readable instructions when executed by a processor, implement a content identification method.
  • a computer device is provided, and the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 11 .
  • the computer equipment includes a processor, memory, a communication interface, a display screen, and an input device connected by a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system and computer-readable instructions.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the communication interface of the computer device is used for wired or wireless communication with an external terminal, and the wireless communication can be realized by WIFI, operator network, NFC (Near Field Communication) or other technologies.
  • the computer readable instructions when executed by a processor, implement a content identification method.
  • the display screen of the computer equipment may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball or a touchpad set on the shell of the computer equipment , or an external keyboard, trackpad, or mouse.
  • FIG. 10 and FIG. 11 are only block diagrams of partial structures related to the solution of the present application, and do not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • a computer device may include more or fewer components than those shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer apparatus including a memory and one or more processors, the memory having computer readable instructions stored in the memory that, when executed by the processors, cause one or more processing
  • the steps in the foregoing method embodiments are implemented when the computer executes the computer-readable instructions.
  • one or more non-transitory readable storage media are provided that store computer-readable instructions that, when executed by one or more processors, cause the one or more processors to The steps in the above method embodiments are implemented.
  • a computer program product includes computer readable instructions that, when executed by a processor, implement the steps of the above image processing method.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Abstract

本申请涉及一种内容识别方法、装置、计算机设备和存储介质。所述方法包括:从待识别的目标内容中,获取目标文本以及与目标文本关联的文本关联数据;对所述目标文本进行特征提取,得到文本提取特征;对所述文本关联数据进行特征提取,得到关联提取特征;确定所述关联提取特征与所述文本提取特征间的特征关联度;基于所述特征关联度对所述文本提取特征进行调整,得到调整文本特征;基于所述调整文本特征进行识别,得到所述目标内容对应的内容识别结果。

Description

内容识别方法、装置、计算机设备和存储介质
本申请要求于2021年03月26日提交中国专利局,申请号为202110325997.8,申请名称为“内容识别方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种内容识别方法、装置、计算机设备和存储介质。
背景技术
随着自然语言处理技术以及人工智能技术的发展,在很多情况下都需要对内容进行识别,例如对视频进行识别。在对内容进行识别时,可以基于人工智能的模型对内容进行识别,并可以从内容中获取所需要的信息。例如,可以对文本进行识别以从文本中获取所需要的内容实体。
目前,对内容进行识别的方法,存在不能准确识别内容的信息的情况,导致内容识别的准确度较低。
发明内容
根据本申请提供的各种实施例,提供了一种内容识别方法、装置、计算机设备、存储介质和计算机程序产品。
一种内容识别方法,由计算机设备执行,所述方法包括:确定待识别的目标内容,从所述目标内容中,获取所述目标文本以及与所述目标文本关联的文本关联数据;对所述目标文本进行特征提取,得到文本提取特征;对所述文本关联数据进行特征提取,得到关联提取特征;确定所述关联提取特征与所述文本提取特征间的特征关联度;所述特征关联度用于表征所述目标文本与所述文本关联数据之间的关联程度;基于所述特征关联度对所述文本提取特征进行调整,得到调整文本特征;及,基于所述调整文本特征进行识别,得到所述目标内容对应的内容识别结果。
一种内容识别装置,所述装置包括:目标内容确定模块,用于确定待识别的目标内容,从所述目标内容中,获取目标文本以及与所述目标文本关联的文本关联数据;特征提取模块,用于对所述目标文本进行特征提取,得到文本提取特征;对所述文本关联数据进行特征提取,得到关联提取特征;特征关联度得到模块,用于确定所述关联提取特征与所述文本提取特征间的特征关联度;所述特征关联度用于表征所述目标文本与所述文本关联数据之间的关联程度;调整文本特征得到模块,用于基于所述特征关联度对所述文本提取特征进行调整,得到调整文本特征;及,内容识别结果得到模块,用于基于所述调整文本特征进行识别,得到所述目标内容对应的内容识别结果。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行上述内容识别方法中的步骤。
一个或多个非易失性可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器实现上述内容识别方法中的步骤。
一种计算机程序产品,包括计算机可读指令,所述计算机可读指令被处理器执行时实现上述图像处理方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其他特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
图1为一些实施例中内容识别方法的应用环境图;
图2为一些实施例中内容识别方法的流程示意图;
图3为一些实施例中应用内容识别方法进行视频识别的原理图;
图4为一些实施例中内容识别模型的框架图;
图5为一些实施例中内容识别模型的框架图;
图6为一些实施例中利用实体识别网络进行实体识别的原理图;
图7为一些实施例中内容识别网络的框架图;
图8为一些实施例中利用实体识别模型进行实体识别的原理图;
图9为一些实施例中内容识别装置的结构框图;
图10为一些实施例中计算机设备的内部结构图;
图11为一些实施例中计算机设备的内部结构图。
具体实施方式
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
本申请提供的内容识别方法,可以应用于如图1所示的应用环境中。该应用环境包括终端102以及服务器104。其中,终端102以及服务器104通过网络进行通信。
具体地,服务器104可以响应于内容识别请求,获取待识别的目标内容,待识别的目标内容可以是内容识别请求中携带的,或者根据内容识别请求中携带的内容标识所获取的内容,服务器104可以从目标内容中,获取目标文本,并从目标内容中获取与目标文本关联的文本关联数据,对目标文本进行特征提取,得到文本提取特征,对文本关联数据进行特征提取,得到关联提取特征,确定关联提取特征与文本提取特征间的特征关联度,特征关联度用于表征目标文本与文本关联数据之间的关联程度,基于特征关联度对文本提取特征进行调整,得到调整文本特征,基于调整文本特征进行识别,得到目标内容对应的内容识别结果,服务器104可以将内容识别结果与目标内容关联存储,例如可以将内容识别结果作为目标内容的标签。其中,内容识别请求可以是服务器104触发的,也可以是其他设备例如终端发送至服务器104的。
其中,终端102上可以安装有客户端,例如可以安装有视频客户端、浏览器客户端、即时通信客户端或者教育客户端等中的至少一个。终端102可以通过客户端响应于用户触发的内容搜索操作,向服务器104发送内容搜索请求,内容搜索请求中可以携带搜索信息,服务器104可以将搜索信息与内容识别结果进行匹配,当搜索信息与内容识别结果匹配时,将该内容识别结果对应的内容发送至终端102,终端102可以在客户端中展示服务器104返回的内容。
其中,终端102可以但不限于是笔记本电脑、智能手机、智能电视、台式电脑、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群或云服务器来实现。可以理解,上述应用场景仅是一种示例,并不构成对本申请实施例提供的内容识别方法的限定,本申请实施例提供的方法还可以应用在其他应用场景中,例如本 申请提供的内容识别方法可以是由终端102或服务器104执行的,还可以是由终端102与服务器104协同执行的,终端102可以将识别出的内容识别结果上传至服务器104,服务器104可以将目标内容与内容识别结果关联存储。
在一些实施例中,如图2所示,提供了一种内容识别方法,以该方法应用于图1中的服务器104为例进行说明,包括以下步骤:
步骤202,确定待识别的目标内容,从目标内容中,获取目标文本以及与目标文本关联的文本关联数据。
其中,内容可以是视频、音频或文本中的任意一种,内容包括文本数据,还可以包括图像数据或音频数据中的至少一种,音频数据例如可以是语音数据。当内容为视频时,内容中的文本数据可以包括视频中的字幕、弹幕、评论或标题中的至少一种,内容中的图像数据可以是视频中的视频帧,内容中的音频数据可以是视频中的配音或音乐等音频数据。当内容为音频数据时,内容中的文本数据可以是音频数据对应的文本数据,例如当内容为歌曲时,内容中的文本数据可以是歌曲对应的歌词,内容中的音频数据可以是音频帧。音频帧是对音频进行分帧得到的,分帧指的是将音频分为多个小段,每个小段为一帧。
目标内容指的是待进行识别的内容,可以是待进行身份识别的内容或待进行场景识别的内容中的至少一种,身份识别指的是对目标内容中出现的人物的身份进行识别,例如可以通过识别目标内容中出现的人物信息确定人物的身份,人物信息可以包括人物的名称或人物的人脸中的至少一种,场景识别指的是对目标内容所属的场景进行识别,例如可以通过识别目标内容中出现的地点确定场景。目标文本指的是目标内容中的文本数据,可以包括目标内容中任意时刻的文本数据,例如,当目标文本为视频时,目标文本可以包括视频中任意时刻或时间段展示的字幕、弹幕、评论或标题中的至少一种。当目标内容为歌曲时,目标文本数据可以是歌曲对应的歌词。
文本关联数据指的是目标内容中与目标文本具有关联关系的数据,例如可以包括目标内容中与目标文本具有关联关系的目标图像数据或目标音频数据中的至少一种。目标图像数据为目标内容中与目标文本具有关联关系的图像数据,目标音频数据为目标内容中与目标文本具有关联关系的音频数据。目标图像数据可以包括一张或多张图像,多张指的是至少两张,目标音频数据可以包括一段或多段音频帧,多段指的是至少两段。
关联关系可以包括时间上的关联关系。例如文本关联数据可以包括目标内容中目标文本出现的时间内所出现的数据,或者包括目标内容中与目标文本出现的时间之间的时间间隔小于时间间隔阈值的时间内所出现的数据。例如当目标内容为目标视频,目标文本为视频的字幕文本时,则文本关联数据可以为与该字幕匹配的视频帧以及语音,例如目标文本与对应的文本关联数据可以为对同一视频场景进行描述的数据。例如目标文本为目标视频中的目标时间出现的字幕时,文本关联数据可以包括目标视频中目标时间出现的数据,例如可以包括目标视频中目标时间出现的视频帧、弹幕或音频帧中的至少一种,或者包括目标视频中与目标时间之间的时间间隔小于时间间隔阈值的时间所出现的数据。时间间隔阈值可以是预先设置的,也可以根据需要设置。其中,目标视频可以是任意的视频,可以是直接拍摄得到的视频,也可以是从拍摄得到的视频中截取得到的视频片段,目标视频可以是任意类型的视频,包括但不限于是广告类的视频、电视剧类的视频或新闻类视频中的至少一种,目标视频还可以是待推送至用户的视频。目标视频中目标时间出现的视频帧可以包括一帧或多帧,目标视频中目标时间出现的音频帧可以包括一帧或多帧,多帧指的是至少两帧。
关联关系还可以包括语义上的关联关系。例如文本关联数据可以包括目标内容中与目标文本的语义匹配的数据,与目标文本的语义匹配的数据可以包括与目标文本的语义一致的数据,或者包括语义与目标文本的语义的差异小于语义差异阈值的数据。语义差异阈值可以是预先设置的,也可以根据需要设置。
具体地,服务器可以获取待进行实体识别的内容,例如待进行实体识别的视频,将待进行实体识别的内容作为待识别的目标内容,利用本申请提供的内容识别方法对待进行实体识别的内容进行识别,得到识别出的实体词,基于识别出的实体词构建知识图谱,或者可以将识别出的实体词作为目标内容对应的标签。当需要对目标内容进行推送时,可以根据目标内容对应的标签,确定目标内容所匹配的用户,将目标内容推送至该匹配用户的终端。
其中,实体(Entity)是指具有特定意义的事物,例如可以包括地名、机构名或者专有名词等中的至少一种。目标文本中可以包括一个或者多个实体,实体词为表示实体的词语。例如假设目标文本为“猴子喜欢吃香蕉”,则目标文本中包括的实体为“猴子”以及“香蕉”,“猴子”为一个实体词,“香蕉”为一个实体词。知识图谱(Knowledge Graph)是一种基于图的数据结构,包括节点(point)和边(Edge),每个节点表示一个实体,每条边为实体与实体之间的关系。
实体识别还可以称为实体词识别或命名实体识别(Named Entity Recognition,NER)。实体词识别是自然语言处理(Natural Language Processing,NLP)领域研究的一个重要方向,进行实体词识别的方法有很多,例如包括基于词典和规则的方法,包括隐马尔科夫模型(Hidden Markov Model,HMM)、最大熵马尔科夫模型(Maximum Entropy Markov Model,MEMM)、条件随机场(Conditional Random Fields,CRF)等机器学习方法,包括循环神经网络(RNN,Recurrent Neural Networks)和长短期记忆网络(LSTM,Long Short-Term Memory)等深度学习模型,以及包括LSTM与CRF结合的识别方法。其中,自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。
在一些实施例中,第一终端可以向服务器发送内容推送请求,内容推送请求中可以携带待推送的内容对应的内容标识,内容标识用于唯一识别内容,待推送的内容例如可以是待推送的视频,服务器可以响应于内容推送请求,获取内容推送请求中携带的内容标识对应的待推送的内容,作为待识别的目标内容。例如,第一终端可以展示内容推送界面,内容推送界面中可以展示推送内容获取区域以及内容推送触发控件,推送内容获取区域用于接收待推送的内容对应的内容信息,内容信息中包括一个或者多个内容标识,多个指的是至少两个,内容推送触发控件用于触发第一终端向服务器发送内容推送请求,当第一终端获取到对内容推送触发控件的触发操作时,获取推送内容获取区域中接收的内容信息,向服务器发送携带该内容信息的内容推送请求。服务器可以获取内容信息中包括的各个内容标识分别对应的内容,分别作为各个待识别的目标内容。服务器可以利用本申请提供的内容识别方法对各个待识别的内容进行识别,根据识别的结果确定与各个目标内容分别匹配的用户,将目标内容推送至所匹配的用户的终端。例如,可以将识别的结果与用户的用户画像进行匹配,当匹配成功时,将目标内容推送至该用户的终端
举例说明,当内容为视频时,内容识别方法还可以称为视频识别方法,内容推送请求还可以称为视频推送请求,内容推送界面例如可以为图3中的视频推送界面300,推送内容获取区域例如可以是图3中的区域302,内容推送触发控件例如可以是图3中的“确定”控件 304,当终端获取到对“确定”控件304的点击操作时,可以向服务器发送视频推送请求,服务器根据视频识别方法对视频A和视频B进行识别,确定视频A所匹配的用户1,以及视频B所匹配的用户B,将视频A推送至用户1的终端,将视频B推送至用户2的终端。
步骤204,对目标文本进行特征提取,得到文本提取特征;对文本关联数据进行特征提取,得到关联提取特征。
其中,文本提取特征是对目标文本进行特征提取所得到的特征,文本提取特征可以为目标文本对应的目标分词的目标词向量进行进一步的特征提取得到的特征。目标分词是对目标文本进行切分得到的,切分的粒度可以根据需要设置,例如可以是以字、词语或句子为单位进行切分,得到切分后的文本块,将每个文本块作为一个分词。当以字为单位进行切分时,一个字对应一个文本块,即一个字为一个分词。例如,当目标文本为“我好渴”时,当以字为单位进行分词时,得到的各个分词分别为“我”、“好”以及“渴”。目标词向量为目标分词的向量表示形式。目标文本切分得到的目标分词可以有一个或多个,多个指的是至少两个。
关联提取特征是对文本关联数据进行特征提取所得到的特征。当文本关联数据为目标图像数据时,关联提取特征可以是对目标图像数据进行图像特征提取所得到的目标图像特征。当文本关联数据为目标音频数据时,关联提取特征可以是对目标音频数据进行音频特征提取所得到的目标音频特征。目标图像特征是对目标图像数据进行图像特征提取所提取的图像特征。目标音频特征是对目标音频特征进行音频特征提取所提取的音频特征。文本提取特征与关联提取特征可以是同维度的,例如可以是同维度的向量或同纬度的矩阵。
具体地,服务器可以将目标文本输入到已训练的内容识别模型中的文本特征提取网络,利用文本特征提取网络对目标文本进行特征提取,得到文本提取特征,将文本关联数据输入到已训练的内容识别模型中的关联特征提取网络中,利用关联提取网络对文本关联数据进行特征提取,得到关联提取特征。已训练的内容识别模型用于对内容进行识别,得到内容识别结果,例如用于识别视频的字幕中包括的实体词或视频的场景中的至少一种。已训练的内容识别模型中的关联特征提取网络可以有多个,例如关联特征提取网络可以包括图像特征提取网络或音频特征提取网络中的至少一种,图像特征提取网络用于提取图像的特征,音频特征提取网络用于提取音频的特征。当文本关联数据为目标图像数据时,可以将文本关联数据输入到图像特征提取网络中,将图像特征提取网络提取的图像特征作为关联提取特征。当文本关联数据为目标音频数据时,可以将文本关联数据输入到音频特征提取网络中,将音频特征提取网络提取的音频特征作为关联提取特征。
其中,文本特征提取网络、图像特征提取网络以及音频特征提取网络可以是基于人工智能的神经网络,例如可以是卷积神经网络(Convolutional Neural Networks,CNN),当然也可以是其他类型的神经网络。文本特征提取网络例如可以是Transformer(转换器)网络或基于Transformer的双向编码器(Bidirectional Encoder Representations from Transformers,BERT)网络。图像特征提取网络例如可以是残差卷积网络(ResNet,Residual Network)。音频特征提取网络例如可以是VGG(Visual Geometry Group)卷积网络。VGG代表了牛津大学的视觉几何组(Visual Geometry Group,VGG)。例如,服务器可以对目标图像进行尺度变换,得到尺度变换后的图像,将尺度变换后的图像数据输入到残差卷积网络中进行图像特征提取,将残差卷积网络中的feature map(特征图)提取层输出的特征进行池化(pooling),例如池化为固定尺寸的n*n大小,将池化后的特征作为关联提取特征。n为 大于等于1的正数。
在一些实施例中,对目标文本进行特征提取,得到文本提取特征的步骤包括:对目标文本进行切分,得到目标分词,对目标分词进行向量转化,得到目标分词对应的目标词向量,将目标词向量作为文本提取特征。
在一些实施例中,服务器可以将目标文本输入到基于attention(注意力)的transformer模型中,transformer模型作为文本特征的编码器,可以对目标文本进行编码,得到目标文本中每个字的嵌入式(embedding)表示形式的编码特征,可以将每个字对应的编码特征作为文本提取特征。
在一些实施例中,服务器可以对目标音频数据进行频谱计算,得到目标音频数据对应的频谱图,对目标音频数据对应的频谱图进行特征提取,将提取到的特征作为关联提取特征。例如,服务器可以对目标音频数据对应的频谱图进行声谱计算,得到目标音频数据对应的声谱信息,对目标音频数据的声谱信息进行特征提取得到关联提取特征。例如,服务器可以利用hann(海宁窗)时窗对目标音频数据进行傅里叶变换得到目标音频数据对应的频谱图,通过mel(梅尔)滤波器对频谱图进行计算,得到目标音频数据对应的声谱信息,利用VGG卷积网络对声谱信息进行特征提取,将特征提取得到的音频特征作为关联提取特征。
步骤206,确定关联提取特征与文本提取特征间的特征关联度;特征关联度用于表征目标文本与文本关联数据之间的关联程度。
其中,特征关联度是对关联提取特征与文本提取特征进行关联计算所得到的结果,关联度越大,则表示关联关系越强。特征关联度可以包括图像关联度或音频关联度中的至少一个,图像关联度指的是对目标图像特征与文本提取特征进行关联计算所得到的结果,音频关联度指的是对目音频特征与文本提取特征进行关联计算所得到的结果。关联计算可以为乘积运算或加和运算中的至少一种。
关联提取特征可以包括多个有序的关联特征值,文本提取特征可以包括多个有序的文本特征值,文本特征值指的是文本提取特征中包括的特征值,关联特征值指的是关联提取特征中包括的特征值。关联提取特征与文本提取特征可以是同维度的,例如可以是同维度的向量或矩阵,也就是说,关联提取特征中所包括的关联特征值的数量,与文本提取特征中所包括的文本特征值的数量可以是相同的。例如,假设文本提取特征为向量A=[a1,a2,a3],关联提取特征为向量B=[b1,b2,b3],其中,向量A包括3个元素,分别为a1,a2和a3,向量A中的每个元素为一个文本特征值,同样的,向量B中包括3个元素,分别为b1,b2和b3,向量B中的每个元素为一个关联特征值。
具体地,关联计算可以为乘积运算或加和运算中的至少一种。当关联计算为乘积运算时,可以将关联提取特征中的关联特征值与文本提取特征中对应位置的文本特征值进行乘积运算,得到乘积运算值,对各个乘积运算值进行统计运算,例如对各个乘积运算值进行加和运算或均值运算,得到统计运算结果,基于统计运算结果得到特征关联度,例如可以将统计运算结果作为特征关联度,也可以对统计运算结果进行归一化处理,将归一化处理的结果作为特征关联度。当关联计算为加和运算时,可以将关联提取特征中的关联特征值与文本提取特征中对应位置的文本特征值进行加和运算,得到加和运算值,对各个加和运算值进行统计运算,例如可以对各个加和运算值进行加和运算或均值运算,得到统计运算结果。
在一些实施例中,目标文本切分得到的目标分词有多个,服务器可以获取根据各个目标分词分别得到的文本提取特征,将各个文本提取特征组成矩阵,将组成的矩阵作为文本提取 特征矩阵,文本提取特征矩阵中的每一列为一个文本提取特征。服务器可以对关联提取特征与文本提取特征矩阵进行乘积运算,得到总乘积运算结果,基于总乘积运算结果确定各个文本提取特征分别对应的特征关联度。其中,对关联提取特征与文本提取特征矩阵进行乘积运算,得到总乘积运算结果的步骤可以包括:将文本提取特征矩阵中的文本提取特征分别与关联提取特征进行乘积运算,得到各个文本提取特征分别对应的子乘积运算结果,将各个子乘积运算结果作为总乘积运算结果。其中,将文本提取特征矩阵中的文本提取特征分别与关联提取特征进行乘积运算,得到各个文本提取特征分别对应的子乘积运算结果的步骤可以包括:将文本提取特征中的文本特征值与关联提取特征中对应位置的关联特征值进行乘积运算,得到文本提取特征对应的子乘积运算结果。基于总乘积运算结果确定各个文本提取特征分别对应的特征关联度的步骤可以包括:对总乘积运算结果中的各个子乘积运算结果进行归一化处理,得到各个归一化后的子乘积运算结果,将归一化后的子乘积运算结果作为文本提取特征对应的特征关联度。
在一些实施例中,当文本关联数据为目标图像数据,并且目标图像数据有多个时,服务器可以将各个目标图像数据对应的目标图像特征组成矩阵,将组成的矩阵作为图像特征矩阵,图像特征矩阵中的每一列为一个目标图像特征。服务器可以对目标图像特征矩阵的转置矩阵与文本提取特征矩阵进行矩阵相乘运算,得到第一乘积矩阵,对第一乘积矩阵中的各个矩阵值进行归一化处理,得到归一化后的第一乘积矩阵,基于归一化后的第一乘积矩阵确定各个文本提取特征分别对应的图像关联度,归一化后的第一乘积矩阵中包括各个文本提取特征分别对应的图像关联度。
例如,假设目标文本为“我好渴”,按照字为单元对目标文本进行切分,得到3个目标分词,分别为“我”、“好”以及“渴”,一个分词为一个文字,目标分词对应的目标词向量的维度为2,“我”对应的目标词向量为A=(a1,a2) T,“好”对应的目标词向量为B=(b1,b2) T,“渴”对应的目标词向量为C=(c1,c2) T,将每一个目标词向量作为一个文本提取特征,则文本提取特征矩阵feature text可以表示为公式(1)。假设有3个目标图像数据,这3个目标图像数据可以相同也可以不同,例如为3张图像,各个目标图像数据分别对应的目标图像特征为R=(r1,r2) T,M=(m1,m2) T,N=(n1,n2) T,R、M以及N可以相同也可以不同,则目标图像特征矩阵feature image可以表示为公式(2)。则第一乘积矩阵L1可以表示为公式(3)。
Figure PCTCN2022081896-appb-000001
L1=[feature image] T[feature text]    (3)
在一些实施例中,对第一乘积矩阵中的各个矩阵值进行归一化处理,得到归一化后的第一乘积矩阵的步骤包括:确定缩放因子,将第一乘积矩阵中的各个矩阵值分别除以缩放因子,得到各个矩阵值对应的缩放值,对各个缩放值进行归一化处理,将各个缩放值组成的矩阵作为归一化后的第一乘积矩阵。其中,缩放因子可以是预先设置的,也可以是根据需要设置的,例如缩放因子可以根据文本提取特征的维度确定,例如缩放因子可以与文本提取特征的维度成正相关关系,例如可以对文本提取特征的维度进行开方计算,得到缩放因子,例如可以对文本提取特征的维度进行开平方处理,将开平方处理后的结果与第一数值的比值,作为缩放因子。第一数值可以是预设设置的。归一化处理所采用的方法,可以是任意的能够将输入数据转化为0到1之间的数的函数,例如可以采用函数softmax进行归一化处理。例如,可以利用公式(4)计算得到归一化后的第一乘积矩阵L2。其中,d为文本提取特征的维度,m为 第一数值。
Figure PCTCN2022081896-appb-000002
同样的,当文本关联数据为目标音频数据,且目标音频数据有多个时,服务器可以将各个目标音频数据对应的目标音频特征组成目标音频特征矩阵,目标音频特征矩阵中的每一列为一个目标音频特征,服务器可以对目标音频特征矩阵的转置矩阵与文本提取特征矩阵进行矩阵相乘运算,得到第二乘积矩阵,对第二乘积矩阵中的各个矩阵值进行归一化处理,得到归一化后的第二乘积矩阵,基于归一化后的第二乘积矩阵确定各个文本提取特征分别对应的音频关联度,归一化后的第二乘积矩阵中包括各个文本提取特征分别对应的音频关联度。
步骤208,基于特征关联度对文本提取特征进行调整,得到调整文本特征。
其中,调整文本特征是基于特征关联度对文本提取特征进行调整所得到的特征,调整文本特征可以包括第一调整文本特征或第二调整文本特征中的至少一种。第一调整文本特征指的是基于图像关注强度对文本提取特征进行调整所得到的特征。第二调整文本特征指的是基于音频关注强度对文本提取特征进行调整所得到的特征。
具体地,服务器可以基于特征关联度得到文本提取特征对应的特征关注强度,特征关联度与特征关注强度成正相关关系,基于特征关注强度对文本提取特征进行调整,得到调整文本特征。其中,特征关联度与特征关注强度成正相关关系。当文本提取特征对应的目标分词不同时,文本提取特征对应的特征关注强度可以不同。例如可以将特征关联度作为特征关注强度,或者对特征关联度进行线性运算或非线性运算,将运算的结果作为特征关注强度。线性运算包括加和运算或乘积运算中的至少一种,非线性运算包括指数运算或对数运算中的至少一种。正相关关系指的是:在其他条件不变的情况下,两个变量变动方向相同,一个变量由大到小变化时,另一个变量也由大到小变化。可以理解的是,这里的正相关关系是指变化的方向是一致的,但并不是要求当一个变量有一点变化,另一个变量就必须也变化。例如,可以设置当变量a为10至20时,变量b为100,当变量a为20至30时,变量b为120。这样,a与b的变化方向都是当a变大时,b也变大。但在a为10至20的范围内时,b可以是没有变化的。
特征关注强度可以包括图像关注强度或音频关注强度中的至少一个。图像关注强度基于图像关联度得到,图像关注强度与图像关联度成正相关关系,音频关注强度基于音频关联度得到,音频关注强度与音频关联度成正相关关系。特征关注强度用于反映对特征进行关注的强度,特征关注强度越大,说明在进行内容识别时,需要更加关注该特征。
在一些实施例中,服务器可以对关联提取特征与文本提取特征进行相似度计算,得到特征相似度,将特征相似度作为特征关联度,基于特征关联度得到文本提取特征对应的特征关注强度。例如可以按照余弦相似度计算公式,对关联提取特征与文本提取特征进行相似度计算,将计算得到的余弦相似度作为特征相似度。
在一些实施例中,服务器可以利用特征关注强度对文本提取特征中的各个文本特征值进行调整,得到调整文本特征,例如可以对文本特征值与特征关注强度进行线性运算,得到线性运算后的文本特征值,基于各个线性运算后的文本特征值得到调整文本特征。其中,线性运算可以包括加和运算或乘积运算中的至少一种。例如,服务器可以将特征关注强度分别与文本提取特征中的各个特征值进行乘积运算,得到各个特征值乘积,并按照文本提取特征中 特征值的位置对特征值乘积进行排序,得到特征值序列,将特征值序列作为调整文本特征。文本特征值在文本提取特征中的位置与该文本特征值计算得到的特征值乘积在特征值序列中的位置相同。例如,假设文本提取特征为向量[a1,a2,a3],则a1、a2和a3为文本提取特征中的特征值,当特征关注强度为c时,特征值序列为向量[a1*c,a2*c,a3*c],a1*c、a2*c和a3*c为特征值乘积,a1*c在特征值序列[a1*c,a2*c,a3*c]中位置与a1在文本提取特征[a1,a2,a3]中的位置相同。
在一些实施例中,服务器可以利用归一化后的第一乘积矩阵对文本提取特征矩阵进行调整,得到第一调整文本特征矩阵,归一化后的第一乘积矩阵中包括各个文本提取特征分别对应的图像关联度,第一调整文本特征矩阵中可以包括各个文本提取特征分别对应的第一调整文本特征。例如,服务器可以对归一化后的第一乘积矩阵与文本提取特征矩阵的转置矩阵进行矩阵相乘运算,将相乘后得到的矩阵的转置矩阵作为第一调整文本特征矩阵。例如,可以采用公式(5)计算得到第一调整文本特征矩阵feature fusion1,其中feature fusion1表示第一调整文本特征矩阵,[feature fusion1] T表示feature fusion1的转置矩阵。同样的,服务器可以利用归一化后的第二乘积矩阵与文本提取特征的转置矩阵进行矩阵相乘运算,得到第二调整文本特征矩阵,归一化后的第二乘积矩阵中包括各个文本提取特征分别对应的音频关联度,第二调整文本特征矩阵中可以包括各个文本提取特征分别对应的第二调整文本特征。例如,可以采用公式(6)计算得到第二调整文本特征矩阵feature fusion2,其中,feature audio为目标音频特征矩阵。[feature audio] T表示目标音频特征矩阵对应的转置矩阵。
Figure PCTCN2022081896-appb-000003
Figure PCTCN2022081896-appb-000004
步骤210,基于调整文本特征进行识别,得到目标内容对应的内容识别结果。
其中,内容识别结果是基于调整文本特征进行识别所得到的结果。内容识别结果可以根据识别时所采用的内容识别网络确定,内容识别网络不同,所得到的内容识别结果可以相同,也可以不同。内容识别网络可以包括场景识别网络或实体识别网络中的至少一种。场景识别网络用于识别场景,实体识别网络用于识别实体。当内容识别网络为场景识别网络时,内容识别模型还可以称为场景识别模型,当内容识别网络为实体识别网络时,内容识别模型还可以称为实体识别模型或实体词识别模型。
具体地,服务器可以将调整文本特征输入到已训练的内容识别模型的内容识别网络中,利用内容识别模型对调整文本特征进行识别,得到目标内容对应的内容识别结果。例如,当文本提取特征为目标文本中的目标分词对应的特征时,可以将各个目标分词分别对应的调整文本特征按照目标分词在目标文本中的顺序进行排序,将排序得到的序列作为特征序列,服务器可以基于特征序列进行识别,得到目标内容对应的内容识别结果,例如可以将特征序列输入到内容识别模型中的内容识别网络中,得到内容识别结果,例如当内容识别网络为实体识别网络时,可以识别出目标内容中包括的实体词。
如图4所示,展示了一个内容识别模型400,内容识别模型400包括文本特征提取网络、关联特征提取网络、关注强度计算模块、特征调整模块以及内容识别网络。其中,关注强度计算模块用于对关联提取特征与文本提取特征进行关联计算,得到特征关注强度。特征调整 模块用于基于特征关注强度对文本提取特征进行调整,得到调整文本特征,将调整文本特征输入到内容识别网络中,得到目标内容对应的内容识别结果。内容识别模型400中各个网络以及模块可以是通过联合训练得到的。服务器从目标中获取目标文本以及文本关联数据,将目标文本输入到文本特征提取网络中,得到文本提取特征,将关联文本数据输入到关联特征提取网络中,得到关联提取特征,将文本提取特征以及关联提取特征输入到关注强度计算模块,得到特征关注强度,将特征关注强度以及文本提取特征输入到特征调整模块中,得到调整文本特征,将调整文本特征输入到内容识别网络中,得到内容识别结果。
在一些实施例中,服务器也可以对调整文本特征以及文本提取特征进行融合,得到融合文本特征,例如可以对调整文本特征与文本提取特征进行统计运算,例如进行加权计算或均值计算,得到融合文本特征,例如服务器可以确定调整文本特征对应的调整特征权重,基于调整特征权重对调整文本特征以及文本提取特征进行融合,得到融合文本特征,服务器可以基于融合文本特征进行识别,得到目标内容对应的内容识别结果。
在一些实施例中,调整文本特征包括第一调整文本特征以及第二调整文本特征,服务器可以基于第一调整文本特征、第二调整文本特征以及文本提取特征进行融合,例如加权计算或均值计算,得到融合文本特征。例如,调整特征权重可以包括第一调整文本特征对应的第一特征权重以及第二调整文本特征对应的第二特征权重,服务器可以基于第一特征权重对第一调整文本特征以及文本提取特征进行融合,得到第一融合特征,基于第二特征权重对第二调整文本特征以及文本提取特征进行融合,得到第二融合特征,将第一融合特征与第二融合特征进行统计运算,将统计运算的结果作为融合文本特征,例如将第一融合特征与第二融合特征中对应位置的特征值进行加和计算,得到各个加和值,按照特征值在第一融合特征或第二融合特征中的位置,对各个加和值进行排序,将排序后得到的序列作为融合文本特征。
上述内容识别方法中,确定待识别的目标内容,从目标内容中,获取目标文本以及与目标文本关联的文本关联数据,对目标文本进行特征提取,得到文本提取特征,对文本关联数据进行特征提取,得到关联提取特征,确定关联提取特征与文本提取特征间的特征关联度,特征关联度用于表征目标文本与文本关联数据之间的关联程度,基于特征关联度对文本提取特征进行调整,得到调整文本特征,基于调整文本特征进行识别,得到目标内容对应的内容识别结果。由于特征关联度用于表征目标文本与文本关联数据之间的关联程度,从而基于特征关联度对文本提取特征进行调整,实现了自适应的根据文本关联数据与目标文本之间的关联程度进行文本特征的调整,从而在基于调整文本特征进行识别时,使得识别结果受到文本关联数据的影响,目标文本与文本关联数据之间的关联程度越大,文本关联数据对识别结果的影响程度越大,使得在内容识别时更加关注关联程度较大的信息,提高了内容识别的准确度。
在一些实施例中,基于调整文本特征进行识别,得到目标内容对应的内容识别结果包括:对调整文本特征以及文本提取特征进行融合,得到融合文本特征;基于融合文本特征进行识别,得到目标内容对应的内容识别结果。
其中,融合文本特征是将调整文本特征与文本提取特征进行融合所得到的特征。融合文本特征、调整文本特征以及文本提取特征的维度可以相同,例如可以是相同维度的向量或矩阵。
具体地,服务器可以对调整文本特征与文本提取特征进行统计运算,例如均值运算或加和运算,将统计运算的结果作为融合文本特征。例如,服务器可以对文本提取特征进行编码, 得到文本提取特征对应的编码后的特征,作为第一编码特征,可以对调整文本特征进行编码,得到调整文本特征对应的编码后的特征,作为第二编码特征,对第一编码特征以及第二编码特征进行统计运算,例如均值运算或加和运算,将运算的结果作为融合文本特征。
在一些实施例中,服务器可以将融合文本特征输入到已训练的内容识别模型的内容识别网络中,利用内容识别网络对融合文本特征进行识别,得到目标内容对应的内容识别结果。
本实施例中,对调整文本特征以及文本提取特征进行融合,得到融合文本特征,基于融合文本特征进行识别,得到目标内容对应的内容识别结果,可以提高内容识别的准确度。
在一些实施例中,对调整文本特征以及文本提取特征进行融合,得到融合文本特征包括:对文本提取特征进行编码,得到第一编码特征,对调整文本特征进行编码,得到第二编码特征;将第一编码特征与第二编码特征进行融合,得到融合编码特征;基于融合编码特征得到调整文本特征对应的调整特征权重;基于调整特征权重对调整文本特征以及文本提取特征进行融合,得到融合文本特征。
其中,第一编码特征是对文本提取特征进行编码所得到的特征。第二编码特征是对调整文本特征进行编码所得到的特征。融合编码特征是将第一编码特征与第二编码特征进行融合所得到的特征。调整特征权重基于融合编码特征得到。
具体地,内容识别模型中还可以包括第一编码器、第二编码器以及特征融合模块,特征融合模块用于对第一编码特征与第二编码特征进行融合,得到融合编码特征。服务器可以将文本提取特征输入到已训练的内容识别模型中的第一编码器进行编码,得到第一编码特征,将调整文本特征输入到已训练的内容识别模型中的第二编码器进行编码,得到第二编码特征,将第一编码特征与第二编码特征进行融合,例如可以将第一编码特征以及第二编码特征输入到特征融合模块中,得到融合编码特征。其中,第一编码器以及第二编码器可以是基于人工智能的神经网络,内容识别模型中的各个网络以及模块可以是联合训练得到的,例如第一编码器与第二编码器是联合训练得到的。
在一些实施例中,服务器可以对第一编码特征与第二编码特征进行统计运算,得到编码统计特征,例如,可以将第一编码特征与第二编码特征进行相加运算,将相加运算的结果作为融合编码特征,或者对第一编码特征与第二编码特征进行均值运算,将计算得到的均值作为融合编码特征。服务器可以基于编码统计特征确定融合编码特征,例如可以将编码统计特征作为融合编码特征。
在一些实施例中,服务器可以对融合编码特征进行归一化处理,将归一化所得到的结果作为调整文本特征对应的调整特征权重。例如,已训练的内容识别模型中可以包括激活层,激活层可以将数据进行转化为0与1之间的数据,对融合编码特征进行归一化处理,将归一化所得到的结果作为调整特征权重的步骤可以包括:将融合编码特征输入到内容识别模型的激活层中进行激活处理,将激活处理的结果作为调整文本特征对应的调整特征权重。
在一些实施例中,服务器可以将调整特征权重与调整文本特征进行乘积计算,得到计算后的调整文本特征,将计算后的调整文本特征与文本提取特征进行统计运算,例如进行加和运算或均值运算,将统计运算的结果作为融合文本特征。
在一些实施例中,服务器可以确定文本提取特征对应的文本特征权重,例如可以获取预设权重,将预设权重作为文本特征权重,预设权重可以是根据需要预先设置的权重。文本特征权重也可以是根据调整特征权重确定的,例如调整特征权重可以与文本特征权重成负相关关系,调整特征权重与文本特征权重的和可以为预设数值,预设数值可以根据需要预先设置 的,例如可以是1,例如可以将预设数值减去文本特征权重所得到的结果作为文本特征权重,例如,当调整特征权重为0.3时,文本特征权重可以为0.7。其中,预设数值大于文本特征权重,并且预设数值大于调整特征权重。负相关关系指的是:在其他条件不变的情况下,两个变量变动方向相反,一个变量由大到小变化时,另一个变量由小到大变化。可以理解的是,这里的负相关关系是指变化的方向是相反的,但并不是要求当一个变量有一点变化,另一个变量就必须也变化。
在一些实施例中,第一编码器可以包括第一文本编码器或第二文本编码器中的至少一个,第二编码器可以包括图像编码器或音频编码器中的至少一个。第一编码特征可以包括第一文本特征或第二文本特征中的至少一个,第一文本特征是利用第一文本编码器对文本提取特征进行编码所得到的特征,第二文本特征是利用第二文本编码器对文本提取特征进行编码所得到的特征。第二编码特征可以包括图像编码特征或音频编码特征中的至少一个,图像编码特征是利用图像编码器对第一调整文本特征进行编码所得到的特征,音频编码特征是利用音频编码特征对第二调整文本特征进行编码所得到的特征。融合编码特征可以包括文本图像编码特征或文本音频编码特征中的至少一种。文本图像编码特征是将第一文本编码特征与图像编码特征进行融合所得到的特征。文本音频编码特征是将第二文本编码特征与音频编码特征进行融合所得到的特征。例如,当调整文本特征为第一调整文本特征时,服务器可以将文本提取特征输入到第一文本编码器进行编码,得到第一文本特征,将第一调整文本特征输入到图像编码器进行编码,得到图像编码特征,将第一文本特征与图像编码特征进行融合,得到文本图像编码特征。当调整文本特征为第二调整文本特征时,服务器可以将文本提取特征输入到第二文本编码器进行编码,得到第二文本特征,将第二调整文本特征输入到音频编码器进行编码,得到音频编码特征,将第二文本特征与音频编码特征进行融合,得到文本音频编码特征,可以将文本图像编码特征以及文本音频编码特征作为融合编码特征。其中,第一文本编码器与第二文本编码器可以为同一编码器,也可以为不同的编码器,图像编码器与音频编码器可以为同一编码器,也可以为不同的编码器。
本实施例中,对文本提取特征进行编码,得到第一编码特征,对调整文本特征进行编码,得到第二编码特征,将第一编码特征与第二编码特征进行融合,得到融合编码特征,基于融合编码特征得到调整文本特征对应的调整特征权重,基于调整特征权重对调整文本特征以及文本提取特征进行融合,得到融合文本特征,从而融合文本特征既能反映文本提取特征又能反映调整文本特征,提高了融合文本特征的表达能力,当基于调整文本特征进行识别时,可以提高识别的准确度。
在一些实施例中,第一编码特征是通过已训练的内容识别模型中的第一编码器编码得到的,第二编码特征是通过内容识别模型中的第二编码器编码得到的,基于融合编码特征得到调整文本特征对应的调整特征权重包括:将融合编码特征输入到内容识别模型中的目标激活层进行激活处理,得到目标激活值,将目标激活值作为调整文本特征对应的调整特征权重,激活层为第一编码器与第二编码器的共享激活层。
其中,激活层用于将数据转换为0与1之间的数据,可以通过激活函数实现,激活函数包括但不限于是Sigmoid函数、tanh函数或Relu函数中的至少一种。目标激活层是已训练的内容识别模型中的激活层,是第一编码器与第二编码器共享的激活层,即目标激活层集可以接收第一编码器的输出数据,又可以接收第二编码器的输出的数据。目标激活值是利用目标激活层对融合编码特征进行激活处理所得到的结果,目标激活值与融合编码特征的维度可 以相同,例如可以是同维度的向量或矩阵。如图5所示,展示了一个内容识别模块500,内容识别模块500中包括关联特征提取网络、关注强度计算模块、文本特征提取网络、特征调整模块、第一编码器、第二编码器、特征融合模块、目标激活层、融合文本特征生成模块以及内容识别网络,特征融合模块用于将第一编码特征与第二编码特征进行融合得到融合编码特征,融合文本特征生成用于基于调整特征权重对调整文本特征与文本提取特征进行融合,得到融合文本特征。
具体地,目标激活层可以包括第一文本编码器与图像编码器共享的第一激活层、第二文本编码器与音频编码器共享的第二激活层中的至少一个。目标激活值可以包括对文本图像编码特征进行激活得到的第一激活值、或对文本音频编码特征进行激活得到的第二激活值中的至少一个,当融合编码特征为文本图像编码特征时,服务器可以将文本图像编码特征输入到第一激活层进行激活,得到第一激活值,将第一激活值作为第一调整文本特征对应的第一特征权重;当融合编码特征为文本音频编码特征时,服务器可以将文本音频编码特征输入到第二激活层进行激活,得到第二激活值,将第二激活值作为第二调整文本特征对应的第二特征权重,将第一特征权重以及第二特征权重作为调整特征权重。
在一些实施例中,当调整文本特征为第一调整文本特征,并且第一调整文本特征有多个时,服务器可以将第一调整文本特征矩阵与文本提取特征矩阵进行矩阵融合,例如可以将文本提取特征矩阵输入到第一文本编码器进行编码,得到第一矩阵编码特征,第一调整文本特征矩阵输入到图像编码器进行编码,得到第二矩阵编码特征,对第一矩阵编码特征与第二矩阵编码特征进行统计运算,得到第一矩阵特征统计结果,对第一矩阵特征统计结果进行归一化处理,例如可以将第一矩阵特征统计结果输入至第一激活层进行激活,得到归一化后的第一矩阵特征统计结果,归一化后的第一矩阵特征统计结果中可以包括各个第一调整文本特征分别对应的第一特征权重。例如,可以采用公式(7)计算得到归一化后的第一矩阵特征统计结果gate 1。其中,gate 1表示归一化后的第一矩阵特征统计结果,sigmoid为激活函数,
Figure PCTCN2022081896-appb-000005
为第一文本编码器的模型参数,
Figure PCTCN2022081896-appb-000006
为图像编码器的模型参数。
Figure PCTCN2022081896-appb-000007
在一些实施例中,当调整文本特征为第二调整文本特征,并且第二调整文本特征有多个时,服务器可以将第二调整文本特征矩阵与文本提取特征矩阵进行矩阵融合,例如可以将文本提取特征矩阵输入到第二文本编码器进行编码,得到第三矩阵编码特征,将第二调整文本特征矩阵输入到音频编码器进行编码,得到第四矩阵编码特征,对第三矩阵编码特征与第四矩阵编码特征进行统计运算,得到第二矩阵特征统计结果,对第二矩阵特征统计结果进行归一化处理,例如可以将第一矩阵特征统计结果输入至第二激活层进行激活,得到归一化后的第二矩阵特征统计结果,归一化后的第二矩阵特征统计结果中可以包括各个第二调整文本特征分别对应的第二特征权重。例如,可以采用公式(8)计算得到归一化后的第二矩阵特征统计结果gate 2。其中,gate 2表示归一化后的第二矩阵特征统计结果,
Figure PCTCN2022081896-appb-000008
为第二文本编码器的模型参数,
Figure PCTCN2022081896-appb-000009
为音频编码器的模型参数。
Figure PCTCN2022081896-appb-000010
本实施例中,将融合编码特征输入到内容识别模型中的目标激活层进行激活处理,得到目标激活值,将目标激活值作为调整文本特征对应的调整特征权重,使得调整特征权重为归一化后的值,提高了调整特征权重的合理性。
在一些实施例中,基于调整特征权重对调整文本特征以及文本提取特征进行融合,得到融合文本特征包括:基于调整特征权重得到文本提取特征对应的文本特征权重;将调整特征权重与调整文本特征进行乘积计算,得到计算后的调整文本特征;将文本特征权重与文本提取特征进行乘积计算,得到计算后的文本提取特征;将计算后的调整文本特征与计算后的文本提取特征进行相加,得到融合文本特征。
其中,文本特征权重可以根据调整特征权重确定,文本特征权重可以与调整特征权重成负相关关系,例如可以将预设数值减去文本特征权重所得到的结果作为文本特征权重。预设数值大于文本特征权重,并且预设数值大于调整特征权重。
具体地,服务器可以将调整特征权重与调整文本特征相乘后的结果,作为计算后的调整文本特征,可以将文本特征权重与文本提取特征相乘后的结果,作为计算后的文本提取特征,可以将计算后的调整文本特征与计算后的文本提取特征进行相加所得到的结果,作为融合文本特征。
在一些实施例中,调整特征权重包括第一特征权重以及第二特征权重,文本特征权重可以包括基于第一特征权重得到的第一文本权重以及基于第二特征权重得到的第二文本权重。第一文本权重与第一特征权重成负相关关系。第二文本权重与第二特征权重成负相关关系。服务器可以利用第一特征权重、第二特征权重、第一文本权重以及第二文本权重,对第一调整文本特征、第二调整文本特征以及文本提取特征进行加权计算,将加权计算的结果作为融合文本特征。例如,服务器可以利用第一特征权重以及第一文本权重对第一调整文本特征以及文本提取特征进行加权计算,得到第一加权值,利用第二特征权重以及第二文本权重对第二调整文本特征以及文本提取特征进行加权计算,得到第二加权值,将第一加权值与第二加权值相加的结果作为融合文本特征。具体地,服务器可以将第一文本权重与文本提取特征进行乘积计算,得到第一乘积值,将第一特征权重与第一调整文本特征进行乘积计算,得到第二乘积值,将第二文本权重与文本提取特征进行乘积计算,得到第三乘积值,将第二特征权重与第二调整文本特征进行乘积计算,得到第四乘积值,将第一乘积值、第二乘积值、第三乘积值以及第四乘积值进行相加,将相加后的结果作为融合文本特征。
在一些实施例中,服务器可以利用归一化后的第一矩阵特征统计结果以及归一化后的第二矩阵特征统计结果,对第一调整文本特征矩阵以及第二调整文本特征矩阵进行加权计算,得到融合文本特征矩阵,融合文本特征矩阵中可以包括各个文本提取特征分别对应的融合文本特征。例如可以利用公式(9)计算得到融合文本特征矩阵output。其中,output指的是融合文本特征矩阵。
output=feature fusion1·gate 1+(1-gate 1)feature text+feature fusion2·gate 2+(1-gate 2)feature text    (9)
本实施例中,将调整特征权重与调整文本特征进行乘积计算,得到计算后的调整文本特征,将文本特征权重与文本提取特征进行乘积计算,得到计算后的文本提取特征,将计算后的调整文本特征与计算后的文本提取特征进行相加,得到融合文本特征,由于文本特征权重是基于调整特征权重得到的,故提高了文本特征权重的准确度,从而提高了融合文本特征的准确度。
在一些实施例中,目标内容为目标视频;从目标内容中,获取目标文本以及与目标文本关联的文本关联数据包括:获取目标视频中目标时间对应的文本,得到目标文本;获取目标视频中目标时间对应的视频相关数据,将视频相关数据作为与目标文本关联的文本关联数据,视频相关数据包括视频帧或者音频帧的至少一种。
其中,视频帧是组成视频的最小单元,视频有多个图像组成,视频中的一张图像称为一帧,也可以称为视频帧。目标视频可以是任意的视频,可以是直接拍摄得到的视频,也可以是从拍摄得到的视频中截取得到的视频片段,目标视频可以是任意类型的视频,包括但不限于是广告类的视频、电视剧类的视频或新闻类视频中的至少一种,目标视频还可以是待推送至用户的视频。目标时间可以是目标视频的起始时间点到终止时间点中的任意的时间点或时间段。视频相关数据指的是目标视频中目标时间展示或播放的任意的数据,可以包括目标视频中目标时间展示的视频帧或目标时间播放的音频帧中的至少一种,目标时间展示的视频帧可以包括一帧或多帧,目标时间播放的音频帧可以包括一帧或多帧。
具体地,服务器可以获取目标视频中目标时间展示的文本,作为目标文本,例如目标时间展示的字幕、弹幕或评论中的至少一种,作为目标文本。服务器可以获取目标视频中目标时间展示的视频帧或目标时间播放的音频帧中的至少一个,作为视频相关数据。
本实施例中,获取目标视频中目标时间对应的文本,得到目标文本,获取目标视频中目标时间对应的视频相关数据,将视频相关数据作为与目标文本关联的文本关联数据,由于视频相关数据包括视频帧或者音频帧的至少一种,从而获取了文本数据以及除文本数据之外的图像数据或音频数据,从而可以在文本数据的基础上结合图像数据或音频数据对视频进行识别,从而有利于提高识别的准确度。
在一些实施例中,调整文本特征包括根据视频帧调整得到的第一调整文本特征;基于调整文本特征进行识别,得到目标内容对应的内容识别结果包括:将第一调整文本特征以及文本提取特征进行融合,得到融合文本特征;基于融合文本特征进行识别,得到目标内容对应的内容识别结果。
具体地,服务器可以从文本关联数据中获取视频帧,对获取的视频帧进行特征提取,得到目标图像特征,基于目标图像特征得到第一调整文本特征,从文本关联数据中获取音频帧,对获取的音频帧进行特征提取,得到目标音频特征,基于目标音频特征得到第二调整文本特征。
在一些实施例中,服务器可以将第一调整文本特征以及文本提取特征进行加权计算,将加权计算的结果作为融合文本特征。例如,服务器可以将第一文本权重与文本提取特征进行乘积计算,得到第一乘积值,将第一特征权重与第一调整文本特征进行乘积计算,得到第二乘积值,将第二文本权重与文本提取特征进行乘积计算,得到第三乘积值,将第一乘积值、第二乘积值以及第三乘积值进行相加,将相加后的结果作为融合文本特征。
本实施例中,将第一调整文本特征以及文本提取特征进行融合,得到融合文本特征,从而使得融合文本特征是基于第一调整文本特征以及文本提取特征这两种特征得到的,从而提高了融合文本特征的特征丰富程度,从而当基于融合文本特征进行识别时,可以提高识别的准确度。
在一些实施例中,调整文本特征还包括根据音频帧调整得到的第二调整文本特征,将第一调整文本特征以及文本提取特征进行融合,得到融合文本特征包括:将第一调整文本特征、第二调整文本特征以及文本提取特征进行融合,得到融合文本特征。
具体地,服务器可以从文本关联数据中获取音频帧,对获取的音频帧进行特征提取,得到目标音频特征,基于目标音频特征得到第二调整文本特征。
在一些实施例中,服务器可以将第一调整文本特征、第二调整文本特征以及文本提取特征进行加权计算,将加权计算的结果作为融合文本特征。例如,服务器可以将第一文本权重 与文本提取特征进行乘积计算,得到第一乘积值,将第一特征权重与第一调整文本特征进行乘积计算,得到第二乘积值,将第二文本权重与文本提取特征进行乘积计算,得到第三乘积值,将第二特征权重与第二调整文本特征进行乘积计算,得到第四乘积值,将第一乘积值、第二乘积值、第三乘积值以及第四乘积值进行相加,将相加后的结果作为融合文本特征。
本实施例中,将第一调整文本特征、第二调整文本特征以及文本提取特征进行融合,得到融合文本特征,从而使得融合文本特征是基于第一调整文本特征、第二调整文本特征以及文本提取特征这三种特征得到的,从而提高了融合文本特征的特征丰富程度,从而当基于融合文本特征进行识别时,可以提高识别的准确度。
在一些实施例中,基于特征关联度对文本提取特征进行调整,得到调整文本特征包括:基于特征关联度得到文本提取特征对应的特征关注强度,特征关联度与特征关注强度成正相关关系;基于特征关注强度对文本提取特征进行调整,得到调整文本特征。
本实施例中,由于特征关联度与特征关注强度成正相关关系,从而基于特征关注强度对文本提取特征进行调整,实现了自适应的根据文本关联数据与目标文本之间的关联程度进行文本特征的调整,从而在基于调整文本特征进行识别时,使得识别结果受到文本关联数据的影响,目标文本与文本关联数据之间的关联程度越大,文本关联数据对识别结果的影响程度越大,使得在内容识别时更加关注关联程度较大的信息,提高了内容识别的准确度。
在一些实施例中,基于特征关注强度对文本提取特征进行调整,得到调整文本特征包括:将特征关注强度与文本提取特征的各个特征值相乘,得到特征值乘积;按照文本提取特征中特征值的位置对特征值乘积进行排列,将排列得到的特征值序列作为调整文本特征。
其中,特征值乘积指的是文本特征值与特征关注强度相乘所得到的结果。特征值序列是按照文本特征值在文本提取特征中的位置对文本特征值计算得到的特征值乘积进行排列得到的。即文本特征值在文本提取特征中的位置与该文本特征值计算得到的特征值乘积在特征值序列中的位置相同。
本实施例中,将特征关注强度与文本提取特征的各个特征值相乘,得到特征值乘积,从而特征值乘积可以反映文本关联数据对文本特征值的关注程度,按照文本提取特征中特征值的位置对特征值乘积进行排列,将排列得到的特征值序列作为调整文本特征,从而调整文本特征可以反映文本关联数据对文本提取特征的关注程度。
在一些实施例中,文本提取特征为目标文本中的分词对应的特征;各个调整文本特征按照分词在目标文本的顺序形成特征序列;基于调整文本特征进行识别,得到目标内容对应的内容识别结果包括:基于特征序列得到各个分词相对于命名实体的位置关系;基于各个位置关系从目标文本中获取目标命名实体,将目标命名实体作为目标内容对应的内容识别结果。
其中,特征序列是按照目标分词在目标文本中的顺序,对目标分词对应的调整文本特征进行排序所得到的序列,目标分词指的是目标文本中的分词。命名实体(named entity)指的是以名称为标识的实体,可以包括人名、地名或机构名中的至少一种,命名实体例如可以是“张三”、“A地区”或“B机构”。
相对于命名实体的位置关系可以包括命名实体位置或非命名实体位置的至少一种。命名实体位置指的是命名实体所在的位置,可以包括命名实体的起始位置、命名实体的结束位置或命名实体的中间位置的至少一个。命名实体的中间位置可以包括命名实体的起始位置与结束位置之间的各个位置。非命名实体位置指的是命名实体之外的分词所处的位置。
具体地,服务器可以基于特征序列,确定各个目标分词相对于命名实体的位置关系,得 到各个目标分词分别对应的位置关系,从各个位置关系中,获取位置关系属于命名实体位置的位置关系对应的目标分词,作为实体分词,基于各个实体分词得到目标命名实体。
在一些实施例中,已训练的内容识别模型可以包括实体识别网络,服务器可以将特征序列输入到实体识别网络中,利用实体识别网络对特征序列中的各个调整文本特征进行位置识别,例如实体识别网络可以基于调整文本特征确定该调整文本特征对应的目标分词处于命名实体位置的概率,得到命名实体概率,将命名实体概率大于命名实体概率阈值的目标分词的位置关系确定为命名实体位置。命名实体概率阈值可以根据需要设置。实体识别网络还可以基于调整文本特征,确定该调整文本特征对应的目标分词处于命名实体的起始位置的概率,得到起始概率,将起始概率大于起始概率阈值的目标分词的位置关系确定为命名实体的起始位置。起始概率阈值可以根据需要设置。实体识别网络还可以基于调整文本特征,确定该调整文本特征对应的目标分词处于命名实体的结束位置的概率,得到结束概率,将结束概率大于结束概率阈值的目标分词的位置关系确定为命名实体的结束位置。结束概率阈值可以根据需要设置。
本实施例中,基于特征序列得到各个分词相对于命名实体的位置关系,基于各个位置关系从目标文本中获取目标命名实体,将目标命名实体作为目标内容对应的内容识别结果,从而可以基于调整文本特征形成的特征序列进行内容识别,提高了内容识别的准确度。
在一些实施例中,基于各个位置关系从目标文本中获取目标命名实体包括:获取位置关系为命名实体的起始位置的分词,作为命名实体起始词;将命名实体起始词对应的后向分词中,位置关系为处于命名实体的内部的分词作为命名实体组成词;将命名实体起始词与命名实体组成词进行组合,得到目标命名实体。
其中,命名实体起始词指的是处于命名实体的起始位置的分词,命名实体起始词对应的后向分词指的是目标文本中排序在命名实体起始词之后的分词。命名实体组成词指的是目标文本中处于命名实体的内部的分词,命名实体的内部包括命名实体的结束位置以及命名实体的中间位置,命名实体词的结束位置与命名实体的中间位置可以为同一位置。例如,当分词为单个字时,假设目标文本为“张三喜欢花”,由于则命名实体为“张三”,为两个字,由于“张”处于命名实体的起始位置,故命名实体起始词为“张”,命名实体起始词对应的后向分词包括“三”、“喜”、“欢”以及“花”,由于“三”处于命名实体的内部,故命名实体组成词为“三”。目标命名实体是目标文本中包括的实体,是将命名实体起始词与对应的命名实体组成词组合得到的。目标文本中可以包括一个或多个目标命名实体,多个指的是至少两个。例如假设目标文本为“张三喜欢李四”,则目标文本包括2个目标命名实体,分别为“张三”以及“李四”。
具体地,服务器可以基于各个目标分词对应的位置关系,从目标文本中获取位置关系为命名实体的起始位置的分词,作为命名实体起始词,按照从前到后的排列顺序,依次从命名实体起始词的各个后向分词中获取一个后向分词,作为当前后向分词,当当前后向分词的位置关系为命名实体的内部时,将当前后向分词作为与命名实体起始词对应的命名实体组成词,当当前后向分词的位置关系为命名实体的外部时,停止从命名实体起始词的各个后向分词中获取后向分词,按照命名实体起始词以及命名实体组成词在目标文本中的位置,从前到后对命名实体起始词以及命名实体组成词进行排序,得到目标命名实体。例如,由于“张”的位置在“三”之前,因此排序得到的是“张三”,即“张三”为目标命名实体。
本实施例中,获取位置关系为命名实体的起始位置的分词,作为命名实体起始词,将命 名实体起始词对应的后向分词中,位置关系为处于命名实体的内部的分词作为命名实体组成词,将命名实体起始词与命名实体组成词进行组合,得到目标命名实体,从而可以基于调整文本特征形成的特征序列进行实体识别,提高了实体识别的准确度。
在一些实施例中,基于特征序列得到各个分词相对于命名实体的位置关系包括:基于特征序列得到各个分词相对于命名实体的位置关系以及分词对应的实体类型;将命名实体起始词对应的后向分词中,位置关系为处于命名实体的内部的分词作为命名实体组成词包括:将命名实体起始词对应的后向分词中,位置关系为处于命名实体的内部,且实体类型与命名实体起始词的类型相同的分词作为命名实体组成词。
其中,实体类型指的是命名实体的类型,包括人名、机构名或地名中的至少一种类型。命名实体起始词以及命名实体组成词可以分别对应有实体类型。
具体地,服务器可以对特征序列中的各个特征进行实体类型的识别,确定特征序列中的各个特征分别对应的实体类型,按照从前到后的排列顺序,依次从命名实体起始词的各个后向分词中获取一个后向分词,作为当前后向分词,当当前后向分词的位置关系为命名实体的内部,并且实体类型与命名实体起始词的实体类型相同时时,将当前后向分词作为与命名实体起始词对应的命名实体组成词,当当前后向分词的位置关系为命名实体的外部或实体类型与命名实体起始词的实体类型不同时,停止从命名实体起始词的各个后向分词中获取后向分词。
在一些实施例中,文本提取特征为目标文本中的目标分词对应的特征;各个目标分词对应的融合文本特征按照目标分词在目标文本的顺序形成融合特征序列;基于调整文本特征进行识别,得到目标内容对应的内容识别结果包括:基于融合特征序列得到各个分词相对于命名实体的位置关系;基于各个位置关系从目标文本中获取目标命名实体,将目标命名实体作为目标内容对应的内容识别结果。
在一些实施例中,可以将融合特征序列输入到实体识别网络中,实体识别网络对融合特征序列中的各个融合文本特征进行实体词的识别。实体识别网络例如可以是图6中的CRF网络,图6中,目标文本为“张小花爱笑”,融合特征序列为[h1,h2,h3,h4,h5],h1为分词“张”对应的融合文本特征,h2为分词“小”对应的融合文本特征,h3为分词“花”对应的融合文本特征,h4为分词“爱”对应的融合文本特征,h5为分词“笑”对应的融合文本特征。将融合特征序列输入到CRF网络中进行实体识别,CRF网络可以基于融合特征序列中的各个特征对目标文本中的分分词进行打分,得到各个分词对应的分数,可以利用softmax对分词的分数进行归一化处理,得到分词对应的概率分布。利用CRF网络识别“张小花爱笑”中的人名所处的位置,CRF网络可以采用“BIO”的标注方法对“张小花爱笑”中的各个目标分词进行标注,得到各个融合文本特征对应的标注,其中,B为begin的缩写,表示实体词开始,I为inside的缩写,表示实体词内部,O为outside的缩写,表示实体词外部,如图所示“张小花爱笑”的标注为“B-PER,I-PER,I-PER,O,O”,其中,“PER”为表示实体词类型为人名。从“B-PER,I-PER,I-PER,O,O”可以确定“张小花爱笑”中的“张小花”为目标命名实体。
本实施例中,基于特征序列得到各个分词相对于命名实体的位置关系以及分词对应的实体类型,将命名实体起始词对应的后向分词中,位置关系为处于命名实体的内部,且实体类型与命名实体起始词的类型相同的分词作为命名实体组成词,提高了实体识别的准确度。
在一些实施例中,确定关联提取特征与文本提取特征间的特征关联度包括:将关联提取 特征中的关联特征值与文本提取特征中对应位置的文本特征值进行乘积运算,得到乘积运算值;对乘积运算值进行统计,得到关联提取特征与文本提取特征之间的特征关联度。
具体地,关联提取特征可以与文本提取特征为同维度的向量或矩阵,服务器可以从关联提取特征中获取目标排序处的关联特征值,作为第一目标特征值,从文本提取特征中获取目标位置处的文本特征值,作为第二目标特征值,则第二目标特征值与第二目标特征值具有位置对应关系,服务器可以对第一目标特征值与第二目标特征值的乘积运算,得到目标位置处的文本特征值与关联特征值计算得到的乘积运算值。目标位置可以是关联提取特征或文本提取特征中的任一位置,例如当关联提取特征为向量时,目标位置可以是任一排序位置,例如第一位。
在一些实施例中,服务器可以对各个乘积运算值进行统计,得到乘积统计值,对乘积统计值进行归一化处理,将归一化处理的结果作为特征关联度。服务器可以将特征关联度作为文本提取特征对应的特征关注强度。
本实施例中,将关联提取特征中的关联特征值与文本提取特征中对应位置的文本特征值进行乘积运算,得到乘积运算值,对各个乘积运算值进行统计运算,得到关联提取特征与文本提取特征之间的特征关联度,从而特征关联度可以准确的反映文本关联数据与目标文本之间的关联关系,从而当基于特征关联度对文本提取特征进行调整时,可以提高调整的准确度。
在一些实施例中,提供了一种内容识别方法,包括以下步骤:
步骤A、确定待识别的目标视频,从目标视频中,获取目标文本以及与目标文本关联的目标图像数据以及目标音频数据。
步骤B、对目标文本进行特征提取,得到文本提取特征,对目标图像数据进行特征提取,得到目标图像特征,对目标音频数据进行特征提取,得到目标音频特征。
具体地,如图7所示,展示了一个已训练的实体识别网络700,服务器可以利用已训练的实体识别模型中的文本特征提取网络对目标文本进行特征提取,得到文本提取特征,同样的可以利用图像特征提取网络对目标图像数据进行特征提取,得到目标图像特征,利用音频特征提取网络对目标音频数据进行特征提取,得到目标音频特征。
步骤C、对目标图像特征与文本提取特征进行关联计算,得到图像关联度,将图像关联度作为图像关注强度,对目标音频特征与文本提取特征进行关联计算,得到音频关联度,将音频关联度作为音频关注强度。
具体地,如图7所示,可以利用图像关注强度计算模块,对目标图像特征与文本提取特征进行关联计算,得到图像关注强度,利用音频关注强度计算模块,对目标音频特征与文本提取特征进行关联计算得到音频关注强度。图像关注强度计算模块中包括乘积运算单元以及归一化处理单元,图像关注强度计算模块可以通过乘积运算单元对目标图像特征与文本提取特征进行乘积运算,将运算的结果输入到归一化运算单元进行归一化处理,得到图像关注强度。音频关注强度计算模块计算得到音频关注强度的过程可以参考图像关注强度计算模块。
步骤D、基于图像关注强度对文本提取特征进行调整,得到第一调整文本特征,基于音频关注强度对文本提取特征进行调整,得到第二调整文本特征。
具体地,如图7所示,可以将图像关注强度以及文本提取特征输入第一特征调整模块,第一特征调整模块可以将图像关注强度与文本提取特征的各个特征值相乘,按照文本提取特征中特征值的位置对相乘得到的各个值进行排列得到第一调整文本特征。同样的,可以利用第二特征调整模块得到第二调整文本特征。
步骤E、确定第一调整文本特征对应的第一特征权重,确定第二调整文本特征对应的第二特征权重。
具体地,如图7所示,服务器可以将第一调整文本特征输入到图像编码器进行编码,得到图像编码特征,将文本提取特征输入到第一文本编码器进行编码,得到第一文本特征,将第一文本特征以及图像编码特征输入到第一特征融合模块中,得到文本图像编码特征。服务器可以将第二调整文本特征输入到音频编码器进行编码,得到音频编码特征,将文本提取特征输入到第二文本编码器进行编码,得到第二文本特征,将第二文本特征以及音频编码特征输入到第二特征融合模块中,得到文本图像编码特征,将文本图像编码特征输入到第一激活层进行激活,得到第一调整文本特征对应的第一特征权重,将文本音频编码特征输入到第二激活层进行激活,得到第二调整文本特征对应的第二特征权重。
步骤F、基于第一特征权重对第一调整文本特征以及文本提取特征进行融合,得到第一融合特征,基于第二特征权重对第二调整文本特征以及文本提取特征进行融合,得到第二融合特征,将第一融合特征与第二融合特征进行统计运算,将统计运算的结果作为融合文本特征。
具体地,如图7所示,服务器可以将第一特征权重、第一调整文本特征以及文本提取特征输入到第一融合文本特征生成模块中,得到第一融合特征,将第二特征权重、第二调整文本特征以及文本提取特征输入到第二融合文本特征生成模块中,得到第二融合特征。
步骤G、基于融合文本特征进行命名实体识别,得到目标内容对应的目标命名实体,将目标命名实体作为目标内容对应的内容识别结果。
例如,如图8所示,目标视频为“张小花”的视频,目标文本为“张小花”的视频中的字幕“张小花爱笑”,目标图像数据为“张小花”的视频中与字幕“张小花爱笑”在时间上关联的图像,即包括“张小花”的图像,目标音频数据为“张小花”的视频中与字幕“张小花爱笑”在时间上关联的音频,即包括“张小花”的音频,将字幕“张小花爱笑”、包括“张小花”的图像以及包括“张小花”的音频输入到实体识别模型中,可以确定实体词“张小花”。
上述的内容识别方法在进行实体识别时,除了利用视频中的文本信息,例如视频中的标题、字幕或描述信息外,还利用了视频的音频特征以及图像特征,并且将多种模态特征进行融合,能更加精准有效的提取视频信息,增强实体词识别的识别效果,例如提高了实体词识别的准确度和效率。可以提升测试数据集上的准确率和召回率。其中,一种模态可以是一种数据类型,例如文本、音频以及图像分别一种模态,多模态包括至少两种模态。模态特征例如可以是文本特征、音频特征或图像特征中的任意一种。多模态特征包括至少两种模态特征。本申请提供的实体词识别模型(即实体识别模型),可以对视频信息进行有效的提取。
本申请还提供一种应用场景,该应用场景应用上述的内容识别方法,可以对视频中的文本进行实体识别。具体地,该内容识别方法在该应用场景的应用如下:
接收针对目标视频的视频标签生成请求,响应于视频标签生成请求,使用本申请提供的内容识别方法,对目标视频进行实体词识别,得到识别出的实体词,将识别出的各个实体词作为目标视频对应的视频标签。
本申请提供的内容识别方法,应用于视频识别,可以节省获取视频信息的时间,提高理解视频信息的效率。
本申请还提供一种应用场景,该应用场景应用上述的内容识别方法,可以对视频中的文 本进行实体识别。具体地,该内容识别方法在该应用场景的应用如下:
接收针对目标用户对应的视频推荐请求,获取候选视频,利用本申请提供的内容识别方法对候选视频进行实体词识别,将识别出的实体词作为候选视频对应的视频标签,获取目标用户对应的用户信息,当确定视频标签与用户信息匹配时,例如视频标签与用户的用户画像匹配时,向目标用户对应的终端推送该候选视频。
本申请提供的内容识别方法,应用于视频推荐中,可以为视频推荐算法提供优质特征,优化视频推荐效果。
应该理解的是,虽然上述各实施例的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,上述各实施例中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在一些实施例中,如图9所示,提供了一种内容识别装置,该装置可以采用软件模块或硬件模块,或者是二者的结合成为计算机设备的一部分,该装置具体包括:目标内容确定模块902、特征提取模块904、特征关联度得到模块906、调整文本特征得到模块908和内容识别结果得到模块910,其中:目标内容确定模块902,用于确定待识别的目标内容,从目标内容中,获取目标文本以及与目标文本关联的文本关联数据。特征提取模块904,用于对目标文本进行特征提取,得到文本提取特征;对文本关联数据进行特征提取,得到关联提取特征。特征关联度得到模块906,用于确定关联提取特征与文本提取特征间的特征关联度;特征关联度用于表征目标文本与文本关联数据之间的关联程度。调整文本特征得到模块908,用于基于特征关联度对文本提取特征进行调整,得到调整文本特征。内容识别结果得到模块910,用于基于调整文本特征进行识别,得到目标内容对应的内容识别结果。
上述内容识别装置中,确定待识别的目标内容,从目标内容中,获取目标文本以及与目标文本关联的文本关联数据,对目标文本进行特征提取,得到文本提取特征,对文本关联数据进行特征提取,得到关联提取特征,确定关联提取特征与文本提取特征间的特征关联度,基于特征关联度对文本提取特征进行调整,得到调整文本特征,基于调整文本特征进行识别,得到目标内容对应的内容识别结果。由于特征关联度反映了目标文本与文本关联数据之间的关联程度的大小,特征关联度越大,则目标文本与文本关联数据之间的关联程度越大,特征关联度越小,则目标文本与文本关联数据之间的关联程度越小,从而在基于调整文本特征进行识别时,目标文本与文本关联数据之间的关联程度越大,文本关联数据对识别结果的影响程度越大,目标文本与文本关联数据之间的关联程度越小,文本关联数据对识别结果的影响程度越小,从而可以自适应的根据文本关联数据与目标文本之间的关系,调整用于识别时的特征,提高识别时所用的特征的准确度,提高内容识别的准确度。
在一些实施例中,内容识别结果得到模块910包括:第一融合文本特征得到单元,用于对调整文本特征以及文本提取特征进行融合,得到融合文本特征。第一内容识别结果得到单元,用于基于融合文本特征进行识别,得到目标内容对应的内容识别结果。
本实施例中,对调整文本特征以及文本提取特征进行融合,得到融合文本特征,基于融合文本特征进行识别,得到目标内容对应的内容识别结果,可以提高内容识别的准确度。
在一些实施例中,第一融合文本特征得到单元,还用于对文本提取特征进行编码,得到 第一编码特征,对调整文本特征进行编码,得到第二编码特征;将第一编码特征与第二编码特征进行融合,得到融合编码特征;基于融合编码特征得到调整文本特征对应的调整特征权重;基于调整特征权重对调整文本特征以及文本提取特征进行融合,得到融合文本特征。
本实施例中,对文本提取特征进行编码,得到第一编码特征,对调整文本特征进行编码,得到第二编码特征,将第一编码特征与第二编码特征进行融合,得到融合编码特征,基于融合编码特征得到调整文本特征对应的调整特征权重,基于调整特征权重对调整文本特征以及文本提取特征进行融合,得到融合文本特征,从而融合文本特征既能反映文本提取特征又能反映调整文本特征,提高了融合文本特征的表达能力,当基于调整文本特征进行识别时,可以提高识别的准确度。
在一些实施例中,第一编码特征是通过已训练的内容识别模型中的第一编码器编码得到的,第二编码特征是通过内容识别模型中的第二编码器编码得到的,第一融合文本特征得到单元,还用于将融合编码特征输入到内容识别模型中的目标激活层进行激活处理,得到激活值,将激活值作为调整文本特征对应的调整特征权重,激活层为第一编码器与第二编码器的共享激活层。
本实施例中,将融合编码特征输入到内容识别模型中的目标激活层进行激活处理,得到目标激活值,将目标激活值作为调整文本特征对应的调整特征权重,使得调整特征权重为归一化后的值,提高了调整特征权重的合理性。
在一些实施例中,第一融合文本特征得到单元,还用于基于调整特征权重得到文本提取特征对应的文本特征权重;将调整特征权重与调整文本特征进行乘积计算,得到计算后的调整文本特征;将文本特征权重与文本提取特征进行乘积计算,得到计算后的文本提取特征;将计算后的调整文本特征与计算后的文本提取特征进行相加,得到融合文本特征。
本实施例中,将调整特征权重与调整文本特征进行乘积计算,得到计算后的调整文本特征,将文本特征权重与文本提取特征进行乘积计算,得到计算后的文本提取特征,将计算后的调整文本特征与计算后的文本提取特征进行相加,得到融合文本特征,由于文本特征权重是基于调整特征权重得到的,故提高了文本特征权重的准确度,从而提高了融合文本特征的准确度。
在一些实施例中,目标内容为目标视频;目标内容确定模块902包括:目标文本得到单元,用于获取目标视频中目标时间对应的文本,得到目标文本。文本关联数据得到单元,用于获取目标视频中目标时间对应的视频相关数据,将视频相关数据作为与目标文本关联的文本关联数据,视频相关数据包括视频帧或者音频帧的至少一种。
本实施例中,获取目标视频中目标时间对应的文本,得到目标文本,获取目标视频中目标时间对应的视频相关数据,将视频相关数据作为与目标文本关联的文本关联数据,由于视频相关数据包括视频帧或者音频帧的至少一种,从而获取了文本数据以及除文本数据之外的图像数据或音频数据,从而可以在文本数据的基础上结合图像数据或音频数据对视频进行识别,从而有利于提高识别的准确度。
在一些实施例中,调整文本特征包括根据视频帧调整得到的第一调整文本特征;内容识别结果得到模块910包括:第二融合文本特征得到单元,用于将第一调整文本特征以及文本提取特征进行融合,得到融合文本特征。第二内容识别结果得到单元,用于基于融合文本特征进行识别,得到目标内容对应的内容识别结果。
本实施例中,将第一调整文本特征、第二调整文本特征以及文本提取特征进行融合,得 到融合文本特征,从而使得融合文本特征是基于第一调整文本特征、第二调整文本特征以及文本提取特征这三种特征得到的,从而提高了融合文本特征的特征丰富程度,从而当基于融合文本特征进行识别时,可以提高识别的准确度。
在一些实施例中,调整文本特征还包括根据音频帧调整得到的第二调整文本特征,第二融合文本特征得到单元还用于:将第一调整文本特征、第二调整文本特征以及文本提取特征进行融合,得到融合文本特征。
本实施例中,将第一调整文本特征、第二调整文本特征以及文本提取特征进行融合,得到融合文本特征,从而使得融合文本特征是基于第一调整文本特征、第二调整文本特征以及文本提取特征这三种特征得到的,从而提高了融合文本特征的特征丰富程度,从而当基于融合文本特征进行识别时,可以提高识别的准确度。
在一些实施例中,基于特征关联度对文本提取特征进行调整,得到调整文本特征包括:基于特征关联度得到文本提取特征对应的特征关注强度,特征关联度与特征关注强度成正相关关系;基于特征关注强度对文本提取特征进行调整,得到调整文本特征。
本实施例中,由于特征关联度与特征关注强度成正相关关系,故目标文本与文本关联数据之间的关联程度越大,特征关注强度越大,对文本提取特征调整的程度越大,目标文本与文本关联数据之间的关联程度越小,特征关注强度越小,对文本提取特征调整的程度越小,,从而基于特征关注强度对文本提取特征进行调整,实现了自适应的根据文本关联数据与目标文本之间的关联程度进行文本特征的调整,从而在基于调整文本特征进行识别时,使得识别结果受到文本关联数据的影响,目标文本与文本关联数据之间的关联程度越大,文本关联数据对识别结果的影响程度越大,使得在内容识别时更加关注关联程度较大的信息,提高了内容识别的准确度。
在一些实施例中,调整文本特征得到模块908包括:特征值乘积得到单元,用于将特征关注强度与文本提取特征的各个特征值相乘,得到特征值乘积。调整文本特征得到单元,用于按照文本提取特征中特征值的位置对特征值乘积进行排列,将排列得到的特征值序列作为调整文本特征。
本实施例中,将特征关注强度与文本提取特征的各个特征值相乘,得到特征值乘积,从而特征值乘积可以反映文本关联数据对文本特征值的关注程度,按照文本提取特征中特征值的排序对特征值乘积进行排序,将排序得到的特征值序列作为调整文本特征,从而调整文本特征可以反映文本关联数据对文本提取特征的关注程度。
在一些实施例中,文本提取特征为目标文本中的分词对应的特征;各个调整文本特征按照分词在目标文本的顺序形成特征序列;内容识别结果得到模块910包括:位置关系得到单元,用于基于特征序列得到各个分词相对于命名实体的位置关系。第三内容识别结果得到单元,用于基于各个位置关系从目标文本中获取目标命名实体,将目标命名实体作为目标内容对应的内容识别结果。
本实施例中,基于特征序列得到各个分词相对于命名实体的位置关系,基于各个位置关系从目标文本中获取目标命名实体,将目标命名实体作为目标内容对应的内容识别结果,从而可以基于调整文本特征形成的特征序列进行内容识别,提高了内容识别的准确度。
在一些实施例中,第三内容识别结果得到单元,还用于获取位置关系为命名实体的起始位置的分词,作为命名实体起始词;将命名实体起始词对应的后向分词中,位置关系为处于命名实体的内部的分词作为命名实体组成词;将命名实体起始词与命名实体组成词进行组合, 得到目标命名实体。
本实施例中,获取位置关系为命名实体的起始位置的分词,作为命名实体起始词,将命名实体起始词对应的后向分词中,位置关系为处于命名实体的内部的分词作为命名实体组成词,将命名实体起始词与命名实体组成词进行组合,得到目标命名实体,从而可以基于调整文本特征形成的特征序列进行实体识别,提高了实体识别的准确度。
在一些实施例中,位置关系得到单元,还用于基于特征序列得到各个分词相对于命名实体的位置关系以及分词对应的实体类型;第三内容识别结果得到单元,还用于将命名实体起始词对应的后向分词中,位置关系为处于命名实体的内部,且实体类型与命名实体起始词的类型相同的分词作为命名实体组成词。
本实施例中,基于特征序列得到各个分词相对于命名实体的位置关系以及分词对应的实体类型,将命名实体起始词对应的后向分词中,位置关系为处于命名实体的内部,且实体类型与命名实体起始词的类型相同的分词作为命名实体组成词,提高了实体识别的准确度。
在一些实施例中,特征关联度得到模块906包括:乘积运算值得到单元,用于将关联提取特征中的关联特征值与文本提取特征中对应位置的文本特征值进行乘积运算,得到乘积运算值。特征关注强度得到单元,用于对乘积运算值进行统计,得到关联提取特征与文本提取特征之间的特征关联度。
本实施例中,将关联提取特征中的关联特征值与文本提取特征中对应位置的文本特征值进行乘积运算,得到乘积运算值,对各个乘积运算值进行统计运算,得到关联提取特征与文本提取特征之间的特征关联度,从而特征关联度可以准确的反映文本关联数据与目标文本之间的关联程度,从而当基于特征关联度对文本提取特征进行调整时,可以提高调整的准确度。
关于内容识别装置的具体限定可以参见上文中对于内容识别方法的限定。上述内容识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一些实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储内容识别数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种内容识别方法。
在一些实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图11所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、运营商网络、NFC(近场通信)或其他技术实现。该计算机可读指令被处理器执行时以实现一种内容识别方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨 迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图10和图11中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一些实施例中,还提供了一种计算机设备,包括存储器和一个或多个处理器,存储器中存储有计算机可读指令,该计算机可读指令被处理器执行时,使得一个或多个处理器执行计算机可读指令时实现上述各方法实施例中的步骤。
在一些实施例中,提供了一个或多个非易失性可读存储介质,存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现上述各方法实施例中的步骤。
一种计算机程序产品,包括计算机可读指令,所述计算机可读指令被处理器执行时实现上述图像处理方法的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (18)

  1. 一种内容识别方法,其特征在于,由计算机设备执行,所述方法包括:
    确定待识别的目标内容,从所述目标内容中,获取目标文本以及与所述目标文本关联的文本关联数据;
    对所述目标文本进行特征提取,得到文本提取特征;对所述文本关联数据进行特征提取,得到关联提取特征;
    确定所述关联提取特征与所述文本提取特征间的特征关联度;所述特征关联度用于表征所述目标文本与所述文本关联数据之间的关联程度;基于所述特征关联度对所述文本提取特征进行调整,得到调整文本特征;及,
    基于所述调整文本特征进行识别,得到所述目标内容对应的内容识别结果。
  2. 根据权利要求1所述的方法,其特征在于,所述基于所述调整文本特征进行识别,得到所述目标内容对应的内容识别结果包括:
    对所述调整文本特征以及所述文本提取特征进行融合,得到融合文本特征;及,
    基于所述融合文本特征进行识别,得到所述目标内容对应的内容识别结果。
  3. 根据权利要求2所述的方法,其特征在于,所述对所述调整文本特征以及所述文本提取特征进行融合,得到融合文本特征包括:
    对所述文本提取特征进行编码,得到第一编码特征,对所述调整文本特征进行编码,得到第二编码特征;
    将所述第一编码特征与所述第二编码特征进行融合,得到融合编码特征;
    基于所述融合编码特征得到所述调整文本特征对应的调整特征权重;及,
    基于所述调整特征权重对所述调整文本特征以及所述文本提取特征进行融合,得到融合文本特征。
  4. 根据权利要求3所述的方法,其特征在于,所述第一编码特征是通过已训练的内容识别模型中的第一编码器编码得到的,所述第二编码特征是通过所述内容识别模型中的第二编码器编码得到的,所述基于所述融合编码特征得到所述调整文本特征对应的调整特征权重包括:
    将所述融合编码特征输入到所述内容识别模型中的目标激活层进行激活处理,得到目标激活值,将所述目标激活值作为所述调整文本特征对应的调整特征权重,所述激活层为所述第一编码器与所述第二编码器的共享激活层。
  5. 根据权利要求3所述的方法,其特征在于,所述基于所述调整特征权重对所述调整文本特征以及所述文本提取特征进行融合,得到融合文本特征包括:
    基于所述调整特征权重得到所述文本提取特征对应的文本特征权重;
    将所述调整特征权重与所述调整文本特征进行乘积计算,得到计算后的调整文本特征;
    将所述文本特征权重与所述文本提取特征进行乘积计算,得到计算后的文本提取特征;及,
    将所述计算后的调整文本特征与所述计算后的文本提取特征进行相加,得到融合文本特征。
  6. 根据权利要求1所述的方法,其特征在于,所述目标内容为目标视频;所述从所述目标内容中,获取目标文本以及与所述目标文本关联的文本关联数据包括:
    获取所述目标视频中目标时间对应的文本,得到目标文本;及,
    获取所述目标视频中所述目标时间对应的视频相关数据,将所述视频相关数据作为与所述目标文本关联的文本关联数据,所述视频相关数据包括视频帧或者音频帧的至少一种。
  7. 根据权利要求6所述的方法,其特征在于,所述调整文本特征包括根据所述视频帧调整得到的第一调整文本特征;
    所述基于所述调整文本特征进行识别,得到所述目标内容对应的内容识别结果包括:
    将所述第一调整文本特征以及所述文本提取特征进行融合,得到融合文本特征;及,
    基于所述融合文本特征进行识别,得到所述目标内容对应的内容识别结果。
  8. 根据权利要求7所述的方法,其特征在于,所述调整文本特征还包括根据所述音频帧调整得到的第二调整文本特征,所述将所述第一调整文本特征以及所述文本提取特征进行融合,得到融合文本特征包括:
    将所述第一调整文本特征、所述第二调整文本特征以及所述文本提取特征进行融合,得到融合文本特征。
  9. 根据权利要求1所述的方法,其特征在于,所述基于所述特征关联度对所述文本提取特征进行调整,得到调整文本特征包括:
    基于所述特征关联度得到所述文本提取特征对应的特征关注强度,所述特征关联度与所述特征关注强度成正相关关系;
    基于所述特征关注强度对所述文本提取特征进行调整,得到调整文本特征。
  10. 根据权利要求9所述的方法,其特征在于,所述基于所述特征关注强度对所述文本提取特征进行调整,得到调整文本特征包括:
    将所述特征关注强度与所述文本提取特征的各个特征值相乘,得到特征值乘积;及,
    按照所述文本提取特征中特征值的位置对所述特征值乘积进行排列,将排列得到的特征值序列作为所述调整文本特征。
  11. 根据权利要求1所述的方法,其特征在于,所述文本提取特征为所述目标文本中的分词对应的特征;各个调整文本特征按照分词在所述目标文本的顺序形成特征序列;所述基于所述调整文本特征进行识别,得到所述目标内容对应的内容识别结果包括:
    基于所述特征序列得到各个所述分词相对于命名实体的位置关系;及,
    基于各个所述位置关系从所述目标文本中获取目标命名实体,将所述目标命名实体作为所述目标内容对应的内容识别结果。
  12. 根据权利要求11所述的方法,其特征在于,所述基于各个所述位置关系从所述目标文本中获取目标命名实体包括:
    获取位置关系为命名实体的起始位置的分词,作为命名实体起始词;
    将所述命名实体起始词对应的后向分词中,位置关系为处于命名实体的内部的分词作为命名实体组成词;及,
    将所述命名实体起始词与所述命名实体组成词进行组合,得到目标命名实体。
  13. 根据权利要求12所述的方法,其特征在于,所述基于所述特征序列得到各个所述分词相对于命名实体的位置关系包括:
    基于所述特征序列得到各个所述分词相对于命名实体的位置关系以及所述分词对应的实体类型;及,
    所述将所述命名实体起始词对应的后向分词中,位置关系为处于命名实体的内部的分词作为命名实体组成词包括:
    将所述命名实体起始词对应的后向分词中,位置关系为处于命名实体的内部,且实体类型与所述命名实体起始词的类型相同的分词作为命名实体组成词。
  14. 根据权利要求1所述的方法,其特征在于,所述确定所述关联提取特征与所述文本提取特征间的特征关联度包括:
    将所述关联提取特征中的关联特征值与所述文本提取特征中对应位置的文本特征值进行乘积运算,得到乘积运算值;及,
    对所述乘积运算值进行统计,得到所述关联提取特征与所述文本提取特征之间的特征关联度。
  15. 一种内容识别装置,其特征在于,所述装置包括:
    目标内容确定模块,用于确定待识别的目标内容,从所述目标内容中,获取目标文本以及与所述目标文本关联的文本关联数据;
    特征提取模块,用于对所述目标文本进行特征提取,得到文本提取特征;对所述文本关联数据进行特征提取,得到关联提取特征;
    特征关联度得到模块,用于确定所述关联提取特征与所述文本提取特征间的特征关联度;所述特征关联度用于表征所述目标文本与所述文本关联数据之间的关联程度;
    调整文本特征得到模块,用于基于所述特征关联度对所述文本提取特征进行调整,得到调整文本特征;及,
    内容识别结果得到模块,用于基于所述调整文本特征进行识别,得到所述目标内容对应的内容识别结果。
  16. 一种计算机设备,包括存储器和一个或多个处理器,所述存储器存储有计算机可读指令,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行权利要求1至14中任一项所述的方法的步骤。
  17. 一个或多个非易失性可读存储介质,存储有计算机可读指令,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器实现权利要求1至14中任一项所述的方法的步骤。
  18. 一种计算机程序产品,包括计算机可读指令,所述计算机可读指令被处理器执行时实现权利要求1至14中任一项所述的方法的步骤。
PCT/CN2022/081896 2021-03-26 2022-03-21 内容识别方法、装置、计算机设备和存储介质 WO2022199504A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/991,727 US20230077849A1 (en) 2021-03-26 2022-11-21 Content recognition method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110325997.8A CN113723166A (zh) 2021-03-26 2021-03-26 内容识别方法、装置、计算机设备和存储介质
CN202110325997.8 2021-03-26

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/991,727 Continuation US20230077849A1 (en) 2021-03-26 2022-11-21 Content recognition method and apparatus, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022199504A1 true WO2022199504A1 (zh) 2022-09-29

Family

ID=78672647

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/081896 WO2022199504A1 (zh) 2021-03-26 2022-03-21 内容识别方法、装置、计算机设备和存储介质

Country Status (3)

Country Link
US (1) US20230077849A1 (zh)
CN (1) CN113723166A (zh)
WO (1) WO2022199504A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600157A (zh) * 2022-11-29 2023-01-13 支付宝(杭州)信息技术有限公司(Cn) 一种数据处理的方法、装置、存储介质及电子设备

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522994B (zh) * 2020-04-15 2023-08-01 北京百度网讯科技有限公司 用于生成信息的方法和装置
CN113723166A (zh) * 2021-03-26 2021-11-30 腾讯科技(北京)有限公司 内容识别方法、装置、计算机设备和存储介质
CN114495938B (zh) * 2021-12-04 2024-03-08 腾讯科技(深圳)有限公司 音频识别方法、装置、计算机设备及存储介质
CN114495102A (zh) * 2022-01-12 2022-05-13 北京百度网讯科技有限公司 文本识别方法、文本识别网络的训练方法及装置
CN114078277A (zh) * 2022-01-19 2022-02-22 深圳前海中电慧安科技有限公司 一人一档的人脸聚类方法、装置、计算机设备及存储介质
CN115168568B (zh) * 2022-03-16 2024-04-05 腾讯科技(深圳)有限公司 一种数据内容的识别方法、装置以及存储介质
CN117237259B (zh) * 2023-11-14 2024-02-27 华侨大学 基于多模态融合的压缩视频质量增强方法及装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150324350A1 (en) * 2014-05-12 2015-11-12 International Business Machines Corporation Identifying Content Relationship for Content Copied by a Content Identification Mechanism
CN109885723A (zh) * 2019-02-20 2019-06-14 腾讯科技(深圳)有限公司 一种视频动态缩略图的生成方法、模型训练的方法及装置
CN110991427A (zh) * 2019-12-25 2020-04-10 北京百度网讯科技有限公司 用于视频的情绪识别方法、装置和计算机设备
CN111680541A (zh) * 2020-04-14 2020-09-18 华中科技大学 一种基于多维度注意力融合网络的多模态情绪分析方法
CN112364810A (zh) * 2020-11-25 2021-02-12 深圳市欢太科技有限公司 视频分类方法及装置、计算机可读存储介质与电子设备
CN112418034A (zh) * 2020-11-12 2021-02-26 元梦人文智能国际有限公司 多模态情感识别方法、装置、电子设备和存储介质
CN113010740A (zh) * 2021-03-09 2021-06-22 腾讯科技(深圳)有限公司 词权重的生成方法、装置、设备及介质
CN113723166A (zh) * 2021-03-26 2021-11-30 腾讯科技(北京)有限公司 内容识别方法、装置、计算机设备和存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150324350A1 (en) * 2014-05-12 2015-11-12 International Business Machines Corporation Identifying Content Relationship for Content Copied by a Content Identification Mechanism
CN109885723A (zh) * 2019-02-20 2019-06-14 腾讯科技(深圳)有限公司 一种视频动态缩略图的生成方法、模型训练的方法及装置
CN110991427A (zh) * 2019-12-25 2020-04-10 北京百度网讯科技有限公司 用于视频的情绪识别方法、装置和计算机设备
CN111680541A (zh) * 2020-04-14 2020-09-18 华中科技大学 一种基于多维度注意力融合网络的多模态情绪分析方法
CN112418034A (zh) * 2020-11-12 2021-02-26 元梦人文智能国际有限公司 多模态情感识别方法、装置、电子设备和存储介质
CN112364810A (zh) * 2020-11-25 2021-02-12 深圳市欢太科技有限公司 视频分类方法及装置、计算机可读存储介质与电子设备
CN113010740A (zh) * 2021-03-09 2021-06-22 腾讯科技(深圳)有限公司 词权重的生成方法、装置、设备及介质
CN113723166A (zh) * 2021-03-26 2021-11-30 腾讯科技(北京)有限公司 内容识别方法、装置、计算机设备和存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600157A (zh) * 2022-11-29 2023-01-13 支付宝(杭州)信息技术有限公司(Cn) 一种数据处理的方法、装置、存储介质及电子设备
CN115600157B (zh) * 2022-11-29 2023-05-16 支付宝(杭州)信息技术有限公司 一种数据处理的方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
US20230077849A1 (en) 2023-03-16
CN113723166A (zh) 2021-11-30

Similar Documents

Publication Publication Date Title
WO2022199504A1 (zh) 内容识别方法、装置、计算机设备和存储介质
US11544474B2 (en) Generation of text from structured data
CN107066464B (zh) 语义自然语言向量空间
CN106973244B (zh) 使用弱监督数据自动生成图像字幕的方法和系统
US11368705B2 (en) Video feature extraction and video content understanding method, apparatus, storage medium and server
US10885344B2 (en) Method and apparatus for generating video
CN111581510A (zh) 分享内容处理方法、装置、计算机设备和存储介质
Luo et al. Online learning of interpretable word embeddings
WO2019052403A1 (zh) 图像文本匹配模型的训练方法、双向搜索方法及相关装置
WO2021204269A1 (zh) 分类模型的训练、对象分类
WO2020199904A1 (zh) 视频描述信息的生成方法、视频处理方法、相应的装置
CN111539197B (zh) 文本匹配方法和装置以及计算机系统和可读存储介质
CN111866610B (zh) 用于生成信息的方法和装置
CN112364204B (zh) 视频搜索方法、装置、计算机设备及存储介质
WO2023179429A1 (zh) 一种视频数据的处理方法、装置、电子设备及存储介质
WO2018068648A1 (zh) 一种信息匹配方法及相关装置
CN109145083B (zh) 一种基于深度学习的候选答案选取方法
WO2023134082A1 (zh) 图像描述语句生成模块的训练方法及装置、电子设备
US20230004608A1 (en) Method for content recommendation and device
CN112950291A (zh) 模型的偏差优化方法、装置、设备及计算机可读介质
CN110321565B (zh) 基于深度学习的实时文本情感分析方法、装置及设备
Wu et al. Hashtag recommendation with attention-based neural image hashtagging network
CN114155388B (zh) 一种图像识别方法、装置、计算机设备和存储介质
CN113657116B (zh) 基于视觉语义关系的社交媒体流行度预测方法及装置
CN116415624A (zh) 模型训练方法及装置、内容推荐方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22774165

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE