WO2019242278A1 - 内容描述生成模型的损失值获取方法及装置 - Google Patents

内容描述生成模型的损失值获取方法及装置 Download PDF

Info

Publication number
WO2019242278A1
WO2019242278A1 PCT/CN2018/123955 CN2018123955W WO2019242278A1 WO 2019242278 A1 WO2019242278 A1 WO 2019242278A1 CN 2018123955 W CN2018123955 W CN 2018123955W WO 2019242278 A1 WO2019242278 A1 WO 2019242278A1
Authority
WO
WIPO (PCT)
Prior art keywords
content
loss value
video
generation model
content description
Prior art date
Application number
PCT/CN2018/123955
Other languages
English (en)
French (fr)
Inventor
李岩
李涛
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2019242278A1 publication Critical patent/WO2019242278A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a method and a device for acquiring a loss value of a content description generation model.
  • Neural network is an important branch of deep learning. Because of its strong fitting ability and end-to-end global optimization ability, the accuracy of the video content description generation task is greatly improved after applying the neural network model.
  • the current content description generation model can generate a video content description according to the video characteristics of the video, the content description of the generated video may still be different from the content expressed by the video itself. Therefore, it is often necessary to obtain the content description generation model. Loss value, and optimize the content description to generate a model based on the loss value.
  • the embodiments of the present disclosure show a method and a device for acquiring a loss value of a content description generation model.
  • an embodiment of the present disclosure shows a method for acquiring a loss value of a content description generation model, the method includes:
  • the obtaining a predicted content theme for describing the content of the video according to a preset parameter matrix in the content description generation model includes:
  • a product between the video feature and the preset parameter matrix is calculated and used as the predicted content theme.
  • the acquiring video features of a video includes:
  • the image average feature is determined as the video feature.
  • the acquiring a labeled content theme used to describe content of the video includes:
  • the topic of the marked content is determined according to a topic to which each of the description words belongs.
  • determining the second loss value of the content description generation model according to the predicted content theme and the labeled content theme includes:
  • L1 is the second loss value
  • is the first preset coefficient
  • is the second preset coefficient
  • X1 is the square
  • X2 is the second norm.
  • determining the target loss value of the content description generation model according to the first loss value and the second loss value includes:
  • L0 is the target loss value
  • is the third preset coefficient
  • is the fourth preset coefficient
  • L1 is the second loss value
  • L2 is the first loss value
  • the method further includes: optimizing a preset parameter matrix of the content description generation model according to a target loss value of the content description generation model.
  • an embodiment of the present disclosure illustrates a loss value obtaining device for a content description generation model, where the device includes:
  • a first acquisition module configured to acquire a first loss value of the content description generation model based on the annotated content description of the video and the predicted content description output by the content description generation model;
  • a second obtaining module configured to obtain a predicted content theme used to describe the content of the video according to a preset parameter matrix in the content description generation model
  • a third acquisition module configured to acquire an annotated content theme used to describe the content of the video
  • a first determining module configured to determine a second loss value of the content description generation model according to the predicted content theme and the labeled content theme
  • a second determining module is configured to determine a target loss value of the content description generation model according to the first loss value and the second loss value.
  • the second acquisition module includes:
  • a first obtaining unit configured to obtain a video feature of the video
  • a first calculation unit is configured to calculate a product between the video feature and the preset parameter matrix, and use the product as the predicted content theme.
  • the first obtaining unit includes:
  • a first acquisition subunit configured to acquire a multi-frame video image in the video
  • a second acquisition subunit configured to use a convolutional neural network CNN and a bidirectional long-term short-term memory network LSTM to acquire image characteristics of the video image in each frame;
  • a calculation subunit configured to calculate an average image feature between image features of the video image in each frame
  • a determining subunit configured to determine the average feature of the image as the video feature.
  • the third acquisition module includes:
  • a second acquisition unit configured to acquire a stored description description of the annotation of the video
  • a splitting unit configured to split the labeled content description to obtain multiple description words
  • a first determining unit configured to determine a topic to which each of the description words belongs
  • a second determining unit is configured to determine the marked content topic according to a topic to which each of the description words belongs.
  • the first determining module includes:
  • a second calculation unit configured to calculate a difference matrix between a matrix corresponding to the labeled content topic and a matrix corresponding to the predicted content topic;
  • a third calculation unit configured to calculate a square of a first norm of the difference matrix
  • a fourth calculation unit configured to calculate a second norm of the preset parameter matrix
  • a fifth calculation unit is configured to calculate the second loss value according to the following formula according to the square sum and the second norm:
  • L1 is the second loss value
  • is the first preset coefficient
  • is the second preset coefficient
  • X1 is the square
  • X2 is the second norm.
  • the second determining module is specifically configured to:
  • L0 is the target loss value
  • is the third preset coefficient
  • is the fourth preset coefficient
  • L1 is the second loss value
  • L2 is the first loss value
  • an embodiment of the present disclosure shows a terminal, including: a memory, a processor, and a loss value acquisition program for a content description generation model stored on the memory and executable on the processor, the content
  • the loss value acquisition program describing the generation model is executed by the processor, the steps of the loss value acquisition method of the content description generation model described in the first aspect are implemented.
  • an embodiment of the present disclosure shows a computer-readable storage medium on which a loss value obtaining program for a content description generation model is stored, and the loss value obtaining program for the content description generation model is stored in the computer readable storage medium.
  • the processor executes, the steps of the method for acquiring a loss value of the content description generation model described in the first aspect are implemented.
  • an embodiment of the present disclosure shows a computer program product, the computer program product including a computer program, the computer program including program instructions and stored on a computer-readable storage medium, the program instructions being processed by a processor When executed, implement the steps of the content value generation method of the content description generation model described in the first aspect.
  • the embodiments of the present disclosure include the following advantages:
  • a first loss value of the content description generation model is acquired; a predicted content theme used to describe the content of the video is acquired according to a preset parameter matrix in the content description generation model; Annotated content theme of the content of the video; determining a second loss value of the content description generation model according to the predicted content theme and annotated content theme; determining a target loss value of the content description generation model according to the first loss value and the second loss value .
  • the degree of error in the content description of the video generated by the content description generation model can be determined, and then an optimization method suitable for the error degree is selected to optimize the preset parameter matrix in the content description generation model, so that the Improve the sparsity of the preset parameter matrix in the content description generation model, that is, make the non-zero values in the preset parameter matrix as small as possible, and then make the dimensions between the video feature of the video and the content theme of the video.
  • the higher the interpretability so that the correlation between the video features of the video and the content theme of the content description of the video generated according to the content description generation model is more significantly visible.
  • FIG. 1 is a flowchart of steps in an embodiment of a method for obtaining a loss value of a content description generation model according to the present disclosure
  • FIG. 2 is a structural block diagram of an embodiment of a loss value obtaining device for a content description generation model of the present disclosure
  • FIG. 3 is a structural block diagram of a terminal embodiment of the present disclosure.
  • FIG. 1 there is shown a flowchart of steps in an embodiment of a method for obtaining a loss value of a content description generation model of the present disclosure, which may specifically include the following steps:
  • step S101 a first loss value of the content description generation model is acquired
  • the first loss value of the preset content description generation model may be obtained according to any conventional loss value acquisition method in the prior art.
  • the video is input into the content description generation model, and the predicted content description of the video output by the content description generation model is obtained; the labeled content description of the video is obtained; the first loss of the content description generation model is obtained according to the predicted content description and the labeled content description. value.
  • step S102 according to a preset parameter matrix in the content description generation model, a predicted content theme used to describe the content of the video is acquired;
  • the content description generation model is used to generate the predicted content description of the video.
  • the content description generation model includes a preset parameter matrix.
  • the predicted content theme used to describe the content of the video can be obtained according to the following process, including:
  • CNN Convolutional Neural Network, Convolutional Neural Network
  • LSTM Long Short-Term Memory, Long Short-Term Memory Network
  • Video image input this frame of video image to CNN, get 1536-dimensional feature description of the frame of video image output by CNN, and input 1536-dimensional feature description into bidirectional LSTM to get 2 256-dimensional feature description, and then input 1536-dimensional feature description
  • the feature description and two 256-dimensional feature descriptions make up the 2018-dimensional feature description, and serve as the image features of the frame video image, and the same is true for each other frame of the multi-frame video image.
  • calculate the average image feature between the image features of each frame of video image for example, in the 2018-dimensional feature description of each frame of video image, calculate the average value between the values of the same dimension to obtain the average image feature.
  • the average feature of the image is then determined as the video feature of the video.
  • the content description generation model is used to generate a predicted content description of a video, and the predicted content topic is generated according to the predicted content description.
  • the content description generation model includes a preset parameter matrix. In order to describe the correlation between the video features of the video and the predicted content theme of the video, the product between the video features and the preset parameter matrix can be calculated and used as the predicted content theme.
  • step S103 acquiring a labeled content theme for describing the content of the video
  • This step can be implemented through the following processes, including:
  • a technician can watch the content of the video in advance, summarize the content description of the video according to the content of the video, and use it as the annotation content description of the video, and then store the annotation content description of the video. Therefore, in this step, a stored description description of the annotations of the video can be obtained.
  • the Chinese word segmentation system NLPIR can be used to describe the word segmentation of the labeled content to obtain multiple description words included in the label content description.
  • a technician sets a plurality of topics in advance, and for each topic, a description vocabulary describing the topic can be counted and a description vocabulary set corresponding to the topic can be formed.
  • a vocabulary description set including the description vocabulary can be found in a plurality of description vocabulary sets, and the topic corresponding to the vocabulary description set is taken as the topic to which the description vocabulary belongs. The same is true for every other descriptive word.
  • the largest number of topics may be determined as the labeled content topic.
  • the labeled content topic may also be determined in other ways. Be restricted.
  • step S104 a second loss value of the content description generation model is determined according to the predicted content theme and the annotated content theme;
  • L1 is a second loss value
  • is a first preset coefficient
  • is a second preset coefficient
  • X1 is a square
  • X2 is a second norm.
  • includes a value between 01 and 1
  • includes a value of 1, 1.001, or 1.001.
  • step S105 a target loss value of the content description generation model is determined according to the first loss value and the second loss value.
  • the target loss value can be calculated according to the following formula:
  • L0 is the target loss value
  • is the third preset coefficient
  • is the fourth preset coefficient
  • L1 is the second loss value
  • L2 is the first loss value
  • includes values such as 1, 1.001, or 1.001, and ⁇ includes values such as 0.5, 0.51, or 0.501.
  • a first loss value of the content description generation model is acquired; a predicted content theme used to describe the content of the video is acquired according to a preset parameter matrix in the content description generation model; Annotate the content theme; determine the second loss value of the content description generation model according to the predicted content theme and the annotated content theme; determine the target loss value of the content description generation model according to the first loss value and the second loss value.
  • the degree of error in the content description of the video generated by the content description generation model can be determined, and then an optimization method suitable for the error degree is selected to optimize the preset parameter matrix in the content description generation model, thereby improving the content.
  • an optimization method suitable for the error degree is selected to optimize the preset parameter matrix in the content description generation model, thereby improving the content.
  • the device may specifically include the following modules:
  • a first acquisition module 11 configured to acquire a first loss value of a content description generation model
  • a second obtaining module 12 configured to obtain a predicted content theme for describing the content of the video according to a preset parameter matrix in the content description generation model
  • a third acquisition module 13 configured to acquire an annotated content theme for describing the content of the video
  • the first determining module 14 is configured to determine a second loss value of the content description generation model according to the predicted content theme and the annotated content theme;
  • the second determining module 15 is configured to determine a target loss value of the content description generation model according to the first loss value and the second loss value.
  • the second obtaining module 12 includes:
  • a first obtaining unit configured to obtain video features of a video
  • the first calculation unit is configured to calculate a product between a video feature and a preset parameter matrix, and use the product as a prediction content theme.
  • the first obtaining unit includes:
  • a first acquisition subunit configured to acquire multiple frames of video images in a video
  • a second acquisition subunit configured to acquire the image characteristics of each frame of the video image using a convolutional neural network CNN and a bidirectional long-term short-term memory network LSTM;
  • a determination subunit configured to determine the average feature of the image as a feature of the video.
  • the third obtaining module 13 includes:
  • a second obtaining unit configured to obtain a stored content description of the video annotation
  • a splitting unit which is used to split the content description of the annotation to obtain multiple description words
  • a first determining unit configured to determine a topic to which each description word belongs
  • the second determining unit is configured to determine a topic of marked content according to a topic to which each description word belongs.
  • the first determination module 14 includes:
  • a second calculation unit configured to calculate a difference matrix between a matrix corresponding to the labeled content topic and a matrix corresponding to the predicted content topic;
  • a third calculation unit configured to calculate a square of a first norm of the difference matrix
  • a fourth calculation unit configured to calculate a second norm of the preset parameter matrix
  • a fifth calculation unit configured to calculate a second loss value according to the following formula according to the square and the second norm:
  • L1 is a second loss value
  • is a first preset coefficient
  • is a second preset coefficient
  • X1 is a square
  • X2 is a second norm.
  • the second determining module 15 is specifically configured to:
  • the target loss value is calculated according to the following formula:
  • L0 is the target loss value
  • is the third preset coefficient
  • is the fourth preset coefficient
  • L1 is the second loss value
  • L2 is the first loss value
  • a first loss value of the content description generation model is acquired; a predicted content theme used to describe the content of the video is acquired according to a preset parameter matrix in the content description generation model; Annotate the content theme; determine the second loss value of the content description generation model according to the predicted content theme and the annotated content theme; determine the target loss value of the content description generation model according to the first loss value and the second loss value.
  • the degree of error in the content description of the video generated by the content description generation model can be determined, and then an optimization method suitable for the error degree is selected to optimize the preset parameter matrix in the content description generation model, thereby improving the content Describe the sparseness of the preset parameter matrix in the generated model, that is, make the non-zero values in the preset parameter matrix as small as possible, so as to make the relationship between the dimensions of the video feature of the video and the content theme of the video clearer.
  • the higher the interpretability the more significantly the correlation between the video features of the video and the content theme of the content description of the video generated according to the content description generation model is more visible.
  • the description is relatively simple. For the relevant part, refer to the description of the method embodiment.
  • the present disclosure also shows a terminal, which may include: a memory, a processor, and a loss value acquisition program for a content description generation model stored on the memory and executable on the processor, and a loss value acquisition program for the content description generation model.
  • a terminal which may include: a memory, a processor, and a loss value acquisition program for a content description generation model stored on the memory and executable on the processor, and a loss value acquisition program for the content description generation model.
  • Fig. 3 is a block diagram of a terminal 600 according to an exemplary embodiment.
  • the terminal 600 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
  • the terminal 600 may include one or more of the following components: a processing component 602, a memory 604, a power component 606, a multimedia component 608, an audio component 610, an input / output (I / O) interface 612, a sensor component 614, And communication component 616.
  • the processing component 602 generally controls the overall operation of the device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing component 602 may include one or more processors 620 to execute instructions to complete all or part of the steps of the method for obtaining a loss value of the model described above.
  • the processing component 602 may include one or more modules to facilitate the interaction between the processing component 602 and other components.
  • the processing component 602 may include a multimedia module to facilitate the interaction between the multimedia component 608 and the processing component 602.
  • the memory 604 is configured to store various types of data to support operations at the terminal 600. Examples of such data include instructions for any application or method for operating on the terminal 600, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory 604 may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), Programming read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM Programming read-only memory
  • PROM programmable read-only memory
  • ROM read-only memory
  • magnetic memory flash memory
  • flash memory magnetic disk or optical disk.
  • the power supply component 606 provides power to various components of the terminal 600.
  • the power component 606 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal 600.
  • the multimedia component 608 includes a screen that provides an output interface between the terminal 600 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user.
  • the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. A touch sensor can not only sense the boundaries of a touch or slide action, but also detect the duration and pressure associated with a touch or slide operation.
  • the multimedia component 608 includes a front camera and / or a rear camera. When the terminal 600 is in an operation mode, such as a shooting mode or a video mode, the front camera and / or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 610 is configured to output and / or input audio signals.
  • the audio component 610 includes a microphone (MIC).
  • the microphone When the terminal 600 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal.
  • the received audio signal may be further stored in the memory 604 or transmitted via the communication component 616.
  • the audio component 610 further includes a speaker for outputting audio signals.
  • the I / O interface 612 provides an interface between the processing component 602 and a peripheral interface module.
  • the peripheral interface module may be a keyboard, a click wheel, a button, or the like. These buttons can include, but are not limited to: a home button, a volume button, a start button, and a lock button.
  • the sensor component 614 includes one or more sensors for providing the terminal 600 with a status assessment of various aspects.
  • the sensor component 614 can detect the opening / closing state of the terminal 600, and the relative positioning of the components, such as the display and keypad of the terminal 600.
  • the sensor component 614 can also detect the change in the position of the terminal 600 or a component of the terminal 600.
  • the user The presence or absence of contact with the terminal 600, the orientation or acceleration / deceleration of the device 600 and the temperature change of the terminal 600.
  • the sensor component 614 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • the sensor component 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 614 may further include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • the communication component 616 is configured to facilitate wired or wireless communication between the terminal 600 and other devices.
  • the terminal 600 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof.
  • the communication section 616 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel.
  • the communication component 616 further includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra wideband
  • Bluetooth Bluetooth
  • the terminal 600 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable It is implemented by a gate array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and is used to perform a loss value acquisition method of a content description generation model.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable It is implemented by a gate array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and is used to perform a loss value acquisition method of a content description generation model.
  • the method includes:
  • a target loss value of the content description generation model is determined according to the first loss value and the second loss value.
  • obtaining the predicted content theme for describing the content of the video according to a preset parameter matrix in the content description generation model includes:
  • obtaining video features of a video includes:
  • the image average feature is determined as the video feature.
  • obtaining the annotated content theme used to describe the content of the video includes:
  • determining the second loss value of the content description generation model according to the predicted content theme and the annotated content theme includes:
  • L1 is a second loss value
  • is a first preset coefficient
  • is a second preset coefficient
  • X1 is a square
  • X2 is a second norm.
  • determining the target loss value of the content description generation model according to the first loss value and the second loss value includes:
  • the target loss value is calculated according to the following formula:
  • L0 is the target loss value
  • is the third preset coefficient
  • is the fourth preset coefficient
  • L1 is the second loss value
  • L2 is the first loss value
  • a non-transitory computer-readable storage medium including instructions may be executed by the processor 620 of the terminal 600 to complete the loss of the content description generation model.
  • Value acquisition method may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
  • the terminal is enabled to execute the steps of the loss value acquisition method of any content description generation model in the present disclosure.
  • a computer program product includes a computer program, the computer program includes program instructions and is stored on a computer-readable storage medium, and the program instructions are executed by a terminal.
  • the terminal When executed by the processor, the terminal is enabled to execute the steps of the loss value acquisition method of any content description generation model in the present disclosure.
  • the method for obtaining the loss value of the content description generation model provided here is not inherently related to any particular computer, virtual system, or other device.
  • Various general-purpose systems can also be used with teaching based on this. From the above description, the structure required to construct a system having the disclosed solution is obvious. Furthermore, this disclosure is not directed to any particular programming language. It should be understood that the content of the present disclosure described herein may be implemented using various programming languages, and that the description of the specific language above is to disclose the best embodiment of the present disclosure.
  • modules in the device in the embodiment can be adaptively changed and set in one or more devices different from the embodiment.
  • the modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Except for such features and / or processes or units, which are mutually exclusive, all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any methods so disclosed may be employed in any combination or All processes or units of the equipment are combined.
  • the various component embodiments of the present disclosure may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof.
  • a microprocessor or a digital signal processor (DSP) may be used to implement some or all of the components in the loss value acquisition method of the content description generation model according to the embodiments of the present disclosure. All functions.
  • the present disclosure may also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing part or all of the methods described herein.
  • Such a program that implements the present disclosure may be stored on a computer-readable medium or may have the form of one or more signals. Such signals can be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开实施例提供了一种内容描述生成模型的损失值获取方法及装置。在本公开实施例中,根据目标损失值就可以确定出内容描述生成模型生成的视频的内容描述的错误程度,之后选择与该错误程度相适应的优化方式来优化内容描述生成模型中的预设参数矩阵,从而可以提高内容描述生成模型中的预设参数矩阵的稀疏性,也即使得预设参数矩阵中的非零的数值尽量少,进而使得视频的视频特征的各个维度与视频的内容主题之间的关系越清晰,可解释性越高,以使得视频的视频特征与根据内容描述生成模型生成的该视频的内容描述的内容主题之间的相关性更加显著可见。

Description

内容描述生成模型的损失值获取方法及装置
本申请要求了2018年6月20日提交的、申请号为201810637242.X、发明名称为“内容描述生成模型的损失值获取方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及计算机技术领域,特别是涉及一种内容描述生成模型的损失值获取方法及装置。
背景技术
近来,深度学习在视频、图像、语音以及自然语言处理等相关领域得到了广泛应用。神经网络作为深度学习的一个重要分支,由于其超强的拟合能力以及端到端的全局优化能力,使得视频的内容描述生成任务在应用神经网络模型之后,精度大幅提升。
虽然目前的内容描述生成模型能够根据视频的视频特征生成视频的内容描述,但是仍然会出现生成的视频的内容描述与视频本身所表达的内容不同的情况,因此,往往需要获取内容描述生成模型的损失值,并根据损失值优化内容描述生成模型。
发明内容
为解决上述技术问题,本公开实施例示出了一种内容描述生成模型的损失值获取方法及装置。
第一方面,本公开实施例示出了一种内容描述生成模型的损失值获取方法,所述方法包括:
基于视频的标注内容描述和所述内容描述生成模型输出的预测内容描述获取所述内容描述生成模型的第一损失值;
根据所述所述内容描述生成模型中的预设参数矩阵获取用于描述所述视频的内容的预测内容主题;
获取用于描述所述视频的内容的标注内容主题;
根据所述预测内容主题和所述标注内容主题确定所述所述内容描述生成模型的第二损失值;
根据所述第一损失值和所述第二损失值确定所述所述内容描述生成模型的目标损失值。
在一些实施方式中,所述根据所述所述内容描述生成模型中的预设参数矩阵获取用于描述所 述视频的内容的预测内容主题,包括:
获取所述视频的视频特征;
计算所述视频特征与所述预设参数矩阵之间的乘积,并作为所述预测内容主题。
在一些实施方式中,所述获取视频的视频特征,包括:
获取所述视频中的多帧视频图像;
使用卷积神经网络CNN和双向长短期记忆网络LSTM获取每一帧所述视频图像的图像特征;
计算每一帧所述视频图像的图像特征之间的图像平均特征;
将所述图像平均特征确定为所述视频特征。
在一些实施方式中,所述获取用于描述所述视频的内容的标注内容主题,包括:
获取已存储的、所述视频的标注内容描述;
拆分所述标注内容描述,得到多个描述词汇;
确定每一个所述描述词汇所属的主题;
根据每一个所述描述词汇所属的主题确定所述标注内容主题。
在一些实施方式中,所述根据所述预测内容主题和所述标注内容主题确定所述所述内容描述生成模型的第二损失值,包括:
计算所述标注内容主题所对应的矩阵与所述预测内容主题所对应的矩阵之间的差异矩阵;
计算所述差异矩阵的第一范数的平方;
计算所述预设参数矩阵的第二范数;
根据所述平方和所述第二范数,按照如下公式计算所述第二损失值:
L1=α*X1+γ*X2;
其中,在上述公式中,L1为所述第二损失值,α为第一预设系数,γ为第二预设系数,X1为所述平方,X2为所述第二范数。
在一些实施方式中,所述根据所述第一损失值和所述第二损失值确定所述所述内容描述生成模型的目标损失值,包括:
根据所述第一损失值和所述第二损失值,按照如下公式计算所述目标损失值:
L0=β*L1+λ*L2;
其中,在上述公式中,L0为所述目标损失值,β为第三预设系数,λ为第四预设系数,L1为所述第二损失值,L2为所述第一损失值。
在一些实施方式中,所述方法还包括:根据所述内容描述生成模型的目标损失值优化所述内容描述生成模型的预设参数矩阵。
第二方面,本公开实施例示出了一种内容描述生成模型的损失值获取装置,所述装置包括:
第一获取模块,用于基于视频的标注内容描述和所述内容描述生成模型输出的预测内容描述获取所述内容描述生成模型的第一损失值;
第二获取模块,用于根据所述所述内容描述生成模型中的预设参数矩阵获取用于描述所述视频的内容的预测内容主题;
第三获取模块,用于获取用于描述所述视频的内容的标注内容主题;
第一确定模块,用于根据所述预测内容主题和所述标注内容主题确定所述所述内容描述生成模型的第二损失值;
第二确定模块,用于根据所述第一损失值和所述第二损失值确定所述所述内容描述生成模型的目标损失值。
在一些实施方式中,所述第二获取模块包括:
第一获取单元,用于获取所述视频的视频特征;
第一计算单元,用于计算所述视频特征与所述预设参数矩阵之间的乘积,并作为所述预测内容主题。
在一些实施方式中,所述第一获取单元包括:
第一获取子单元,用于获取所述视频中的多帧视频图像;
第二获取子单元,用于使用卷积神经网络CNN和双向长短期记忆网络LSTM获取每一帧所述视频图像的图像特征;
计算子单元,用于计算每一帧所述视频图像的图像特征之间的图像平均特征;
确定子单元,用于将所述图像平均特征确定为所述视频特征。
在一些实施方式中,所述第三获取模块包括:
第二获取单元,用于获取已存储的、所述视频的标注内容描述;
拆分单元,用于拆分所述标注内容描述,得到多个描述词汇;
第一确定单元,用于确定每一个所述描述词汇所属的主题;
第二确定单元,用于根据每一个所述描述词汇所属的主题确定所述标注内容主题。
在一些实施方式中,所述第一确定模块包括:
第二计算单元,用于计算所述标注内容主题所对应的矩阵与所述预测内容主题所对应的矩阵之间的差异矩阵;
第三计算单元,用于计算所述差异矩阵的第一范数的平方;
第四计算单元,用于计算所述预设参数矩阵的第二范数;
第五计算单元,用于根据所述平方和所述第二范数,按照如下公式计算所述第二损失值:
L1=α*X1+γ*X2;
其中,在上述公式中,L1为所述第二损失值,α为第一预设系数,γ为第二预设系数,X1为所述平方,X2为所述第二范数。
在一些实施方式中,所述第二确定模块具体用于:
根据所述第一损失值和所述第二损失值,按照如下公式计算所述目标损失值:
L0=β*L1+λ*L2;
其中,在上述公式中,L0为所述目标损失值,β为第三预设系数,λ为第四预设系数,L1为所述第二损失值,L2为所述第一损失值。
第三方面,本公开实施例示出了一种终端,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的内容描述生成模型的损失值获取程序,所述内容描述生成模型的损失值获取程序被所述处理器执行时实现如第一方面所述的内容描述生成模型的损失值获取方法的步骤。
第四方面,本公开实施例示出了一种计算机可读存储介质,所述计算机可读存储介质上存储有内容描述生成模型的损失值获取程序,所述内容描述生成模型的损失值获取程序被处理器执行时实现如第一方面所述的内容描述生成模型的损失值获取方法的步骤。
第五方面,本公开实施例示出了一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序包括程序指令并被存储在计算机可读存储介质上,所述程序指令被处理器执行时实现如第一方面所述的内容描述生成模型的损失值获取方法的步骤。
与现有技术相比,本公开实施例包括以下优点:
在本公开实施例中,获取所述内容描述生成模型的第一损失值;根据所述内容描述生成模型中的预设参数矩阵获取用于描述视频的内容的预测内容主题;获取用于描述该视频的内容的标注内容主题;根据预测内容主题和标注内容主题确定所述内容描述生成模型的第二损失值;根据第一损失值和第二损失值确定所述内容描述生成模型的目标损失值。
根据目标损失值就可以确定出内容描述生成模型生成的视频的内容描述的错误程度,之后选择与该错误程度相适应的优化方式来优化所述内容描述生成模型中的预设参数矩阵,从而可以提高所述内容描述生成模型中的预设参数矩阵的稀疏性,也即使得预设参数矩阵中的非零的数值尽量少,进而使得视频的视频特征的各个维度与视频的内容主题之间的关系越清晰,可解释性越高,进而使得视频的视频特征与根据所述内容描述生成模型生成的该视频的内容描述的内容主题之间的相关性更加显著可见。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本公开的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1是本公开的一种内容描述生成模型的损失值获取方法实施例的步骤流程图;
图2是本公开的一种内容描述生成模型的损失值获取装置实施例的结构框图;
图3是本公开的一种终端实施例的结构框图。
具体实施方式
为使本公开的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本公开作进一步详细的说明。
参照图1,示出了本公开的一种内容描述生成模型的损失值获取方法实施例的步骤流程图,具体可以包括如下步骤:
在步骤S101中,获取内容描述生成模型的第一损失值;
其中,可以根据现有技术中的任意一种传统的损失值获取方法来获取预设的内容描述生成模型的第一损失值。例如,将视频输入内容描述生成模型中,得到内容描述生成模型输出的该视频的预测内容描述;获取该视频的标注内容描述;根据预测内容描述和标注内容描述获取内容描述生成模型的第一损失值。
在步骤S102中,根据内容描述生成模型中的预设参数矩阵获取用于描述视频的内容的预测内容主题;
内容描述生成模型用于生成视频的预测内容描述。内容描述生成模型中包括预设参数矩阵,为了描述视频的视频特征与视频的预测内容主题之间的相关性,可以按照如下流程取用于描述视频的内容的预测内容主题,包括:
11)、获取该视频的视频特征;
获取视频中的多帧视频图像;其中,该视频包括具有先后顺序排列的大量的视频图像,可以从大量的视频图像中等距采样,得到多帧视频图像,例如,得到26帧视频图像。然后使用CNN(Convolutional Neural Network,卷积神经网络)和双向LSTM(Long Short-Term Memory,长短期记忆网络)获取每一帧视频图像的图像特征;例如,对于多帧视频图像中的任意一帧视频图像,将该帧视频图像输入CNN,得到CNN输出的该帧视频图像的1536维的特征描述,并将1536维的特征描述输入双向LSTM得到2个256维的特征描述,然后将1536维的特征描述和2个256 维的特征描述组成2018维的特征描述,并作为该帧视频图像的图像特征,对于多帧视频图像中的其他每一帧视频图像,同样如此。再计算每一帧视频图像的图像特征之间的图像平均特征;例如,在每一帧视频图像的2018维的特征描述中,计算相同维的数值之间的平均值,得到图像平均特征。之后将该图像平均特征确定为该视频的视频特征。
12)、计算视频特征与预设参数矩阵之间的乘积,并作为预测内容主题。
在本公开实施例中,内容描述生成模型用于生成视频的预测内容描述,预测内容主题是根据预测内容描述生成的。内容描述生成模型中包括预设参数矩阵,为了描述视频的视频特征与视频的预测内容主题之间的相关性,可以计算视频特征与预设参数矩阵之间的乘积,并作为预测内容主题。
在步骤S103中,获取用于描述该视频的内容的标注内容主题;
其中,本步骤可以通过如下流程实现,包括:
21)、获取已存储的、该视频的标注内容描述;
在本公开实施例中,技术人员事先可以观看该视频的内容,并根据该视频的内容总结该视频的内容描述,并将其作为该视频的标注内容描述,然后存储该视频的标注内容描述。因此,在本步骤中,可以获取已存储的、该视频的标注内容描述。
22)、拆分标注内容描述,得到多个描述词汇;
其中,可以使用汉语分词系统NLPIR对该标注内容描述分词,得到标注内容描述包括的多个描述词汇。
23)、确定每一个描述词汇所属的主题;
在本公开实施例中,技术人员事先会设置多个主题,对于每一个主题,可以统计用于描述该主题的描述词汇,并组成该主题对应的描述词汇集合。
因此,在对于任意一描述词汇,可以在多个描述词汇集合中查找包括该描述词汇的词汇描述集合,并将该词汇描述集合对应的主题作为该描述词汇所属的主题。对于其他每一描述词汇,同样如此。
24)、根据每一个描述词汇所属的主题确定标注内容主题。
在本公开实施例中,在确定每一个描述词汇所属的主题中,可以将数量最多的主题确定为标注内容主题,当然,也可以按照其他方式来确定标注内容主题,本公开实施例对此不加以限定。
在步骤S104中,根据预测内容主题和标注内容主题确定内容描述生成模型的第二损失值;
可以计算标注内容主题所对应的矩阵与预测内容主题所对应的矩阵之间的差异矩阵;计算差异矩阵的第一范数的平方;计算预设参数矩阵的第二范数;然后可以根据平方和第二范数,按照 如下公式计算第二损失值:
L1=α*X1+γ*X2;
其中,在上述公式中,L1为第二损失值,α为第一预设系数,γ为第二预设系数,X1为平方,X2为第二范数。
其中,α包括位于01至1之间的数值,γ包括1、1.001或1.0001等数值。
在步骤S105中,根据第一损失值和第二损失值确定内容描述生成模型的目标损失值。
其中,可以根据第一损失值和第二损失值,按照如下公式计算目标损失值:
L0=β*L1+λ*L2;
其中,在上述公式中,L0为目标损失值,β为第三预设系数,λ为第四预设系数,L1为第二损失值,L2为第一损失值。
其中,β包括1、1.001或1.0001等数值,γ包括0.5、0.51或0.501等数值。
在本公开实施例中,获取内容描述生成模型的第一损失值;根据内容描述生成模型中的预设参数矩阵获取用于描述视频的内容的预测内容主题;获取用于描述该视频的内容的标注内容主题;根据预测内容主题和标注内容主题确定内容描述生成模型的第二损失值;根据第一损失值和第二损失值确定内容描述生成模型的目标损失值。
根据目标损失值就可以确定出内容描述生成模型生成的视频的内容描述的错误程度,之后选择与该错误程度相适应的优化方式来优化内容描述生成模型中的预设参数矩阵,从而可以提高内容描述生成模型中的预设参数矩阵的稀疏性,也即使得预设参数矩阵中的非零的数值尽量少,进而使得视频的视频特征的各个维度与视频的内容主题之间的关系越清晰,可解释性越高,以使得视频的视频特征与根据内容描述生成模型生成的该视频的内容描述的内容主题之间的相关性更加显著可见。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开实施例并不受所描述的动作顺序的限制,因为依据本公开实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本公开实施例所必须的。
参照图2,示出了本公开一种内容描述生成模型的损失值获取装置实施例的结构框图,该装置具体可以包括如下模块:
第一获取模块11,用于获取内容描述生成模型的第一损失值;
第二获取模块12,用于根据内容描述生成模型中的预设参数矩阵获取用于描述视频的内容的 预测内容主题;
第三获取模块13,用于获取用于描述视频的内容的标注内容主题;
第一确定模块14,用于根据预测内容主题和标注内容主题确定内容描述生成模型的第二损失值;
第二确定模块15,用于根据第一损失值和第二损失值确定内容描述生成模型的目标损失值。
在一些实施方式中,第二获取模块12包括:
第一获取单元,用于获取视频的视频特征;
第一计算单元,用于计算视频特征与预设参数矩阵之间的乘积,并作为预测内容主题。
在一些实施方式中,第一获取单元包括:
第一获取子单元,用于获取视频中的多帧视频图像;
第二获取子单元,用于使用卷积神经网络CNN和双向长短期记忆网络LSTM获取每一帧视频图像的图像特征;
计算子单元,用于计算每一帧视频图像的图像特征之间的图像平均特征;
确定子单元,用于将图像平均特征确定为视频特征。
在一些实施方式中,第三获取模块13包括:
第二获取单元,用于获取已存储的、视频的标注内容描述;
拆分单元,用于拆分标注内容描述,得到多个描述词汇;
第一确定单元,用于确定每一个描述词汇所属的主题;
第二确定单元,用于根据每一个描述词汇所属的主题确定标注内容主题。
在一些实施方式中,第一确定模块14包括:
第二计算单元,用于计算标注内容主题所对应的矩阵与预测内容主题所对应的矩阵之间的差异矩阵;
第三计算单元,用于计算差异矩阵的第一范数的平方;
第四计算单元,用于计算预设参数矩阵的第二范数;
第五计算单元,用于根据平方和第二范数,按照如下公式计算第二损失值:
L1=α*X1+γ*X2;
其中,在上述公式中,L1为第二损失值,α为第一预设系数,γ为第二预设系数,X1为平方,X2为第二范数。
在一些实施方式中,第二确定模块15具体用于:
根据第一损失值和第二损失值,按照如下公式计算目标损失值:
L0=β*L1+λ*L2;
其中,在上述公式中,L0为目标损失值,β为第三预设系数,λ为第四预设系数,L1为第二损失值,L2为第一损失值。
在本公开实施例中,获取内容描述生成模型的第一损失值;根据内容描述生成模型中的预设参数矩阵获取用于描述视频的内容的预测内容主题;获取用于描述该视频的内容的标注内容主题;根据预测内容主题和标注内容主题确定内容描述生成模型的第二损失值;根据第一损失值和第二损失值确定内容描述生成模型的目标损失值。
根据目标损失值就可以确定出内容描述生成模型生成的视频的内容描述的错误程度,之后选择与该错误程度相适应的优化方式来优化内容描述生成模型中的预设参数矩阵,从而可以提高内容描述生成模型中的预设参数矩阵的稀疏性,也即使得预设参数矩阵中的非零的数值尽量少,进而使得视频的视频特征的各个维度与视频的内容主题之间的关系越清晰,可解释性越高,进而使得视频的视频特征与根据内容描述生成模型生成的该视频的内容描述的内容主题之间的相关性更加显著可见。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本公开还示出了一种终端,该终端可以包括:存储器、处理器及存储在存储器上并可在处理器上运行的内容描述生成模型的损失值获取程序,内容描述生成模型的损失值获取程序被处理器执行时实现本公开中的任意一种内容描述生成模型的损失值获取方法的步骤。
图3是根据一示例性实施例示出的一种终端600的框图。例如,终端600可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图3,终端600可以包括以下一个或多个组件:处理组件602,存储器604,电源组件606,多媒体组件608,音频组件610,输入/输出(I/O)的接口612,传感器组件614,以及通信组件616。
处理组件602通常控制装置600的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件602可以包括一个或多个处理器620来执行指令,以完成上述内容描述生成模型的损失值获取方法的全部或部分步骤。此外,处理组件602可以包括一个或多个模块,便于处理组件602和其他组件之间的交互。例如,处理部件602可以包括多媒体模块,以方便多媒体组件608和处理组件602之间的交互。
存储器604被配置为存储各种类型的数据以支持在终端600的操作。这些数据的示例包括用于在终端600上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器604可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件606为终端600的各种组件提供电力。电源组件606可以包括电源管理系统,一个或多个电源,及其他与为终端600生成、管理和分配电力相关联的组件。
多媒体组件608包括在终端600和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件608包括一个前置摄像头和/或后置摄像头。当终端600处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件610被配置为输出和/或输入音频信号。例如,音频组件610包括一个麦克风(MIC),当终端600处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器604或经由通信组件616发送。在一些实施例中,音频组件610还包括一个扬声器,用于输出音频信号。
I/O接口612为处理组件602和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件614包括一个或多个传感器,用于为终端600提供各个方面的状态评估。例如,传感器组件614可以检测到终端600的打开/关闭状态,组件的相对定位,例如组件为终端600的显示器和小键盘,传感器组件614还可以检测终端600或终端600一个组件的位置改变,用户与终端600接触的存在或不存在,装置600方位或加速/减速和终端600的温度变化。传感器组件614可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件614还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件614还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件616被配置为便于终端600和其他设备之间有线或无线方式的通信。终端600可以 接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信部件616经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,通信部件616还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,终端600可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行内容描述生成模型的损失值获取方法,具体地,该方法包括:
获取内容描述生成模型的第一损失值;
根据内容描述生成模型中的预设参数矩阵获取用于描述视频的内容的预测内容主题;
获取用于描述视频的内容的标注内容主题;
根据预测内容主题和标注内容主题确定内容描述生成模型的第二损失值;
根据第一损失值和第二损失值确定内容描述生成模型的目标损失值。
在一些实施方式中,根据内容描述生成模型中的预设参数矩阵获取用于描述视频的内容的预测内容主题,包括:
获取视频的视频特征;
计算视频特征与预设参数矩阵之间的乘积,并作为预测内容主题。
在一些实施方式中,获取视频的视频特征,包括:
获取视频中的多帧视频图像;
使用卷积神经网络CNN和双向长短期记忆网络LSTM获取每一帧视频图像的图像特征;
计算每一帧视频图像的图像特征之间的图像平均特征;
将图像平均特征确定为视频特征。
在一些实施方式中,获取用于描述视频的内容的标注内容主题,包括:
获取已存储的、视频的标注内容描述;
拆分标注内容描述,得到多个描述词汇;
确定每一个描述词汇所属的主题;
根据每一个描述词汇所属的主题确定标注内容主题。
在一些实施方式中,根据预测内容主题和标注内容主题确定内容描述生成模型的第二损失值,包括:
计算标注内容主题所对应的矩阵与预测内容主题所对应的矩阵之间的差异矩阵;
计算差异矩阵的第一范数的平方;
计算预设参数矩阵的第二范数;
根据平方和第二范数,按照如下公式计算第二损失值:
L1=α*X1+γ*X2;
其中,在上述公式中,L1为第二损失值,α为第一预设系数,γ为第二预设系数,X1为平方,X2为第二范数。
在一些实施方式中,根据第一损失值和第二损失值确定内容描述生成模型的目标损失值,包括:
根据第一损失值和第二损失值,按照如下公式计算目标损失值:
L0=β*L1+λ*L2;
其中,在上述公式中,L0为目标损失值,β为第三预设系数,λ为第四预设系数,L1为第二损失值,L2为第一损失值。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器604,上述指令可由终端600的处理器620执行以完成上述内容描述生成模型的损失值获取方法。例如,非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。当存储介质中的指令由终端的处理器执行时,使得终端能够执行本公开中的任意一种内容描述生成模型的损失值获取方法的步骤。
在示例性实施例中,还提供了一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序包括程序指令并被存储在计算机可读存储介质上,所述程序指令被终端的处理器执行时,使得终端能够执行本公开中的任意一种内容描述生成模型的损失值获取方法的步骤。
在此提供的内容描述生成模型的损失值获取方法不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述,构造具有本公开方案的系统所要求的结构是显而易见的。此外,本公开也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本公开的内容,并且上面对特定语言所做的描述是为了披露本公开的最佳实施方式。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本公开的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本 公开的示例性实施例的描述中,本公开的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本公开要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本公开的单独实施例。
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
此外,本领域的技术人员能够理解,尽管在此的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本公开的范围之内并且形成不同的实施例。例如,在权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。
本公开的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本公开实施例的内容描述生成模型的损失值获取方法中的一些或者全部部件的一些或者全部功能。本公开还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本公开的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
应该注意的是上述实施例对本公开进行说明而不是对本公开进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包括”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本公开可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。

Claims (16)

  1. 一种内容描述生成模型的损失值获取方法,所述方法包括:
    基于视频的标注内容描述和所述内容描述生成模型输出的预测内容描述获取所述内容描述生成模型的第一损失值;
    根据所述内容描述生成模型中的预设参数矩阵获取用于描述所述视频的内容的预测内容主题;
    获取用于描述所述视频的内容的标注内容主题;
    根据所述预测内容主题和所述标注内容主题确定所述内容描述生成模型的第二损失值;
    根据所述第一损失值和所述第二损失值确定所述内容描述生成模型的目标损失值。
  2. 根据权利要求1所述的方法,其中,所述根据所述内容描述生成模型中的预设参数矩阵获取用于描述所述视频的内容的预测内容主题,包括:
    获取所述视频的视频特征;
    计算所述视频特征与所述预设参数矩阵之间的乘积,并作为所述预测内容主题。
  3. 根据权利要求2所述的方法,其中,所述获取视频的视频特征,包括:
    获取所述视频中的多帧视频图像;
    使用卷积神经网络CNN和双向长短期记忆网络LSTM获取每一帧所述视频图像的图像特征;
    计算每一帧所述视频图像的图像特征之间的图像平均特征;
    将所述图像平均特征确定为所述视频特征。
  4. 根据权利要求1所述的方法,其中,所述获取用于描述所述视频的内容的标注内容主题,包括:
    获取已存储的、所述视频的标注内容描述;
    拆分所述标注内容描述,得到多个描述词汇;
    确定每一个所述描述词汇所属的主题;
    根据每一个所述描述词汇所属的主题确定所述标注内容主题。
  5. 根据权利要求1所述的方法,其中,所述根据所述预测内容主题和所述标注内容主题确定所述内容描述生成模型的第二损失值,包括:
    计算所述标注内容主题所对应的矩阵与所述预测内容主题所对应的矩阵之间的差异矩阵;
    计算所述差异矩阵的第一范数的平方;
    计算所述预设参数矩阵的第二范数;
    根据所述平方和所述第二范数,按照如下公式计算所述第二损失值:
    L1=α*X1+γ*X2;
    其中,在上述公式中,L1为所述第二损失值,α为第一预设系数,γ为第二预设系数,X1为所述平方,X2为所述第二范数。
  6. 根据权利要求5所述的方法,其中,所述根据所述第一损失值和所述第二损失值确定所述内容描述生成模型的目标损失值,包括:
    根据所述第一损失值和所述第二损失值,按照如下公式计算所述目标损失值:
    L0=β*L1+λ*L2;
    其中,在上述公式中,L0为所述目标损失值,β为第三预设系数,λ为第四预设系数,L1为所述第二损失值,L2为所述第一损失值。
  7. 根据权利要求1所述的方法,还包括:根据所述内容描述生成模型的目标损失值优化所述内容描述生成模型的预设参数矩阵。
  8. 一种内容描述生成模型的损失值获取装置,所述装置包括:
    第一获取模块,用于基于视频的标注内容描述和所述内容描述生成模型输出的预测内容描述获取所述内容描述生成模型的第一损失值;
    第二获取模块,用于根据所述内容描述生成模型中的预设参数矩阵获取用于描述视频的内容的预测内容主题;
    第三获取模块,用于获取用于描述所述视频的内容的标注内容主题;
    第一确定模块,用于根据所述预测内容主题和所述标注内容主题确定所述内容描述生成模型的第二损失值;
    第二确定模块,用于根据所述第一损失值和所述第二损失值确定所述内容描述生成模型的目标损失值。
  9. 根据权利要求8所述的装置,其中,所述第二获取模块包括:
    第一获取单元,用于获取所述视频的视频特征;
    第一计算单元,用于计算所述视频特征与所述预设参数矩阵之间的乘积,并作为所述预测内容主题。
  10. 根据权利要求9所述的装置,其中,所述第一获取单元包括:
    第一获取子单元,用于获取所述视频中的多帧视频图像;
    第二获取子单元,用于使用卷积神经网络CNN和双向长短期记忆网络LSTM获取每一帧所述视频图像的图像特征;
    计算子单元,用于计算每一帧所述视频图像的图像特征之间的图像平均特征;
    确定子单元,用于将所述图像平均特征确定为所述视频特征。
  11. 根据权利要求8所述的装置,其中,所述第三获取模块包括:
    第二获取单元,用于获取已存储的、所述视频的标注内容描述;
    拆分单元,用于拆分所述标注内容描述,得到多个描述词汇;
    第一确定单元,用于确定每一个所述描述词汇所属的主题;
    第二确定单元,用于根据每一个所述描述词汇所属的主题确定所述标注内容主题。
  12. 根据权利要求8所述的装置,其中,所述第一确定模块包括:
    第二计算单元,用于计算所述标注内容主题所对应的矩阵与所述预测内容主题所对应的矩阵之间的差异矩阵;
    第三计算单元,用于计算所述差异矩阵的第一范数的平方;
    第四计算单元,用于计算所述预设参数矩阵的第二范数;
    第五计算单元,用于根据所述平方和所述第二范数,按照如下公式计算所述第二损失值:
    L1=α*X1+γ*X2;
    其中,在上述公式中,L1为所述第二损失值,α为第一预设系数,γ为第二预设系数,X1为所述平方,X2为所述第二范数。
  13. 根据权利要求12所述的装置,其中,所述第二确定模块具体用于:
    根据所述第一损失值和所述第二损失值,按照如下公式计算所述目标损失值:
    L0=β*L1+λ*L2;
    其中,在上述公式中,L0为所述目标损失值,β为第三预设系数,λ为第四预设系数,L1为所述第二损失值,L2为所述第一损失值。
  14. 一种终端,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的内容描述生成模型的损失值获取程序,内容描述生成模型的损失值获取程序被所述处理器执行时实现如权利要求1至7中任一项所述的内容描述生成模型的损失值获取方法的步骤。
  15. 一种计算机可读存储介质,所述计算机可读存储介质上存储有内容描述生成模型的损失值获取程序,内容描述生成模型的损失值获取程序被处理器执行时实现如权利要求1至7中任一项所述的内容描述生成模型的损失值获取方法的步骤。
  16. 一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序包括程序指令并被存储在计算机可读存储介质上,所述程序指令被处理器执行时实现如权利要求1至7中任一项所述的内容描述生成模型的损失值获取方法的步骤。
PCT/CN2018/123955 2018-06-20 2018-12-26 内容描述生成模型的损失值获取方法及装置 WO2019242278A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810637242.X 2018-06-20
CN201810637242.XA CN108984628B (zh) 2018-06-20 2018-06-20 内容描述生成模型的损失值获取方法及装置

Publications (1)

Publication Number Publication Date
WO2019242278A1 true WO2019242278A1 (zh) 2019-12-26

Family

ID=64541496

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/123955 WO2019242278A1 (zh) 2018-06-20 2018-12-26 内容描述生成模型的损失值获取方法及装置

Country Status (2)

Country Link
CN (1) CN108984628B (zh)
WO (1) WO2019242278A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984628B (zh) * 2018-06-20 2020-01-24 北京达佳互联信息技术有限公司 内容描述生成模型的损失值获取方法及装置
CN110730381A (zh) * 2019-07-12 2020-01-24 北京达佳互联信息技术有限公司 基于视频模板合成视频的方法、装置、终端及存储介质
CN111047187B (zh) * 2019-12-12 2023-10-17 浙江大搜车软件技术有限公司 信息匹配处理方法、装置、计算机设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014179906A (ja) * 2013-03-15 2014-09-25 Nippon Telegr & Teleph Corp <Ntt> 映像要約装置、映像要約方法及び映像要約プログラム
CN107122801A (zh) * 2017-05-02 2017-09-01 北京小米移动软件有限公司 图像分类的方法和装置
CN107391646A (zh) * 2017-07-13 2017-11-24 清华大学 一种视频图像的语义信息提取方法及装置
CN107908601A (zh) * 2017-11-01 2018-04-13 北京颐圣智能科技有限公司 医疗文本的分词模型构建方法、设备、可读存储介质及分词方法
CN108984628A (zh) * 2018-06-20 2018-12-11 北京达佳互联信息技术有限公司 内容描述生成模型的损失值获取方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315670B (zh) * 2007-06-01 2010-08-11 清华大学 特定被摄体检测装置及其学习装置和学习方法
US8145943B2 (en) * 2009-05-11 2012-03-27 Empire Technology Development Llc State variable-based detection and correction of errors
CN104572786A (zh) * 2013-10-29 2015-04-29 华为技术有限公司 随机森林分类模型的可视化优化处理方法及装置
JP6518254B2 (ja) * 2014-01-09 2019-05-22 ドルビー ラボラトリーズ ライセンシング コーポレイション オーディオ・コンテンツの空間的誤差メトリック
CN104850818B (zh) * 2014-02-17 2018-05-18 华为技术有限公司 人脸检测器训练方法、人脸检测方法及装置
CN107066973B (zh) * 2017-04-17 2020-07-21 杭州电子科技大学 一种利用时空注意力模型的视频内容描述方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014179906A (ja) * 2013-03-15 2014-09-25 Nippon Telegr & Teleph Corp <Ntt> 映像要約装置、映像要約方法及び映像要約プログラム
CN107122801A (zh) * 2017-05-02 2017-09-01 北京小米移动软件有限公司 图像分类的方法和装置
CN107391646A (zh) * 2017-07-13 2017-11-24 清华大学 一种视频图像的语义信息提取方法及装置
CN107908601A (zh) * 2017-11-01 2018-04-13 北京颐圣智能科技有限公司 医疗文本的分词模型构建方法、设备、可读存储介质及分词方法
CN108984628A (zh) * 2018-06-20 2018-12-11 北京达佳互联信息技术有限公司 内容描述生成模型的损失值获取方法及装置

Also Published As

Publication number Publication date
CN108984628B (zh) 2020-01-24
CN108984628A (zh) 2018-12-11

Similar Documents

Publication Publication Date Title
TWI781359B (zh) 人臉和人手關聯檢測方法及裝置、電子設備和電腦可讀儲存媒體
US11048983B2 (en) Method, terminal, and computer storage medium for image classification
WO2020134556A1 (zh) 图像风格迁移方法、装置、电子设备及存储介质
US20210133459A1 (en) Video recording method and apparatus, device, and readable storage medium
JP6227766B2 (ja) チャットインターフェースでの表情記号変更の方法、装置および端末機器
CN107644646B (zh) 语音处理方法、装置以及用于语音处理的装置
JP6918181B2 (ja) 機械翻訳モデルのトレーニング方法、装置およびシステム
CN109919829B (zh) 图像风格迁移方法、装置和计算机可读存储介质
EP3176709A1 (en) Video categorization method and apparatus, computer program and recording medium
WO2017031905A1 (zh) 社交关系分析方法及装置
WO2017088247A1 (zh) 输入处理方法、装置及设备
WO2021031308A1 (zh) 音频处理方法、装置及存储介质
WO2017020482A1 (zh) 票务信息展示方法及装置
US20180365200A1 (en) Method, device, electric device and computer-readable storage medium for updating page
WO2019242278A1 (zh) 内容描述生成模型的损失值获取方法及装置
EP3260998A1 (en) Method and device for setting profile picture
CN114240882A (zh) 缺陷检测方法及装置、电子设备和存储介质
WO2017054354A1 (zh) 信息处理方法及装置
WO2017092121A1 (zh) 信息处理的方法及装置
CN111242303A (zh) 网络训练方法及装置、图像处理方法及装置
CN111160047A (zh) 一种数据处理方法、装置和用于数据处理的装置
JP6085067B2 (ja) ユーザデータ更新方法、装置、プログラム、及び記録媒体
WO2016197549A1 (zh) 一种进行搜索的方法和装置
CN109145151B (zh) 一种视频的情感分类获取方法及装置
WO2019105243A1 (zh) 图像处理方法、装置及终端

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18923511

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25/03/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18923511

Country of ref document: EP

Kind code of ref document: A1