WO2024082914A1 - 视频问答方法及电子设备 - Google Patents

视频问答方法及电子设备 Download PDF

Info

Publication number
WO2024082914A1
WO2024082914A1 PCT/CN2023/120449 CN2023120449W WO2024082914A1 WO 2024082914 A1 WO2024082914 A1 WO 2024082914A1 CN 2023120449 W CN2023120449 W CN 2023120449W WO 2024082914 A1 WO2024082914 A1 WO 2024082914A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
question
feature
answer
text
Prior art date
Application number
PCT/CN2023/120449
Other languages
English (en)
French (fr)
Inventor
姚淅峰
许坤
陈开济
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024082914A1 publication Critical patent/WO2024082914A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions

Definitions

  • the present application relates to the field of terminal technology, and in particular to a video question-and-answer method and an electronic device.
  • VQA Visual Question Answering
  • the specific implementation method of video question answering may include: obtaining text features and visual features and semantic features of each frame image from the question text and the original video to be processed, and obtaining a global visual representation of each frame image based on the text features, visual features and semantic features, and finally obtaining the answer to the question based on the text features and the global visual representation.
  • the above-mentioned video question-answering method requires obtaining a global visual representation of each frame image of the original video to be processed, which takes a lot of time to process the image data, resulting in a slow speed in obtaining the answer to the question.
  • the embodiments of the present application provide a video question-and-answer method and an electronic device, which can extract relevant video clips of a question text according to factors such as time, characters, and semantics implied in the question text, reduce the number of video frames containing implicit answers to the question, and further reduce the time for processing the relevant video clips, thereby increasing the speed of obtaining answers to the question, improving the efficiency of video question-and-answering, and improving the user experience.
  • a video question-answering method is provided, which is applied to an electronic device.
  • the electronic device obtains a target video and a user's question information.
  • the electronic device obtains at least one associated parameter in the question information, wherein the associated parameter includes one or more of a time associated parameter, an object associated parameter, or a semantic associated parameter.
  • the electronic device segments the target video according to the at least one associated parameter to obtain at least one video segment.
  • the electronic device obtains an answer to the question corresponding to the question information in the at least one video segment.
  • the electronic device displays the answer to the question.
  • the target video is a continuously recorded video stored in the video library.
  • the continuous time of the target video can be the maximum storage time of the video (such as 7 days, one week, one month, etc.), and the recording time of the target video is calculated from the current time according to the maximum storage time.
  • the target video has a large amount of data, and it is slow to get the answer to the question if the judgment is performed frame by frame.
  • each video segment in the at least one video segment is processed separately. Since each video segment contains an independent semantics, it can avoid mutual interference between the video segments and improve the accuracy of the answer to the question.
  • the electronic device obtains at least one associated parameter in the question information, which may include: first, the electronic device converts the question information into a question text. Then, the electronic device performs word segmentation on the question text to obtain a word vector. Furthermore, the electronic device inputs the word vector into a preset text encoding model to obtain a text feature. Next, the electronic device extracts the time feature and the object feature in the text feature. Finally, the electronic device obtains one or more of the semantic association parameters corresponding to the text feature, the time association parameters corresponding to the time feature, and the object association parameters corresponding to the object feature.
  • the association parameters include one or more of semantic association parameters corresponding to text features, time association parameters corresponding to time features, and object association parameters corresponding to object features.
  • the question information is obtained in response to text input, or in response to voice input, or in response to video input.
  • the electronic device After obtaining the question information, the electronic device performs cleaning and standardization on the question information to obtain the question text. Compared with the question information, the question text is easier to process as a text and easier to be recognized by a machine.
  • the preset text encoding model may be a BERT model.
  • the main input of the BERT model is the original word vector of each character/word (or token) in the text.
  • the original word vector may be a one-dimensional vector converted from each character in the text by querying a word vector table, or a vector obtained by pre-training a word vector model.
  • the output of the BERT model is a vector representation corresponding to each character/word in the text after integrating the semantic information of the entire text.
  • the text features and the semantic association parameters corresponding to the text features are obtained.
  • the above processing can improve the accuracy of the text features.
  • the time association parameters corresponding to the time features and the object association parameters corresponding to the object features are obtained. In this way, the accuracy of the obtained association parameters can be improved.
  • the time-related parameters include the video start time and the video end time.
  • different methods can be used to obtain the time-related parameters corresponding to the time feature according to whether the time feature is valid. Specifically, when the time feature is a valid feature, the time segmentation corresponding to the time feature is mapped according to a preset mapping rule to determine the video end time and the video start time; when the time feature is an invalid feature, the video end time is determined as: the time when the question text is obtained, and the video start time is determined as: the time that is a preset time away from the video end time.
  • the object feature is a person category feature
  • the electronic device obtains the object association parameter corresponding to the object feature, including: determining the user identity information based on the question text; determining the target person corresponding to the object feature based on the user identity information in a preset identity relationship table; and determining the target feature corresponding to the target person as the object association parameter.
  • the target feature includes at least one of the following: image feature, behavior feature and voiceprint feature.
  • the target feature can be extracted by a preset feature extraction algorithm based on pre-stored data such as the image and voice of the target person, and the preset feature extraction algorithm can be a residual network ResNet algorithm.
  • the character category feature contained in the character classifier result that is, the object feature is determined to be a character category feature.
  • the character category feature can represent character relationship categories such as dad, mom, me, grandpa and grandma.
  • the question text is "Mom's key”, and the result of the character classifier is "Mom”; the question text is "Find the key”, and the result of the character classifier is "Me”.
  • the target features of the target person related to the question text can be obtained according to the preset identity relationship table and the user identity information, thereby improving the accuracy of the obtained "person" association parameters and ultimately improving the accuracy of the answer to the question.
  • the electronic device determines the user identity information based on the question text, including any of the following items: in the case of obtaining the question information in response to text input, the user identity information is determined as: the identity information corresponding to the biometric feature of starting the electronic device; in the case of obtaining the question information in response to voice input, the user identity information is determined as: the identity information corresponding to the voiceprint feature corresponding to the voice stream of the voice input; in the case of obtaining the question information in response to video input, the user identity information is determined as: the identity information corresponding to the voiceprint feature and/or facial feature corresponding to the video stream of the video input.
  • the electronic device can select a target identity confirmation method according to the method of obtaining the question text, and determine the identity information of the user who asked the question text according to the target identity confirmation method. If the user identity information cannot be confirmed, the person-related parameters are not output. The electronic device cannot confirm the user identity information, that is, the user who inputs the question text is no longer within the preset face and preset voiceprint recognition range, or the person classifier result does not have a target person corresponding to the user.
  • the biometric features, voiceprint information or facial features can be directly obtained to determine the user's identity information, thereby increasing the speed of determining the identity information.
  • the electronic device segments the target video according to at least one associated parameter to obtain at least one video segment, which may include: obtaining an associated video corresponding to at least one associated parameter in the target video. Next, segmenting the associated video to obtain at least one video segment.
  • Part or all of the target video is determined as an associated video corresponding to at least one associated parameter.
  • the recording time corresponding to the associated video may be discontinuous.
  • At least one video segment may be segmented according to whether the recording time of the associated video is discontinuous.
  • the associated video is divided into at least one video segment, so that in the process of obtaining the answer to the question, each video segment in the at least one video segment is processed separately. Since each video segment contains an independent semantics, mutual interference between the video segments can be avoided, and the accuracy of the answer to the question can be improved.
  • the electronic device obtains an associated video corresponding to at least one associated parameter in a target video, specifically comprising: firstly extracting a first video from the target video according to a time association parameter, then extracting a second video from the first video according to an object association parameter, and finally extracting an associated video from the second video according to a semantic association parameter.
  • the electronic device first uses the time association parameter as the video recording time period to extract the first video in the target video. Encode each video frame in the first video to obtain the video frame features of each video frame.
  • the electronic device calculates the object association parameter (such as the target feature of the target person) and the first feature similarity of the video frame features of each video frame. If the first feature similarity is greater than the preset threshold, the video frame is retained. If the first feature similarity is less than the preset threshold, the video frame is deleted until each video frame in the first video is judged to be retained or deleted in turn to obtain the second video. Finally, the electronic device calculates the semantic association parameter (text feature) and the second feature similarity of the video frame features of each video frame of the second video frame. If the second feature similarity is greater than the preset threshold, the video frame is retained. If the second feature similarity is less than the preset threshold, the video frame is deleted until each video frame in the second video is judged to be retained or deleted in turn to obtain the associated video.
  • the object association parameter such as the
  • the associated videos in the target video are extracted according to the time association parameter, object association parameter and semantic association parameter in the order of the data processing speed of the extracted video from fast to slow.
  • the step of extracting the first video with the fastest processing speed is executed first, and the amount of data to be processed is the largest.
  • the step of extracting the related videos with the slowest processing speed is executed last, and the amount of data to be processed is the smallest.
  • the electronic device segments the associated video to obtain at least one video segment, which may be: segmenting the associated video according to the video segmentation position to obtain at least one video segment.
  • the video segmentation position is: the position of adjacent video frames in the associated video where the recording time difference is greater than the preset time difference. It can be understood that in the process of determining the segmentation position, the video segments in the associated video where the recording time is not continuous can be determined first for judgment, thereby reducing the number of times the recording time difference corresponding to adjacent video frames is calculated, further reducing the amount of calculation required to obtain the answer to the question, and improving the speed of obtaining the answer to the question.
  • the video segments in at least one video segment are discontinuous in time, so that each video segment contains an independent semantics, which can avoid mutual interference between the video segments and improve the accuracy of the answer to the question.
  • the associated video is segmented according to the video segmentation position, and after obtaining at least one video segment, when the feature similarity of the feature means of adjacent video segments in the at least one video segment is greater than a preset threshold, the electronic device merges the adjacent video segments and regenerates at least one video segment.
  • the feature mean may be the sum of the video frame features of each video frame in the video clip, and then divided by the total number of video frames to obtain the average value.
  • the feature similarity is the similarity between two feature means of adjacent video clips.
  • the video question-and-answer method provided by the present application further includes: first, the electronic device extracts the answer quantity feature in the text feature. Then, the electronic device obtains the answer quantity association parameter corresponding to the answer quantity feature. Next, after the associated video is segmented to obtain at least one video segment, when the answer quantity association parameter is 1, the electronic device deletes the other segments except the first video segment in the at least one video segment, and the first video segment is: the last recorded video segment in the at least one video segment.
  • the electronic device inputs the text feature into the character classifier, and determines whether the character feature is included in the character classifier result and judges through the semantic association parameter, and then the electronic device determines that the answer quantity association parameter corresponding to the answer quantity feature is 1 or more.
  • the associated parameter includes an answer quantity associated parameter
  • the answer quantity associated parameter is 1
  • the electronic device retains the video segment with the latest recording time in at least one video segment; when the parameter associated with the number of answers is large, the electronic device retains all the video segments in at least one video segment.
  • the number of answers is only related to the question text and has nothing to do with the associated video. It is the possible number of answers that are pre-determined. Of course, the number of answers has a certain correlation with the answer to the question. For example, if the number of answers is one, the actual number of answers to the question may be zero or one; if the number of answers is multiple, the actual number of answers to the question may be any natural number, such as 0, 1, 2, 3, etc.
  • the answer quantity association parameter is determined by the answer quantity feature.
  • the answer quantity association parameter is 1, at least one video clip can be further screened to retain only the video clip corresponding to the video clip with the most recent recording time, so as to further reduce the amount of data in the video clip and increase the speed of obtaining the answer to the question.
  • the video question-and-answer method provided by the present application also includes: first, the electronic device extracts the answer ordinal feature in the text feature. Then, the electronic device obtains the answer ordinal association parameter corresponding to the answer ordinal feature. Next, after the associated video is segmented to obtain at least one video segment, when the answer quantity association parameter is large and the answer quantity association parameter is greater than or equal to the answer ordinal association parameter, the electronic device deletes the other segments in the at least one video segment except the second video segment, wherein the second video segment is the video segment corresponding to the answer ordinal association parameter in the at least one video segment, and the at least one video segment is arranged in chronological order.
  • At least one video clip is further screened and only the video clips corresponding to the answer ordinal associated parameters are retained, so as to further reduce the amount of data in the video clips and increase the speed of obtaining answers to questions.
  • the electronic device displays the answer to the question, specifically comprising: first, in the case where the question text, at least one video clip, or the answer to the question contains private information, verifying the user's identity information. Then, in the case where the identity verification passes, displaying the answer to the question.
  • privacy information and user identity information are relative. Assuming the question text is "safe password", the corresponding answer can be obtained for the user identity information saved in the preset identity relationship table. For the user identity information not saved in the preset identity relationship table (strangers relative to the characters in the preset identity relationship table), in order to prevent strangers from achieving illegal purposes based on the obtained answer to the question, the corresponding answer cannot be obtained.
  • the electronic device displays the answer to the question, including at least one of the following: playing the answer to the question through voice playback; displaying the answer to the question through text display; playing a video clip corresponding to the answer to the question in at least one video clip through video display.
  • At least one display method is used to display the answers to the questions and increase the diversity of the displayed answers to the questions, so as to increase the probability of users knowing the answers to the questions.
  • the present application provides a video question-and-answer device, which includes an acquisition unit, a processing unit, and a display unit; the acquisition unit is used to acquire a target video and a user's question information.
  • the acquisition unit is also used to acquire at least one associated parameter in the question information, wherein the associated parameter includes one or more of a time associated parameter, an object associated parameter, or a semantic associated parameter.
  • the processing unit is used to segment the target video according to at least one associated parameter to obtain at least one video segment.
  • the display unit is used to acquire a question answer corresponding to the question information in at least one video segment; and display the question answer.
  • each video clip in at least one video clip is processed separately. Since each video clip contains an independent semantics, mutual interference between the video clips can be avoided, which can improve the accuracy of the answers to questions.
  • an electronic device comprising: a memory, one or more processors; the memory and the processor are coupled; wherein computer program code is stored in the memory, and the computer program code comprises computer instructions, and when the computer instructions are executed by the processor, the electronic device executes the video question-and-answer method described in any one of the above-mentioned first aspects.
  • a computer-readable storage medium comprising computer instructions.
  • the computer instructions When the computer instructions are executed on an electronic device, the electronic device executes the video question-and-answer method described in any one of the first aspects.
  • a computer program product is provided.
  • the computer program product When the computer program product is run on a computer, the computer executes the video question-and-answer method described in any one of the first aspects.
  • beneficial effects that can be achieved by the electronic device described in the third aspect, the computer-readable storage medium described in the fourth aspect, and the computer program product described in the fifth aspect provided above can refer to the beneficial effects in the first aspect and any possible design method thereof, and will not be repeated here.
  • FIG1 is a schematic diagram of a flow chart of a video question-answering method in a first related art
  • FIG2 is a schematic diagram of a flow chart of a video question-answering method in a second related art
  • FIG3 is one of the schematic diagrams of a video question-and-answer scenario shown in an embodiment of the present application.
  • FIG4 is a schematic diagram of the structure of a video question-answering system framework shown in an embodiment of the present application.
  • FIG5 is a schematic diagram of the hardware structure of an electronic device shown in an embodiment of the present application.
  • FIG6 is a schematic diagram of a software structure of an electronic device shown in an embodiment of the present application.
  • FIG. 7 is a flowchart of a video question-and-answer method according to an embodiment of the present application.
  • FIG8 is a schematic diagram of a model structure for obtaining associated parameters according to an embodiment of the present application.
  • FIG9 is a family task relationship diagram shown in an embodiment of the present application.
  • FIG10 is a second flow chart of a video question-answering method according to an embodiment of the present application.
  • FIG11 is a schematic diagram of a scene of a video clip shown in an embodiment of the present application.
  • FIG12 is a schematic diagram of the structure of a VQA model shown in an embodiment of the present application.
  • FIG13 is a second schematic diagram of a video question-and-answer scenario shown in an embodiment of the present application.
  • FIG14 is a third flow chart of a video question-answering method according to an embodiment of the present application.
  • FIG15 is a schematic diagram of a model structure for determining whether privacy information is contained, shown in an embodiment of the present application.
  • FIG16 is a fourth flow chart of a video question-and-answer method according to an embodiment of the present application.
  • FIG17 is a schematic diagram of the structure of a video question-answering method device shown in an embodiment of the present application.
  • FIG. 18 is a schematic diagram of the structure of another electronic device according to an embodiment of the present application.
  • At least one of the following or its similar expressions refers to any combination of these items, including any combination of single items or plural items.
  • at least one of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple.
  • the words "first”, “second” and the like are used to distinguish between the same items or similar items with basically the same functions and effects.
  • the Recurrent Neural Network (RNN) model takes sequence data as input, performs recursion in the direction of sequence evolution, and all nodes are connected in a chain.
  • the Long-Short Term Memory (LSTM) model is a time-recurrent neural network model that can solve the long-term dependency problem of the RNN model.
  • the RNN model has only one state in a single loop structure, while the LSTM model has four states in a single loop structure.
  • the LSTM model maintains a persistent The cell state is continuously passed on to decide which information to forget or continue to pass on, which can avoid gradient explosion or gradient disappearance in the loop process, which leads to the problem of being unable to process longer sequence data and unable to obtain information from long-distance data.
  • the Bidirectional Encoder Representations from Transformers (BERT) model is a language representation model that uses a masked language model to generate deep bidirectional language representations. Its purpose is to use large-scale unlabeled prediction training to obtain a model representation of the text that contains rich semantic information.
  • the Residual Neural Network (ResNet) model is formed by skipping some layers in the Artificial Neural Networks (ANN) and connecting them to the next layer of neurons, which can weaken the strong connection between each layer.
  • ANN Artificial Neural Networks
  • the more convolutional layers and pooling layers there are in the ANN the more comprehensive the image feature information obtained, and the better the learning effect.
  • the gradient disappearance and gradient explosion may occur. In order to avoid this phenomenon, the ResNet model is proposed.
  • ASR Automatic Speech Recognition
  • the Transformer (TRM) model uses the attention mechanism to improve the model training speed. It relies on the architecture of the attention mechanism, including the attention mechanism and the feedforward neural network. Its purpose is to input one language and output another language, that is, to process the natural language obtained by ASR technology and extract valid text.
  • VQA Visual Question Answer
  • CV Computer Vision
  • NLP Natural Language Processing
  • CV is used to perform image recognition, image classification, target tracking and other processing on a given image
  • NLP is used to perform machine translation, information retrieval, and text summary generation on natural language.
  • the first video question-answering method is as shown in S101 to S106 below.
  • the second video question-answering method is shown in S201 to S213 below.
  • S202 respectively obtain the correlation score between the description information of each video frame and the question, sort the video frames in descending order of the correlation score, and use the video frames in the first M positions after sorting as key frames.
  • S203 Obtain an audio vector corresponding to the video and a question vector corresponding to the question.
  • S209 Concatenate the vector identifiers of the target regions with the feature vectors corresponding to the key frame, and use the concatenation result as the vector representation of the key frame.
  • a fixed number of key frames can be selected through the video frame description and text, which can reduce the number of video frames for feature extraction, thereby increasing the speed of obtaining answers to questions.
  • the selected key frames may include image noise, resulting in the use of key frame vector representations that are insufficient to determine the answer to the question.
  • the number of video frames is large, it still takes a long time to select key frames, resulting in the problem of slow speed in obtaining answers to questions.
  • an embodiment of the present application provides a video question and answer method, in which, first, a target video and a user's question text are obtained. Then, the associated parameters such as time, object (such as a person), semantics, and number of answers implied in the question text are obtained, and then the associated video corresponding to the associated parameters is obtained, and the associated video is segmented to obtain at least one video clip, and finally at least one video clip and the question text are input into the VQA model to obtain the answer to the question.
  • the semantic representation can be text or text vector.
  • the number of answers can be one or more.
  • the text processing process in the VQA model can adopt natural language processing models such as BERT model, LSTM model, word vector word2vec model, etc.
  • image processing network models such as Transformer network model and ResNet network can be used.
  • the object in the associated parameter can be a person or thing object, etc.
  • a person object refers to a text with person characteristics obtained through semantic recognition
  • a thing object refers to a text with thing characteristics obtained through semantic recognition.
  • Thing objects can be things that cannot move automatically (such as computers, houses, trees, etc.), and things that can move automatically (such as sweeping robots, pets, etc.).
  • the number of answers is only related to the question text and has nothing to do with the associated video. It is the possible number of answers to the question that is pre-determined. Of course, the number of answers has a certain correlation with the answer to the question. For example, if the number of answers is one, the actual number of answers to the question may be zero or one; if the number of answers is multiple, the actual number of answers to the question may be any natural number less than the number of answers, such as 0, 1, 2, 3, etc.
  • user A asks the question text "Where did my mother put the keys this afternoon” to the intelligent assistant B, wherein the intelligent assistant B is able to shoot videos, receive voice messages, and play answers to questions.
  • Intelligent assistant B understands the time corresponding to "this afternoon", the object corresponding to "mom” (character object), the semantics corresponding to "find the keys”, and then needs the corresponding answer to the question according to the actual meaning of the question text, such as, the keys are on the table.
  • Intelligent assistant B can also further understand the question text and determine the location of the key to be asked (that is, the current location of the key). Even if the mother puts the keys in 3 different places in the afternoon, there is only one answer to the question.
  • the associated parameters include time (this afternoon), target person (mom), semantics (find the keys), and number of answers (1).
  • the embodiment of the present application provides a video question-answering method, which obtains the associated video corresponding to the associated parameters, can reduce the number of video frames that imply the answer to the question, and further reduce the time that the VQA model needs to process at least one video segment, so as to improve the speed of obtaining the answer to the question. At the same time, avoiding the interference of irrelevant videos on the answer to the question can further improve the speed of obtaining the answer to the question and improve the accuracy of the answer to the question. Furthermore, by processing each video segment in at least one video segment separately through the VQA model, since each video segment contains an independent semantics, it is possible to avoid interference between the video segments, which can It can improve the accuracy of the answers to questions.
  • the video question-and-answer method provided in the embodiment of the present application can be applied to a video question-and-answer system framework.
  • the video question-and-answer system framework includes an input/front-end processing module, an ASR speech recognition module, a camera module, and a video question-and-answer module.
  • the above-mentioned video question-and-answer system framework is used to answer questions raised by users.
  • Each module in the above-mentioned video question-and-answer system framework can be configured in the same electronic device.
  • the electronic device may be a device that supports voice recognition, such as a television, a laptop computer, a personal computer, a mobile phone, a tablet computer, a smart speaker, etc.
  • voice recognition such as a television, a laptop computer, a personal computer, a mobile phone, a tablet computer, a smart speaker, etc.
  • the embodiments of the present application do not impose any special restrictions on the specific form of the electronic device.
  • the various modules in the video question and answer system framework can also be configured in multiple electronic devices, with at least one module configured in each electronic device.
  • a voice processing module and a voice recognition module are configured in the first electronic device
  • a camera module is configured in the second electronic device
  • a video question and answer module is configured in the third electronic device.
  • a voice processing module, a voice recognition module, and a video question and answer module are configured in the fourth electronic device
  • a camera module is configured in the fifth electronic device.
  • Electronic devices configured with data transmission relationship modules can communicate with each other. The embodiment of the present application does not limit the number of electronic devices corresponding to the video question and answer system framework.
  • the input/front-end processing module is used to process the input voice stream (user's question) into a preset data format, including: audio decoding, separation and noise reduction of the input voice stream using voiceprint or other features; extracting audio features through audio processing algorithms such as framing, windowing, short-time Fourier transform; and sending the audio vector corresponding to the audio feature to the ASR speech recognition module.
  • the input/front-end processing module is implemented on the terminal side.
  • the ASR speech recognition module is used to obtain audio features and convert the input audio features into text through the acoustic model and the language model.
  • the typical implementation method includes: obtaining the phonemes corresponding to the acoustic features in the audio features through the acoustic model, obtaining the text corresponding to the language features in the audio features through the speech model, and outputting the text corresponding to the speech stream by serially outputting the phonemes and the text.
  • the ASR speech recognition module can be implemented on the terminal side. In the embodiment of the present application, the speech recognition method adopted by the ASR speech recognition module is not limited.
  • the acoustic model and the language model are both neural network structures, and they are jointly trained during the model training process. Therefore, the text corresponding to the audio features output by the acoustic model and the language model is a sequence of Chinese characters.
  • the camera module is used to shoot videos of a specific area.
  • the camera module can save videos of preset duration.
  • the preset duration refers to the length of time from the current time in the shot video, such as 3 days, 7 days, one month or three months, etc.
  • the camera module can also perform image processing on the shot video, such as analyzing video semantics, extracting image features, extracting character features, extracting environmental features, etc.
  • the video question-answering module is used to receive text (question information) and video, and output the answer corresponding to the question information.
  • Typical methods include: extracting text features and video features, performing feature fusion, and obtaining the answer to the question by classification or generation.
  • the above-mentioned video question-answering system framework can be applied in a voice assistant program to answer questions contained in the voice text input by the user by combining the voice text input by the user and the video of the camera module mobile phone.
  • FIG5 shows a schematic diagram of the hardware structure of an electronic device, which can realize the function of the video question-answering module.
  • the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a camera 193, a display screen 194, and a subscriber identification module (SIM) card interface 195, etc.
  • SIM subscriber identification module
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 100.
  • the electronic device 100 may include more or fewer components than shown in the figure, or combine some components, or split some components, or arrange the components differently.
  • the components shown in the figure may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (AP), a modem processor, a graphics processor (GPU), an image signal processor (ISP), a controller, a memory, a video codec, a digital signal processor (DSP), a baseband processor, and/or a neural-network processing unit (NPU), etc.
  • AP application processor
  • GPU graphics processor
  • ISP image signal processor
  • controller a memory
  • video codec a digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • Different processing units may be independent devices or integrated in one or more processors.
  • the controller can be the nerve center and command center of the electronic device 100.
  • the controller can sequence signal, generates operation control signal, and completes the control of instruction fetching and execution.
  • the processor 110 may also be provided with a memory for storing instructions and data.
  • the memory in the processor 110 is a cache memory.
  • the memory may store instructions or data that the processor 110 has just used or cyclically used. If the processor 110 needs to use the instruction or data again, it may be directly called from the memory. This avoids repeated access, reduces the waiting time of the processor 110, and thus improves the efficiency of the system.
  • the processor 110 can obtain the question text and the target video; input the question text into the encoder, output the associated parameters of the question text, and the associated parameters include: time, object, semantics and number of answers; obtain the associated video corresponding to the associated parameters, and then segment the associated video to obtain at least one video clip, and finally input the at least one video clip and the question text into the VQA model to obtain the answer to the question.
  • the question text is obtained by converting the voice stream through ASR, and the question text can also be directly obtained text; the video library includes continuously recorded videos.
  • the object can be a person object, a thing object, etc.
  • the processor 110 may include one or more interfaces.
  • the interface may include an inter-integrated circuit (I2C) interface, an inter-integrated circuit sound (I2S) interface, a pulse code modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a mobile industry processor interface (MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (SIM) interface, and/or a universal serial bus (USB) interface, etc.
  • I2C inter-integrated circuit
  • I2S inter-integrated circuit sound
  • PCM pulse code modulation
  • UART universal asynchronous receiver/transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB universal serial bus
  • the USB interface 130 is an interface that complies with the USB standard specification, and specifically can be a Mini USB interface, a Micro USB interface, a USB Type C interface, etc.
  • the USB interface 130 can be used to connect a charger to charge the electronic device 100, and can also be used to transmit data between the electronic device 100 and a peripheral device. It can also be used to connect headphones to play audio through the headphones.
  • the interface can also be used to connect other electronic devices, such as AR devices, etc.
  • the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration and does not constitute a structural limitation on the electronic device 100.
  • the electronic device 100 may also adopt different interface connection methods in the above embodiments, or a combination of multiple interface connection methods.
  • the USB interface 130 may be used to transmit data such as voice streams, question information, target videos, associated videos, and question answers.
  • the wireless communication function of the electronic device 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.
  • Antenna 1 and antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 100 can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve the utilization of antennas.
  • antenna 1 can be reused as a diversity antenna for a wireless local area network.
  • the antenna can be used in combination with a tuning switch.
  • the mobile communication module 150 can provide solutions for wireless communications including 2G/3G/4G/5G, etc., applied to the electronic device 100.
  • the mobile communication module 150 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), etc.
  • the mobile communication module 150 may receive electromagnetic waves from the antenna 1, and perform filtering, amplification, and other processing on the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
  • the mobile communication module 150 may also amplify the signal modulated by the modulation and demodulation processor, and convert it into electromagnetic waves for radiation through the antenna 1.
  • at least some of the functional modules of the mobile communication module 150 may be arranged in the processor 110.
  • at least some of the functional modules of the mobile communication module 150 may be arranged in the same device as at least some of the modules of the processor 110.
  • the wireless communication module 160 can provide wireless communication solutions including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), infrared (IR) and the like applied to the electronic device 100.
  • WLAN wireless local area networks
  • BT wireless fidelity
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication
  • IR infrared
  • the wireless communication module 160 can be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the frequency of the electromagnetic wave signal and performs filtering processing, and sends the processed signal to the processor 110.
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110, modulate the frequency of the signal, amplify the signal, and convert it into electromagnetic waves for radiation through the antenna 2.
  • the antenna 1 of the electronic device 100 is coupled to the mobile communication module 150, and the antenna 2 is coupled to the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology.
  • the technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), wideband code division multiple access (WCDMA), time-division code division multiple access (TD-SCDMA), long term evolution (LTE), BT, GNSS, WLAN, NFC, FM, and/or IR technology, etc.
  • GSM global system for mobile communications
  • GPRS general packet radio service
  • CDMA code division multiple access
  • WCDMA wideband code division multiple access
  • TD-SCDMA time-division code division multiple access
  • LTE long term evolution
  • BT GNSS
  • WLAN wireless local area network
  • NFC long term evolution
  • FM long term evolution
  • BT long term evolution
  • BT long term evolution
  • BT long term evolution
  • BT long term evolution
  • antenna 1 of electronic device 100 is coupled to mobile communication module 150, and antenna 2 is coupled to wireless communication module 160, so that electronic device 100 can be used to transmit data such as voice stream, question information, target video, associated video and answer to question.
  • the electronic device 100 implements the display function through a GPU, a display screen 194, and an application processor.
  • the GPU is a microprocessor for image processing, which connects the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • the processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos, etc.
  • the display screen 194 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diodes (QLED), etc.
  • the electronic device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the display screen 194 may be used to display data such as question text, associated video, and answer to the question.
  • the electronic device 100 can realize the shooting function through ISP, camera 193, video codec, GPU, display screen 194 and application processor.
  • the ISP is used to process the data fed back by the camera 193. For example, when taking a photo, the shutter is opened, and the light is transmitted to the camera photosensitive element through the lens. The light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converts it into an image visible to the naked eye.
  • the ISP can also perform algorithm optimization on the noise, brightness, and skin color of the image. The ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP can be set in the camera 193.
  • the camera 193 is used to capture still images or videos.
  • the object generates an optical image through the lens and projects it onto the photosensitive element.
  • the photosensitive element can be a charge coupled device (CCD) or a complementary metal oxide semiconductor (CMOS) phototransistor.
  • CMOS complementary metal oxide semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then passes the electrical signal to the ISP to be converted into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • the DSP converts the digital image signal into an image signal in a standard RGB, YUV or other format.
  • the electronic device 100 may include 1 or N cameras 193, where N is a positive integer greater than 1.
  • the camera 193 can be used to continuously record videos to form videos in the video library.
  • the target video is the video in the video library.
  • the digital signal processor is used to process digital signals, and can process not only digital image signals but also other digital signals. For example, when the electronic device 100 is selecting a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.
  • Video codecs are used to compress or decompress digital videos.
  • the electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record videos in a variety of coding formats, such as Moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.
  • MPEG Moving Picture Experts Group
  • MPEG2 MPEG2, MPEG3, MPEG4, etc.
  • NPU is a neural network (NN) computing processor.
  • NN neural network
  • applications such as intelligent cognition of electronic device 100 can be realized, such as image recognition, face recognition, voice recognition, text understanding, etc.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music and videos can be stored in the external memory card.
  • the internal memory 121 may be used to store computer executable program codes, which may include instructions.
  • the device 110 executes various functional applications and data processing of the electronic device 100 by running the instructions stored in the internal memory 121.
  • the internal memory 121 may include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc.
  • the data storage area may store data created during the use of the electronic device 100 (such as audio data, a phone book, etc.), etc.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, a universal flash storage (UFS), etc.
  • UFS universal flash storage
  • the external memory interface 120 and the internal memory 121 may be used to store data such as voice streams, question information, target videos, associated videos, and question answers.
  • the electronic device 100 can implement audio functions such as music playing and recording through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone jack 170D, and the application processor.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signals.
  • the audio module 170 can also be used to encode and decode audio signals.
  • the audio module 170 can be arranged in the processor 110, or some functional modules of the audio module 170 can be arranged in the processor 110.
  • the speaker 170A also called a "speaker" is used to convert an audio electrical signal into a sound signal.
  • the electronic device 100 can listen to music or listen to a hands-free call through the speaker 170A.
  • the receiver 170B also called a "earpiece" is used to convert audio electrical signals into sound signals.
  • the voice can be received by placing the receiver 170B close to the human ear.
  • Microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can speak by putting their mouth close to microphone 170C to input the sound signal into microphone 170C.
  • the electronic device 100 can be provided with at least one microphone 170C. In other embodiments, the electronic device 100 can be provided with two microphones 170C, which can not only collect sound signals but also realize noise reduction function. In other embodiments, the electronic device 100 can also be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify the sound source, realize directional recording function, etc.
  • the earphone interface 170D is used to connect a wired earphone.
  • the earphone interface 170D may be the USB interface 130, or may be a 3.5 mm open mobile terminal platform (OMTP) standard interface or a cellular telecommunications industry association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • the receiver 170B and the microphone 170C may be used to receive a voice stream input by a user
  • the audio module 170 may be used to convert the voice stream through ASR to obtain a question text
  • the speaker 170A may be used to play the answer to the question.
  • the SIM card interface 195 is used to connect a SIM card.
  • the SIM card can be connected to and separated from the electronic device 100 by inserting it into the SIM card interface 195 or pulling it out from the SIM card interface 195.
  • the electronic device 100 can support 1 or N SIM card interfaces, where N is a positive integer greater than 1.
  • the SIM card interface 195 can support Nano SIM cards, Micro SIM cards, SIM cards, and the like. Multiple cards can be inserted into the same SIM card interface 195 at the same time. The types of the multiple cards can be the same or different.
  • the SIM card interface 195 can also be compatible with different types of SIM cards.
  • the SIM card interface 195 can also be compatible with external memory cards.
  • the electronic device 100 interacts with the network through the SIM card to implement functions such as calls and data communications.
  • the electronic device 100 uses an eSIM, i.e., an embedded SIM card.
  • the eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.
  • the antenna 1 of the electronic device 100 and the mobile communication module 150 are coupled, so that the electronic device 100 can be used to transmit data such as voice streams, question texts, videos in the video library, and answers to questions.
  • the electronic device 100 can be coupled to the mobile communication module 150 through the antenna 1, and the antenna 2 can be coupled to the wireless communication module 160, so that the electronic device 100 obtains the voice stream, question information, question text or target video in the video library through wireless communication technology.
  • the electronic device 100 can also obtain the voice stream, question text and target video through the USB interface 130.
  • the voice stream can also be obtained by the receiver 170B and the microphone 170C in response to the user input, and correspondingly, the question text can also be obtained by converting the voice stream through ASR by the audio module 170.
  • the video in the video library can also be a video continuously recorded by the camera 193.
  • the processor 110 can obtain the question text and target video corresponding to the question information; input the question text into the encoder, and output the associated parameters of the question text, and the associated parameters include: time, object, semantics and number of answers; obtain the associated video corresponding to the associated parameters, and then segment the associated video to obtain at least one video segment, and finally input the at least one video segment and the question text into the VQA model to obtain the answer to the question.
  • Display The display screen 194 can be used to display data such as question text, target video, and question answer.
  • the display screen 194 can also be used to display video clips corresponding to the question answer.
  • the software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture.
  • the Android system of the layered architecture is taken as an example to exemplify the software structure of the electronic device 100.
  • FIG. 6 is a software structure block diagram of the electronic device according to an embodiment of the present invention.
  • the layered architecture divides the software into several layers, each with a clear role and division of labor.
  • the layers communicate with each other through software interfaces.
  • the Android system is divided into four layers, from top to bottom: the application layer, the application framework layer, the Android runtime (Android Runtime) and system library, and the kernel layer.
  • the application layer may include a series of application packages.
  • the application packages may include camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and other applications.
  • an application corresponding to the video question-answering method can be started based on the application layer to obtain the answer to the question corresponding to the question text.
  • the application framework layer provides an application programming interface (API) and a programming framework for the applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer may include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, etc.
  • the window manager is used to manage window programs.
  • the window manager can obtain the display screen size, determine whether there is a status bar, lock the screen, capture the screen, etc.
  • Content providers are used to store and retrieve data and make it accessible to applications.
  • the data may include videos, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying images, etc.
  • the view system can be used to build applications.
  • a display interface can be composed of one or more views.
  • a display interface including a text notification icon can include a view for displaying text and a view for displaying images.
  • the resource manager provides various resources for applications, such as localized strings, icons, images, layout files, video files, and so on.
  • the notification manager enables applications to display notification information in the status bar. It can be used to convey notification-type messages and can disappear automatically after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also be a notification that appears in the system top status bar in the form of a chart or scroll bar text, such as notifications of applications running in the background, or a notification that appears on the screen in the form of a dialog window. For example, a text message is displayed in the status bar, a prompt sound is emitted, an electronic device vibrates, an indicator light flashes, etc.
  • the application framework layer can control the display area size of the display screen 194 through the window manager.
  • the application framework layer can also store or obtain data such as question text, videos in the video library, answers to questions, and video clips corresponding to the answers to questions through the content provider.
  • the application framework layer can also display visual controls for indicating whether to display data such as question text, videos in the video library, answers to questions, and video clips corresponding to the answers to questions through the view system.
  • the adaptation in the video library can be transmitted to the content provider through the resource manager.
  • Android Runtime includes the core library and the virtual machine. Android Runtime is responsible for the scheduling and management of the Android system.
  • the core library contains two parts: one is the function that the Java language needs to call, and the other is the Android core library.
  • the application layer and the application framework layer run in a virtual machine.
  • the virtual machine executes the Java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • the system library may include multiple functional modules, such as surface manager, media library, 3D graphics processing library (such as OpenGL ES), 2D graphics engine (such as SGL), etc.
  • functional modules such as surface manager, media library, 3D graphics processing library (such as OpenGL ES), 2D graphics engine (such as SGL), etc.
  • the surface manager is used to manage the display subsystem and provide the fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as static image files, etc.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis and layer processing.
  • a 2D graphics engine is a drawing engine for 2D drawings.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer includes at least display driver, camera driver, audio driver, and sensor driver.
  • USB interface 130, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, external memory interface 120, internal memory 121, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, camera 193 and SIM card interface 195 shown in FIG. 5 need to be driven by the corresponding driver of the kernel layer to realize the transmission of voice stream, question information, question text, target video and question answer data, display question text, target video and question answer data, or continuously recorded video.
  • the following is an example of the workflow of the software and hardware of the electronic device 100 in conjunction with a video question-and-answer scenario.
  • the electronic device 100 can drive the antenna 1 and the mobile communication module 150 to couple through the kernel layer, and drive the antenna 2 and the wireless communication module 160 to couple, so that the electronic device 100 obtains the voice stream, question information, question text or target video through wireless communication technology.
  • the electronic device 100 can also drive the USB interface 130 through the kernel layer to obtain the voice stream, question text and video in the video library.
  • the electronic device 100 can also drive the receiver 170B and the microphone 170C through the kernel layer to obtain the voice stream in response to the user input, and accordingly, the electronic device 100 can also drive the audio module 170 through the kernel layer to convert the voice stream through ASR to obtain the question text.
  • the electronic device 100 can also drive the video library of the video library continuously recorded by the camera 193 through the kernel layer.
  • the external storage device connected to the external memory interface 120 of the electronic device, or the internal memory 121 stores the question text and the video in the video library through the content provider in the application framework layer.
  • the processor 110 of the electronic device obtains the question text and the target video through the content provider in the application framework layer; inputs the question text into the encoder through Android Runtime and the system library, and outputs the associated parameters of the question text, the associated parameters including: time, object, semantics and number of answers; obtains the associated video corresponding to the associated parameters in the target video, divides the associated video to obtain at least one video clip, inputs at least one video clip and the question text into the VQA model, and obtains the answer to the question.
  • the electronic device sets the display parameters through the window manager of the application framework layer, and manages them through the surface manager and display subsystem of the system library, and finally drives the display screen 194 through the kernel layer to display the question text, target video, question answer and the video clip corresponding to the question answer and other data.
  • the embodiment of the present application provides a video question-answering method, and the following two aspects need to be considered during the implementation process:
  • the videos used to find answers to questions often last for several days or months, so it is necessary to obtain the videos associated with the question text in order to accurately understand the associated videos and get the answers to the questions. Therefore, when extracting the associated videos, it is necessary to consider the associated parameters such as time, characters, semantics, and number of answers implied in the question text.
  • a question text may correspond to multiple video clips. If the associated video is taken as a whole to find the answer to the question corresponding to the question text, the multiple video clips in the associated video may interfere with each other. Therefore, the associated video can be segmented to obtain multiple video clips, and each video clip can be understood separately. In particular, when there are multiple answers to a question text, in the process of finding the answer to the question text, the mutual interference between multiple video clips may cause a misunderstanding of the video, resulting in the answer to the question being inconsistent with the question text.
  • the video question-answering method provided by the embodiment of the present application is described below by taking a mobile phone as an example of an electronic device. As shown in FIG7 , the method may include the following steps S701-S704.
  • the electronic device obtains associated parameters in the question text.
  • the electronic device may also obtain the user's question information.
  • Obtaining the associated parameters in the question text may also be understood as the electronic device obtaining at least one associated parameter in the question information, the at least one associated parameter including one or more of a time associated parameter, an object associated parameter, and a semantic associated parameter.
  • the question text corresponds to the question information
  • the question information refers to the video, audio, and text (which can be written text or spoken text) including the question.
  • the question text refers to the text including the question.
  • the electronic device may obtain question information in response to text input by the user.
  • the electronic device may also obtain a voice stream in response to voice input by the user, and then convert the voice stream into question information through voice recognition technology.
  • the electronic device may also obtain question information by responding to video input by the user.
  • the electronic device may also obtain question information from other electronic devices in a wired or wireless manner.
  • the source of the question information is not limited.
  • the specific implementation method of the electronic device obtaining the associated parameters may include: first, the electronic device converts the question information into a question text. Then, the electronic device performs word segmentation on the question text to obtain a word vector. Furthermore, the electronic device inputs the word vector into a preset text encoding model to obtain text features. Next, the electronic device extracts time features and object features from the text features. Finally, the electronic device obtains at least one associated parameter in the question information.
  • the association parameters include one or more of semantic association parameters corresponding to text features, time association parameters corresponding to time features, and object association parameters corresponding to object features.
  • the electronic device cleans and standardizes the question information to obtain the question text.
  • One or more of the text features, object features and time features may be extracted from the question text.
  • the electronic device can assign the corresponding features to preset parameters, which are used to identify the features as invalid features.
  • the preset parameters are parameters that do not interfere with the process of obtaining the answer to the question.
  • association parameters in the question information by taking the object association parameter as a person association parameter as an example.
  • the electronic device can segment the question text, and input the word vector corresponding to the segmentation into a text encoder (such as a BERT model), encode the text through the text encoder, obtain text features, and then extract the time features in the text features, and then input the non-time features in the text features into a person classifier, extract the person features and the number of answers features, and finally obtain the time association parameters corresponding to the time features, the person association parameters corresponding to the person features, and the number of answers association parameters corresponding to the number of answers features.
  • the text features are the semantic association parameters in the association parameters.
  • the text encoder can adopt the BERT model, and can also adopt the LSTM model.
  • the main input of the BERT model is the original word vector of each character/word (or token) in the text.
  • the original word vector can be converted into a one-dimensional vector by querying the word vector table for each word in the text, or it can be a vector obtained after pre-training using the word vector model.
  • the output of the BERT model is the vector representation of each character/word in the text after integrating the semantic information of the whole text.
  • obtaining text features that are more similar to the actual question from the question text can improve the accuracy of the semantic association parameters and further improve the accuracy of the answer to the question.
  • the electronic device can input the word segmentation features corresponding to each text segmentation in the question text into the time classifier (marking the time-related words in the input text in a sequence annotation manner) to extract the time features in the text features.
  • the time classifier marking the time-related words in the input text in a sequence annotation manner
  • the accuracy of the obtained time features can be improved through training of the sequence annotation manner, and the accuracy of the obtained time-related parameters can be further improved, and finally the accuracy of the obtained answer to the question can be improved.
  • the time feature may exist (valid feature) or may not exist (invalid feature).
  • the electronic device may determine the time-related parameters corresponding to the time feature according to whether the time feature is a valid feature or an invalid feature.
  • the time-related parameters include the video start time and the video end time.
  • the time segmentation word corresponding to the time feature is mapped according to the preset mapping rule to determine the video start time and the video end time. It can be understood that if there is a time-related word in the question text, the world time mapped by the time-related word is determined as the time-related parameter.
  • the question text is "Where did you put your keys this afternoon?"
  • the classification result of the time classifier is "BIIIO OOOO"
  • B represents the first word of the time-related word
  • I identifies the internal word of the time-related word
  • O represents a non-time word.
  • the time-related word "this afternoon” corresponding to the question text is obtained, and according to the time-related word "this afternoon”, the time-related parameter is determined to be "from 12:00 on September 7, 2022 to 18:00 on September 7, 2022" through mapping.
  • the video end time is determined as: the time when the question text is obtained
  • the video start time is determined as: the time at which the preset time length is separated from the video end time. It can be understood that if there is no time-related word in the question text, the time-related parameter is determined as N hours before the time when the question text is obtained. Among them, N is greater than 0 and N is less than the total time length of the videos in the video library.
  • the question text is "find the key”, and the classification result of the time classifier is "OOO".
  • the classification result "OOO” it is determined that there is no corresponding time-related word in the question text.
  • the acquisition time of the question text is obtained (15:30 on September 7, 2022), and the time-related parameter is determined to be "12:30 on September 7, 2022 to 15:00 on September 7, 2022", and N is 3 hours. hour.
  • the electronic device inputs the non-time feature in the text feature into the person classifier to extract the person category feature (i.e., determines the object association parameter as the object feature of the person association parameter).
  • the electronic device obtains the object association parameter corresponding to the object feature, specifically including: the electronic device determines the user identity information according to the question text; in the preset identity relationship table, the electronic device determines the target person corresponding to the object feature according to the user identity information; the electronic device determines the target feature corresponding to the target person as the object association parameter, and the target feature includes at least one of the following: image feature, behavior feature and voiceprint feature.
  • the electronic device selects the target person corresponding to the character category features in the family member relationship table. Then the target features of the target person are determined as character association parameters.
  • the target features include image features, behavior features, and voiceprint features.
  • the target features can be extracted by a preset feature extraction algorithm based on pre-stored data such as the portrait and voice of the target person, and the preset feature extraction algorithm can be a residual network ResNet algorithm.
  • the above-mentioned electronic device determines the user identity information based on the question text, which may include any of the following items: in the case of obtaining the question information in response to text input, the user identity information is determined as: the identity information corresponding to the biometric feature of starting the electronic device; in the case of obtaining the question information in response to voice input, the user identity information is determined as: the identity information corresponding to the voiceprint feature corresponding to the voice stream of the voice input; in the case of obtaining the question information in response to video input, the user identity information is determined as: the identity information corresponding to the voiceprint feature and/or facial feature corresponding to the video stream of the video input.
  • the question information is obtained by the electronic device from other electronic devices via wired or wireless means, the user identity information is determined based on the voiceprint or facial image carried in the question information.
  • the character category feature can represent the character relationship categories such as father, mother, me, grandfather and grandmother. For example, if the question text is “Mom's key”, the result of the character classifier is “Mom”; if the question text is “Find the key”, the result of the character classifier is "Me”.
  • the family member relationship table shown in FIG9 can be pre-set or continuously added as the electronic device is used.
  • the family member relationship report two main elements are included: family members and member relationships.
  • the mother of character 1 is character 2
  • the father of character 1 is character 2
  • the mother of character 2 is character 4
  • the father of character 2 is character 5.
  • the amount of data in the family member relationship report is determined according to the number of family members and is not limited in the embodiments of the present application.
  • the character category feature extracted by the electronic device is "mom" and the user identity information is determined to be character 1
  • the target character that can be obtained by searching the family relationship table is character 2
  • the target feature of character 2 is determined as the character association parameter.
  • the electronic device can select a target identity confirmation method according to the method of obtaining the question text, and determine the identity information of the user who asked the question text according to the target identity confirmation method. If the user identity information cannot be confirmed, the person-related parameters are not output. The electronic device cannot confirm the user identity information, that is, the user who inputs the question text is no longer within the preset face and preset voiceprint recognition range, or the person classifier result does not have a target person corresponding to the user.
  • the electronic device directly obtains the question text, then the user identity information corresponding to the biometric feature (such as fingerprint, voiceprint, or face image, etc.) collected by the electronic device is turned on and determined as the user identity information of the user who asked the question text. If the electronic device indirectly obtains the question text through a voice stream, then the user identity information of the user who asked the question text is determined based on the voiceprint corresponding to the voice stream. If the electronic device indirectly obtains the question text through a video, then the user identity information of the user who asked the question text is determined based on the voiceprint corresponding to the voice in the video and/or the face image in the video.
  • the biometric feature such as fingerprint, voiceprint, or face image, etc.
  • the electronic device inputs the text feature into the character classifier, and determines whether the character feature is included in the character classifier result and judges through the semantic association parameter, then the electronic device determines whether the answer quantity association parameter corresponding to the answer quantity feature is 1 or more. In this way, by determining the number of answers to the question answer, the accuracy of the obtained answer quantity association parameter can be improved, and the accuracy of the obtained answer to the question can be further improved.
  • the question text when there are many answers, may also include a specified number of answers to be output.
  • the electronic device inputs the text features into the character classifier (marks the ordinal associated words in the input text in a sequence tagging manner), determines the answer quantity features, and also extracts the answer ordinal features in the question text.
  • the question text is "Who is the second person to come this afternoon?" Although there are many corresponding answers, there is actually only one answer, and the answer ordinal number is "second.”
  • S702 The electronic device obtains an associated video corresponding to the associated parameter.
  • the electronic device may also obtain a target video, and then obtain an associated video corresponding to the associated parameters in the target video.
  • the electronic device can also directly obtain the target video from the video library, or obtain the associated video corresponding to the associated parameter.
  • the target video can be stored in the video library
  • the video library refers to a database for storing continuously recorded videos.
  • the video library can be stored in the electronic device, in a server, or in a recording device for recording videos.
  • the electronic device uses the associated parameter as a filtering condition to obtain associated videos that meet the filtering condition from the target video in the video library.
  • the electronic device sends the associated parameter as a screening condition to the server.
  • the server receives the screening condition, extracts associated videos that meet the screening condition from the target video in the video library, and sends the associated video to the electronic device, so that the electronic device obtains the associated video corresponding to the associated parameter.
  • the electronic device sends the associated parameter as a filter condition to the recording device.
  • the recording device receives the filter condition, extracts the associated video that meets the filter condition from the target video in the video library, and sends the associated video to the electronic device, so that the electronic device obtains the associated video corresponding to the associated parameter.
  • the electronic device requests to obtain all videos in the video library (including the target video), and the recording device sends all videos to the electronic device. After receiving all videos, the electronic device uses the associated parameter as a screening condition to obtain associated videos that meet the screening condition from all videos.
  • the associated video corresponding to the associated parameter in the target video may be obtained through the following steps S1001 to S1003 .
  • the electronic device extracts a first video from a target video according to a time association parameter.
  • S1002 The electronic device extracts the second video from the first video according to the object association parameter.
  • S1003 The electronic device extracts a related video from the second video according to the semantic association parameter.
  • the electronic device uses the time association parameter as the video recording time period to extract the first video in the target video. Encode each video frame in the first video to obtain the video frame features of each video frame.
  • the electronic device calculates the object association parameter (such as the target feature of the target person) and the first feature similarity of the video frame features of each video frame. If the first feature similarity is greater than a preset threshold, the video frame is retained. If the first feature similarity is less than the preset threshold, the video frame is deleted until each video frame in the first video is judged to be retained or deleted in turn to obtain the second video. Finally, the electronic device calculates the semantic association parameter (text feature) and the second feature similarity of the video frame features of each video frame of the second video frame. If the second feature similarity is greater than the preset threshold, the video frame is retained. If the second feature similarity is less than the preset threshold, the video frame is deleted until each video frame in the second video is judged to be retained or deleted in turn to obtain the associated video.
  • the object association parameter such as the
  • each video frame can be generated in the process of implementing the video question answering method, or can be generated in the process of recording the video.
  • the video frame feature of each video frame can be encoded using a pre-trained ResNet network as an encoder.
  • a vector product method can be used, that is, the vector corresponding to the target feature is vector-producted with the vector corresponding to the video frame feature to obtain the first feature similarity; the vector corresponding to the text feature is vector-producted with the vector corresponding to the video frame feature to obtain the second feature similarity.
  • obtaining the associated video corresponding to the associated parameter can reduce the number of video frames containing the implicit answer to the question in the associated video, thereby reducing the time for the VQA model to process at least one video clip, so as to improve the speed of obtaining the answer to the question.
  • it can avoid the interference of unrelated videos on the answer to the question, which can further improve the speed of obtaining the answer to the question and improve the accuracy of the answer to the question.
  • processing each video clip in at least one video clip separately through the VQA model since each video clip contains an independent semantics, it can avoid mutual interference between the video clips and improve the accuracy of the answer to the question.
  • S703 The electronic device divides the associated video to obtain at least one video segment.
  • the electronic device can segment the associated video according to the video segmentation position to obtain at least one video segment, wherein the video segmentation position is: the position where the adjacent video frames whose recording time difference is greater than the preset time difference are located in the associated video Set.
  • the electronic device judges temporally discontinuous video segments of associated videos, and when the recording time difference between adjacent video frames in adjacent video segments is greater than a preset time interval, the adjacent video segments are divided into different video clips; when the recording time difference between adjacent video frames in adjacent video segments is not greater than the preset time interval, the adjacent video segments are divided into the same video clip, thereby obtaining at least one video clip.
  • the temporally discontinuous video segments of the associated video include f1, f2, and f3, etc.
  • f1 is determined to belong to the first video segment, and if the recording time difference between adjacent video frames in f1 and f2 is greater than 10 seconds, f2 is determined to belong to the second video segment, otherwise, f2 is determined to also belong to the first video segment.
  • the adjacent video segments are merged to regenerate at least one video segment.
  • the electronic device may further calculate the similarity of the feature means of two adjacent video segments, and if the similarity is greater than a preset threshold, the two adjacent video segments are merged into one video segment.
  • the feature mean of the video segment may be the average value obtained by adding the sum of the video frame features of each video frame in the video segment and dividing it by the total number of video frames.
  • a vector product method can be used, that is, the feature mean of the previous video segment in adjacent video segments is vector-producted with the feature mean of the next video segment in adjacent video segments to obtain the similarity of the feature means of two adjacent video segments.
  • mom’s activity scenes include: scene one: mom is preparing food in the kitchen and the restaurant is empty; scene two: the kitchen is empty and mom is placing dishes in the restaurant; scene three: mom returns to the kitchen to prepare food and the restaurant is empty.
  • the kitchen videos corresponding to scene one and scene three are related videos, and the similarity of the feature means of the kitchen videos corresponding to scene one and scene three is greater than the preset threshold, and they are adjacent video clips.
  • the electronic device may further extract the answer quantity feature from the text feature; obtain the answer quantity association parameter corresponding to the answer quantity feature; and then after segmenting the associated video to obtain at least one video segment, the electronic device may further delete the other segments except the first video segment from the at least one video segment when the answer quantity association parameter is 1.
  • the first video segment is: the last recorded video segment from the at least one video segment.
  • the association parameter includes an answer quantity association parameter
  • the electronic device when the answer quantity association parameter is 1, the electronic device retains the video segment with the most recent recording time in at least one video segment; when the answer quantity association parameter is multiple, the electronic device retains all video segments in at least one video segment. That is, when the answer quantity association parameter is 1, only the video segment corresponding to the video segment with the most recent recording time is retained, so as to further reduce the amount of data in the video segment and improve the speed of obtaining the answer to the question.
  • the electronic device derives multiple possible answers corresponding to the question text (i.e., corresponding to multiple video clips); however, if the associated parameters also include an answer ordinal associated parameter, then the final answer to the question is one of the answers corresponding to the answer ordinal associated parameter (i.e., corresponding to the video clip in multiple video clips that corresponds to the answer ordinal associated parameter).
  • the electronic device can also extract the answer ordinal feature in the text feature; obtain the answer ordinal association parameter corresponding to the answer ordinal feature; and then after segmenting the associated video to obtain at least one video segment, the electronic device can also delete the other segments in at least one video segment except the second video segment when the answer quantity association parameter is large and the answer quantity association parameter is greater than or equal to the answer ordinal association parameter.
  • the second video segment is: a video segment corresponding to the ordinal number of the answer in at least one video segment, and at least one video segment is arranged in chronological order.
  • S704 The electronic device inputs at least one video clip and a question text into a VQA model to obtain a question corresponding to the question text. Answers and display answers to questions.
  • the electronic device may first obtain the answer to the question corresponding to the question text in at least one video clip, and then display the answer to the question.
  • the electronic device can combine the video frame features of at least one video clip and the text features of the question text through the VQA model to obtain the answer to the question text.
  • the feature combination method is not limited.
  • T is the text feature and F is the video frame feature.
  • the text feature and the video frame feature are input into the Transformer neural network to obtain the fused feature, and then the answer to the question is obtained by using the classification or generation method based on the fused feature.
  • the time association parameter is determined to be N hours before the time when the question text is obtained, and the associated video extracted based on this may not include the answer to the question corresponding to the question text. Therefore, if the electronic device cannot obtain the answer to the question corresponding to the question text, the electronic device can prompt the user in which time period the answer to the question corresponding to the question text was not found, and prompt the user to enter a target time period, and use the target time period as the time association parameter to re-search for the answer to the question corresponding to the question text.
  • the electronic device is mobile phone C.
  • mobile phone C obtains associated videos based on “this afternoon (time)”, “who (any person)” and “comes in (semantics)”, and inputs at least one video clip corresponding to the associated video into the VQA model. Based on each video clip, it is determined that the persons who came in this afternoon include: Person A, Person B and Person C. Finally, the answer to the question corresponding to the question text is Person A, Person B and Person C.
  • the electronic device is mobile phone C.
  • mobile phone C obtains an associated video based on “this afternoon (time)”, “who (any person)”, “enters the door (semantics)” and “second (answer ordinal number)”, and inputs at least one video clip corresponding to the associated video into the VQA model. Based on each video clip, it is determined that the persons who enter the door in the afternoon include: Person A, Person B and Person C. Finally, the answer to the question corresponding to the question text is: the second person among the above-mentioned persons (Person B).
  • the electronic device displaying the answer to the question may include at least one of the following: playing the answer to the question through voice playback; displaying the answer to the question through text display; playing a video clip corresponding to the answer to the question in at least one video clip through video display.
  • the electronic device can display the answer to the question through voice playback, text display, and video display.
  • the video display method can display a video clip corresponding to the answer to the question in at least one video clip.
  • the electronic device is an intelligent assistant B
  • the intelligent assistant B displays the answer to the question through voice: the key is on the table.
  • the electronic device is a mobile phone
  • the mobile phone displays the answer to the question through video, and displays an image of the key at the location on the display screen of the mobile phone.
  • the electronic device is a mobile phone
  • the mobile phone displays the answer to the question through text, and displays an image of the key at the location on the display screen of the mobile phone.
  • At least one display method is used to display the answers to the questions and increase the diversity of the displayed answers to the questions, so as to increase the probability of users knowing the answers to the questions.
  • the embodiment of the present application provides a video question-answering method, which can reduce the number of video frames containing the answer to the question in the at least one video segment by obtaining at least one associated parameter in the question information, and thereby reduce the time for processing the at least one video segment, so as to increase the speed of obtaining the answer to the question. At the same time, it can avoid the interference of irrelevant videos in the target video on the answer to the question, which can further increase the speed of obtaining the answer to the question and improve the accuracy of the answer to the question. Furthermore, in the process of obtaining the answer to the question corresponding to the question text, each video segment in the at least one video segment is processed separately. Since each video segment contains an independent semantics, it can avoid mutual interference between the video segments and improve the accuracy of the answer to the question.
  • the answers to questions may involve privacy during the video question-and-answer process
  • the answers to the questions may be displayed after the user identity information is verified.
  • the display of the answers to the questions in the above S704 may also include the following steps S1401 and S1402.
  • the question text, at least one video clip or the answer to the question are respectively input into the privacy feature extraction model, and if any of the above inputs contains privacy information, the electronic device performs identity verification on the user identity information.
  • the user identity information can be obtained in the process of obtaining the character association parameters.
  • the electronic device may use a sequence labeling method to determine whether the input feature contains privacy information.
  • the question text is "safe password"
  • the electronic device needs to verify the user's identity information.
  • the corresponding answer can be obtained.
  • user identity information not stored in the preset identity relationship table (strangers relative to the characters in the preset identity relationship table), in order to prevent strangers from achieving illegal purposes based on the obtained answer to the question, the corresponding answer cannot be obtained.
  • the user identity information verification method is obtained according to the method of obtaining the question text. If the electronic device directly obtains the question text, the user identity information corresponding to the biometric feature (such as fingerprint, voiceprint, or face image, etc.) collected by the electronic device is turned on and determined as the user identity information of the question text. If the electronic device indirectly obtains the question text through a voice stream, the user identity information of the question text is determined based on the voiceprint corresponding to the voice stream. If the electronic device indirectly obtains the question text through a video, the user identity information of the question text is determined based on the voiceprint corresponding to the voice in the video and/or the face image in the video.
  • the biometric feature such as fingerprint, voiceprint, or face image, etc.
  • the verification method used by the electronic device to verify the user identity information corresponds to the actual information type of the user identity information.
  • the identity verification is performed according to the fingerprint recognition method.
  • the electronic device when the identity verification fails, does not display the answer to the question.
  • the electronic device may display a prompt message to indicate the reason for not displaying the answer to the question, to prompt to re-obtain the question text, and so on.
  • the video question-answering method inputs the question text into the text encoder, and the text encoder can perform text feature recognition, time recognition, character recognition, and answer quantity recognition. If it is determined that the question text includes a time feature according to time recognition, a time association parameter is obtained according to the time feature, and a first video corresponding to the time association parameter is obtained from the video library. If it is determined that the question text does not include a time feature according to time recognition, a time association parameter is obtained according to a preset rule (the time association parameter is determined to be N hours before the acquisition time of the question text), and a first video corresponding to the time association parameter is obtained from the video library.
  • a preset rule the time association parameter is determined to be N hours before the acquisition time of the question text
  • a target feature corresponding to the target person is determined as a character association parameter, and a second video corresponding to the character association parameter is obtained from the first video.
  • an associated video is extracted from the second video according to a semantic association parameter (text feature).
  • the associated video is segmented to obtain at least one video clip.
  • the number of answers determined according to the number of answers is identified, and a video clip with a higher correlation with the question text is selected from at least one video clip.
  • the answer to the question is displayed according to the VQA model and whether privacy information is included.
  • the video question-and-answer method provided in the embodiment of the present application proposes a video question-and-answer solution that combines time, person, number of answers, and privacy, which can quickly locate the time and video clips of the person corresponding to the question text, and can accurately output one or more answers based on the text semantics. It can also identify the privacy of the text and the corresponding video, and support identity verification.
  • the video is filtered layer by layer, and the most relevant video part is selected, which can respond quickly, and other irrelevant interference is eliminated, so that the question can be answered accurately.
  • the screened video part is divided into segments according to the time interval and semantic relevance.
  • the segments are divided and each segment is answered, which can avoid interference and accurately answer the question.
  • the video segment is selected according to the time dimension to answer. Some questions have multiple or one answer, so the most relevant segment can be selected to answer.
  • a video question-and-answer device is also provided in an embodiment of the present application.
  • the video question-and-answer device includes an acquisition unit 1701 , a segmentation unit 1702 , and a processing unit 1703 .
  • the acquisition unit 1701 is used to acquire the associated parameters in the question text, for example, by executing step S701 in the above embodiment.
  • the acquisition unit 1701 is used to acquire the associated video corresponding to the associated parameter, for example, by executing step S702 in the above embodiment.
  • the segmentation unit 1702 is used to segment the associated video to obtain at least one video segment. Step S703.
  • the processing unit 1703 is used to input at least one video clip and question text into the VQA model, obtain the answer to the question corresponding to the question text, and display the answer to the question. For example, step S704 in the above embodiment is performed.
  • the electronic device includes hardware and/or software modules corresponding to the execution of each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is executed in the form of hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application in combination with the embodiments, but such implementation should not be considered to be beyond the scope of the present application.
  • the electronic device can be divided into functional modules according to the above method example.
  • each functional module can be divided according to each function, or two or more functions can be integrated into one processing module.
  • the above integrated module can be implemented in the form of hardware. It should be noted that the division of modules in this embodiment is schematic and is only a logical function division. There may be other division methods in actual implementation.
  • An embodiment of the present application also provides an electronic device, as shown in FIG18 , which may include one or more processors 1001 , a memory 1002 , and a communication interface 1003 .
  • the memory 1002 and the communication interface 1003 are coupled to the processor 1001.
  • the memory 1002, the communication interface 1003 and the processor 1001 may be coupled together via a bus 1004.
  • the communication interface 1003 is used for data transmission with other devices.
  • the memory 1002 stores computer program code.
  • the computer program code includes computer instructions.
  • the electronic device executes the video question-answering method in the embodiment of the present application.
  • the processor 1001 can be a processor or a controller, for example, it can be a central processing unit (CPU), a general processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It can implement or execute various exemplary logic blocks, modules and circuits described in conjunction with the contents of this disclosure.
  • the processor can also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of DSP and microprocessors, and the like.
  • the bus 1004 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the bus 1004 may be divided into an address bus, a data bus, a control bus, etc.
  • FIG18 only uses one thick line, but does not mean that there is only one bus or one type of bus.
  • An embodiment of the present application further provides a computer-readable storage medium, in which a computer program code is stored.
  • a computer program code is stored.
  • the electronic device executes the relevant method steps in the method embodiment.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product When the computer program product is run on a computer, it enables the computer to execute the relevant method steps in the above method embodiment.
  • the electronic device, computer storage medium or computer program product provided in this application is used to execute the corresponding method provided above. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects in the corresponding method provided above, and will not be repeated here.
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the modules or units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another device, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separate, and the components shown as units may be physically separate. It may be one physical unit or multiple physical units, that is, it may be located in one place, or it may be distributed in multiple different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium.
  • the technical solution of the embodiment of the present application can essentially or contribute to the part or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to enable a device (which can be a single-chip microcomputer, chip, etc.) or a processor (processor) to execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk and other media that can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种视频问答方法及电子设备,方法中,获取目标视频和用户的问题信息;获取问题信息中的至少一个关联参数(S701),包括时间关联参数、对象关联参数和语义关联参数中的一个或多个;根据至少一个关联参数,对目标视频进行分割,得到至少一个视频片段(S703);获取至少一个视频片段中,问题信息对应的问题答案;展示问题答案(S704)。

Description

视频问答方法及电子设备
本申请要求于2022年10月20日提交国家知识产权局、申请号为202211289300.7、申请名称为“视频问答方法及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及终端技术领域,尤其涉及一种视频问答方法及电子设备。
背景技术
在开启摄像头拍摄特定区域(如客厅、饭店、门口等)并录制视频的情况下,可以通过查看视频的方式追溯过去发生的事件,例如,钥匙在哪里、今天有多少人用餐、下午是否有人进门等等。为了提高事件追溯的效率,可以将追溯的事件对应的问题文本和录制的视频,输入视觉问答(Visual Question Answering,VQA)模型,通过VQA模型将视觉处理与自然语言处理相结合,自动求解并输出问题文本对应的答案,实现视频问答。
在相关技术中,视频问答的具体实现方式,可以包括:从问题文本和待处理的原始视频中,得到文本特征以及各帧图像的视觉特征和语义特征,并根据文本特征、视觉特征和语义特征得到各帧图像的全局视觉表示,最后根据文本特征和全局视觉表示,得到问题答案。
然而,上述视频问答方式需要得到待处理的原始视频的各帧图像的全局视觉表示,需要耗费大量的时间进行图像数据处理,导致得到问题答案的速度较慢。
发明内容
本申请实施例提供一种视频问答方法及电子设备,能够根据问题文本中隐含的时间、人物、语义等因素,提取问题文本的关联视频片段,减少隐含问题答案的视频帧的数量,进而减少处理关联视频片段的时间,从而提高得到问题答案的速度,提高视频问答效率和用户体验。
为达到上述目的,本申请的实施例采用如下技术方案:
第一方面,提供了一种视频问答方法,应用于电子设备,在该方法中,首先,电子设备获取目标视频和用户的问题信息。然后,电子设备获取问题信息中的至少一个关联参数,其中,关联参数包括时间关联参数、对象关联参数或语义关联参数中的一个或多个。再者,电子设备根据至少一个关联参数,对目标视频进行分割,得到至少一个视频片段。接下来,电子设备获取至少一个视频片段中,问题信息对应的问题答案。最后,电子设备展示问题答案。
其中,目标视频是存储在视频库中的连续录制的视频。目标视频的连续时间可以为视频的最大存储时间(如7天、一星期、一个月等),目标视频的录制时间从当前时间开始根据最大存储时间进行推算。目标视频的数据量较大,如果逐个视频帧进行判断得到问题答案的速度较慢。
如此,通过获取问题信息中的至少一个关联参数,对目标视频进行分割,并且得到至少一个视频片段,能够减少至少一个视频片段中隐含问题答案的视频帧的数量,进而减少处理至少一个视频片段的时间,以便于提高得到问题答案的速度。同时,能够避免目标视频中的不相关视频对问题答案的干扰,能够进一步提高得到问题答案的速度,并提高得到的问题答案的准确度。再者,获取问题文本对应的问题答案的过程中,分别处理至少一个视频片段中的每个视频片段,由于每个视频片段包含一个独立的语义,能够避免各个视频片段之间相互干扰,能够提高问题答案的准确度。
在第一方面的一种可实现方式中,电子设备获取问题信息中的至少一个关联参数,可以包括:首先,电子设备将问题信息转换为问题文本。然后,电子设备对问题文本进行分词,得到词向量。再者,电子设备将词向量输入预置文本编码模型,得到文本特征。接下来,电子设备提取文本特征中的时间特征和对象特征。最后,电子设备获取文本特征对应的语义关联参数,时间特征对应的时间关联参数,和对象特征对应的对象关联参数中的一个或多个。
其中,关联参数包括文本特征对应的语义关联参数,时间特征对应的时间关联参数,和对象特征对应的对象关联参数中的一个或多个。
需要说明的是,问题信息是响应于文本输入获取的,或者响应于语音输入获取的,或者响应于视频输入获取的。在获取问题信息后,电子设备对问题信息进行清洗、标准化等处理,得到问题文本。相比于问题信息,问题文本更容易进行文本处理,更容易被机器识别。
其中,预置文本编码模型可以是BERT模型,BERT模型的主要输入是文本中各个字/词(或者称为token)的原始词向量,原始词向量可以是通过查询字向量表将文本中的每个字转换为一维向量,也可以是将利用词向量模型进行预训练后得到的向量。BERT模型的输出是文本中各个字/词对应的融合全文语义信息后的向量表示。
可以理解的是,从问题文本中可能提取出文本特征、对象特征和时间特征中的一个或多个。
如此,通过对文本信息转换为问题文本,以及对文本问题件分词得到分词向量,再将分词向量输入预置文本编码模型,以此,得到文本特征,以及文本特征对应的语义关联参数,通过上述处理能够提高文本特征的准确性。通过将文本特征进行分类,提取其中的时间特征和对象特征,以此,获取时间特征对应的时间关联参数,以及对象特征对应的对象关联参数。如此,能够提高获取的关联参数的准确性。
在第一方面的一种可实现方式中,时间关联参数包括视频起始时刻和视频终止时刻。以此,可以根据时间特征是否有效,采用不同的方法获取时间特征对应的时间关联参数。具体包括:在时间特征为有效特征的情况下,根据预置映射规则,对时间特征对应的时间分词进行映射,确定视频终止时刻和视频起始时刻;在时间特征为无效特征的情况下,确定视频终止时刻为:获取问题文本的时刻,确定视频起始时刻为:与视频终止时刻相距预置时长的时刻。
如此,根据时间特征为有效特征还是无效特征,采用不同的方式,确定时间特征对应的时间关联参数,以使得时间关联参数与问题文本的关联程度更高,使得根据时间关联参数得到的关联视频与问题文本的关联程度更高,进一步提高获取的问题答案的准确性。
在第一方面的一种可实现方式中,对象特征为人物类别特征,电子设备获取对象特征对应的对象关联参数,包括:根据问题文本,确定用户身份信息;在预置身份关系表中,根据用户身份信息确定对象特征对应的目标人物;将目标人物对应的目标特征,确定为对象关联参数。
其中,目标特征包括以下至少一项:图像特征、行为特征和声纹特征。目标特征可以是根据预先存储的目标人物的人像、声音等数据,通过预置特征提取算法提取的,预置特征提取算法可以为残差网络ResNet算法。
需要说明的是,如果能够确认用户身份信息,并且人物分类器结果中包含的人物类别特征,即,确定对象特征为人物类别特征。示例性的,人物类别特征可以表示爸爸、妈妈、我、爷爷和奶奶等人物关系类别。例如,问题文本为“妈妈的钥匙”,人物分类器的结果为“妈妈”;问题文本为“找一下钥匙”,人物分类器的结果为“我”。
如此,能够根据预置身份关系表和用户身份信息,获取问题文本相关的目标人物的目标特征,提高获取的“人物”关联参数的准确性,最终提高得到的问题答案的准确性。
在第一方面的一种可实现方式中,电子设备根据问题文本,确定用户身份信息,包括以下任一项:在响应于文本输入获取问题信息的情况下,确定用户身份信息为:启动电子设备的生物特征对应的身份信息;在响应于语音输入获取问题信息的情况下,确定用户身份信息为:语音输入的语音流对应的声纹特征对应的身份信息;在响应于视频输入获取问题信息的情况下,确定用户身份信息为:视频输入的视频流对应的声纹特征和/或人脸特征对应的身份信息。
需要说明的是,电子设备可以根据问题文本的获取方式,选取目标身份确认方法,并根据目标身份确认方式确定提出问题文本的用户身份信息。如果无法确认用户身份信息,那么不输出人物关联参数。电子设备无法确认用户身份信息,也就是,输入问题文本的用户,不再预设人脸和预设声纹识别范围内,或者人物分类器结果没有该用户对应的目标人物。
如此,通过获取问题信息的方式,能够直接得到的生物特征、声纹信息或人脸特征,确定用户身份信息,以提高确定身份信息的速度。
在第一方面的一种可实现方式中,电子设备根据至少一个关联参数,对目标视频进行分割,得到至少一个视频片段,可以包括:获取目标视频中,至少一个关联参数对应的关联视频。接下来,将关联视频进行分割,得到至少一个视频片段。
其中,目标视频中的部分视频或全部视频被确定为至少一个关联参数对应的关联视频。关联视频对应的录制时间可能间断的。至少一个视频片段可以根据关联视频的录制时间是否间断进行分割。
如此,将关联视频分割为至少一个视频片段,以使得获取问题答案的过程中,分别处理至少一个视频片段中的每个视频片段,由于每个视频片段包含一个独立的语义,能够避免各个视频片段之间相互干扰,能够提高问题答案的准确度。
在第一方面的一种可实现方式中,电子设备获取目标视频中,至少一个关联参数对应的关联视频,具体包括:首先根据时间关联参数,从目标视频中提取第一视频。然后,根据对象关联参数,从第一视频中提取第二视频。最后,根据语义关联参数,从第二视频中提取关联视频。
需要说明的是,电子设备首先将时间关联参数作为视频录制时间段,提取目标视频中的第一视频。针对第一视频中的每个视频帧进行编码,获取每个视频帧的视频帧特征。电子设备计算对象关联参数(如,目标人物的目标特征),与每个视频帧的视频帧特征的第一特征相似度,如果第一特征相似度大于预置阈值,则保留该视频帧,如果第一特征相似度小于预置阈值,则删除该视频帧,直至依次判断第一视频中每个视频帧保留或删除,得到第二视频。最后,电子设备计算语义关联参数(文本特征),与第二视频帧的每个视频帧的视频帧特征的第二特征相似度,如果第二特征相似度大于预置阈值,则保留该视频帧,如果第二特征相似度小于预置阈值,则删除该视频帧,直至依次判断第二视频中每个视频帧保留或删除,得到关联视频。
如此,按照提取视频的数据处理速度从快到慢的顺序,以此根据时间关联参数、对象关联参数和语义关联参数提取目标视频中的关联视频。最先执行处理速度最快的提取第一视频的步骤,需要处理的数据量最大。最后执行处理速度最慢的提取相关视频的步骤,需要处理的数据量最小。通过平衡数据处理的处理速度和数据处理的数据量的关系,提高提取目标视频中关联视频的速度。
在第一方面的一种可实现方式中,电子设备将关联视频进行分割,得到至少一个视频片段,具体可以为:按照视频分割位置对关联视频进行分割,获取至少一个视频片段。
其中,视频分割位置为:关联视频中录制时间差大于预置时间差的相邻视频帧所在的位置。可以理解的,在确定分割位置过程中,可以首先确定关联视频中录制时间不连续的视频段进行判断,以此,减少计算相邻视频帧对应的录制时间差的次数,进一步减少得到问题答案所需的计算量,提高得到问题答案的速度。
如此,通过按照视频分割位置对关联视频进行分割,得到的至少一个视频段中的各个视频段在时间上不连续,以使得各个视频段包含一个独立的语义,能够避免各个视频片段之间相互干扰,能够提高问题答案的准确度。
在第一方面的一种可实现方式中,按照视频分割位置对关联视频进行分割,获取至少一个视频片段之后,在至少一个视频片段中的相邻视频片段的特征均值的特征相似度大于预置阈值的情况下,电子设备将相邻视频片段合并,重新生成至少一个视频片段。
其中,特征均值可以为视频片段中每个视频帧的视频帧特征相加的和值,再除以视频帧总数得到的平均值。特征相似度为相邻视频片段的两个特征均值相比的相似度。
如此,对于特征均值的相似度较高的视频片段,在视频帧特征、人物追踪等等方面,可以复用的数据较多,如果将特征均值的相似度较高的视频片段进行合并,可以减少后续数据处理所需的时间,进而提高得到问题答案的速度。
在第一方面的一种可实现方式中,本申请提供的视频问答方法还包括:首先,电子设备提取文本特征中的答案数量特征。然后,电子设备获取答案数量特征对应的答案数量关联参数。接下来,将关联视频进行分割,得到至少一个视频片段之后,在答案数量关联参数为1的情况下,电子设备删除至少一个视频片段中除了第一视频片段以外的其他片段,第一视频片段为:至少一个视频片段中最后录制的视频片段。
其中,电子设备将文本特征输入人物分类器,根据人物分类器结果中是否包含人物特征,且通过语义关联参数进行判断,则电子设备确定答案数量特征对应的答案数量关联参数为1或多。
可以理解的是,如果关联参数包括答案数量关联参数,在答案数量关联参数为1的情况下, 电子设备保留至少一个视频片段中的录制时间最近的视频片段;在答案数量关联参数为多的情况下,电子设备保留至少一个视频片段中所有的视频片段。
需要说明的是,答案数量只与问题文本相关,与关联视频无关,是预先判断的问题答案的可能数量。当然,答案数量与问题答案具有一定的相关性,例如,在答案数量为一个的情况下,问题答案的实际数量可能为零个或一个;在答案数量为多个的情况下,问题答案的实际数量可能为任一自然数,如,0、1、2、3等数值。
如此,通过答案数量特征确定答案数量关联参数,对于答案数量关联参数为1的情况,能够对至少一个视频片段进行进一步筛选,仅保留录制时间最近的视频片段对应的视频片段,以便于进一步减少视频片段中的数据量,提高得到问题答案的速度。
在第一方面的一种可实现方式中,本申请提供的视频问答方法还包括:首先,电子设备提取文本特征中的答案序数特征。然后,电子设备获取答案序数特征对应的答案序数关联参数。接下来,将关联视频进行分割,得到至少一个视频片段之后,在答案数量关联参数为多,且,答案数量关联参数大于或等于答案序数关联参数的情况下,电子设备删除至少一个视频片段中除了第二视频片段以外的其他片段,其中,第二视频片段为至少一个视频片段中答案序数关联参数对应的视频片段,至少一个视频片段按照时间先后顺序排列。
如此,对于答案数量关联参数为多,且,答案数量关联参数大于或等于答案序数关联参数的情况,对至少一个视频片段进行进一步筛选,仅保留答案序数关联参数对应的视频片段,以便于进一步减少视频片段中的数据量,提高得到问题答案的速度。
在第一方面的一种可实现方式中,电子设备展示问题答案,具体包括:首先,在问题文本、至少一个视频片段或问题答案包含隐私信息的情况下,对用户身份信息进行身份校验。然后,在通过身份校验的情况下,展示问题答案。
其中,隐私信息与用户身份信息是相对而言的。假设问题文本为“保险箱密码”,对于在预置身份关系表中保存的用户身份信息都可以获取对应的问题答案,对于在预置身份关系表中未保存的用户身份信息(相对于预置身份关系表中各个人物而言的陌生人),为了避免陌生人根据获取的问题答案实现非法目的,则不可获取对应的问题答案。
如此,通过判断问题文本、至少一个视频片段或问答答案是否包含隐私信息,确实用户身份信息是否需要进行身份校验,并在身份校验通过的情况下,才展示问题答案,在身份校验不通过的情况下,不展示问题答案,能够避免隐私泄露。
在第一方面的一种可实现方式中,电子设备展示问题答案,包括以下至少一项:通过语音播放方式,播放问题答案;通过文字显示方式,显示问题答案;通过视频显示方式,播放至少一个视频片段中问题答案对应的视频片段。
需要说明的是,由于想要获取问题答案的用户的个人感知能力有限,如,不识字、视力障碍或者耳聋等等,采用至少一种展示方式展示问题答案,增加展示问题答案的多样性,以便于提高用户获知问题答案的概率。
第二方面,本申请提供一种视频问答装置,该装置中包括获取单元、处理单元和展示单元;获取单元,用于获取目标视频和用户的问题信息。获取单元,还用于获取问题信息中的至少一个关联参数,其中,关联参数包括时间关联参数、对象关联参数或语义关联参数中的一个或多个。处理单元,用于根据至少一个关联参数,对目标视频进行分割,得到至少一个视频片段。展示单元,用于获取至少一个视频片段中,问题信息对应的问题答案;展示问题答案。
如此,能够减少关联视频中隐含问题答案的视频帧的数量,进而减少处理至少一个视频片段的时间,以便于提高得到问题答案的速度。同时,能够避免不关联视频对问题答案的干扰,能够进一步提高得到问题答案的速度,并提高得到的问题答案的准确度。再者,获取问题文本对应的问题答案的过程中,分别处理至少一个视频片段中的每个视频片段,由于每个视频片段包含一个独立的语义,能够避免各个视频片段之间相互干扰,能够提高问题答案的准确度。
第三方面,提供了一种电子设备,包括:存储器、一个或多个处理器;存储器和处理器耦合;其中,存储器中存储有计算机程序代码,计算机程序代码包括计算机指令,当计算机指令被处理器执行时,使得电子设备执行上述第一方面任一项所述的视频问答方法。
第四方面,提供了一种计算机可读存储介质,包括计算机指令,当计算机指令在电子设备上运行时,使得电子设备执行上述第一方面任一项所述的视频问答方法。
第五方面,提供了一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行上述第一方面任一项所述的视频问答方法。
可以理解地,上述提供的第三方面所述的电子设备,第四方面所述的计算机可读存储介质,第五方面所述的计算机程序产品所能达到的有益效果,可参考第一方面及其任一种可能的设计方式中的有益效果,此处不再赘述。
附图说明
图1为第一种相关技术中的视频问答方法的流程示意图;
图2为第二种相关技术中的视频问答方法的流程示意图;
图3为本申请实施例示出的视频问答的场景示意图之一;
图4为本申请实施例示出的一种视频问答系统框架的结构示意图;
图5为本申请实施例示出的一种电子设备的硬件结构示意图;
图6为本申请实施例示出的一种电子设备的软件结构示意图;
图7为本申请实施例示出的一种视频问答方法的流程示意图之一;
图8为本申请实施例示出的一种获取关联参数的模型结构示意图;
图9为本申请实施例示出的一种家庭任务关系图;
图10为本申请实施例示出的一种视频问答方法的流程示意图之二;
图11为本申请实施例示出的一种视频片段的场景示意图;
图12为本申请实施例示出的一种VQA模型的结构示意图;
图13为本申请实施例示出的视频问答的场景示意图之二;
图14为本申请实施例示出的一种视频问答方法的流程示意图之三;
图15为本申请实施例示出的一种判断是否包含隐私信息的模型结构示意图;
图16为本申请实施例示出的一种视频问答方法的流程示意图之四;
图17为本申请实施例示出的一种视频问答方法装置的结构示意图;
图18为本申请实施例示出的另一种电子设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。其中,在本申请的描述中,除非另有说明,“/”表示前后关联的对象是一种“或”的关系,例如,A/B可以表示A或B;本申请中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。并且,在本申请的描述中,除非另有说明,“多个”是指两个或多于两个。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。同时,在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用
“示例性的”或者“例如”等词旨在以具体方式呈现相关概念,便于理解。
本申请实施例涉及的技术术语包括:
循环神经网络(Recurrent Neural Network,RNN)模型,是以序列数据为输入,在序列的演进方向进行递归且所有节点按链式连接的递归神经网络模型。
长短期记忆(Long-Short Term Memory,LSTM)模型,是时间循环神经网络模型,能够解决RNN模型存在的长期依赖的问题。RNN模型的单个循环结构内部只有一个状态,而LSTM模型的单个循环结构内部有四个状态。相比于RNN模型,LSTM模型循环结构之间保持一个持久 的单元状态不断传递下去,用于决定哪些信息要遗忘或者继续传递,能够避免在循环过程中出现梯度爆炸或者梯度消失,进而导致无法处理较长序列数据,无法获取长距离数据的信息的问题。
基于变换器的双向编码器表示(Bidirectional Encoder Representations from Transformers,BERT)模型,是语言表征模型,采用掩码语言模型生成深度的双向语言表征,其目的在于,利用大规模无标注预料训练,获得文本的包含丰富语义信息的模型表示。
残差神经网络(Residual Neural Network,ResNet)模型,是通过将人工神经网络(Artificial Neural Networks,ANN)中的某些层跳过下一层神经元,与下下层的神经元隔层相连形成的,能够弱化每层之间的强联系。需要说明的是,通常ANN中卷积层和池化层的层数越多,获取到的图片特征信息越全面,学习效果也越好,但是,实际上随着卷积层和池化层的叠加,可能出现梯度消失和梯度爆炸的现象,为了避免出现该现象,提出的ResNet模型。
自动语音识别(Automatic Speech Recognition,ASR)技术,是将人的语音转换为文本的技术。其目的在于,“听写”出不同人所说的连续语音,是实现“声音”到“文字”转换的技术。
变换器(Transformer,TRM)模型,是利用注意力机制提高模型训练速度的模型,依赖于注意力机制的架构,包括注意力机制和前馈神经网络。其目的在于,输入一种语言,输出另一种语言,即,对通过ASR技术得到的自然语言进行处理,提取有效文本。
视觉问答(Visual Question Answer,VQA)模型,是一种结合计算机视觉(Computer Vision,CV)和自然语言处理(Natural Language Processing,NLP)的学习任务,用于根据输入的图片和问题文本输出一个符合自然语言规则且内容合理的答案。其中,CV用于对给定图像进行图像识别、图像分类、目标跟踪等处理,NLP用于对自然语言进行机器翻译、信息检索、生成文本摘要等处理。
随着人工智能技术的发展,可以应用在智能控制、智能搜索、语言理解和图像理解等方面。对于语言理解和图像理解,可以应用于视频问答场景,能够解决自动救援搜索、智能家居管理和多媒体信息检索等实际问题。示例性的,提出问题“是否有人进门”,通过获取能够捕捉到人像的视频,对“是否有人进门”和该视频进行语言理解和图像理解(即,视觉处理和自然语言处理),查找问题对应的答案,实现视频问答。视频问答的具体实现方式,详见下述相关技术。
在第一种相关技术中,如图1所示,第一种视频问答的方法如下S101至S106所示。
S101、从待处理的原始视频及与原始视频对应的问题文本中,得到文本特征以及各帧图像中多个目标的第一视觉特征及第一语义特征。
S102、针对每帧图像中的每一目标,根据文本特征以及目标的第一视觉特征及第一语义特征,确定目标的第二视觉特征及第二语义特征。
S103、根据文本特征、目标的第二视觉特征及第二语义特征,得到该帧图像的第一全局视觉表示及第一全局语义表示。
S104、根据文本特征及各帧图像的第一全局视觉表示及第一全局语义表示,得到各帧图像的全局视觉表示。
S105、根据文本特征及各帧图像的全局视觉表示,得到原始视频的全局视觉特征表示。
S106、根据所述全局视觉特征表示及文本特征,可准确得到所述原始视频的问题答案。
如此,能够从视频中得到问题答案,然而,得到原始视频的各帧图像的全局视觉表示,耗费的时间较长,导致得到问题答案的速度较慢。
为了提高得到问题答案的速度,在第二种相关技术中,如图2所示,第二种视频问答的方法如下S201至S213所示。
S201、针对待回答的问题对应的视频,分别获取其中的各视频帧的描述信息。
S202、分别获取各个视频帧的描述信息与所述问题之间的相关性评分,按照相关性评分从大到小的顺序对各个视频帧进行排序,将排序后处于前M位的视频帧作为关键帧。
S203、获取所述视频对应的音频向量及所述问题对应的问题向量。
S204、针对任一关键帧,分别按照下述S205至S210所示方式进行处理。
S205、对该关键帧进行目标区域提取。
S206、对关键帧进行特征提取,得到关键帧对应的特征向量,并分别提取出各目标区域进行 特征提取,得到各目标区域对应的特征向量。
S207、获取该关键帧对应的文本向量。
S208、针对该关键帧中的每个目标区域,分别进行以下处理:将该目标区域对应的特征向量与以下向量进行拼接:该关键帧对应的特征向量,该关键帧对应的文本向量和音量向量;获取该目标区域对应的空间注意力权重,将空间注意力权重与拼接结果相乘,将相乘结果作为该目标区域的向量表示。
S209、将各个目标区域的向量标识与该关键帧对应的特征向量进行拼接,将拼接结果作为该关键帧的向量表示。
S210、获取该关键帧对应的时序注意力权重,将所述时序注意力权重与该关键帧的向量表示相乘,得到更新后的该关键帧的向量表示。
S211、确定所述问题为直观问题还是非直观问题。若是直观问题,则执行S212;若是非直观问题,则执行S213。
S212、利用各关键帧的向量表示及所述问题确定出对应的答案。
S213、利用各关键帧的向量表示,所述问题及对应的知识图谱确定出对应的答案。
如此,通过上述S201至S213,可以通过视频帧描述和文本,选取固定数目的关键帧,能够减少进行特征提取的视频帧数量,以此能够提高得到问题答案的速度。其中,选取的关键帧可能包括图像噪声,导致利用关键帧的向量表示不足以确定出问题对应的答案。同时,如果视频帧数量较大,那么选取关键帧仍然耗费的时间较长,导致仍然存在得到问题答案的速度较慢的问题。
在视频问答的场景中,用于查找问题答案的视频,往往会持续几天或者几个月,无论通过上述第一种相关技术,还是通过上述第二种相关技术,都不能解决得到问题答案的速度较慢的问题。因此,本申请实施例提供一种视频问答方法,该方法中,首先,获取目标视频和用户的问题文本。然后,获取问题文本中隐含的时间、对象(如人物)、语义和答案数量等关联参数,然后获取关联参数对应的关联视频,将关联视频进行分割得到至少一个视频片段,最后将至少一个视频片段和问题文本输入VQA模型,得出问题答案。
其中,语义的表示方式可以为文字,还可以为文本向量。答案数量可以为一个或者多个。VQA模型中的文本处理过程,可采用BERT模型、LSTM模型、词向量word2vec模型等自然语言处理模型。VQA模型中在进行图像处理过程中,可采用Transformer网络模型、ResNet网络等图像处理网络模型。关联参数中的对象,可以为人物或事物对象等等。具体的,人物对象是指通过语义识别得到的具有人物特征的文本,事物对象是指通过语义识别得到的具有事物特征的文本,事物对象可以为不能自动移动的事物(如电脑、房屋、树木等),以及可以自动移动的事物(如扫地机器人、宠物等)。
需要说明的是,答案数量只与问题文本相关,与关联视频无关,是预先判断的问题答案的可能数量。当然,答案数量与问题答案具有一定的相关性,例如,在答案数量为一个的情况下,问题答案的实际数量可能为零个或一个;在答案数量为多个的情况下,问题答案的实际数量可能为小于答案数量的任一自然数,如,0、1、2、3等数值。
示例性的,如图3所示,用户A向智能助手B提出问题文本“今天下午妈妈把钥匙放在哪里了”,其中,智能助手B能够拍摄视频、接收语音信息和播放问题答案。智能助手B理解“今天下午”对应的时间,“妈妈”对应的对象(人物对象),“找钥匙”对应的语义,然后按照问题文本的实际含义需要对应的问题答案,如,钥匙在桌子上。智能助手B还可以对问题文本进一步理解,确定询问钥匙的位置(也就是钥匙现在的放置位置),即使下午妈妈把钥匙放在3处不同的地方,问题答案只有一个。关联参数包括时间(今天下午)、目标人物(妈妈)、语义(找钥匙)和答案数量(1个)。
本申请实施例提供一种视频问答方法,获取关联参数对应的关联视频,能够减少隐含问题答案的视频帧的数量,进而减少VQA模型需要处理的至少一个视频片段的时间,以便于提高得到问题答案的速度。同时,避免不相关视频对问题答案的干扰,能够进一步提高得到问题答案的速度,并提高得到的问题答案的准确度。再者,通过VQA模型分别处理至少一个视频片段中的每个视频片段,由于每个视频片段包含一个独立的语义,能够避免各个视频片段之间相互干扰,能 够提高问题答案的准确度。
本申请实施例提供的视频问答方法,可应用于视频问答系统框架。如图4所示,视频问答系统框架包括输入/前端处理模块、ASR语音识别模块、摄像模块和视频问答模块。上述视频问答系统框架,用于回答用户提出的询问问题。上述视频问答系统框架中的各个模块可以配置于同一电子设备中。
示例性的,电子设备可以是电视、笔记本电脑、个人计算机、手机、平板电脑、智能音箱等支持语音识别的设备,本申请实施例对电子设备的具体形式不做特殊限制。
需要说明的是,视频问答系统框架中的各个模块,还可以配置在多个电子设备中,每个电子设备中至少配置一个模块。例如,在第一电子设备中配置语音处理模块和语音识别模块、在第二电子设备中配置摄像模块、在第三电子设备中配置视频问答模块。例如,在第四电子设备中配置语音处理模块、语音识别模块和视频问答模块,在第五电子设备中配置摄像模块。配置有数据传输关系模块的电子设备之间,能够进行通信。本申请实施例对视频问答系统框架对应的电子设备的数量不做限定。
输入/前端处理模块,用于将输入的语音流(用户的询问问题)处理成预置的数据格式,具体包括:音频解码、利用声纹或其他特征对输入的语音流进行分离和降噪;通过分帧、开窗、短时傅里叶变换等音频处理算法,提取音频特征;将音频特征对应的音频向量发送至ASR语音识别模块。一般的,输入/前端处理模块在终端侧实现。
ASR语音识别模块,用于获取音频特征,通过声学模型和语言模型将输入的音频特征转换为文本,典型的实现方式包括:通过声学模型获取音频特征中的声学特征对应的音素,通过语音模型获取音频特征中的语言特征对应的文字,串联输出音素和文字输出语音流对应的文本。一般的,ASR语音识别模块可以在终端侧实现。在本申请实施例中,对ASR语音识别模块所采用的语音识别方法不做限定。
其中,声学模型和语言模型都是神经网络结构,在进行模型训练过程中,是联合训练的,因此,通过声学模型和语言模型输出的音频特征对应的文本,为汉字序列。
摄像模块,用于为特定区域的拍摄视频。摄像模块可以保存预置时长的视频。预置时长是指在拍摄的视频中,距离当前时间的时间长度,如3天、7天、一个月或三个月等等。需要说明的是,摄像模块还可以对拍摄的视频进行图像处理,例如分析视频语义、提取图像特征、提取人物特征、提取环境特征等等。
视频问答模块,用于接收文本(问题信息)和视频,输出问题信息对应的答案。典型的方式包括:通过提取文本特征和视频特征,并进行特征融合,利用分类或生成的方式得到问题的答案。
示例性的,上述视频问答系统框架可以应用在语音助手程序中,通过结合用户输入的语音文本以及摄像模块手机的视频,回答用户输入的语音文本包含的问题。
图5示出了电子设备的硬件结构示意图,能够实现视频问答模块的功能。如图5所示,电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。
可以理解的是,本发明实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
其中,控制器可以是电子设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时 序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
示例性的,处理器110可以获取问题文本和目标视频;将问题文本输入编码器,输出问题文本的关联参数,关联参数包括:时间、对象、语义和答案数量;获取关联参数对应的关联视频,然后将关联视频进行分割得到至少一个视频片段,最后将至少一个视频片段和问题文本输入VQA模型,得出问题答案。其中,问题文本是由语音流经过ASR转化得到的,问题文本还可以是直接获取的文字;视频库包括连续录制的视频。其中,对象可以为人物对象和事物对象等。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为电子设备100充电,也可以用于电子设备100与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他电子设备,例如AR设备等。
可以理解的是,本发明实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备100的结构限定。在本申请另一些实施例中,电子设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
示例性的,USB接口130可以用于传输语音流、问题信息、目标视频、关联视频和问题答案等数据。
电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,电子设备100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得电子设备100可以通过无线通信技术与网络以及其他设备通信。所述无线通信技 术可以包括全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期演进(long term evolution,LTE),BT,GNSS,WLAN,NFC,FM,和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)。
示例性的,电子设备100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得电子设备100可以用于传输语音流、问题信息、目标视频、关联视频和问题答案等数据。
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。
示例性的,显示屏194可以用于显示问题文本、关联视频和问题答案等数据。
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,电子设备100可以包括1个或N个摄像头193,N为大于1的正整数。
示例性的,摄像头193可以用于连续录制的视频,构成视频库中的视频。目标视频为视频库中的视频。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。电子设备100可以支持一种或多种视频编解码器。这样,电子设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理 器110通过运行存储在内部存储器121的指令,从而执行电子设备100的各种功能应用以及数据处理。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。
示例性的,外部存储器接口120和内部存储器121,可以用于存储语音流、问题信息、目标视频、关联视频和问题答案等数据。
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备100可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。电子设备100可以设置至少一个麦克风170C。在另一些实施例中,电子设备100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
示例性的,受话器170B和麦克风170C可以用于接收用户输入的语音流,音频模块170可以用于将语音流经过ASR转化得到问题文本,扬声器170A可以用于播放问题答案。
SIM卡接口195用于连接SIM卡。SIM卡可以通过插入SIM卡接口195,或从SIM卡接口195拔出,实现和电子设备100的接触和分离。电子设备100可以支持1个或N个SIM卡接口,N为大于1的正整数。SIM卡接口195可以支持Nano SIM卡,Micro SIM卡,SIM卡等。同一个SIM卡接口195可以同时插入多张卡。所述多张卡的类型可以相同,也可以不同。SIM卡接口195也可以兼容不同类型的SIM卡。SIM卡接口195也可以兼容外部存储卡。电子设备100通过SIM卡和网络交互,实现通话以及数据通信等功能。在一些实施例中,电子设备100采用eSIM,即:嵌入式SIM卡。eSIM卡可以嵌在电子设备100中,不能和电子设备100分离。
示例性的,SIM卡接口195可以在连接SIM卡后,电子设备100的天线1和移动通信模块150耦合,使得电子设备100能够用于传输语音流、问题文本、视频库中的视频和问题答案等数据。
基于图5所示的电子设备100实现本申请实施例中的视频问答方法时,电子设备100可以通过天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得电子设备100通过无线通信技术获取语音流、问题信息、问题文本或视频库中的目标视频。电子设备100还可以通过USB接口130获取语音流、问题文本和目标视频。其中,语音流还可以通过受话器170B和麦克风170C响应于用户输入得到,相应的,问题文本还可以通过音频模块170将语音流经过ASR转化得到。视频库中的视频还可以通过摄像头193连续录制的视频。处理器110可以获取问题信息对应的问题文本和目标视频;将问题文本输入编码器,输出问题文本的关联参数,关联参数包括:时间、对象、语义和答案数量;获取关联参数对应的关联视频,然后将关联视频进行分割得到至少一个视频片段,最后将至少一个视频片段和问题文本输入VQA模型,得出问题答案。显 示屏194可以用于显示问题文本、目标视频和问题答案等数据。显示屏194还可以用于显示问题答案对应的视频片段。
电子设备100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本发明实施例以分层架构的Android系统为例,示例性说明电子设备100的软件结构。
图6是本发明实施例的电子设备的软件结构框图。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android Runtime)和系统库,以及内核层。
如图6所示,应用程序层可以包括一系列应用程序包。应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序。
示例性的,基于应用程序层可以启动视频问答方法对应的应用程序,以便于得到问题文本对应的问题答案。
如图6所示,应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
示例性的,应用程序框架层可以通过窗口管理器控制显示屏194的显示区域大小。应用程序框架层还可以通过内容提供器存放或获取问题文本、视频库中的视频、问题答案和问题答案对应的视频片段等数据。应用程序框架层还可以通过视图系统显示用于指示是否显示问题文本、视频库中的视频、问题答案和问题答案对应的视频片段等数据的可视控件。当然,视频库中的适配,可以通过资源管理器传输至内容提供器。
如图6所示,Android Runtime包括核心库和虚拟机。Android Runtime负责安卓系统的调度和管理。核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
如图6所示,系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。
三维图形处理库用于实现三维图形绘图,图像渲染,合成和图层处理等。
2D图形引擎是2D绘图的绘图引擎。
如图6所示,内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
示例性的,上述图5所示的USB接口130、天线1,天线2,移动通信模块150,无线通信模块160、外部存储器接口120、内部存储器121,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D、摄像头193和SIM卡接口195,需要通过内核层对应的驱动进行驱动后,才能实现传输语音流、问题信息、问题文本、目标视频和问题答案等数据,显示问题文本、目标视频和问题答案等数据,或者连续录制的视频。
下面结合视频问答场景,示例性说明电子设备100软件以及硬件的工作流程。
基于图5和图6所示的电子设备100实现本申请实施例中的视频问答方法时,示例性说明电子设备100软件以及硬件的工作流程。电子设备100可以通过内核层驱动天线1和移动通信模块150耦合,驱动天线2和无线通信模块160耦合,使得电子设备100通过无线通信技术获取语音流、问题信息、问题文本或目标视频。电子设备100还可以通过内核层驱动USB接口130获取语音流、问题文本和视频库中的视频。其中,电子设备100还可以通过内核层驱动受话器170B和麦克风170C响应于用户输入得到语音流,相应的,电子设备100还可以通过内核层驱动音频模块170将语音流经过ASR转化得到问题文本。电子设备100还可以通过内核层驱动摄像头193连续录制的视频库中的视频。电子设备的外部存储器接口120连接的外部存储设备,或者内部存储器121通过应用程序框架层中的内容提供器存储问题文本和视频库中视频。电子设备的处理器110通过应用程序框架层中的内容提供器获取问题文本和目标视频;通过Android Runtime和系统库,将问题文本输入编码器,输出问题文本的关联参数,关联参数包括:时间、对象、语义和答案数量;获取目标视频中关联参数对应的关联视频,将关联视频进行分割得到至少一个视频片段,将至少一个视频片段和问题文本输入VQA模型,得出问题答案。电子设备通过应用程序框架层的窗口管理器设置显示参数,并通过系统库的表面管理器与显示子系统进行管理,最后通过内核层驱动显示屏194显示问题文本、目标视频、问题答案以及问题答案对应的视频片段等数据。
在确保问题答案的准确性的基础上,为了解决得到问题答案的速度较慢的问题,本申请实施例提供一种视频问答方法,在实现过程中需要考虑以下两个方面:
第一、用于查找问题答案的视频,往往会持续几天或者几个月,需要获取与问题文本对应关联视频,以便于精确地理解关联视频,得到问题答案。因此,在提取关联视频中需要考虑问题文本中隐含的时间、人物、语义和答案数量等关联参数。
第二、一个问题文本可能对应多个视频片段,如果将关联视频作为一个整体,查找问题文本对应的问题答案,那么关联视频中多个视频片段之间可能互相干扰。因此,可以分割关联视频得到多个视频片段,并对各个视频片段分别进行理解。尤其是,对于一个问题文本存在多个问题答案的情况,在查找问题文本对应的问题答案过程中,多个视频片段之间的互相干扰可能造成对视频的错误理解,导致得到的问题答案与问题文本不一致。
以下将以电子设备为手机为例,对本申请实施例提供的视频问答方法进行说明。如图7所示,该方法可以包括如下步骤S701-S704。
S701、电子设备获取问题文本中的关联参数。
在一些实施例中,电子设备获取问题信息中的关联参数之前,电子设备还可以获取用户的问题信息。获取问题文本中的关联参数,也可以理解为,电子设备获取问题信息中的至少一个关联参数,至少一个关联参数包括时间关联参数、对象关联参数和语义关联参数中的一个或多个。
其中,问题文本与问题信息相对应,问题信息是指包括询问问题的视频、音频、文字(可以为书面形式的文字、也可以为口语化的文字)。问题文本是指包括询问问题的文本。
在一些实施例中,电子设备可以响应于用户的文字输入获取问题信息。电子设备还可以响应于用户的语音输入获取语音流,再通过语音识别技术将语音流转换为问题信息。电子设备还可以通过响应于用户的视频输入获取问题信息。电子设备还可以通过有线或无线方式,从其他电子设备获取问题信息。在本申请实施例中,对问题信息的来源不做限定。
在一些实施例中,电子设备获取关联参数的具体实现方式可以包括:首先,电子设备将问题信息转换为问题文本。然后,电子设备对问题文本进行分词,得到词向量。再者,电子设备将词向量输入预置文本编码模型,得到文本特征。接下来,电子设备提取文本特征中的时间特征和对象特征。最后,电子设备获取问题信息中的至少一个关联参数。
其中,关联参数包括文本特征对应的语义关联参数,时间特征对应的时间关联参数,和对象特征对应的对象关联参数中的一个或多个。
可以理解的是,在获取问题信息后,电子设备对问题信息进行清洗、标准化等处理,得到问题文本。从问题文本中可能提取出文本特征、对象特征和时间特征中的一个或多个。当然,电子设备从问题文本中提取文本特征、对象特征和时间特征的过程中,如果不能从问题文本中提取到文本特征、对象特征或时间特征对应的特征参数,那么电子设备可以将对应的特征赋值为预置参数,预置参数用于标识该特征为无效特征。
在本申请实施例中,预置参数为不干扰得到问题答案的过程的参数。
以下,以对象关联参数为人物关联参数为例,对问题信息中的关联参数进行详细说明。
针对语义关联参数而言,如图8中的(a)所示,电子设备可以将问题文本进行分词,并将分词对应的词向量输入文本编码器(如BERT模型),通过文本编码器进行文本编码,获取文本特征,然后提取文本特征中的时间特征,再将文本特征中的非时间特征,输入人物分类器,提取人物特征和答案数量特征,最后获取时间特征对应的时间关联参数,人物特征对应的人物关联参数,以及答案数量特征对应的答案数量关联参数。文本特征即关联参数中的语义关联参数。
其中,文本编码器可以采用BERT模型,还可以采用LSTM模型。在本申请实施例中,以文本编码器可以采用BERT模型为例,说明如何获取关联参数。需要说明的是,BERT模型的主要输入是文本中各个字/词(或者称为token)的原始词向量,原始词向量可以是通过查询字向量表将文本中的每个字转换为一维向量,也可以是将利用词向量模型进行预训练后得到的向量。BERT模型的输出是文本中各个字/词对应的融合全文语义信息后的向量表示。
如此,从问题文本中获取与实际问题更相似的文本特征,能够提高语义关联参数的准确性,进一步提高得到的问题答案的准确性。
针对时间关联参数而言,如图8中的(b)所示,电子设备可以将问题文本中的每一个文本分词对应的分词特征输入时间分类器(以序列标注方式标记输入文本中的时间关联词),提取文本特征中的时间特征。如此,通过序列标注方式获取时间特征,能够通过对序列标注方式的训练,提高获取的时间特征的准确性,进一步提高获取的时间关联参数的准确性,最终提高得到的问题答案的准确性。
需要说明的是,时间特征可以存在(有效特征),可以不存在(无效特征),获取时间特征之后,电子设备可以根据时间特征为有效特征还是无效特征,确定时间特征对应的时间关联参数。其中,时间关联参数包括视频起始时刻和视频终止时刻。
在时间特征为有效特征的情况下,根据预置映射规则,对时间特征对应的时间分词进行映射,确定视频起始时刻和视频终止时刻。可以理解为,如果问题文本中存在时间关联词,则将时间关联词映射的世界时间,确定为时间关联参数。
示例性的,问题文本为“今天下午钥匙放哪了”,时间分类器的分类结果为“BIIIO OOOO”,B表示时间关联词的第一个字,I标识时间关联词的内部字,O表示非时间词。根据分类结果“BIII”,获取问题文本对应的时间关联词“今天下午”,根据时间关联词“今天下午”,通过映射确定时间关联参数为“2022年9月7日12点至2022年9月7日18点”。
在时间特征为无效特征的情况下,确定视频终止时刻为:获取问题文本的时刻,确定视频起始时刻为:与视频终止时刻相距预置时长的时刻。可以理解为,如果问题文本中不存在时间关联词,则确定时间关联参数为从问题文本的获取时刻之前的N个小时。其中,N大于0,N小于视频库中视频的总时长。
示例性的,问题文本为“找钥匙”,时间分类器的分类结果为“OOO”。根据分类结果“OOO”,确定问题文本不存在对应的时间关联词。获取问题文本的获取时间(2022年9月7日15点30分),确定时间关联参数为“2022年9月7日12点30分至2022年9月7日15点分”,N为3小 时。
针对人物关联参数而言,如图8中的(c)所示,电子设备将文本特征中的非时间特征,输入人物分类器,提取人物类别特征(即,确定对象关联参数为人物关联参数的对象特征)。电子设备获取对象特征对应的对象关联参数,具体包括:电子设备根据问题文本,确定用户身份信息;在预置身份关系表中,电子设备根据用户身份信息确定对象特征对应的目标人物;电子设备将目标人物对应的目标特征,确定为对象关联参数,目标特征包括以下至少一项:图像特征、行为特征和声纹特征。
值得说明的是,如果能够确认用户身份信息,并且人物分类器结果中包含的人物类别特征,则电子设备在家庭成员关系表中,选取人物类别特征对应的目标人物。然后确定目标人物的目标特征为人物关联参数。其中,目标特征包括图像特征、行为特征和声纹特征等特征。目标特征可以是根据预先存储的目标人物的人像、声音等数据,通过预置特征提取算法提取的,预置特征提取算法可以为残差网络ResNet算法。
如此,通过人物分类器和用户身份信息获取目标特征,能够通过获取问题文本相关的目标人物的目标特征,提高获取的“人物”关联参数的准确性,最终提高得到的问题答案的准确性。
其中,上述电子设备根据问题文本,确定用户身份信息,可以包括以下任一项:在响应于文本输入获取问题信息的情况下,确定用户身份信息为:启动电子设备的生物特征对应的身份信息;在响应于语音输入获取问题信息的情况下,确定用户身份信息为:语音输入的语音流对应的声纹特征对应的身份信息;在响应于视频输入获取问题信息的情况下,确定用户身份信息为:视频输入的视频流对应的声纹特征和/或人脸特征对应的身份信息。
需要说明的是,如果问题信息是电子设备通过有线或无线方式从其他电子设备获取的,那么根据问题信息携带的声纹或人脸图像,确定用户身份信息。
其中,人物类别特征可以表示爸爸、妈妈、我、爷爷和奶奶等人物关系类别。例如,问题文本为“妈妈的钥匙”,人物分类器的结果为“妈妈”;问题文本为“找一下钥匙”,人物分类器的结果为“我”。
示例性的,如图9所示的家庭成员关系表,可以是预先设置的,也可以是随着电子设备的使用不断添加的。在家庭成员关系报表中,包括家庭成员和成员关系两个主要元素,如,人物1的妈妈是人物2,人物1的爸爸是人物2,人物2的妈妈是人物4,人物2的爸爸是人物5。家庭成员关系报表中的数据量,根据家庭成员的多少确定的,在本申请实施例中不做限定。
示例性的,在图9的基础上,如果电子设备提取的人物类别特征为“妈妈”,确定用户身份信息为人物1,则通过查找家庭关系表可以获取的目标人物为人物2,然后将人物2的目标特征确定为人物关联参数。
需要说明的是,电子设备可以根据问题文本的获取方式,选取目标身份确认方法,并根据目标身份确认方式确定提出问题文本的用户身份信息。如果无法确认用户身份信息,那么不输出人物关联参数。电子设备无法确认用户身份信息,也就是,输入问题文本的用户,不再预设人脸和预设声纹识别范围内,或者人物分类器结果没有该用户对应的目标人物。
可以理解的是,如果电子设备直接获取问题文本,那么将开启电子设备采集的生物特征(如指纹、声纹、或人脸图像等)对应的用户身份信息,确定为提出问题文本的用户身份信息。如果电子设备通过语音流间接获取问题文本,那么根据语音流对应的声纹,确定提出问题文本的用户身份信息。如果电子设备通过视频间接获取问题文本,那么根据视频中语音对应的声纹,和/或视频中的人脸图像,确定提出问题文本的用户身份信息。
针对答案数量关联参数而言,如图8中的(d)所示,电子设备将文本特征输入人物分类器,根据人物分类器结果中是否包含人物特征,且通过语义关联参数进行判断,则电子设备确定答案数量特征对应的答案数量关联参数为1或多。如此,通过确定问题答案的答案数量,能够提高获取的答案数量关联参数的准确性,进一步提高得到的问题答案的准确性。
在另一些实施例中,在答案数量为多的情况下,在问题文本中还可能包括指定输出第几个答案。如图8中的(e)所示,电子设备将文本特征输入人物分类器(以序列标注方式标记输入文本中的序数关联词),确定答案数量特征,同时,还提取问题文本中的答案序数特征。例如,问 题文本为“今天下午第二个来的是谁”,虽然对应的答案数量为多,但是实际的答案为一个,答案序数为“第二”。
S702、电子设备获取关联参数对应的关联视频。
在一些实施例中,电子设备获取问题文本中的关联参数之前,电子设备还可以获取目标视频,然后获取目标视频中,关联参数对应的关联视频。
可以理解的是,电子设备还可以直接从视频库中目标视频,或者获取关联参数对应的关联视频。其中,目标视频可以存储在视频库中,视频库是指存放连续录制视频的数据库。视频库可以存储在电子设备中,也可以存储在服务器中、还可以存储在录制视频的录制设备中。
在第一种示例中,如果视频库可以存储在电子设备中,那么电子设备将关联参数作为筛选条件,从视频库的目标视频中获取符合筛选条件的关联视频。
在第二种示例中,如果视频库存储在服务器中,那么电子设备将关联参数作为筛选条件发送至服务器。服务器接收筛选条件,并从视频库的目标视频中提取符合筛选条件的关联视频,并将关联视频发送至电子设备,以使得电子设备获取关联参数对应的关联视频。
在第三种示例中,如果视频库存储在录制视频的录制设备中,那么电子设备将关联参数作为筛选条件发送至录制设备。录制设备接收筛选条件,并从视频库的目标视频中提取符合筛选条件的关联视频,并将关联视频发送至电子设备,以使得电子设备获取关联参数对应的关联视频。
在第四种示例中,如果视频库存储在录制视频的录制设备中,那么电子设备请求获取视频库中的全部视频(包括目标视频),录制设备将全部视频发送至电子设备。电子设备在接收到全部视频后,将关联参数作为筛选条件,从全部视频中获取符合筛选条件的关联视频。
在另一些实施例中,如图10所示,上述获取目标视频中,关联参数对应的关联视频,可以通过下述步骤S1001至S1003实现。
S1001、电子设备根据时间关联参数,从目标视频中提取第一视频。
S1002、电子设备根据对象关联参数,从第一视频中提取第二视频。
S1003、电子设备根据语义关联参数,从第二视频中提取关联视频。
在本申请实施例中,电子设备将时间关联参数作为视频录制时间段,提取目标视频中的第一视频。针对第一视频中的每个视频帧进行编码,获取每个视频帧的视频帧特征。电子设备计算对象关联参数(如,目标人物的目标特征),与每个视频帧的视频帧特征的第一特征相似度,如果第一特征相似度大于预置阈值,则保留该视频帧,如果第一特征相似度小于预置阈值,则删除该视频帧,直至依次判断第一视频中每个视频帧保留或删除,得到第二视频。最后,电子设备计算语义关联参数(文本特征),与第二视频帧的每个视频帧的视频帧特征的第二特征相似度,如果第二特征相似度大于预置阈值,则保留该视频帧,如果第二特征相似度小于预置阈值,则删除该视频帧,直至依次判断第二视频中每个视频帧保留或删除,得到关联视频。
需要说明的是,每个视频帧的视频帧特征,可以实现视频问答方法的过程中生成的,还可以是在录制视频的过程中生成的。每个视频帧的视频帧特征,可以以预训练的ResNet网络为编码器进行编码得到。
为了计算第一特征相似度和第二特征相似度,可以采用向量乘积方式,即,将目标特征对应的向量与视频帧特征对应的向量进行向量乘积,得到第一特征相似度;将文本特征对应的向量与视频帧特征对应的向量进行向量乘积,得到第二特征相似度。
如此,获取关联参数对应的关联视频,能够减少关联视频中隐含问题答案的视频帧的数量,进而减少VQA模型处理至少一个视频片段的时间,以便于提高得到问题答案的速度。同时,能够避免不关联视频对问题答案的干扰,能够进一步提高得到问题答案的速度,并提高得到的问题答案的准确度。再者,通过VQA模型分别处理至少一个视频片段中的每个视频片段,由于每个视频片段包含一个独立的语义,能够避免各个视频片段之间相互干扰,能够提高问题答案的准确度。
S703、电子设备将关联视频进行分割,得到至少一个视频片段。
在本申请实施例中,电子设备可以按照视频分割位置对关联视频进行分割,获取至少一个视频片段,其中,视频分割位置为:关联视频中录制时间差大于预置时间差的相邻视频帧所在的位 置。
在一些实施例中,电子设备针对关联视频的时间上不连续的视频段进行判断,在相邻视频段中相邻视频帧的录制时间差,大于预置时间间隔的情况下,则将相邻视频段分割为不同的视频片段,在相邻视频段中相邻视频帧的录制时间差,不大于预置时间间隔的情况下,则将相邻视频段分割为相同的视频片段,以此,得到至少一个视频片段。
示例性的,关联视频的时间上不连续的视频段,包括f1、f2和f3等等。首先将f1确定属于第一个视频片段,如果f1和f2中相邻视频帧的录制时间差大于10秒,则确定f2属于第二个视频片段,否则确定f2也属于第一个视频片段。
在一些实施例中,在至少一个视频片段中的相邻视频片段的特征均值的特征相似度大于预置阈值的情况下,将相邻视频片段合并,重新生成至少一个视频片段。
可以理解的是,在得到至少一个视频片段之后,电子设备还可以计算两个相邻视频片段的特征均值的相似度,如果该相似度大于预设阈值,则将两个相邻视频片段合并为一个视频片段。视频片段的特征均值,可以为视频片段中每个视频帧的视频帧特征相加的和值,再除以视频帧总数得到的平均值。
对于特征均值的特征相似度,可以采用向量乘积方式,即,将相邻视频片段中前一个视频片段的特征均值,与相邻视频片段中后一个视频片段的特征均值,进行向量乘积,得到两个相邻视频片段的特征均值的相似度。
示例性的,如图11所示,如果问题文本为“妈妈把炒锅放在哪里了”,那么追踪妈妈的位置,捕捉妈妈的动作,是得到问题答案的关键。假设,妈妈的活动场景包括,场景一为:妈妈在厨房备菜,餐厅处于无人状态;场景二为:厨房处于无人状态,妈妈在餐厅摆放菜品;场景三为:妈妈回到厨房备菜,餐厅处于无人状态。其中,场景一和场景三对应的厨房视频属于关联视频,且,场景一和场景三对应的厨房视频的特征均值的相似度大于预设阈值,是相邻视频片段。
如此,对于特征均值的相似度较高的视频片段,在视频帧特征、人物追踪等等方面,可以复用的数据较多,如果将特征均值的相似度较高的视频片段进行合并,可以减少后续数据处理所需的时间,进而提高得出问题文本对应的问题答案的速度。
在一些实施例中,电子设备还可以提取文本特征中的答案数量特征;获取答案数量特征对应的答案数量关联参数;然后在将关联视频进行分割,得到至少一个视频片段之后,电子设备还可以在答案数量关联参数为1的情况下,删除至少一个视频片段中除了第一视频片段以外的其他片段。其中,第一视频片段为:至少一个视频片段中最后录制的视频片段。
可以理解的是,如果关联参数包括答案数量关联参数,在答案数量关联参数为1的情况下,电子设备保留至少一个视频片段中的录制时间最近的视频片段;在答案数量关联参数为多的情况下,电子设备保留至少一个视频片段中所有的视频片段。也就是,对于答案数量关联参数为1的情况,仅保留录制时间最近的视频片段对应的视频片段,以便于进一步减少视频片段中的数据量,提高得到问题答案的速度。
在一些实施例中,如果关联参数包括答案数量关联参数和答案序数关联参数,在答案数量关联参数为多的情况下,电子设备得出问题文本对应的多个可能答案(即,对应于多个视频片段),但是,如果关联参数中还包括答案序数关联参数,那么最终的问题答案为答案序数关联参数对应的一个问题答案(即,对应于多个视频片段中与答案序数关联参数对应的视频片段)。
具体的,电子设备还可以提取文本特征中的答案序数特征;获取答案序数特征对应的答案序数关联参数;然后在将关联视频进行分割,得到至少一个视频片段之后,电子设备还可以在答案数量关联参数为多,且,答案数量关联参数大于或等于答案序数关联参数的情况下,删除至少一个视频片段中除了第二视频片段以外的其他片段。
其中,第二视频片段为:至少一个视频片段中答案序数对应的视频片段,至少一个视频片段按照时间先后顺序排列。
如此,仅保留答案序数关联参数对应的视频片段,以便于进一步减少视频片段中的数据量,提高得到问题答案的速度。
S704、电子设备将至少一个视频片段和问题文本输入VQA模型,得出问题文本对应的问题 答案,并展示问题答案。
在本申请实施例中,电子设备可以先获取至少一个视频片段中,问题文本对应的问题答案。再展示问题答案。
在一些实施例中,电子设备可以通过VQA模型,至少一个视频片段的视频帧特征,和问题文本的文本特征进行结合,得出问题文本对应的问题答案。在本申请实施例中,对特征结合方法不做限定。
如图12所示,T为文本特征,F为视频帧特征,将文本特征和视频帧特征输入变换Transformer神经网络,得到融合特征,再根据融合特征利用分类或生成方法得到问题答案。
可以理解的是,确定时间关联参数为从问题文本的获取时刻之前的N个小时,据此提取的关联视频中,可能不包括问题文本对应的问题答案。因此,如果电子设备不能得出问题文本对应的问题答案,那么电子设备可以提示用户在哪个时间段没有查找到问题文本对应的问题答案,并提示用户可以输入目标时间段,将目标时间段作为时间关联参数,重新查找问题文本对应的问题答案。
示例性的,如图13中的(a)所示,电子设备为手机C,手机C响应于用户A输入的问题文本“今天下午谁进门了”,手机C根据“今天下午(时间)”、“谁(任意人物)”和“进门(语义)”获取关联视频,将关联视频对应的所有至少一个视频片段输入VQA模型,依据各个视频片段确定下午进门的人物包括:人物甲、人物乙和人物丙,最后得到问题文本对应的问题答案即为人物甲、人物乙和人物丙。
示例性的,如图13中的(b)所示,电子设备为手机C,手机C响应于用户A输入的问题文本“今天下午第二个进门的是谁”,手机C根据“今天下午(时间)”、“谁(任意人物)”、“进门(语义)”和“第二(答案序数)”获取关联视频,并将关联视频对应的所有至少一个视频片段输入VQA模型,依据各个视频片段确定下午进门的人物包括:人物甲、人物乙和人物丙,最后得到问题文本对应的问题答案即为:上述人物中的第二个人物(人物乙)。
在本申请实施例中,电子设备展示问题答案可以包括以下至少一项:通过语音播放方式,播放问题答案;通过文字显示方式,显示问题答案;通过视频显示方式,播放至少一个视频片段中问题答案对应的视频片段。
简而言之,电子设备可以通过语音播放、文字显示、视频显示的方式展示问题答案。其中,视频显示方式,可以显示至少一个视频片段中问题答案对应的视频片段。示例性的,参见图3,如果电子设备为智能助手B,那么智能助手B通过语音展示问题答案:钥匙在桌子上。再示例性的,如果电子设备为手机,那么手机通过视频展示问题答案,在手机的显示屏上显示钥匙在位置上的图像。再示例性的,如果电子设备为手机,那么手机通过文字展示问题答案,在手机的显示屏上显示钥匙在位置上的图像。
需要说明的是,由于想要获取问题答案的用户的个人感知能力有限,如,不识字、视力障碍或者耳聋等等,采用至少一种展示方式展示问题答案,增加展示问题答案的多样性,以便于提高用户获知问题答案的概率。
本申请实施例提供一种视频问答方法,通过获取问题信息中的至少一个关联参数,对目标视频进行分割,并且得到至少一个视频片段,能够减少至少一个视频片段中隐含问题答案的视频帧的数量,进而减少处理至少一个视频片段的时间,以便于提高得到问题答案的速度。同时,能够避免目标视频中的不相关视频对问题答案的干扰,能够进一步提高得到问题答案的速度,并提高得到的问题答案的准确度。再者,获取问题文本对应的问题答案的过程中,分别处理至少一个视频片段中的每个视频片段,由于每个视频片段包含一个独立的语义,能够避免各个视频片段之间相互干扰,能够提高问题答案的准确度。
进一步地,在本申请实施例中,由于视频问答过程中,问题答案可能涉及隐私,因此,还可以在用户身份信息校验通过后再展示问题答案,比如,在图7的基础上,如图14所示,上述S704中的展示问题答案,还可以包括下述步骤S1401和S1402。
S1401、在问题文本、至少一个视频片段或问题答案包含隐私信息的情况下,电子设备对用户身份信息进行身份校验。
S1402、在身份校验通过的情况下,电子设备展示问题答案。
在本申请实施例中,如图15所示,将问题文本、至少一个视频片段或问答答案分别输入隐私特征提取模型,如果输出上述任一输入包含隐私信息,则电子设备对用户身份信息进行身份校验。其中,在获取人物关联参数过程中可以获取用户身份信息。
在一些实施例中,电子设备可以采用序列标注方法,判断输入特征是否包含隐私信息。
示例性的,如果问题文本为“保险箱密码”,显然,该问题文本包含隐私信息,需要电子设备对用户身份信息进行身份校验。对于在预置身份关系表中保存的用户身份信息都可以获取对应的问题答案,对于在预置身份关系表中未保存的用户身份信息(相对于预置身份关系表中各个人物而言的陌生人),为了避免陌生人根据获取的问题答案实现非法目的,则不可获取对应的问题答案。
在一些实施例中,用户身份信息校验方式是根据问题文本的获取方式得到的。如果电子设备直接获取问题文本,那么将开启电子设备采集的生物特征(如指纹、声纹、或人脸图像等)对应的用户身份信息,确定为提出问题文本的用户身份信息。如果电子设备通过语音流间接获取问题文本,那么根据语音流对应的声纹,确定提出问题文本的用户身份信息。如果电子设备通过视频间接获取问题文本,那么根据视频中语音对应的声纹,和/或视频中的人脸图像,确定提出问题文本的用户身份信息。
可以理解的是,电子设备对用户身份信息进行校验采用的校验方式,与用户身份信息的实际信息类型相对应。在用户身份信息为指纹的情况下,根据指纹识别方法进行身份校验。
在另一些实施例中,在身份校验不通过的情况下,电子设备不展示问题答案,电子设备可以展示提示信息,提示不展示问题答案的原因,提示重新获取问题文本等等。
如此,通过判断问题文本、至少一个视频片段或问答答案是否包含隐私信息,确实用户身份信息是否需要进行身份校验,并在身份校验通过的情况下,才展示问题答案,在身份校验不通过的情况下,不展示问题答案,能够避免隐私泄露。
综上可知,如图16所示,本申请实施例提供的视频问答方法,将问题文本输入文本编码器,文本编码器可以进行文本特征识别、时间识别、人物识别和答案数量识别。如果根据时间识别确定问题文本中包括时间特征,则根据时间特征得到时间关联参数,从视频库中获取时间关联参数对应的第一视频。如果根据时间识别确定问题文本中不包括时间特征,则根据预置规则得到时间关联参数(确定时间关联参数为从问题文本的获取时刻之前的N个小时),从视频库中获取时间关联参数对应的第一视频。其后,如果根据人物识别确定问题文本中包括目标人物,则确定目标人物对应的目标特征为人物关联参数,从第一视频中获取人物关联参数对应的第二视频。其后,根据语义关联参数(文本特征),从第二视频中提取关联视频。其后,将关联视频进行分割,得到至少一个视频片段。其后,根据答案数量识别确定的答案数量,挑选至少一个视频片段中的与问题文本关联性更高的视频片段。最后根据VQA模型和是否包括隐私信息,展示问题答案。
如此,本申请实施例提供的视频问答方法,提出结合时间、人物、答案数量以及隐私的视频问答方案,能够快速定位问题文本对应的时间、人物的视频片段,并且能够根据文本语义准确输出一个或者多个答案,还能够识别文本和对应视频的隐私,支持身份校验。结合时间、目标人物以及文本语义相关性,对视频进行层层过滤,选取最相关的视频部分,能够快速响应,并且剔除了其它不相关干扰,能够准确回答问题。对筛选的视频部分按照时间间隔和语义相关性进行划分片段,因为问题可能包含多个时间段的答案,因此划分片段,对每一个片段进行回答,能够避免干扰,准确回答问题。根据文本识别答案的个数,进而按照时间维度选择视频片段进行回答,有些问题是有多个或者一个答案,因此能够选择最相关的片段去回答。
本申请实施例中还提供了一种视频问答装置,参见图17,该视频问答装置包括获取单元1701、分割单元1702和处理单元1703。
获取单元1701用于获取问题文本中的关联参数。例如执行前述实施例中的步骤S701。
获取单元1701用于获取关联参数对应的关联视频。例如执行前述实施例中的步骤S702。
分割单元1702用于将关联视频进行分割,得到至少一个视频片段。例如执行前述实施例中 的步骤S703。
处理单元1703用于将至少一个视频片段和问题文本输入VQA模型,得出问题文本对应的问题答案,并展示问题答案。例如执行前述实施例中的步骤S704。
可以理解的是,为了实现上述功能,电子设备包含了执行各个功能相应的硬件和/或软件模块。结合本文中所公开的实施例描述的各示例的算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以结合实施例对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本实施例可以根据上述方法示例对电子设备进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块可以采用硬件的形式实现。需要说明的是,本实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
本申请实施例还提供一种电子设备,如图18所示,该电子设备可以包括一个或者多个处理器1001、存储器1002和通信接口1003。
其中,存储器1002、通信接口1003与处理器1001耦合。例如,存储器1002、通信接口1003与处理器1001可以通过总线1004耦合在一起。
其中,通信接口1003用于与其他设备进行数据传输。存储器1002中存储有计算机程序代码。计算机程序代码包括计算机指令,当计算机指令被处理器1001执行时,使得电子设备执行本申请实施例中的视频问答方法。
其中,处理器1001可以是处理器或控制器,例如可以是中央处理器(Central Processing Unit,CPU),通用处理器,数字信号处理器(Digital Signal Processor,DSP),专用集成电路(Application-Specific Integrated Circuit,ASIC),现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。
其中,总线1004可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。上述总线1004可以分为地址总线、数据总线、控制总线等。为便于表示,图18中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
本申请实施例还提供一种计算机可读存储介质,该计算机存储介质中存储有计算机程序代码,当上述处理器执行该计算机程序代码时,电子设备执行上述方法实施例中的相关方法步骤。
本申请实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述方法实施例中的相关方法步骤。
其中,本申请提供的电子设备、计算机存储介质或者计算机程序产品均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
通过以上实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可 以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上内容,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (12)

  1. 一种视频问答方法,其特征在于,应用于电子设备,所述方法包括:
    获取目标视频和用户的问题信息;
    获取所述问题信息中的至少一个关联参数,所述关联参数包括时间关联参数、对象关联参数和语义关联参数中的一个或多个;
    根据所述至少一个关联参数,对所述目标视频进行分割,得到至少一个视频片段;
    获取所述至少一个视频片段中,所述问题信息对应的问题答案;
    展示所述问题答案。
  2. 根据权利要求1所述的方法,其特征在于,所述获取所述问题信息中的至少一个关联参数,包括:
    将所述问题信息转换为问题文本;
    对所述问题文本进行分词,得到词向量;
    将所述词向量输入预置文本编码模型,得到文本特征;
    提取所述文本特征中的时间特征和对象特征;
    获取所述文本特征对应的语义关联参数,所述时间特征对应的时间关联参数,和所述对象特征对应的对象关联参数中的一个或多个。
  3. 根据权利要求2所述的方法,其特征在于,所述时间关联参数包括视频起始时刻和视频终止时刻;所述获取所述时间特征对应的时间关联参数,包括:
    在所述时间特征为有效特征的情况下,根据预置映射规则,对所述时间特征对应的时间分词进行映射,确定所述视频终止时刻和所述视频起始时刻;
    在所述时间特征为无效特征的情况下,确定所述视频终止时刻为:获取所述问题文本的时刻,确定所述视频起始时刻为:与所述视频终止时刻相距预置时长的时刻。
  4. 根据权利要求2或3所述的方法,其特征在于,所述对象特征为人物类别特征,所述获取所述对象特征对应的对象关联参数,包括:
    根据所述问题文本,确定用户身份信息;
    在预置身份关系表中,根据所述用户身份信息确定所述对象特征对应的目标人物;
    将所述目标人物对应的目标特征,确定为所述对象关联参数,所述目标特征包括以下至少一项:图像特征、行为特征和声纹特征。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述根据所述至少一个关联参数,对所述目标视频进行分割,得到至少一个视频片段,包括:
    获取所述目标视频中,所述至少一个关联参数对应的关联视频;
    将所述关联视频进行分割,得到至少一个视频片段。
  6. 根据权利要求5所述的方法,其特征在于,所述获取所述目标视频中,所述至少一个关联参数对应的关联视频,包括:
    根据所述时间关联参数,从所述目标视频中提取第一视频;
    根据所述对象关联参数,从所述第一视频中提取第二视频;
    根据所述语义关联参数,从所述第二视频中提取所述关联视频。
  7. 根据权利要求5或6所述的方法,其特征在于,所述将所述关联视频进行分割,得到至少一个视频片段,包括:
    按照视频分割位置对所述关联视频进行分割,获取所述至少一个视频片段,其中,所述视频分割位置为:所述关联视频中录制时间差大于预置时间差的相邻视频帧所在的位置。
  8. 根据权利要求7所述的方法,其特征在于,所述按照视频分割位置对所述关联视频进行分割,获取所述至少一个视频片段之后,所述方法还包括:
    在所述至少一个视频片段中的相邻视频片段的特征均值的特征相似度大于预置阈值的情况下,将所述相邻视频片段合并,重新生成所述至少一个视频片段。
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:
    提取所述文本特征中的答案数量特征;
    获取所述答案数量特征对应的答案数量关联参数;
    所述将所述关联视频进行分割,得到至少一个视频片段之后,所述方法还包括:
    在所述答案数量关联参数为1的情况下,删除所述至少一个视频片段中除了第一视频片段以外的其他片段,所述第一视频片段为:所述至少一个视频片段中最后录制的视频片段。
  10. 根据权利要求9所述的方法,其特征在于,所述方法还包括:
    提取所述文本特征中的答案序数特征;
    获取所述答案序数特征对应的答案序数关联参数;
    所述将所述关联视频进行分割,得到至少一个视频片段之后,所述方法还包括:
    在所述答案数量关联参数为多,且,所述答案数量关联参数大于或等于所述答案序数关联参数的情况下,删除所述至少一个视频片段中除了第二视频片段以外的其他片段,其中,所述第二视频片段为所述至少一个视频片段中所述答案序数关联参数对应的视频片段,所述至少一个视频片段按照时间先后顺序排列。
  11. 一种电子设备,其特征在于,包括:存储器、一个或多个处理器;所述存储器与所述处理器耦合;其中,所述存储器中存储有计算机程序代码,所述计算机程序代码包括计算机指令,当所述计算机指令被所述处理器执行时,使得所述电子设备执行如权利要求1-10任一项所述的视频问答方法。
  12. 一种计算机可读存储介质,其特征在于,包括计算机指令,当所述计算机指令在电子设备上运行时,使得所述电子设备执行如权利要求1-10任一项所述的视频问答方法。
PCT/CN2023/120449 2022-10-20 2023-09-21 视频问答方法及电子设备 WO2024082914A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211289300.7 2022-10-20
CN202211289300.7A CN117917696A (zh) 2022-10-20 2022-10-20 视频问答方法及电子设备

Publications (1)

Publication Number Publication Date
WO2024082914A1 true WO2024082914A1 (zh) 2024-04-25

Family

ID=90729596

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/120449 WO2024082914A1 (zh) 2022-10-20 2023-09-21 视频问答方法及电子设备

Country Status (2)

Country Link
CN (1) CN117917696A (zh)
WO (1) WO2024082914A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763444A (zh) * 2018-05-25 2018-11-06 杭州知智能科技有限公司 利用分层编码解码器网络机制来解决视频问答的方法
CN112036276A (zh) * 2020-08-19 2020-12-04 北京航空航天大学 一种人工智能视频问答方法
CN112559698A (zh) * 2020-11-02 2021-03-26 山东师范大学 基于多模态融合模型的提高视频问答精度方法及系统
CN113032535A (zh) * 2019-12-24 2021-06-25 中国移动通信集团浙江有限公司 辅助视障人士视觉问答方法、装置、计算设备及存储介质
CN113392288A (zh) * 2020-03-11 2021-09-14 阿里巴巴集团控股有限公司 视觉问答及其模型训练的方法、装置、设备及存储介质
WO2021190078A1 (zh) * 2020-03-26 2021-09-30 华为技术有限公司 短视频的生成方法、装置、相关设备及介质
EP3920048A1 (en) * 2020-06-02 2021-12-08 Siemens Aktiengesellschaft Method and system for automated visual question answering
CN114387537A (zh) * 2021-11-30 2022-04-22 河海大学 一种基于描述文本的视频问答方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763444A (zh) * 2018-05-25 2018-11-06 杭州知智能科技有限公司 利用分层编码解码器网络机制来解决视频问答的方法
CN113032535A (zh) * 2019-12-24 2021-06-25 中国移动通信集团浙江有限公司 辅助视障人士视觉问答方法、装置、计算设备及存储介质
CN113392288A (zh) * 2020-03-11 2021-09-14 阿里巴巴集团控股有限公司 视觉问答及其模型训练的方法、装置、设备及存储介质
WO2021190078A1 (zh) * 2020-03-26 2021-09-30 华为技术有限公司 短视频的生成方法、装置、相关设备及介质
EP3920048A1 (en) * 2020-06-02 2021-12-08 Siemens Aktiengesellschaft Method and system for automated visual question answering
CN112036276A (zh) * 2020-08-19 2020-12-04 北京航空航天大学 一种人工智能视频问答方法
CN112559698A (zh) * 2020-11-02 2021-03-26 山东师范大学 基于多模态融合模型的提高视频问答精度方法及系统
CN114387537A (zh) * 2021-11-30 2022-04-22 河海大学 一种基于描述文本的视频问答方法

Also Published As

Publication number Publication date
CN117917696A (zh) 2024-04-23

Similar Documents

Publication Publication Date Title
US20220310095A1 (en) Speech Detection Method, Prediction Model Training Method, Apparatus, Device, and Medium
WO2020078299A1 (zh) 一种处理视频文件的方法及电子设备
WO2021244457A1 (zh) 一种视频生成方法及相关装置
WO2021104485A1 (zh) 一种拍摄方法及电子设备
WO2020019220A1 (zh) 在预览界面中显示业务信息的方法及电子设备
WO2021258797A1 (zh) 图像信息输入方法、电子设备及计算机可读存储介质
CN112214636A (zh) 音频文件的推荐方法、装置、电子设备以及可读存储介质
CN112580400B (zh) 图像选优方法及电子设备
WO2021254411A1 (zh) 意图识别方法和电子设备
CN114816610B (zh) 一种页面分类方法、页面分类装置和终端设备
CN112840635A (zh) 智能拍照方法、系统及相关装置
CN114691839A (zh) 一种意图槽位识别方法
WO2023273543A1 (zh) 一种文件夹管理方法及装置
CN111460231A (zh) 电子设备以及电子设备的搜索方法、介质
WO2021238371A1 (zh) 生成虚拟角色的方法及装置
CN114697543B (zh) 一种图像重建方法、相关装置及系统
WO2021031862A1 (zh) 一种数据处理方法及其装置
CN114943976B (zh) 模型生成的方法、装置、电子设备和存储介质
WO2024082914A1 (zh) 视频问答方法及电子设备
CN115661941A (zh) 手势识别方法和电子设备
CN113507406B (zh) 消息管理方法及相关设备
CN116193275B (zh) 视频处理方法及相关设备
WO2023197949A1 (zh) 汉语翻译的方法和电子设备
CN115525783B (zh) 图片显示方法及电子设备
WO2023006001A1 (zh) 视频处理方法及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23878907

Country of ref document: EP

Kind code of ref document: A1