WO2021259336A1 - 一种模态信息补全方法、装置及设备 - Google Patents

一种模态信息补全方法、装置及设备 Download PDF

Info

Publication number
WO2021259336A1
WO2021259336A1 PCT/CN2021/101905 CN2021101905W WO2021259336A1 WO 2021259336 A1 WO2021259336 A1 WO 2021259336A1 CN 2021101905 W CN2021101905 W CN 2021101905W WO 2021259336 A1 WO2021259336 A1 WO 2021259336A1
Authority
WO
WIPO (PCT)
Prior art keywords
modal information
feature vector
modal
information
group
Prior art date
Application number
PCT/CN2021/101905
Other languages
English (en)
French (fr)
Inventor
李太松
李明磊
吴益灵
怀宝兴
袁晶
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21829076.5A priority Critical patent/EP4160477A4/en
Publication of WO2021259336A1 publication Critical patent/WO2021259336A1/zh
Priority to US18/069,822 priority patent/US20230206121A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • G06F18/15Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing

Definitions

  • This application relates to the field of computer technology, and in particular to a method, device and equipment for completing modal information.
  • Modal refers to the source or form of information.
  • the definition of modal is broad. For example, human touch, hearing, touch, sight, and smell can all be used as sources of information, and they can all be regarded as a modal.
  • the information forms include voice, video, text, etc., each can be used as a module.
  • Various sensors, such as radar, pressure gauge, accelerometer, etc., are also sources of information. Similarly, any sensor can also be used as a mode.
  • the definition of modal is relatively broad, and it is not limited to the cases listed above. For example, two different languages can be considered as two different modalities, and data collected in two different scenarios can also be considered as two different modalities.
  • MMML multimodal machine learning
  • modal information a type of modal information refers to the information content of the modal
  • multi-modal machine learning is mostly used to learn modal information of image, video, audio, text and other types.
  • Modal missing is the lack of multiple modal information at least part of or all of the modal information. The lack of modalities will affect the accuracy of multi-modal machine learning.
  • Data cleaning refers to removing the remaining information in the missing modal information.
  • the method of data cleaning will cause multiple modal information to be missing at least All the information of one modal makes it impossible to learn at least one missing modal information when performing multi-modal machine learning, and the efficiency of multi-modal machine learning becomes worse.
  • Data filling refers to the use of zero value or the mean value of modal information to fill in the missing information of at least one modal part. The information filled in this way does not conform to the actual distribution of modal information. In machine learning, at least one type of missing modal information cannot be accurately learned.
  • This application provides a method, device, and equipment for completing modal information to accurately complete modal information.
  • this application provides a method for complementing modal information, which can be executed by a complementing device.
  • the complementing device first obtains a modal information group, and the modal information group includes at least two Modal information; afterwards, the completion device can determine whether one or more modal information is missing in the modal information group according to the attributes of the modal information group (that is, one or more modal information is missing all information) , And whether one or more modal information is missing part of the information.
  • the modal information that lacks some or all of the information is called the first modal information.
  • One modal information in the modal information group except the first modal information is called the second modal information
  • the complement device can extract the feature vector of the second modal information; then, based on the preset feature vector mapping relationship , Determine the target feature vector of the first modal information according to the feature vector of the second modal information.
  • the complement device can use the feature vector of the second modal information in the modal information group to determine the target feature vector of the first modal information, and use the feature vector of the second modal information to determine the first mode
  • the target feature vector of the modal information is closer to the true feature vector of the first modal information, which ensures the accuracy of the target feature vector of the first modal information.
  • the completion device when it determines the target feature vector of the first modal information based on the preset feature vector mapping relationship and the feature vector of the second modal information, it may first be based on the feature vector mapping relationship , Determine the candidate feature vector of the first modal information according to the feature vector of the second modal information; then, determine the target feature vector of the first modal information according to the candidate feature vector of the first modal information.
  • the complement device can adjust the candidate feature vector of the first modal information, and use the adjusted candidate feature vector of the first modal information as the target feature vector of the first modal information, or directly The candidate feature vector of the modal information is directly used as the target feature vector of the first modal information.
  • the completion device can first determine the candidate feature vector of the first modal information, and then use the candidate feature vector of the first modal information to determine the target feature vector of the first modal information, so as to finally determine the first modal information.
  • the target feature vector of the modal information can be closer to the real feature vector of the first modal information.
  • the feature vector mapping relationship can be set in the form of data mapping.
  • the feature vector mapping relationship can also be set in the manner of a machine learning model.
  • the machine learning model learns the feature vector mapping relationship, and can be used to output the feature vector of other modal information according to the feature vector of the input modal information.
  • the complement device may determine the target feature vector of the first modal information based on the preset machine learning model and the feature vector of the second modal information.
  • the completion device can use the machine learning model to more conveniently determine the target feature vector of the first modal information.
  • the attributes of the modal information group include some or all of the following:
  • the quantity of modal information in the modal information group The quantity of modal information in the modal information group, the data amount of each modal information in the modal information group, and the type of each modal information in the modal information group.
  • the attribute of the modal information group can indicate one or more kinds of information of the modal information group, so that the completion device can quickly determine that part or all of the first modal information is missing.
  • the completion device before the completion device extracts the feature vector of the second modal information, in addition to determining that part or all of the first modal information is missing, it can also determine the second modal according to the attributes of the modal information group The information is complete.
  • the completion device can quickly distinguish missing modal information (such as the first modal information) or no missing modal information (such as the second modal information) in the modal information group according to the attributes of the modal information group .
  • the complement device can obtain the first auxiliary information, and determine the attribute of the modal information group according to the first auxiliary information.
  • the first auxiliary information can indicate the attribute of the modal information group, that is, it can indicate part or all of the following: The quantity of modal information in the modal information group, the data amount of each modal information in the modal information group, and the type of each modal information in the modal information group.
  • Second auxiliary information is preset in the complementing device, and the second auxiliary information is the information that any modal information group received by the complementing device needs to match, and the complementing device can be based on the preset second auxiliary information,
  • the second auxiliary information can indicate part or all of the following: the quantity of modal information in any modal information group acquired, the amount of each modal information in any modal information group acquired The amount of data, the type of each modal information in any acquired modal information group.
  • Manner 3 The complement device determines the attributes of the modal information group according to the attributes of other modal information groups, and the other modal information group is the modal information group acquired before the modal information group is acquired.
  • the complement device can flexibly determine the attributes of the modal information group in a variety of different ways.
  • the modal information group further includes third modal information; the complement device can extract the feature vector of the third modal information; afterwards, based on the preset feature vector mapping relationship, according to the third modal information
  • the feature vector of the modal information and the feature vector of the second modal information determine the target feature vector of the first modal information.
  • the complement device can determine the target feature vector of the first modal information according to the feature vectors of the multiple modal information in the modal information group.
  • the complement device determines the target feature vector of the first modal information based on the preset feature vector mapping relationship and the feature vector of the third modal information and the feature vector of the second modal information.
  • another candidate feature vector of the first modal information can be determined according to the feature vector of the third modal information; then, based on the candidate feature vector of the first modal information and the feature vector of the first modal information The other candidate feature vector determines the target feature vector of the first modal information.
  • the complementing device can accurately determine the target feature vector of the first modal information according to the multiple candidate feature vectors of the first modal information.
  • the completion device determines the target feature vector of the first modal information according to the candidate feature vector of the first modal information and another candidate feature vector of the first modal information, it can do this:
  • the two candidate feature vectors are configured with corresponding weights, and then the first modal information is determined according to the candidate feature vector of the first modal information and the corresponding weight, and another candidate feature vector of the first modal information and the corresponding weight.
  • the target feature vector is configured with corresponding weights.
  • the complementing device determines the target feature vector of the first modal information by performing a weighted summation on the multiple candidate feature vectors.
  • the complementing device can also determine whether the missing part of the first modal information conforms to a preset condition. For example, it can be determined whether the data amount of the missing partial information of the first modal information or the proportion of the missing partial information is greater than or less than the threshold. After determining that the preset condition is met, the complementing device can determine the target feature vector of the first modal information.
  • the completion device can further determine the preset conditions that the missing part of the first modal information needs to meet, so that the target feature vector of the first modal information can be accurately determined subsequently.
  • the modal information group includes multiple modal information, and the type of each modal information may be different.
  • the complement device can use the feature vector of one type of modal information to determine the target feature vector of another type of modal information.
  • the embodiment of the present application does not limit the type of the first modal information or the second modal information.
  • the second modal information may be a voice-type modal.
  • the modal information can also be image-type modal information, or text-type modal information.
  • the second modal information can also be structured data.
  • different complement devices can be used.
  • the feature vector of the second modal information is extracted in a manner.
  • the complement device may extract the feature vector of the second modal information based on a one-hot encoding method.
  • the complement device can determine the feature vector of the second modal information in a corresponding manner in a targeted manner.
  • the specific type of the machine learning model is not limited, and it can be a Seq2Seq model or MCTN.
  • the embodiments of the present application also provide a supplementary device, and the beneficial effects can be referred to the description of the first aspect and will not be repeated here.
  • the device has the function of realizing the behavior in the method example of the first aspect described above.
  • the functions can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above-mentioned functions.
  • the structure of the device includes an information acquisition module, a feature extraction module, and a completion module. These modules can perform the corresponding functions in the method example of the first aspect. For details, please refer to the detailed description in the method example, which will not be repeated here.
  • an embodiment of the present application also provides a computing device.
  • the computing device includes a processor and a memory, and may also include a communication interface.
  • the processor executes the program instructions in the memory to execute the above-mentioned first aspect or The method provided by any possible implementation of the first aspect.
  • the memory is coupled with the processor, and it stores the program instructions and data necessary in the process of determining the traffic flow.
  • the communication interface is used to communicate with other devices, such as receiving a modal information group, and for example sending a target feature vector of missing modal information and a feature vector of missing modal information.
  • the present application provides a computing device cluster, which includes at least one computing device.
  • Each computing device includes a memory and a processor.
  • the processor of at least one computing device is configured to access the code in the memory to execute the first aspect or the method provided in any one of the possible implementation manners of the first aspect.
  • the present application provides a non-transitory readable storage medium.
  • the non-transitory readable storage medium executes the foregoing first aspect or any of the first aspects.
  • the storage medium stores the program.
  • the storage medium includes, but is not limited to, volatile memory, such as random access memory, non-volatile memory, such as flash memory, hard disk drive (HDD), and solid state drive (SSD).
  • this application provides a computing device program product.
  • the computing device program product includes computer instructions. When executed by a computing device, the computing device executes the foregoing first aspect or any possible implementation of the first aspect.
  • the computer program product may be a software installation package. In the case where the method provided in the foregoing first aspect or any possible implementation of the first aspect needs to be used, the computer program product may be downloaded and executed on a computing device. Program product.
  • the present application also provides a computer chip, which is connected to the memory, and the chip is used to read and execute the software program stored in the memory, and execute the aforementioned first aspect or any possible implementation of the first aspect. method.
  • Figure 1 is a schematic structural diagram of a system provided by this application.
  • Figure 2 is a schematic structural diagram of a complement device provided by this application.
  • Figure 3 is a schematic diagram of a modal information completion method provided by this application.
  • FIG. 4 is a schematic diagram of training a machine learning model provided by this application.
  • FIG. 5A is a schematic diagram of a method for complementing modal information of voice type and image type provided by this application;
  • FIG. 5B is a schematic diagram of a text-type modal information completion method provided by this application.
  • Figure 6 is a schematic structural diagram of a computer cluster provided by this application.
  • FIG. 7 is a schematic structural diagram of a system provided by this application.
  • FIG. 1 is a schematic structural diagram of a system to which the embodiment of this application is applicable.
  • the system includes a collection device 100, a complement device 200, and optionally, an information processing device 300.
  • the collection device 100 is used to collect information, and the information collected by the collection device 100 may be used as modal information.
  • the embodiment of the present application does not limit the number and specific form of the collection device 100.
  • the system may include one or more collection devices 100.
  • the collection device 100 can be a sensor, a video camera, a smart camera, a monitoring device, a mobile phone, a tablet computer (pad), a computer with a transceiver function, a terminal device in a smart city, or a smart home.
  • IoT Internet of Things
  • the information collected by the collection device 100 may be used as a type of modal information, or may be a modal information group including multiple modal information.
  • the collection device 100 can be a mobile phone, the modal information of the voice type can be collected through the microphone set on the mobile phone, the modal information of the image type can be collected through the camera on the mobile phone, and the application programs (such as WeChat, QQ) collects modal information of text type.
  • the capture device 100 may be a camera, and the camera may capture video.
  • Video can be used as a modal information group, where the video includes voice, image, and text-type modal information.
  • the complement device 200 can obtain the modal information group from the collection device 100, and execute the modal information complement method provided in the embodiment of this application.
  • There is a connection between the collection device 100 and the complement device 200 which is not limited in this embodiment of the application.
  • the connection mode between the collection device 100 and the complement device 200 may be connected to the complement device 200 in a wireless or wired manner.
  • the complement device 200 (or part of the modules in the complement device 200) can also be set in the collection device 100. After the collection device 100 collects the modal information group, the modal information group can be obtained quickly and executed The modal information completion method provided by the embodiment of the application.
  • the information processing device 300 can obtain the feature vector of each modal information in the modal information group from the complementing device 200.
  • the feature vector of each modal information in the modal information group includes the target feature vector of missing modal information and no missing modal information.
  • the characteristic vector of the modal information is used to process and understand the modal information group according to the characteristic vector of each modal information in the modal information group.
  • the information processing device 300 includes a multi-modal machine learning model, and has the ability to process and understand modal information groups.
  • the manner in which the information processing device 300 processes and understands the modal information group is related to the application scenario of the multi-modal machine learning model.
  • the information processing device 300 may perform emotion recognition on the modal information group, and predict the emotions hidden in the modal information group.
  • the modal information group is a video
  • the information processing device 300 may generate a video tag (used to identify the category of the video) based on the video, and may also extract video features (such as the category and duration of the video) for video recommendation.
  • the video is recommended to users who have a potential demand for the video.
  • the information processing device 300 can analyze the modal information group and detect target information (such as false advertisements, violent content, etc.) in the modal information group.
  • the information processing device 300 may perform voice recognition on the modal information group to determine the voice content.
  • the modal information group includes face information, voiceprint information, human gait information, fingerprint information, or iris information
  • the information processing device 300 can identify the modal information group to determine the modal information group belongs to User information.
  • connection method between the information processing device 300 and the supplement device 200 is similar to the connection method between the collection device 100 and the supplement device 200.
  • the connection method between the collection device 100 and the supplement device 200 is similar to the connection method between the collection device 100 and the supplement device 200.
  • the complement device 200 includes an information acquisition module 210, a feature extraction module 220, and a complement module 230.
  • the information acquisition module 210 is used to acquire a modal information group. At least one modal information is missing from the modal information group, or at least one modal information is missing (for the convenience of explanation, there will be missing modal information in this embodiment Or missing modal information is called missing modal information).
  • the feature extraction module 220 can obtain the modal information group from the information acquisition module 210, and there is no missing modal information in the modal information group, that is, complete modal information (for the convenience of description, the embodiment of this application will not exist).
  • the missing modal information or complete modal information is called non-missing modal information)
  • the feature extraction module 220 extracts the feature vector of the non-missing modal information, and each non-missing modal information corresponds to a feature vector.
  • the completion module 230 obtains the feature vector without missing modal information from the feature extraction module 220, and based on the preset feature vector mapping relationship, determines the target feature vector of the missing modal information according to the feature vector without missing modal information, for example, complement
  • the full module 230 may first determine one or more candidate feature vectors with missing modal information based on the preset feature vector mapping relationship based on feature vectors without missing modal information, and then, based on one or more candidate feature vectors with missing modal information The feature vector determines the target feature vector with missing modal information.
  • the feature vector mapping relationship indicates the mapping relationship between feature vectors of different types of modal information, and the mapping relationship between feature vectors of different types of modal information includes feature vectors without missing modal information and missing modes.
  • the embodiment of the application does not limit the setting form of the feature vector mapping relationship.
  • the feature vector mapping relationship may be set in the completion module 230 in the form of a machine learning model.
  • the learning model can analyze the mapping relationship between the feature vectors of different types of modal information, learn the mapping relationship between the feature vectors of different types of modal information, and output according to the input feature vector of one or more modal information The feature vector of one or more other modal information.
  • the completion device 200 when the completion device 200 performs modal information completion, it needs to first obtain the feature vector without missing modal information in the modal information group, and based on the preset feature vector mapping relationship, according to the missing modal information
  • the feature vector of the modal information determines the target feature vector of the missing modal information. Since the modal information group is usually multiple modal information with certain associations, the target feature vector of the missing modal information determined by the feature vector without missing modal information is closer to the true feature vector of the missing modal information, and closer For the information distribution of missing modal information, the accuracy of multi-modal machine learning based on the target feature vector with missing modal information and the feature vector without missing modal information is also higher.
  • the method includes:
  • Step 301 The information acquisition module 210 acquires a modal information group, and the modal information group includes at least two modal information.
  • Step 302 The information acquisition module 210 determines that part or all of the first modal information group is determined in the modal information group according to the attributes of the modal information group, and that the modal information group includes complete second modal information .
  • the information acquisition module 210 After the information acquisition module 210 acquires the modal information group, it may first determine whether there is missing modal information in the modal information group according to the attributes of the module information group. If it is determined that the modal information group includes missing modal information, the The modal information group is sent to the feature extraction module 220, that is, step 302 is executed. If it is determined that the missing modal information is not included in the modal information group, the information acquisition module 210 may send the modal information group to the feature extraction module 220, The feature vector of each modal information in the modal information group is extracted, and then the feature vector of each modal information in the modal information group can be sent to the information processing device 300 or to the training device. The training device can use the model The feature vector of each modal information in the modal information group trains the multi-modal machine learning model.
  • the attributes of the modal information group can indicate part or all of the following: the amount of modal information in the module information group, and the data amount of each modal information in the modal information group.
  • this application does not limit the way of indicating the data amount of modal information.
  • the data amount of modal information may be the size of the modal information (such as the number of bytes occupied). State information, the amount of data of the modal information can be indicated by the length of time.
  • the attribute of the modal information group may also include the type of each modal information in the modal information group.
  • the information acquisition module 210 determines whether the modal information group includes missing modal information and non-missing modal information according to the attributes of the module information group, it needs to determine the attributes of the modal information group first, which is not limited in this embodiment of the application. The method for the information acquisition module 210 to determine the attributes of the modal information group.
  • the information acquisition module 210 when it acquires the modal information group, it may also acquire first auxiliary information, which can indicate the attribute of the modal information group, that is, the first auxiliary information indicates part or all of the following: The quantity of modal information in the module information group, and the data quantity of each modal information in the modal information group.
  • the first auxiliary information may also indicate the type or name of each modal information of the modal information.
  • the information obtaining module 210 may determine the attribute of the modal information group according to the first auxiliary information.
  • the information acquisition module 210 is pre-configured with second auxiliary information
  • the second auxiliary information may indicate the attributes of any modal information group acquired by the information acquisition module 210 (such as the number of modal information in the modal information group, the mode The data amount of each modal information in the modal information group), that is, any modal information group acquired by the signal acquisition module 210 needs to satisfy the second auxiliary information.
  • the information acquisition module 210 may determine the attribute of the modal information group according to the second auxiliary information.
  • the information acquisition module 210 may compare the attributes of one or more modal information groups obtained before acquiring the modal information group with the attributes of the modal information group, and according to the attributes of the one or more modal information groups Determine the attributes of the modal information group. Use the attributes of the one or more modal information groups as the attributes of the modal information group.
  • the information acquisition module 210 After the information acquisition module 210 determines the attributes of the modal information group, it can determine whether the acquired multimodal information group satisfies the attributes of the modal information group. For example, the information acquisition module 210 can determine the modalities in the multimodal information group. Whether the quantity of information is consistent with the quantity of modal information indicated by the attribute of the multi-modal information group, if it is consistent, it means that the multi-modal information group includes all the modal information; otherwise, it means the multi-modal information group All information of one or more modal information is missing in.
  • the modal information is Whether the data amount of is consistent with the data amount of the modal information indicated by the attribute of the multimodal information group, if it is consistent, it means that the modal information is complete modal information, that is, there is no missing modal information, otherwise, It shows that the modal information is missing part of the information, which is the missing modal information.
  • the attribute of the modal information group determined by the information acquisition module 210 indicates that the number of modal information in the modal information group is 3, and the number of modal information included in the modal information group actually obtained is Two, the information acquisition module 210 can determine that all of the modal information is missing from the modal information group.
  • the attribute of the modal information group determined by the information acquisition module 210 indicates that the modal information of the voice type in the modal information group is 10 minutes of speech data, and the actually obtained modal information of the voice type in the modal information group The modal information is 2 minutes of voice data, and the information acquisition module 210 can determine that part of the modal information of the voice type in the modal information group is missing.
  • the information acquisition module 210 may also use other methods to determine that there is missing modal information in the modal information group.
  • the modal information group is used as a video, and the video includes text, voice, image and other types of modal information.
  • the information acquisition module 210 When determining whether the modal information of the image type is missing, it can be detected whether there is a blurred image in the modal information of the image type. If there is a blurred image, it is determined that the modal information of the image type is missing, and the information acquisition module 210 is determining When the modal information of the voice type is missing, it can be determined whether the total duration of the modal information of the voice type is equal to the total duration of the video. If not, determine whether the modal information of the voice type is missing, and if it is equal, determine whether the modal information of the voice type is missing. There is no lack of modal information.
  • the modal information group includes missing modal information and non-missing modal information as an example for description. Missing modal information can be missing part of the information, or it can be missing all of the information. No missing modal information refers to the complete modal information in the modal information group.
  • the modal information group is transmitted.
  • One or more modal information in the information group is missing some or all of the information.
  • other devices perform preprocessing operations on the modal information group, such as noise reduction, filtering, cleaning, compression, and re-encoding, so that one of the modal information groups is Or part or all of the modal information is missing.
  • noise reduction usually eliminates the "noise" existing in the modal information, and eliminating the "noise" will cause some information in the modal information to be deleted.
  • the information acquisition module 210 may also determine whether the missing part of the missing modal information meets a preset condition, for example, the data amount of the missing part of the missing modal information (such as the size or the amount of data of the part of the information). Whether the duration corresponding to the partial information) is less than the first threshold, if the data amount of the missing partial information of the missing modal information (such as the size of the data amount of the partial information or the duration corresponding to the partial information) is less than the first threshold, the Missing modal information The amount of missing part of the information is small, and the information acquisition module 210 may execute step 303, otherwise, the modal information group may be discarded.
  • the data amount of the missing part of the missing modal information such as the size or the amount of data of the part of the information.
  • the information acquisition module 210 may determine whether the proportion of the missing part of the missing modal information to the total information of the missing modal information is less than a second threshold, if the missing part of the missing modal information accounts for the missing modal information The proportion of the total information of the information is less than the second threshold, and the data amount of the missing part of the missing modal information is small.
  • the information acquisition module 210 may execute step 303, otherwise the modal information group may be discarded.
  • the information acquisition module 210 can determine to send the modal information group; a large number of missing images are missing from the modal information of the image type, and the number of missing images If it is greater than the image threshold, the information acquisition module 210 may discard the modal information group.
  • the information acquisition module 210 may determine to send the modal information group; a large amount of voice data is missing in the voice type modal information, and the duration of the missing voice data is greater than the time threshold, and the information acquisition module 210 may discard the modal information group.
  • the information acquisition module 210 may determine whether the ratio of the remaining partial information in the missing modal information (the information after the missing partial information is the remaining partial information) to the missing partial information is greater than the third threshold, if The ratio of the data amount of the remaining partial information to the missing partial information is greater than the third threshold, and the data amount of the missing partial information of the missing modal information is small.
  • the information acquisition module 210 may execute step 303, otherwise the modal information group may be discarded .
  • the modal information can be unstructured data of voice, image, text, etc.
  • the modal information can also be structured data.
  • the structured data can be used. Data represented by a unified structure (such as a two-dimensional table).
  • the information acquisition module 210 may also analyze the type of the missing modal information, and determine whether to send the modal information group according to the analysis result.
  • the modal information of the image type usually contains more modal information. Rich information is difficult to complete, and the information acquisition module 210 may discard the modal information group. If the type of missing modal information in the modal information group is text, since there is still voice-type modal information in the modal information group, it is less difficult to complete the modal information, and the information acquisition module 210 can determine to send the modal Information group.
  • Step 303 The information acquisition module 210 sends the modal information group to the feature extraction module 220.
  • Step 304 After the feature extraction module 220 obtains the modal information group, for the non-missing modal information in the modal information group, the feature extraction module 220 extracts the feature vector of the non-missing modal information, and each non-missing modal information Corresponds to a feature vector.
  • the feature extraction module 220 can extract only one feature vector without missing modal information, or it can extract multiple feature vectors without missing modal information.
  • the embodiments of the present application do not limit the manner of feature vectors without missing modal information, and any manner capable of extracting feature vectors is applicable to the embodiments of the present application.
  • the application scenarios of multimodal machine learning are different, and the feature extraction module 220 extracts feature vectors in different ways.
  • the feature extraction module 220 can be based on the spectral features of speech, Low-level features (LLDs) and other methods determine the feature vector of the modal information of the voice type.
  • the feature extraction module 220 may obtain the feature vector of the modal information of the image type by convolving the face region in the image.
  • the feature extraction module 220 can determine the feature vector of the voice type modal information based on the frequency spectrum and timing features of the voice, and the feature extraction module 220 can obtain the image type model by convolving the entire image.
  • the feature vector of the modal information the feature extraction module 220 may use the word vector of the modal information of the text type as the feature vector of the modal information of the text type.
  • the feature extraction module 220 can extract the feature vector of the structured data in a one-hot manner.
  • the structured data is statistical data of the age of the user, and the feature extraction module 220 can construct a 100-dimensional vector.
  • the 18th value of the 100-dimensional vector is 1, and the remaining values are 0.
  • the 100-dimensional vector is the feature vector of the structured data.
  • the structured data is statistical data of the user’s gender, and the feature extraction module 220 can construct a 2-dimensional vector. When the user’s gender is female, the 2-dimensional vector is 10, and when the user’s gender is male, the 2-dimensional vector is 01.
  • the structured data is a statistical value such as temperature, pressure, or length.
  • the statistical value of temperature, pressure, or length can be a continuous value.
  • the feature extraction module 220 can first divide the data interval, Each data interval corresponds to a value range, and then the one-hot method is used to extract the feature vector of the structured data.
  • the structured data is the statistical value of temperature.
  • the temperature value can be divided into 100 intervals from 0 to 100 degrees.
  • the temperature interval of each interval is 1 degree. When the temperature value is 37.5 degrees, it belongs to the 37-38 interval.
  • the feature vector of the structured data is extracted by one-hot method, and the feature extraction module 220 constructs a 100-dimensional vector.
  • the 38th value of the 100-dimensional vector is 1, and the remaining values are If it is 0, the 100-dimensional vector is the feature vector of the structured data.
  • Step 305 The feature extraction module 220 sends the feature vector without missing modal information to the completion module 230.
  • Step 306 The completion module 230 determines candidate feature vectors of missing modal information according to feature vectors without missing modal information based on the preset feature vector mapping relationship.
  • a feature vector mapping relationship is preset in the completion module 230, and the feature vector mapping relationship describes the mapping relationship between feature vectors of different types of modal information.
  • the feature vector mapping relationship includes but not limited to: the feature vector of voice type modal information and the feature of image type modal information.
  • the mapping relationship between vectors the mapping relationship between the feature vector of the modal information of the text type and the feature vector of the modal information of the image type, the feature vector of the modal information of the voice type and the feature of the modal information of the text type
  • the mapping relationship between the vectors is the mapping relationship between the feature vector of the modal information of the image type and the feature vector of the modal information of the text type.
  • the embodiment of the present application does not limit the setting form of the feature vector mapping relationship.
  • the feature vector mapping relationship may be a mapping relationship between data.
  • the feature vector mapping relationship can be set in the form of a machine learning model, which has learned the mapping relationship between feature vectors of different types of modal information in advance, and can output according to the feature vector of the input modal information Feature vectors for other types of modal information.
  • the feature vector mapping relationship is set in the completion module 230 in the form of a machine learning model as an example to illustrate the training method of the machine learning model, see Fig. 4:
  • Step 1 Prepare a multi-modal training set.
  • the training set includes multiple modal information groups. Each modal information group includes multiple modal information. Each modal information is complete and there is no missing.
  • the multi-modal training set can also be used to train a multi-modal machine learning model.
  • the embodiment of this application does not limit the training method of the multi-modal machine learning model.
  • Multi-modal training set can be used to realize the multi-modal machine learning model. All of the training methods are applicable to the embodiments of this application.
  • Step 2 Extract the feature vector of each modal information in each modal information group in the multi-modal training set.
  • step 302 For the method of extracting the feature vector of the modal information, please refer to step 302, which will not be repeated here.
  • Step 3 Based on the feature vector of each modal information in each modal information group in the multi-modal training set, train the preset machine learning model based on the method of supervised learning, so that the preset machine learning model can learn differently
  • the mapping relationship between the feature vectors of the types of modal information, and the preset machine learning model outputs feature vectors of other modal information according to the input feature vectors of the modal information.
  • the preset machine learning model may be a sequence to sequence model (Seq2Seq) or a multimodal cyclic translation network (MCTN).
  • Seq2Seq sequence to sequence model
  • MCTN multimodal cyclic translation network
  • Step 4 Prepare a multi-modal test set.
  • the test set includes multiple modal information groups.
  • Each modal information group includes multiple modal information.
  • Each modal information is complete and there is no missing.
  • Step 5 Extract the feature vector of each modal information in each modal information group in the multi-modal test set.
  • the method of extracting the feature vector of the modal information please refer to step 302, which will not be repeated here.
  • Step 6 Test the trained machine learning model based on the feature vector of each modal information in each modal information group in the multi-modal test set.
  • the embodiment of this application does not limit the way the trained machine learning model is tested.
  • the feature vector of the modal information A in the modal information group M in the test set can be input into the trained machine learning model, and the output model
  • the candidate feature vector of modal information B compare the feature vector of modal information B in the modal information group with the candidate feature vector of modal information B. If they are consistent or the similarity is greater than the set value, the model training can be considered successful. Otherwise, fail, re-execute steps 1 to 3, and continue to train the machine learning model.
  • the feature vector of the modal information A of the modal information group M in the test set can be input into the trained machine learning model, and the candidate feature vector of the modal information B can be output, and the output candidate of the modal information B can be
  • the eigenvectors and the eigenvectors of the remaining modal information in the modal information group M are input to the multi-modal machine learning model for analysis.
  • the multi-modal machine learning model is The analysis result is consistent with the analysis result of the candidate feature vector of the modal information B and the feature vector of the remaining modal information by the multimodal machine learning model, or the similarity is greater than the set value, then the machine learning model training can be considered as successful, otherwise, If it fails, re-execute steps 1 to 3 to continue training the model.
  • the successfully tested machine learning model can be configured in the completion module 230, and output candidate feature vectors with missing modal information according to the input feature vectors without missing modal information.
  • the completion module 230 can obtain a candidate feature vector with missing modal information according to a feature vector without missing modal information, or According to multiple feature vectors without missing modal information, multiple candidate feature vectors with missing modal information are obtained, and one feature vector without missing modal information can obtain a candidate feature vector with missing modal information.
  • Step 307 The completion module 230 can use the candidate feature vector of the missing modal information to determine the target feature vector of the missing modal information.
  • the completion module 230 can directly use the candidate feature vector with missing modal information as the target feature vector for the missing modal information.
  • the complementing device 230 can directly obtain the target feature vector of the missing modal information through step 306, and the complementing module 230 can also adjust the candidate feature vector of the missing modal information, such as zooming in or out, etc.
  • the adjusted candidate feature vector of the first modal information is used as the target feature vector of the missing modal information. If multiple candidate feature vectors of missing modal information are generated, the completion module 230 may determine the target feature vector of missing modal information according to the multiple candidate feature vectors of missing modal information.
  • the method for the completion module 230 to determine the target feature vector of the missing modal information according to the multiple candidate feature vectors of the missing modal information is not limited in the embodiment of the application.
  • one candidate feature vector with missing modal information is selected as the target feature vector for missing modal information. It is also possible to perform a weighted summation of multiple candidate feature vectors with missing modal information (that is, the value of each missing modal information)
  • the candidate feature vector corresponds to a weight, and the target feature vector of missing modal information is obtained.
  • the weight of each candidate feature vector of missing modal information can be an empirical value, or it can be predetermined according to a multi-modal machine learning model.
  • a variable parameter is set for the weight of each candidate feature vector of missing modal information, and a weighted summation of multiple candidate feature vectors of missing modal information is used to determine the target feature vector of missing modal information.
  • the target feature vector Include the variable parameter. Change the specific value of the variable parameter.
  • the target feature vector with missing modal information and the feature vector without missing modal information are input to the multimodal machine learning model to determine the multimodal machine learning
  • the specific value of the variable parameter in is used as the weight of the candidate feature vector of the missing modal information.
  • the weight of a candidate feature vector of the text type modal information is set as a parameter X, and the parameter X is between 0 and 1
  • a candidate feature vector of the modal information of the text type determined by the feature vector of the modal information of the voice type is f1
  • the weight is X
  • the modulus of the text type determined by the feature vector of the modal information of the image type The candidate feature vector of the morphological information is f2, the weight is 1-X, and the target feature vector of the text-type modal information obtained by the weighted summation method is X*f1+(1-X)*f2.
  • the output value of the multi-modal machine learning model is obtained.
  • the output value of the multi-modal machine learning model indicates different information. Taking the emotion recognition scene as an example, the multi-modal machine learning model indicates different information.
  • the output value of the modal machine learning model is used to indicate the emotional changes of the characters in the video. From the output value of the multi-modal machine learning model, determine the output value closest to the real emotional changes of the characters in the video.
  • the value of the parameter X in the target feature vector of the modal information corresponding to the text type is the text type The weight of a candidate feature vector of the modal information.
  • the completion module 230 can use the candidate feature vector of the missing modal information to determine the target feature vector of the missing modal information, and obtain the target feature vector of missing modal information and the feature vector of no missing modal information.
  • the completion module 230 can send the target feature vector of the missing modal information and the feature vector without missing modal information to the information processing device 300, and the information processing device 300 can calculate the target feature vector of the missing modal information and the feature vector without missing modal information.
  • the characteristic vector is processed.
  • the modal information group is taken as a video, which includes three types of modal information: voice, text, and image.
  • the modal information of the voice and image types is missing modal information, and it is impossible to extract voice and modal information.
  • the feature vector of the modal information of the image type, the modal information of the text type is no missing modal information, and the feature vector of the modal information of the text type can be extracted.
  • the completion module 230 may generate candidate feature vectors of voice and image type modal information according to the feature vector of the text type modal information based on the preset feature vector mapping relationship.
  • the completion module 230 can use the candidate feature vector of the voice and image type modal information as the target feature vector of the voice and image type modal information, and the completion module 230 can use the target feature vector of the voice and image type modal information as the target feature vector of the modal information. And the feature vector of the text type modal information is sent to the information processing device 300 for subsequent processing.
  • the modal information group is a video, which includes three types of modal information: voice, text, and image.
  • the modal information of the voice and image types is non-missing modal information, which can be extracted separately
  • the feature vector of the modal information of the voice and image type, the modal information of the text type is the missing modal information, and the feature vector of the modal information of the text type cannot be extracted.
  • the completion module 230 may generate feature vectors of two text types of modal information according to the feature vectors of voice and image types of modal information based on the preset mapping relationship of feature vectors.
  • the completion module 230 can determine the target feature vector of the text-type modal information according to the candidate feature vectors of the two text-type modal information, and the completion module 230 can combine the target feature vector of the text-type modal information as well as voice and image
  • the feature vector of the type of modal information is sent to the information processing device 300 for subsequent processing.
  • a computer cluster for executing the method shown in the above method embodiment.
  • a computer cluster provided by an embodiment of this application includes at least one computing device 600, and each computing device 600 establishes a communication path through a communication network.
  • Each computing device 600 includes a bus 601, a processor 602, a communication interface 603, and a memory 604.
  • the computing device 600 may also include a display screen 605.
  • the processor 602, the memory 604, and the communication interface 603 communicate through a bus 601.
  • the processor 602 may be composed of one or more general-purpose processors, such as a central processing unit (CPU), or a combination of a CPU and a hardware chip.
  • the above-mentioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof.
  • the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (generic array logic, GAL), or any combination thereof.
  • the memory 604 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM).
  • volatile memory such as a random access memory (random access memory, RAM).
  • the memory 604 may also include non-volatile memory (NVM), such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid-state drive (solid-disk drive, HDD). state drive, SSD).
  • NVM non-volatile memory
  • ROM read-only memory
  • HDD hard disk drive
  • solid-state drive solid-disk drive
  • state drive SSD
  • the memory 604 may also include a combination of the above types.
  • the memory 604 stores executable codes.
  • the processor 602 can read the executable codes in the memory 604 to realize functions, and can also communicate with other computing devices through the communication interface 603.
  • the processor 602 can Realize the function of one or more modules of the complement device 200 (such as one or more of the information acquisition module 210, the feature extraction module 220, and the complement module 230).
  • the complement is stored in the memory 604
  • One or more modules of the device 200 (such as one or more modules of the information acquisition module 210 and the feature extraction module 220).
  • the processors 602 in the multiple computing devices 600 may coordinate work to execute the modal information completion method provided in the embodiment of the present application.
  • a system architecture provided by this embodiment of the application includes a client 200 and a cloud device 300 equipped with a change device.
  • the client 200 and the cloud device 300 are connected via a network, and the cloud device 300 is located in a cloud environment and can be a server or a virtual machine deployed in a cloud data center.
  • the merging apparatus 100 is deployed on a cloud device 300 as an example.
  • the merging The device may be deployed on multiple cloud devices 300 in a distributed manner.
  • the client 200 includes a bus 201, a processor 202, a communication interface 203, a memory 204 and a display screen 205.
  • the processor 202, the memory 204, and the communication interface 203 communicate through a bus 201.
  • the memory 204 stores executable code, and the processor 202 can read the executable code in the memory 204 to implement functions.
  • the processor 202 may also communicate with the cloud device through the communication interface 203. For example, the processor 202 may prompt the user to input the modal information group through the display screen 205, and feedback the modal information group to the cloud device 300 through the communication interface 203.
  • the cloud device 300 includes a bus 301, a processor 302, a communication interface 303, and a memory 304.
  • the processor 302, the memory 304, and the communication interface 303 communicate through a bus 301.
  • the memory 304 stores executable code
  • the processor 302 can read the executable code in the memory 304 to implement functions, and can also communicate with the client 200 through the communication interface 303.
  • the processor 302 may implement the function of the complement device 200.
  • the information acquisition module 210, the feature extraction module 220, and the complement module 230 of the complement device 200 are stored in the memory 304.
  • the processor 302 After the processor 302 receives the modal information group from the client 200 through the communication interface 303, it can call a module stored in the memory 304 to implement the modal information completion method provided in the embodiment of the present application.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

一种模态信息补全方法、装置及设备,本申请中,补全装置先获取模态信息组,该模态信息组包括至少两个模态信息;之后,补全装置可以根据该模态信息组的属性,判断该模态信息组中第一模态信息是否缺失部分或全部;之后,基于预设的特性向量映射关系,根据模态信息组中第二模态信息的特征向量确定第一模态信息的目标特征向量。补全装置利用第二模态信息的特征向量确定的该第一模态信息的目标特征向量更贴近与第一模态信息真实的模态信息,保证了第一模态信息的目标特征向量的准确性。

Description

一种模态信息补全方法、装置及设备 技术领域
本申请涉及计算机技术领域,尤其涉及一种模态信息补全方法、装置及设备。
背景技术
模态是指信息来源或信息形式,模态的定义较为广泛,例如,人类的触觉、听觉、触觉、视觉、嗅觉都可以作为信息的来源,均可以看做一种模态。信息形式有语音、视频、文字等,分别可以作为一种模块。各种传感器,如雷达、压力计、加速度计等,也都是信息的来源,同样的,任一个传感器也可以作为一种模态。模态的定义较为广泛,并不仅限于上述列举的几种情况。例如,两种不同的语言可以认为是两种不同的模态,在两个不同场景下采集到的数据,也可以认为是两种不同的模态。
而多模态机器学习(multimodal machine learning,MMML)通过机器学习的方法获得处理和理解多个模态信息(一种模态信息是指模态的信息内容)的能力。目前,多模态机器学习多用于学习图像、视频、音频、文字等类型的模态信息。
但在多模态机器学习的过程中,通常会遇到模态缺失的情况,模态缺失是多个模态信息缺失至少一个模态信息中的部分信息或全部信息。模态缺失会影响多模态机器学习的准确程度。
目前,针对模态缺失的问题,常见的处理方式有数据清洗以及数据填充,数据清洗是指剔除缺失的模态信息中的剩余信息,数据清洗的方式会导致多个模态信息中将缺少至少一种模态的全部信息,使得在进行多模态机器学习时,不能对缺失的至少一种模态信息进行学习,多模态机器学习的效率变差。数据填充是指采用零值或模态信息的均值对缺失的至少一个模态部分信息进行填充,这种方式下填充的信息并不符合实际的模态信息的分布情况,使得在进行多模态机器学习时,不能对缺失的至少一种模态信息进行准确的学习。
综上,针对模态缺失的问题的处理方式均会导致多模态机器学习的效率变差,准确率变低。
发明内容
本申请提供一种模态信息补全方法、装置及设备,用以准确对模态信息进行补全。
第一方面,本申请提供了一种模态信息补全方法,该方法可以由补全装置执行,在该方法中,补全装置先获取模态信息组,该模态信息组包括至少两个模态信息;之后,补全装置可以根据该模态信息组的属性,判断该模态信息组中是否缺失一个或多个模态信息(也即一个或多个模态信息缺失了全部信息),以及一个或多个模态信息是否缺失部分信息。为方便说明,缺失部分信息或全部信息的模态信息称为第一模态信息。模态信息组中除第一模态信息外的一个模态信息称为第二模态信息,补全装置可以提取该第二模态信息的特征向量;之后,基于预设的特性向量映射关系,根据第二模态信息的特征向量确定第一模态信息的目标特征向量。
通过上述方法,补全装置可以利用该模态信息组中第二模态信息的特征向量确定该第一模态信息的目标特征向量,利用第二模态信息的特征向量确定的该第一模态信息的目标特征向量更贴近第一模态信息真实的特征向量,保证了第一模态信息的目标特征向量的准确性。
在一种可能的实现方式中,补全装置在基于预设的特性向量映射关系,根据第二模态信息的特征向量确定第一模态信息的目标特征向量时,可以先基于特性向量映射关系,根据第二模态信息的特征向量确定第一模态信息的候选特征向量;之后,再根据第一模态信息的候选特征向量确定第一模态信息的目标特征向量。例如,补全装置可以对第一模态信息的候选特征向量进行调整,将调整后的第一模态信息的候选特征向量作为第一模态信息的目标特征向量,也可以直接将该第一模态信息的候选特征向量直接作为第一模态信息的目标特征向量。
通过上述方法,补全装置能够先确定第一模态信息的候选特征向量,之后再利用第一模态信息的候选特征向量确定第一模态信息的目标特征向量,以便最终确定的第一模态信息的目标特征向量能够更加接近与第一模态信息的真实的特征向量。
在一种可能的实现方式中,特性向量映射关系的设置方式有许多,例如,特性向量映射关系可以采用数据映射的方式设置,又例如,特性向量映射关系也可以采用机器学习模型的方式进行设置,机器学习模型学习了特性向量映射关系,能够用于根据输入的模态信息的特征向量输出其他的模态信息的特征向量。补全装置可以基于预设的机器学习模型,根据第二模态信息的特征向量确定第一模态信息的目标特征向量。
通过上述方法,补全装置可以利用机器学习模型的方式可以更加便捷的确定第一模态信息的目标特征向量。
在一种可能的实现方式中,模态信息组的属性包括下列的部分或全部:
模态信息组中模态信息的数量、模态信息组中每个模态信息的数据量、模态信息组中每个模态信息的类型。
通过上述方法,模态信息组的属性能够指示该模态信息组的一种或多种信息,以便补全装置能够较为快捷的确定第一模态信息缺失部分或全部。
在一种可能的实现方式中,补全装置提取第二模态信息的特征向量之前,除了确定第一模态信息缺失部分或全部,还可以根据模态信息组的属性,确定第二模态信息是完整的。
通过上述方法,补全装置根据模态信息组的属性能够快速区分出该模态信息组中缺失模态信息(如第一模态信息)或无缺失模态信息(如第二模态信息)。
在一种可能的实现方式中,补全装置确定该模态信息组的属性的方式有很多种,下面列举其中几种:
方式一、补全装置可以获取第一辅助信息,根据第一辅助信息确定模态信息组的属性,第一辅助信息能够指示该模态信息组的属性,也即可以指示下列的部分或全部:模态信息组中模态信息的数量、模态信息组中每个模态信息的数据量、模态信息组中每个模态信息的类型。
方式二、补全装置中预先设置有第二辅助信息,第二辅助信息为补全装置接收的任一模态信息组所需符合的信息,补全装置可以根据预设的第二辅助信息,确定模态 信息组的属性,第二辅助信息可以指示下列的部分或全部:获取的任一模态信息组中模态信息的数量、获取的任一模态信息组中每个模态信息的数据量、获取的任一模态信息组中每个模态信息的类型。
方式三、补全装置根据其他模态信息组的属性确定模态信息组的属性,该其他模态信息组为在获取模态信息组之前所获取的模态信息组。
通过上述方法,补全装置可以通过多种不同的方式灵活的确定该模态信息组的属性。
在一种可能的实现方式中,模态信息组还包括第三模态信息;补全装置可以提取第三模态信息的特征向量;之后,基于预设的特性向量映射关系,根据第三模态信息的特征向量和第二模态信的特征向量确定第一模态信息的目标特征向量。
通过上述方法,补全装置可以根据该模态信息组中的多个模态信息的特征向量确定第一模态信息的目标特征向量。
在一种可能的实现方式中,补全装置在基于预设的特性向量映射关系,根据第三模态信息的特征向量和第二模态信的特征向量确定第一模态信息的目标特征向量时,可以基于特性向量映射关系,根据第三模态信息的特征向量确定第一模态信息的另一候选特征向量;之后,根据第一模态信息的候选特征向量和第一模态信息的另一候选特征向量确定第一模态信息的目标特征向量。
通过上述方法,补全装置可以根据多个第一模态信息的候选特征向量准确的确定第一模态信息的目标特征向量。
在一种可能的实现方式中,补全装置在根据第一模态信息的候选特征向量和第一模态信息的另一候选特征向量确定第一模态信息的目标特征向量时,可以为这两个候选特征向量配置对应的权重,之后,根据第一模态信息的候选特征向量和对应的权重、以及第一模态信息的另一候选特征向量和对应的权重,确定第一模态信息的目标特征向量。
通过上述方法,补全装置通过对该多个候选特征向量进行加权求和的方式,确定第一模态信息的目标特征向量。
在一种可能的实现方式中,补全装置在确定模态信息组缺失了第一模态信息的部分的情况下,补全装置还可以确定第一模态信息中缺失的部分是否符合预设条件。例如,可以确定第一模态信息缺失的部分信息的数据量或该缺失的部分信息的比例是否大于阈值或小于阈值。在确定符合预设条件后,补全装置可以确定第一模态信息的目标特征向量。
通过上述方法,补全装置可以进一步确定该第一模态信息中缺失的部分信息所需符合的预设条件,以便后续能够准确的确定出第一模态信息的目标特征向量。
在一种可能的实现方式中,模态信息组中包括多个模态信息,每个模态信息的类型可以不同。
通过上述方法,补全装置能够利用一个类型的模态信息的特征向量确定另一种类型的模态信息的目标特征向量。
在一种可能的实现方式中,本申请实施例并不限定第一模态信息或第二模态信息的类型,以第二模态信息为例,第二模态信息可以为语音类型的模态信息,也可以为 图像类型的模态信息,还可以为文字类型的模态信息,第二模态信息还可以为结构化数据,对于不同类型的模态信息,补全装置可以采用不同的方式提取第二模态信息的特征向量,例如,该第二模态信息为结构化数据时,补全装置可以基于独热编码的方式,提取第二模态信息的特征向量。
通过上述方法,对于不同类型的第二模态信息,补全装置可以针对性的采用对应的方式确定第二模态信息的特征向量。
在一种可能的实现方式中,机器学习模型的具体类型并不限定,可以为Seq2Seq模型,也可以为MCTN。
第二方面,本申请实施例还提供了一种补全装置,有益效果可以参见第一方面的描述此处不再赘述。该装置具有实现上述第一方面的方法实例中行为的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。在一个可能的设计中,所述装置的结构中包括信息获取模块、特征提取模块、补全模块。这些模块可以执行上述第一方面方法示例中的相应功能,具体参见方法示例中的详细描述,此处不做赘述。
第三方面,本申请实施例还提供了一种计算设备,所述计算设备包括处理器和存储器,还可以包括通信接口,所述处理器执行所述存储器中的程序指令执行上述第一方面或第一方面任一可能的实现方式提供的方法。所述存储器与所述处理器耦合,其保存确定交通流量的过程中必要的程序指令和数据。所述通信接口,用于与其他设备进行通信,例如接收模态信息组,又例如发送缺失模态信息的目标特征向量和缺失模态信息的特征向量。
第四方面,本申请提供了一种计算设备集群,该计算设备集群包括至少一个计算设备。每个计算设备包括存储器和处理器。至少一个计算设备的处理器用于访问所述存储器中的代码以执行第一方面或第一方面的任意一种可能的实现方式提供的方法。
第五方面,本申请提供了一种非瞬态的可读存储介质,所述非瞬态的可读存储介质被计算设备执行时,所述计算设备执行前述第一方面或第一方面的任意可能的实现方式中提供的方法。该存储介质中存储了程序。该存储介质包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、硬盘(hard disk drive,HDD)、固态硬盘(solid state drive,SSD)。
第六方面,本申请提供了一种计算设备程序产品,所述计算设备程序产品包括计算机指令,在被计算设备执行时,所述计算设备执行前述第一方面或第一方面的任意可能的实现方式中提供的方法。该计算机程序产品可以为一个软件安装包,在需要使用前述第一方面或第一方面的任意可能的实现方式中提供的方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。
第七方面,本申请还提供一种计算机芯片,芯片与存储器相连,芯片用于读取并执行存储器中存储的软件程序,执行前述第一方面或第一方面的任意可能的实现方式中提供的方法。
附图说明
图1为本申请提供的一种系统的结构示意图;
图2为本申请提供的一种补全装置的结构示意图;
图3为本申请提供的一种模态信息补全方法示意图;
图4为本申请提供的一种机器学习模型的训练示意图;
图5A为本申请提供的一种语音类型和图像类型的模态信息补全方法示意图;
图5B为本申请提供的一种文字类型的模态信息补全方法示意图;
图6为本申请提供的一种计算机集群的结构示意图;
图7为本申请提供一种系统的结构示意图。
具体实施方式
如图1所示,为本申请实施例适用的一种系统结构示意图,该系统包括采集设备100、补全装置200,可选的,还可以包括信息处理设备300。
采集设备100用于采集信息,采集设备100采集的信息可以作为模态信息,本申请实施例并不限定采集设备100的数量以及具体形态,该系统中可以包括一个或多个采集设备100。采集设备100可以是传感器、摄像机、智能相机、监控设备、手机(mobile phone)、平板电脑(pad)、带收发功能的电脑、智慧城市(smart city)中的终端设备、智慧家庭(smart home)中的终端设备、物联网(internet of things,IoT)终端设备等,凡是能够进行信息采集的设备均适用于本申请实施例。
采集设备100采集的信息可以作为一种模态信息,也可以为包括多个模态信息的模态信息组。例如采集设备100可以为手机,通过手机上设置的麦克风可以采集语音类型的模态信息,通过手机上的摄像头可以采集图像类型的模态信息,还可以通过手机上安装的应用程序(如微信、QQ)等采集文字类型的模态信息。采集设备100可以为摄像机,摄像机可以采集视频。视频可以作为模态信息组,其中,视频包括语音、图像、以及文字类型的模态信息。
补全装置200可以从采集设备100中获取模态信息组,执行本申请实施例提供的模态信息补全方法,采集设备100与补全装置200之间存在连接,本申请实施例并不限定采集设备100与补全装置200之间的连接方式。例如,采集设备100可以通过无线或有线的方式与补全装置200连接。又例如,补全装置200(或补全装置200中的部分模块)也可以设置在采集设备100中,在采集设备100采集到模态信息组后,可以较快速的获取模态信息组,执行本申请实施例提供的模态信息补全方法。
信息处理设备300可以从补全装置200中获取模态信息组中各个模态信息的特征向量,该模态信息组中各个模态信息的特征向量包括缺失模态信息的目标特性向量和无缺失模态信息的特性向量,根据模态信息组中各个模态信息的特征向量,对模态信息组进行处理和理解。例如,信息处理设备300中包括多模态机器学习模型,具备对模态信息组进行处理和理解的能力。信息处理设备300对模态信息组进行处理以及理解的方式,与多模态机器学习模型的应用场景有关。
例如,在情感识别场景中,信息处理设备300可以对模态信息组进行情感识别,预测模态信息组中隐藏的情绪。又例如,若模态信息组为视频,信息处理设备300可以基于视频生成视频标签(用于标识该视频的类别),还可以提取视频特征(如视频 的类别、时长等)进行视频推荐,将该视频推荐给对该视频存在潜在需求的用户。又例如,信息处理设备300可以对模态信息组进行分析,检测模态信息组中的目标信息(如虚假广告、暴力内容等)。又例如,若模态信息组中包括语音类型的模态信息,或模态信息组为展示唇形的视频,信息处理设备300可以对模态信息组进行语音识别,确定语音内容。又例如,若模态信息组中包括人脸信息、声纹信息、人体的步态信息、指纹信息或虹膜信息,信息处理设备300可以对模态信息组进行身份识别,确定模态信息组所属的用户信息。
上述列举的几种场景仅是举例,本申请实施例并不限定信息处理设备300对模态信息组进行处理以及理解的方式。
信息处理设备300与补全装置200之间存在连接,信息处理设备300与补全装置200之间的连接方式与采集设备100与补全装置200之间的连接方式相似,具体可参见前述内容,此处不再赘述。
如图2所示,为本申请实施例提供的一种补全装置200的结构示意图,该补全装置200包括信息获取模块210、特征提取模块220、补全模块230。
信息获取模块210用于获取模态信息组,该模态信息组中缺失至少一个模态信息,或者至少一个模态信息中存在缺失(为方便说明,本申请实施例将存在缺失的模态信息或缺失的模态信息称为缺失模态信息)。
特征提取模块220能够从信息获取模块210获取该模态信息组,对于模态信息组中不存在缺失的模态信息,也即完整的模态信息(为方便说明,本申请实施例将不存在缺失的模态信息或完整的模态信息称为无缺失模态信息),特征提取模块220提取该无缺失模态信息的特征向量,每个无缺失模态信息对应一个特征向量。
补全模块230从特征提取模块220获取无缺失模态信息的特征向量,基于预设的特征向量映射关系,根据无缺失模态信息的特征向量确定缺失模态信息的目标特征向量,例如,补全模块230可以先基于预设的特征向量映射关系,根据无缺失模态信息的特征向量确定一个或多个缺失模态信息的候选特征向量,之后,根据一个或多个缺失模态信息的候选特征向量确定缺失模态信息的目标特征向量。
该特征向量映射关系指示了不同类型的模态信息的特征向量之间的映射关系,其中,不同类型的模态信息的特征向量之间的映射关系包括无缺失模态信息的特征向量与缺失模态信息的特征向量之间的映射关系,本申请实施例并不限定该特征向量映射关系的设置形式,例如该特征向量映射关系可以以机器学习模型的形式设置在补全模块230中,该机器学习模型可以分析不同类型的模态信息的特征向量之间的映射关系,学习不同类型的模态信息的特征向量之间的映射关系,能够根据输入的一个或多个模态信息的特征向量输出其他一个或多个模态信息的特征向量。
在本申请实施例中,补全装置200在进行模态信息补全时,需要先获取模态信息组中无缺失模态信息的特征向量,基于预设的特征向量映射关系,根据无缺失模态信息的特征向量确定缺失模态信息的目标特征向量。由于模态信息组通常为存在一定关联的多个模态信息,利用无缺失模态信息的特征向量确定的缺失模态信息的目标特征 向量,更加接近缺失模态信息真实的特征向量,更贴近缺失模态信息的信息分布情况,基于缺失模态信息的目标特性向量和无缺失模态信息的特性向量进行多模态机器学习的准确程度也更高。
下面结合如图3,对本申请实施例提供的一种模态信息补全方法进行说明,参见图3,该方法包括:
步骤301:信息获取模块210获取模态信息组,该模态信息组中包括至少两个模态信息。
步骤302:信息获取模块210根据模态信息组的属性,确定该模态信息组中确定了第一模态信息组的部分或全部,以及该模态信息组中包括完整的第二模态信息。
信息获取模块210在获取模态信息组后,可以先根据该模块信息组的属性判断该模态信息组中是否存在缺失模态信息,若确定该模态信息组包括缺失模态信息,将该模态信息组发送至特征提取模块220,也即执行步骤302,若确定该模态信息组中不包括缺失模态信息,信息获取模块210可以将该模态信息组发送至特征提取模块220,提取模态信息组中每个模态信息的特征向量,之后可以将模态信息组中每个模态信息的特征向量发送至信息处理设备300,也可以发送至训练设备,训练设备可以利用模态信息组中每个模态信息的特征向量对多模态机器学习模型进行训练。
在本申请实施例中模态信息组的属性能够指示下列的部分或全部:该模块信息组中模态信息的数量、模态信息组中每个模态信息的数据量。其中,本申请并不限定模态信息的数据量的指示方式,例如,模态信息的数据量可以是模态信息的大小(如占用的字节数等),又例如,对于语音类型的模态信息,该模态信息的数据量可以用时长来指示。可选的,该模态信息组的属性还可以包括该模态信息组中每个模态信息的类型。
信息获取模块210根据该模块信息组的属性判断该模态信息组中是否包括缺失模态信息和无缺失模态信息之前,需要先确定该模态信息组的属性,本申请实施例并不限定信息获取模块210确定模态信息组属性的方法。
例如,信息获取模块210在获取模态信息组时,还可以获取第一辅助信息,该第一辅助信息能够指示该模态信息组的属性,也即第一辅助信息指示下列部分或全部:该模块信息组中模态信息的数量、模态信息组中每个模态信息的数据量。可选的,该第一辅助信息还可以指示该模态信息各个模态信息的类型或名称。信息获取模块210在获取了第一辅助信息后,可以根据该第一辅助信息确定该模态信息组的属性。
又例如,信息获取模块210预先配置了第二辅助信息,该第二辅助信息可以指示信息获取模块210获取的任一模态信息组的属性(如模态信息组中模态信息的数量、模态信息组中各个模态信息的数据量),也即信号获取模块210获取的任一模态信息组均需要满足该第二辅助信息。信息获取模块210可以根据该第二辅助信息确定该模态信息组的属性。
又例如,信息获取模块210可以比对在获取该模态信息组之前获得一个或多个模态信息组的属性与该模态信息组的属性,根据该一个或多个模态信息组的属性确定该 模态信息组的属性。将该一个或多个模态信息组的属性作为该模态信息组的属性。
信息获取模块210在确定了模态信息组的属性后,可以确定获取的多模态信息组是否满足该模态信息组的属性,例如,信息获取模块210可以确定多模态信息组中模态信息的数量是否与该多模态信息组的属性指示的模态信息的数量一致,若一致,则说明该多模态信息组中包括了所有的模态信息,否则,说明多模态信息组中缺失了一个或多个模态信息的全部信息。又例如,多模态信息组中每个模态信息的数据量是否与该多模态信息组的属性指示每个模态信息的数据量一致,对于任一模态信息,若该模态信息的数据量是否与该多模态信息组的属性指示的该模态信息的数据量一致,若一致,则说明该模态信息为完整的模态信息,也即无缺失模态信息,否则,说明该模态信息缺失部分信息,为缺失模态信息。
举例来说,信息获取模块210确定的模态信息组的属性指示该模态信息组中模态信息的数量为3个,而实际获得的该模态信息组中包括的模态信息的数量为2个,信息获取模块210可以确定该模态信息组中缺失一个模态信息的全部。又如,信息获取模块210确定的模态信息组的属性指示该模态信息组中语音类型的模态信息为10分钟时长的语音数据,而实际获得的该模态信息组中语音类型的模态信息为2分钟时长的语音数据,信息获取模块210可以确定该模态信息组中语音类型的模态信息缺失了部分信息。
信息获取模块210也可以采用其他方法确定该模态信息组中存在缺失模态信息,以模态信息组为视频,该视频中包括文字、语音、图像等类型的模态信息,信息获取模块210在确定图像类型的模态信息是否存在缺失时,可以检测图像类型的模态信息中是否存在模糊的图像,若存在模糊的图像,确定图像类型的模态信息存在缺失,信息获取模块210在确定语音类型的模态信息是否存在缺失时,可以确定语音类型的模态信息的总时长是否等于视频的总时长,若不等于,确定语音类型的模态信息存在缺失,若等于,确定语音类型的模态信息不存在缺失。
在本申请实施例中以模态信息组包括缺失模态信息以及无缺失模态信息为例进行说明。缺失模态信息可以缺失部分信息,也可以是缺失全部信息。无缺失模态信息是指模态信息组中完整的模态信息。
在实际应用场景中,导致模态缺失的情况有许多,例如模态信息组在传输过程中,由于传输环境的影响,如传输线缆故障、传输网络中断、采集设备100故障等,导致模态信息组中一个或多个模态信息缺失了部分或全部信息。又例如,信息获取模块210在接收到模态信息组之前,其他设备对模态信息组进行了预处理操作,如降噪、过滤、清洗、压缩、再编码等,使得模态信息组中一个或多个模态信息缺失了部分或全部信息。以降噪为例,降噪通常对模态信息中存在的“噪音”剔除,而剔除“噪音”会导致模态信息中的一些信息被删除。
信息获取模块210在执行步骤303之前,还可以确定缺失模态信息缺失的部分是否符合预设条件,例如该缺失模态信息缺失的部分信息的数据量(如该部分信息的数据量的大小或该部分信息对应的时长)是否小于第一阈值,若该缺失模态信息缺失的部分信息的数据量(如该部分信息的数据量的大小或该部分信息对应的时长)小于第一阈值,该缺失模态信息缺失的部分信息的数据量较小,信息获取模块210可以执行 步骤303,否则可以丢弃该模态信息组。
又例如,信息获取模块210可以确定该缺失模态信息缺失的部分信息占该缺失模态信息的总信息的比例是否小于第二阈值,若该缺失模态信息缺失的部分信息占该缺失模态信息的总信息的比例小于第二阈值,该缺失模态信息缺失的部分信息的数据量较小,信息获取模块210可以执行步骤303,否则可以丢弃该模态信息组。
以该模态信息组包括语音、文字、图像这三种类型的模态信息为例,若该模态信息组中缺失模态信息的类型为图像,图像类型的模态信息中缺失较少量的图像,缺失的图像数量小于图像阈值(第二阈值的一种表征形式),信息获取模块210可以确定发送该模态信息组;图像类型的模态信息中缺失大量的图像,缺失的图像数量大于图像阈值,信息获取模块210可以丢弃该模态信息组。若该模态信息组中缺失模态信息的类型为语音,语音类型的模态信息中缺失较少量的语音数据,缺失的语音数据的时长小于时间阈值(第二阈值的另一种表征形式),信息获取模块210可以确定发送该模态信息组;语音类型的模态信息中缺失大量的语音数据,缺失的语音数据的时长大于时间阈值,信息获取模块210可以丢弃该模态信息组。
又例如,信息获取模块210可以确定该缺失模态信息中剩余的部分信息(除去缺失的部分信息的信息即为剩余的部分信息)与缺失的部分信息的数据量比值是否大于第三阈值,若剩余的部分信息与缺失的部分信息的数据量比值大于第三阈值,该缺失模态信息缺失的部分信息的数据量较小,信息获取模块210可以执行步骤303,否则可以丢弃该模态信息组。
本申请实施例并不限定模态信息的类型,例如模态信息可以为语音、图像、文字等类型的非结构化数据,模态信息也可以为结构化数据,其中,结构化数据为能够用统一结构(如二维表格)表示的数据。
信息获取模块210也可以分析该缺失模态信息的类型,根据分析结果确定是否发送该模态信息组。
以该模态信息组包括语音、文字、图像这三种类型的模态信息为例,若该模态信息组中缺失模态信息的类型为图像,由于图像类型的模态信息中通常蕴含较为丰富的信息,较难进行补全,信息获取模块210可以丢弃该模态信息组。若该模态信息组中缺失模态信息的类型为文字,由于模态信息组中还存在语音类型的模态信息,模态信息补全难度较小,信息获取模块210可以确定发送该模态信息组。
步骤303:信息获取模块210将模态信息组发送至特征提取模块220。
步骤304:特征提取模块220获取该模态信息组后,对于模态信息组中的无缺失模态信息,特征提取模块220提取该无缺失模态信息的特征向量,每个无缺失模态信息对应一个特征向量。
特征提取模块220在执行步骤304时,特征提取模块220可以只提取一个无缺失模态信息的特征向量,也可以提取多个无缺失模态信息的特征向量。本申请实施例并不限定无缺失模态信息的特征向量的方式,凡是能够提取特征向量的方式均适用于本申请实施例。
多模态机器学习的应用场景不同,特征提取模块220提取特征向量的方式也不同,以模态信息组是视频为例,在情感识别的场景下,特征提取模块220可以基于语音的 频谱特征、低水平特征(low level descriptors,LLDs)等方式确定语音类型的模态信息的特征向量。特征提取模块220可以通过对图像中的人脸区域进行卷积获得图像类型的模态信息的特征向量。又例如,在视频推荐的场景下,特征提取模块220可以基于语音的频谱和时序特征确定语音类型的模态信息的特征向量,特征提取模块220可以通过对整个图像进行卷积获得图像类型的模态信息的特征向量,特征提取模块220可以将文字类型的模态信息的词向量作为文字类型的模态信息的特征向量。
若模态信息组中的一个或多个模态信息为结构化数据,针对结构化数据,特征提取模块220可以采用独热编码(one-hot)的方式提取结构化数据的特征向量。
例如,该结构化数据为用户年龄的统计数据,特征提取模块220可以构建一个100维的向量,当用户的年龄为18时,该100维的向量的第18个值为1,其余值为0,该100维的向量即为该结构化数据的特征向量。例如,该结构化数据为用户性别的统计数据,特征提取模块220可以构建一个2维的向量,当用户性别为女时,2维的向量为10,用户性别为男时,2维的向量为01。
对于结构化数据中包括数据为连续数据,例如结构化数据为温度、压力、或长度等统计值,温度、压力、或长度的统计值可以为连续值,特征提取模块220可以先划分数据区间,每个数据区间对应一个取值范围,之后再利用one-hot的方式提取结构化数据的特征向量。例如结构化数据为温度的统计值,从温度值可以从0到100度划分为100个区间,每个区间的温度间隔为1度,当温度值为37.5度时,属于37-38的区间,在确定了温度值所属的区间后,利用one-hot的方式提取结构化数据的特征向量,特征提取模块220构建一个100维的向量,该100维的向量的第38个值为1,其余值为0,该100维的向量即为该结构化数据的特征向量。
步骤305:特征提取模块220将该无缺失模态信息的特征向量发送至补全模块230。
步骤306:补全模块230基于预设的特征向量映射关系,根据无缺失模态信息的特征向量确定缺失模态信息的候选特征向量。
补全模块230中预先设置了特征向量映射关系,该特征向量映射关系描述的不同类型的模态信息的特征向量之间的映射关系。以模态信息组包括语音、文字、图像这三种类型的模态信息为例,该特征向量映射关系包括但不限于:语音类型的模态信息的特征向量与图像类型的模态信息的特征向量之间的映射关系,文字类型的模态信息的特征向量与图像类型的模态信息的特征向量之间的映射关系,语音类型的模态信息的特征向量与文字类型的模态信息的特征向量之间的映射关系,图像类型的模态信息的特征向量与文字类型的模态信息的特征向量之间的映射关系。
本申请实施例并不限定特征向量映射关系的设置形式,例如特征向量映射关系可以为数据之间的映射关系。又例如,特征向量映射关系可以以机器学习模型的形式进行设置,该机器学习模型预先学习了不同类型的模态信息的特征向量之间的映射关系,能够根据输入的模态信息的特征向量输出其他类型的模态信息的特征向量。
下面以特征向量映射关系以机器学习模型的形式设置在补全模块230为例,对机器学习模型的训练方式进行说明,参见图4:
步骤1、准备多模态训练集,训练集中包括多个模态信息组,每个模态信息组中包括多个模态信息,每个模态信息是完整的,不存在缺失。该多模态训练集也可以用于训练多模态机器学习模型,本申请实施例并不限定多模态机器学习模型的训练方式,凡事能够利用多模态训练集实现多模态机器学习模型的训练的方式均适用于本申请实施例。
步骤2、提取多模态训练集中各个模态信息组中每个模态信息的特征向量。模态信息的特征向量的提取方式可以参见步骤302,此处不再赘述。
步骤3、基于多模态训练集中各个模态信息组中每个模态信息的特征向量,基于监督学习的方式对预设的机器学习模型进行训练,使得预设的机器学习模型可以学习到不同类型的模态信息的特征向量之间的映射关系,该预设的机器学习模型根据输入的模态信息的特征向量输出其他模态信息的特征向量。
预设的机器学习模型可以为序列到序列模型(sequence to sequence,Seq2Seq)或多模态循环翻译网络(multimodal cyclic translation network,MCTN)。
步骤4、准备多模态测试集,测试集中包括多个模态信息组,每个模态信息组中包括多个模态信息,每个模态信息是完整的,不存在缺失。
步骤5、提取多模态测试集中各个模态信息组中每个模态信息的特征向量。模态信息的特征向量的提取方式可以参见步骤302,此处不再赘述。
步骤6、基于多模态测试集中各个模态信息组中每个模态信息的特征向量,对训练好的机器学习模型进行测试。
本申请实施例并不限定训练好的机器学习模型进行测试的方式,例如,可以将测试集中的模态信息组M中模态信息A的特征向量输入至训练好的机器学习模型中,输出模态信息B的候选特征向量,比对模态信息组中模态信息B的特征向量与模态信息B的候选特征向量,若一致,或者相似度大于设定值,则可以认为模型训练成功,否则,失败,重新执行步骤1~3,继续对机器学习模型进行训练。又例如,可以将测试集中的模态信息组M的模态信息A的特征向量输入至训练好的机器学习模型中,输出模态信息B的候选特征向量,将输出的模态信息B的候选特征向量与该模态信息组M中的剩余模态信息的特征向量输入至多模态机器学习模型中进行分析,多模态机器学习模型对模态信息组M中各个模态信息的特征向量的分析结果与多模态机器学习模型对模态信息B的候选特征向量与剩余模态信息的特征向量的分析结果一致,或者相似度大于设定值,则可以认为机器学习模型训练成功,否则,失败,重新执行步骤1~3,继续对模型进行训练。
测试成功的机器学习模型可以配置在补全模块230中,根据输入的无缺失模态信息的特征向量输出缺失模态信息的候选特征向量。
由于模态信息组中可能存在多个无缺失模态信息,补全模块230在执行步骤306时,可以根据一个无缺失模态信息的特征向量获得一个缺失模态信息的候选特征向量,也可以根据多个无缺失模态信息的特征向量获得多个缺失模态信息的候选特征向量,一个无缺失模态信息的特征向量可以获得一个缺失模态信息的候选特征向量。
步骤307:补全模块230可以利用缺失模态信息的候选特征向量确定缺失模态信 息的目标特征向量。
补全模块230在执行步骤306时,若生成一个缺失模态信息的候选特征向量,补全模块230可以直接将该缺失模态信息的候选特征向量作为缺失模态信息的目标特征向量,这种情况下,补全装置230可以通过步骤306直接获得该缺失模态信息的目标特征向量,补全模块230也可以对该缺失模态信息的候选特征向量进行调整,如放大或缩小等调整,将调整后的第一模态信息的候选特征向量作为缺失模态信息的目标特征向量。若生成多个缺失模态信息的候选特征向量,补全模块230可以根据多个缺失模态信息的候选特征向量确定缺失模态信息的目标特征向量。
补全模块230根据多个缺失模态信息的候选特征向量确定缺失模态信息的目标特征向量的方式本申请实施例并不限定,例如,补全模块230可以从多个缺失模态信息的候选特征向量中选择缺失模态信息的一个候选特征向量作为缺失模态信息的目标特征向量,也可以对多个缺失模态信息的候选特征向量进行加权求和(也即每个缺失模态信息的候选特征向量对应一个权重,获取缺失模态信息的目标特征向量,其中,每个缺失模态信息的候选特征向量的权重可以为经验值,也可以根据多模态机器学习模型预先确定的。
下面对根据多模态机器学习模型预先确定每个缺失模态信息的候选特征向量的权重的方式进行说明:
对每个缺失模态信息的候选特征向量的权重设置一个可变参数,利用对多个缺失模态信息的候选特征向量进行加权求和,确定缺失模态信息的目标特征向量,该目标特征向量包括该可变参数。改变可变参数的具体数值,每改变一次可变参数的具体数值,将缺失模态信息的目标特征向量和无缺失模态信息的特征向量输入至多模态机器学习模型,确定多模态机器学习模型的输出值,由此可以获得多模态机器学习模型的多个输出值,确定多模态机器学习模型的多个输出值中最接近真实值的输出值,该输出值所对应目标特征向量中的可变参数的具体数值作为缺失模态信息的候选特征向量的权重。
以模态信息组为视频,缺失模态信息为文字类型的模态信息为例,将文字类型的模态信息的一个候选特征向量的权重设定为一个参数X,参数X介于在0到1之间,通过语音类型的模态信息的特征向量确定的文字类型的模态信息的一个候选特征向量为f1,权重为X,通过图像类型的模态信息的特征向量确定的文字类型的模态信息的候选特征向量为f2,权重为1-X,通过加权求和方式获得的文字类型的模态信息的目标特征向量为X*f1+(1-X)*f2。将参数X从0变化到1每次增加0.1,每增加0.1,将文字类型的模态信息的目标特征向量、语音类型的模态信息的特征向量以及图像类型的模态信息的特征向量输入至多模态机器学习模型中,获得该多模态机器学习模型的输出值,在不同的应用场景中,该多模态机器学习模型的输出值指示的信息不同,以情感识别场景为例,该多模态机器学习模型的输出值用于指示视频中人物的情绪变化。从该多模态机器学习模型的输出值中确定最接近视频中人物真实情绪变化的输出值,该输出值对应的文字类型的模态信息的目标特征向量中参数X的取值即为文字类型的模态信息的一个候选特征向量的权重。
补全模块230可以利用缺失模态信息的候选特征向量确定缺失模态信息的目标特 征向量,可以获得缺失模态信息的目标特性向量和无缺失模态信息的特性向量。补全模块230可以将缺失模态信息的目标特性向量和无缺失模态信息的特性向量发送至信息处理设备300,由信息处理设备300对缺失模态信息的目标特性向量和无缺失模态信息的特性向量进行处理。
如图5A所示,以该模态信息组为视频,其中,包括语音、文字、图像这三种类型的模态信息,语音和图像类型的模态信息为缺失模态信息,无法提取语音和图像类型的模态信息的特征向量,文本类型的模态信息为无缺失模态信息,可以提取文本类型的模态信息的特征向量。补全模块230可以基于预设的特征向量映射关系,根据文本类型的模态信息的特征向量分别生成语音和图像类型的模态信息的候选特征向量。补全模块230可以将语音和图像类型的模态信息的候选特征向量作为语音和图像类型的模态信息的目标特征向量,补全模块230可以将语音和图像类型的模态信息的目标特征向量以及文本类型的模态信息的特征向量发送给信息处理设备300,做后续处理。
如图5B所示,以该模态信息组为视频,其中,包括语音、文字、图像这三种类型的模态信息,语音和图像类型的模态信息为无缺失模态信息,可以分别提取语音和图像类型的模态信息的特征向量,文本类型的模态信息为缺失模态信息,无法提取文本类型的模态信息的特征向量。补全模块230可以基于预设的特征向量的映射关系,分别根据语音和图像类型的模态信息的特征向量生成两个文本类型的模态信息的特征向量。补全模块230可以根据两个文本类型的模态信息的候选特征向量确定文本类型的模态信息的目标特征向量,补全模块230可以将文本类型的模态信息的目标特征向量以及语音以及图像类型的模态信息的特征向量发送给信息处理设备300,做后续处理。
基于与方法实施例同一发明构思,本申请实施例还提供了一种计算机集群,用于执行上述方法实施例中所示的方法,相关特征可参见上述方法实施例,此处不再赘述,如图6所示,为本申请实施例提供的一种计算机集群,该计算机集群中包括至少一个计算设备600,每个计算设备600间通过通信网络建立通信通路。
每个计算设备600中包括总线601、处理器602、通信接口603以及存储器604,可选的,计算设备600中还可以包括显示屏605。处理器602、存储器604和通信接口603之间通过总线601通信。
其中,处理器602可以由一个或者多个通用处理器构成,例如中央处理器(central processing unit,CPU),或者CPU和硬件芯片的组合。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC)、可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合。
存储器604可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器604还可以包括非易失性存储器(non-volatile memory,NVM),例如只读存储器(read-only memory,ROM),快闪存储器,硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)。存储器604还可以包括 上述种类的组合。
存储器604中存储存可执行代码,处理器602可以读取存储器604中的该可执行代码实现功能,还可以通过通信接口603与其他计算设备进行通信,在本申请实施例中,处理器602可以实现补全装置200的一个或多个模块(如信息获取模块210、特征提取模块220、补全模块230中的一个或多个模块)的功能,这种情况下,存储器604中存储有补全装置200的一个或多个模块(如信息获取模块210、特征提取模块220的一个或多个模块)。
在本申请实施例中,多个计算设备600中的处理器602可以协调工作,执行本申请实施例提供的模态信息补全方法。
如图7所示,为本申请实施例提供的一种系统架构,该系统架构中包括客户端200和部署有变更装置的云端设备300,客户端200与云端设备300通过网络连接,该云端设备300位于云环境中,可以是部署在云数据中心中的服务器或者虚拟机,图7中,仅是以该合并装置100部署在一个云端设备300为例,作为一种可能的实施方式,该合并装置可以分布式地部署在多个云端设备300上。
如图7所示,客户端200包括总线201、处理器202、通信接口203、存储器204以及显示屏205。处理器202、存储器204和通信接口203之间通过总线201通信。其中,处理器202和存储器204的类型可以参见处理器602以及存储器604的相关说明,此处不再赘述。存储器204中存储存可执行代码,处理器202可以读取存储器204中的该可执行代码实现功能。处理器202还可以通过通信接口203与云端设备进行通信。例如处理器202可以通过显示屏205提示用户输入模态信息组,通过通信接口203将模态信息组反馈给云端设备300。
如图7所示,云端设备300包括总线301、处理器302、通信接口303以及存储器304。处理器302、存储器304和通信接口303之间通过总线301通信。其中,处理器302和存储器304的类型可以参见处理器602以及存储器604的相关说明,此处不再赘述。存储器304中存储存可执行代码,处理器302可以读取存储器304中的该可执行代码实现功能,还可以通过通信接口303与客户端200进行通信。在本申请实施例中,处理器302可以实现补全装置200的功能,这种情况下,存储器304中存储有补全装置200的信息获取模块210、特征提取模块220、补全模块230中的一个或多个模块。
处理器302通过通信接口303从客户端200接收模态信息组后,可以调用存储器304中存储的模块实现本申请实施例提供的模态信息补全方法。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到 多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
以上所述,仅为本发明的具体实施方式。熟悉本技术领域的技术人员根据本发明提供的具体实施方式,可想到变化或替换,都应涵盖在本发明的保护范围之内。

Claims (18)

  1. 一种模态信息补全方法,其特征在于,该方法包括:
    获取模态信息组,所述模态信息组包括至少两个模态信息;
    根据所述模态信息组的属性,确定所述模态信息组缺失了第一模态信息的部分或全部,所述模态信息组还包括第二模态信息;
    提取所述第二模态信息的特征向量;
    基于预设的特性向量映射关系,根据所述第二模态信息的特征向量确定所述第一模态信息的目标特征向量。
  2. 如权利要求1所述的方法,其特征在于,所述基于预设的特性向量映射关系,根据所述第二模态信息的特征向量确定所述第一模态信息的目标特征向量,包括:
    基于所述特性向量映射关系,根据所述第二模态信息的特征向量确定所述第一模态信息的候选特征向量;
    根据所述第一模态信息的候选特征向量确定所述第一模态信息的目标特征向量。
  3. 如权利要求1或2所述的方法,其特征在于,所述基于预设的特性向量映射关系,根据所述第二模态信息的特征向量确定所述第一模态信息的目标特征向量,包括:
    基于预设的机器学习模型,根据所述第二模态信息的特征向量确定所述第一模态信息的目标特征向量,所述机器学习模型学习了所述特性向量映射关系,用于根据输入的模态信息的特征向量输出其他的模态信息的特征向量。
  4. 如权利要求1~3任一所述的方法,其特征在于,所述模态信息组的属性包括下列的部分或全部:
    所述模态信息组中模态信息的数量、所述模态信息组中每个模态信息的数据量。
  5. 如权利要求1~4任一所述的方法,其特征在于,所述方法还包括:
    获取所述第一辅助信息,根据所述第一辅助信息确定所述模态信息组的属性,所述第一辅助信息用于指示下列的部分或全部:所述模态信息组中模态信息的数量、所述模态信息组中每个模态信息的数据量;或
    根据预设的第二辅助信息,确定所述模态信息组的属性,所述第二辅助信息用于指示下列的部分或全部:获取的任一模态信息组中模态信息的数量、获取的任一模态信息组中每个模态信息的数据量;或
    根据其他模态信息组的属性确定所述模态信息组的属性,所述其他模态信息组为在获取所述模态信息组之前所获取的模态信息组。
  6. 如权利要求1~5任一所述的方法,其特征在于,所述模态信息组还包括第三模态信息;
    所述方法还包括:
    提取所述第三模态信息的特征向量;
    基于所述特性向量映射关系,根据所述第三模态信息的特征向量和所述第二模态信的特征向量确定所述第一模态信息的目标特征向量。
  7. 如权利要求6所述的方法,其特征在于,所述基于所述预设的特性向量映射关系,根据所述第三模态信息的特征向量和所述第二模态信的特征向量确定所述第一模态信息的目标特征向量,包括:
    基于所述特性向量映射关系,根据所述第三模态信息的特征向量确定所述第一模态信息的另一候选特征向量;
    根据所述第一模态信息的候选特征向量和所述第一模态信息的另一候选特征向量确定所述第一模态信息的目标特征向量。
  8. 如权利要求1~7任一所述的方法,其特征在于,所述模态信息组中包括的每个模态信息的类型不同。
  9. 一种补全装置,其特征在于,该装置包括:
    信息获取模块,用于获取模态信息组,所述模态信息组包括至少两个模态信息;以及根据所述模态信息组的属性,确定所述模态信息组缺失了第一模态信息的部分或全部,所述模态信息组还包括第二模态信息;
    特征提取模块,用于提取所述第二模态信息的特征向量;
    补全模块,用于基于预设的特性向量映射关系,根据所述第二模态信息的特征向量确定所述第一模态信息的目标特征向量。
  10. 如权利要求9所述的装置,其特征在于,所述补全模块,具体用于:
    基于所述特性向量映射关系,根据所述第二模态信息的特征向量确定所述第一模态信息的候选特征向量;
    根据所述第一模态信息的候选特征向量确定所述第一模态信息的目标特征向量。
  11. 如权利要求9或10所述的装置,其特征在于,所述补全模块在基于预设的特性向量映射关系,根据所述第二模态信息的特征向量确定所述第一模态信息的目标特征向量时,具体用于:
    基于预设的机器学习模型,根据所述第二模态信息的特征向量确定所述第一模态信息的目标特征向量,所述机器学习模型学习了所述特性向量映射关系,用于根据输入的模态信息的特征向量输出其他的模态信息的特征向量。
  12. 如权利要求9~11任一所述的装置,其特征在于,所述模态信息组的属性包括下列的部分或全部:
    所述模态信息组中模态信息的数量、所述模态信息组中每个模态信息的数据量。
  13. 如权利要求9~12任一所述的装置,其特征在于,所述信息获取模块,还用于:
    获取所述第一辅助信息,根据所述第一辅助信息确定所述模态信息组的属性,所述第一辅助信息用于指示下列的部分或全部:所述模态信息组中模态信息的数量、所述模态信息组中每个模态信息的数据量;或
    根据预设的第二辅助信息,根据所述第二辅助信息确定所述模态信息组的属性,所述第二辅助信息用于指示下列的部分或全部:获取的任一模态信息组中模态信息的数量、获取的任一模态信息组中每个模态信息的数据量;或
    根据其他模态信息组的属性确定所述模态信息组的属性,所述其他模态信息组为在获取所述模态信息组之前所获取的模态信息组。
  14. 如权利要求9~13任一所述的装置,其特征在于,所述模态信息组还包括第三模态信息;
    所述特征提取模块,还用于:
    提取所述第三模态信息的特征向量;
    所述补全模块,还用于:
    基于所述特性向量映射关系,根据所述第三模态信息的特征向量和所述第二模态信的特征向量确定所述第一模态信息的目标特征向量。
  15. 如权利要求14所述的装置,其特征在于,所述补全模块在基于所述预设的特性向量映射关系,根据所述第三模态信息的特征向量和所述第二模态信的特征向量确定所述第一模态信息的目标特征向量时,具体用于:
    基于所述特性向量映射关系,根据所述第三模态信息的特征向量确定所述第一模态信息的另一候选特征向量;
    根据所述第一模态信息的候选特征向量和所述第一模态信息的另一候选特征向量确定所述第一模态信息的目标特征向量。
  16. 如权利要求9~15任一所述的装置,其特征在于,所述模态信息组中每个模态信息的类型不同。
  17. 一种计算设备,其特征在于,所述计算设备包括处理器和存储器;
    所述存储器,用于存储计算机程序指令;
    所述处理器调用所述存储器中的计算机程序指令执行如权利要求1至8中任一项所述的方法。
  18. 一种计算设备集群,其特征在于,所述计算设备集群中包括多个计算设备,每个计算设备包括处理器和存储器;至少一个所述计算设备中的存储器,用于存储计算机程序指令;
    至少一个所述计算设备中的处理器调用所述存储器中存储的计算机程序指令执行如权利要求1至8中任一项所述的方法。
PCT/CN2021/101905 2020-06-23 2021-06-23 一种模态信息补全方法、装置及设备 WO2021259336A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21829076.5A EP4160477A4 (en) 2020-06-23 2021-06-23 METHOD, DEVICE AND DEVICE FOR SUPPLEMENTING MODAL INFORMATION
US18/069,822 US20230206121A1 (en) 2020-06-23 2022-12-21 Modal information completion method, apparatus, and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010582370.6A CN113837390A (zh) 2020-06-23 2020-06-23 一种模态信息补全方法、装置及设备
CN202010582370.6 2020-06-23

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/069,822 Continuation US20230206121A1 (en) 2020-06-23 2022-12-21 Modal information completion method, apparatus, and device

Publications (1)

Publication Number Publication Date
WO2021259336A1 true WO2021259336A1 (zh) 2021-12-30

Family

ID=78964152

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/101905 WO2021259336A1 (zh) 2020-06-23 2021-06-23 一种模态信息补全方法、装置及设备

Country Status (4)

Country Link
US (1) US20230206121A1 (zh)
EP (1) EP4160477A4 (zh)
CN (1) CN113837390A (zh)
WO (1) WO2021259336A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024021008A1 (zh) * 2022-07-29 2024-02-01 华为技术有限公司 数据处理方法、装置、系统以及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117370933B (zh) * 2023-10-31 2024-05-07 中国人民解放军总医院 多模态统一特征提取方法、装置、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202281A (zh) * 2016-06-28 2016-12-07 广东工业大学 一种多模态数据表示学习方法及系统
CN106803098A (zh) * 2016-12-28 2017-06-06 南京邮电大学 一种基于语音、表情与姿态的三模态情感识别方法
CN108536735A (zh) * 2018-03-05 2018-09-14 中国科学院自动化研究所 基于多通道自编码器的多模态词汇表示方法与系统
CN109614895A (zh) * 2018-10-29 2019-04-12 山东大学 一种基于attention特征融合的多模态情感识别的方法
CN110301920A (zh) * 2019-06-27 2019-10-08 清华大学 用于心理压力检测的多模态融合方法及装置
US20190325025A1 (en) * 2018-04-24 2019-10-24 Electronics And Telecommunications Research Institute Neural network memory computing system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202281A (zh) * 2016-06-28 2016-12-07 广东工业大学 一种多模态数据表示学习方法及系统
CN106803098A (zh) * 2016-12-28 2017-06-06 南京邮电大学 一种基于语音、表情与姿态的三模态情感识别方法
CN108536735A (zh) * 2018-03-05 2018-09-14 中国科学院自动化研究所 基于多通道自编码器的多模态词汇表示方法与系统
US20190325025A1 (en) * 2018-04-24 2019-10-24 Electronics And Telecommunications Research Institute Neural network memory computing system and method
CN109614895A (zh) * 2018-10-29 2019-04-12 山东大学 一种基于attention特征融合的多模态情感识别的方法
CN110301920A (zh) * 2019-06-27 2019-10-08 清华大学 用于心理压力检测的多模态融合方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4160477A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024021008A1 (zh) * 2022-07-29 2024-02-01 华为技术有限公司 数据处理方法、装置、系统以及存储介质

Also Published As

Publication number Publication date
CN113837390A (zh) 2021-12-24
EP4160477A1 (en) 2023-04-05
EP4160477A4 (en) 2023-08-30
US20230206121A1 (en) 2023-06-29

Similar Documents

Publication Publication Date Title
US11138903B2 (en) Method, apparatus, device and system for sign language translation
US20190392587A1 (en) System for predicting articulated object feature location
WO2021051545A1 (zh) 基于行为识别模型的摔倒动作判定方法、装置、计算机设备及存储介质
US20230206121A1 (en) Modal information completion method, apparatus, and device
CN112929695B (zh) 视频去重方法、装置、电子设备和存储介质
US20230080098A1 (en) Object recognition using spatial and timing information of object images at diferent times
WO2019119396A1 (zh) 人脸表情识别方法及装置
CN108229375B (zh) 用于检测人脸图像的方法和装置
CN110941978B (zh) 一种未识别身份人员的人脸聚类方法、装置及存储介质
CN108388889B (zh) 用于分析人脸图像的方法和装置
CN111539897A (zh) 用于生成图像转换模型的方法和装置
CN111985414B (zh) 一种关节点位置确定方法及装置
JP2019153092A (ja) 位置特定装置、位置特定方法及びコンピュータプログラム
CN111597933A (zh) 人脸识别方法和装置
CN104794446A (zh) 基于合成描述子的人体动作识别方法及系统
CN112529149A (zh) 一种数据处理方法及相关装置
CN116311539A (zh) 基于毫米波的睡眠动作捕捉方法、装置、设备及存储介质
CN115731341A (zh) 三维人头重建方法、装置、设备及介质
CN110633630A (zh) 一种行为识别方法、装置及终端设备
CN113822871A (zh) 基于动态检测头的目标检测方法、装置、存储介质及设备
CN110489592B (zh) 视频分类方法、装置、计算机设备和存储介质
CN111401317A (zh) 视频分类方法、装置、设备及存储介质
CN113221920B (zh) 图像识别方法、装置、设备、存储介质以及计算机程序产品
WO2023152832A1 (ja) 識別装置、識別方法、及び非一時的なコンピュータ可読媒体
CN116501176B (zh) 基于人工智能的用户动作识别方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21829076

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021829076

Country of ref document: EP

Effective date: 20221230

NENP Non-entry into the national phase

Ref country code: DE