WO2022142014A1 - Procédé de classification de texte sur la base d'une fusion d'informations multimodales et dispositif associé correspondant - Google Patents

Procédé de classification de texte sur la base d'une fusion d'informations multimodales et dispositif associé correspondant Download PDF

Info

Publication number
WO2022142014A1
WO2022142014A1 PCT/CN2021/090497 CN2021090497W WO2022142014A1 WO 2022142014 A1 WO2022142014 A1 WO 2022142014A1 CN 2021090497 W CN2021090497 W CN 2021090497W WO 2022142014 A1 WO2022142014 A1 WO 2022142014A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
image
sample
model
feature
Prior art date
Application number
PCT/CN2021/090497
Other languages
English (en)
Chinese (zh)
Inventor
陈昊
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022142014A1 publication Critical patent/WO2022142014A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular, to a text classification method based on multimodal information fusion, and related equipment.
  • the correct classification and labeling of text information are crucial to the classification, storage, search and user understanding of information.
  • the conversion result may be ambiguous. Correctly classifying the text information can help users correctly understand the converted content.
  • the purpose of the embodiments of the present application is to propose a text classification method, device, computer equipment and storage medium based on multimodal information fusion, so as to solve the problem of using a single information source for information classification, lack of mutual verification of other homologous information, accuracy low problem.
  • the embodiments of the present application provide a text classification method based on multimodal information fusion, which adopts the following technical solutions:
  • the text is derived from multimodal information, wherein the multimodal information at least further includes images;
  • the fusion feature is input into a pre-trained text classification model to obtain a classification result of the text in the multimodal information.
  • the embodiments of the present application also provide a text classification device based on multimodal information fusion, which adopts the following technical solutions:
  • an acquisition module configured to acquire text to be classified, the text is derived from multimodal information, wherein the multimodal information at least further includes images;
  • a first extraction module configured to input the text into a pre-trained text feature extraction model for feature extraction to obtain text features of the text
  • a second extraction module configured to input the image in the multimodal information into a pre-trained image feature extraction model for feature extraction to obtain image features of the image;
  • a fusion module for inputting the text features and the image features into a pre-trained attention fusion model for feature fusion, to obtain a fusion feature fused with the text features and the image features;
  • a classification module configured to input the fusion feature into a pre-trained text classification model to obtain a classification result of the text in the multimodal information.
  • the embodiment of the present application also provides a computer device, which adopts the following technical solutions:
  • a computer device includes a memory and a processor, wherein computer-readable instructions are stored in the memory, and the processor implements the following steps when executing the computer-readable instructions:
  • the text is derived from multimodal information, wherein the multimodal information at least further includes images;
  • the fusion feature is input into a pre-trained text classification model to obtain a classification result of the text in the multimodal information.
  • the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:
  • a computer-readable storage medium where computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the processor is caused to perform the following steps:
  • the text is derived from multimodal information, wherein the multimodal information at least further includes images;
  • the fusion feature is input into a pre-trained text classification model to obtain a classification result of the text in the multimodal information.
  • the text is derived from multimodal information, wherein the multimodal information at least further includes images; inputting the text into a pre-trained text feature extraction model for feature extraction to obtain the text the text features of the image; input the image in the multimodal information into a pre-trained image feature extraction model for feature extraction to obtain the image features of the image; input the text features and the image features into the pre-trained image features
  • the attention fusion model performs feature fusion to obtain a fusion feature that fuses the text feature and the image feature; the fusion feature is input into a pre-trained text classification model, and the classification result of the text in the multimodal information is obtained. .
  • text classification is performed based on the fused features.
  • the text classification utilizes image information, and the classification results are more accurate.
  • FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
  • FIG. 2 is a flowchart of an embodiment of a text classification method based on multimodal information fusion according to the present application
  • FIG. 3 is a schematic structural diagram of the attention fusion model of the present application.
  • FIG. 4 is a schematic structural diagram of a gated activation layer in the attention fusion model of the present application.
  • FIG. 5 is a schematic structural diagram of an attention layer in the attention fusion model of the present application.
  • FIG. 6 is a schematic structural diagram of an embodiment of a text classification apparatus based on multimodal information fusion according to the present application
  • FIG. 7 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
  • the terminal devices 101, 102, and 103 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4
  • the server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
  • the text classification method based on multimodal information fusion provided in the embodiment of the present application is generally executed by the server /terminal device , and accordingly, the text classification device based on multimodal information fusion is generally set on the server/ terminal device. in the terminal device .
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • FIG. 2 a flowchart of one embodiment of a method for text classification based on multimodal information fusion according to the present application is shown.
  • the described text classification method based on multimodal information fusion includes the following steps:
  • Step S201 acquiring text to be classified, where the text is derived from multi-modal information, wherein the multi-modal information further includes at least an image.
  • the electronic device for example, the server/terminal device shown in FIG. 1
  • the text classification method based on multimodal information fusion runs can obtain the text to be classified through wired connection or wireless connection.
  • the above wireless connection methods may include but are not limited to 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods currently known or developed in the future .
  • Multimodal here refers to various ways of representing information
  • multimodal information refers to information including images, text, and sounds.
  • the text to be classified here can be directly extracted from the multi-modal information.
  • the text to be classified comes from the extraction of text in the video.
  • the audio and video files of the video are used, the text to be classified comes from the text conversion result of the audio. Correctly classifying and labeling the text in the multimodal information will help users to understand the textual information correctly.
  • Step S202 Input the text into a pre-trained text feature extraction model for feature extraction to obtain text features of the text.
  • the pre-trained text feature extraction model is based on the DPCNN (Deep Pyramid Convolutional Neural Networks) structure, which is widely considered to be able to effectively extract semantic information in text.
  • DPCNN Deep Pyramid Convolutional Neural Networks
  • Step S203 Input the image in the multimodal information into a pre-trained image feature extraction model for feature extraction to obtain image features of the image.
  • the image feature extraction model is composed of the first five layers of the Resnet (Deep Residual Network) deep residual network, which is used for image feature extraction.
  • Resnet Deep Residual Network
  • Step S204 inputting the text feature and the image feature into a pre-trained attention fusion model for feature fusion, to obtain a fusion feature fused with the text feature and the image feature.
  • the attention fusion model mainly needs to realize the fusion of two features, including dimension filling and transformation. For example, if the text information comes from "watering trees", if there are areas of trees in the image, then the information of the tree areas in this image should be supplemented to the text information tensor in some way.
  • the attention fusion model here finds that the trees represent the trees region, and can effectively fuse it into the output information fusion feature.
  • the structure of the attention fusion model here is shown in Figure 3. It consists of a gated activation layer, an attention layer and a fusion layer.
  • the gated activation layer adopts a classic gated design, as shown in Figure 4.
  • the position shown by t is softmax as the activation operation, hi is the hidden state, ri is the learnable gating parameter, is the estimated state of the hidden state after the gating operation.
  • the attention layer is shown in Figure 5: it is composed of multiplication/addition operations and softmax operations between multiple convolutional layers.
  • the fusion layer is composed of residual structural blocks of resnet.
  • Step S205 inputting the fusion feature into a pre-trained text classification model to obtain a classification result of the text in the multimodal information.
  • the text is derived from multi-modal information, wherein the multi-modal information at least further includes images; inputting the text into a pre-trained text feature extraction model for feature extraction to obtain Text features of the text; input the images in the multimodal information into a pre-trained image feature extraction model for feature extraction to obtain image features of the images; input the text features and the image features into The pre-trained attention fusion model performs feature fusion to obtain a fusion feature that combines the text features and the image features; the fusion features are input into the pre-trained text classification model to obtain the text in the multimodal information. classification results. Through the fusion of text features and image features, text classification is performed based on the fused features. The text classification utilizes image information, and the classification results are more accurate.
  • the above electronic device may further perform the following steps:
  • Step S301 obtaining a multimodal information sample, where the multimodal information sample at least includes a text sample and an image sample;
  • Step S302 inputting the text samples in the multimodal information samples into a preset text feature extraction model to obtain the text sample features of the text samples;
  • Step S303 inputting the image samples in the multimodal information samples into a preset image feature extraction model to obtain image sample features of the image samples;
  • Step S304 inputting the text sample feature and the image sample feature into a preset attention fusion model to obtain the fusion sample feature of the multimodal information sample;
  • Step S305 inputting the fused sample features into a preset image restoration model for image restoration, to obtain a restored image of the image sample;
  • Step S306 compare the consistency between the restored image and the image sample through a first loss function, where the first loss function is:
  • L1
  • L2
  • X is the image sample
  • Y is the restored image
  • Step S307 Adjust the parameters of each node in the text feature extraction model, the image feature extraction model, the attention fusion model, and the image restoration model, and end when the first loss function reaches a minimum value, obtaining: The trained text feature extraction model, image feature extraction model and attention fusion model.
  • the preset text feature extraction model is based on DPCNN structure
  • the preset image feature extraction model is based on Resnet structure
  • the attention fusion model is composed of gated activation layer, attention layer and fusion layer
  • the image restoration model is based on CNN structure.
  • Extract features from text samples and image samples then perform feature fusion, restore the fused features, and compare the consistency between the restored images and image samples to check the completeness of the fusion features and ensure that the image information is integrated into the fusion features. for text classification.
  • the first loss function reaches the minimum value, and the first loss function is:
  • L1
  • L2
  • X is the image sample
  • Y is the restored image.
  • the first loss function reaches the minimum value, and it is considered that the restored image is consistent with the image sample.
  • the text feature extraction model, the image feature extraction model, and the attention fusion model reach the optimal state, and the training ends.
  • the text samples in the multimodal information samples are marked with reference classifications.
  • the above electronic device may perform the following steps:
  • the second loss function is used to compare whether the classification prediction result is consistent with the reference classification.
  • the second loss function is:
  • N is the number of training samples
  • the corresponding yi for the ith sample is the reference classification of the label
  • the parameters of each node of the text classification model are adjusted until the second loss function reaches a minimum, and a trained text classification model is obtained.
  • the text samples in the multimodal information samples are labeled as reference classification, and the fused sample features obtained by feature extraction and fusion of the text samples and image samples in the multimodal information samples are input into the text classification.
  • the text classification model is based on the textCNN structure, and the parameters of each node of the text classification model are adjusted so that the classification prediction results output by the text classification model are consistent with the standard reference classification, and the training of the text classification model is completed.
  • the softmax loss function is used here.
  • the above electronic device may perform the following steps:
  • the multimodal information includes at least audio information
  • Text conversion is performed on the audio information to obtain the text to be classified.
  • the multi-modal information contains audio information, and the audio needs to be converted into text.
  • the audio needs to be converted into text.
  • wrong conversions are prone to occur.
  • users need to make content understanding judgments based on text ,Error-prone. If the text converted from the audio information is correctly classified, it can help the user to understand the audio information, and it is also convenient to correctly classify the multimodal information. For example, if the text is converted to "home tree", and the text classification result is "person's name”, the user will not understand the action of "watering the tree”.
  • Audio-to-text conversion is achieved through general-purpose software.
  • step S201 the above electronic device may perform the following steps:
  • the text to be classified is subjected to word segmentation based on the HMM hidden Markov algorithm, and the word segmentation result of the to-be-classified text is obtained;
  • the word segmentation result constitutes a text tensor according to a preset corpus dictionary
  • the text tensor is input into a pre-trained text feature extraction model for feature extraction to obtain text features of the text.
  • the hidden Markov algorithm is applied to Chinese word segmentation.
  • a Chinese sentence is given as input, and the sequence string composed of "BEMS" is used as output, and then word segmentation is performed to obtain Enter the division of the sentence.
  • B represents that the word is the starting word in the word
  • M represents the middle word in the word
  • E represents the ending word in the word
  • S represents a single-character word.
  • What you want to get is the position of each character, but you only see these Chinese characters.
  • You need to use the Chinese characters to deduce the position of each character in the word, and what state each character belongs to is also related to the character before it.
  • This is a HMM problem.
  • the specific implementation can be realized by calling related functions in python based on the HMM algorithm.
  • the word segmentation result is formed into a text tensor including the time dimension.
  • a text tensor ⁇ t, contents> According to the time interval of 5s, construct a text tensor ⁇ t, contents>; here t refers to the moment obtained according to the time interval, and contents is the content obtained by the above method in this time interval.
  • the above electronic device may perform the following steps:
  • the two-dimensional image tensor is input into a pre-trained image feature extraction model for feature extraction to obtain image features of the image.
  • Grayscale the image in the multimodal information is to unify the RGB values of each pixel into the same value.
  • the grayscaled image will change from three-channel to single-channel.
  • binarization is performed, that is, the grayscale threshold is set, and the grayscale of the pixel greater than the grayscale threshold is set as the grayscale maximum value, and normalization is performed here. In the operation, it is set to 1, and the pixel grayscale less than the grayscale threshold is set to the grayscale minimum value, that is, set to 0, thereby realizing binarization.
  • the grayscale and binarized images not only retain the image features, but also reduce the data complexity.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
  • the present application provides an embodiment of a text classification device based on multimodal information fusion, which is the same as the method embodiment shown in FIG. 2 .
  • the apparatus can be specifically applied to various electronic devices.
  • the text classification apparatus 600 based on multimodal information fusion described in this embodiment includes: an acquisition module 601 , a first extraction module 602 , a second extraction module 603 , a fusion module 604 and a classification module 605 . in:
  • Obtaining module 601 for obtaining the text to be classified, the text is derived from multimodal information, wherein the multimodal information at least also includes images;
  • a first extraction module 602 configured to input the text into a pre-trained text feature extraction model for feature extraction to obtain text features of the text;
  • the second extraction module 603 is configured to input the image in the multimodal information into a pre-trained image feature extraction model for feature extraction to obtain image features of the image;
  • a fusion module 604 configured to input the text feature and the image feature into a pre-trained attention fusion model for feature fusion, to obtain a fusion feature that fuses the text feature and the image feature;
  • the classification module 605 is configured to input the fusion feature into a pre-trained text classification model to obtain a classification result of the text in the multimodal information.
  • the text is derived from multimodal information, wherein the multimodal information at least further includes images; inputting the text into a pre-trained text feature extraction model for feature extraction to obtain the text the text features of the image; input the image in the multimodal information into a pre-trained image feature extraction model for feature extraction to obtain the image features of the image; input the text features and the image features into the pre-trained image features
  • the attention fusion model performs feature fusion to obtain a fusion feature that fuses the text feature and the image feature; the fusion feature is input into a pre-trained text classification model, and the classification result of the text in the multimodal information is obtained. .
  • text classification is performed based on the fused features.
  • the text classification utilizes image information, and the classification results are more accurate.
  • the text classification apparatus 600 based on multimodal information fusion further includes:
  • a first acquisition sub-module for acquiring multimodal information samples, the multimodal information samples at least including text samples and image samples;
  • a first extraction submodule configured to input text samples in the multimodal information samples into a preset text feature extraction model to obtain text sample features of the text samples;
  • a second extraction submodule configured to input the image samples in the multimodal information samples into a preset image feature extraction model to obtain image sample features of the image samples
  • a first fusion sub-module configured to input the text sample feature and the image sample feature into a preset attention fusion model to obtain the fusion sample feature of the multimodal information sample;
  • a first restoration sub-module configured to input the fused sample features into a preset image restoration model for image restoration, and obtain a restored image of the image sample
  • a first calculation submodule configured to compare the consistency between the restored image and the image sample through a first loss function, where the first loss function is:
  • L1
  • L2
  • X is the image sample
  • Y is the restored image
  • the first adjustment sub-module is used to adjust the parameters of each node in the text feature extraction model, the image feature extraction model, the attention fusion model and the image restoration model, until the first loss function reaches a minimum When the value ends, the trained text feature extraction model, image feature extraction model and attention fusion model are obtained.
  • the text classification apparatus 600 based on multimodal information fusion further includes:
  • a second acquisition sub-module for acquiring multimodal information, wherein the multimodal information at least includes audio information
  • the first conversion submodule is configured to perform text conversion on the audio information to obtain the text to be classified.
  • the text classification device based on multimodal information fusion further includes:
  • the first word segmentation submodule is used to segment the text to be classified based on the HMM Hidden Markov algorithm to obtain the word segmentation result of the text to be classified;
  • the first construction submodule is used to form a text tensor according to the word segmentation result according to a preset corpus dictionary
  • the third extraction sub-module is used for inputting the text tensor into a pre-trained text feature extraction model for feature extraction to obtain text features of the text.
  • the text classification apparatus 600 based on multimodal information fusion further includes:
  • a first processing submodule configured to grayscale the image in the multimodal information to obtain a grayscale image of the image
  • a second processing submodule configured to binarize the grayscale image to obtain a two-dimensional image tensor of the image
  • the fourth extraction sub-module is used for inputting the two-dimensional image tensor into a pre-trained image feature extraction model for feature extraction to obtain image features of the image.
  • the text classification apparatus 600 based on multimodal information fusion further includes:
  • a first prediction submodule configured to input the fused sample feature into a preset text classification model, and obtain a text sample classification prediction result output by the text classification model in response to the fused sample feature;
  • the second calculation submodule is used to compare whether the classification prediction result is consistent with the reference classification through a second loss function, and the second loss function is:
  • N is the number of training samples
  • the corresponding yi for the ith sample is the reference classification of the label
  • the second adjustment sub-module is used to adjust the parameters of each node of the text classification model, and ends when the second loss function reaches a minimum, and a trained text classification model is obtained.
  • FIG. 7 is a block diagram of the basic structure of a computer device according to this embodiment.
  • the computer device 7 includes a memory 71 , a processor 72 , and a network interface 73 that communicate with each other through a system bus. It should be pointed out that only the computer device 7 with components 71-73 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead.
  • the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • embedded equipment etc.
  • the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment.
  • the computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.
  • the memory 71 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc.
  • the memory 71 may be an internal storage unit of the computer device 7 , such as a hard disk or a memory of the computer device 7 .
  • the memory 71 may also be an external storage device of the computer device 7, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 71 may also include both the internal storage unit of the computer device 7 and its external storage device.
  • the memory 71 is generally used to store the operating system and various application software installed on the computer device 7 , such as computer-readable instructions for a text classification method based on multimodal information fusion.
  • the memory 71 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 72 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. This processor 72 is typically used to control the overall operation of the computer device 7. In this embodiment, the processor 72 is configured to execute computer-readable instructions stored in the memory 71 or process data, for example, computer-readable instructions for executing the text classification method based on multimodal information fusion.
  • CPU Central Processing Unit
  • controller central processing unit
  • microcontroller a microcontroller
  • microprocessor microprocessor
  • This processor 72 is typically used to control the overall operation of the computer device 7.
  • the processor 72 is configured to execute computer-readable instructions stored in the memory 71 or process data, for example, computer-readable instructions for executing the text classification method based on multimodal information fusion.
  • the network interface 73 may include a wireless network interface or a wired network interface, and the network interface 73 is generally used to establish a communication connection between the computer device 7 and other electronic devices.
  • the text is derived from multimodal information, wherein the multimodal information at least further includes images; inputting the text into a pre-trained text feature extraction model for feature extraction to obtain the text the text features of the image; input the image in the multimodal information into a pre-trained image feature extraction model for feature extraction to obtain the image features of the image; input the text features and the image features into the pre-trained image features
  • the attention fusion model performs feature fusion to obtain a fusion feature that fuses the text feature and the image feature; the fusion feature is input into a pre-trained text classification model, and the classification result of the text in the multimodal information is obtained. .
  • text classification is performed based on the fused features.
  • the text classification utilizes image information, and the classification results are more accurate.
  • the present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to perform the steps of the method for text classification based on multimodal information fusion as described above.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the text is derived from multimodal information, wherein the multimodal information at least further includes images; inputting the text into a pre-trained text feature extraction model for feature extraction to obtain the text the text features of the image; input the image in the multimodal information into a pre-trained image feature extraction model for feature extraction to obtain the image features of the image; input the text features and the image features into the pre-trained image features
  • the attention fusion model performs feature fusion to obtain a fusion feature that fuses the text feature and the image feature; the fusion feature is input into a pre-trained text classification model, and the classification result of the text in the multimodal information is obtained. .
  • text classification is performed based on the fused features.
  • the text classification utilizes image information, and the classification results are more accurate.
  • the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne un procédé et un appareil de classification de texte sur la base d'une fusion d'informations multimodales, ainsi qu'un dispositif informatique et un support de stockage, qui se rapportent au domaine de l'intelligence artificielle. Le procédé comprend les étapes consistant à : acquérir un texte à classifier ; entrer le texte dans un modèle d'extraction de caractéristiques de texte pré-entraîné pour effectuer une extraction de caractéristiques, de façon à obtenir une caractéristique de texte du texte ; entrer une image dans des informations multimodales dans un modèle d'extraction de caractéristiques d'image pré-entraîné pour effectuer une extraction de caractéristiques, de façon à obtenir une caractéristique d'image de l'image ; entrer la caractéristique de texte et la caractéristique d'image dans un modèle de fusion d'attention pré-entraîné pour effectuer une fusion de caractéristiques, de façon à obtenir une caractéristique de fusion dans laquelle la caractéristique de texte et la caractéristique d'image sont fusionnées ; et entrer la caractéristique de fusion dans un modèle de classification de texte pré-entraîné, de façon à obtenir un résultat de classification du texte dans les informations multimodales. Au moyen de la fusion d'une caractéristique de texte et d'une caractéristique d'image et de la réalisation d'une classification de texte sur la base d'une caractéristique fusionnée, des informations d'image sont utilisées pour la classification de texte, de telle sorte qu'un résultat de classification soit plus précis.
PCT/CN2021/090497 2020-12-29 2021-04-28 Procédé de classification de texte sur la base d'une fusion d'informations multimodales et dispositif associé correspondant WO2022142014A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011594264.6 2020-12-29
CN202011594264.6A CN112685565B (zh) 2020-12-29 2020-12-29 基于多模态信息融合的文本分类方法、及其相关设备

Publications (1)

Publication Number Publication Date
WO2022142014A1 true WO2022142014A1 (fr) 2022-07-07

Family

ID=75455223

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/090497 WO2022142014A1 (fr) 2020-12-29 2021-04-28 Procédé de classification de texte sur la base d'une fusion d'informations multimodales et dispositif associé correspondant

Country Status (2)

Country Link
CN (1) CN112685565B (fr)
WO (1) WO2022142014A1 (fr)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115114408A (zh) * 2022-07-14 2022-09-27 平安科技(深圳)有限公司 多模态情感分类方法、装置、设备及存储介质
CN115310122A (zh) * 2022-07-13 2022-11-08 广州大学 一种多模态数据融合训练中的隐私参数优化方法
CN115375934A (zh) * 2022-10-25 2022-11-22 北京鹰瞳科技发展股份有限公司 一种用于对进行聚类的模型进行训练的方法和相关产品
CN115797706A (zh) * 2023-01-30 2023-03-14 粤港澳大湾区数字经济研究院(福田) 目标检测方法、目标检测模型训练方法及相关装置
CN115909317A (zh) * 2022-07-15 2023-04-04 广东工业大学 一种三维模型-文本联合表达的学习方法及系统
CN115906845A (zh) * 2022-11-08 2023-04-04 重庆邮电大学 一种电商商品标题命名实体识别方法
CN116029556A (zh) * 2023-03-21 2023-04-28 支付宝(杭州)信息技术有限公司 一种业务风险的评估方法、装置、设备及可读存储介质
CN116052186A (zh) * 2023-01-30 2023-05-02 无锡容智技术有限公司 多模态发票自动分类识别方法、校验方法及系统
CN116469111A (zh) * 2023-06-08 2023-07-21 江西师范大学 一种文字生成模型训练方法及目标文字生成方法
CN116796290A (zh) * 2023-08-23 2023-09-22 江西尚通科技发展有限公司 一种对话意图识别方法、系统、计算机及存储介质
CN116994069A (zh) * 2023-09-22 2023-11-03 武汉纺织大学 一种基于多模态信息的图像解析方法及系统
CN117312612A (zh) * 2023-10-07 2023-12-29 广东鼎尧科技有限公司 一种基于多模态的远程会议数据记录方法、系统和介质
CN115114408B (zh) * 2022-07-14 2024-05-31 平安科技(深圳)有限公司 多模态情感分类方法、装置、设备及存储介质

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685565B (zh) * 2020-12-29 2023-07-21 平安科技(深圳)有限公司 基于多模态信息融合的文本分类方法、及其相关设备
CN113361247A (zh) * 2021-06-23 2021-09-07 北京百度网讯科技有限公司 文档版面分析方法、模型训练方法、装置和设备
CN113469067B (zh) * 2021-07-05 2024-04-16 北京市商汤科技开发有限公司 一种文档解析方法、装置、计算机设备和存储介质
CN113377958A (zh) * 2021-07-07 2021-09-10 北京百度网讯科技有限公司 一种文档分类方法、装置、电子设备以及存储介质
CN113449808B (zh) * 2021-07-13 2022-06-21 广州华多网络科技有限公司 多源图文信息分类方法及其相应的装置、设备、介质
CN113343936A (zh) * 2021-07-15 2021-09-03 北京达佳互联信息技术有限公司 视频表征模型的训练方法及训练装置
CN113343703B (zh) * 2021-08-09 2021-10-29 北京惠每云科技有限公司 医学实体的分类提取方法、装置、电子设备及存储介质
CN113779934B (zh) * 2021-08-13 2024-04-26 远光软件股份有限公司 多模态信息提取方法、装置、设备及计算机可读存储介质
CN113742483A (zh) * 2021-08-27 2021-12-03 北京百度网讯科技有限公司 文档分类的方法、装置、电子设备和存储介质
CN113468108B (zh) * 2021-09-06 2021-11-12 辰风策划(深圳)有限公司 基于特征数据识别的企业策划方案智能管理分类系统
CN114238690A (zh) * 2021-12-08 2022-03-25 腾讯科技(深圳)有限公司 视频分类的方法、装置及存储介质
CN113961710B (zh) * 2021-12-21 2022-03-08 北京邮电大学 基于多模态分层融合网络的细粒度化论文分类方法及装置
CN114445833B (zh) * 2022-01-28 2024-05-14 北京百度网讯科技有限公司 文本识别方法、装置、电子设备和存储介质
CN114625897A (zh) * 2022-03-21 2022-06-14 腾讯科技(深圳)有限公司 多媒体资源处理方法、装置、电子设备及存储介质
CN114662033B (zh) * 2022-04-06 2024-05-03 昆明信息港传媒有限责任公司 一种基于文本和图像的多模态有害链接识别
CN114579964A (zh) * 2022-04-29 2022-06-03 成都明途科技有限公司 一种信息监测方法及装置、电子设备、存储介质
CN115828162B (zh) * 2023-02-08 2023-07-07 支付宝(杭州)信息技术有限公司 一种分类模型训练的方法、装置、存储介质及电子设备
CN117421641B (zh) * 2023-12-13 2024-04-16 深圳须弥云图空间科技有限公司 一种文本分类的方法、装置、电子设备及可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190341025A1 (en) * 2018-04-18 2019-11-07 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN110717335A (zh) * 2019-09-23 2020-01-21 中国科学院深圳先进技术研究院 用户评论数据处理方法、装置、存储介质及电子设备
CN111259215A (zh) * 2020-02-14 2020-06-09 北京百度网讯科技有限公司 基于多模态的主题分类方法、装置、设备、以及存储介质
CN111985369A (zh) * 2020-08-07 2020-11-24 西北工业大学 基于跨模态注意力卷积神经网络的课程领域多模态文档分类方法
CN112685565A (zh) * 2020-12-29 2021-04-20 平安科技(深圳)有限公司 基于多模态信息融合的文本分类方法、及其相关设备

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492666B (zh) * 2018-09-30 2021-07-06 北京百卓网络技术有限公司 图像识别模型训练方法、装置及存储介质
US11244205B2 (en) * 2019-03-29 2022-02-08 Microsoft Technology Licensing, Llc Generating multi modal image representation for an image
CN109961491B (zh) * 2019-04-12 2023-05-26 上海联影医疗科技股份有限公司 多模态图像截断补偿方法、装置、计算机设备和介质
CN111126282B (zh) * 2019-12-25 2023-05-12 中国矿业大学 一种基于变分自注意力强化学习的遥感图像内容描述方法
CN111259851B (zh) * 2020-01-23 2021-04-23 清华大学 一种多模态事件检测方法及装置
CN111461174B (zh) * 2020-03-06 2023-04-07 西北大学 多层次注意力机制的多模态标签推荐模型构建方法及装置
CN111860116B (zh) * 2020-06-03 2022-08-26 南京邮电大学 一种基于深度学习和特权信息的场景识别方法
CN111861672A (zh) * 2020-07-28 2020-10-30 青岛科技大学 基于多模态的生成式兼容性服装搭配方案生成方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190341025A1 (en) * 2018-04-18 2019-11-07 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN110717335A (zh) * 2019-09-23 2020-01-21 中国科学院深圳先进技术研究院 用户评论数据处理方法、装置、存储介质及电子设备
CN111259215A (zh) * 2020-02-14 2020-06-09 北京百度网讯科技有限公司 基于多模态的主题分类方法、装置、设备、以及存储介质
CN111985369A (zh) * 2020-08-07 2020-11-24 西北工业大学 基于跨模态注意力卷积神经网络的课程领域多模态文档分类方法
CN112685565A (zh) * 2020-12-29 2021-04-20 平安科技(深圳)有限公司 基于多模态信息融合的文本分类方法、及其相关设备

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115310122A (zh) * 2022-07-13 2022-11-08 广州大学 一种多模态数据融合训练中的隐私参数优化方法
CN115114408B (zh) * 2022-07-14 2024-05-31 平安科技(深圳)有限公司 多模态情感分类方法、装置、设备及存储介质
CN115114408A (zh) * 2022-07-14 2022-09-27 平安科技(深圳)有限公司 多模态情感分类方法、装置、设备及存储介质
CN115909317A (zh) * 2022-07-15 2023-04-04 广东工业大学 一种三维模型-文本联合表达的学习方法及系统
CN115375934A (zh) * 2022-10-25 2022-11-22 北京鹰瞳科技发展股份有限公司 一种用于对进行聚类的模型进行训练的方法和相关产品
CN115906845B (zh) * 2022-11-08 2024-05-10 芽米科技(广州)有限公司 一种电商商品标题命名实体识别方法
CN115906845A (zh) * 2022-11-08 2023-04-04 重庆邮电大学 一种电商商品标题命名实体识别方法
CN115797706A (zh) * 2023-01-30 2023-03-14 粤港澳大湾区数字经济研究院(福田) 目标检测方法、目标检测模型训练方法及相关装置
CN116052186A (zh) * 2023-01-30 2023-05-02 无锡容智技术有限公司 多模态发票自动分类识别方法、校验方法及系统
CN116029556B (zh) * 2023-03-21 2023-05-30 支付宝(杭州)信息技术有限公司 一种业务风险的评估方法、装置、设备及可读存储介质
CN116029556A (zh) * 2023-03-21 2023-04-28 支付宝(杭州)信息技术有限公司 一种业务风险的评估方法、装置、设备及可读存储介质
CN116469111B (zh) * 2023-06-08 2023-09-15 江西师范大学 一种文字生成模型训练方法及目标文字生成方法
CN116469111A (zh) * 2023-06-08 2023-07-21 江西师范大学 一种文字生成模型训练方法及目标文字生成方法
CN116796290A (zh) * 2023-08-23 2023-09-22 江西尚通科技发展有限公司 一种对话意图识别方法、系统、计算机及存储介质
CN116796290B (zh) * 2023-08-23 2024-03-29 江西尚通科技发展有限公司 一种对话意图识别方法、系统、计算机及存储介质
CN116994069A (zh) * 2023-09-22 2023-11-03 武汉纺织大学 一种基于多模态信息的图像解析方法及系统
CN116994069B (zh) * 2023-09-22 2023-12-22 武汉纺织大学 一种基于多模态信息的图像解析方法及系统
CN117312612A (zh) * 2023-10-07 2023-12-29 广东鼎尧科技有限公司 一种基于多模态的远程会议数据记录方法、系统和介质
CN117312612B (zh) * 2023-10-07 2024-04-02 广东鼎尧科技有限公司 一种基于多模态的远程会议数据记录方法、系统和介质

Also Published As

Publication number Publication date
CN112685565A (zh) 2021-04-20
CN112685565B (zh) 2023-07-21

Similar Documents

Publication Publication Date Title
WO2022142014A1 (fr) Procédé de classification de texte sur la base d'une fusion d'informations multimodales et dispositif associé correspondant
WO2021121198A1 (fr) Procédé et appareil d'extraction de relation d'entité basée sur une similitude sémantique, dispositif et support
CN113159010B (zh) 视频分类方法、装置、设备和存储介质
CN112559800B (zh) 用于处理视频的方法、装置、电子设备、介质和产品
US20170116521A1 (en) Tag processing method and device
CN112287069A (zh) 基于语音语义的信息检索方法、装置及计算机设备
CN110633475A (zh) 基于计算机场景的自然语言理解方法、装置、系统和存储介质
CN114817478A (zh) 基于文本的问答方法、装置、计算机设备及存储介质
WO2022001233A1 (fr) Procédé de pré-étiquetage basé sur un apprentissage par transfert hiérarchique et dispositif associé
CN112182255A (zh) 用于存储媒体文件和用于检索媒体文件的方法和装置
CN113239215B (zh) 多媒体资源的分类方法、装置、电子设备及存储介质
WO2022105120A1 (fr) Procédé et appareil de détection de texte à partir d'une image, dispositif informatique et support de mémoire
WO2022073341A1 (fr) Procédé et appareil de mise en correspondance d'entités de maladie fondés sur la sémantique vocale, et dispositif informatique
JP2023554210A (ja) インテリジェント推奨用のソートモデルトレーニング方法及び装置、インテリジェント推奨方法及び装置、電子機器、記憶媒体、並びにコンピュータプログラム
CN115168609A (zh) 一种文本匹配方法、装置、计算机设备和存储介质
US10910014B2 (en) Method and apparatus for generating video
CN112149389A (zh) 简历信息结构化处理方法、装置、计算机设备和存储介质
CN111723177A (zh) 信息提取模型的建模方法、装置及电子设备
CN114238574B (zh) 基于人工智能的意图识别方法及其相关设备
CN116363686B (zh) 一种在线社交网络视频平台来源检测方法及其相关设备
WO2023168997A1 (fr) Procédé de récupération intermodale et dispositif associé
CN118070072A (zh) 基于人工智能的问题处理方法、装置、设备及存储介质
CN117874073A (zh) 一种搜索优化方法、装置、设备及其存储介质
CN117992569A (zh) 基于生成式大模型生成文档的方法、装置、设备及介质
CN113361280A (zh) 训练模型的方法、预测方法、装置、电子设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912776

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21912776

Country of ref document: EP

Kind code of ref document: A1