CN110706771B

CN110706771B - Method, device, server and storage medium for generating multi-mode suffering teaching content

Info

Publication number: CN110706771B
Application number: CN201910957077.0A
Authority: CN
Inventors: 王天浩; 潘志刚; 虞莹
Original assignee: Zhongshan Hospital Fudan University
Current assignee: Zhongshan Hospital Fudan University
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2023-06-30
Anticipated expiration: 2039-10-10
Also published as: CN110706771A

Abstract

The invention relates to a method, a device, a server and a storage medium for generating multi-mode suffers from teaching content. Belongs to the technical field of internet data processing and the field of medical information. The generation method comprises the following steps: performing entity identification processing on the suffering teaching content of at least one mode to generate data of different modes; encoding the data of different modes to generate embedded data corresponding to the data of different modes; and writing the obtained embedded data into a database to generate multi-mode data. And an apparatus, a server and a storage medium for implementing the method. According to the invention, the suffering teaching contents of other modes can be generated through active searching according to the suffering teaching contents of a single mode, so that the suffering teaching contents are enriched, and the active searching and other processes of a user are simplified.

Description

Method, device, server and storage medium for generating multi-mode suffering teaching content

Technical Field

The invention relates to a method, a device, a server and a storage medium for generating multi-mode suffers from teaching content, belonging to the technical field of Internet data processing and the field of medical information.

Background

There are massive amounts of multi-modal content such as text, pictures, video, audio, etc. on the internet, however, these content exist in most cases only in a single modal form, independent, discrete, and non-intersecting with each other. However, the existing search technology or recommendation system is usually only aimed at single-mode data, such as the main stream search engine is mainly used for searching text data, so that the process of acquiring multi-mode data by users is very difficult. If data of different modalities are mapped into the same vector space to obtain their joint representation, allowing them to search each other in the vector space would greatly facilitate the user to obtain multi-modal content. However, the data on the internet is various, so that the processing of the data in the whole field is too cumbersome, the joint representation of the multi-mode content is more feasible in a single field, and the medical field is more significant in a plurality of fields.

With the popularization of the internet, it has become an important channel for patients to acquire medical information resources, and how to more conveniently acquire multi-mode medical information is a main problem. For example, if a diabetic is reading an insulin injection related article and the teaching video of the insulin injection is automatically loaded at the end of the article, the user avoids the need for an additional step of searching for the insulin injection related video from another source.

Disclosure of Invention

The invention aims to solve the technical problems that: mapping the patient teaching data of different modes to the same space to obtain their joint representation, and generating patient teaching content containing multi-mode data to facilitate the popularization of patient teaching.

In order to solve the technical problem, the technical scheme of the invention provides a method for generating multi-mode suffering from teaching content, which is characterized by comprising the following steps:

step 1: defining an entity, carrying out entity identification on the sick teaching contents of different modes, and processing to generate different mode data, wherein the sick teaching contents of different modes comprise text data, picture data and video data, carrying out entity identification on the text data to generate the text mode data, extracting a picture title or a picture label from the picture data to carry out entity identification to generate the picture mode data, converting caption or audio information in the video data into characters, and carrying out entity identification on the characters to generate the video mode data;

step 2, coding text mode data, picture mode data and video mode data, wherein the coded data are embedded data corresponding to different mode data;

step 3, writing the embedded data obtained in the previous step and the text mode data, the picture mode data and the video mode data of the corresponding modes thereof into a database;

and 4, after obtaining the suffering teaching content of the current mode in real time, obtaining the current embedded data of the suffering teaching content of the current mode by utilizing the same steps as the

steps

1 and 2, searching similar embedded data similar to the current embedded data in the database obtained in the step 3, and inserting text mode data, picture mode data or video mode data corresponding to the similar embedded data into the mode data of the current mode, thereby obtaining the suffering teaching content of different modes from the current mode.

Preferably, in step 1, the entity identification uses a two-way long and short term memory network plus conditional random field.

Preferably, the step 2 includes the steps of:

step 201, encoding text mode data, picture mode data and video mode data by adopting a single-Hot encoding One-Hot;

step 202, automatically selecting a corresponding encoder to encode text mode data, picture mode data and video mode data according to an encoding result of the One-Hot encoding One-Hot, wherein: the encoder of the text modal data adopts a bidirectional lstm model to encode the text modal data, and then encodes the text modal data into a 200-dimensional vector serving as embedded data of the text modal data through a full connection layer;

the encoder of the picture mode data adopts a depth residual error network model to encode the picture mode data, and then encodes the picture mode data into a 200-dimensional vector serving as embedded data of the picture mode data through a full connection layer;

the encoder of the video mode data adopts a depth residual error network model and a bidirectional lstm model to encode the video mode data of each frame image of the current video, and then encodes the video mode data into a vector serving as embedded data of the video mode data through a full connection layer.

Preferably, in step 201, the following method is adopted when training the One-Hot encoding One-Hot:

and putting the text mode data, the picture mode data and the video mode data for training in the same batch process and simultaneously training the One-Hot encoding One-Hot, so that the encoding result output by the One-Hot encoding One-Hot can distinguish whether the input mode data is the text mode data, the picture mode data or the video mode data.

Preferably, in step 202, the encoding of the video modality data by the encoder of the video modality data includes the steps of:

firstly, a depth residual error network model is adopted to encode video modal data, then the encoding result of the depth residual error network model is input into a bidirectional lstm model to be secondarily encoded, and finally the encoding result output by the bidirectional lstm model is encoded into an embedded data taking a vector as the video modal data through a full connection layer.

Preferably, after said step 2 and before said step 3, the method further comprises the steps of:

predicting corresponding labels by the embedded data of the text mode data, the picture mode data and the video mode data obtained in the step 2 through the same softmax layer;

in step 3, further comprising: the embedded data corresponding to different modes are aligned through the same label, so that the embedded data of the same entity in the same mode or different modes of the same entity are in similar positions in the same vector space.

Preferably, in step 4, similar embedded data, which is similar to the currently embedded data, is looked up in the database using a k-dimensional tree, i.e. a k-dimensional tree.

Another technical solution of the present invention is to provide a device for generating multi-modal suffering from teaching content, which is characterized by comprising: the device comprises an entity identification unit, a data generation unit and a writing unit; the entity identification unit is used for carrying out entity identification processing on the sick teaching contents of different modes to generate data of different modes; the data generation unit is connected with the entity identification unit and is used for encoding the different-mode data to generate embedded data corresponding to the different-mode data; the writing unit is connected with the data generating unit and used for writing the obtained embedded data into a database to generate multi-mode data.

Another technical solution of the present invention is to provide a server, which is characterized by comprising: a memory, one or more processors, and a computer program stored in the memory and executable on the processors; the one or more programs are executed by the one or more processors, so that the one or more processors implement the method for generating multi-modal patient teaching content.

Another technical solution of the present invention is to provide a computer storage medium, on which a computer program is stored, wherein the computer program when executed by a processor implements the method for generating multi-modal suffering from education content described above.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the invention, the suffering teaching content of other modes can be actively searched and generated according to the suffering teaching content of a single mode, so that the suffering teaching content is enriched, and the active searching and other processes of a user are simplified.

2. And inputting information of different modes into the trained model to obtain corresponding embedded data representation with low dimensionality and dense. And (5) returning the content matched with other modes with highest similarity through vector similarity retrieval to jointly form a suffering teaching content.

3. The data of different modes are put in the same batch to be trained simultaneously, and the data of each mode is processed by the neural network to obtain a low-dimensional dense vector, so that the storage space is greatly saved, and the performance can be improved when tasks such as online pushing and searching are performed. The data of different modes are aligned through the same predictive label, so that the data of the same theme in the same mode and the data of the same theme in different modes are in similar positions in the same vector space.

Drawings

Fig. 1 is a flowchart of a method for generating multi-modal suffering from teaching content according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of entity identification processing of a method for generating multi-modal patient teaching content according to an embodiment of the invention.

FIG. 3 is a schematic diagram of a bi-directional lstm model of a method for generating multi-modal patient teaching content according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a video coding network model of a method for generating multi-modal suffers from teaching content according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a res net model of a method for generating multi-modal patient teaching content according to an embodiment of the present invention.

Fig. 6 is a flowchart of a preferred mode of a method for generating multi-modal patient teaching according to an embodiment of the present invention.

Fig. 7 is a block diagram of a generating device for multi-modal suffering from teaching content according to a second embodiment of the present invention.

Reference numerals: 1. entity recognition unit 2 data generation unit 3 writing unit

Fig. 8 is a structural diagram of a server according to a third embodiment of the present invention.

Reference numerals: 71. memory 72, processor 73, communication interface 74, bus.

Detailed Description

The invention will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention. Further, it is understood that various changes and modifications may be made by those skilled in the art after reading the teachings of the present invention, and such equivalents are intended to fall within the scope of the claims appended hereto.

Example 1

As shown in fig. 1 to 6, the present invention provides a method for generating multi-modal patient teaching content, and as shown in fig. 1, a flowchart of a method for generating multi-modal patient teaching content provided in the first embodiment is shown, and the method includes:

step S1, carrying out entity identification processing on the suffering teaching content of at least one mode to generate data of different modes.

In this embodiment, the suffering teaching content may include, but is not limited to, text data, picture data, video data. In step S1, when the patient teaching content is text data, the two-way long-short term memory network+the conditional random field (lstm+crf) is used to identify the entity in the text to generate text modal data. Wherein the entity may include a disease, a drug, a treatment modality, an examination, a surgery, etc. As shown in fig. 2, for a two-way long and short term memory network + conditional random field (lstm + crf) network for entity recognition, the output result is adjusted with the conditional random field (crf) when the result is output, excluding unlikely order of labeling. Only each word may be considered directly here, the block of alphabetical features being omitted. And when the patient teaching content is picture data, performing entity identification processing on the title or label of the picture, which is the same as that of the text data, to generate picture mode data. When the trouble teaching content is video data, the video contains subtitles or audio information in the video can be translated into text information, the subtitles in the video are converted into characters after being identified by software, or the audio information is converted into characters for processing, so that a text is generated. And carrying out entity identification processing on the information in the text, wherein the entity identification processing is the same as that of the text data, so as to generate video mode data. Specifically, the subtitles in the video are converted into characters after optical character recognition (Optical Character Recognition, OCR) or the audio information is converted into characters.

And S2, encoding the data with different modes to generate embedded data corresponding to the data with different modes.

Specifically, one-bit significant coding (One-Hot) is applied for coding; different modes are identified according to the effective code (One-Hot) code, so that the corresponding encoder is selected for encoding. The training method of One-bit efficient coding (One-Hot) can be as follows: generating a corresponding plurality of training data for a plurality of entities of different modality data for training; and then, encoding a plurality of training data in different modes through different encoders to generate the embedded data. In the embodiment of the invention, if N corresponding entities exist in the single input data of a single mode, N corresponding training data can be generated for the data of the mode: (input, N1), (input, N2), …, (input, nn). In the training process, different mode data are put in the same batch (batch) for training simultaneously, and specifically, information of different mode data are put in the same batch (batch) for training simultaneously. The data of different modes can encode the mode information by using effective codes (One-Hot), the training model encodes according to the information of the effective codes (One-Hot), and the different modes are identified according to the effective codes (One-Hot) of the modes, so that a corresponding encoder (encoder) is selected for encoding in the training process.

The text mode data is encoded by using a two-way long-short-term memory (lstm) model, and then the sentence is encoded into a 200-dimensional vector as embedded data of the sentence through a full connection layer. Specifically, the text modal data is encoded by using the long-short-term memory network (lstm) model which is the same as the entity identification, so as to obtain a hidden layer (hidden state) of a last cell (cell) of the two-way long-short-term memory network (lstm) model, and then the sentence is encoded (encoding) into a 200-dimensional vector as embedded data (embedding) of the sentence through a full connection layer.

The picture mode data is encoded by using a depth residual error network (ResNet) model, and then the picture is encoded into a 200-dimensional vector serving as embedded data of the picture through a full connection layer. As shown in fig. 3, the depth residual network (Resnet) model is divided into 5 parts, which are: conv1, conv2_x, conv3_x, conv4_x, conv5_x. In the embodiment of the present invention, the res net34 is applied, first, there is a convolution layer with 7x7x64, then 3+4+6+3=16 building blocks, each block is 2 layers, so there are 16x2=32 layers, and finally there are fc layers (for classification), so the whole model has 1+32+1=101 layers.

Each frame of picture is encoded by using a depth residual network model and a two-way long-short-term memory (lstm) model for video modal data, and then is encoded into a vector serving as embedded data of a video through a full connection layer. Specifically, as shown in fig. 4, video mode data is pre-trained on an image network (image net) to obtain each frame of picture, and each frame of picture is encoded by using a depth residual error network which is the same as that of the picture; inputting the coding result of each frame of picture into a two-way long-short-term memory network (lstm) model for coding, obtaining a hidden layer (hidden state) of the last cell (cell) of the long-short-term memory network (lstm), and then coding sentences into 200-dimensional vectors serving as embedded data of the video through a full-connection layer.

The step S2 further includes: embedded data corresponding to different modalities is passed through the same softmax layer to predict corresponding tags.

As shown in fig. 5, step S2 includes:

step S20: and inputting the data of each mode. Including text modality data, picture modality data, and video modality data.

Step S21: preprocessing the data of each mode.

Step S22: and (3) putting the data of different modes into the same batch to carry out effective coding (One-Hot) coding, and selecting a corresponding coder. Specifically, putting different mode data into the same batch process for training simultaneously, and coding by applying One-bit effective coding (One-Hot); different modes are identified according to the effective coding (One-Hot) coding, and then the corresponding coder is selected for coding.

Step S23: the video modality data is encoded by applying a corresponding encoder, and then the process goes to step S26. Specifically, each frame of picture is encoded by using a depth residual network model and a two-way long-short-term memory (lstm) model for video modal data, and then is encoded into a 200-dimensional vector serving as embedded data of the video through a full-connection layer.

Step S24: the text modality data is encoded by applying a corresponding encoder, and then jumps to step S26. Specifically, the text modal data is encoded using a two-way long-short term memory (lstm) model, and then the sentence is encoded into a 200-dimensional vector as embedded data of the sentence through a full connection layer.

Step S25: the corresponding encoder is applied to the picture modality data to encode, and then step S26 is performed. Specifically, the depth residual error network model is used for encoding the picture modal data, and then the picture is encoded into a 200-dimensional vector serving as the embedded data of the picture through a full connection layer.

Step S26: the embedded data corresponding to different modes are input into the same softmax layer.

Step S27: and applying a softmax layer to predict corresponding labels according to the embedded data corresponding to different modes.

The embedded data corresponding to different modes are aligned through the same predictive label, so that the data of the same entity in the same mode and the data of the same entity in different modes are in similar positions in the same vector space.

And step S3, writing the obtained embedded data into a database to generate multi-mode data.

Specifically, embedded data corresponding to different modal data is written into a database to generate multi-modal data.

According to the embodiment of the invention, the suffering teaching content of other modes can be actively searched and generated according to the suffering teaching content of a single mode, so that the suffering teaching content is enriched, and the processes of active searching and the like of a user are simplified.

In a preferred embodiment of the present embodiment, as shown in fig. 6, the method for generating the multi-modal suffering from teaching content includes:

step S50, inputting single-mode data;

specifically, it may be one of text data, picture data, and video data.

Step S51, model training;

specifically, model training is performed on input single-mode data to obtain corresponding embedded data.

Step S52, searching a k-dimensional tree (KD tree);

specifically, similar data similar to the embedded data is found in a database of a k-dimensional tree (KD-tree).

Step S53, filtering.

And filtering the searched similar data.

Step S54, inserting single mode data;

similar data is inserted into the single modality data.

In step S55, multi-modal data is formed.

According to the embodiment of the invention, the contents of different modes can be mapped to the same vector space, so that the similarity of the contents of different modes with the same theme in the vector space is higher, and the contents can be searched mutually, so that the data of different modes can be fused and then pushed to a patient when the sick teaching contents are generated.

In this embodiment, according to the patient teaching content of a single mode, patient teaching content of other modes can be actively searched and generated, so as to enrich the patient teaching content and simplify the active searching and other processes of the user.

Secondly, inputting information of different modes into the trained model to obtain corresponding embedded data representation with low dimensionality and dense. And (5) returning the content matched with other modes with highest similarity through vector similarity retrieval to jointly form a suffering teaching content.

Furthermore, the data of different modes are put in the same batch to be trained simultaneously, and the data of each mode is processed by the neural network to obtain a low-dimensional dense vector, so that the storage space is greatly saved, and the performance can be improved when tasks such as online pushing and searching are performed. The data of different modes are aligned through the same predictive label, so that the data of the same theme in the same mode and the data of the same theme in different modes are in similar positions in the same vector space.

Embodiment two:

fig. 7 is a block diagram of a device for generating multi-modal suffering from teaching content according to a second embodiment of the present invention, where the device for generating multi-modal suffering from teaching content includes: an entity recognition unit 1, a data generation unit 2 connected to the entity recognition unit 1, and a writing unit 3 connected to the data generation unit 2. Wherein: the entity recognition unit 1 is used for carrying out entity recognition processing on the different-mode suffering teaching content, which is the same as the text data, so as to generate different-mode data; in particular, the patient teaching content may include, but is not limited to, text data, picture data, video data. When the teaching content is text data, the entity identification unit 1 uses the two-way long-short term memory network+the conditional random field (lstm+crf) to identify the entity in the text to generate text modal data. Wherein the entity may include a disease, a drug, a therapeutic modality, an examination, a surgery, etc. When the teaching content is picture data, the entity identification unit 1 performs entity identification processing on the title or label of the picture, which is the same as that of the text data, to generate picture mode data. When the teaching content is video data, the video contains subtitles or audio information in the video can be translated into text information, the entity identification unit 1 performs software identification on the subtitles in the video and then converts the subtitles into characters or converts the audio information into characters to generate texts, and the entity identification processing on the information in the texts, which is the same as the text data, generates video mode data. More specifically, the entity recognition unit 1 performs optical character recognition (Optical Character Recognition, OCR) on subtitles in video and then converts the subtitles into text or audio information into text.

The data generating unit 2 is used for encoding the data with different modes to generate embedded data corresponding to the data with different modes; specifically, the data generating unit 2 generates a corresponding plurality of training data for a plurality of entities of different modality data; and then, encoding a plurality of training data in different modes through different encoders to generate the embedded data. In the embodiment of the invention, if N corresponding entities exist in the single input data of a single mode, N corresponding training data can be generated for the data of the mode: (input, N1), (input, N2), …, (input, nn). In the training process, the data generating unit 2 puts the data of different modes into the same batch process (batch) for training simultaneously, and applies One-bit effective coding (One-Hot) for coding; and identifying different modes according to the effective coding (One-Hot) coding, and selecting a corresponding coder for coding. Specifically, the data generating unit 2 puts information of different modality data in the same batch process (batch) to train simultaneously. The data of different modes can encode the mode information by using effective codes (One-Hot), the training model encodes according to the information of the effective codes (One-Hot), and the different modes are identified according to the effective codes (One-Hot) of the modes, so that a corresponding encoder (encoder) is selected for encoding in the training process.

The data generating unit 2 encodes the text modality data using a two-way long short term memory network (lstm) model, and encodes the sentence into a 200-dimensional vector as embedded data of the sentence through a full connection layer. Specifically, the data generating unit 2 encodes the text modal data using the same long-short term memory network (lstm) model as the entity identification, and obtains the hidden layer (hidden state) of the last cell (cell) of the long-short term memory network (lstm) model, and then encodes (encoding) the sentence into a 200-dimensional vector as the embedded data (ebedding) of the sentence through a full connection layer. The data generating unit 2 encodes the picture mode data by using a depth residual network (res net) model, and encodes the picture into a 200-dimensional vector as the embedded data of the picture through a full connection layer. The data generating unit 2 encodes each frame of picture using a depth residual network model and a two-way long-short-term memory (lstm) model on the video modality data, and encodes the encoded data into a vector as embedded data of the video through a full connection layer. Specifically, the data generating unit 2 pre-trains video mode data on an image network (image net) to obtain each frame of picture, and encodes each frame of picture by using a depth residual error network which is the same as that of the picture; inputting the coding result of each frame of picture into a two-way long-short-term memory network (lstm) model for coding, obtaining a hidden layer (hidden state) of the last cell (cell) of the long-short-term memory network (lstm), and then coding sentences into 200-dimensional vectors serving as embedded data of the video through a full-connection layer.

In the embodiment of the present invention, the data generating unit 2 also predicts the corresponding tag by passing the embedded data corresponding to different modalities through the same softmax layer.

And the writing unit 3 is used for writing the obtained embedded data into a database for vector similarity query. Specifically, the writing unit 3 writes the embedded data corresponding to the different-modality data into the database, generating the multi-modality data. In this embodiment, the patient teaching content of other modes can be actively searched and generated according to the patient teaching content of a single mode, so as to enrich the patient teaching content and simplify the active searching and other processes of the user.

Example III

Fig. 8 shows a structural diagram of a server according to a third embodiment of the present invention, where the server includes: a memory (memory) 71, a processor (processor) 72, a communication interface (Communications Interface) 73 and a bus 74, the processor 72, the memory 71, the communication interface 73 completing interactive communication with each other via the bus 74. A memory 71 for storing various data; specifically, the memory 71 is used for storing various data such as text data, picture data, video data, and various modality data, etc., and is not limited thereto, and a plurality of computer programs are also included.

A communication interface 73 for information transmission between communication devices of the server;

a processor 72, configured to invoke various computer programs in the memory 71 to execute a method for generating multi-modal patient teaching content according to the first embodiment, for example:

performing entity identification processing on the suffering teaching content of at least one mode to generate data of different modes;

encoding the data of different modes to generate embedded data corresponding to the data of different modes;

and writing the obtained embedded data into a database to generate multi-mode data.

The invention belongs to the technical field of data processing, and the units and algorithm steps of each example described by the disclosed examples can be realized by electronic hardware or a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution.

Claims

1. The method for generating the multi-mode suffers from teaching content is characterized by comprising the following steps:

step 2, encoding text mode data, picture mode data and video mode data, wherein the encoded data are embedded data corresponding to different mode data, and the method comprises the following steps:

step 201, encoding text mode data, picture mode data and video mode data by adopting a single-Hot encoding One-Hot; when the One-Hot encoding One-Hot is trained, the following method is adopted: the text mode data, the picture mode data and the video mode data for training are put in the same batch processing and the One-Hot encoding One-Hot is trained at the same time, so that the encoding result output by the One-Hot encoding One-Hot can distinguish the input mode data as the text mode data, the picture mode data or the video mode data;

the encoder of the video mode data adopts a depth residual error network model and a bidirectional lstm model to encode the video mode data of each frame image of the current video, and then encodes the video mode data into a vector serving as embedded data of the video mode data through a full connection layer;

step 3, the text mode data, the picture mode data and the embedded data of the video mode data obtained in the step 2 are subjected to the same softmax layer to predict the corresponding label;

step 4, writing the embedded data obtained in the previous step and the text mode data, the picture mode data and the video mode data of the corresponding modes into a database, and aligning the embedded data corresponding to different modes through the same label to enable the embedded data of the same entity in the same mode or different modes of the same entity to be in similar positions in the same vector space;

and 5, after the suffering teaching content of the current mode is obtained in real time, the same steps as the steps 1 and 2 are utilized to obtain the current embedded data of the suffering teaching content of the current mode, the k-dimensional tree is utilized to search similar embedded data similar to the current embedded data in the database obtained in the step 4, and text mode data, picture mode data or video mode data corresponding to the similar embedded data are inserted into the mode data of the current mode, so that the suffering teaching content of different modes from the current mode is obtained.

2. The method of claim 1, wherein in step 1, the entity identification uses a two-way long and short term memory network plus conditional random field.

3. The method of claim 1, wherein in step 202, the encoder of the video modality data encodes the video modality data comprises the steps of:

4. A multi-modal patient teaching content generating apparatus, comprising: the device comprises an entity identification unit, a data generation unit and a writing unit; the entity identification unit is used for carrying out entity identification processing on the sick teaching contents of different modes to generate data of different modes; the data generation unit is connected with the entity identification unit and is used for encoding the different-mode data to generate embedded data corresponding to the different-mode data; the writing unit is connected with the data generating unit and is used for writing the obtained embedded data into a database to generate multi-modal data and realizing the generating method of the multi-modal suffering teaching content according to any one of claims 1 to 3.

5. A server, comprising: a memory, one or more processors, and a computer program stored in the memory and executable on the processors; the one or more programs are executed by the one or more processors to cause the one or more processors to implement a method of generating multimodal suffers from education as claimed in any one of claims 1 to 3.

6. A computer storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements a method of generating a multimodal suffers from teaching according to any of claims 1 to 3.