CN113408208B - Model training method, information extraction method, related device and storage medium - Google Patents

Model training method, information extraction method, related device and storage medium Download PDF

Info

Publication number
CN113408208B
CN113408208B CN202110709820.8A CN202110709820A CN113408208B CN 113408208 B CN113408208 B CN 113408208B CN 202110709820 A CN202110709820 A CN 202110709820A CN 113408208 B CN113408208 B CN 113408208B
Authority
CN
China
Prior art keywords
data
sub
processed
vector
representation space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110709820.8A
Other languages
Chinese (zh)
Other versions
CN113408208A (en
Inventor
刘曙铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Oppo Communication Technology Co ltd
Original Assignee
Chengdu Oppo Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Oppo Communication Technology Co ltd filed Critical Chengdu Oppo Communication Technology Co ltd
Priority to CN202110709820.8A priority Critical patent/CN113408208B/en
Publication of CN113408208A publication Critical patent/CN113408208A/en
Application granted granted Critical
Publication of CN113408208B publication Critical patent/CN113408208B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a model training method, an information extraction method, a related device and a storage medium. The model training method comprises the following steps: acquiring N sample data; each sample data includes M categories of sub-data; n sample data correspond to M multiplied by N sub data pairs, each sub data pair comprises M sub data, the categories to which each sub data belongs are different, each sub data pair comprises M sub data corresponding to an association relation, and the sub data of M categories included in each sample data are associated with each other; inputting M multiplied by N sub-data pairs into a preset model for training, and generating a pre-training model corresponding to each class of sub-data; the pre-training model is used for calculating the similarity between M pieces of sub data of each piece of sub data pair, determining the vector representation space corresponding to each data type according to the similarity between M pieces of sub data included in M multiplied by N piece of sub data pair, and the pre-training model obtained by the method can rapidly and accurately extract target information.

Description

Model training method, information extraction method, related device and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a model training method, an information extraction method, a related device, and a storage medium.
Background
At present, the popularity of the Internet is higher and higher, the number of netizens is also increasing, and more people record and share life through multi-mode data such as videos. Not only is it necessary to prepare video content, audio content, and text in creating short video, but it is also necessary to consider how to generate high quality text or titles to attract more users to watch. The current document generation method mainly adopts modes of manual writing and the like to generate, and generally causes the problems of low quality, low generation efficiency and the like of the generated document.
Disclosure of Invention
The embodiment of the application provides a model training method, an information extraction method, a related device and a storage medium, which can rapidly and accurately extract target information.
In order to solve the technical problems, the application comprises the following technical scheme:
in a first aspect, an embodiment of the present application provides a model training method, where the method includes:
acquiring N sample data; each sample data comprises M categories of sub-data; the sub data of M categories included in the N sample data correspond to M multiplied by N sub data pairs, each sub data pair comprises M sub data, the categories to which each sub data belongs are different, each sub data pair comprises M sub data corresponding to an association relation, the sub data of M categories included in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2;
Inputting the M multiplied by N sub-data pairs into a preset model for training, and generating a pre-training model corresponding to each class of sub-data; the preset model is used for calculating the similarity between M pieces of sub data included in each piece of sub data pair, and determining the vector representation space corresponding to each piece of data according to the similarity between M pieces of sub data included in each M×N piece of sub data pair.
In a second aspect, an embodiment of the present application provides an information extraction method, where the method includes:
acquiring data to be processed, wherein the data to be processed comprises at least one category of sub data to be processed;
inputting the sub-data to be processed of at least one category into a pre-training model corresponding to each category of the sub-data to be processed, and obtaining vector information corresponding to the data to be processed; wherein the pre-training model is a pre-training model obtained by the model training method of claim 1;
and extracting target information carried by the data to be processed according to the vector information.
In a third aspect, embodiments of the present application provide a model training apparatus, including:
the first acquisition module is used for acquiring N sample data; each sample data comprises M categories of sub-data; the sub data of M categories included in the N sample data correspond to M multiplied by N sub data pairs, each sub data pair comprises M sub data, the categories to which each sub data belongs are different, each sub data pair comprises M sub data corresponding to an association relation, the sub data of M categories included in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2;
The training module is used for inputting the M multiplied by N sub-data pairs into a preset model to train and generating a pre-training model corresponding to each class of sub-data; the preset model is used for calculating the similarity between M pieces of sub data included in each piece of sub data pair, and determining the vector representation space corresponding to each piece of data according to the similarity between M pieces of sub data included in each M×N piece of sub data pair.
In a fourth aspect, an embodiment of the present application provides an information extraction apparatus, including:
the second acquisition module is used for acquiring data to be processed, wherein the data to be processed comprises at least one category of sub data to be processed;
the output module is used for inputting the sub-data to be processed of at least one category into the pre-training model corresponding to each category of the sub-data to be processed, and obtaining vector information corresponding to the data to be processed; wherein the pre-training model is a pre-training model obtained by the model training method of claim 1;
and the extraction module is used for extracting the target information carried by the data to be processed according to the vector information.
In a fifth aspect, the present application provides another model training apparatus, the apparatus comprising a processor, a memory, and a communication interface:
The processor is connected with the memory and the communication interface;
the memory is used for storing executable program codes;
the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for executing the model training method as described in the first aspect above.
In a sixth aspect, the present application provides another information extraction apparatus, the apparatus comprising a processor, a memory, and a communication interface:
the processor is connected with the memory and the communication interface;
the memory is used for storing executable program codes;
the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for executing the information extraction method as described in the second aspect above.
In a seventh aspect, embodiments of the present application provide a computer readable storage medium, on which a computer program is stored, wherein the program is executed by a processor to implement the model training method according to the first aspect or the information extraction method according to the second aspect.
According to the method and the device, a preset model is trained through a large number of sample data containing sub-data of various categories, a pre-training model corresponding to the sub-data of each category is generated, in the use process of a user, original data are input into the pre-training model, vector information corresponding to the original data is obtained, and finally target information corresponding to the original data is extracted through a preset information extraction model. By adopting the model training method provided by the application, the pre-training model is constructed by utilizing the ideas of contrast learning, the vector information corresponding to each of various different types of original data is generated, a more accurate and short representation method of the original data is obtained, and then the target information is extracted according to the obtained vector information, so that the quality of the extracted target information is higher, the rapid and accurate extraction of the target information from the original data is realized, and the problems of low quality, low generation efficiency and the like of the extracted information caused by the existing information extracted by manual writing and the like are solved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;
FIG. 2 is a schematic flow chart of a model training method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a method for constructing a sub-data pair according to an embodiment of the present application;
FIG. 4 is a schematic representation of a vector representation of a dual tower model provided in an embodiment of the present application;
FIG. 5 is a schematic diagram of an encoder according to an embodiment of the present application;
fig. 6 is a schematic flow chart of an information extraction method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a decoder according to an embodiment of the present application;
FIG. 8 is a schematic diagram of an overall framework of a method for extracting information based on a pre-training model according to an embodiment of the present application;
Fig. 9 is a schematic diagram of interface display of an electronic device in a process of extracting information according to an embodiment of the present application;
fig. 10 is a flowchart of another information extraction method according to an embodiment of the present application;
fig. 11 is a flowchart of another information extraction method according to an embodiment of the present application;
FIG. 12 is a schematic structural diagram of a model training device according to an embodiment of the present disclosure;
fig. 13 is a schematic structural diagram of an information extraction device according to an embodiment of the present application;
FIG. 14 is a schematic structural view of another model training apparatus according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of another information extraction apparatus according to an embodiment of the present application.
Detailed Description
In order to make the above objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures are described in detail below.
The terms first, second, third and the like in the description and in the claims of the application and in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Referring to fig. 1, fig. 1 is a schematic view of an application scenario provided in the present application. As shown in fig. 1, video a may be input into an electronic device 10, and the electronic device 10 performs data analysis on the video a to output a title or a text of the video a. In this embodiment of the present application, the title or text of the video a may be referred to as target information of the video a, the data input into the electronic device for information extraction may be referred to as data to be processed, and the data to be processed may be processed to obtain sub-data that may obtain at least one category, where the category of the sub-data may include, but is not limited to, video data, audio data, picture data, text data, and the like, and the category of the sub-data is not limited in this application.
The electronic device 10 may include, but is not limited to, a smart phone, a personal computer, a notebook computer, a smart tablet computer, a portable wearable device, and the like. The electronic device 10 has flexible access and high bandwidth communication capabilities, and may communicate over a variety of wireless operating networks including, but not limited to, GSM, code division multiple access (Code Division Multiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, W-CDMA), and the like, as well as over a wireless local area network, bluetooth, and infrared.
In embodiments of the present application, the electronic device 10 may include an encoder and a decoder. The encoder is used for training the sample data to obtain a pre-training model, vector information corresponding to the data to be processed is obtained according to the pre-training model in the using process, and the decoder is used for generating target information according to the vector information. As shown in fig. 1, after a user uploads a video a based on an electronic device, the electronic device may output target information after information extraction of the video a according to the information extraction method provided in the present application.
Next, the model training method and the information extraction method provided in the embodiments of the present application will be described with reference to the application scenario schematic diagram shown in fig. 1.
Referring to fig. 2, fig. 2 is a flow chart of a model training method according to an embodiment of the present application, where the method includes:
s201, N sample data are acquired.
Specifically, the electronic device trains the sample data according to the internal encoder, and first obtains N sample data. Each sample data comprises M types of sub-data, M types of sub-data contained in N sample data correspond to M multiplied by N sub-data pairs, each sub-data pair comprises M sub-data, the types of the sub-data are different, each sub-data pair comprises M sub-data corresponding to an association relation, the M types of sub-data contained in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2. In embodiments of the present application, the sample data may include, but is not limited to, a variety of different types of data, such as video data, audio data, image data, text data, and the like.
Further, after acquiring N sample data, the electronic device first processes all the sample data. The manner of processing the sample data may include: and performing batch processing on the sample data. Firstly, an electronic device constructs a batch data set for sample data, and the construction method specifically comprises the following steps: taking 3 pieces of data as sample data, the sub-data category takes video data and audio data as examples, processing the 3 pieces of sample data to obtain sub-data corresponding to each piece of data, and constructing sub-data pairs by using the video data and the audio data contained in each piece of current sample data, wherein each sub-data pair comprises one video data and one audio data.
FIG. 3 shows a schematic diagram of a method of constructing a sub-data pair. It is assumed that the sample data includes sub data of two types of audio and video, as shown in fig. 3, and there are 3 pieces of data in the sample data, which are sample data a, sample data B, and sample data C, respectively. First, the 3 pieces of data are decomposed to obtain sub data respectively: and respectively constructing similar sub-data pairs and dissimilar sub-data pairs for the 6 sub-data according to the video A, the audio A, the video B, the audio B, the video C and the audio C. Wherein, similar sub-data pairs represent the same source of sub-data of different categories in the sub-data pairs, and the same source of sub-data refers to the same sample data. Similar pairs of sub-data as shown in fig. 3 include: (video a, audio a), (video B, audio B), (video C, audio C). The sources of the sub-data of different categories in the dissimilar sub-data pair are different, and the different sources of the sub-data refer to the sources of the different sample data. The dissimilar sub-data pairs as shown in fig. 3 include: (video a, audio B), (video a, audio C), (video B, audio a), (video B, audio C), (video C, audio a), (video C, audio B). Illustratively, the pair of sub-data (video a, audio B) is a similar pair of sub-data, where both video a and audio a are derived from sample data a. The sub-data pair (video B, audio C) is a dissimilar sub-data pair, where video B is derived from sample data B and audio C is derived from sample data C. It will be appreciated that the similar sub-data pairs and dissimilar sub-data pairs constructed in the above examples may also be referred to as positive and negative examples of constructing the sample data set, and specifically the similar sub-data pairs may be taken as positive examples and the dissimilar sub-data pairs as negative examples. In the embodiment of the present application, the similar sub-data may be symmetrical to be a similar instance pair, and the dissimilar sub-data pair may be referred to as a dissimilar instance pair. The method for processing the sample data and the mode for constructing the sub-data pair are not limited.
S202, inputting M multiplied by N sub-data pairs into a preset model for training, and generating a pre-training model corresponding to each class of sub-data.
Specifically, after processing the N sample data to obtain mxn sub-data, the electronic device inputs the mxn sub-data pairs into a preset model to train, and after training, a pre-training model corresponding to each class of sub-data is obtained. The preset model is used for calculating the similarity between M pieces of sub data included in each piece of sub data pair, and determining the vector representation space corresponding to each piece of data according to the similarity between M pieces of sub data included in each M×N piece of sub data pair. The preset model can comprise a comparison learning model obtained based on double-tower model training, and the comparison learning model can be any comparison learning model suitable for the model training method of the application, and the type of the comparison learning model is not limited in the application. The method for training the model based on the double-tower model can comprise the following steps: and performing contrast learning training based on the data of any two categories.
Fig. 4 shows a schematic diagram of a vector representation space of a dual-tower model, in which the vector representation space may be represented in the form of a coordinate system, which corresponds to a two-dimensional coordinate system in the case of a dual-tower model, and a three-dimensional coordinate system in the case of a three-tower model. In the vector representation space, the sub-data of the same source is concentrated in one region, and the regions of the sub-data of different sources are different. As shown in fig. 4, as can be seen from the coordinate system, the sub-data included in each of the sample data a, the sample data B, and the sample data C are more concentrated, that is, the sub-data included in the sample data a is concentrated in the region corresponding to the sample data a in the figure, and similarly, the sub-data included in each of the sample data B and the sample data C is concentrated in the region corresponding to the sample data B and the sample data C in the figure. The present application is not limited to the presentation form of the vector representation space. For the sub-data of different categories of the same source, the similarity is higher than that of the sub-data of different categories of different sources. The higher the similarity, the closer the distance of the sub-data in the vector representation space, the smaller the loss value of the loss function in the pre-training model; the lower the similarity, the further apart the sub-data is in the vector representation space, the greater the loss value of the loss function in the pre-trained model.
It can be understood that in practical application, according to different service scenarios, one kind of data is used as a reference tower for model training, and then the data of other kinds are combined for comparison learning training. For example, if training is to obtain pre-training models corresponding to video data, audio data, picture data and text data respectively, features corresponding to the video data are generally taken as reference towers in the dual-tower model, the dual-tower model of the video tower and the audio tower, the dual-tower model of the video tower and the picture tower, and the dual-tower model of the video tower and the text tower are respectively built by combining the audio data, the picture data and the text data, and training is performed on the dual-tower models according to sample data to obtain pre-training models corresponding to each type of data respectively. It should be noted that, in the embodiment of the present application, the data category of the reference tower in the training process of the dual-tower model is not limited. Such as but not limited to audio data as a reference tower in a dual tower model. It will be appreciated that in practical applications, the dual-tower model may also be extended to a model of a multi-tower structure, when the sample data includes three categories of data, the dual-tower model may be extended to a tri-tower model, the sample data may be subsequently trained based on the tri-tower model, and so on. In the embodiment of the application, the model expanded into the multi-tower structure can be flexibly selected according to different services and differences of sample data in practical application, and the application is not limited to the method.
Further, inputting the m×n pairs of sub-data into a preset model for training, and generating a pre-training model corresponding to each class of sub-data, including: inputting M multiplied by N sub-data pairs into a preset model; training a preset model according to the association relation among M sub-data included in the M multiplied by N sub-data pairs, and generating a pre-training model corresponding to each class of sub-data.
Specifically, the InfoNCE function may be used as a loss function for model training when training a preset model. The specific calculation formula of the InfoNCE function is as follows:
Figure BDA0003132722040000081
wherein Z is i And z i ' represents similar pairs of sub-data, z i And z j ' represents dissimilar subdata pairs. Examples of similar and dissimilar pairs of sub-data are detailed in S201And closing the content.
During model training, the video data and audio data (i.e., similar sub-data pairs mentioned in the previous embodiments) from the same sample data in the current vector representation space can be pulled closer together by the InfoNCE loss function, while the video data and audio data (i.e., dissimilar sub-data pairs mentioned in the previous embodiments) from different sample data are pulled farther apart. The higher the similarity of similar sub-data pairs, the closer the distance of each sub-data in the vector representation space, the smaller the loss value of the loss function; the lower the similarity of similar pairs of sub-data, the further apart each sub-data is in the vector representation space, the greater the loss value of the loss function. The lower the similarity of the dissimilar sub-data pairs, the further the distance of each sub-data in the vector representation space, the greater the loss value of the loss function; the higher the similarity of the dissimilar sub-data pairs, the closer the distance of each sub-data in the vector representation space, the smaller the loss value of the loss function. Training the contrast learning model based on the InfoNCE loss function can obtain a vector representation space corresponding to each category of data. In this vector representation space, similar pairs of sub-data are highly similar, with closer distances in the vector representation space, dissimilar pairs of sub-data are less similar, and with farther distances in the vector representation space. In the model training process, according to the respective similarity of the pre-constructed similar sub-data pair and the pre-constructed dissimilar sub-data pair, the relevant parameter values of the loss function are continuously adjusted, so that the pre-training model obtained through training reaches an optimal state. Based on the training process, a pre-training model corresponding to each category of sub-data can be obtained, and after the pre-training model is obtained, the data of the information to be extracted is input into the pre-training model in the subsequent use process, so that the vector information corresponding to the data to be extracted can be obtained and used for the task of the downstream of the item.
The model training method provided by the embodiment of the application is executed by an encoder inside the electronic equipment. The structure of an encoder for performing the model training method in the embodiment of the present application will be described below in conjunction with the above model training method.
Fig. 5 shows a schematic structural diagram of an encoder. The encoder is mainly used for generating a pre-training model and generating vectors corresponding to data of information to be extracted by using the pre-training model. As shown in fig. 5, the electronic device first obtains sample data including video data, audio data, picture data, and text data. The current model training is based on a double-tower model to perform contrast learning training to obtain a pre-training model, and then video data can be used as a reference tower to construct a video tower, and other three types of data are combined to construct a double-tower model of a video tower and an audio tower, a double-tower model of a video tower and a picture tower, and a double-tower model of a video tower and a text tower respectively. Taking a double-tower model of a video tower and an audio tower as an example, an encoder in electronic equipment firstly carries out batch processing on sample data, and classifies video data, audio data, picture data and text data respectively to obtain required video data and audio data. If there are 3 pieces of data in the sample data: the three sample data are respectively processed to obtain sample data A, sample data B and sample data C, wherein the sample data A comprises video A and audio A, the sample data B comprises video B and audio B, and the sample data C comprises video C and audio C. For sample data a, the sub-data includes video a and audio a, and similarly, the sub-data of sample data B includes video B and audio B, and the sub-data of sample data C includes video C and audio C. Thus, in constructing the sub-data pairs, similar sub-data pairs such as (video a, audio a), (video B, audio B), (video C, audio C), and dissimilar sub-data pairs such as (video a, audio B), (video B, audio C), and the like may be constructed according to the sources of the sub-data. If there are N sample data and M kinds of data, m×n sub data pairs can be obtained. After the sub-data pair is constructed, training the double-tower model by taking the InfoNCE function as a loss function to obtain a pre-training model. As shown in fig. 5, the pre-training model obtained by training is different for different types of data, and for video data, contrast learning training is generally performed based on 3 dresent 50, audio data and text data are trained based on the BERT model, and picture data are trained based on ResNet. After the pre-training model is obtained, vector representation spaces corresponding to the data of each category can be obtained respectively, as shown in fig. 5, the vector representation spaces of the video data, the audio data, the picture data and the text data are Linear projector, CLS, linear projector and CLS respectively, and the vector corresponding to each sub data can be obtained based on the vector representation spaces. As shown in fig. 5, taking one of the data as an example, the video data, the audio data, the picture data, and the text data can respectively obtain the vectors (z, a), (z, b), (z, c), (z, d) after the transformation of the sample data. After obtaining the vector information, in order to avoid overfitting, the data is generally required to be subjected to L2 regularization, and after the processing is completed, a feature fusion operation is performed on the single vector, so as to obtain final vectors corresponding to the sample data, which are (z_a, z_b), (z_a, z_c) and (z_a, z_d), respectively. Wherein, the vector (z_a, z_b) represents a vector which is output based on the double-tower model processing of the video tower and the audio tower, and the original data corresponding to the vector comprises video data and audio data; the vector (z_a, z_c) represents a vector output based on the double-tower model processing of the video tower and the picture tower, and the original data corresponding to the vector comprises video data and picture data; the vector (z_a, z_d) represents a vector outputted based on the double-tower model processing of the video tower and the text tower, and the original data corresponding to the vector includes video data and text data. To this end, the task of the encoder is completed, and the generated vector may be sent to a pre-training model that is obtained in the decoder based on the T5 model training, so that the decoder generates the target information.
According to the model training method, after sample data are acquired, the sample data are firstly processed, sub-data pairs are constructed, comparison learning training is conducted on the processed sub-data pairs based on the double-tower model, a pre-training model is obtained, meanwhile, vector representation spaces corresponding to data of each category are obtained according to the double-tower model training, the vector representation spaces are used in downstream information extraction tasks, vector information corresponding to each data is generated more specifically, a foundation is laid for subsequent information extraction, and quality of extracted target information is improved.
The information extraction method provided by the embodiment of the present application will be described below in conjunction with the model training method provided by the embodiment of the present application. The information extraction method provided by the embodiment of the application adopts the model training method to extract information.
Referring to fig. 6, fig. 6 is a flowchart of an information extraction method according to an embodiment of the present application, where the method includes:
s601, acquiring data to be processed.
Specifically, when the user wants to perform the information extraction operation, the user can click on a designated position on the interface of the electronic device, and upload the data to be processed at the designated position. The electronic device obtains data to be processed. Wherein the data to be processed comprises at least one category of sub data to be processed. For example, the data to be processed is any piece of video in a piece of video software, and the piece of video includes video data and audio data after being processed. The processed video data and audio data are sub-data to be processed.
S602, inputting the sub-data to be processed of at least one category into a pre-training model corresponding to each category of the sub-data to be processed, and obtaining vector information corresponding to the data to be processed.
Specifically, after the electronic device obtains the data to be processed, when the data to be processed only includes one piece of data, and the piece of data only includes one type of sub data, the data does not need to be batch processed, the data is directly input into a pre-training model corresponding to the type of sub data, and the pre-training model outputs vector information corresponding to the data to be processed. When the data to be processed comprises a piece of data and the piece of data comprises sub-data of at least two categories, respectively inputting the sub-data of the at least two categories into corresponding pre-training models of respective categories, and outputting vector information corresponding to the sub-data of the at least two categories. And when the data to be processed comprises at least two pieces of data and each piece of data comprises at least one type of sub-data, performing batch processing on the data to be processed. Please refer to the above embodiment for a specific method of batch processing, and this embodiment is not repeated. And obtaining at least one piece of sub-data to be processed, which is included in the data to be processed, after the batch processing. The electronic equipment inputs the sub-data to be processed into the pre-training models corresponding to the categories of the sub-data to be processed respectively, and vector information corresponding to the data to be processed is obtained. For example, if the data to be processed is batched to obtain sub-data to be processed including video data and audio data, the video data is input into a video tower pre-training model, and the audio data is input into an audio tower pre-training model to obtain vector information corresponding to the video data and vector information corresponding to the audio data respectively. The pre-training model is a pre-training model obtained by the model training method shown in fig. 2, and reference is made to the above embodiment for specific training method, which is not described in detail in this embodiment.
S603, extracting target information carried by the data to be processed according to the vector information.
Specifically, after obtaining the vector information, in order to avoid over-fitting, regularization processing is performed on the vector information, the processed vector information is transferred to a decoder by an encoder in the electronic device, and target information is extracted by the decoder based on a T5 (Transfer Text-to-Text transducer, T5) model according to the vector information corresponding to the data to be processed.
Fig. 7 shows a schematic diagram of a decoder. The function of the decoder is mainly to extract target information from vector information generated by the encoder based on a pre-training model for extracting information obtained by training a T5 model. After the decoder generates vector information corresponding to the data to be processed, the vector information is transmitted to the decoder, the decoder receives the vector information and inputs the vector information into a pre-training model for extracting information, and the model processes the vector information to generate target information corresponding to the data to be processed. The T5 model is a model capable of converting all natural language processing tasks into Text-to-Text tasks, adopts a transducer structure, and has extremely strong feature extraction capability. Natural processing tasks may include, but are not limited to: a text translation task, a text classification task, a text generation task, and an automatic summarization task. The text translation task comprises the step of translating a language corresponding to the input text into the text with a specified language. Text classification tasks include automatically classifying input text based on certain criteria. For example, the input words are classified according to semantic features. The text generation task and the automatic summary task are similar to the information extraction method provided in the embodiment of the present application, and the main purpose of the text generation task and the automatic summary task is to extract target information of data to be processed. The method for extracting the pre-training model of the information based on the T5 model training comprises the following steps: creating a self-supervision task (such as language modeling or filling missing words), pre-training a model by using a large amount of sample data to obtain a pre-trained model for extracting information, and then fine-tuning the pre-trained model for extracting information by using a small amount of data comprising various different types of original data and target information generated according to the original data, wherein the method specifically comprises adjusting all parameters contained in the pre-trained model, and the purpose of optimizing the model can be achieved by continuously adjusting the model, so that the model effect is improved. The pre-trained model for extracting information after fine adjustment can be used for extracting target information. It should be noted that, since the T5 model is pre-trained based on english sample data, the model may be trained using a multinational language version MT5 (MT 5) in the embodiment of the present application.
Fig. 8 shows a schematic diagram of an overall framework of a method for extracting information based on a pre-training model, and fig. 9 shows a schematic diagram of an interface display of an electronic device in the process of extracting information. As shown in fig. 8, the overall framework for information extraction based on the pre-training model includes an encoder and a decoder, the main function of the encoder is to obtain vector information corresponding to data to be processed by using the pre-training model obtained by the model training method, and the main function of the decoder is to extract target information corresponding to the data to be processed according to the vector information obtained by the encoder. For example, when a user inputs any piece of data in the electronic device, the electronic device processes the data, determines that the piece of data includes two types of data, namely video data and audio data, obtains video vector information corresponding to the video data and audio vector information corresponding to the audio data according to a pre-training model corresponding to the video data and the audio data, performs feature fusion processing on the video vector information and the audio vector information, and obtains vector information of the piece of data after processing. After the encoder acquires the vector information, the vector information is sent to a decoder, the decoder decodes the vector information of the data to be processed according to a pre-trained pre-training model based on a T5 model, and the target information of the data to be processed is extracted according to the vector information. As shown in fig. 9, according to the information extraction method provided in the present application, when a user uploads data a of information to be extracted at a designated location on an electronic device, a system inner encoder and a decoder can extract target information of the data a through processing as in the above information extraction method, obtain content of which target information shown in fig. 9B is "a precious family book", and display the content on a screen of the electronic device to provide the user.
According to the information extraction method, when the user inputs the data of the information to be extracted, the electronic equipment extracts the vector information of the data of the information to be extracted according to the pre-trained pre-training model, the decoder extracts the target information corresponding to the data of the information to be extracted according to the vector information, and the problems that the quality of the extracted information is low and the efficiency of information extraction is low due to low accuracy of modes such as manual writing and the like in the current information extraction method are solved by converting the data into the vector and extracting the information according to the vector.
Referring to fig. 10, fig. 10 shows a flowchart of another information extraction method. The method comprises the following steps:
s1001, acquiring data to be processed.
Specifically, the user uploads the data to be processed to the electronic device. The electronic device obtains sub-data to be processed. Wherein the data to be processed comprises at least one category of sub data to be processed. Please refer to the above embodiment for the category of the data to be processed, and the description of this embodiment is omitted.
S1002, inputting at least one type of sub-data to be processed into a pre-training model corresponding to each type of sub-data to be processed, and obtaining vector information corresponding to each type of sub-data to be processed.
Specifically, if a certain data to be processed includes two types of sub-data to be processed, the sub-data are video data and audio data respectively. After the electronic equipment acquires the sub-data to be processed, video data are respectively input into a pre-training model corresponding to the video data, audio data are input into a pre-training model corresponding to the audio data, and vector information corresponding to the video data and vector information corresponding to the audio data are respectively obtained. The pre-training model is obtained by training according to the model training method in the embodiment. Please refer to the above embodiment for a specific method for obtaining vector information according to the pre-training model, and the description is omitted.
S1003, performing feature fusion operation on vector information corresponding to each sub-data to be processed to obtain vector information corresponding to the data to be processed.
Specifically, the electronic device performs feature fusion operation on vector information corresponding to the acquired sub-data to be processed. Wherein the feature fusion operation includes at least one of: a splicing operation and a pooling operation. For splicing operation, the specific splicing method comprises the following steps: and splicing the vector information of at least two sub-data to be processed into a vector, and obtaining the vector after splicing, namely the vector information corresponding to the data to be processed. The format of the spliced vector may include, but is not limited to, a list format. For the pooling operation, two methods are specifically included, the first is a sum method, which specifically includes: and carrying out one-to-one correspondence addition on the vectors of all the sub-data to be processed, and obtaining added vector information, namely vector information corresponding to the data to be processed. The second is an average method, which specifically includes: and carrying out average value calculation on the vectors of all the sub-data to be processed, wherein the obtained vector is the vector corresponding to the data to be processed. For example, if a certain data to be processed includes two sub-data to be processed, and vectors corresponding to the two sub-data to be processed are (z, a) and (z, b), respectively, when the stitching operation is adopted to perform feature fusion on the vectors, the fused vectors are (z_a and z_b). When feature fusion is carried out by adopting pooling operation, if a sum method is adopted, the fused vector is (z+z, a+b); if the average method is adopted, the vector after fusion is ((z+z)/2, (a+b)/2).
S1004, extracting target information carried by the data to be processed according to the vector information.
Specifically, after the vector information is acquired, an encoder in the electronic device transmits the processed vector information to a decoder, and the decoder extracts target information according to the vector information corresponding to the data to be processed based on a T5 model. The process of extracting information based on the T5 model is referred to the above embodiment, and the description of this embodiment is omitted.
According to the method for obtaining the vector information of the data to be processed according to the sub-data to be processed, the vector information corresponding to each sub-data to be processed is obtained by inputting the sub-data to be processed into the pre-training model, then feature fusion processing is carried out on the vector information to obtain the vector information corresponding to the data to be processed, the vector information corresponding to each class of sub-data to be processed of the data to be processed is extracted, then fusion processing is carried out on the vectors of a plurality of sub-data to be processed, which are included in the data to be processed, so that the vector of the finally obtained data to be processed is more accurate, and the accuracy of generating target information during subsequent information extraction is improved on the side face.
Referring to fig. 11, fig. 11 is a flowchart illustrating another method for determining a vector corresponding to data to be processed according to sub-data to be processed. The method comprises the following steps:
S1101, acquiring data to be processed.
Specifically, the user uploads the data to be processed to the electronic device. The electronic device obtains sub-data to be processed. Wherein the data to be processed comprises at least one category of sub data to be processed. Please refer to the above embodiment for the category of the data to be processed, and the description of this embodiment is omitted.
S1102, inputting at least two types of sub-data to be processed into pre-training models corresponding to the types of the sub-data to be processed.
Specifically, if a certain data to be processed includes two types of sub-data to be processed, the sub-data are video data and audio data respectively. After the electronic equipment acquires the sub-data to be processed, the video data are respectively input into a pre-training model corresponding to the video data, and the audio data are input into the pre-training model corresponding to the audio data. The pre-training model is obtained by training according to the model training method in the embodiment.
S1103, determining the similarity between the sub-data to be processed of at least two categories.
Specifically, the electronic device obtains the similarity of the video data and the audio data contained in the data to be processed according to the pre-training model. The similarity is used for representing the similarity degree between the sub-data of different categories in the sub-data pair, and can particularly represent the possibility that the sub-data of different categories are derived from the same original data. The closer the distance between the similar sub-data pairs is, the higher the similarity is, and the closer the distance between each sub-data pair is in the vector representation space, the smaller the loss value of the loss function is; the farther apart the similar sub-data pairs are, the lower the similarity is, and the further apart the respective sub-data are in the vector representation space, the greater the loss value of the loss function. The higher the similarity of the dissimilar sub-data pairs, the further the distance of each sub-data in the vector representation space, the smaller the loss value of the loss function; the lower the similarity of the dissimilar sub-data pairs, the closer the distance of each sub-data in the vector representation space, the greater the loss value of the loss function. For example, there are two sub-data pairs, (video a, audio a), (video a, audio B), respectively. The electronic device can determine that the sub-data video and the audio in the first sub-data pair are both from the data A, and the sub-data video in the latter sub-data pair is from the data A, and the audio is from the data B after analysis. Therefore, the similarity between the video data and the audio data in the first pair of sub-data is greater than the similarity between the video data and the audio data in the second pair of sub-data.
S1104, generating vector information corresponding to each sub-data to be processed according to the similarity and the vector representation space.
Specifically, after determining the similarity between the sub-data of different categories in the sub-data pair, the electronic device combines the vector representation space corresponding to each category of each sub-data to be processed to generate the vector information corresponding to each sub-data to be processed.
S1105, performing feature fusion operation on the vector information corresponding to each sub-data to be processed to obtain the vector information corresponding to the data to be processed.
Specifically, the electronic device performs feature fusion operation on vector information corresponding to the acquired sub-data to be processed. Wherein the feature fusion operation includes at least one of: a splicing operation and a pooling operation. The specific methods corresponding to the splicing operation and the pooling operation refer to the above embodiments, and this embodiment is not repeated.
S1106, extracting target information carried by the data to be processed according to vector information corresponding to the data to be processed.
Specifically, after the vector information is acquired, an encoder in the electronic device transmits the processed vector information to a decoder, and the decoder extracts target information according to the vector information corresponding to the data to be processed based on a T5 model. The process of extracting information based on the T5 model is referred to the above embodiment, and the description of this embodiment is omitted.
According to the flow chart of the method for determining the vector corresponding to the data to be processed according to the sub-data to be processed, the similarity among the sub-data of different categories in the sub-data pair is determined, and the vector information corresponding to the sub-data to be processed is determined according to the similarity and the vector representation space corresponding to the category of each sub-data, so that the generated vector information corresponding to the sub-data to be processed is more accurate, the accuracy of extracting target information according to the vector information is improved, and the extracted information is faster and more accurate.
Referring to fig. 12, fig. 12 is a schematic structural diagram of a model training device provided in the present application, where the model extraction device in the embodiment of the present application has the same function and effect as the encoder mentioned in the above embodiment, and both belong to the same concept and are used for executing the model training method in the present application. The model training apparatus 1200 includes:
a first obtaining module 1201, configured to obtain N sample data; each sample data comprises M categories of sub-data; the sub data of M categories included in the N sample data correspond to M multiplied by N sub data pairs, each sub data pair comprises M sub data, the categories to which each sub data belongs are different, each sub data pair comprises M sub data corresponding to an association relation, the sub data of M categories included in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2;
The training module 1202 is configured to input the mxn pairs of sub-data into a preset model to perform training, and generate a pre-training model corresponding to each class of sub-data; the preset model is used for calculating the similarity between M pieces of sub data included in each piece of sub data pair, and determining the vector representation space corresponding to each piece of data according to the similarity between M pieces of sub data included in each M×N piece of sub data pair.
In some embodiments, the training module 1202 includes:
an input unit for inputting the m×n pairs of sub-data into a preset model;
the generating unit is used for training the preset model according to the incidence relation among M pieces of sub-data included in the M×N pieces of sub-data pairs, and generating a pre-training model corresponding to each type of sub-data.
Referring to fig. 13, fig. 13 is a schematic structural diagram of an information extraction device according to the present application, where the information extraction device according to the embodiment of the present application has the same effect as the decoder according to the above embodiment, and both belong to the same concept and are used for executing the information extraction method according to the present application. The information extraction apparatus 1300 includes:
A second obtaining module 1301, configured to obtain data to be processed, where the data to be processed includes at least one category of sub data to be processed;
the output module 1302 is configured to input the sub-data to be processed of the at least one type into a pre-training model corresponding to each type of the sub-data to be processed, so as to obtain vector information corresponding to the data to be processed; wherein the pre-training model is a pre-training model obtained by the model training method of claim 1;
and the extracting module 1303 is configured to extract target information carried by the data to be processed according to the vector information.
In some embodiments, the output module 1302 includes:
the input unit is used for inputting the sub-data to be processed of at least one category into the pre-training model corresponding to each category of the sub-data to be processed, and obtaining the vector information corresponding to each sub-data to be processed;
the fusion unit is used for carrying out characteristic fusion operation on the vector information corresponding to each sub-data to be processed to obtain the vector information corresponding to the data to be processed; wherein the feature fusion operation comprises at least one of: and (5) splicing and pooling.
In some embodiments, the extracting module 1303 includes:
the processing single cloud is used for regularizing the vector information;
and the extraction unit is used for extracting the target information carried by the data to be processed according to the processed vector information.
In some embodiments, the data to be processed includes at least two categories of sub-data to be processed;
the apparatus further comprises:
a determining module, configured to determine a similarity between the at least two types of sub-data to be processed after the output module 1302 inputs the at least one type of sub-data to be processed into a pre-training model corresponding to each type of sub-data to be processed;
the output module 1302 is specifically configured to:
generating vector information corresponding to each sub-data to be processed according to the similarity and the vector representation space; wherein the vector representation space is a vector representation space obtained by the model training method of claim 1.
Referring to fig. 14, fig. 14 is a schematic structural diagram of another model training apparatus 1400 according to an embodiment of the present disclosure. Wherein the model training means may be integrated in the electronic device 10. The model training apparatus 1400 may include at least: at least one processor 1401, such as a CPU, at least one network interface 1404, a user interface 1403, a memory 1405, and at least one communication bus 1402. Wherein a communication bus 1402 is used to enable connected communication among the components. User interface 1403 may include, but is not limited to, a camera, display, touch screen, keyboard, mouse, joystick, and the like. The network interface 1404 may optionally include a standard wired interface, a wireless interface (e.g., WIFI interface), and a communication connection may be established with a server through the network interface 1404. Memory 1402 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. As shown in fig. 14, an operating system, a network communication module, a user interface module, and program instructions may be included in the memory 1005, which is a type of computer storage medium.
It should be noted that, the network interface 1404 may be connected to an acquirer, a transmitter, or other communication modules, which may include, but are not limited to, a WiFi module, an operator network communication module, etc., and it is understood that the model training device in the embodiment of the present application may also include an acquirer, a transmitter, and other communication modules, etc.
The processor 1401 may be configured to invoke the program instructions stored in the memory 1405 and may perform the steps of:
acquiring N sample data; each sample data comprises M categories of sub-data; the sub data of M categories included in the N sample data correspond to M multiplied by N sub data pairs, each sub data pair comprises M sub data, the categories to which each sub data belongs are different, each sub data pair comprises M sub data corresponding to an association relation, the sub data of M categories included in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2;
inputting the M multiplied by N sub-data pairs into a preset model for training, and generating a pre-training model corresponding to each class of sub-data; the preset model is used for calculating the similarity between M pieces of sub data included in each piece of sub data pair, and determining the vector representation space corresponding to each piece of data according to the similarity between M pieces of sub data included in each M×N piece of sub data pair.
Possibly, the processor 1401 trains the m×n pairs of sub-data input into a preset model, and generates a pre-training model corresponding to each class of sub-data, which specifically performs:
inputting the M multiplied by N sub-data pairs into a preset model;
training the preset model according to the association relation among M sub-data included in the M multiplied by N sub-data pairs, and generating a pre-training model corresponding to each class of sub-data.
Referring to fig. 15, fig. 15 is a schematic structural diagram of another information extraction apparatus 1500 provided in an embodiment of the present application. Wherein the information extraction means may be integrated in the electronic device 10. The information extraction apparatus 1500 may include at least: at least one processor 1501, such as a CPU, at least one network interface 1504, a user interface 1503, a memory 1505, and at least one communication bus 1502. Wherein a communication bus 1502 is used to enable connected communications between these components. The user interface 1503 may include, but is not limited to, a camera, display, touch screen, keyboard, mouse, joystick, and the like. The network interface 1504 may optionally include a standard wired interface, a wireless interface (e.g., a WIFI interface), and a communication connection may be established with a server through the network interface 1504. The memory 1502 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory. As shown in fig. 15, an operating system, network communication modules, user interface modules, and program instructions may be included in memory 1505, which is a type of computer storage medium.
It should be noted that, the network interface 1504 may be connected to an acquirer, a transmitter, or other communication modules, and the other communication modules may include, but are not limited to, a WiFi module, an operator network communication module, etc., and it is understood that the information extraction device in the embodiment of the present application may also include an acquirer, a transmitter, and other communication modules, etc.
The processor 1501 may be used to call up program instructions stored in the memory 1505, which may perform the steps of:
acquiring data to be processed, wherein the data to be processed comprises at least one category of sub data to be processed;
inputting the sub-data to be processed of at least one category into a pre-training model corresponding to each category of the sub-data to be processed, and obtaining vector information corresponding to the data to be processed; wherein the pre-training model is a pre-training model obtained by the model training method of claim 1;
and extracting target information carried by the data to be processed according to the vector information.
Possibly, the processor 1501 inputs the sub-data to be processed of the at least one category into a pre-training model corresponding to each category of the sub-data to be processed, so as to obtain vector information corresponding to the data to be processed, and specifically performs:
Inputting the sub-data to be processed of at least one category into a pre-training model corresponding to each category of the sub-data to be processed, and obtaining vector information corresponding to each sub-data to be processed;
performing feature fusion operation on the vector information corresponding to each sub-data to be processed to obtain the vector information corresponding to the data to be processed; wherein the feature fusion operation comprises at least one of: and (5) splicing and pooling.
Possibly, the processor 1501 extracts target information carried by the data to be processed according to the vector information, specifically performs:
regularizing the vector information;
and extracting target information carried by the data to be processed according to the processed vector information.
Possibly, the data to be processed includes at least two categories of sub data to be processed;
after the processor 1501 inputs the sub-data to be processed of the at least one category into the pre-training model corresponding to each category of the sub-data to be processed, the method is further configured to perform:
determining the similarity between the sub-data to be processed of the at least two categories;
the processor 1501 obtains vector information corresponding to each of the sub-data to be processed, and specifically performs:
Generating vector information corresponding to each sub-data to be processed according to the similarity and the vector representation space; wherein the vector representation space is a vector representation space obtained by the model training method of claim 1.
Embodiments also provide a computer readable storage medium having instructions stored therein, which when run on a computer or processor, cause the computer or processor to perform one or more steps of any of the methods described above. The above-described respective constituent modules of the model training apparatus and the information extraction apparatus may be stored in the computer-readable storage medium if implemented in the form of software functional units and sold or used as independent products.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (Digital Subscriber Line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (Digital Video Disc, DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by way of a computer program, which may be stored in a computer-readable storage medium, instructing relevant hardware, and which, when executed, may comprise the embodiment methods as described above. And the aforementioned storage medium includes: read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like. The technical features in the present examples and embodiments may be arbitrarily combined without conflict.
The above-described embodiments are merely illustrative of the preferred embodiments of the present application and are not intended to limit the scope of the present application, and various modifications and improvements made by those skilled in the art to the technical solutions of the present application should fall within the protection scope defined by the claims of the present application without departing from the design spirit of the present application.

Claims (10)

1. A method of model training, the method comprising:
acquiring N sample data; each sample data comprises M categories of sub-data; the sub data of M categories included in the N sample data correspond to M multiplied by N sub data pairs, each sub data pair comprises M sub data, the categories to which each sub data belongs are different, each sub data pair comprises M sub data corresponding to an association relation, the sub data of M categories included in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2;
Inputting the M multiplied by N sub-data pairs into a preset model for training, and generating a pre-training model corresponding to each class of sub-data; the preset model is used for calculating the similarity between M pieces of sub data included in each piece of sub data pair, and determining the vector representation space corresponding to each piece of data according to the similarity between M pieces of sub data included in each M×N piece of sub data pair;
the pre-training model is used for determining the similarity between at least two types of sub-data to be processed according to the sub-data to be processed of the corresponding type in the data to be processed after the data to be processed is acquired, obtaining the vector information corresponding to the sub-data to be processed respectively, and obtaining the vector information corresponding to the data to be processed based on the vector information corresponding to the sub-data to be processed respectively, so that the target information carried by the data to be processed is extracted according to the vector information; the data to be processed comprises at least two categories of sub data to be processed;
the vector representation space is used for generating vector information corresponding to each sub-data to be processed by combining the similarity;
the sample data is video, the sub data is in the form of video data, audio data, picture data and text data, the video data corresponds to a video vector representation space, the video vector representation space is used for acquiring video vector information corresponding to the video data, the audio data corresponds to an audio vector representation space, the audio vector representation space is used for acquiring audio vector information corresponding to the audio data, the picture data corresponds to a picture vector representation space, the picture vector representation space is used for acquiring picture vector information corresponding to the picture data, the text data corresponds to a text vector representation space, and the text vector representation space is used for acquiring text vector information corresponding to the text data;
The target information is the title and the text of the data to be processed.
2. The method of claim 1, wherein the training the m×n pairs of sub-data input into a preset model to generate a pre-training model corresponding to each class of sub-data, includes:
inputting the M multiplied by N sub-data pairs into a preset model;
training the preset model according to the association relation among M sub-data included in the M multiplied by N sub-data pairs, and generating a pre-training model corresponding to each class of sub-data.
3. An information extraction method, characterized in that the method comprises:
acquiring data to be processed, wherein the data to be processed comprises at least one category of sub data to be processed;
when the data to be processed comprises at least two types of sub data to be processed, inputting the sub data to be processed of at least one type into a pre-training model corresponding to each type of the sub data to be processed, determining similarity between the sub data to be processed of at least two types, obtaining vector information corresponding to each type of sub data to be processed, and obtaining vector information corresponding to the data to be processed based on the vector information corresponding to each type of sub data to be processed; wherein the pre-training model is a pre-training model obtained by the model training method of claim 1;
Extracting target information carried by the data to be processed according to the vector information;
the obtaining the vector information corresponding to each sub-data to be processed includes: generating vector information corresponding to each sub-data to be processed according to the similarity and the vector representation space; wherein the vector representation space is a vector representation space obtained by adopting the model training method of claim 1;
the data to be processed is video, the categories of the sub data to be processed are video data, audio data, picture data and text data, the video data corresponds to a video vector representation space, the video vector representation space is used for acquiring video vector information corresponding to the video data, the audio data corresponds to an audio vector representation space, the audio vector representation space is used for acquiring audio vector information corresponding to the audio data, the picture data corresponds to a picture vector representation space, the picture vector representation space is used for acquiring picture vector information corresponding to the picture data, the text data corresponds to a text vector representation space, and the text vector representation space is used for acquiring text vector information corresponding to the text data;
The target information is the title and the text of the data to be processed.
4. The method of claim 3, wherein the obtaining the vector information corresponding to the to-be-processed data based on the vector information corresponding to each of the to-be-processed sub-data includes:
performing feature fusion operation on the vector information corresponding to each sub-data to be processed to obtain the vector information corresponding to the data to be processed; wherein the feature fusion operation comprises at least one of: and (5) splicing and pooling.
5. A method according to claim 3, wherein said extracting target information carried by said data to be processed from said vector information comprises:
regularizing the vector information;
and extracting target information carried by the data to be processed according to the processed vector information.
6. A model training apparatus, the apparatus comprising:
the first acquisition module is used for acquiring N sample data; each sample data comprises M categories of sub-data; the sub data of M categories included in the N sample data correspond to M multiplied by N sub data pairs, each sub data pair comprises M sub data, the categories to which each sub data belongs are different, each sub data pair comprises M sub data corresponding to an association relation, the sub data of M categories included in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2;
The training module is used for inputting the M multiplied by N sub-data pairs into a preset model to train and generating a pre-training model corresponding to each class of sub-data; the preset model is used for calculating the similarity between M pieces of sub data included in each piece of sub data pair, and determining the vector representation space corresponding to each piece of data according to the similarity between M pieces of sub data included in each M×N piece of sub data pair;
the pre-training model is used for determining the similarity between at least two types of sub-data to be processed according to the sub-data to be processed of the corresponding type in the data to be processed after the data to be processed is acquired, obtaining the vector information corresponding to the sub-data to be processed respectively, and obtaining the vector information corresponding to the data to be processed based on the vector information corresponding to the sub-data to be processed respectively, so that the target information carried by the data to be processed is extracted according to the vector information; the data to be processed comprises at least two categories of sub data to be processed;
the vector representation space is used for generating vector information corresponding to each sub-data to be processed by combining the similarity;
the sample data is video, the sub data is in the form of video data, audio data, picture data and text data, the video data corresponds to a video vector representation space, the video vector representation space is used for acquiring video vector information corresponding to the video data, the audio data corresponds to an audio vector representation space, the audio vector representation space is used for acquiring audio vector information corresponding to the audio data, the picture data corresponds to a picture vector representation space, the picture vector representation space is used for acquiring picture vector information corresponding to the picture data, the text data corresponds to a text vector representation space, and the text vector representation space is used for acquiring text vector information corresponding to the text data;
The target information is the title and the text of the data to be processed.
7. An information extraction apparatus, characterized in that the apparatus comprises:
the second acquisition module is used for acquiring data to be processed, wherein the data to be processed comprises at least one category of sub data to be processed;
the output module is used for inputting the sub-data to be processed of the at least one type into a pre-training model corresponding to each type of the sub-data to be processed, obtaining vector information corresponding to each sub-data to be processed, and obtaining vector information corresponding to the data to be processed based on the vector information corresponding to each sub-data to be processed; wherein the pre-training model is a pre-training model obtained by the model training method of claim 1;
the extraction module is used for extracting target information carried by the data to be processed according to the vector information;
the determining module is used for determining the similarity between the at least two categories of sub-data to be processed when the data to be processed comprises the at least two categories of sub-data to be processed;
the output module is further used for generating vector information corresponding to each sub-data to be processed according to the similarity and the vector representation space; wherein the vector representation space is a vector representation space obtained by adopting the model training method of claim 1;
The data to be processed is video, the categories of the sub data to be processed include, but are not limited to, video data, audio data, picture data and text data, the video data corresponds to a video vector representation space, the video vector representation space is used for acquiring video vector information corresponding to the video data, the audio data corresponds to an audio vector representation space, the audio vector representation space is used for acquiring audio vector information corresponding to the audio data, the picture data corresponds to a picture vector representation space, the picture vector representation space is used for acquiring picture vector information corresponding to the picture data, the text data corresponds to a text vector representation space, and the text vector representation space is used for acquiring text vector information corresponding to the text data;
the target information is the title and the text of the data to be processed.
8. A model training device, comprising a processor, a memory, and a communication interface:
the processor is connected with the memory and the communication interface;
the memory is used for storing executable program codes;
the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for executing the model training method according to any one of claims 1-2.
9. An information extraction device, comprising a processor, a memory, and a communication interface:
the processor is connected with the memory and the communication interface;
the memory is used for storing executable program codes;
the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the information extraction method according to any one of claims 3 to 5.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the model training method according to any one of claims 1-2 or the information extraction method according to any one of claims 3-5.
CN202110709820.8A 2021-06-25 2021-06-25 Model training method, information extraction method, related device and storage medium Active CN113408208B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110709820.8A CN113408208B (en) 2021-06-25 2021-06-25 Model training method, information extraction method, related device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110709820.8A CN113408208B (en) 2021-06-25 2021-06-25 Model training method, information extraction method, related device and storage medium

Publications (2)

Publication Number Publication Date
CN113408208A CN113408208A (en) 2021-09-17
CN113408208B true CN113408208B (en) 2023-06-09

Family

ID=77679324

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110709820.8A Active CN113408208B (en) 2021-06-25 2021-06-25 Model training method, information extraction method, related device and storage medium

Country Status (1)

Country Link
CN (1) CN113408208B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241514B (en) * 2021-11-15 2024-05-28 北京爱笔科技有限公司 Model training method and device for extracting human skeleton characteristics
CN114297366A (en) * 2021-11-18 2022-04-08 北京智谱华章科技有限公司 Student title and age prediction method and device based on MT5 pre-training model
CN114422824A (en) * 2021-12-29 2022-04-29 阿里巴巴(中国)有限公司 Data processing method, video processing method, display method and device
CN114519395B (en) * 2022-02-22 2024-05-14 平安科技(深圳)有限公司 Model training method and device, text abstract generating method and device and equipment
WO2024197810A1 (en) * 2023-03-31 2024-10-03 华为技术有限公司 Data processing method, model training method, and related device
CN116705252B (en) * 2023-06-16 2024-05-31 脉得智能科技(无锡)有限公司 Construction method, image classification method, device and medium for prostate cancer diagnosis model
CN117593757B (en) * 2023-12-13 2024-10-01 招商基金管理有限公司 Text element extraction method, device and storage medium in scanned item

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017032243A1 (en) * 2015-08-26 2017-03-02 阿里巴巴集团控股有限公司 Image feature extraction method, apparatus, terminal device, and system
CN112560912A (en) * 2020-12-03 2021-03-26 北京百度网讯科技有限公司 Method and device for training classification model, electronic equipment and storage medium

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609461A (en) * 2017-07-19 2018-01-19 阿里巴巴集团控股有限公司 The training method of model, the determination method, apparatus of data similarity and equipment
CN110019962B (en) * 2017-10-27 2024-01-02 优酷网络技术(北京)有限公司 Method and device for generating video file information
US10810993B2 (en) * 2018-10-26 2020-10-20 Deepmind Technologies Limited Sample-efficient adaptive text-to-speech
CN110991391B (en) * 2019-09-17 2021-06-29 腾讯科技(深圳)有限公司 Information processing method and device based on block chain network
CN110781347B (en) * 2019-10-23 2023-03-07 腾讯科技(深圳)有限公司 Video processing method, device and equipment and readable storage medium
CN110929772A (en) * 2019-11-15 2020-03-27 北京奇艺世纪科技有限公司 Model training method, sample generation method, device, electronic device and storage medium
CN111582910A (en) * 2020-04-14 2020-08-25 上海明略人工智能(集团)有限公司 Method, system and equipment for generating advertisement case
CN111581926B (en) * 2020-05-15 2023-09-01 抖音视界有限公司 Document generation method, device, equipment and computer readable storage medium
CN111726475A (en) * 2020-06-28 2020-09-29 网易传媒科技(北京)有限公司 Video processing method, system, electronic device and storage medium
CN111933110B (en) * 2020-08-12 2021-10-29 北京字节跳动网络技术有限公司 Video generation method, generation model training method, device, medium and equipment
CN112104919B (en) * 2020-09-11 2022-05-06 腾讯科技(深圳)有限公司 Content title generation method, device, equipment and computer readable storage medium based on neural network
CN112784130B (en) * 2021-01-27 2022-05-27 杭州网易云音乐科技有限公司 Twin network model training and measuring method, device, medium and equipment
CN112990297B (en) * 2021-03-10 2024-02-02 北京智源人工智能研究院 Training method, application method and device of multi-mode pre-training model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017032243A1 (en) * 2015-08-26 2017-03-02 阿里巴巴集团控股有限公司 Image feature extraction method, apparatus, terminal device, and system
CN112560912A (en) * 2020-12-03 2021-03-26 北京百度网讯科技有限公司 Method and device for training classification model, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113408208A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN113408208B (en) Model training method, information extraction method, related device and storage medium
CN104735468B (en) A kind of method and system that image is synthesized to new video based on semantic analysis
CN110162799A (en) Model training method, machine translation method and relevant apparatus and equipment
US12001474B2 (en) Information determining method and apparatus, computer device, and storage medium
AU2016256764A1 (en) Semantic natural language vector space for image captioning
CN112749326B (en) Information processing method, information processing device, computer equipment and storage medium
GB2546360A (en) Image captioning with weak supervision
TW201915790A (en) Generating document for a point of interest
US20230042221A1 (en) Modifying digital images utilizing a language guided image editing model
KR102684502B1 (en) Voice packet recommendation methods, devices, facilities and storage media
JP2022088304A (en) Method for processing video, device, electronic device, medium, and computer program
CN115982376B (en) Method and device for training model based on text, multimode data and knowledge
CN104221033A (en) Fixed format document conversion engine
CN110851644A (en) Image retrieval method and device, computer-readable storage medium and electronic device
JP7337172B2 (en) Voice packet recommendation method, device, electronic device and program
KR102003221B1 (en) System for generating note data and method for generating note data using the system
CN112182255A (en) Method and apparatus for storing media files and for retrieving media files
US20240265198A1 (en) Reply content processing method and interaction method for interactive content of media content
KR20230065339A (en) Model data processing method, device, electronic device and computer readable medium
JP2024502400A (en) Automatic depiction and extraction of tabular data in portable document formats using graph neural networks
CN117173497A (en) Image generation method and device, electronic equipment and storage medium
CN117556036A (en) Training method of abstract generation model, abstract generation method, device and equipment
KR101804679B1 (en) Apparatus and method of developing multimedia contents based on story
JP7113000B2 (en) Method and apparatus for generating images
Stacchio et al. DanXe: An extended artificial intelligence framework to analyze and promote dance heritage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant