CN114567811A

CN114567811A - Multi-modal model training method and system for sound sequencing and related equipment

Info

Publication number: CN114567811A
Application number: CN202210192960.7A
Authority: CN
Inventors: 谭又伟; 丁宁
Original assignee: Guangzhou Huanlao Network Technology Co ltd
Current assignee: Guangzhou Huanlao Network Technology Co ltd
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-05-31
Anticipated expiration: 2042-02-28
Also published as: CN114567811B

Abstract

The invention is suitable for the application field of artificial intelligence technology, and provides a multi-modal model training method, a system and related equipment for sound sequencing, wherein the method comprises the following steps: acquiring sound platform data for sound sequencing, wherein the sound platform data comprises playing sequence data, and the playing sequence data comprises sound click time information and sound duration information of sound; extracting data sections according to the play queue data; expanding the data sections to obtain frequent subgraphs, and constructing a data sample containing the platform data according to the frequent subgraphs; inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample; and constructing a multi-modal sound sequencing model, training the multi-modal sound sequencing model, and outputting the trained multi-modal sound sequencing model. According to the method and the device, the interest change of the user is captured in real time through the multi-mode sequencing model, so that the accuracy of pushing the content is improved.

Description

Multi-modal model training method and system for sound sequencing and related equipment

Technical Field

The invention belongs to the field of artificial intelligence technology application, and particularly relates to a multi-modal model training method and system for sound sequencing and related equipment.

Background

At present, the intelligent terminal capable of obtaining information by using the mobile internet is popular, audio and video application programs become a part of work and life of people, and people can shoot and edit audio and video information at will by using the intelligent terminal and share the audio and video information to other people through the audio and video application programs. Due to the entertainment and public attributes of the audio and video applications, strong social wind direction and flow are brought, users are attracted by the aid of high-quality audio and video contents, and a new research direction is formed for saving people from media users.

In the current audio and video application, a user can push audio and video according to a certain rule and attention degree when using the audio and video application, wherein a large amount of data is needed as a basis for analysis in order to more remarkably push contents of different audio and video according to priority degree so as to achieve the effect of optimizing user experience. From the perspective of algorithm recommendation, an LR (logical regression) model can realize automatic sequencing according to the characteristics of contents, but the LR model is a linear model, has weak learning ability and cannot learn high-dimensional characteristics, so that a large amount of manual characteristic engineering is required when the LR model is applied, and more business knowledge is relied on; furthermore, a method of combining GBDT (systematic transition and binary distance solution) with an LR model is applied, and leaf node output of each tree in the GBDT model is combined and input into the LR model as a feature, so that the use of artificial features can be effectively reduced, but the method is not suitable for high-dimensional sparse features, most features of recommended ordering scenes are discrete features, and the interest change of a user cannot be learned when the method is applied to an audio and video pushing scene; in order to improve the problems, a Wide & Deep model is also used in some audio and video applications, the core idea of the Wide & Deep model is to combine the memory capability of a generalized linear model and the generalization capability of a Deep feedforward neural network model so as to enhance the prediction capability, but the input of the Wide part in the Wide & Deep model still depends on artificial feature engineering, and meanwhile, the problems that the multi-modal features are not considered and the interest change of a user cannot be learned still exist.

Disclosure of Invention

The embodiment of the invention provides a multi-modal model training method, a multi-modal model training system and related equipment for sound sequencing, and aims to solve the problems that the existing neural network model cannot consider multi-modal characteristics and cannot learn the interest change of a user when pushing sound.

In a first aspect, an embodiment of the present invention provides a multi-modal model training method for sound ranking, where the method includes the following steps:

acquiring sound platform data for sound sequencing, wherein the sound platform data comprises sound ID information, sound tags, sound keywords, anchor ID information and playing sequence data, and the playing sequence data comprises sound click time information and sound duration information of sound;

extracting data sections according to the play queue data;

expanding the data sections to obtain frequent subgraphs, and constructing a data sample containing the platform data according to the frequent subgraphs;

inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample;

and constructing a multi-modal sound sequencing model, training the multi-modal sound sequencing model according to the platform data, the data sample and the mapping data, and outputting the trained multi-modal sound sequencing model.

Further, the step of extracting the data section according to the play queue data specifically includes:

and taking a queue segment as the data section, wherein the interval between the sound click time information of any two continuous sounds in the play queue data is not more than 20 minutes, and the sound duration information of any sound is not less than 20 seconds.

Further, the step of expanding the data section to obtain a frequent subgraph and constructing a data sample containing the platform data according to the frequent subgraph specifically comprises the following steps:

and performing frequent subgraph mining on the data section to obtain a plurality of frequent subgraphs corresponding to the data section, and combining the frequent subgraphs with the sound platform data to construct the data sample containing the sound ID information, the sound label, the sound keyword and the anchor ID information of all the sounds in the frequent subgraph.

Furthermore, in the step of inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample, the position of the mapping data of the obtained sound keyword is before the sound ID information.

Furthermore, the multi-modal sound sequencing model comprises a first input end, a second input end and a Dense backbone network, wherein the first input end is used for processing the data samples sequentially through pre-training, maximum pooling, a concatemate function and an activation function to obtain a first text fusion characteristic, the second input end is used for processing the data samples sequentially through onehot coding, multi-hot coding, mapping sharing, Hadamard product, a concatemate function and an activation function to obtain a second text fusion characteristic, and the Dense backbone network is used for sequencing the first text fusion characteristic and the second text fusion characteristic as input and outputting a sequencing result of the sound in the sound platform data.

Further, the Dense backbone network in the multi-modal acoustic ranking model further includes fusing the first text fusion feature and the second text fusion feature using a concatemate function before inputting the first text fusion feature and the second text fusion feature.

Further, the multi-modal acoustic ranking model uses logistic loss as a loss function when training.

In a second aspect, an embodiment of the present invention further provides a multi-modal model training system for sound ranking, including:

the data acquisition module is used for acquiring sound platform data for sound sequencing, wherein the sound platform data comprises sound ID information, sound labels, sound keywords, anchor ID information and playing sequence data, and the playing sequence data comprises sound click time information and sound duration information;

the data section extraction module is used for extracting data sections according to the play queue data;

the data section expansion module is used for expanding the data sections to obtain frequent subgraphs and constructing a data sample containing the platform data according to the frequent subgraphs;

the data mapping module is used for inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample;

and the sequencing model training module is used for constructing a multi-modal sound sequencing model, training the multi-modal sound sequencing model according to the platform data, the data sample and the mapping data, and outputting the trained multi-modal sound sequencing model.

Further, the density backbone network in the multi-modal voice ranking model further includes fusing the first text fusion feature and the second text fusion feature using a conticatenate function before the first text fusion feature and the second text fusion feature are input.

In a third aspect, an embodiment of the present invention further provides a computer device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the multimodal model training method as described in any of the above embodiments when executing the computer program.

In a fourth aspect, the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps in the multimodal model training method as described in any one of the above embodiments.

The method has the advantages that due to the fact that multi-mode data such as graph mining, text and numerical value characteristics, pre-training and deep learning models are fused, the final model can capture interest changes of users in real time, accuracy of pushed contents is improved, and effects of improving click rate and playing duration of sound contents are achieved.

Drawings

FIG. 1 is a flow chart of steps of a multi-modal model training method provided by an embodiment of the invention;

FIG. 2 is a data section diagram provided by an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a first input end according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a second input terminal according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the overall structure of a multi-modal sound sequencing model provided by an embodiment of the invention;

FIG. 6 is a schematic structural diagram of a multi-modal model training system provided by an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating steps of a multi-modal model training method according to an embodiment of the present invention, including the following steps:

s101, sound platform data used for sound sequencing are obtained, the sound platform data comprise sound ID information, sound labels, sound keywords, anchor ID information and playing sequence data, and the playing sequence data comprise sound click time information and sound duration information.

Specifically, the sound platform data is based on an audio/video application platform, in the embodiment of the present invention, taking sound recommendation of the audio/video application platform as an example, in the audio/video application platform, a user can click and play sound recommended by the platform, the sound is uploaded by a host, each sound includes sound ID information, a sound tag, a sound keyword, and host ID information, the sound ID information is flag information of the sound on the audio/video application platform, the sound tag and the sound keyword are description information of the sound when the host uploads, the host ID information is flag information of the host corresponding to the sound, a plurality of sounds are displayed to the user in a list form on the audio/video application platform, wherein a sorting rule of the list leads the sound with high interest level for the user to be sorted ahead, and recording the sequence and the watching time length of the sound clicked by the user in the playing sequence data corresponding to the user, wherein the playing sequence data comprises sound clicking time information and sound time length information of the sound.

In the embodiment of the present invention, the method for acquiring the sound platform data may be to view a background, or use a data capture mode, and the method for acquiring the data is not limited in the embodiment of the present invention.

And S102, extracting data sections according to the play queue data.

In this step, a queue segment in which an interval between the sound click time information of any two consecutive sounds in the play queue data is not greater than 20 minutes and sound duration information of any sound is not less than 20 seconds is taken as one data section. Specifically, referring to fig. 2, fig. 2 is a schematic diagram of data sections provided in an embodiment of the present invention, taking the play queue data of users U1 and U2 as an example, the play queue data of user U1 includes sounds I1, I3, I8, I2, I6, and I5 according to a chronological order, where the play interval duration of I8 and I5 exceeds 20 minutes, and the play queue data of user U2 includes sounds I1, I3, I8, I2, I7, and I6 according to a chronological order, and for convenience of description, the durations of the sounds are all greater than 20 seconds, and for the play queue data corresponding to user U1, the extracted data sections are separated from I8 and I2, because the play interval of the extracted data sections exceeds 20 minutes, and in an actual scene, it can be regarded that the user does not continuously click on the same push queue, and has no push relevance, and therefore the data section 1 corresponding to user U1 should be the I1 I3, I8, and I2, I6, I5; the play queue data of the user U2 does not appear to be more than 20 minutes apart, so the data sections corresponding to the user U2 should be I1, I3, I8, I2, I7 and I6.

S103, expanding the data sections to obtain frequent subgraphs, and constructing a data sample containing the platform data according to the frequent subgraphs.

Specifically, in this step, frequent subgraph mining is performed on the data section to obtain a plurality of frequent subgraphs corresponding to the data section, and the frequent subgraphs are combined with the sound platform data to construct the data sample including the sound ID information, the sound tag, the sound keyword, and the anchor ID information of all the sounds in the frequent subgraph.

Taking the data sections I1, I3, and I8 obtained by the user U1 in the above embodiment as an example, by using a frequent subgraph mining algorithm, under the premise that the minimum support threshold 2 is 2, the obtained one frequent subgraph is I1, I3, and I8, and then, the order of the frequent subgraphs is randomly disordered to generate 2 new frequent subgraphs, such as I1, I8, I3, I8, I3, and I1. After the frequent subgraph is obtained, the data samples are obtained according to the frequent subgraph, and taking the frequent subgraph I1, I3 and I8 as an example, if the frequent subgraph contains 3 sounds, the data samples are constructed corresponding to each sound as follows:

i1 keyword, I1ID information, I1 anchor ID information, I1 sound tag;

i3 keyword, I3ID information, I3 anchor ID information, I3 sound tag;

i8 keyword, I8ID information, I8 anchor ID information, I8 sound tag.

It should be noted that, when the sound tag is uploaded to the audio/video application platform on the anchor, in order to better describe the content and characteristics of the sound, a plurality of sound tags may be included in the sound description, at this time, a plurality of sound tags may also be included in the data sample, for example, the I1 sound tag includes a first-level I1 tag, a second-level I1 tag, and the number of the sound tags may be set according to actual needs.

S104, inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample.

Specifically, the word2vec model is a commonly-used neural network model for converting non-computable unstructured text data into computable structured mathematical vector data, and in the embodiment of the present invention, the word2vec model is configured to obtain the mapping data (embedding) corresponding to the data sample, so as to use the data sample as input data of a subsequent ranking model.

Preferably, the position of the mapping data for obtaining the sound keyword is before the sound ID information, which is set because in a playing behavior of a user, the user firstly sees the sound keyword contained in a title of the sound, and has an interest in the sound keyword, and then clicks the sound to play, and for the above behavior, the behavior is reflected in a sequence of obtaining the mapping data by using the word2vec model, and prime number mapping data can be obtained from the perspective of an actual situation, so that a mapping effect of the obtained mathematical vector data is improved.

S105, constructing a multi-modal sound sequencing model, training the multi-modal sound sequencing model according to the platform data, the data sample and the mapping data, and outputting the trained multi-modal sound sequencing model.

For example, please refer to fig. 3, fig. 4, and fig. 5, where fig. 3 is a schematic structural diagram of a first input end provided by an embodiment of the present invention, fig. 4 is a schematic structural diagram of a second input end provided by an embodiment of the present invention, and fig. 5 is a schematic structural diagram of an overall multi-modal sound sequencing model provided by an embodiment of the present invention, where the first input end is configured to perform pretraining, maximum pooling, concatation function, and activation function processing on the data sample to obtain a first text fusion feature, and more preferably, as shown in fig. 3, besides performing pretraining, maximum pooling and the like processing on data such as the sound ID information, the sound tag, the sound keyword, and the anchor ID information in the data sample before concatation function processing, the method further includes performing eonhot encoding on user basic information, user behavior information, and statistical type information related to the data sample, In the embodiment of the present invention, the weight multi-hot encoding, the bucket processing, and the data mapping process are performed, because in the playing behavior of the user, in order to improve the clicking habit of the user in the audio/video application platform when pushing the list of the sound to the user, the user behavior information includes the clicking habit of the user, and the statistical information includes the number of times that the user clicks the sound in the playing list within a period of time, and the like, the concatemate function is a commonly used character or data merging function, after the input data is subjected to the regular encoding, the concatemate function may be merged to obtain character data with input data characteristics, and in order to further simplify the first text merging characteristics obtained by the first input end, the first input comprises two sets of leak _ relu activation functions of different size parameters.

The second input end is configured to successively process the data sample through onehot coding, multihot coding, mapping sharing, hadamard product, concatemate function, and activation function to obtain a second text fusion feature, and more preferably, as shown in fig. 4, in addition to performing onehot or multihot coding on data such as the sound ID information, the sound tag, the sound keyword, and the anchor ID information in the data sample before data mapping, the method further includes performing multihot coding or weight multihot coding on a play sequence sound ID, a play sequence anchor ID, a keyword preference, and a tag preference, which are correspondingly related to the data sample, and mapping the data, and then performing hadamard product (hadamard product) calculation on the data subjected to data mapping sharing to obtain a same product matrix between the two data, thereby unifying the corresponding data input from the beginning into data of the same size, and then processing the second text fusion characteristic through a coordinate function and an activation function to obtain the second text fusion characteristic, wherein the input data in the second input end is not as complex as the data in the first input end, so that the second text fusion characteristic only comprises a group of leak _ relu activation functions.

The multimodal sound ordering model specifically takes the first text fusion feature and the second text fusion feature respectively output by the first input end and the second input end as input, and the input is processed by a concatemate function and an activation function, and then the input is processed by the dense Convolutional network, specifically, the dense Convolutional network (densneet) is a dense Convolutional network, and compared with other Convolutional neural Networks, the dense Convolutional network can effectively extract the relevance among input parameters through multilayer connection by using the densneet as a backbone network, and avoids overfitting caused by single data, and a new sequencing result is output, so that the interest change of the user is captured in real time in the process of clicking the sound by the user, and the real-time interest push is realized. Preferably, a logistic loss is used as a loss function in the training of the multi-modal acoustic ranking model.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a multi-modal model training system 200 according to an embodiment of the present invention, which includes a data obtaining module 201, a data section extracting module 202, a data section expanding module 203, a data mapping module 204, and a ranking model training module 205, where:

the data acquisition module 201 is configured to acquire sound platform data for sound sequencing, where the sound platform data includes sound ID information, sound tags, sound keywords, anchor ID information, and play sequence data, and the play sequence data includes sound click time information and sound duration information of sound;

a data section extraction module 202, configured to extract data sections according to the play queue data;

the data section expansion module 203 is used for expanding the data sections to obtain frequent subgraphs and constructing a data sample containing the platform data according to the frequent subgraphs;

a data mapping module 204, configured to input the data sample into a word2vec model, to obtain mapping data corresponding to the data sample;

the ranking model training module 205 is configured to construct a multi-modal sound ranking model, train the multi-modal sound ranking model according to the platform data, the data sample, and the mapping data, and output the trained multi-modal sound ranking model.

The multi-modal model training system 200 can implement the steps in the multi-modal model training method in the above embodiments, and can implement the same technical effects, which is referred to the description in the above embodiments and is not repeated herein.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device provided in an embodiment of the present invention, where the computer device 300 includes: a memory 302, a processor 301, and a computer program stored on the memory 302 and executable on the processor 301.

The processor 301 calls the computer program stored in the memory 302 to execute the steps in the multi-modal model training method provided by the embodiment of the present invention, and with reference to fig. 1, the method specifically includes:

And S102, extracting data sections according to the play queue data.

Further, the step of expanding the data section to obtain frequent subgraphs and constructing a data sample containing the platform data according to the frequent subgraphs specifically comprises the following steps:

and performing frequent subgraph mining on the data sections to obtain a plurality of frequent subgraphs corresponding to the data sections, and combining the frequent subgraphs with the sound platform data to construct the data sample containing the sound ID information, the sound label, the sound keyword and the anchor ID information of all the sounds in the frequent subgraphs.

Still further, the multi-modal voice ranking model uses logistic loss as a loss function in training.

The computer device 300 provided in the embodiment of the present invention can implement the steps in the multimodal model training method in the foregoing embodiment, and can implement the same technical effects, and reference is made to the description in the foregoing embodiment, which is not repeated herein.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process and step in the multi-modal model training method provided in the embodiment of the present invention, and can implement the same technical effect, and in order to avoid repetition, the detailed description is omitted here.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the preferred embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, which are illustrative, but not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of multi-modal model training for sound sequencing, the method comprising the steps of:

extracting data sections according to the play queue data;

2. The multi-modal model training method of claim 1, wherein the step of extracting data sections according to the play queue data comprises:

3. The multi-modal model training method of claim 1, wherein the step of expanding the data section to obtain frequent subgraphs and constructing data samples containing the platform data according to the frequent subgraphs comprises:

4. The multimodal model training method as claimed in claim 3, wherein in the step of inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample, the position of the mapping data for obtaining the voice keyword is before the voice ID information.

5. The multi-modal model training method as claimed in claim 1, wherein the multi-modal sound ranking model comprises a first input end, a second input end, and a Dense backbone network, wherein the first input end is used for processing the data samples sequentially through pre-training, max pooling, catanate function, and activation function to obtain a first text fusion feature, the second input end is used for processing the data samples sequentially through onehot coding, multi-hot coding, mapping sharing, hadamard product, catanate function, and activation function to obtain a second text fusion feature, and the Dense backbone network is used for performing ranking processing with the first text fusion feature and the second text fusion feature as inputs, and outputting a ranking result with the sound in the sound platform data.

6. The method of multi-modal model training as recited in claim 5 wherein the density backbone network in the multi-modal voice order model further comprises fusing the first text-fused feature with the second text-fused feature using a conticatenate function prior to the first text-fused feature and the second text-fused feature being input.

7. The method of multi-modal model training as recited in claim 1 in which the multi-modal sound order model is trained using a logistic loss as a loss function.

8. A multi-modal model training system for sound sequencing, comprising:

9. A computer device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the multi-modal model training method as claimed in any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the multi-modal model training method as claimed in any one of claims 1 to 7.