CN114567811A - Multi-modal model training method and system for sound sequencing and related equipment - Google Patents
Multi-modal model training method and system for sound sequencing and related equipment Download PDFInfo
- Publication number
- CN114567811A CN114567811A CN202210192960.7A CN202210192960A CN114567811A CN 114567811 A CN114567811 A CN 114567811A CN 202210192960 A CN202210192960 A CN 202210192960A CN 114567811 A CN114567811 A CN 114567811A
- Authority
- CN
- China
- Prior art keywords
- data
- sound
- modal
- model
- sequencing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 52
- 238000012549 training Methods 0.000 title claims abstract description 49
- 238000013507 mapping Methods 0.000 claims abstract description 32
- 230000006870 function Effects 0.000 claims description 37
- 230000004927 fusion Effects 0.000 claims description 31
- 230000004913 activation Effects 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 14
- 238000013506 data mapping Methods 0.000 claims description 7
- 238000005065 mining Methods 0.000 claims description 7
- 238000011176 pooling Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 abstract description 5
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 9
- 230000000694 effects Effects 0.000 description 7
- 230000006399 behavior Effects 0.000 description 6
- 238000003062 neural network model Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013481 data capture Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/63—Querying
- G06F16/638—Presentation of query results
- G06F16/639—Presentation of query results using playlists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/68—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/686—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/4508—Management of client data or end-user data
- H04N21/4532—Management of client data or end-user data involving end-user characteristics, e.g. viewer profile, preferences
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/466—Learning process for intelligent management, e.g. learning user preferences for recommending movies
- H04N21/4662—Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention is suitable for the application field of artificial intelligence technology, and provides a multi-modal model training method, a system and related equipment for sound sequencing, wherein the method comprises the following steps: acquiring sound platform data for sound sequencing, wherein the sound platform data comprises playing sequence data, and the playing sequence data comprises sound click time information and sound duration information of sound; extracting data sections according to the play queue data; expanding the data sections to obtain frequent subgraphs, and constructing a data sample containing the platform data according to the frequent subgraphs; inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample; and constructing a multi-modal sound sequencing model, training the multi-modal sound sequencing model, and outputting the trained multi-modal sound sequencing model. According to the method and the device, the interest change of the user is captured in real time through the multi-mode sequencing model, so that the accuracy of pushing the content is improved.
Description
Technical Field
The invention belongs to the field of artificial intelligence technology application, and particularly relates to a multi-modal model training method and system for sound sequencing and related equipment.
Background
At present, the intelligent terminal capable of obtaining information by using the mobile internet is popular, audio and video application programs become a part of work and life of people, and people can shoot and edit audio and video information at will by using the intelligent terminal and share the audio and video information to other people through the audio and video application programs. Due to the entertainment and public attributes of the audio and video applications, strong social wind direction and flow are brought, users are attracted by the aid of high-quality audio and video contents, and a new research direction is formed for saving people from media users.
In the current audio and video application, a user can push audio and video according to a certain rule and attention degree when using the audio and video application, wherein a large amount of data is needed as a basis for analysis in order to more remarkably push contents of different audio and video according to priority degree so as to achieve the effect of optimizing user experience. From the perspective of algorithm recommendation, an LR (logical regression) model can realize automatic sequencing according to the characteristics of contents, but the LR model is a linear model, has weak learning ability and cannot learn high-dimensional characteristics, so that a large amount of manual characteristic engineering is required when the LR model is applied, and more business knowledge is relied on; furthermore, a method of combining GBDT (systematic transition and binary distance solution) with an LR model is applied, and leaf node output of each tree in the GBDT model is combined and input into the LR model as a feature, so that the use of artificial features can be effectively reduced, but the method is not suitable for high-dimensional sparse features, most features of recommended ordering scenes are discrete features, and the interest change of a user cannot be learned when the method is applied to an audio and video pushing scene; in order to improve the problems, a Wide & Deep model is also used in some audio and video applications, the core idea of the Wide & Deep model is to combine the memory capability of a generalized linear model and the generalization capability of a Deep feedforward neural network model so as to enhance the prediction capability, but the input of the Wide part in the Wide & Deep model still depends on artificial feature engineering, and meanwhile, the problems that the multi-modal features are not considered and the interest change of a user cannot be learned still exist.
Disclosure of Invention
The embodiment of the invention provides a multi-modal model training method, a multi-modal model training system and related equipment for sound sequencing, and aims to solve the problems that the existing neural network model cannot consider multi-modal characteristics and cannot learn the interest change of a user when pushing sound.
In a first aspect, an embodiment of the present invention provides a multi-modal model training method for sound ranking, where the method includes the following steps:
acquiring sound platform data for sound sequencing, wherein the sound platform data comprises sound ID information, sound tags, sound keywords, anchor ID information and playing sequence data, and the playing sequence data comprises sound click time information and sound duration information of sound;
extracting data sections according to the play queue data;
expanding the data sections to obtain frequent subgraphs, and constructing a data sample containing the platform data according to the frequent subgraphs;
inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample;
and constructing a multi-modal sound sequencing model, training the multi-modal sound sequencing model according to the platform data, the data sample and the mapping data, and outputting the trained multi-modal sound sequencing model.
Further, the step of extracting the data section according to the play queue data specifically includes:
and taking a queue segment as the data section, wherein the interval between the sound click time information of any two continuous sounds in the play queue data is not more than 20 minutes, and the sound duration information of any sound is not less than 20 seconds.
Further, the step of expanding the data section to obtain a frequent subgraph and constructing a data sample containing the platform data according to the frequent subgraph specifically comprises the following steps:
and performing frequent subgraph mining on the data section to obtain a plurality of frequent subgraphs corresponding to the data section, and combining the frequent subgraphs with the sound platform data to construct the data sample containing the sound ID information, the sound label, the sound keyword and the anchor ID information of all the sounds in the frequent subgraph.
Furthermore, in the step of inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample, the position of the mapping data of the obtained sound keyword is before the sound ID information.
Furthermore, the multi-modal sound sequencing model comprises a first input end, a second input end and a Dense backbone network, wherein the first input end is used for processing the data samples sequentially through pre-training, maximum pooling, a concatemate function and an activation function to obtain a first text fusion characteristic, the second input end is used for processing the data samples sequentially through onehot coding, multi-hot coding, mapping sharing, Hadamard product, a concatemate function and an activation function to obtain a second text fusion characteristic, and the Dense backbone network is used for sequencing the first text fusion characteristic and the second text fusion characteristic as input and outputting a sequencing result of the sound in the sound platform data.
Further, the Dense backbone network in the multi-modal acoustic ranking model further includes fusing the first text fusion feature and the second text fusion feature using a concatemate function before inputting the first text fusion feature and the second text fusion feature.
Further, the multi-modal acoustic ranking model uses logistic loss as a loss function when training.
In a second aspect, an embodiment of the present invention further provides a multi-modal model training system for sound ranking, including:
the data acquisition module is used for acquiring sound platform data for sound sequencing, wherein the sound platform data comprises sound ID information, sound labels, sound keywords, anchor ID information and playing sequence data, and the playing sequence data comprises sound click time information and sound duration information;
the data section extraction module is used for extracting data sections according to the play queue data;
the data section expansion module is used for expanding the data sections to obtain frequent subgraphs and constructing a data sample containing the platform data according to the frequent subgraphs;
the data mapping module is used for inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample;
and the sequencing model training module is used for constructing a multi-modal sound sequencing model, training the multi-modal sound sequencing model according to the platform data, the data sample and the mapping data, and outputting the trained multi-modal sound sequencing model.
Furthermore, the multi-modal sound sequencing model comprises a first input end, a second input end and a Dense backbone network, wherein the first input end is used for processing the data samples sequentially through pre-training, maximum pooling, a concatemate function and an activation function to obtain a first text fusion characteristic, the second input end is used for processing the data samples sequentially through onehot coding, multi-hot coding, mapping sharing, Hadamard product, a concatemate function and an activation function to obtain a second text fusion characteristic, and the Dense backbone network is used for sequencing the first text fusion characteristic and the second text fusion characteristic as input and outputting a sequencing result of the sound in the sound platform data.
Further, the density backbone network in the multi-modal voice ranking model further includes fusing the first text fusion feature and the second text fusion feature using a conticatenate function before the first text fusion feature and the second text fusion feature are input.
In a third aspect, an embodiment of the present invention further provides a computer device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the multimodal model training method as described in any of the above embodiments when executing the computer program.
In a fourth aspect, the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps in the multimodal model training method as described in any one of the above embodiments.
The method has the advantages that due to the fact that multi-mode data such as graph mining, text and numerical value characteristics, pre-training and deep learning models are fused, the final model can capture interest changes of users in real time, accuracy of pushed contents is improved, and effects of improving click rate and playing duration of sound contents are achieved.
Drawings
FIG. 1 is a flow chart of steps of a multi-modal model training method provided by an embodiment of the invention;
FIG. 2 is a data section diagram provided by an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a first input end according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a second input terminal according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the overall structure of a multi-modal sound sequencing model provided by an embodiment of the invention;
FIG. 6 is a schematic structural diagram of a multi-modal model training system provided by an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a flowchart illustrating steps of a multi-modal model training method according to an embodiment of the present invention, including the following steps:
s101, sound platform data used for sound sequencing are obtained, the sound platform data comprise sound ID information, sound labels, sound keywords, anchor ID information and playing sequence data, and the playing sequence data comprise sound click time information and sound duration information.
Specifically, the sound platform data is based on an audio/video application platform, in the embodiment of the present invention, taking sound recommendation of the audio/video application platform as an example, in the audio/video application platform, a user can click and play sound recommended by the platform, the sound is uploaded by a host, each sound includes sound ID information, a sound tag, a sound keyword, and host ID information, the sound ID information is flag information of the sound on the audio/video application platform, the sound tag and the sound keyword are description information of the sound when the host uploads, the host ID information is flag information of the host corresponding to the sound, a plurality of sounds are displayed to the user in a list form on the audio/video application platform, wherein a sorting rule of the list leads the sound with high interest level for the user to be sorted ahead, and recording the sequence and the watching time length of the sound clicked by the user in the playing sequence data corresponding to the user, wherein the playing sequence data comprises sound clicking time information and sound time length information of the sound.
In the embodiment of the present invention, the method for acquiring the sound platform data may be to view a background, or use a data capture mode, and the method for acquiring the data is not limited in the embodiment of the present invention.
And S102, extracting data sections according to the play queue data.
In this step, a queue segment in which an interval between the sound click time information of any two consecutive sounds in the play queue data is not greater than 20 minutes and sound duration information of any sound is not less than 20 seconds is taken as one data section. Specifically, referring to fig. 2, fig. 2 is a schematic diagram of data sections provided in an embodiment of the present invention, taking the play queue data of users U1 and U2 as an example, the play queue data of user U1 includes sounds I1, I3, I8, I2, I6, and I5 according to a chronological order, where the play interval duration of I8 and I5 exceeds 20 minutes, and the play queue data of user U2 includes sounds I1, I3, I8, I2, I7, and I6 according to a chronological order, and for convenience of description, the durations of the sounds are all greater than 20 seconds, and for the play queue data corresponding to user U1, the extracted data sections are separated from I8 and I2, because the play interval of the extracted data sections exceeds 20 minutes, and in an actual scene, it can be regarded that the user does not continuously click on the same push queue, and has no push relevance, and therefore the data section 1 corresponding to user U1 should be the I1 I3, I8, and I2, I6, I5; the play queue data of the user U2 does not appear to be more than 20 minutes apart, so the data sections corresponding to the user U2 should be I1, I3, I8, I2, I7 and I6.
S103, expanding the data sections to obtain frequent subgraphs, and constructing a data sample containing the platform data according to the frequent subgraphs.
Specifically, in this step, frequent subgraph mining is performed on the data section to obtain a plurality of frequent subgraphs corresponding to the data section, and the frequent subgraphs are combined with the sound platform data to construct the data sample including the sound ID information, the sound tag, the sound keyword, and the anchor ID information of all the sounds in the frequent subgraph.
Taking the data sections I1, I3, and I8 obtained by the user U1 in the above embodiment as an example, by using a frequent subgraph mining algorithm, under the premise that the minimum support threshold 2 is 2, the obtained one frequent subgraph is I1, I3, and I8, and then, the order of the frequent subgraphs is randomly disordered to generate 2 new frequent subgraphs, such as I1, I8, I3, I8, I3, and I1. After the frequent subgraph is obtained, the data samples are obtained according to the frequent subgraph, and taking the frequent subgraph I1, I3 and I8 as an example, if the frequent subgraph contains 3 sounds, the data samples are constructed corresponding to each sound as follows:
i1 keyword, I1ID information, I1 anchor ID information, I1 sound tag;
i3 keyword, I3ID information, I3 anchor ID information, I3 sound tag;
i8 keyword, I8ID information, I8 anchor ID information, I8 sound tag.
It should be noted that, when the sound tag is uploaded to the audio/video application platform on the anchor, in order to better describe the content and characteristics of the sound, a plurality of sound tags may be included in the sound description, at this time, a plurality of sound tags may also be included in the data sample, for example, the I1 sound tag includes a first-level I1 tag, a second-level I1 tag, and the number of the sound tags may be set according to actual needs.
S104, inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample.
Specifically, the word2vec model is a commonly-used neural network model for converting non-computable unstructured text data into computable structured mathematical vector data, and in the embodiment of the present invention, the word2vec model is configured to obtain the mapping data (embedding) corresponding to the data sample, so as to use the data sample as input data of a subsequent ranking model.
Preferably, the position of the mapping data for obtaining the sound keyword is before the sound ID information, which is set because in a playing behavior of a user, the user firstly sees the sound keyword contained in a title of the sound, and has an interest in the sound keyword, and then clicks the sound to play, and for the above behavior, the behavior is reflected in a sequence of obtaining the mapping data by using the word2vec model, and prime number mapping data can be obtained from the perspective of an actual situation, so that a mapping effect of the obtained mathematical vector data is improved.
S105, constructing a multi-modal sound sequencing model, training the multi-modal sound sequencing model according to the platform data, the data sample and the mapping data, and outputting the trained multi-modal sound sequencing model.
For example, please refer to fig. 3, fig. 4, and fig. 5, where fig. 3 is a schematic structural diagram of a first input end provided by an embodiment of the present invention, fig. 4 is a schematic structural diagram of a second input end provided by an embodiment of the present invention, and fig. 5 is a schematic structural diagram of an overall multi-modal sound sequencing model provided by an embodiment of the present invention, where the first input end is configured to perform pretraining, maximum pooling, concatation function, and activation function processing on the data sample to obtain a first text fusion feature, and more preferably, as shown in fig. 3, besides performing pretraining, maximum pooling and the like processing on data such as the sound ID information, the sound tag, the sound keyword, and the anchor ID information in the data sample before concatation function processing, the method further includes performing eonhot encoding on user basic information, user behavior information, and statistical type information related to the data sample, In the embodiment of the present invention, the weight multi-hot encoding, the bucket processing, and the data mapping process are performed, because in the playing behavior of the user, in order to improve the clicking habit of the user in the audio/video application platform when pushing the list of the sound to the user, the user behavior information includes the clicking habit of the user, and the statistical information includes the number of times that the user clicks the sound in the playing list within a period of time, and the like, the concatemate function is a commonly used character or data merging function, after the input data is subjected to the regular encoding, the concatemate function may be merged to obtain character data with input data characteristics, and in order to further simplify the first text merging characteristics obtained by the first input end, the first input comprises two sets of leak _ relu activation functions of different size parameters.
The second input end is configured to successively process the data sample through onehot coding, multihot coding, mapping sharing, hadamard product, concatemate function, and activation function to obtain a second text fusion feature, and more preferably, as shown in fig. 4, in addition to performing onehot or multihot coding on data such as the sound ID information, the sound tag, the sound keyword, and the anchor ID information in the data sample before data mapping, the method further includes performing multihot coding or weight multihot coding on a play sequence sound ID, a play sequence anchor ID, a keyword preference, and a tag preference, which are correspondingly related to the data sample, and mapping the data, and then performing hadamard product (hadamard product) calculation on the data subjected to data mapping sharing to obtain a same product matrix between the two data, thereby unifying the corresponding data input from the beginning into data of the same size, and then processing the second text fusion characteristic through a coordinate function and an activation function to obtain the second text fusion characteristic, wherein the input data in the second input end is not as complex as the data in the first input end, so that the second text fusion characteristic only comprises a group of leak _ relu activation functions.
The multimodal sound ordering model specifically takes the first text fusion feature and the second text fusion feature respectively output by the first input end and the second input end as input, and the input is processed by a concatemate function and an activation function, and then the input is processed by the dense Convolutional network, specifically, the dense Convolutional network (densneet) is a dense Convolutional network, and compared with other Convolutional neural Networks, the dense Convolutional network can effectively extract the relevance among input parameters through multilayer connection by using the densneet as a backbone network, and avoids overfitting caused by single data, and a new sequencing result is output, so that the interest change of the user is captured in real time in the process of clicking the sound by the user, and the real-time interest push is realized. Preferably, a logistic loss is used as a loss function in the training of the multi-modal acoustic ranking model.
The method has the advantages that due to the fact that multi-mode data such as graph mining, text and numerical value characteristics, pre-training and deep learning models are fused, the final model can capture interest changes of users in real time, accuracy of pushed contents is improved, and effects of improving click rate and playing duration of sound contents are achieved.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a multi-modal model training system 200 according to an embodiment of the present invention, which includes a data obtaining module 201, a data section extracting module 202, a data section expanding module 203, a data mapping module 204, and a ranking model training module 205, where:
the data acquisition module 201 is configured to acquire sound platform data for sound sequencing, where the sound platform data includes sound ID information, sound tags, sound keywords, anchor ID information, and play sequence data, and the play sequence data includes sound click time information and sound duration information of sound;
a data section extraction module 202, configured to extract data sections according to the play queue data;
the data section expansion module 203 is used for expanding the data sections to obtain frequent subgraphs and constructing a data sample containing the platform data according to the frequent subgraphs;
a data mapping module 204, configured to input the data sample into a word2vec model, to obtain mapping data corresponding to the data sample;
the ranking model training module 205 is configured to construct a multi-modal sound ranking model, train the multi-modal sound ranking model according to the platform data, the data sample, and the mapping data, and output the trained multi-modal sound ranking model.
The multi-modal model training system 200 can implement the steps in the multi-modal model training method in the above embodiments, and can implement the same technical effects, which is referred to the description in the above embodiments and is not repeated herein.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device provided in an embodiment of the present invention, where the computer device 300 includes: a memory 302, a processor 301, and a computer program stored on the memory 302 and executable on the processor 301.
The processor 301 calls the computer program stored in the memory 302 to execute the steps in the multi-modal model training method provided by the embodiment of the present invention, and with reference to fig. 1, the method specifically includes:
s101, sound platform data used for sound sequencing are obtained, the sound platform data comprise sound ID information, sound labels, sound keywords, anchor ID information and playing sequence data, and the playing sequence data comprise sound click time information and sound duration information.
And S102, extracting data sections according to the play queue data.
Further, the step of extracting the data section according to the play queue data specifically includes:
and taking a queue segment as the data section, wherein the interval between the sound click time information of any two continuous sounds in the play queue data is not more than 20 minutes, and the sound duration information of any sound is not less than 20 seconds.
S103, expanding the data sections to obtain frequent subgraphs, and constructing a data sample containing the platform data according to the frequent subgraphs.
Further, the step of expanding the data section to obtain frequent subgraphs and constructing a data sample containing the platform data according to the frequent subgraphs specifically comprises the following steps:
and performing frequent subgraph mining on the data sections to obtain a plurality of frequent subgraphs corresponding to the data sections, and combining the frequent subgraphs with the sound platform data to construct the data sample containing the sound ID information, the sound label, the sound keyword and the anchor ID information of all the sounds in the frequent subgraphs.
Furthermore, in the step of inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample, the position of the mapping data of the obtained sound keyword is before the sound ID information.
S104, inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample.
S105, constructing a multi-modal sound sequencing model, training the multi-modal sound sequencing model according to the platform data, the data sample and the mapping data, and outputting the trained multi-modal sound sequencing model.
Furthermore, the multi-modal sound sequencing model comprises a first input end, a second input end and a Dense backbone network, wherein the first input end is used for processing the data samples sequentially through pre-training, maximum pooling, a concatemate function and an activation function to obtain a first text fusion characteristic, the second input end is used for processing the data samples sequentially through onehot coding, multi-hot coding, mapping sharing, Hadamard product, a concatemate function and an activation function to obtain a second text fusion characteristic, and the Dense backbone network is used for sequencing the first text fusion characteristic and the second text fusion characteristic as input and outputting a sequencing result of the sound in the sound platform data.
Still further, the multi-modal voice ranking model uses logistic loss as a loss function in training.
The computer device 300 provided in the embodiment of the present invention can implement the steps in the multimodal model training method in the foregoing embodiment, and can implement the same technical effects, and reference is made to the description in the foregoing embodiment, which is not repeated herein.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process and step in the multi-modal model training method provided in the embodiment of the present invention, and can implement the same technical effect, and in order to avoid repetition, the detailed description is omitted here.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the preferred embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, which are illustrative, but not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (10)
1. A method of multi-modal model training for sound sequencing, the method comprising the steps of:
acquiring sound platform data for sound sequencing, wherein the sound platform data comprises sound ID information, sound tags, sound keywords, anchor ID information and playing sequence data, and the playing sequence data comprises sound click time information and sound duration information of sound;
extracting data sections according to the play queue data;
expanding the data sections to obtain frequent subgraphs, and constructing a data sample containing the platform data according to the frequent subgraphs;
inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample;
and constructing a multi-modal sound sequencing model, training the multi-modal sound sequencing model according to the platform data, the data sample and the mapping data, and outputting the trained multi-modal sound sequencing model.
2. The multi-modal model training method of claim 1, wherein the step of extracting data sections according to the play queue data comprises:
and taking a queue segment as the data section, wherein the interval between the sound click time information of any two continuous sounds in the play queue data is not more than 20 minutes, and the sound duration information of any sound is not less than 20 seconds.
3. The multi-modal model training method of claim 1, wherein the step of expanding the data section to obtain frequent subgraphs and constructing data samples containing the platform data according to the frequent subgraphs comprises:
and performing frequent subgraph mining on the data sections to obtain a plurality of frequent subgraphs corresponding to the data sections, and combining the frequent subgraphs with the sound platform data to construct the data sample containing the sound ID information, the sound label, the sound keyword and the anchor ID information of all the sounds in the frequent subgraphs.
4. The multimodal model training method as claimed in claim 3, wherein in the step of inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample, the position of the mapping data for obtaining the voice keyword is before the voice ID information.
5. The multi-modal model training method as claimed in claim 1, wherein the multi-modal sound ranking model comprises a first input end, a second input end, and a Dense backbone network, wherein the first input end is used for processing the data samples sequentially through pre-training, max pooling, catanate function, and activation function to obtain a first text fusion feature, the second input end is used for processing the data samples sequentially through onehot coding, multi-hot coding, mapping sharing, hadamard product, catanate function, and activation function to obtain a second text fusion feature, and the Dense backbone network is used for performing ranking processing with the first text fusion feature and the second text fusion feature as inputs, and outputting a ranking result with the sound in the sound platform data.
6. The method of multi-modal model training as recited in claim 5 wherein the density backbone network in the multi-modal voice order model further comprises fusing the first text-fused feature with the second text-fused feature using a conticatenate function prior to the first text-fused feature and the second text-fused feature being input.
7. The method of multi-modal model training as recited in claim 1 in which the multi-modal sound order model is trained using a logistic loss as a loss function.
8. A multi-modal model training system for sound sequencing, comprising:
the data acquisition module is used for acquiring sound platform data for sound sequencing, wherein the sound platform data comprises sound ID information, sound labels, sound keywords, anchor ID information and playing sequence data, and the playing sequence data comprises sound click time information and sound duration information;
the data section extraction module is used for extracting data sections according to the play queue data;
the data section expansion module is used for expanding the data sections to obtain frequent subgraphs and constructing a data sample containing the platform data according to the frequent subgraphs;
the data mapping module is used for inputting the data sample into a word2vec model to obtain mapping data corresponding to the data sample;
and the sequencing model training module is used for constructing a multi-modal sound sequencing model, training the multi-modal sound sequencing model according to the platform data, the data sample and the mapping data, and outputting the trained multi-modal sound sequencing model.
9. A computer device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the multi-modal model training method as claimed in any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the multi-modal model training method as claimed in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210192960.7A CN114567811B (en) | 2022-02-28 | 2022-02-28 | Multi-modal model training method, system and related equipment for voice sequencing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210192960.7A CN114567811B (en) | 2022-02-28 | 2022-02-28 | Multi-modal model training method, system and related equipment for voice sequencing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114567811A true CN114567811A (en) | 2022-05-31 |
CN114567811B CN114567811B (en) | 2024-02-09 |
Family
ID=81714946
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210192960.7A Active CN114567811B (en) | 2022-02-28 | 2022-02-28 | Multi-modal model training method, system and related equipment for voice sequencing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114567811B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160093292A1 (en) * | 2014-09-26 | 2016-03-31 | Intel Corporation | Optimizations to decoding of wfst models for automatic speech recognition |
WO2020054822A1 (en) * | 2018-09-13 | 2020-03-19 | LiLz株式会社 | Sound analysis device, processing method thereof, and program |
CN112287160A (en) * | 2020-10-28 | 2021-01-29 | 广州欢聊网络科技有限公司 | Audio data sorting method and device, computer equipment and storage medium |
CN113486833A (en) * | 2021-07-15 | 2021-10-08 | 北京达佳互联信息技术有限公司 | Multi-modal feature extraction model training method and device and electronic equipment |
CN113688167A (en) * | 2021-01-15 | 2021-11-23 | 稿定(厦门)科技有限公司 | Deep interest capture model construction method and device based on deep interest network |
-
2022
- 2022-02-28 CN CN202210192960.7A patent/CN114567811B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160093292A1 (en) * | 2014-09-26 | 2016-03-31 | Intel Corporation | Optimizations to decoding of wfst models for automatic speech recognition |
WO2020054822A1 (en) * | 2018-09-13 | 2020-03-19 | LiLz株式会社 | Sound analysis device, processing method thereof, and program |
CN112287160A (en) * | 2020-10-28 | 2021-01-29 | 广州欢聊网络科技有限公司 | Audio data sorting method and device, computer equipment and storage medium |
CN113688167A (en) * | 2021-01-15 | 2021-11-23 | 稿定(厦门)科技有限公司 | Deep interest capture model construction method and device based on deep interest network |
CN113486833A (en) * | 2021-07-15 | 2021-10-08 | 北京达佳互联信息技术有限公司 | Multi-modal feature extraction model training method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN114567811B (en) | 2024-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112015949B (en) | Video generation method and device, storage medium and electronic equipment | |
US20200301954A1 (en) | Reply information obtaining method and apparatus | |
CN106407178B (en) | A kind of session abstraction generating method, device, server apparatus and terminal device | |
CN108304439B (en) | Semantic model optimization method and device, intelligent device and storage medium | |
CN108345692B (en) | Automatic question answering method and system | |
WO2018036555A1 (en) | Session processing method and apparatus | |
CN110364146B (en) | Speech recognition method, speech recognition device, speech recognition apparatus, and storage medium | |
CN112364168A (en) | Public opinion classification method based on multi-attribute information fusion | |
JP2023036574A (en) | Conversational recommendation method, method and device of training model, electronic apparatus, storage medium, and computer program | |
EP3340073A1 (en) | Systems and methods for processing of user content interaction | |
CN115618101A (en) | Streaming media content recommendation method and device based on negative feedback and electronic equipment | |
CN111488813A (en) | Video emotion marking method and device, electronic equipment and storage medium | |
CN117474748A (en) | Image generation method and device, electronic equipment and storage medium | |
CN117540007B (en) | Multi-mode emotion analysis method, system and equipment based on similar mode completion | |
CN112784094B (en) | Automatic audio summary generation method and device | |
CN113762056A (en) | Singing video recognition method, device, equipment and storage medium | |
CN110516086B (en) | Method for automatically acquiring movie label based on deep neural network | |
CN113962417A (en) | Video processing method and device, electronic equipment and storage medium | |
CN114567811B (en) | Multi-modal model training method, system and related equipment for voice sequencing | |
CN113762372B (en) | Method and device for identifying organization members in instant messaging information | |
CN109062900A (en) | A kind of circle of friends generation method and device | |
CN114449342A (en) | Video recommendation method and device, computer readable storage medium and computer equipment | |
CN114329005A (en) | Information processing method, information processing device, computer equipment and storage medium | |
CN113763934A (en) | Training method and device of audio recognition model, storage medium and electronic equipment | |
CN112822501A (en) | Information display method and device in video live broadcast, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |