CN113813609B - Game music style classification method and device, readable medium and electronic equipment - Google Patents

Game music style classification method and device, readable medium and electronic equipment Download PDF

Info

Publication number
CN113813609B
CN113813609B CN202110615605.1A CN202110615605A CN113813609B CN 113813609 B CN113813609 B CN 113813609B CN 202110615605 A CN202110615605 A CN 202110615605A CN 113813609 B CN113813609 B CN 113813609B
Authority
CN
China
Prior art keywords
game
audio
music
data set
music style
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110615605.1A
Other languages
Chinese (zh)
Other versions
CN113813609A (en
Inventor
彭博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110615605.1A priority Critical patent/CN113813609B/en
Publication of CN113813609A publication Critical patent/CN113813609A/en
Application granted granted Critical
Publication of CN113813609B publication Critical patent/CN113813609B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/50Controlling the output signals based on the game progress
    • A63F13/54Controlling the output signals based on the game progress involving acoustic signals, e.g. for simulating revolutions per minute [RPM] dependent engine sounds in a driving game or reverberation against a virtual wall
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The application belongs to the technical field of artificial intelligence, and particularly relates to automatic identification of game music styles in artificial intelligence. The application discloses a game music style classification method, a game music style classification device, a readable medium and electronic equipment. The corresponding method of the application obtains a game audio aggregation set by carrying out unsupervised clustering on game audio in a game music data set, then selects a plurality of game audio samples from the game audio aggregation set, and determines the music style label of the game audio aggregation set according to the content correlation of the plurality of game audio samples; the method can realize the determination of the music style label of the game music, and the user can listen to the collection of the game music according to the classification corresponding to the music style label, thereby greatly improving the user experience of the user on the game music and promoting the development of the game music.

Description

Game music style classification method and device, readable medium and electronic equipment
Technical Field
The application belongs to the technical field of artificial intelligence, and particularly relates to a game music style classification method, a game music style classification device, a computer readable medium and electronic equipment.
Background
The music style is a tag reflecting the overall characteristics of music, and the music style classification of a piece of music can be classified into country music, jazz, rock, heavy metal music, punk, and the like. Classification of music styles is now generally by the type of music.
However, with respect to game music, more is a score, many have no singer singing, and classification of directly carried music style makes it difficult to effectively classify game music. In practical use, users often listen to the collection of game music according to specific classifications, but cannot obtain the classifications of the game music, so that the user experience of the game music is greatly reduced, and the development of the game music is hindered.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the application and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.
Disclosure of Invention
The application aims to provide a game music style classification method, a game music style classification device, a computer readable medium and an electronic device. At least to a certain extent, the technical problems that the game music styles cannot be classified and the like in the related technology are overcome.
Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.
According to an aspect of an embodiment of the present application, there is provided a game music style classification method including:
acquiring a game music data set, wherein the game music data set comprises game audio;
performing unsupervised clustering on game audio in the game music data set to obtain a game audio aggregation set, wherein the game audio aggregation set comprises game audio aggregated in one set after unsupervised clustering;
selecting a plurality of game audio samples from the game audio cluster set, and determining a music style tag of the game audio cluster set according to the content correlation of the plurality of game audio samples;
and adding a music style tag to the game audio in the game music data set to obtain an audio tag data set.
According to an aspect of an embodiment of the present application, there is provided a game music style classification apparatus including:
the acquisition module is used for acquiring a game music data set, wherein the game music data set comprises game audio;
the clustering module is connected with the acquisition module and is used for performing unsupervised clustering on game audios in the game music data set to obtain a game audio clustering set, wherein the game audio clustering set comprises game audios which are clustered in a set after unsupervised clustering;
The identification module is connected with the clustering module and is used for selecting a plurality of game audio samples from the game audio clustering set and determining music style labels of the game audio clustering set according to the content correlation of the game audio samples;
and the adding module is connected with the identification module and used for adding a music style tag to the game audio in the game music data set to obtain an audio tag data set.
In some embodiments of the present application, based on the above technical solution, the method further includes:
an audio acquisition module configured to acquire the audio tag dataset and game audio to be categorized;
the transformation module is configured to obtain a Mel spectrogram through short-time Fourier transform of game audio with music style tags in the audio tag data set, and obtain the Mel spectrogram through short-time Fourier transform of the game audio to be classified;
the prediction training module is configured to input a mel spectrogram corresponding to the audio tag data set into a preset deep convolutional neural network for training to obtain a network model for music style tag prediction;
And the style identification module is configured to input the mel spectrogram corresponding to the game audio to be classified into the network model for music style label prediction to obtain the music style label of the game audio to be classified.
In some embodiments of the present application, based on the above technical solution, the prediction training module is further configured to perform supervised learning on a deep convolutional neural network by using a spectrum atlas obtained by short-time fourier transform of game audio with music style tags in the audio tag dataset, so as to obtain a suitable weight parameter matrix and offset; and correspondingly assigning a weight parameter matrix and an offset to each layer of the deep convolutional neural network.
In some embodiments of the present application, based on the above technical solution, the clustering module includes an unsupervised training unit and an unsupervised clustering unit.
In some embodiments of the present application, based on the above technical solution, the unsupervised training unit is configured to randomly clip any two pieces of each game audio in the game music dataset, convert the any two pieces of each game audio into audio feature vectors, and then form an audio slice pair;
The unsupervised training unit is further configured to input the audio slice pairs into a multi-class cross entropy contrast loss function for unsupervised training to obtain a game music data set for unsupervised training, wherein the multi-class cross entropy contrast loss function is used for reducing intra-pair feature distances of the audio slice pairs and increasing inter-pair feature distances of the audio slice pairs.
In some embodiments of the present application, based on the above technical solution, the unsupervised clustering unit is configured to input an audio feature vector corresponding to each audio in the game music dataset for unsupervised training into a greedy algorithm to perform unsupervised clustering, so as to obtain a game audio cluster set.
In some embodiments of the present application, based on the above technical solutions, a method for inputting an audio feature vector corresponding to each audio in the game music dataset for unsupervised training into a greedy algorithm for unsupervised clustering includes:
selecting a pair of audio feature vectors with minimum distance in the game music data set of the unsupervised training;
and if the distance between the pair of audio feature vectors with the minimum distance is smaller than the specified threshold value, the pair of audio feature vectors with the minimum distance are gathered into one class to form a game audio gathering set.
In some embodiments of the present application, based on the above technical solutions, the identification module includes:
a feature extraction unit configured to extract game music features of game audio in a game music data set and to extract game music features of a plurality of game audio samples in the game audio cluster set;
the label training unit is configured to input the game music characteristics and the music style labels corresponding to the game music characteristics into a machine learning model for training, and a label calibration model for predicting the music style labels based on the game music characteristics is obtained;
and the label prediction unit is configured to input game music characteristics of a plurality of game audio samples in the game audio cluster set into the label calibration model to obtain music style labels of the game audio cluster set.
According to an aspect of the embodiments of the present application, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements a game music style classification method as in the above technical solution.
According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the game music style classification method as in the above technical solution via execution of the executable instructions.
According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the game music style classification method as in the above technical solution.
According to the technical scheme provided by the embodiment of the application, game audios in the game music data set are gathered together to form a plurality of game audio aggregation sets according to the music style labels in an unsupervised clustering mode, then the music style labels of the game audio aggregation sets are determined according to the content relativity of a plurality of game audio samples of the game audio aggregation sets, finally the music style labels are added to all game audios in the game music data set to obtain an audio label data set, and the method provided by the application is used for obtaining the classification labels specially aiming at the game music styles.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
FIG. 2 schematically illustrates a flow chart of steps of a game music style classification method in one embodiment of the application.
FIG. 3 schematically illustrates a flow chart of method steps for unsupervised training of game audio in one embodiment of the present application.
Figure 4 schematically illustrates an effect diagram of unsupervised training of game audio in one embodiment of the present application.
FIG. 5 schematically illustrates a flow chart of method steps for unsupervised clustering of unsupervised training data in one embodiment of the present application.
FIG. 6 schematically illustrates a general classification table based on game music style tags in one embodiment of the application.
FIG. 7 schematically illustrates a flowchart of method steps for determining a music style tag for a collection of game audio categories in one embodiment of the application.
FIG. 8 schematically illustrates a flowchart of method steps for music style tag prediction of game audio in one embodiment of the present application.
Fig. 9 schematically shows a flowchart of the method steps for training a mel-frequency spectrogram in an embodiment of the application.
FIG. 10 schematically illustrates a schematic of a specific application of music style tag prediction on game audio in one embodiment of the application.
Fig. 11 schematically shows a block diagram of a game music style classification apparatus according to an embodiment of the present application.
Fig. 12 schematically shows a block diagram of a computer system of an electronic device for implementing an embodiment of the application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
As a big direction of artificial intelligence software technology, machine Learning (ML) is a multi-domain interdisciplinary discipline, which involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
The application relates to a technology for automatically classifying the style of game music in the field of artificial intelligence, which mainly realizes two functions through the technology of artificial intelligence. According to the application, a classification tag system special for game music styles is obtained, and according to the application, automatic identification and determination of music style tags of game audios to be classified are realized. The present application utilizes a great deal of machine learning technology, and the specific technical content of the present application will be further disclosed below.
Music has a certain music style, and music can be classified according to the music style, and each person has different preference for the music style. The music style tag is defined for the style of the song, for example, the music style may be RAP (RAP), POP (POP), FOLK (ballad), etc., and the definition of the music style is more determined based on the style of singer singing corresponding to the song, the rhythm composition of the whole song, etc.
For game music, more music is played, no singer sings, and no specific music style exists, and if the ordinary song is directly carried, the game music is difficult to label. But the solution for music style tags in turn has to rely on high quality known music tag labeling data, so that for game music it is difficult to classify and identify the game music without music style tags to meet the user experience of the user for different types of game music styles.
The scheme provided by the patent can better solve the problems, and supplements and perfects the label system of the game music style.
Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
As shown in fig. 1, system architecture 100 may include a terminal device 110, a network 120, and a server 130. Terminal device 110 may include various electronic devices such as smart phones, tablet computers, notebook computers, desktop computers, and the like. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. Network 120 may be a communication medium of various connection types capable of providing a communication link between terminal device 110 and server 130, and may be, for example, a wired communication link or a wireless communication link.
The steps corresponding to the game music style classification method of the present application may be performed in the server 130 or in the terminal device 110. Specifically, the server 130 or the terminal device 110 can recognize and add a music style tag of game audio through the following steps. The method comprises the following specific steps: after the server 130 or the terminal device 110 acquires the game music data set, performing unsupervised clustering on game audio in the game music data set to obtain a game audio cluster set; selecting a plurality of game audio samples from a game audio cluster set, and determining a music style tag of the game audio cluster set according to the content correlation of the plurality of game audio samples; and adding a music style tag to game audio in the game music data set to obtain an audio tag data set.
After the audio tag data set is obtained by the steps, training can be carried out on the audio tag data set, and then automatic identification of music style tags of any game audio by a user is realized. At this time, game audio to be classified may be acquired through the terminal device 110 or the server 130, and then the corresponding music style tag automatic recognition may be obtained through a network model for music style tag prediction. The main steps of the application are two parts, the first part is to obtain the music style label corresponding to the game audio, and the music style label corresponding to the game audio is obtained by the method corresponding to the application because the music style label of the common song cannot be directly applied to the game music, thus the music style label special for the game audio is needed to be obtained first, and the music style label system is constructed. And the second part is to automatically identify the music style label of the game audio to be classified according to the constructed music style label system aiming at the game audio. After the music style tags are obtained, machine learning a number of game audio of known music style tags results in a network model for music style tag prediction. The automatic identification of the music style tags for the game audio to be classified can be achieved by inputting the game audio to be classified into a network model for music style tag prediction.
The system architecture in embodiments of the present application may have any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 130 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by the terminal device 110 and the server 130 together, which is not limited in particular.
The game music style classification method provided by the application is described in detail below with reference to the specific embodiments.
According to an aspect of an embodiment of the present application, there is provided a game music style classification method, as shown in fig. 2, and fig. 2 schematically shows a flow chart of steps of the game music style classification method in an embodiment of the present application. The method according to the present application is applied to the server 130, may be applied to the terminal device 110, or may be executed by both the terminal device 110 and the server 130. Specifically, the method comprises the steps S210-S240.
The step S210 specifically includes: a game music dataset is obtained, the game music dataset comprising game audio.
The game music data set includes a plurality of game music, which is a collection of a plurality of game music. And for game music, all music related to the game may be available. For example, it may include game background music when a game is opened, prompt music when a game interface clicks an interaction, prompt music in a game, music played in correspondence with a special effect of a game, dialogs and paralogs in a game, and the like. The game music in the game music data set can come from different kinds of games, and all game audio contents in the game foundation package can be directly obtained by obtaining the game foundation package of various different games.
In particular applications, the manner in which the game music data sets are acquired may be varied. For example, when the game music data set is stored locally at the server 130 or the terminal device 110 as the subject of the execution of the present application, all the game music composition game music data sets may be acquired locally directly from the server 130 or the terminal device 110. Alternatively, for example, when the game music data set is stored in the network cloud connected in communication with the server 130 or the terminal device 110, all audio related to the game may be requested to be acquired from the cloud as the game music data set. Alternatively, when the game music data set is stored in the other server 130 or the other terminal device 110, a request may be made to the other server 130 or the other terminal device 110, and all the audio related to the game may be acquired as the game music data set after the authorization is acquired.
Any way of acquiring a game music data set may be used in the present application, which is not limited in this embodiment.
The step S220 specifically includes: performing unsupervised clustering on game audio in the game music data set to obtain a game audio aggregation set, wherein the game audio aggregation set comprises game audio aggregated in one set after unsupervised clustering;
when a large number of game audios are obtained from each game to form a game music data set, the game audios are clustered, otherwise, if each game audio corresponds to a type of music style tag, the classification of the game music has no reference meaning. Therefore, the application performs unsupervised clustering on the game audio in the game music data set, so that the game audio is gathered together based on a certain rule to obtain a game audio gathering set. The game audio collection is a corresponding collection after the game audio in the game music data set is collected into one type. The whole game music data set forms a plurality of game audio collection sets, and the game audio collection sets correspond to specific classifications of the game music data sets.
By way of example, by taking all game audio for 200 games, if a game has an average of 100 game audio, then a game music dataset with 20000 game audio is composed, and the music style tags for these game audio are all indeterminate. And step S220 of the present application performs unsupervised clustering on these 20000 game audios. After the clustering is completed, 100 game audio cluster sets can be obtained, and then 100 game audio classes corresponding to 20000 game audio classes are obtained. Each corresponding to a game music style. One of the game audio collection is a collection corresponding to one of the game music styles.
The application further discloses a specific method for unsupervised clustering.
In one embodiment of the application, the method for performing unsupervised clustering on game audio in the game music data set to obtain the game audio clustering set specifically comprises two aspects, namely, performing unsupervised training on the game audio in the game music data set firstly, and performing unsupervised clustering on the data of the unsupervised training to obtain the game audio clustering set.
In one embodiment of the present application, a method for performing unsupervised training on game audio in a game music data set is shown in fig. 3, and fig. 3 schematically shows a flowchart of steps of a method for performing unsupervised training on game audio in one embodiment of the present application. Specifically, the method comprises the steps of S310-S320.
In step S310, any two pieces of each game audio in the game music dataset are randomly clipped, and the any two pieces of each game audio are converted into audio feature vectors to form an audio slice pair.
The unsupervised training of game audio is to highlight the styles and features of different game audio. Since there are very many game audios in the game music data set, if these individual game audios cannot stand out for a specific style by themselves, difficulty is added to the clustering of the entire huge game music data set. Taking common music as an example, if a song contains multiple music styles, it is difficult to cluster the song, because the difference between the song and other songs is not great, so in order to avoid the difficulty in clustering caused by the small difference between game audios, the application needs to perform unsupervised training on the game audios to highlight the styles and features of different game audios.
The application uses the audio slice pair corresponding to the two segments as the specific game music style of the game audio. For example, there is a game audio in the game music data set, two pieces of the game audio are cut randomly, and then the corresponding style and feature of the audio slice pair composed of the two pieces represent the style and feature of the game audio. Taking the music style of songs as an example, if a song contains both classical style elements and ballad style elements, the application needs to randomly select two segments in the song to form an audio slice pair, and takes the styles of the two segments as the style of the whole song.
After converting any two pieces of each game audio into audio feature vectors to form an audio slice pair, it may also appear whether the feature distance between the two pieces in the audio slice pair is long, or whether the difference between the music styles of the two pieces is large for the corresponding music styles, and step S230 is required.
In step S320, the audio slice pair is input into a multi-class cross entropy contrast loss function for performing unsupervised training to obtain a game music data set for unsupervised training, and the multi-class cross entropy contrast loss function (NT-Xent loss) is used to reduce intra-pair feature distance of the audio slice pair and increase inter-pair feature distance of the audio slice pair.
The application uses multi-class cross entropy contrast loss function (NT-Xent loss) to perform unsupervised training on the audio slice pair, and the training purpose is mainly to reduce the intra-pair feature distance of the audio slice pair and increase the inter-pair feature distance of the audio slice pair. The feature distance corresponds to the style difference, the feature distance is smaller, the style difference is smaller, and the feature distance is larger, the style difference is larger.
Figure 4 schematically illustrates an effect diagram of unsupervised training of game audio in one embodiment of the present application. With reference to fig. 4, for two segments in an audio slice pair with small feature distances, the distance between the audio feature vectors corresponding to any two segments of each game audio is reduced by a multi-class cross entropy contrast loss function (NT-Xent loss), while for different game audio, the distance between the audio feature vectors corresponding to each game audio is increased by a multi-class cross entropy contrast loss function (NT-Xent loss) due to the difference between the respective game audio. Thereby realizing similar aggregation and heterogeneous separation of game audios in the game music data set.
After the unsupervised training of the game audio in the game music data set is completed, unsupervised clustering can be carried out on the unsupervised training data to obtain a game audio aggregation set. The specific steps are as follows.
In one embodiment of the application, the method for performing unsupervised clustering on unsupervised training data to obtain the game audio cluster set comprises the steps of inputting audio feature vectors corresponding to each audio in the unsupervised training game music data set into a greedy algorithm to perform unsupervised clustering to obtain the game audio cluster set.
Greedy algorithms are algorithms that always make the best choice to correspond to when solving a problem. That is, the algorithm results in a locally optimal solution in a sense that is not considered for overall optimization. So the greedy algorithm can not obtain the overall optimal solution for all problems, and the key is the selection of greedy strategies. The selection of greedy strategy according to the application comprises the following steps.
In one embodiment of the present application, a specific method for inputting an audio feature vector corresponding to each audio in an unsupervised training game music dataset into a greedy algorithm for unsupervised clustering is shown in fig. 5, where fig. 5 schematically shows a flowchart of steps of a method for unsupervised clustering unsupervised training data in one embodiment of the present application. Including steps S510-S520.
Step S510: a pair of audio feature vectors with the smallest distance in the game music data set for unsupervised training is selected.
Through step S310-step S320, the game audio in the game music data set is subjected to unsupervised training, so that the difference of single game audio is reduced, the direct difference of the game audio is increased, and the condition that the style of the single game audio is uncertain is eliminated. The game audio in the game music dataset can be clustered at this time. The clustering method is to select a pair of audio feature vectors with the smallest distance, wherein the pair of audio feature vectors corresponds to two game audio, the feature distance of the two game audio is the smallest, and one game audio corresponds to one audio feature vector. The purpose of selecting the audio feature vector with the smallest distance is to bring together the two game audio when the two audio feature vectors with the smallest distance satisfy a certain condition. And specific conditions are judged by step S520.
Step S520: if the distance between the pair of audio feature vectors with the smallest distance is smaller than the specified threshold value, the pair of audio feature vectors with the smallest distance are gathered into one class to form a game audio aggregation set.
Wherein the specified threshold is set according to user definition, and the size of the specified threshold is related to the number of the game music style labels. The greater the number of game music style tags, the smaller the specified threshold is required. For example, if 100 game music style tags are required, then there are 100 specific categories of corresponding game music styles, then the specified threshold may be set a little bit smaller. If 50 game music style tags are required, then the corresponding game music styles have 50 specific classifications, and the specified threshold is greater than the specified threshold for 100 specific classifications. If more than 100 game music style tags are required, the corresponding specified threshold is less than the specified threshold for 100 specific categories.
The clustering method of the application is to judge whether the distance between a pair of audio feature vectors with the smallest distance is smaller than a specified threshold value, if the distance between a pair of audio feature vectors with the smallest distance is smaller than the specified threshold value, the pair of audio feature vectors with the smallest distance are clustered into one class to form a game audio cluster set. This step is performed continuously until the distance between the pair of audio feature vectors with the smallest distance is larger than a specified threshold value, and clustering is interrupted.
For example, if there are 20000 game audios, a pair of audio feature vectors with the smallest distance is found, if the distance between the pair of audio feature vectors with the smallest distance is smaller than a specified threshold, the pair of audio feature vectors with the smallest distance is gathered into one class, and then two game audios corresponding to the pair of audio feature vectors with the smallest distance are gathered into one class, so as to form a game audio aggregation set N. Continue to compare other 19998 game tones. When a pair of audio feature vectors with the minimum distance is found, if the distance between the pair of audio feature vectors with the minimum distance is smaller than a specified threshold value, the pair of audio feature vectors with the minimum distance are clustered into one class to form a game audio cluster set M. And if the feature vector distance corresponding to a certain game audio is the smallest with the game audio distance in the game audio aggregation set M and the distance is smaller than a specified threshold value, the game audio is classified into the game audio aggregation set M, and the game audio is sequentially circulated. This allows game audio that is close in distance, while being less than a specified threshold, to be grouped into one category. If the above clustering is adopted, 1000 game audio frequency gathering sets M are formed in 20000 game audio frequency, 2000 game audio frequency gathering sets N are formed in 2000, and 200 game audio frequency gathering sets are formed in total, wherein the number of the game audio frequency of each game audio frequency gathering set is variable, and the number is determined according to the specific clustering condition. The game audio corresponding to the present application is illustrated as having 200-in-200 music style tags. The number of game audios in the game audio cluster set is at least one, and the situation is that for the game audios, the distances between all other game audios and the game audios are larger than or equal to a specified threshold value, so that the game audios are independent and are in a class, the situation is generally caused by insufficient game audios of the game music data set or too single game types of the game music data set, and the situation is encountered that the number of samples of the game music data set can be continuously expanded.
After the game music data sets are clustered, the music style labels of the clustered game audios can be identified. The specific steps are as in step S230.
The step S230 specifically includes: and selecting a plurality of game audio samples from the game audio cluster set, and determining the music style label of the game audio cluster set according to the content correlation of the plurality of game audio samples.
After the game music data sets are clustered to obtain a plurality of game audio aggregation sets through unsupervised clustering, identification confirmation of music style labels is needed for each game audio aggregation set. There are a number of methods for specific identification verification.
The present application can utilize the simplest way of principal validation. Through selecting a plurality of game audio samples in the audio cluster set, the terminal corresponding to the main body can listen to the plurality of game audio samples, and then the identification confirmation of the music style label is carried out on the audio cluster set according to the heard content. The identification confirmation mode can be based on the user-defined unique classification standard for game music. For example, for game music, a user corresponding to a subject may customize a setting, and when a plurality of game audio samples in a certain audio cluster set all have soldier shouts or weapon gun shots, a game music style tag of the audio cluster set may be defined as a poetry class. The user corresponding to the main body can also be defined according to the content of a plurality of game audio samples, for example, the lyrics correspond to the repairing immortals, and the game music style labels of the corresponding audio collection can be defined as repairing immortals.
The present application obtains a general classification based on game music style tags as shown in fig. 6 by clustering a large number of game audio, but the specific classification of game audio is not limited thereto, but is determined by the resulting audio cluster set.
FIG. 6 schematically illustrates a general classification table based on game music style tags in one embodiment of the application.
As can be seen from fig. 6, the classification for game music can be divided into three major categories, including song-based classification, background sound-based classification, and other classification. And the three major classes comprise a plurality of minor classes. The composition of these three general classes will be described in detail below.
Song-based classifications include popularity, electronics, rock, metal, rap, disco, chinese style, light music, etc. Where the popular categories correspond to popular songs or matches as game music, often used in game promotions or background music to begin a game or in game activities. The electronic type is music in which electronic songs or music is used as game music, and is often used for defeating. Rock songs or music pieces are used as game music and are often used in game main interface background music or in-game background music. The metal class corresponds to music with metal class songs or soundtracks as games, background music commonly used in large checkpoints of games. The rap class is to use rap class songs as game music and is often used in a scene of game light. Disco corresponds to music with very heavy beat points as game music, and is often used in music games. Chinese wind is music with strong Chinese wind as game music, and is often used in music with deep combination with Chinese culture. The light music class is to use soft music as game music and often as background music of different game scenes. Of course, the specific subcategories for song-based categorization are not limited to the above, but rather the present application is merely illustrative of the relatively large number of uses in games.
And game music can be classified into poetry, ACG, fairy-porch, fantasy jazz, chip music, musical instruments, and the like based on the classification of background sounds. Specific: the Shis comprise common matches of war games, such as war sound, soldier sonar shouting sound and weapon gun sound. ACG cartoons contain a lot of game plays of cartoons, and are often used in climax or turning parts of games. In the game of repairing fairy with the common term of the swordlike, there are often game games of the martial arts, and the game is more graceful, and there are many accompaniments of traditional musical instruments, such as whistle and shou. The fantasy jazz category contains a pipe element plus beat cycles, typically for an infinite duration of soundtrack for a game menu or a fixed scene. The chip music is 8bit music, and is music produced by a music chip of an old-fashioned game. Musical instruments are purely musical instrument-playing instruments, such as the playing of lute, dulcimer, etc., often used at the beginning or in the middle of a game. Of course, the specific sub-categories of the classification based on the background sounds are not limited to the above, and the present application is merely given by way of example of the relatively large use in games.
While for other categories it is understood that a spam category is a category for all game audio except the song-based category and the background sound-based category described above. While other categories often have a long piece of still music, such as a background of a setting game is a silently night. But also purely vocal, for example as a background introduction to the beginning of the game, also dubbing of the person, etc. But also specific sound effects such as door opening, gunshot, etc. Also included are various animal sounds, in-game riding sounds, etc. which can be individually categorized into other categories. Of course, the specific sub-categories based on other categories are not limited to the above, and the present application is merely given by way of example of the relatively large number of uses in games.
The application also discloses a method for identifying and confirming the music style label of each game audio collection.
In one embodiment of the present application, a method for determining a music style tag of a game audio collection based on content correlation of a plurality of game audio samples is shown in fig. 7, and fig. 7 schematically illustrates a flowchart of method steps for determining a music style tag of a game audio collection in one embodiment of the present application. Including steps S710-S730.
Step S710: game music features of game audio in the game music dataset are extracted.
The game music features correspond to specific features of the game music wind grid class for which correspondence is shown in fig. 6. These features may be user-defined. For example, when the user speaks the war, the soldier shots, and the game music feature corresponding to the weapon gun shot corresponds to the poetry-like music style tag, step S710 extracts the war, the soldier shots, and the weapon gun shot as the game music feature. The game music features are all that can be used to influence the music style tags of the game, such as flute, piano, etc.
Step S720: inputting the music style labels corresponding to the game music characteristics and the game music characteristics into a machine learning model for training to obtain a label calibration model for predicting the music style labels based on the game music characteristics.
The game music feature and the music style label corresponding to the game music feature are preset by the user. For example, if the game music style label corresponding to the flute sound is "xiaoxian", the application inputs a great number of flute sounds in the game music data set, the flute sound and the corresponding "xiaoxian" into the machine learning model for training. The machine learning model of the present application may be a model constructed based on a convolutional neural network, a cyclic neural network, or the like. After training a large number of game music features, a tag calibration model for predicting the music style tag based on the game music features can be obtained.
Step S730: and extracting game music characteristics of a plurality of game audio samples in the game audio cluster set, and inputting the game music characteristics of the plurality of game audio samples in the game audio cluster set into a tag calibration model to obtain music style tags of the plurality of game audio samples in the game audio cluster set.
The game music features in step S730 are similar to those in step S710, and are specific sounds related to the game music style tag. For example, the game audio collection is extracted to obtain the game music feature as flute sound or flute sound, and the game music feature is input into the label calibration model in step S720, so that the music style label of the game audio collection is obtained as a xiaoxian.
The automatic identification and confirmation of the tags by the method from step S710 to step S730 can be realized by machine learning, thereby completing the acquisition of the specific music style tags of the game audio collection.
After the game music style tags of the respective game audio collection are acquired, step S240 is also required.
The step S240 specifically includes: and adding a music style tag to game audio in the game music data set to obtain an audio tag data set.
The specific music style labels of the game audio collection are obtained through step S230, and the music style labels of the game audio collection are directly used as the music style labels of all game audio in the game audio cluster. For example, a game audio cluster set of "xiaoxian" class is obtained in step S230, and 100 game audio is collected in the game audio cluster set, and then the game music style labels of the 100 game audio are all "xiaoxian". By using the method, all game audio in the game music data set can be added with music style labels, so that the audio label data set is obtained. The audio tag dataset is a collection of known music style tags for all game audio.
The audio tag data set obtained by constructing the music style tag for the game music is completed through the steps, and the prediction of the music style tag for the game audio can be performed based on the audio tag data set after the construction is completed. As shown in FIG. 8, FIG. 8 is a flow chart schematically illustrating the method steps for music style tag prediction for game audio in one embodiment of the present application. Specifically, steps S810-S840 are included.
In step S810: an audio tag dataset and game audio to be categorized are obtained.
The audio tag data set is obtained through steps S210-S240, and may be directly obtained. Whereas the game audio to be classified is typically from the terminal device. In particular applications, the manner in which the game audio to be categorized is obtained may be varied. For example, when the user uploads the game audio to be classified through the terminal device, the game audio to be classified can be directly accepted through the terminal device. Or when the user uploads the game audio to be classified to the cloud server, the cloud server can be accessed through the network to obtain the game audio to be classified.
In step S820: and obtaining a Mel spectrogram through short-time Fourier transform of game audio with music style tags in the audio tag data set.
The formula of the short-time Fourier transform is as follows:
where t is the frame length, w (n) is the window function, typically a hanning window, and the following formula is obtained by conversion:
where N is the window length and H is the size of the jump.
In step S830: and inputting the Mel spectrogram into a preset deep convolutional neural network for training to obtain a network model for predicting the music style label.
In one embodiment of the present application, the method of inputting the mel spectrogram into the preset deep convolutional neural network for training in step S830 includes steps S910-S920, and fig. 9 schematically illustrates a flowchart of the method steps for training the mel spectrogram in one embodiment of the present application.
Step S910: and performing supervised learning on the deep convolutional neural network by utilizing the spectrum atlas obtained after the data set preprocessing to obtain a proper weight parameter matrix and an offset.
The specific method for supervised learning of the deep convolutional neural network comprises the following steps: inputting spectrograms in the spectrogram set into the deep convolutional neural network; forward propagation is carried out on the deep convolutional neural network to obtain a recognition result; judging whether the identification result accords with the actual music style; if the training is consistent, stopping training; and if the two spectrograms do not coincide, adopting a random gradient descent algorithm to adjust a weight parameter matrix and an offset in the back propagation process, and inputting the spectrograms in the spectrogram set into the deep convolutional neural network again.
Step S920: the weight parameter matrix and the offset are assigned to each layer of the deep convolutional neural network correspondingly.
The preset deep convolutional neural network can comprise a plurality of convolutional layers, a plurality of pooling layers and a plurality of fully-connected layers, wherein the number of the convolutional layers can be five, and the number of the pooling layers and the fully-connected layers can be three. When a proper weight parameter matrix and offset are obtained, the weight parameter matrix and the offset can be correspondingly assigned to each layer of the deep convolutional neural network, and then the Mel spectrogram can be input into the preset deep convolutional neural network for training.
Among these, there are many deep convolutional neural networks that can be used with the present application. An audio classification model PANN (PANNs: large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition) can be selected, and a convolution neural network model such as musicnn or harmonecnn can be selected. The objective function of the prediction using the deep convolutional neural network is as follows:
wherein the audio input is x n The model is predicted to be f (x n )∈[0,1] K The number of samples is N.
And constructing an end-to-end training model by utilizing the gradient descent optimization objective function to construct the spectrum atlas obtained after the data set pretreatment.
In step S840: and obtaining a Mel spectrogram through short-time Fourier transform of the game audio to be classified, and inputting a network model for music style label prediction to obtain the music style label of the game audio to be classified.
After the construction of the network model for music style tag prediction in step S830 is completed, the game audio to be classified may be input, so as to obtain a music style tag corresponding to the game audio to be classified.
For example, when the mel spectrogram of the game audio to be classified is input into the network model for music style label prediction, and the model of the network model for music style label prediction finds that the mel spectrogram of the game audio to be classified is similar to the mel spectrogram of the music style label corresponding to "xiaoxian" through comparison, the music style label of the game audio to be classified is determined to be "xiaoxian".
Embodiments of specific applications of steps S810-S840 of the present application are further described below by the present application using a convolutional neural network model, hardoniccnn. As shown in FIG. 10, FIG. 10 schematically illustrates a schematic diagram of a specific application of music style tag prediction to game audio in one embodiment of the application. Mainly comprises steps a, b and c.
Step a: first, game audio (Warenform) is input, and the game audio is subjected to short-time Fourier transform (STFT) to obtain a Mel Spectrogram (Spectrogram). Then inputting the Mel spectrogram into a convolutional neural network model of the HarmonicCNN for training, converting the Mel spectrogram by using a harmonic filter which can be learned by the HarmonicCNN, and then converting the Mel spectrogram by a CNN network to obtain music characteristics.
Step b: and finally, obtaining a line drawing corresponding to the music characteristic based on the music characteristic.
Step c: and converting the line drawings corresponding to the music characteristics into a Mel spectrogram, so that the music style labels corresponding to various Mel spectrograms can be obtained, and the subsequent prediction is convenient.
After the models are obtained in the steps a, b and c, the audio of the game to be classified can be obtained through short-time Fourier transformation, the obtained mel spectrogram is input into a network model for predicting the music style label, and the music style label corresponding to the mel spectrogram with consistent mel spectrogram of the audio of the game to be classified can be determined according to the comparison between the mel spectrogram in the step c and the mel spectrogram of the audio of the game to be classified, so that the music style label of the audio of the game to be classified is obtained.
It should be noted that although the steps of the methods of the present application are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
The following describes an embodiment of the apparatus of the present application that can be used to perform the game music style classification method in the above-described embodiment of the present application. Fig. 11 schematically shows a block diagram of a game music style classification apparatus 1100 according to an embodiment of the present application. The method specifically comprises the following steps:
the acquisition module 1110, the acquisition module 1110 is configured to acquire a game music data set, where the game music data set includes game audio;
the clustering module 1120 is connected with the acquisition module 1110, and the clustering module 1120 is used for performing unsupervised clustering on game audio in the game music data set to obtain a game audio aggregation set, wherein the game audio aggregation set comprises game audio which is aggregated in one set after unsupervised clustering;
The recognition module 1130 is connected with the clustering module 1120 and is used for selecting a plurality of game audio samples from the game audio clustering set and determining music style labels of the game audio clustering set according to the content correlation of the plurality of game audio samples;
the adding module 1140 is connected to the identifying module 1130, and is used for adding a music style tag to game audio in the game music data set to obtain an audio tag data set.
In some embodiments of the present application, the game music style classification apparatus 1100 of the present application further comprises:
an audio acquisition module 1110, the audio acquisition module 1110 being configured to acquire an audio tag dataset and game audio to be categorized;
the conversion module is configured to obtain a Mel spectrogram through short-time Fourier transform of game audio with music style tags in the audio tag data set, and obtain the Mel spectrogram through short-time Fourier transform of the game audio to be classified;
the prediction training module is configured to input a mel spectrogram corresponding to the audio tag data set into a preset deep convolutional neural network for training to obtain a network model for music style tag prediction;
The style recognition module 1130 is configured to input mel spectrograms corresponding to the game audio to be classified into a network model for music style tag prediction to obtain a music style tag of the game audio to be classified.
In some embodiments of the present application, the prediction training module is further configured to perform supervised learning on the deep convolutional neural network by using a spectrum atlas obtained by short-time fourier transform of game audio with music style tags in the audio tag dataset, so as to obtain a suitable weight parameter matrix and offset; and correspondingly assigning a weight parameter matrix and an offset to each layer of the deep convolutional neural network.
In some embodiments of the application, the clustering module 1120 includes an unsupervised training unit and an unsupervised clustering unit.
In some embodiments of the present application, the unsupervised training unit is configured to randomly clip any two pieces of each game audio in the game music dataset, convert the any two pieces of each game audio into audio feature vectors, and then form an audio slice pair;
the unsupervised training unit is further configured to input the audio slice pairs into a multi-class cross entropy contrast loss function for unsupervised training to obtain a game music data set for unsupervised training, wherein the multi-class cross entropy contrast loss function is used for reducing intra-pair feature distances of the audio slice pairs and increasing inter-pair feature distances of the audio slice pairs;
In some embodiments of the present application, the unsupervised clustering unit is configured to input an audio feature vector corresponding to each audio in the game music dataset for unsupervised training into a greedy algorithm for unsupervised clustering, so as to obtain a game audio cluster set.
In some embodiments of the present application, a method for inputting an audio feature vector corresponding to each audio in an unsupervised training game music dataset into a greedy algorithm for unsupervised clustering, includes:
selecting a pair of audio feature vectors with minimum distance in the game music data set without supervision training;
if the distance between the pair of audio feature vectors with the smallest distance is smaller than the specified threshold value, the pair of audio feature vectors with the smallest distance are gathered into one class to form a game audio aggregation set.
In some embodiments of the application, the identification module 1130 includes:
a feature extraction unit configured to extract game music features of game audio in the game music data set and to extract game music features of a plurality of game audio samples in the game audio cluster set;
the label training unit is configured to input the game music characteristics and the music style labels corresponding to the game music characteristics into the machine learning model for training, and a label calibration model for predicting the music style labels based on the game music characteristics is obtained;
The label prediction unit is configured to input game music characteristics of a plurality of game audio samples in the game audio cluster set into the label calibration model to obtain music style labels of the plurality of game audio samples in the game audio cluster set.
Specific details of the game music style classification device provided in each embodiment of the present application have been described in the corresponding method embodiments, and are not described herein.
According to an aspect of the embodiments of the present application, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements a game music style classification method as in the above technical solution.
According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the game music style classification method as in the above technical solution via execution of the executable instructions.
According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the game music style classification method as in the above technical solution.
Fig. 12 schematically shows a block diagram of a computer system of an electronic device for implementing an embodiment of the application.
It should be noted that, the computer system 1200 of the electronic device shown in fig. 12 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
As shown in fig. 12, the computer system 1200 includes a central processing unit 1201 (Central Processing Unit, CPU) which can perform various appropriate actions and processes according to a program stored in a Read-Only Memory 1202 (ROM) or a program loaded from a storage section 1208 into a random access Memory 1203 (Random Access Memory, RAM). In the random access memory 1203, various programs and data necessary for the system operation are also stored. The cpu 1201 and the ram 1202 are connected to each other via a bus 1204. An Input/Output interface 1205 (i.e., an I/O interface) is also connected to the bus 1204.
The following components are connected to the input/output interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker, etc.; a storage section 1208 including a hard disk or the like; and a communication section 1209 including a network interface card such as a lan card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. The driver 1210 is also connected to the input/output interface 1205 as needed. A removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1210 so that a computer program read out therefrom is installed into the storage section 1208 as needed.
In particular, the processes described in the various method flowcharts may be implemented as computer software programs according to embodiments of the application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1209, and/or installed from the removable media 1211. The computer programs, when executed by the central processor 1201, perform the various functions defined in the system of the present application.
According to the technical scheme provided by the embodiment of the application, game audios in the game music data set are gathered together according to the music style labels to form a plurality of game audio aggregation sets, then the music style labels of the game audio aggregation sets are determined according to the content relativity of a plurality of game audio samples of the game audio aggregation sets, finally the music style labels are added to all game audios in the game music data set to obtain the audio label data set, and the method provided by the application obtains a classification method specially aiming at the game music styles, so that the determination of the music style labels of the game music is realized, and a user can collect and listen the game music according to the classification corresponding to the music style labels, so that the user experience of the user on the game music is greatly improved, and the development of the game music is promoted.
It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (9)

1. A game music style classification method, comprising:
acquiring a game music data set, wherein the game music data set comprises game audio;
performing unsupervised clustering on game audio in the game music data set to obtain a game audio aggregation set, wherein the game audio aggregation set comprises game audio aggregated in one set after unsupervised clustering;
selecting a plurality of game audio samples from the game audio cluster set, and determining a music style tag of the game audio cluster set according to the content correlation of the plurality of game audio samples;
adding a music style tag to game audio in the game music data set to obtain an audio tag data set, wherein the music style tag of the game audio is a music style tag of a game audio collection to which the game audio belongs;
the performing unsupervised clustering on game audio in the game music data set to obtain a game audio collection set includes:
randomly cutting any two pieces of each game audio in the game music data set, converting the any two pieces of each game audio into audio feature vectors and then forming an audio slice pair;
Inputting the audio slice pairs into a multi-class cross entropy contrast loss function for performing unsupervised training to obtain an unsupervised training game music data set, wherein the multi-class cross entropy contrast loss function is used for reducing intra-pair feature distances of the audio slice pairs and increasing inter-pair feature distances of the audio slice pairs.
2. The game music style classification method of claim 1, further comprising:
acquiring an audio tag data set and game audio to be classified;
obtaining a Mel spectrogram through short-time Fourier transform of game audio with music style tags in the audio tag data set;
inputting the Mel spectrogram into a preset deep convolutional neural network for training to obtain a network model for predicting music style labels;
and obtaining a Mel spectrogram through short-time Fourier transform of the game audio to be classified, and inputting the network model for music style label prediction to obtain the music style label of the game audio to be classified.
3. The game music style classification method according to claim 2, wherein inputting the mel spectrogram into a preset deep convolutional neural network for training to obtain a network model for music style label prediction, comprising:
Performing supervised learning on a deep convolutional neural network by utilizing a frequency spectrum atlas obtained by short-time Fourier transform of game audio with music style labels in the audio label data set to obtain a proper weight parameter matrix and an offset;
and correspondingly assigning a weight parameter matrix and an offset to each layer of the deep convolutional neural network.
4. The game music style classification method of claim 1, wherein performing unsupervised clustering on game audio in the game music dataset to obtain a game audio cluster set, further comprising:
and inputting the audio feature vector corresponding to each audio in the game music data set of the unsupervised training into a greedy algorithm to perform unsupervised clustering, so as to obtain a game audio cluster set.
5. The game music style classification method of claim 4, wherein inputting the audio feature vector corresponding to each audio in the unsupervised training game music dataset into a greedy algorithm for unsupervised clustering comprises:
selecting a pair of audio feature vectors with minimum distance in the game music data set of the unsupervised training;
and if the distance between the pair of audio feature vectors with the minimum distance is smaller than the specified threshold value, the pair of audio feature vectors with the minimum distance are gathered into one class to form a game audio gathering set.
6. The game music style classification method of claim 1, wherein determining a music style tag for the set of game audio categories based on content correlation of the plurality of game audio samples comprises:
extracting game music characteristics of game audio in the game music data set;
inputting the game music characteristics and the music style labels corresponding to the game music characteristics into a machine learning model for training to obtain a label calibration model for predicting the music style labels based on the game music characteristics;
and extracting game music characteristics of a plurality of game audio samples in the game audio cluster set, and inputting the game music characteristics of the plurality of game audio samples in the game audio cluster set into the tag calibration model to obtain the game audio collection music style tag.
7. A game music style classification device, comprising:
the acquisition module is used for acquiring a game music data set, wherein the game music data set comprises game audio;
the clustering module is connected with the acquisition module and is used for performing unsupervised clustering on game audios in the game music data set to obtain a game audio clustering set, wherein the game audio clustering set comprises game audios which are clustered in a set after unsupervised clustering;
The identification module is connected with the clustering module and is used for selecting a plurality of game audio samples from the game audio clustering set and determining music style labels of the game audio clustering set according to the content correlation of the game audio samples;
the adding module is connected with the identification module and is used for adding a music style tag to the game audio in the game music data set to obtain an audio tag data set, wherein the music style tag of the game audio is a music style tag of a game audio aggregation set to which the game audio belongs;
the performing unsupervised clustering on game audio in the game music data set to obtain a game audio collection set includes:
randomly cutting any two pieces of each game audio in the game music data set, converting the any two pieces of each game audio into audio feature vectors and then forming an audio slice pair;
inputting the audio slice pairs into a multi-class cross entropy contrast loss function for performing unsupervised training to obtain an unsupervised training game music data set, wherein the multi-class cross entropy contrast loss function is used for reducing intra-pair feature distances of the audio slice pairs and increasing inter-pair feature distances of the audio slice pairs.
8. A computer readable medium having stored thereon a computer program which, when executed by a processor, implements the game music style classification method of any of claims 1 to 7.
9. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the game music style classification method of any of claims 1 to 6 via execution of the executable instructions.
CN202110615605.1A 2021-06-02 2021-06-02 Game music style classification method and device, readable medium and electronic equipment Active CN113813609B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110615605.1A CN113813609B (en) 2021-06-02 2021-06-02 Game music style classification method and device, readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110615605.1A CN113813609B (en) 2021-06-02 2021-06-02 Game music style classification method and device, readable medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN113813609A CN113813609A (en) 2021-12-21
CN113813609B true CN113813609B (en) 2023-10-31

Family

ID=78923795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110615605.1A Active CN113813609B (en) 2021-06-02 2021-06-02 Game music style classification method and device, readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113813609B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114036341B (en) * 2022-01-10 2022-03-29 腾讯科技(深圳)有限公司 Music tag prediction method and related equipment
CN114464152B (en) * 2022-04-13 2022-07-19 齐鲁工业大学 Music genre classification method and system based on visual transformation network
CN114917585A (en) * 2022-06-24 2022-08-19 四川省商投信息技术有限责任公司 Sound effect generation method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197282A (en) * 2018-01-10 2018-06-22 腾讯科技(深圳)有限公司 Sorting technique, device and the terminal of file data, server, storage medium
CN108363769A (en) * 2018-02-07 2018-08-03 大连大学 The method for building up of semantic-based music retrieval data set
CN109918535A (en) * 2019-01-18 2019-06-21 华南理工大学 Music automatic marking method based on label depth analysis
CN110188235A (en) * 2019-05-05 2019-08-30 平安科技(深圳)有限公司 Music style classification method, device, computer equipment and storage medium
CN110491393A (en) * 2019-08-30 2019-11-22 科大讯飞股份有限公司 The training method and relevant apparatus of vocal print characterization model
CN111859010A (en) * 2020-07-10 2020-10-30 浙江树人学院(浙江树人大学) Semi-supervised audio event identification method based on depth mutual information maximization
CN112861758A (en) * 2021-02-24 2021-05-28 中国矿业大学(北京) Behavior identification method based on weak supervised learning video segmentation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7777125B2 (en) * 2004-11-19 2010-08-17 Microsoft Corporation Constructing a table of music similarity vectors from a music similarity graph

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197282A (en) * 2018-01-10 2018-06-22 腾讯科技(深圳)有限公司 Sorting technique, device and the terminal of file data, server, storage medium
CN108363769A (en) * 2018-02-07 2018-08-03 大连大学 The method for building up of semantic-based music retrieval data set
CN109918535A (en) * 2019-01-18 2019-06-21 华南理工大学 Music automatic marking method based on label depth analysis
CN110188235A (en) * 2019-05-05 2019-08-30 平安科技(深圳)有限公司 Music style classification method, device, computer equipment and storage medium
CN110491393A (en) * 2019-08-30 2019-11-22 科大讯飞股份有限公司 The training method and relevant apparatus of vocal print characterization model
CN111859010A (en) * 2020-07-10 2020-10-30 浙江树人学院(浙江树人大学) Semi-supervised audio event identification method based on depth mutual information maximization
CN112861758A (en) * 2021-02-24 2021-05-28 中国矿业大学(北京) Behavior identification method based on weak supervised learning video segmentation

Also Published As

Publication number Publication date
CN113813609A (en) 2021-12-21

Similar Documents

Publication Publication Date Title
Nam et al. Deep learning for audio-based music classification and tagging: Teaching computers to distinguish rock from bach
CN113813609B (en) Game music style classification method and device, readable medium and electronic equipment
EP3803846B1 (en) Autonomous generation of melody
Wu et al. Automatic audio chord recognition with MIDI-trained deep feature and BLSTM-CRF sequence decoding model
US8392414B2 (en) Hybrid audio-visual categorization system and method
Zhang Music style classification algorithm based on music feature extraction and deep neural network
CA3194565A1 (en) System and method for recommending semantically relevant content
Pachet et al. Analytical features: a knowledge-based approach to audio feature generation
CN109829482A (en) Song training data processing method, device and computer readable storage medium
CN108766451B (en) Audio file processing method and device and storage medium
CN110136689A (en) Song synthetic method, device and storage medium based on transfer learning
Middlebrook et al. Song hit prediction: Predicting billboard hits using spotify data
CN111816170B (en) Training of audio classification model and garbage audio recognition method and device
CN110851650B (en) Comment output method and device and computer storage medium
Wu Research on automatic classification method of ethnic music emotion based on machine learning
Yang Research on music content recognition and recommendation technology based on deep learning
CN113506553A (en) Audio automatic labeling method based on transfer learning
KR101801250B1 (en) Method and system for automatically tagging themes suited for songs
Zhang Research on music classification technology based on deep learning
Li et al. Music genre classification based on fusing audio and lyric information
Wen et al. Parallel attention of representation global time–frequency correlation for music genre classification
CN111026908A (en) Song label determination method and device, computer equipment and storage medium
CN114117096A (en) Multimedia data processing method and related equipment
CN115206270A (en) Training method and training device of music generation model based on cyclic feature extraction
Wijaya et al. Song Similarity Analysis With Clustering Method On Korean Pop Song

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant