CN113813609A

CN113813609A - Game music style classification method and device, readable medium and electronic equipment

Info

Publication number: CN113813609A
Application number: CN202110615605.1A
Authority: CN
Inventors: 彭博
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2021-12-21
Anticipated expiration: 2041-06-02
Also published as: CN113813609B

Abstract

The application belongs to the technical field of artificial intelligence, and particularly relates to automatic identification of game music style in artificial intelligence. The application discloses a game music style classification method, a game music style classification device, a readable medium and electronic equipment. The method corresponding to the application obtains a game audio cluster set by carrying out unsupervised clustering on game audio in a game music data set, then selects a plurality of game audio samples from the game audio cluster set, and determines the music style labels of the game audio cluster set according to the content relevance of the plurality of game audio samples; by the method, the music style labels of the game music can be determined, and the user can collect and listen to the game music according to the classification corresponding to the music style labels, so that the user experience of the user on the game music is greatly improved, and the development of the game music is promoted.

Description

Game music style classification method and device, readable medium and electronic equipment

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a game music style classification method, a game music style classification device, a computer readable medium and electronic equipment.

Background

Music style is a label reflecting the overall characteristics of music, and the music style classification of a piece of music may be classified into country music, jazz, rock, heavy metal music, punk, and so on. The classification of music styles is now generally by type of music.

However, for game music, more is dubbing music, and many are not singing by singers, and it is difficult to effectively classify game music by directly following the classification of music style. In practical use, users often listen to the collected game music according to specific categories, but cannot obtain the categories of the game music, so that the user experience of the game music is greatly reduced, and the development of the game music is hindered.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present application and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

The application aims to provide a game music style classification method, a game music style classification device, a computer readable medium and an electronic device. The technical problems that the game music styles cannot be classified and the like in the related technology are overcome at least to a certain extent.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided a game music style classification method, including:

obtaining a game music data set, the game music data set including game audio;

unsupervised clustering is carried out on game audio in the game music data set to obtain a game audio cluster set, wherein the game audio cluster set comprises the game audio which is subjected to unsupervised clustering and then gathered in one set;

selecting a plurality of game audio samples from the game audio cluster set, and determining music style labels of the game audio cluster set according to the content relevance of the plurality of game audio samples;

and adding a music style label to the game audio in the game music data set to obtain an audio label data set.

According to an aspect of an embodiment of the present application, there is provided a game music style classification apparatus including:

an acquisition module for acquiring a game music data set, the game music data set including game audio;

the clustering module is connected with the acquisition module and is used for carrying out unsupervised clustering on the game audio in the game music data set to obtain a game audio cluster set, and the game audio cluster set comprises the game audio which is gathered in one set after unsupervised clustering;

the identification module is connected with the clustering module and used for selecting a plurality of game audio samples from the game audio clustering set and determining the music style labels of the game audio clustering set according to the content relevance of the plurality of game audio samples;

and the adding module is connected with the identification module and is used for adding music style labels to the game audio in the game music data set to obtain an audio label data set.

In some embodiments of the present application, based on the above technical solutions, the method further includes:

an audio acquisition module configured to acquire the audio tag dataset and game audio to be categorized;

a transformation module configured to obtain a Mel frequency spectrogram from the game audio with music style labels in the audio label data set through short-time Fourier transformation, and obtain a Mel frequency spectrogram from the game audio to be classified through short-time Fourier transformation;

the prediction training module is configured to input a Mel frequency spectrogram corresponding to the audio tag data set into a preset deep convolution neural network for training to obtain a network model for predicting the music style tag;

the style identification module is configured to input a Mel frequency spectrogram corresponding to the game audio to be classified into the network model for music style label prediction to obtain a music style label of the game audio to be classified.

In some embodiments of the present application, based on the above technical solution, the prediction training module is further configured to perform supervised learning on a deep convolutional neural network by using a spectrogram set obtained by short-time fourier transform of game audio with music style labels in the audio label data set, so as to obtain a suitable weight parameter matrix and offset; and correspondingly assigning the weight parameter matrix and the offset to each layer of the deep convolutional neural network.

In some embodiments of the present application, based on the above technical solution, the clustering module includes an unsupervised training unit and an unsupervised clustering unit.

In some embodiments of the present application, based on the above technical solution, the unsupervised training unit is configured to randomly clip any two segments of each game audio in the game music data set, convert the any two segments of each game audio into audio feature vectors, and form an audio slice pair;

the unsupervised training unit is further configured to input the audio slice pairs into a multi-class cross-entropy contrast loss function for unsupervised training, resulting in unsupervised trained game music datasets, the multi-class cross-entropy contrast loss function for reducing intra-pair feature distances of audio slice pairs and increasing inter-pair feature distances of audio slice pairs.

In some embodiments of the application, based on the above technical solution, the unsupervised clustering unit is configured to input the audio feature vector corresponding to each audio in the unsupervised trained game music data set into a greedy algorithm for unsupervised clustering, so as to obtain a game audio cluster set.

In some embodiments of the present application, based on the above technical solutions, a method for inputting an audio feature vector corresponding to each audio in the unsupervised training game music data set into a greedy algorithm for unsupervised clustering includes:

selecting a pair of audio feature vectors with the minimum distance in the unsupervised training game music data set;

and if the distance between the pair of audio characteristic vectors with the minimum distance is smaller than a specified threshold value, grouping the pair of audio characteristic vectors with the minimum distance into a class to form a game audio cluster set.

In some embodiments of the present application, based on the above technical solutions, the identification module includes:

a feature extraction unit configured to extract game music features of game audio in a game music data set and to extract game music features of a plurality of game audio samples in the game audio cluster;

the label training unit is configured to input the game music features and music style labels corresponding to the game music features into a machine learning model for training to obtain a label calibration model for predicting the music style labels based on the game music features;

a tag prediction unit configured to input game music features of a plurality of game audio samples in the game audio cluster into the tag targeting model, resulting in music style tags for the game audio cluster.

According to an aspect of the embodiments of the present application, there is provided a computer readable medium, on which a computer program is stored, the computer program, when executed by a processor, implementing a game music style classification method as in the above technical solution.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the game music style classification method as in the above technical solution via executing the executable instructions.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the game music style classification method as in the above technical solution.

In the technical scheme provided by the embodiment of the application, the game audios in the game music data set are gathered together to form a plurality of game audio cluster sets according to the music style labels in an unsupervised clustering mode, then the music style labels of the game audio cluster sets are determined according to the content relevance of a plurality of game audio samples in the game audio cluster sets, finally, the music style labels are added to all the game audios in the game music data sets to obtain the audio label data sets, the classification labels specially aiming at the game music styles are obtained by using the method, and users can collect and listen to the game music according to the classification corresponding to the music style labels, so that the user experience of the users on the game music is greatly improved, and the development of the game music is promoted.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the solution of the present application applies.

Fig. 2 schematically shows a flowchart of steps of a game music style classification method in an embodiment of the present application.

FIG. 3 is a flow chart that schematically illustrates steps of a method for unsupervised training of game audio in one embodiment of the present application.

Fig. 4 schematically shows an effect diagram of unsupervised training of game audio in one embodiment of the present application.

Fig. 5 schematically illustrates a flowchart of method steps for unsupervised clustering of unsupervised trained data in one embodiment of the present application.

FIG. 6 schematically illustrates a table of general categories based on game music style labels in one embodiment of the present application.

FIG. 7 schematically illustrates a flow chart of method steps for determining a music style label for a game audio cluster set in one embodiment of the present application.

FIG. 8 is a flow diagram that schematically illustrates steps in a method for music style label prediction for game audio in one embodiment of the present application.

Fig. 9 is a flow chart schematically illustrating the steps of a method for training a mel-frequency spectrum diagram in an embodiment of the present application.

Fig. 10 schematically shows a specific application of music style label prediction to game audio in one embodiment of the present application.

Fig. 11 schematically shows a block diagram of a game music style classification apparatus according to an embodiment of the present application.

Fig. 12 schematically shows a block diagram of a computer system of an electronic device for implementing an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

As a large direction of artificial intelligence software technology, Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The application is a technology for automatically classifying the style of game music in the field of artificial intelligence, and mainly realizes two functions through the technology of artificial intelligence. On the first hand, a classification label system specially aiming at game music styles is obtained through the application, and on the second hand, the music style labels of game audios to be classified are automatically identified and determined through the application. The present application utilizes a number of machine learning techniques, and the detailed technical contents of the present application will be further disclosed below.

Music has a certain music style, and can be classified according to the music style, and the preference of each person for the music style is different. The music style labels are defined for the style of the song, for example, the music styles can be RAP (RAP), POP (POP), FOLK (ballad), etc., and the definition of the music styles is determined more based on the style sung by the corresponding singer of the song, the rhythm composition of the whole song, etc.

For game music, more is dubbing music, no singer sings, and no clear music style exists, and game music is difficult to label if the music style label of a common song is directly taken. However, the solution for music style tagging must rely on high-quality known music tag labeling data, and therefore, for game music, it is difficult to classify and identify the game music without music style tags to satisfy the user experience of users for different types of game music styles.

The scheme provided by the patent can better solve the problems and complement and perfect a label system of the game music style.

As shown in fig. 1, system architecture 100 may include a terminal device 110, a network 120, and a server 130. The terminal device 110 may include various electronic devices such as a smart phone, a tablet computer, a notebook computer, and a desktop computer. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. Network 120 may be a communication medium of various connection types capable of providing a communication link between terminal device 110 and server 130, such as a wired communication link or a wireless communication link.

The steps corresponding to the game music style classification method of the present application may be completed in the server 130, or may be completed in the terminal device 110. Specifically, the server 130 or the terminal device 110 may implement the identification and addition of the music style label of the game audio by the following steps. The method comprises the following specific steps: after the server 130 or the terminal device 110 obtains the game music data set, unsupervised clustering is performed on the game audio in the game music data set to obtain a game audio cluster set; selecting a plurality of game audio samples from a game audio cluster set, and determining music style labels of the game audio cluster set according to the content relevance of the plurality of game audio samples; and adding music style labels to game audio in the game music data set to obtain an audio label data set.

After the audio tag data set is obtained in the above steps, the audio tag data set can be trained, and then the automatic identification of the music style tag of any game audio by the user can be realized. At this time, the terminal device 110 or the server 130 may obtain the game audio to be classified, and then the corresponding music style label may be automatically identified through the network model for music style label prediction. The method comprises the following steps of obtaining a music style label corresponding to game audio, wherein the first step is to obtain the music style label corresponding to the game audio, and the music style label of a common song cannot be directly applied to game music. And the second part is that the music style label of the game audio to be classified is automatically identified according to the established music style label system aiming at the game audio. After the music style labels are obtained, machine learning is carried out on a large number of game audios with known music style labels to obtain a network model for predicting the music style labels. The game audio to be classified is input into the network model for music style label prediction, so that the automatic identification of the music style labels of the game audio to be classified can be realized.

The system architecture in the embodiments of the present application may have any number of terminal devices, networks, and servers, according to implementation needs. For example, the server 130 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by both the terminal device 110 and the server 130, which is not particularly limited in this application.

The following describes the method for classifying the style of game music provided by the present application in detail with reference to the specific embodiments.

According to an aspect of an embodiment of the present application, there is provided a method for classifying a style of game music, as shown in fig. 2, where fig. 2 schematically shows a flowchart of steps of the method for classifying a style of game music in an embodiment of the present application. The method corresponding to the present application is applied to the server 130, may also be applied to the terminal device 110, or may also be executed by both the terminal device 110 and the server 130. The method specifically comprises the steps S210-S240.

Wherein, step S210 specifically includes: a game music data set is obtained, the game music data set including game audio.

The game music data set includes a plurality of game music, which is a collection of a plurality of game music. And for game music, it may be all music related to the game. For example, the music may include game background music when a game is opened, prompt music when a game interface clicks interaction, prompt music in the game, music played corresponding to a game special effect, dialogue and dialogue in the game, and the like. The game music in the game music data set can come from different games, and all game audio contents in the game basic pack can be directly obtained by obtaining the game basic packs of different games.

In a particular application, the manner of acquiring the game music data set may be various. Illustratively, when the game music data set is stored locally in the server 130 or the terminal device 110 that is the subject of execution of the present application, all game music composition game music data sets may be acquired locally directly from the server 130 or the terminal device 110. Alternatively, for example, when the game music data set is stored in the server 130 or a network cloud to which the terminal device 110 is communicatively connected, all audio related to the game may be requested from the cloud as the game music data set. Still alternatively, when the game music data set is stored on the other server 130 or the other terminal device 110, a request may be made to the other server 130 or the other terminal device 110 to acquire all game-related audio as the game music data set when the authorization is acquired.

Any method for obtaining the game music data set can be used in the present invention, and the present embodiment does not limit this.

Wherein, step S220 specifically includes: carrying out unsupervised clustering on game audio in the game music data set to obtain a game audio cluster set, wherein the game audio cluster set comprises the game audio which is gathered in one set after unsupervised clustering;

after a large number of game audios are obtained from each game to form a game music data set, the game audios are clustered, otherwise, if each game audio corresponds to a type of music style label, there is no reference to the classification of the game music. Therefore, the game audio in the game music data set is subjected to unsupervised clustering, so that the game audio is firstly clustered together based on a certain rule to obtain the game audio clustering set. The game audio cluster set is a set corresponding to the game audio in the game music data set by clustering the game audio into one category. The whole game music data set forms a plurality of game audio cluster sets, and the game audio cluster sets correspond to the specific classification of the game music data set.

For example, by acquiring all game audios of 200 games, if there are 100 game audios on average for one game, a game music data set having 20000 game audios, whose music style labels are uncertain, is composed. And the present application step S220 performs unsupervised clustering on the 20000 game audios. After the clustering is completed, 100 game audio clusters can be obtained, and then there are 100 corresponding classifications for the 20000 game audios. Each corresponding to a game music style. And one of the game audio clusters is a set corresponding to a game music style.

The application will further disclose specific methods for unsupervised clustering.

In an embodiment of the application, a method for unsupervised clustering of game audio in a game music data set to obtain a game audio cluster set specifically includes two aspects, a first aspect is to unsupervised train the game audio in the game music data set, and a second aspect is to unsupervised cluster the unsupervised trained data to obtain the game audio cluster set.

In one embodiment of the present application, a method for unsupervised training of game audio in a game music data set is shown in fig. 3, and fig. 3 schematically shows a flowchart of steps of the method for unsupervised training of game audio in one embodiment of the present application. Specifically, the method comprises steps S310 to S320.

In step S310, two arbitrary segments of each game audio in the game music data set are randomly clipped, and the two arbitrary segments of each game audio are converted into audio feature vectors to form audio slice pairs.

The game audio is unsupervised and trained to highlight the style and characteristics of different game audio. Since there are very many game audios in a game music data set, if these individual game audios cannot highlight a specific style by themselves, a difficulty is added to the clustering of the entire huge game music data set. By taking common music as an example, if one song comprises multiple music styles, clustering the song is difficult, and because the song is not greatly different from other songs, in order to avoid clustering difficulty caused by small difference among game audios during clustering, the application needs to perform unsupervised training on the game audios to highlight the styles and characteristics of different game audios.

The application takes the audio slice pair corresponding to the two segments as a mode of embodying the specific game music style of the game audio. For example, there is a game audio in the game music data set, and two segments of the game audio are randomly clipped, so that the style and the characteristics of the game audio are represented by the audio slice pair consisting of the two segments. Also take the music style of a song as an example, if a song contains elements of both classical style and ballad, what the application needs to do is to randomly select two segments in the song to form an audio slice pair, and then the styles of the two segments are used as the style of the whole song.

After converting any two segments of each game audio into audio feature vectors to form an audio slice pair, it may happen that the feature distance between two segments in the audio slice pair is still long, and the music style difference between two segments is still large for the corresponding music style, which requires step S230.

In step S320, the audio slice pair is input into a plurality of types of cross entropy contrast loss functions (NT-Xent loss) for unsupervised training, so as to obtain an unsupervised training game music data set, wherein the plurality of types of cross entropy contrast loss functions are used for reducing the intra-pair feature distance of the audio slice pair and increasing the inter-pair feature distance of the audio slice pair.

The method utilizes a multi-class cross entropy contrast loss function (NT-intent loss) to perform unsupervised training on the audio slice pair, and the training aims to reduce the intra-pair feature distance of the audio slice pair and increase the inter-pair feature distance of the audio slice pair. The style difference is corresponding to the characteristic distance, the smaller the characteristic distance is, the smaller the style difference is, and the larger the characteristic distance is, the larger the style difference is, and the unsupervised training mode is to reduce the style difference of each game audio in the game music data set, increase the style difference among the game audios and facilitate the subsequent steps to cluster the game music data set.

Fig. 4 schematically shows an effect diagram of unsupervised training of game audio in one embodiment of the present application. With reference to fig. 4, for two segments within an audio slice pair with a small feature distance, the distance between the audio feature vectors corresponding to any two segments of each game audio is reduced by the multi-class cross entropy contrast loss function (NT-xenloss), while for different game audios, the distance between the audio feature vectors corresponding to each game audio is increased by the multi-class cross entropy contrast loss function (NT-xenloss) due to the difference between the game audios. Thereby realizing the similar convergence and the heterogeneous separation of the game audio in the game music data set.

After the unsupervised training of the game audio in the game music data set is completed, unsupervised clustering can be performed on the unsupervised training data to obtain a game audio cluster set. The specific steps are as follows.

In an embodiment of the application, the method for performing unsupervised clustering on unsupervised training data to obtain the game audio cluster set includes inputting audio feature vectors corresponding to each audio in the unsupervised training game music data set into a greedy algorithm for performing unsupervised clustering to obtain the game audio cluster set.

Greedy algorithms are algorithms that always make the selection of the corresponding algorithm that seems best at the present time when solving a problem. That is, rather than being considered from a global optimum, the algorithm results in a locally optimal solution in some sense. Therefore, the greedy algorithm cannot obtain the overall optimal solution for all problems, and the key is selection of the greedy strategy. The selection of the greedy strategy specifically comprises the following steps.

In an embodiment of the present application, a specific method for unsupervised clustering by inputting an audio feature vector corresponding to each audio in an unsupervised training game music data set into a greedy algorithm is provided, as shown in fig. 5, and fig. 5 schematically illustrates a flowchart of steps of the method for unsupervised clustering by unsupervised training data in an embodiment of the present application. Comprising step S510-step S520.

Step S510: and selecting a pair of audio feature vectors with the minimum distance in the unsupervised training game music data set.

Through the steps S310 to S320, the game audio in the game music data set is subjected to unsupervised training, the difference of the single game audio is reduced, the direct difference of the game audio is increased, and the situation that the style of the single game audio is uncertain is eliminated. The game audio in the game music data set can be clustered at this time. The clustering mode is to select a pair of audio feature vectors with the minimum distance, wherein the pair of audio feature vectors corresponds to two game audios, the feature distance of the two game audios is minimum, and one game audio corresponds to one audio feature vector. The purpose of selecting the audio feature vector with the smallest distance is to bring two game audios together when the two audio feature vectors with the smallest distance satisfy a certain condition. The specific conditions are determined in step S520.

Step S520: and if the distance between the pair of audio feature vectors with the minimum distance is smaller than a specified threshold value, grouping the pair of audio feature vectors with the minimum distance into a class to form a game audio cluster set.

Wherein the designated threshold is set according to user definition, and the size of the designated threshold is related to the number of the game music style labels. The greater the number of game music style tags, the smaller the specified threshold needs to be. For example, if 100 game music style labels are required, then there are 100 specific categories of corresponding game music styles, and the specified threshold can be set a little bit smaller. If 50 game music style tags are required, then there are 50 specific categories for the corresponding game music style, and the specified threshold is greater than the specified threshold for 100 specific categories. If more than 100 game music style tags are desired, the corresponding specified threshold is less than the specified threshold for 100 specific categories.

The clustering mode of the application is to judge whether the distance between the pair of audio characteristic vectors with the minimum distance is smaller than a specified threshold value, and if the distance between the pair of audio characteristic vectors with the minimum distance is smaller than the specified threshold value, the pair of audio characteristic vectors with the minimum distance are clustered into a class to form a game audio clustering set. And the step is continuously carried out until the distance between the pair of audio feature vectors with the minimum distance is greater than a specified threshold value, and the clustering is interrupted.

For example, if there are 20000 game audios, a pair of audio feature vectors with the smallest distance is found, and if the distance between the pair of audio feature vectors with the smallest distance is smaller than a specified threshold, the pair of audio feature vectors with the smallest distance is grouped into one class, and then two game audios corresponding to the pair of audio feature vectors with the smallest distance are grouped into one class at this time, so as to form a game audio cluster set N. Continue to compare to the other 19998 game audios. And when a pair of audio feature vectors with the minimum distance is found, if the distance between the pair of audio feature vectors with the minimum distance is smaller than a specified threshold value, grouping the pair of audio feature vectors with the minimum distance into a class to form a game audio cluster set M. And circulating all the time, when the distance between the feature vector corresponding to a certain game audio and the game audio in the game audio cluster set M is minimum and the distance is smaller than a specified threshold value, classifying the game audio into the game audio cluster set M, and circulating in sequence. Thus, game audio that is close in distance and closer than a specified threshold can be grouped together in a category. If through the clustering, 1000 of 20000 game audios are clustered together to form a game audio cluster set M, 2000 are clustered together to form a game audio cluster set N, and 200 game audio cluster sets are formed in total, wherein the number of the game audios of each game audio cluster set is not constant, and is determined according to the specific clustering condition. Then the game audio corresponding to the application is indicated to have a 200-medium music style label. The game audio in the game audio clustering set has the least number of game audio, and the condition is that the distance between all other game audio and the game audio is larger than or equal to a specified threshold value, which indicates that the game audio is independent into a category, and the condition is generally that the game audio in the game music data set is not enough or the game type of the game music data set is too single, and the sample number of the game music data set can be continuously expanded when the condition is met.

When clustering of the game music data sets is completed, the music style labels can be identified for the clustered game audio. The specific steps are as in step S230.

Wherein, step S230 specifically includes: a plurality of game audio samples are selected from the game audio cluster set, and the music style labels of the game audio cluster set are determined according to the content relevance of the plurality of game audio samples.

After the game music data sets are subjected to unsupervised clustering to obtain a plurality of game audio cluster sets, identification and confirmation of music style labels are required to be carried out on each game audio cluster set. There are many methods for identifying and confirming the specific identification.

The present application can utilize the simplest means of body validation. By selecting a plurality of game audio samples in the audio cluster set, the terminal corresponding to the main body can listen to the plurality of game audio samples, and then the identification and confirmation of the music style label are carried out on the audio cluster set according to the heard content. And the identification and confirmation mode can be based on the user-defined classification standard specific to the game music. For example, for game music, a user corresponding to a main body can define and set, and when a plurality of game audio samples in a certain audio cluster set carry soldier shouting or weapon gunshot, the game music style labels of the audio cluster set can be defined as a streetscape. The user corresponding to the subject may also be defined according to the content of the plurality of game audio samples, for example, the lyrics correspond to a manikin, and the game music style tag of the corresponding audio cluster set may be defined as a manikin class.

The present application obtains a common classification based on game music style labels as shown in fig. 6 by clustering a large number of game audios, but the specific classification of game audios is not limited thereto, but is determined by the finally obtained audio cluster set.

As can be seen from fig. 6, the classification of game music can be divided into three major categories, including song-based classification, background-sound-based classification, and other classifications. These three main classes include several subclasses. The composition of these three main categories will be described in detail below.

Categories for song-based include pop, electronic, rock, metal, rap, disco, chinese, light music, etc. The popularity category corresponds to popular songs or scores as game music, and is often used in game publicity or background music for starting a game or in game activities. Electronic genres are games that use electronic songs or soundtracks, often in music at beat. The rock class is that rock class songs or dubbing music are used as game music, and are often used in game main interface background music or in-game background music. The metal class corresponds to music of a game, such as a metal song or a score, and background music in a large level stage of the game is often used. The rap class is to use rap class songs as game music, and is often applied to scenes with fast and light games. Disco corresponds to music with a very heavy beat as game music, and is often used in music games. The Chinese style is music with strong Chinese style as game music, and is often used in music deeply combined with Chinese culture. The light music category is to use soft music as game music, often as background music for different game scenes. Of course, the specific sub-categories of the song-based classification are not limited to the above categories, and the application only exemplifies the more common use in games.

And the classification based on background sounds can classify game music into the classes of smiths, ACGs, swordsmen, hallucinations, cheque, musical instruments, and so on. Specifically, the method comprises the following steps: the idiom includes the usual match of war games, such as the usual war drum, soldier whooping, weapon gunshot. ACG animations include many cartoon game matches, with hot blood bias, commonly used in the climax or turning part of the game. In the fairy game of the mysterious and fantasy common wordings, there are often swordsmen game matches, which are lingering, and there are many traditional musical instruments accompaniment such as flute sound, which is happy. The puzzling class contains the orchestral elements plus beat cycles, an infinite duration soundtrack commonly used for game menus or fixed scenes. The chip music is 8bit music, which is music made by the music chip of the old-fashioned game. Musical instruments are pure musical instrument playing matches, such as playing of lutes, dulcimers and the like, and are often used at the beginning or in the middle of games. Of course, the specific sub-categories of the classification based on the background sound are not limited to the above, and the present application only exemplifies the more frequent use in the game.

While other categories based may be understood as a papanicolastic category, which is a category for all game audio other than the song-based category and the background-based category described above. Other categories often include a long piece of stationary music, such as where the background of the game is quiet and silent at night. But also purely vocal, such as a background introduction to the beginning of the game, and also vocal dubbing of the person, etc. But also of a special sound effect, such as a door opening sound, a gunshot sound, etc. Also, various animal voices, sounds of riding in the game, and the like can be individually classified into other categories. Of course, the specific sub-categories based on other categories are not limited to the above categories, and the present application only exemplifies the more common use in games.

There are various methods for identifying and confirming the music style label for each game audio cluster set, and the above method is only the simplest one, and the present application also discloses the following method for identifying and confirming the music style label for each game audio cluster set.

In one embodiment of the present application, a method for determining a music style label for a game audio cluster based on content relevance of a plurality of game audio samples is provided, as shown in fig. 7, and fig. 7 schematically illustrates a flowchart of the method steps for determining a music style label for a game audio cluster in one embodiment of the present application. Including step S710-step S730.

Step S710: game music features of game audio in the game music data set are extracted.

The game music features correspond to specific features of the corresponding game music genre sub-category as shown in fig. 6. These features may be user-defined. For example, when the user applies the waring drum sound, soldier whooping sound, and the game music feature corresponding to weapon gunshot sound is the streety music style label, step S710 extracts the waring drum sound, soldier whooping sound, and weapon gunshot sound as the game music feature. The game music features are all used for influencing the music style labels of the game, such as the sound of a vertical bamboo flute, a flute sound, a piano sound and the like.

Step S720: and inputting the game music characteristics and the music style labels corresponding to the game music characteristics into a machine learning model for training to obtain a label calibration model for predicting the music style labels based on the game music characteristics.

The game music feature and the music style label corresponding to the game music feature are preset by the user. For example, for a game music style label with a flute sound, the flute sound corresponding to the flute sound is "impatient", then the application inputs a large number of flute sounds in the game music data set, the flute sound and the corresponding "impatient" into the machine learning model for training. The machine learning model of the present application may be a model constructed based on a convolutional neural network, a cyclic neural network, or the like. After a large number of game music characteristics are trained, a label calibration model for music style label prediction based on the game music characteristics can be obtained.

Step S730: extracting game music characteristics of a plurality of game audio samples in the game audio clustering set, and inputting the game music characteristics of the plurality of game audio samples in the game audio clustering set into the label calibration model to obtain the music style labels of the plurality of game audio samples in the game audio clustering set.

The game music feature in step S730 is similar to that in step S710, and is a specific sound related to the game music style label. For example, if the game music feature is a vertical bamboo flute or a flute sound extracted from the game audio cluster set, and the game music feature is input into the label calibration model of step S720, the music style label of the game audio cluster set is "Xian".

The automatic identification and confirmation of the tags through the method from step S710 to step S730 can be realized through machine learning, so as to complete the acquisition of the music style tags specific to the game audio cluster.

After the game music style labels of the respective game audio clusters are acquired, step S240 is also required.

Wherein, step S240 specifically includes: and adding music style labels to game audio in the game music data set to obtain an audio label data set.

The specific music style label of each game audio cluster set is obtained in step S230, and at this time, the music style label of the game audio cluster set is directly used as the music style label of all game audio in the game audio cluster set. For example, if a game audio cluster set of the "immortal" class is obtained through step S230, and there are 100 game audios in the game audio cluster set, then the game music style labels of the 100 game audios are all "immortals". By the method, music style labels can be added to all game audio in the game music data set, so that the audio label data set is obtained. The audio tag dataset is a collection of tags corresponding to all known musical styles of game audio.

The audio tag data set obtained by constructing the music style tag of the game music is completed through the steps, and the music style tag of the game audio can be predicted based on the audio tag data set after the audio tag data set is constructed. As shown in fig. 8, fig. 8 is a flow chart schematically illustrating steps of a method for music style label prediction for game audio in an embodiment of the present application. Specifically, steps S810-S840 are included.

In step S810: and acquiring an audio tag data set and game audio to be classified.

The audio tag data set is obtained through steps S210-S240 and can be directly obtained. And the game audio to be classified is generally from the terminal device. In a specific application, the manner of obtaining the game audio to be classified can be various. Illustratively, when the user uploads the game audio to be classified through the terminal device, the game audio to be classified can be directly accepted through the terminal device. Or when the user uploads the game audio to be classified to the cloud server, the cloud server can be accessed through the network to obtain the game audio to be classified.

In step S820: and (4) carrying out short-time Fourier transform on the game audio with the music style labels in the audio label data set to obtain a Mel frequency spectrogram.

Wherein, the formula of the short-time Fourier transform is as follows:

where t is the frame length, w (n) is a window function, typically a hanning window, and the following formula is obtained by conversion:

where N is the window length and H is the size of the jump.

In step S830: and inputting the Mel frequency spectrogram into a preset deep convolution neural network for training to obtain a network model for predicting the music style label.

In an embodiment of the present application, the method for inputting the mel frequency spectrum diagram into the preset deep convolutional neural network for training in step S830 includes steps S910 to S920, and fig. 9 schematically illustrates a flowchart of the method for training the mel frequency spectrum diagram in an embodiment of the present application.

Step S910: and (3) carrying out supervised learning on the deep convolutional neural network by using a spectrogram set obtained after data set preprocessing to obtain a proper weight parameter matrix and offset.

The specific method for carrying out supervised learning on the deep convolutional neural network comprises the following steps: inputting the spectrograms in the spectrogram set into the deep convolutional neural network; carrying out forward propagation on the deep convolutional neural network to obtain an identification result; judging whether the recognition result is consistent with the actual music style or not; if the result is consistent, stopping training; and if the difference is inconsistent, adjusting the weight parameter matrix and the offset by adopting a random gradient descent algorithm in the back propagation process, and inputting the spectrogram in the spectrogram set into the deep convolutional neural network again.

Step S920: and correspondingly assigning the weight parameter matrix and the offset to each layer of the deep convolutional neural network.

The preset deep convolutional neural network may include a plurality of convolutional layers, a plurality of pooling layers, and a plurality of fully-connected layers, wherein the convolutional layers may be five, and the pooling layers and the fully-connected layers may be three. The first five convolutional layers and the pooling layer alternately appear, and when a proper weight parameter matrix and an offset are obtained, the weight parameter matrix and the offset can be correspondingly assigned to each layer of the deep convolutional neural network, so that the Mel frequency spectrogram can be input into the preset deep convolutional neural network for training.

There are a wide variety of deep convolutional neural networks that may be used with the present application. An Audio classification model PANN (PANNs: Large-Scale predicted Audio Neural Networks for Audio Pattern Recognition) can be selected, and convolutional Neural network models such as musicnn or harmonicCNN can also be selected. The objective function predicted using the deep convolutional neural network is as follows:

wherein the audio input is x_nThe model predicts as f (x)_n)∈[0,1]^KThe number of samples is N.

The spectrum atlas obtained after preprocessing of the data set can be constructed by utilizing the gradient descent optimization objective function to train the model end to end.

In step S840: and obtaining a Mel frequency spectrogram by short-time Fourier transform of the game audio to be classified, and inputting a network model for music style label prediction to obtain the music style label of the game audio to be classified.

After the building of the network model for music style label prediction in step S830 is completed, the game audio to be classified may be input, so as to obtain the music style label corresponding to the game audio to be classified.

For example, the mel frequency spectrum diagram of the game audio to be classified is input into the network model for music style label prediction, and the model for the network model for music style label prediction finds that the mel frequency spectrum diagram of the game audio to be classified is similar to the mel frequency spectrum diagram of the music style label corresponding to the "xiu", so that the music style label of the game audio to be classified is determined as the "xiu".

The following further illustrates an embodiment of a specific application of steps S810-S840 of the present application through a convolutional neural network model of harmonicCNN. As shown in fig. 10, fig. 10 schematically shows a specific application of music style label prediction to game audio in an embodiment of the present application. Mainly comprises steps a, b and c.

Step a: first, a game audio (Wareform) is input, and a Mel frequency Spectrogram (Spectrogram) is obtained by short-time Fourier transform (STFT) of the game audio. And then inputting the Mel frequency spectrogram into a convolution neural network model of Harmonicnn for training, converting the Mel frequency spectrogram by using a harmonic filter which can be learned by the Harmonicnn, and then converting the Mel frequency spectrogram through a CNN network to obtain music characteristics.

Step b: and finally, obtaining a line drawing corresponding to the music characteristics based on the music characteristics.

Step c: and converting the line graph corresponding to the music characteristics into a Mel frequency spectrogram, so that the music style labels corresponding to various Mel frequency spectrograms can be obtained, and subsequent prediction is facilitated.

After the models are obtained through the steps a, b and c, a Mel frequency spectrogram can be obtained through short-time Fourier transform of the game audio to be classified, a network model for music style label prediction is input, and according to the comparison between the Mel frequency spectrogram in the step c and the Mel frequency spectrogram of the game audio to be classified, the music style label corresponding to the Mel frequency spectrogram consistent with the Mel frequency spectrogram of the game audio to be classified can be determined, so that the music style label of the game audio to be classified is obtained.

It should be noted that although the various steps of the methods in this application are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the shown steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

The following describes embodiments of the apparatus of the present application, which may be used to perform the method for classifying a genre of game music in the above-described embodiments of the present application. Fig. 11 schematically shows a block diagram of a game music style classification apparatus 1100 according to an embodiment of the present application. The method specifically comprises the following steps:

an obtaining module 1110, wherein the obtaining module 1110 is configured to obtain a game music data set, where the game music data set includes game audio;

the clustering module 1120 is connected with the acquiring module 1110, and the clustering module 1120 is used for carrying out unsupervised clustering on game audio in the game music data set to obtain a game audio cluster set, wherein the game audio cluster set comprises the game audio which is aggregated in a set after unsupervised clustering;

the identifying module 1130, where the identifying module 1130 is connected to the clustering module 1120, and is configured to select a plurality of game audio samples from the game audio cluster set, and determine a music style tag of the game audio cluster set according to content correlations of the plurality of game audio samples;

an adding module 1140, the adding module 1140 is connected to the identifying module 1130, and is configured to add a music style tag to the game audio in the game music data set, so as to obtain an audio tag data set.

In some embodiments of the present application, the apparatus 1100 for classifying a style of game music further comprises:

an audio acquisition module 1110, wherein the audio acquisition module 1110 is configured to acquire an audio tag data set and game audio to be classified;

the conversion module is configured to obtain a Mel frequency spectrogram from the game audio with the music style labels in the audio label data set through short-time Fourier transform, and obtain a Mel frequency spectrogram from the game audio to be classified through short-time Fourier transform;

the prediction training module is configured to input the Mel frequency spectrogram corresponding to the audio tag data set into a preset deep convolution neural network for training to obtain a network model for predicting the music style tag;

the style recognition module 1130, the style recognition module 1130 is configured to input the mel frequency spectrum map corresponding to the game audio to be classified into the network model for music style label prediction to obtain the music style label of the game audio to be classified.

In some embodiments of the present application, the predictive training module is further configured to perform supervised learning on the deep convolutional neural network using a spectrogram set obtained by short-time fourier transform of game audio with music style labels in the audio label data set, to obtain a suitable weight parameter matrix and offset; and correspondingly assigning the weight parameter matrix and the offset to each layer of the deep convolutional neural network.

In some embodiments of the present application, clustering module 1120 includes an unsupervised training unit and an unsupervised clustering unit.

In some embodiments of the present application, the unsupervised training unit is configured to randomly clip any two segments of each game audio in the game music data set, convert any two segments of each game audio into audio feature vectors, and form audio slice pairs;

the unsupervised training unit is further configured to input the audio slice pairs into a plurality of types of cross entropy contrast loss functions for unsupervised training to obtain unsupervised training game music data sets, wherein the plurality of types of cross entropy contrast loss functions are used for reducing the intra-pair feature distance of the audio slice pairs and increasing the inter-pair feature distance of the audio slice pairs;

in some embodiments of the present application, the unsupervised clustering unit is configured to input the audio feature vector corresponding to each audio in the unsupervised trained game music data set into a greedy algorithm for unsupervised clustering, so as to obtain a game audio cluster set.

In some embodiments of the present application, a method for unsupervised clustering by inputting an audio feature vector corresponding to each audio in an unsupervised training game music data set into a greedy algorithm, includes:

selecting a pair of audio characteristic vectors with the minimum distance in the unsupervised training game music data set;

and if the distance between the pair of audio feature vectors with the minimum distance is smaller than a specified threshold value, grouping the pair of audio feature vectors with the minimum distance into a class to form a game audio cluster set.

In some embodiments of the present application, the identification module 1130 includes:

a feature extraction unit configured to extract game music features of game audio in the game music data set and to extract game music features of a plurality of game audio samples in the game audio cluster;

the label training unit is configured to input the game music characteristics and the music style labels corresponding to the game music characteristics into the machine learning model for training to obtain a label calibration model for predicting the music style labels based on the game music characteristics;

and the label prediction unit is configured to input game music characteristics of the plurality of game audio samples in the game audio clustering set into the label calibration model to obtain the music style labels of the plurality of game audio samples in the game audio clustering set.

The specific details of the game music style classification apparatus provided in the embodiments of the present application have been described in detail in the corresponding method embodiments, and are not described herein again.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions for the processor; wherein the processor is configured to execute the game music style classification method as in the above technical solution via executing the executable instructions.

It should be noted that the computer system 1200 of the electronic device shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 12, the computer system 1200 includes a Central Processing Unit 1201 (CPU), which can perform various appropriate actions and processes according to a program stored in a Read-Only Memory 1202 (ROM) or a program loaded from a storage section 1208 into a Random Access Memory 1203 (RAM). In the random access memory 1203, various programs and data necessary for system operation are also stored. The cpu 1201, the rom 1202, and the ram 1203 are connected to each other by a bus 1204. An Input/Output interface 1205(Input/Output interface, i.e., I/O interface) is also connected to the bus 1204.

The following components are connected to the input/output interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output section 1207 including a Display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a network interface card such as a local area network card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. The driver 1210 is also connected to the input/output interface 1205 as necessary. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary.

In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1209, and/or installed from the removable medium 1211. The computer program, when executed by the central processing unit 1201, performs various functions defined in the system of the present application.

In the technical scheme provided by the embodiment of the application, the game audio in the game music data set is gathered together according to the music style labels to form a plurality of game audio cluster sets in an unsupervised clustering mode, the method comprises the steps of firstly, obtaining a game audio data set, then, determining music style labels of the game audio data set according to the content relevance of a plurality of game audio samples of the game audio data set, finally, adding music style labels to all game audios in the game music data set to obtain an audio label data set, obtaining a classification method specially aiming at the game music style by utilizing the method, achieving determination of the music style labels of the game music, enabling users to collect and listen to the game music according to the classification corresponding to the music style labels, greatly improving user experience of the users for the game music, and promoting development of the game music.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for classifying a style of game music, comprising:

obtaining a game music data set, the game music data set including game audio;

2. The method of classifying a style of game music according to claim 1, further comprising:

acquiring an audio tag data set and game audio to be classified;

obtaining a Mel frequency spectrogram by short-time Fourier transform of game audio with music style labels in the audio label data set;

inputting the Mel frequency spectrogram into a preset deep convolution neural network for training to obtain a network model for predicting a music style label;

and obtaining a Mel frequency spectrogram by short-time Fourier transform of the game audio to be classified, and inputting the Mel frequency spectrogram into the network model for predicting the musical style labels to obtain the musical style labels of the game audio to be classified.

3. The method of claim 2, wherein the step of inputting the mel frequency spectrum into a preset deep convolutional neural network for training to obtain a network model for music style label prediction comprises:

carrying out supervised learning on the deep convolution neural network by utilizing a spectrogram set obtained by short-time Fourier transform of game audio with music style labels in the audio label data set to obtain a proper weight parameter matrix and an offset;

and correspondingly assigning the weight parameter matrix and the offset to each layer of the deep convolutional neural network.

4. The method of classifying a style of game music as claimed in claim 1, wherein unsupervised clustering of game audio in said game music data set to obtain a set of game audio clusters comprises:

randomly cutting any two segments of each game audio in the game music data set, converting any two segments of each game audio into audio feature vectors, and forming an audio slice pair;

and inputting the audio slice pairs into a multi-class cross entropy comparison loss function for unsupervised training to obtain an unsupervised training game music data set, wherein the multi-class cross entropy comparison loss function is used for reducing the intra-pair characteristic distance of the audio slice pairs and increasing the inter-pair characteristic distance of the audio slice pairs.

5. The method of classifying a style of game music according to claim 4, wherein unsupervised clustering of game audio in said game music data set is performed to obtain a set of game audio clusters, further comprising:

and inputting the audio characteristic vector corresponding to each audio in the unsupervised training game music data set into a greedy algorithm for unsupervised clustering to obtain a game audio cluster set.

6. The method of classifying a style of game music according to claim 5, wherein the step of inputting the audio feature vector corresponding to each audio in the unsupervised trained game music data set into a greedy algorithm for unsupervised clustering comprises:

7. The method of claim 1, wherein determining the music style label for the set of game audio clusters based on the content relevance of the plurality of game audio samples comprises:

extracting game music characteristics of game audio in the game music data set;

inputting the game music characteristics and the music style labels corresponding to the game music characteristics into a machine learning model for training to obtain a label calibration model for predicting the music style labels based on the game music characteristics;

extracting game music characteristics of a plurality of game audio samples in the game audio clustering set, and inputting the game music characteristics of the plurality of game audio samples in the game audio clustering set into the label calibration model to obtain the game audio clustering set music style label.

8. A game music style classification apparatus, comprising:

9. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method of classifying a style of game music according to any one of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the game music style classification method of any one of claims 1 to 7 via execution of the executable instructions.