KR20170091888A

KR20170091888A - Method and system for automatically tagging themes suited for songs

Info

Publication number: KR20170091888A
Application number: KR1020160012717A
Authority: KR
Inventors: 김정명; 하정우; 김정희
Original assignee: 네이버 주식회사
Priority date: 2016-02-02
Filing date: 2016-02-02
Publication date: 2017-08-10
Also published as: KR101801250B1

Abstract

Disclosed are a method and system for automatically tagging themes suitable for music. A method for automatically tagging music includes the steps of: receiving text information and sound source data of a sound source content with regard to the sound source content; preprocessing the sound source data and the text information with a studying data type; simultaneously studying the preprocessed study data with regard to the text information together with the preprocessed study data with regard to the sound source data through a sound source-text integrated study model; and automatically tagging the music by giving at least one tag according to a study result to the sound source content. Accordingly, the present invention can reduce costs for giving the tag.

Description

[0001] METHOD AND SYSTEM FOR AUTOMATICALLY TAGGING THEMES SUITED FOR SONGS [0002]

The following description relates to a technique for tagging a theme suitable for music.

In addition to the development of various digital compression technologies, it is possible to digitize existing large capacity media and convert them into low capacity media, so that various types of media can be stored in a portable user terminal so that users can easily select desired media have.

In addition, digitized media through such digital compression technology enables media sharing among users on a network, thereby exploiting online media services explosively.

As a music, which occupies a large part of such a massive media, it is low in capacity and low in communication load compared to other types of media, so it is easy to support a real-time streaming service and satisfaction is high for both service providers and users.

Accordingly, a variety of online music services are currently available. In the conventional online music service, a user who is connected to the online is provided with a real-time music service by providing the entire file of the sound source content to the user terminal, or providing the sound source content through the streaming service. To provide a sound source.

Korean Patent Registration No. 10-0615522 (registered on Aug. 17, 2006) discloses an example of an online music service in which music content is categorized on the basis of content, and a technique of providing sound source content to users accessing through the network .

In order to efficiently recommend, search, and manage music, it is important to classify music content. Methods for classifying sound source contents into similar sound source contents are classified into genre, singer, and music. Examples of how to distinguish music by genre include Timbre, Rhythm, Pitch, and Mel-frequency cepstral coefficient (MFCC), which is widely used in speech recognition.

However, although the music genre is mostly given when the music is created, it is not possible to clearly define the music genre, and the system is ambiguous and inconsistent, making it difficult to classify the music by genre and serve the user.

In particular, in order to recommend music based on the current situation of the user rather than merely recommend music according to a specific classification, prior to determining the current situation of the user, Do. In a general conventional art, most people directly generate tags directly for respective sound source contents. However, because tags generated by a person are very subjective, different tags can be generated for different people even for the same sound source content, and it is costly for a person to generate a tag for a very large number of sound source contents And is inefficient.

As another prior art, the genre of meta information included in the sound source contents may be utilized as the advance information. However, as described above, genres provided as meta information are information given at the time of song creation, but the system is ambiguous and inconsistent. In addition, there is a problem that the genre provided as meta information is not clearly related to the user's situation.

On the other hand, there is a prior art that utilizes log information such as what kind of music has been heard by the users before. For example, Korean Patent No. 10-1170208 (music recommendation system and method) extracts structural features by analyzing music structurally, models the analysis results of music structure and characteristics based on user information, Discloses a technique for recommending music corresponding to a modeling result.

However, this prior art has not been able to generate and provide a tag as dictionary information for the sound source content itself, but it is a technique that empirically sets the relation between the situation and the sound source contents in advance and recommends the related sound source content for the same situation There is a problem.

A theme suitable for music is used together with sound source data and text information (for example, lyrics, meta information (singer, genre, title, album name, etc.) and other text And to provide a system and method for tagging the same.

The present invention provides a method and system capable of automatically learning not only sound source data but also a plurality of tags such as a theme, a genre, a mood, a theme, etc. of sound source contents by learning text information at a time.

A method for automatically tagging music performed by a computer-implemented music auto-tagging system, the method comprising: receiving sound source data and text information of the sound source content with respect to sound source content; Processing the sound source data and the text information in the form of learning data; Learning together the learning data preprocessed for the sound source data and the learning data preprocessed for the text information through a sound source-text integrated learning model; And providing at least one tag according to the learning result to the sound source content to perform music automatic tagging.

According to one aspect of the present invention, the pre-processing step may convert the sound source data into learning data expressed in time-frequency.

According to another aspect of the present invention, the preprocessing step converts the text information into learning data represented by a sequence of individual words.

According to another aspect of the present invention, the sound source-text integrated learning model is a combination of a sound source model and a text model, and the simultaneous learning step includes a step of pre-processing the sound source data using the sound source model among the sound source- Generating a first real number vector corresponding to the learned learning data; Generating a second real number vector corresponding to the learning data preprocessed for the text information using the text model among the sound source-text integrated learning models; And calculating a tag-specific score according to a third real number vector to which the first real number vector and the second real number vector are connected with respect to the entire tag set through the sound-text integrated learning model have.

According to another aspect of the present invention, the step of performing automatic music tagging includes determining the at least one tag in the entire tag set using the score of each tag output through the sound source-text integrated learning model, And the like.

According to another aspect of the present invention, the step of generating a real vector corresponding to the training data preprocessed on the sound source data may include sampling a plurality of frames from the training data preprocessed on the sound source data, As a source of the sound source model, a real vector for the sound source data.

According to another aspect of the present invention, the sound source model may have the same number of channels as the number of the sampled frames.

According to another aspect, the step of generating a real vector corresponding to the preprocessed training data for the text information may include applying a plurality of real vector vectors to each of the plurality of individual text information of different types, And the like.

According to another aspect of the present invention, in the sound source-text integrated learning model, the source-text integrated learning model using an error based on a difference between a tag vector value obtained by digitizing tags of the entire tag set and a vector based on the calculated score of each tag, And a learning process.

According to another aspect of the present invention, an error change value calculated by partially differentiating the error in the sound source-text integrated learning model is transmitted to the individual model of the sound source model and the text model through backpropagation or BackPropagation Through Time (BPTT) And the like.

A computer-implemented music auto-tagging system, comprising: an input control unit for receiving sound source data and text information of sound source content with respect to sound source content; A preprocessor for processing the sound source data and the text information in the form of learning data; A learning unit for learning together learning data preprocessed on the sound source data and learning data preprocessed for the text information through a sound source-text integrated learning model; And a tagging unit for providing at least one tag according to the learning result to the sound source content to perform automatic tagging of music.

According to the embodiment of the present invention, it is possible to automatically tag a theme suitable for music by using the sound source data of the sound source content together with the text information, thereby reducing the cost incurred in tagging the sound source content.

According to the embodiment of the present invention, the sound source data and the text information of the sound source content can be automatically learned and tagged without separating the process of extracting a separate factor from the sound source through the minimum preprocessing.

According to the embodiment of the present invention, a plurality of tags such as a theme, a genre, a mood, and a theme suitable for sound source content can be given by learning text information including lyrics or meta information at the same time in addition to sound source data, Various tagging is possible.

1 is a block diagram for explaining an example of the internal configuration of a computer system according to an embodiment of the present invention.
2 is a diagram illustrating an example of components that a processor of a computer system according to an embodiment of the present invention may include.
3 is a flowchart illustrating an example of a method of automatically tagging music that can be performed by a computer system according to an embodiment of the present invention.
4 is a diagram illustrating an example of an automatic music tagging learning model structure according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The present invention relates to methods and systems for automatic tagging of music, and more particularly to a method and system for automatically tagging music that can be automatically and appropriately applied to music by using sound source data and text information of music source contents together. Technology.

Embodiments, including those specifically disclosed herein, achieve automatic learning and tagging for source content and thereby achieve significant advantages in terms of cost savings, efficiency and accuracy.

The tags generated for the sound source content can be utilized for recommendation, search, classification, and management of the sound source content. For example, in order to recommend sound source contents based on a user's situation, it is necessary to provide prior information on how each music fits in a certain situation, and a tag suitable for music can be given as the dictionary information.

1 is a block diagram for explaining an example of the internal configuration of a computer system according to an embodiment of the present invention. For example, a music auto-tagging system according to embodiments of the present invention may be implemented through the computer system 100 of FIG. 1, the computer system 100 includes a processor 110, a memory 120, a persistent storage 130, a bus 140, an input / output interface (not shown) 150 and a network interface 160.

Processor 110 may include or be part of any device capable of processing any sequence of instructions. The processor 110 may comprise, for example, a processor and / or a digital processor within a computer processor, a mobile device, or other electronic device. The processor 110 may be, for example, a server computing device, a server computer, a series of server computers, a server farm, a cloud computer, a content platform, a mobile computing device, a smart phone, a tablet, a set top box, The processor 110 may be connected to the memory 120 via a bus 140.

The memory 120 may include volatile memory, permanent, virtual or other memory for storing information used by or output by the computer system 100. Memory 120 may include, for example, random access memory (RAM) and / or dynamic RAM (DRAM). The memory 120 may be used to store any information, such as the state information of the computer system 100. The memory 120 may also be used to store instructions of the computer system 100, including, for example, instructions for automatic tagging of tone generator content. Computer system 100 may include one or more processors 110 as needed or where appropriate.

The bus 140 may comprise a communication infrastructure that enables interaction between the various components of the computer system 100. The bus 140 may, for example, carry data between components of the computer system 100, for example, between the processor 110 and the memory 120. The bus 140 may comprise a wireless and / or wired communication medium between the components of the computer system 100 and may include parallel, serial, or other topology arrangements.

The persistent storage device 130 may be a component such as a memory or other persistent storage device as used by the computer system 100 to store data for a predetermined extended period of time (e.g., as compared to the memory 120) Lt; / RTI > The persistent storage device 130 may include non-volatile main memory as used by the processor 110 in the computer system 100. The persistent storage device 130 may include, for example, flash memory, hard disk, optical disk, or other computer readable medium.

The input / output interface 150 may include a keyboard, a mouse, voice command inputs, displays, or interfaces to other input or output devices. Configuration instructions and / or sound source data and text information of the sound source content for tagging may be received via the input / output interface 150. [

The network interface 160 may include one or more interfaces to networks such as a local area network or the Internet. The network interface 160 may include interfaces for wired or wireless connections. Configuration data and / or textual information of the tone generator content for tagging may be received via the network interface 160.

Also, in other embodiments, the computer system 100 may include more components than the components of FIG. However, there is no need to clearly illustrate most prior art components. For example, the computer system 100 may be implemented to include at least some of the input / output devices connected to the input / output interface 150 described above, or may include a transceiver, a Global Positioning System (GPS) module, Databases, and the like. More specifically, when the computer system 100 is implemented in the form of a mobile device such as a smart phone, an acceleration sensor, a gyro sensor, a camera, various physical buttons, buttons using a touch panel, An input / output port, a vibrator for vibration, and the like may be further included in the computer system 100.

FIG. 2 is a diagram illustrating an example of a component that a processor of a computer system according to an exemplary embodiment of the present invention may include; FIG. 3 is a block diagram of a music automatic Fig. 3 is a flowchart showing an example of a tagging method. Fig.

2, the processor 110 includes a sound source input control unit 210, a text input control unit 220, a sound source preprocessing unit 230, a text preprocessing unit 240, a learning unit 250, (260). The components of such a processor 110 may be representations of different functions performed by the processor 110 in accordance with control commands provided by at least one program code. For example, the sound source input control 210 may be used as a functional representation in which the processor 110 operates to control the computer system 100 to receive sound source data. The components of the processor 110 and the processor 110 may perform the steps S310 through S350 included in the music auto-tagging method of FIG. For example, the components of processor 110 and processor 110 may be implemented to execute instructions in accordance with the at least one program code described above and the code of the operating system that memory 120 contains. Wherein at least one program code may correspond to a code of a program implemented to process the music auto-tagging method.

The music auto-tagging method may not occur in the order shown, and some of the steps may be omitted or an additional process may be further included.

In step S310, the processor 110 may load the program code stored in the program file for the music auto-tagging method into the memory 120. [ For example, a program file for the music auto-tagging method may be stored in the persistent storage 130 described with reference to FIG. 1, and the processor 110 may store the program files stored in the persistent storage 130 The computer system 110 may be controlled such that the program code is loaded into the memory 120. [

At this time, the sound source input control unit 210, the text input control unit 220, the sound source preprocessing unit 230, the text preprocessing unit 240, the learning unit 250, and the tagging unit 250 included in the processor 110 and the processor 110, Each of the units 260 may be different functional representations of the processor 110 for executing instructions of a corresponding one of the program codes loaded into the memory 120 to perform subsequent steps S320 through S350. For the execution of steps S320 through S350, the processor 110 and the components of the processor 110 may process an operation according to a direct control command or control the computer system 100. [

In step S320, the sound source input control unit 210 and the text input control unit 220 may control the computer system 100 to receive the sound source data of the sound source content and the text information with respect to the sound source content. In other words, the sound source input control unit 210 may control the computer system 100 to receive the sound source data with respect to the sound source content, and the text input control unit 220 may control the computer system 100 Can be controlled. In this case, the sound source content may refer to all digital data having an audio file format, and may include, for example, MPEG Audio Layer-3 (MP3), Waveform Audio Format (WAVE), Free Lossless Audio Codec have. The text information related to the sound source content may include lyrics or information such as meta information such as a singer, a genre, a title, an album name, or a query inputted in connection with a search of sound source content. The sound source data and text information may be input to or received from the computer system 100 via the input / output interface 150 or the network interface 160 and stored and managed in the memory 120 or the permanent storage 130.

In step S330, the sound source preprocessing unit 230 and the text preprocessing unit 240 may process the sound source data and the text information received in step S320, respectively, as learning data. At this time, the sound source preprocessing unit 230 can express the sound source data in a time-frequency manner through the preprocessing. For example, the sound source preprocessing unit 230 may convert the sound source data into data of a time-frequency-size type such as a Mel-spectrogram or a Mel Frequency Cepstral Coefficient (MFCC). The text preprocessing unit 240 can express the text information of the sound source content in a sequence of individual words through the preprocessing. For example, the text preprocessing unit 240 may filter meaningless texts from input text information using a language preprocessor such as a morpheme analyzer, an indexer extractor, and the like. In other words, the text preprocessing unit 240 removes unnecessary part-of-speech words and special symbols (!,, /, Etc.) included in the text information and extracts words corresponding to a cognition or a root .

In step S340, the learning unit 250 may learn the preprocessed training data on the sound source data and the preprocessed training data on the text information together through the sound source-text integrated learning model. The learning unit 250 uses a sound source-text integrated learning model in which a sound source model and a text model are integrated. For example, the learning unit 250 may express the sound source data (pre-processed sound source data) and the text information (the preprocessed text information) as real vectors, and then calculate the score for each tag according to the real vector for the entire tag set. In other words, the learning unit 250 can learn the relationship between each tag of the entire tag set by learning the sound source data and the text information together with the sound source content.

In step S350, the tagging unit 260 may perform automatic music tagging by providing at least one tag according to the learning result of step S340 to the corresponding sound source content. At this time, the tagging unit 260 can determine a tag suitable for the sound source content from the entire tag set using the tag-based score output through the sound source-text integrated learning model. For example, the tagging unit 260 may determine and assign tags having a predetermined number of tags to tags of the corresponding sound source content based on the tag having the highest score among the tag-based scores outputted for the sound source content. Here, the tag set may be a set of tags that can be assigned to the sound source contents. For example, the computer system 100 may manage a plurality of sets of words as a set of tags. The tag-by-tag score may be a calculated score for each of the tags included in this overall tag set. According to an embodiment, the entire tag set can be divided into a plurality of sets of subtags. In this case, the tags included in one set of subtags for one sound source content may be calculated and used for each tag.

In this embodiment, a multimodal technique for simultaneously learning text data together with sound source data for music tagging is provided. In particular, sound source data and text information are input without a separate factor extraction process for sound source data Provides an E2E (end to end) technology that can automatically assign multiple tags. As described above, according to the embodiments of the present invention, text information such as lyrics or meta information is simultaneously used for tagging together with sound source data, thereby simultaneously assigning a tag considering the content of the lyrics, meta information, It becomes possible. For example, if a song is slow, but the lyrics contain bright content, or if the song contains a fast beat, but the lyrics contain depressed content, a more appropriate tag may be assigned to the song, Accuracy and reliability can be improved.

4 is a diagram illustrating an example of an automatic music tagging learning model structure according to an embodiment of the present invention. The automatic music tagging learning model structure according to the embodiment of FIG. 4 can be utilized by combining a sound source model and a text model as one embodiment of the structure of the sound source-text integration model described above. At this time, the automatic music tagging learning model structure includes a sound source data pattern learning layer 410, a text information pattern learning layer 420, a sound source-text integration hidden connection layer 430, and an output layer 440, as shown in FIG. . &Lt; / RTI > Hereinafter, an example of a process of generating a tag for sound source content based on the automatic music tagging learning model structure of FIG. 4 will be described. The sound source data pattern learning layer 410 may correspond to the sound source model described above, the text information pattern learning layer may correspond to the text model described above, and the sound source-text integration hidden connection layer 430 may correspond to the sound source model output, And the output layer 440 may be a layer in which the final output of the tone-text integration model is performed.

Steps 1 to 3 below may be an example of a process of generating a real vector corresponding to sound source data in the sound source data pattern learning layer 410.

In step 1, the sound source data (for example, an mp3 file) included in the sound source content can be converted into data of a time-frequency-size type such as a mel-spectrogram or MFCC through preprocessing. For example, the sound source preprocessing unit 230 may represent the sound source data input by the sound source input control unit 210 in time-frequency.

A plurality of frequency frames for one or more short time intervals (1 second to 10 seconds) may be sampled from the sound source data converted in step 2 and used as input data for the learning model of the sound source data. For example, the learning unit 250 may sample a plurality of frames and utilize the frames as an input of a CNN (Convolutional Neural Network) model presented as an example of a sound source model in the sound source data pattern learning layer 410. Therefore, the CNN model for learning sound source data can be a model having the same number of channels as the number of sampled frames.

In step 3, a plurality of convolution and pooling layers that the sound source data pattern learning layer 410 may include may be repeatedly configured to generate abstract features from sound source frames. In the convolution, the size of the patch can be variously configured, and at least one of several pooling techniques such as a pooling method using a maximum value (max), a pooling method using an average (average) Can be used.

There is a fully-connected layer on the plurality of convolution and pulling layers, and each layer function includes a sigmoid function, a hyperbolic tangent (tanh) function, a ReLU (Rectified Linear Unit) function Various functions can be used. As a result, one multi-dimensional real vector can be generated for the sound source data. For example, given the first source data

A sound source learning model

In the output layer of

= {- 0.24, 0.124, 0.312, ... } Can be represented by a single multidimensional real number vector. The number of dimensions is an integer greater than zero and can typically have a value of at least 100 empirically. This process 3 can be processed by the learning unit 250 described above.

Steps 4 to 6 below may be an example of a process of generating a real vector corresponding to text information in the text information pattern learning layer 420.

In step 4, the lyric and meta information (genre, title, singer, album name, and query information) can be represented as a sequence of individual words through morphological analysis preprocessing. For example, the text preprocessing unit 240 may generate a plurality of sequences of words through morphological analysis preprocessing based on the text information input through the text input control unit 220. At this time, text information may be given as an input of an individual model, or all text information may be given as an input by joining all text information into one sequence. For example, individual sequences for each of the lyrics and meta information may be generated through a model for the lyrics and a model for the meta information, and a single sequence for the entire lyrics and meta information may be generated through a single model have.

In step 5, the individual word sequences (or individual word sequences) may be expressed as a single multidimensional real number vector using a Recurrent Neural Network (RNN) or a model capable of learning sequence information such as CNN described above. If a plurality of individual sequences are respectively input to a plurality of models, a plurality of multi-dimensional real vectors may be generated. Assuming that an individual model is used for each text information, the first data of the given m-th text information

Is an m-th text model

In the output layer of

= {- 0.312, 0.991, ... } Can be represented by a single multidimensional real number vector. The number of dimensions is an integer greater than zero and may have an empirical value that is generally greater than 50. This process 5 can be processed by the learning unit 250 described above.

In step 6, the real vectors calculated in the output layer of the process 3 and the process 5 can be connected as a vector in the sound-text integration concealment link layer 430, which is an input of a model integrating a sound source model and a text model Can be used. The number of nodes of the output layer 440 of the integrated model M can be set to be equal to the number of kinds of theme word words of the sound source. The output layer 440 of the integrated model M may exist in a plurality of independent layers and may be composed of abstract categories or words for various songs such as main language, genre, and mood. At this time, each output layer present in the output layer 440 of the integrated model M may be fully connected to the immediately underlying sound-text integrated hidden connection layer 430 and may not be connected to each other have. Propagation of the values from input to output may follow the method of a typical feedforward network. In other words, the sum of the weights for all node values in the lower layer is given as the parameter of the active function, and the active function value can be defined as the node value in the upper layer.

In Equation (1), u is a conjugated vector, and ^F h ^(k) can denote the k th hidden layer of the integrated model.

And

May be a sound source data learning model and a kth text information learning model.

When the sound source data a is given, the sound source data learning model

A real vector generated by the output of

May be a real vector generated when the kth text information is given as an input.

In the case of the theme output layer, it may be composed of the same number of nodes as the size of the theme word set, and the tags assigned to the training data may be represented by a vector having the same size as the number of nodes in the output layer. For example, if the entire set of theme tags is Y = {sadness, pleasure, break}, if the tag of the first data is 'joyful', y = {0, 1, 0}. In this case, when there are a plurality of tags such as 'enjoyment' and 'rest', the value of the elements may be divided by the number of tags such that the sum of the elements of the vector is 1, such as y = {0, 0.5, 0.5} have.

In addition, the specific output layer may be configured such that the output layer composed of two nodes of 0 and 1 has output layers as many as the number of tag words.

When the output layer is defined such that the tags assigned to the training data are represented by a vector having the same size as the number of nodes of the output layer, the score to be given to each tag can be expressed as a probability as shown in Equation 2 below.

In Equation (2), f may denote the entire model,

Function may mean a sound source-text learning model and may mean the k th theme value of the theme output layer calculated when sound source data and text information are given.

When the output layer is defined such that the tags assigned to the training data in step 7 are represented by a vector of the same size as the number of nodes in the output layer, the error is the difference between the actual tag vector values and the values calculated for each tag in equation Can be defined as Equation (3) below.

At this time, various functions such as a mean square error function and a cross-entropy function can be used for the function used as the difference.

and

May refer to the actual theme binary vector and the predicted theme vector, respectively.

In step 8, learning can use a backpropagation technique that adjusts the weight of the model based on a gradient descent so that the error is minimized. The error change value for the individual weight change is given by Equation 4 below As can be defined by partially differentiating the error into a weight vector.

The error change value follows the general back propagation from the output layer of the unified model to the layer where the vector of the sound source data and the text information are joined, and the vector corresponding to the sound source data portion and the vector corresponding to the text information portion in the joint layer, And can be propagated backward.

Equation (5) is an example of an error function using a mean square error function

Sound-Text-Integrated Feedforward Network

Third

+ 1 < th > hidden layer,

The

Th layer of the hidden layer.

Can mean an output layer,

May be included in the hidden layer of the unified layer. Likewise

Wow

May mean the weight matrix value of the connection lines connecting the kth and (k + 1) th layers in the sound source model and the text model,

Wow

May denote the node value of the kth hidden layer of each model. In other words, the first one in equation (5) can mean the error conversion value occurring in the entire output layer of the sound source-text integrated feedforward model, and the second can be the error change value occurring in the hidden layer of the integrated feedforward model can do. In addition, the third and fourth can refer to the error change value occurring in the output layer of the sound source model and the text model, respectively.

May refer to a nonlinear activation function used in the feedforward model, and the sigmoid function, the hyperbolic tangent function, the ReLU function, and the like described above may be used as the function.

In the case of using backpropagation or recurrent neural network used in each model from the hidden layer below the output layer of the sound source model and the text model, learning using an algorithm such as BPTT (BackPropagation Through Time) Can proceed.

In this way, sound source, lyric, and meta information can be integrated into the integrated model to be used for theme tagging. Errors that occur during learning can be propagated back to shared form in the integrated model and propagated to each model in the joint layer .

As described above, according to the embodiments of the present invention, it is possible to automatically tag the theme suitable for music by using the sound source data of the sound source content together with the text information, thereby reducing the cost incurred in tagging the sound source content. In addition, the sound source data and text information of the sound source contents can be automatically learned and tagged without separating the process of extracting a separate factor from the sound source through the minimum preprocessing. In addition to the sound source data, text including lyrics or meta information By learning information at a time, multiple tags such as theme, genre, mood, and theme suitable for sound source contents can be assigned, and various tagging related to music can be performed.

The apparatus described above may be implemented as a hardware component, a software component, or a combination of hardware components and software components. For example, the apparatus and components described in the embodiments may be implemented as a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit, a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For ease of understanding, the processing apparatus may be described as being used singly, but those skilled in the art will recognize that the processing apparatus may have a plurality of processing elements and / As shown in FIG. For example, the processing unit may comprise a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as a parallel processor.

The software may include a computer program, code, instructions, or a combination of one or more of the foregoing, and may be configured to configure the processing device to operate as desired or to process it collectively or collectively Device can be commanded. The software and / or data may be in the form of any type of machine, component, physical device, virtual equipment, computer storage media, or device , Or may be permanently or temporarily embodied in a transmitted signal wave. The software may be distributed over a networked computer system and stored or executed in a distributed manner. The software and data may be stored on one or more computer readable recording media.

The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

A music automatic tagging method performed by a computer-implemented music automatic tagging system,
Receiving sound source data and text information of the sound source content with respect to the sound source content;
Processing the sound source data and the text information in the form of learning data;
Learning together the learning data preprocessed for the sound source data and the learning data preprocessed for the text information through a sound source-text integrated learning model; And
Performing at least one tag according to the learning result on the sound source content to perform automatic music tagging
Wherein the automatic music tagging method comprises:

The method according to claim 1,
Wherein the processing comprises:
And converting the sound source data into learning data expressed in time-frequency.

The method according to claim 1,
Wherein the processing comprises:
And converting the text information into learning data represented by a sequence of individual words.

The method according to claim 1,
The sound source-text integrated learning model is a combination of a sound source model and a text model,
Wherein the learning comprises:
Generating a first real-valued vector corresponding to the preprocessed training data for the sound source data using the sound source model among the sound source-text integrated learning models;
Generating a second real number vector corresponding to the learning data preprocessed for the text information using the text model among the sound source-text integrated learning models; And
A step of calculating a score for each tag according to a third real number vector to which the first real number vector and the second real number vector are connected with respect to the entire tag set through the sound-text integrated learning model
Wherein the automatic music tagging method comprises:

5. The method of claim 4,
The step of performing the automatic music tagging includes:
Wherein the at least one tag is determined in the entire tag set by using the tag-specific score output through the sound source-text integrated learning model, and is assigned to the sound source content.

5. The method of claim 4,
Wherein the step of generating a real vector corresponding to the preprocessed training data for the sound source data comprises:
Wherein a plurality of frames are sampled from the training data preprocessed for the sound source data and a real vector for the sound source data is generated using each of the plurality of sampled frames as an input of the sound source model, Way.

The method according to claim 6,
Wherein the sound source model has a number of channels equal to the number of the sampled frames.

5. The method of claim 4,
Wherein generating the real vector corresponding to the preprocessed training data for the text information comprises:
Wherein a plurality of real vectors are generated by applying an individual text model to each of a plurality of pieces of individual text information of different kinds.

5. The method of claim 4,
Text integrated learning model using an error based on a difference between a tag vector value obtained by digitizing the tags of the entire tag set and a vector based on the calculated score of each tag in the sound source-text integrated learning model, A method for automatically tagging music.

10. The method of claim 9,
The error change value calculated by partially differentiating the error in the sound source-text integrated learning model is propagated to the individual sound model models and the individual models of the text model through backpropagation or backpropagation through time (BPTT) A method for automatic tagging of music.

A computer-readable recording medium storing a program for causing a computer to execute the method according to any one of claims 1 to 10.

A computer-implemented music auto-tagging system,
An input control unit for receiving sound source data and text information of the sound source content with respect to the sound source content;
A preprocessor for processing the sound source data and the text information in the form of learning data;
A learning unit for learning together learning data preprocessed on the sound source data and learning data preprocessed for the text information through a sound source-text integrated learning model; And
And a tagging unit for assigning at least one tag according to the learning result to the sound source content and performing automatic music tagging,
Wherein the music auto-tagging system comprises:

13. The method of claim 12,
The sound source-text integrated learning model is an integrated sound source model and a text model,
Wherein,
Generating a first real number vector corresponding to the preprocessed learning data for the sound source data using the sound source model among the sound source and text integrated learning models and generating a first real number vector corresponding to the preprocessed training data for the sound source data using the text model A second real number vector corresponding to the preprocessed learning data is generated, and the score for each tag according to the third real number vector to which the first real number vector and the second real number vector are connected is set as the sound- Learning model of the music.

14. The method of claim 13,
The tagging unit,
Wherein the at least one tag is determined in the entire tag set using the tag-specific score output through the sound-text integrated learning model, and is assigned to the sound source content.