WO2022160054A1

WO2022160054A1 - Artificial intelligence and audio processing system & methodology to automatically compose, perform, mix, and compile large collections of music

Info

Publication number: WO2022160054A1
Application number: PCT/CA2022/050119
Authority: WO
Inventors: Chu HANG; Chantal LEMIRE; Sang KOH; Deborah ROBB; Ang Li
Original assignee: 1227997 B.C. Ltd.
Priority date: 2021-01-29
Filing date: 2022-01-28
Publication date: 2022-08-04

Abstract

A method and system of using artificial intelligence to automatically create and generate music and lyrics that can later be played through a music engine.

Description

ARTIFICIAL INTELLIGENCE AND AUDIO PROCESSING SYSTEM & METHODOLOGY TO AUTOMATICALLY COMPOSE, PERFORM, MIX, AND COMPILE LARGE COLLECTIONS OF MUSIC

FIELD OF INVENTION

0001 The present invention is invention relates to artificial intelligence in general, and more particularly to a novel software platform for authoring, analyzing and performing music and lyrics by artificial intelligence.

BACKGROUND TO THE INVENTION

0002 Artificial intelligence (“A. I ”) is a field of computer science concerned with creating software systems and methods which can perform activities that are traditionally thought to be the exclusive domain of humans. Research in artificial intelligence (Al) is known to have impacted medical diagnosis, stock trading, robot control, and several other fields. One subfield in this area relates to creating software systems and methods that can mimic human behavior, including human creativity, so that these software systems and methods can replicate human characteristics or traits.

0003 A.I. has also contributed to the field of music. Artificial intelligent systems and methods have been subject to much research. Current research includes the application of A.I. in music composition, performance, theory, etc.. Several music software programs have been developed that use A.I. to produce music. A prominent feature is the capability of the A.I. based processes or algorithms to learn based on information obtained such as the computer accompaniment technology, which is capable of listening to and following a human performer so it can perform in synchrony. Artificial intelligence also drives the so- called interactive composition technology, wherein a computer composes music in response to the performance of a live musician. There are several other A.I. applications to music that covers not only music composition, production, and performance but also the way it is marketed and consumed.

0004 However, current A.I. systems either rely on human input to initiate the compositional process or depend solely on analysis of musical data that utilize statistical analysis of the musical data. These two characteristics of current Al software limits the degree of autonomy and the compositional variety. As a result, some music being produced is a human-AI hybrid rather than purely Al-generated, and the music quality does not meet either production or artistic quality. Similarly, music produced through “pure” Al-generated processes known in the art also do not meet either production or artistic quality.

0005 With all the limitations and challenges above comes the need for new technologies and techniques to address the limitations.

SUMMARY OF THE INVENTION

0006 There remains a need for techniques, systems, methods and devices that can be used in the development of artificial intelligence.

0007 An aspect of the present invention is directed to a system for using artificial intelligence to automatically create and generate music and lyrics that can later be played through a music engine or be played by human performers.

0008 An aspect of the present invention is directed to a method for automatically composing a music composition, the method comprising the steps of: (a) a user selecting a music related control variable; (b) processing the music related control variable through a music composer module to compose a music file based on the music related control variable; (c) synthesizing a raw audio file based on the music file from (b) from a rendered module; and (d) processing the raw audio file through a mixer module to produce auditory effects to produce the music composition.

0009 Another aspect of the present invention is directed to the above noted method wherein the music related control variable comprises audio files of a music to be emulated, an emotional tone the music, the genres of the music, the tempo, sentiment, time of the music to be emulated.

0010 Another aspect of the present invention is directed to the above noted method wherein the music composer module comprise an Al generative module or a module applying rule-based templates to concatenate audio samples that are pitch-tuned. 0011 Another aspect of the present invention is directed to the above noted method wherein the Al generative module is selected from the group consisting of convolutional neural networks, recurrent neural networks, self-attention neural networks and symbolic music compositional systems.

0012 Another aspect of the present invention is directed to the above noted method wherein the composer module determines how the music should be performed and dictates a musical parameter, a performative parameter or text-setting content parameter.

0013 Another aspect of the present invention is directed to the above noted method wherein the musical parameter is selected from the group consisting of pitch, durations, motivic patterns, harmony, and voice-leading, the performative parameter is selected from the group consisting of articulatory profile, velocity, and temporality, and the text-setting content is syllable positions.

0014 Another aspect of the present invention is directed to the above noted method wherein the music related control variable is rendered from music composed by a human.

0015 Yet another aspect of the present invention is directed to a system implemented on a machine that utilizes predictive models of human memory to automatically compose a music composition, the system comprising a component that (a) acquires a music related control variable ; (b) process the music related control variable through a music composer module to compose a music file based on the music related control variable; (c) synthesize a raw audio file based on the music file from (b) from a rendered module; and (d) process the raw audio file through a mixer module to produce auditory effects to produce the music composition.

0016 Yet another aspect of the present invention is directed to the system noted above wherein the music related control variable comprises audio files of a music to be emulated, an emotional tone the music, the genres of the music, the tempo, sentiment, time of the music to be emulated. 0017 Yet another aspect of the present invention is directed to the system noted above wherein the music composer module comprise an Al generative module or a module applying rule-based templates to concatenate audio samples that are pitch-tuned.

0018 Yet another aspect of the present invention is directed to the system noted above wherein the Al generative module is selected from the group consisting of convolutional neural networks, recurrent neural networks, self-attention neural networks and symbolic music compositional systems.

0019 Yet another aspect of the present invention is directed to the system noted above wherein the composer module determines how the music should be performed and dictates a musical parameter, a performative parameter or text-setting content parameter.

0020 Yet another aspect of the present invention is directed to the system noted above wherein the musical parameter is selected from the group consisting of pitch, durations, motivic patterns, harmony, and voice-leading, the performative parameter is selected from the group consisting of articulatory profile, velocity, and temporality, and the text-setting content is syllable positions.

0021 Yet another aspect of the present invention is directed to the system noted above wherein the music related control variable is rendered from music composed by a human.

0022 Yet another aspect of the present invention is directed to one or more non- transitory computer-readable media storing computer-executable instructions that upon execution cause one or more processors to perform acts comprising: (a) acquiring a music related control variable ; (b) processing the music related control variable through a music composer module to compose a music file based on the music related control variable; (c) synthesizing a raw audio file based on the music file from (b) from a rendered module; and (d) processing the raw audio file through a mixer module to produce auditory effects to produce the music composition.

0023 Yet another aspect of the present invention is directed to one or more non- transitory computer-readable media storing computer-executable instructions, as noted above, wherein the music related control variable comprises audio files of a music to be emulated, an emotional tone the music, the genres of the music, the tempo, sentiment, time of the music to be emulated.

0024 Yet another aspect of the present invention is directed to one or more non- transitory computer-readable media storing computer-executable instructions, as noted above, wherein the music composer module comprise an Al generative module or a module applying rule-based templates to concatenate audio samples that are pitch-tuned.

0025 Yet another aspect of the present invention is directed to one or more non- transitory computer-readable media storing computer-executable instructions, as noted above, wherein the Al generative module is selected from the group consisting of convolutional neural networks, recurrent neural networks, self-attention neural networks and symbolic music compositional systems.

0026 Yet another aspect of the present invention is directed to one or more non- transitory computer-readable media storing computer-executable instructions, as noted above, wherein the composer module determines how the music should be performed and dictates a musical parameter, a performative parameter or text-setting content parameter.

0027 Yet another aspect of the present invention is directed to one or more non- transitory computer-readable media storing computer-executable instructions, as noted above, wherein the musical parameter is selected from the group consisting of pitch, durations, motivic patterns, harmony, and voice-leading, the performative parameter is selected from the group consisting of articulatory profile, velocity, and temporality, and the text-setting content is syllable positions.

0028 Yet another aspect of the present invention is directed to one or more non- transitory computer-readable media storing computer-executable instructions, as noted above, wherein the music related control variable is rendered from music composed by a human.

0029 Yet another aspect of the present invention is directed to an automated music composition system for composing and performing a music composition in response to a user providing a music related control variable, said automated music composition and generation system comprising: an automated music composition engine, the automated music composition engine using a trained machine learning model to compose the music composition, the trained machine learning model employing multiple types of machine learning algorithms to compose the music composition; a user interface subsystem interfaced with the automated music composition engine, and employing a graphical user interface (GUI) for permitting the user to select the music related control variable for the music composition; a processing subsystem interfaced with the automated music composition engine: (i) processing the music related control variable through a music composer module to compose a music file based on the music related control variable(ii) synthesizing a raw audio file based on the music file from (i) from a rendered module; and (iii) processing the raw audio file through a mixer module to produce auditory effects to produce the music composition.

BRIEF DESCRIPTION OF THE DRAWINGS

0030 In the drawings, which illustrate embodiments of the invention:

0031 FIG. 1 illustrates a preferred embodiment of the present invention detailing the overall system architecture and its components.

0032 FIG. 2 illustrates a preferred embodiment of the present invention detailing the Musical Turing Test.

0033 FIG. 3 illustrates a preferred embodiment of the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

0034 The description that follows, and the embodiments described therein, is provided by way of illustration of an example, or examples, of particular embodiments of the principles and aspects of the present invention. These examples are provided for the purposes of explanation, and not of limitation, of those principles and of the invention.

0035 It should also be appreciated that the present invention can be implemented in numerous ways, including as a process, method, an apparatus, a system, a device or a method. In this specification, these implementations, or any other form that the invention may take, may be referred to as processes. In general, the order of the steps of the disclosed processes may be altered within the scope of the invention. The description that follows, and the embodiments described therein, is provided by way of illustration of an example, or examples, of particular embodiments of the principles and aspects of the present invention. These examples are provided for the purposes of explanation, and not of limitation, of those principles and of the invention.

0036 It will be understood by a person skilled in the relevant art that in different geographical regions and jurisdictions these terms and definitions used herein may be given different names, but relate to the same respective systems.

0037 Although the present specification describes components and functions implemented in the embodiments with reference to standards and protocols known to a person skilled in the art, the present disclosure as well as the embodiments of the present invention are not limited to any specific standard or protocol. Each of the standards for Internet and other forms of computer network transmission (e.g., TCP/IP, UDP/IP, HTML, HTTP, SSL and SFTP) represent examples of the state of the art. Such standards are periodically superseded by faster or more efficient equivalents having essentially the same functions. Accordingly, replacement standards and protocols having the same functions are considered equivalents.

0038 Preferred embodiments of the present invention can be implemented in numerous configurations depending on implementation choices based upon the principles described herein. Various specific aspects are disclosed, which are illustrative embodiments not to be construed as limiting the scope of the disclosure. Although the present specification describes components and functions implemented in the embodiments with reference to standards and protocols known to a person skilled in the art, the present disclosures as well as the embodiments of the present invention are not limited to any specific standard or protocol.

0039 Some portion of the detailed descriptions that follow are presented in terms of procedures, steps, logic block, processing, and other symbolic representations of operations on data bits that can be performed on computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. A procedure, computer executed step, logic block, process, etc. may be here, and generally, conceived to be a self-consi stent sequence of operations or instructions leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers or the like.

0040 A person skilled in the art will understand that the present description will reference terminology from the field of artificial intelligence, including machine learning, and may be known to such a person skilled in the relevant art. A person skilled in the relevant art will also understand that artificial neural networks generally refer to computing or computer systems that are design to mimic biological neural networks (e.g. animal brains). Such systems “learn” to perform tasks by considering examples, generally being programmed with or without task-specific rules. For example, in music composition, such systems might learn to produce musical scores that contain sequences of notes by analyzing musical data from a particular composer or a specific compositional style. A person skilled in the relevant art will understand that convolutional neural networks, recurrent neural networks, transformer neural networks are classes of neural networks that specializes in processing data that has a grid-like or sequential-like topology, such as a music score. A digitized music score is a digital representation of music data. It contains a series of notes arranged in a sequence-like fashion that contains pitch and rhythmic values to denote how human or electronic performers should perform the notations.

0041 Machine learning techniques will generally be understood as being used to identify and classify specific reviewed data. Machine learning approaches first tend to involve what is known in the art as a “training phase”. In the context of classifying functions, a training “corpus” is first constructed. This corpus typically comprises a set of known data. Each set is optionally accompanied with a “label” of its disposition. It is preferable to have fewer unknown samples. Furthermore, it is preferable for the corpus to be representative of the real world scenarios in which the machine learning techniques will ultimately be applied. This is followed by a “training phase” in which the data together with the labels associated with the data, files, etc. themselves, are fed into an algorithm that implements the “training phase”. The goal of this phase is to automatically derive a “generative model”. A person skilled in the relevant art will understand that a generative model effectively encodes a mathematical function whose input is the data and whose output is also the data. By exploiting patterns that exist in the data through the training phase, the model learns the process that generates similar note patterns and arrangements of notes, which are indicative of specific compositional styles. A generative machine learning algorithm should ideally produce a generator that is reasonably consistent with the training examples and that has a reasonable likelihood of generating new instances that are similar to its training data but not identical. Specific generative machine learning algorithms in the art include the Autoregressive Recurrent Neural Networks, Variational Auto-Encoders, Generative Adversarial Neural Networks, Energy-Based Models, Flow-Based Neural Networks, and others known in the art. The term generator is also used to describe a model. For example, one may refer to a Recurrent Neural Network Generator. Once the model/generator is established, it can be used to generate new instances, scenarios or data sets that are presented to a computer or computer network in practice.

0042 The present invention may be a system, a method, and/or a computer program product such that selected embodiments include software that performs certain tasks. The software discussed herein may include script, batch, or other executable files. The software may be stored on a machine-readable or computer-readable storage medium, and is otherwise available to direct the operation of the computer system as described herein and claimed below. In one embodiment, the software uses a local or database memory to implement the data transformation and data structures so as to automatically generate and add libraries to a library knowledge base for use in detecting library substitution opportunities, thereby improving the quality and robustness of software and educating developers about library opportunities and implementation to generate more readable, reliable, smaller, and robust code with less effort. The local or database memory used for storing firmware or hardware modules in accordance with an embodiment of the invention may also include a semiconductor-based memory, which may be permanently, removably or remotely coupled to a microprocessor system. Other new and various types of computer- readable storage media may be used to store the modules discussed herein. Additionally, those skilled in the art will recognize that the separation of functionality into modules is for illustrative purposes. Alternative embodiments may merge the functionality of multiple software modules into a single module or may impose an alternate decomposition of functionality of modules. For example, a software module for calling sub-modules may be decomposed so that each sub-module performs its function and passes control directly to another sub-module.

0043 In addition, selected aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and/or hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of computer program product embodied in a computer readable storage medium or media having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention. Thus embodied, the disclosed system, a method, and/or a computer program product is operative to improve the design, functionality and performance of software programs by adding libraries for use in automatically detecting and recommending library function substitutions for replacing validated code snippets in the software program.

0044 The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non- exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a dynamic or static random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a magnetic storage device, a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

0045 A person skilled in the relevant art will understand that the computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a Public Switched Circuit Network (PSTN), a packet-based network, a personal area network (PAN), a local area network (LAN), a wide area network (WAN), a wireless network, or any suitable combination thereof. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device. Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Python, Visual Basic.net, Ruby, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language, Hypertext Precursor (PHP), or similar programming languages. The computer readable program instructions may execute entirely on the user’s computer, partly on the user’s computer, as a stand-alone software package, partly on the user’s computer and partly on a remote computer or entirely on the remote computer or server or cluster of servers. In the latter scenario, the remote computer may be connected to the user’s computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention. A person skilled in the relevant art will understand that the Al based or algorithmic processes of the present invention may be implemented in any desired source code language, such as Python, Java, and other programming languages and may reside in private software repositories or online hosting service such as Github.

0046 A person skilled in the relevant art will understand that the term “deep learning” refers to a type of machine learning based on artificial neural networks. Deep learning is a class of machine learning algorithms (e.g. a set of instructions, typically to solve a class of problems or perform a computation) that use multiple layers to progressively extract higher level features from raw input. For example, in image processing, lower layers may identify edges, while higher layers may identify human-meaningful items such as digits or letters or faces; in music analysis, lower layers may identify local pitch and rhythmic movements, while higher layers may identify emotional artistic expressions of the composer.

0047 A person skilled in the art will understand that the operation of the network ready device (e.g. mobile device, work station, etc.) may be controlled by a variety of different program modules. Examples of program modules are routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. It will be understood that the present invention may also be practiced with other computer system configurations, including multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCS, minicomputers, mainframe computers, and the like. Furthermore, the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices. One skilled in the relevant art would appreciate that the device connections mentioned herein are for illustration purposes only and that any number of possible configurations and selection of peripheral devices could be coupled to the computer system.

0048 Embodiments of the present invention can be implemented by a software program for processing data through a computer system. It will be understood by a person skilled in the relevant art that the computer system can be a personal computer, mobile device, notebook computer, server computer, mainframe, networked computer (e.g., router), workstation, and the like. The program or its corresponding hardware implementation is operable for providing user authentication. In one embodiment, the computer system includes a processor coupled to a bus and memory storage coupled to the bus. The memory storage can be volatile or non-volatile (i.e. transitory or non-transitory) and can include removable storage media. The computer can also include a display, provision for data input and output, etc. as will be understood by a person skilled in the relevant art.

0049 A person skilled in the art will understand that a “Turing Test” is typically referred to as test to tell computers and humans apart. In theory, it is a simple test that can be easily answered by a human but extremely difficult to be answered by a computer. Such tests have been widely used for security reasons, such as for example, preventing automated registration in web-based services like web-based email. Email providers may use an automated Turing Test as a step in the registration process to prevent automated scripts from subscribing and using their resources for spam distribution. Other applications of Automated Turing Tests involve on-line polls, web-blogs, or purchasing products, where only humans are permitted to participate. An automated Turing Test typically presents a human with a token that includes a key. The token is often implemented as an image and the key is often implemented as text within the image. While a human is generally able to identify the text within the image fairly easily, such identification is often difficult for a computer program. Automated Turing Tests typically attempt to frustrate a computer program’s ability to identify the key by embedding text into the image that violates OCR recognition rules. In embodiment of the present invention, an analogy to a Turing Test is used to determine whether music compositions (containing music and/or lyrics) has been composed by a machine (e.g. a computer) or by a person. 0050 It should be bome in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present invention, discussions utilizing terms such as “receiving,” “creating,” “providing,” or the like refer to the actions and processes of a computer system, or similar electronic computing device, including an embedded system, that manipulates and transfers data represented as physical (electronic) quantities within the computer system’s registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

Composition Rule Sets

0051 A music composition may comprise multiple voices, where a voice will be generally understood to represent an independent musical line consisting of an ordered sequence of notes. Within each voice, certain compositional rules apply to adjacent notes, which are referred to as the Melodic or Horizontal Rules. When notes from two or more voices are involved, the governing rules are referred to as Contrapuntal Rules, or Vertical Rules. Musical Phrase is typically multiple measures that are spliced together to form musical sentences and paragraphs as in written languages. Within the context of Musical Phrase, Cadential Rules function as the musical equivalence of punctuations, partitioning the musical materials into various sections and providing the music a sense of conclusion. Motivic Rules dictate the pattern of specific pitches, rhythms, and intervals, that serve to interrelate various sections and provide the music a sense of cohesiveness. Text-Setting Rules determine how unaccented and accented syllables can be paired with specific rhythms and pitch motions. When all the foregoing rules, as seen in FIG. 1, both music rule sets 110 and lyric rule sets 111, are applied as a set of rules, they are referred to as a Composition Rules Set 115.

0052 A person skilled in the relevant art will understand each type, style, era, etc. of music may have specific Compositional Rules Sets. As such, the methods, systems and apparatus of the present invention can be directed to the production and performance of music and lyrics derived from any Composition Rules Set. In a preferred embodiment, a Compositional Rule Set may be derived from or based on 16th and 17th century speciescounterpoint rules. It will be understood, however, that any species-counterpoint rules can be used in the present invention. In this preferred embodiment, as shown in FIG. 1, the target music data 100 (for example, in MIDI format) may be rendered (see 145 of FIG. 1) from computer readable data of human composed music 140 (e.g. the music from specific known musical compositions). A person skilled in the relevant art will understand that any other computer readable format may be used in the embodiments of the present invention. The rendered music data 100 may contain note values (e.g. pitch content and rhythmic duration), lyrics, performance parameters (e.g. breathing, note velocity, note amplitude, vibrato, maintenance of sound etc.), etc.. The Al-based processes of the present invention utilize the Composition Rules Set 115 to determine the best option for the next immediate note or notes to follow in one or multiple voices based on the specific Compositional Rules Set employed in order to produce Rules Set music and lyrics data 120. In a preferred embodiment, this process is repeated until the conclusion of the composition. The Melodic (horizontal) Rules and the Contrapuntal (vertical) Rules from the Compositional Rules Sets dictate the solution space, which consists of all the plausible note selections that follow the Compositional Rules Sets. The resulting music/lyrics data 120 may be produced based strictly in accordance with that specific Compositional Rule Set. The music data 120 may be then said to have zero “errors” (e.g. it complies with the parameters established under the Compositional Rule Set). Despite the Composition Rule Set being designated for a specific time period (i.e. 16th and 17th century), it will be understood that these rule sets tend to be fundamental to most , if not all, music in Western Art Music tradition and provide fertile grounds for implementing Al-rule augmented systems.

0053 Music compositions may also comprise lyrics as well as musical notes. Text- Setting Rules 111 determine how words or syllables (e.g. lyrics/libretto) are applied to the music data 100, which dictates how notes and syllables are paired. When the Text Setting Rules 111 are combined with the Music Setting or Note Placement Rules 110 to arrive at the Compositional Rules Set 115, algorithm based processes of the present invention referred to as purely rule based composition or a searching based virtual composer 130 may be able to produce new music and/or lyrics data set 195 that abides by the conventions and rules selected (e.g. 16th and 17th species-counterpoint rules). In a preferred embodiment, the Text Setting Rules 111 may be a part of the Compositional Rule Set 115, and it is activated when the composition involves text.

0054 In order to obtain the necessary music derived computer readable data 100, the raw music data (e.g. MIDI or any other known format) 140 must first be reviewed, transcribed and translated into corresponding mathematical and computer readable code formulations (e.g. “rendering” music data at 145 in FIG. 1).

0055 In an embodiment of the present invention, there is provided Al-based processes of the present invention responsible for auto-detection and auto-annotation, referred to herein as the “Rule Annotator” (see 150 in FIG. 1). Using the encoded rules provided under the selected Compositional Rules Set 115 as the basis for analysis, the Rule Annotator 150 scans the musical data and detects points in the music data 100 in which the music deviates from the encoded Compositional Rules Set 115. The data representing these “violations” 160 are then automatically identified and compiled using an auto-annotation algorithm. The algorithm encodes each individual horizontal and vertical rules as a logical expression (i.e. a rule-abiding or rule-violating statement), and sequentially applies this logical checking to annotate the music data. Through the repetition of this process, Rule Annotator 150 collects data on what rules are being violated, how a rule is violated, when it is violated, and how the violation is rectified within each violation’s respective musical contexts. This is the first step towards developing the present inventions’ Al’s based “conscious learning” 170 of the music data 100 and the violations data 160 (see 196 and 197 in FIG. 1). Over time, this process will produce a database of rule violations, which would serve to identify particular styles, composers, and time periods.

0056 The details of this analysis and annotation process may, in a more preferred embodiment, involve the following steps. First, Al-based or algorithmic processes of the present invention represent each musical note (or alternatively, a lyric) as a pair of two (or three) integer values based on or corresponding to the musical note’s pitch and time duration. This data structure enables querying of neighboring notes in the music, both horizontally and vertically, which may be used to verify whether or not certain Compositional Rules Sets are violated (see 160 in FIG. 1). 0057 Simultaneously, Al-based or algorithmic processes of the present invention represented by the Rule Annotator 150 divides the Musical Phrases, which defines a unit of musical segment that has a complete musical sense of its own, by detecting cadences or specific note sequences that signal ends of musical phrases using a Cadential Point Detector, which is based on the aforementioned Cadential Rules that checks specific combinations of note pitch and duration values. Motifs are detected by checking and mining the repetitive note patterns in the note values of music data, which is based on the aforementioned Motivic Rules. Also simultaneously, Rule Annotator 150 first converts the inputted text (e.g. lyrics) into words, then a sequence of syllables, consisting of the syllable name, letters, and accentuations. The syllable sequence is sequentially paired with musical notes, where one syllable can be mapped to multiple notes, and vice versa. The Text Setting Rules 111 dictates when and what syllable can appear based on the notes’ rhythmic durations, pitch values, and horizontal motions from adjacent notes.

0058 In a preferred embodiment of the present invention, a rule-augmented Al-based process referred to as the “conscious learning” algorithm uses the rule violation data 160 from Rule Annotator 150 and music data 100 to train one or more music-generating neural networks (collectively referred to as 185 in FIG. 1) so as to develop a “Conscious Learning” Al-based algorithm at 170. In a preferred embodiment, conscious learning 170 trains the one or more neural networks 185 to allow the generation of music data. After processing 160 and 100, the neural networks 185 learn to jointly model these two data modalities to produce, with Al-based algorithms 180, new music data set A 190. These Al-based algorithms 180 can be referred to as “Mixture of Experts”, as described below.

0059 In a preferred embodiment, a “Mixture-of-Expert” system 180 may be used with 185 and 115 to generate new music data set A 190, where each “expert” (e.g. set of rules or probabilities provided by trained neural networks 185) provides certain constraints or probabilistic distributions of the next note to be added, given preceding notes 190. A search-based algorithm, e.g. A-Star searching algorithm (such as for example, 130), may be used to improve the efficiency of music generation in 180. In a preferred embodiment, the “Mixture-of-Expert” system 180 produces computer readable data that corresponds to the music and/or lyrics. This computer readable music data 190 and/or 195 can then be converted into sound (See FIG. 2).

Instrumental and Vocal Performing.

0060 In a preferred embodiment, the performances of the generated computer readable music data (see 195 and 190 in FIG. 2) may be automated and may be dictated by neural networks that autonomously control velocity, pedal, vocal range, tonal quality, vibrato, amplitude, breathing, and timbre. When multiple voices are being performed, separately trained neural networks, or other ML- or rule-based algorithms 201 synthesize or automatically generate mixing parameters for EQ, compression, normalization, reverb, and acoustical effects, which are conducive to producing the most favourable musical results 210 and/or 211. If lyrics are involved in the music, the Text Setting Rules 111 provide the basis for how the syllables are coupled with the notes in the performance process.

0061 It will be understood that any musical and lyric computer readable data 100 could be used to “train” the Al-based process of the present invention (e.g. see FIG. 1 where 100 is used to train 170). In a preferred embodiment, the musical (and lyric, when available) computer readable data 100 may be derived from music data 140. As will be understood by a person skilled in the relevant art, MIDI files of Palestrina and Bach’s music may be used as the music data 140. MIDI files, widely available on the web, may have to be downloaded and edited to exclude any anomalies and human errors in the transposition process (e.g. rendered as provided in 145). In an embodiment of the present invention, rhythms, instrumentation, and pitch range may also be manually modified in 145 to ensure that the Rule Annotator 150 can properly analyze the music data files 100.

Training data and methodology for rendering training data

0062 Along with manual tweaking of MIDI files, several methods for rendering the data (see 145 in FIG. 1) may be employed to ensure that the converted MIDI files may be formatted into well-paced symbolic music scores, properly transposed between different music modes, e.g. Aeolian, Phrygian, Mixolydian, Dorian, Ionian, Lydian, and Locarian known by a person skilled in the relevant art, and reduce down to the music’s nonornamental notes, e.g. notes that conveys the main body of the flow of the music. These modifications were then exported as new MIDI files or other known formats, which may be then processed in accordance with the embodiments of the present invention.

Design of Musical Turing Test: Comparison of human compositions and Al generations under same compositional parameters

0063 In a preferred embodiment, as shown in FIG. 2, the Musical Turing Test of the present invention is designed to compare Al-generated musical compositions (e.g. containing music and/or lyrics, such as, for example, music data A 190 and music data B 195 in FIGS. 1 and 2) with music data whereby the music is by human composers (see for example 140 in FIGS. 1 and 2) to determine whether or not the Al based processes of the present invention or other Al systems can compose music comparable to and/or indistinguishable from those of humans. The aim of the Musical Turing Test is to explore the qualities of the music that makes human compositions human and isolate (as shown in 240 of FIG. 2), the effective sub-procedures of the algorithm that correlate to human-like Al music generations.

0064 In a preferred embodiment, human composers may be asked to compose music based on specific criteria, such as, images, texts, themes, etc. These same criteria may be processed by the Al based systems of the present invention to produce music similarly “inspired” by the selected criteria. In a more preferred embodiment, an input criteria, such as an image, may be converted into a representational vector of floating numbers by Al or algorithm based processes such as convolutional neural networks (e.g. a class of deep neural networks, most commonly applied to analyzing visual imagery), which is then converted into a text consisting of a sequence of words that describes the content of the image via Al or algorithm based processes such as recurrent neural networks or transformer neural networks. The generated text is then used to condition music generation, such as, for example, generation of music in accordance to the semantics and meaning of the text. Human performers 220 may record the human compositions 140 to produce audio recordings 214 and may record the transcribed Al-generated music 190 and 195 to produce 213 and 212 (see FIG. 2), and these recordings 212, 213 and 214 may be sampled into excerpts (not shown) (e.g. of one minute in duration and/or sixteen bars of composed music). These audio excerpts may be the sound materials that the human participants will evaluate as part of the Musical Turing Test (see 230 in FIG. 2). Human participants will listen to randomized sound materials and determine whether or not sound files were produced by humans or Al. Al-generated music that were misidentified as being human produced will serve as the proof-of-concept for this music generating software.

Methodology to automatically evaluate algorithm effectiveness

0065 The first type of evaluation requires collecting a statistically sufficient number of human questionnaires as part of the Musical Turing Test (see 230 of FIG. 2) to evaluate the effectiveness of the generated test results, e.g. via ratings. The second type of evaluation (e.g. via a trained evaluation neural network 240) may take test music data (e.g. 140, 190 and 195) and automatically determine features and metrics that may be well-correlated with human ratings. This enables automatic human-like evaluation of the present invention’s algorithms effectiveness, even without a human in-the-loop. In a preferred embodiment, 240 may be used together with 180 (see 241 in FIG. 2) to further improve the quality of 190.

Application of Present Invention

0066 EMUJI™ is a system that automatically converts text into singing music with accompanied instrumental performances. The user of Emuji first inputs text and chooses a desired music style to “Emujify” (e.g. Piano/String/Rock/Pop/etc, male/female/etc). The algorithm generates new and diverse musical samples utilizing the text provided within the musical constraints identified by the user. When a viewer browses text that has been processed by Emuji, the generated music automatically plays.

0067 The Emuji system consists of an automatic text-to-music API empowered by the aforementioned composition algorithm, as well as the GUI of querying and controlling the API, and the streaming of generated music data, while text is being browsed.

0068 Another aspect of the present invention is directed to methods, systems and apparatus for using artificial intelligence to automatically compose, perform, mix, and compile large collections of music. As shown in FIG. 3, a user may provide input 300 to select a plurality of music related control variables, including, but not limited to, actual audio files of the music or hyperlinks to the music to be emulated, the emotional tone or mood of the music (e.g. happy/sad, etc.), the genres of the music or style (e.g. rock/punk, type of video game (e.g. sci-fi, medieval, etc.), etc.), a sliding bar for the tempo, sentiment, decade/time, etc. In a preferred embodiment, the variables can be selected a through series of multiple-choice selections. A person skilled in the art will understand that for each music related control variable where user-input field is provided, it is possible to have further multiple (e.g. second level) control variable entries and each second level entry may also have multiple (e.g. third level) associated values. It will be understood that any number of and levels of control variables may be provided. This will collectively be referred to as control variables. User inputted control variables can be obtained by directly presenting the selections to the user, or via other means such as automatically detected sentiment/emotional values using automatic or semi-automatic classification techniques on text, image, audio data (as provided herein), or any other direct or indirect user input such as gaming controllers or gameplay. For example, in the context if video games, the style, pace, genre of play could provide the control variable. In a preferred embodiment, a fast paced first person shooter may provide control variables that include aggressive rock music.

0069 As shown in FIG. 3, a composer module 310 may be provided that produces collections of music sheet files 320 based on the control variables selected by the user. A typical composer module may have included therein an Al generative module 315, such as machine-learning-based systems such as convolutional neural networks, recurrent neural networks, self-attention neural networks, or symbolic music compositional systems, including the embodiments provided and described herein. Another example of a composer module can be applying rule-based templates to concatenate audio samples that are pitch- tuned. The composer may include sub-modules such as a performer module (e.g. 201) that generates performative instructions and specifications, and/or a text-setter that generates vocal performance instruction based on user inputted control variables.

0070 As shown in FIG. 3, a music sheet file 320 is a data file produced by composer module 310 that provides information on how the music should be performed and dictates musical (i.e. pitch, durations, motivic patterns, harmony, voice-leading), performative parameters (e.g. articulatory profile, velocity, temporality), and/or text-setting content (e.g. syllable positions). The music sheet file 320 may be realized in various audio outputs (e.g. the MIDI, and XML file format).

0071 In a preferred embodiment, a Tenderer module 330 is an audio synthesizing module that converts music sheet file 320 into a raw audio file 340 featuring instrument performance. In a preferred embodiment of the present invention, the Tenderer 330 may also include a machine learning classification or regression module or submodule that determines appropriate instrument assignment for different parts of the music sheet file 320 based on the user input control variables and a database 350 for training the classification submodule 345. Music parts can be different sections or segments e.g. staff/voice, melody /harmony, and/or verse/chorus. The instrument assignment includes but is not limited to instrument types, settings, and/or effects e.g. piano/guitar with amplifier/distortion, and/or audio snippets. An example of this sub-module 345 is a neural network that takes user input (i.e. the control variables) as its input (see 362 of FIG. 3) and predicts the category of instruments to be applied to various musical voices/parameters including melody, harmony, and rhythms, etc. In a preferred embodiment, Tenderer training database 350 is used for training the machine learning sub-module 345. The database 350 may comprise a training dataset of manually or procedurally created “ground-truth” examples of user inputs and the associated Tenderer settings and parameters. A raw audio file 340 is an audio file for an individual music part (e.g. the melody performed by a selected instrument). File formats may include (but are not limited to) wav, mp3, ogg, flac, aiff, live stream, or any audio output signals.

0072 The mixer module 360 (see FIG. 3) may be, in a preferred embodiment, an audio signal processing module that manipulates audio signals and applies appropriate auditory effects (e.g. EQ, reverb, delay, flange, compressor, limiter, side-chain/gate, groove pool, etc.) to the raw audio file 340. In a preferred embodiment of this invention, the mixer module 360 comprises a machine learning classification or regression module or sub-module 361 trained on a database 365 that determines which auditory effects are applied and its associated musical and performative parameters based on the user inputted control variables. An example of this sub-module 361 is a neural network that predicts a set of auditory parameters based on user and audio inputs (see, for example, 363 of FIG. 3) to determine aesthetically appropriate effects that are applied to the raw audio files as its output. The mixer training database 365 is used for training the machine learning sub-module 361 of mixer 360. The database 365 may comprises consists of a training dataset of manually or procedurally created “ground-truth” examples of user inputs and the associated mixer settings and parameters.

0073 The processed audio data file 370 is the final output of the inventions and processes discussed herein. The methods and outputs of the present invention may be distributed, in a preferred embodiment, through a cloud-based API that provides processed audio files or streams based on user-specified input parameters (e.g. 300). Given the variety of user inputs and the multiple possible outputs from the various stages of the system discussed herein, this invention can produce a nearly infinite combination of outputs resulting in a perceptually endless collection of generated music.

0074 As shown in FIG. 3, the client-side interface or software 375 controls how the processed audio 370 (e.g. music outcome) is consumed by the user, whose form can be flexible depending on the application scenario. One example is to pre-generate a large quantity of processed audio results 370 and distribute them with a cloud-based file downloader. Another example is to fully execute the preferred whole music-generating process on the client-side, enabling all users to access endless libraries of pre created/performed or simultaneously created/performed music through the client-side software e.g. each copy of a video game having its own unique music library/soundtrack.

0075 INFINIMUSIC™ is a system that automatically takes the user inputs and creates instrumental and vocal performances. The user of INFINIMUSIC™ first inputs the desired music style and other inputs as described above to generates new and diverse musical samples utilizing the musical constraints identified by the user.

0076 The INFINIMUSIC™ system consists of an automatic music API empowered by the aforementioned algorithms, as well as a GUI of querying and controlling the API, and the streaming of generated music data. 0077 Although this disclosure has described and illustrated certain preferred embodiments. As shown in FIG. 1, in a second situation, of the invention, it may be understood that the invention may be not restricted to those embodiments. Rather, the invention includes all embodiments which are functional or mechanical equivalence of the specific embodiments and features that have been described and illustrated.

Claims

25 CLAIMS:

1. A method for automatically composing a music composition, the method comprising the steps of:

(a) a user selecting a music related control variable;

(b) processing the music related control variable through a music composer module to compose a music file based on the music related control variable;

(c) synthesizing a raw audio file based on the music file from (b) from a rendered module; and

(d) processing the raw audio file through a mixer module to produce auditory effects to produce the music composition.

2. The method of claim 1 wherein the music related control variable comprises audio files of a music to be emulated, an emotional tone the music, the genres of the music, the tempo, sentiment, time of the music to be emulated.

3. The method of claim 2 wherein the music composer module comprise an Al generative module or a module applying rule-based templates to concatenate audio samples that are pitch-tuned.

4. The method of claim 3 wherein the Al generative module is selected from the group consisting of convolutional neural networks, recurrent neural networks, self-attention neural networks and symbolic music compositional systems.

5. The method of claim 4 wherein the composer module determines how the music should be performed and dictates a musical parameter, a performative parameter or textsetting content parameter.

6. The method of claim 5 wherein the musical parameter is selected from the group consisting of pitch, durations, motivic patterns, harmony, and voice-leading, the performative parameter is selected from the group consisting of articulatory profile, velocity, and temporality, and the text-setting content is syllable positions.

7. The method of claim 6 wherein the music related control variable is rendered from music composed by a human.

8. A system implemented on a machine that utilizes predictive models of human memory to automatically compose a music composition, the system comprising a component that

(a) acquires a music related control variable ;

(b) process the music related control variable through a music composer module to compose a music file based on the music related control variable;

(c) synthesize a raw audio file based on the music file from (b) from a rendered module; and

(d) process the raw audio file through a mixer module to produce auditory effects to produce the music composition.

9. The system of claim 8 wherein the music related control variable comprises audio files of a music to be emulated, an emotional tone the music, the genres of the music, the tempo, sentiment, time of the music to be emulated.

10. The system of claim 9 wherein the music composer module comprise an Al generative module or a module applying rule-based templates to concatenate audio samples that are pitch-tuned.

11. The system of claim 10 wherein the Al generative module is selected from the group consisting of convolutional neural networks, recurrent neural networks, self-attention neural networks and symbolic music compositional systems.

12. The system of claim 11 wherein the composer module determines how the music should be performed and dictates a musical parameter, a performative parameter or textsetting content parameter.

13. The system of claim 12 wherein the musical parameter is selected from the group consisting of pitch, durations, motivic patterns, harmony, and voice-leading, the performative parameter is selected from the group consisting of articulatory profile, velocity, and temporality, and the text-setting content is syllable positions.

14. The system of claim 13 wherein the music related control variable is rendered from music composed by a human.

15. One or more non-transitory computer-readable media storing computer-executable instructions that upon execution cause one or more processors to perform acts comprising:

(a) acquiring a music related control variable ;

16. The non-transitory computer-readable media of claim 15 wherein the music related control variable comprises audio files of a music to be emulated, an emotional tone the music, the genres of the music, the tempo, sentiment, time of the music to be emulated.

17. The non-transitory computer-readable media of claim 16 wherein the music composer module comprise an Al generative module or a module applying rule-based templates to concatenate audio samples that are pitch-tuned.

18. The non-transitory computer-readable media of claim 17 wherein the Al generative module is selected from the group consisting of convolutional neural networks, recurrent neural networks, self-attention neural networks and symbolic music compositional systems.

19. The non-transitory computer-readable media of claim 18 wherein the composer module determines how the music should be performed and dictates a musical parameter, a performative parameter or text-setting content parameter. 28

20. The non-transitory computer-readable media of claim 19 wherein the musical parameter is selected from the group consisting of pitch, durations, motivic patterns, harmony, and voice-leading, the performative parameter is selected from the group consisting of articulatory profile, velocity, and temporality, and the text-setting content is syllable positions.

21. The non-transitory computer-readable media of claim 20 wherein the music related control variable is rendered from music composed by a human.

22. An automated music composition system for composing and performing a music composition in response to a user providing a music related control variable, said automated music composition and generation system comprising: an automated music composition engine, the automated music composition engine using a trained machine learning model to compose the music composition, the trained machine learning model employing multiple types of machine learning algorithms to compose the music composition; a user interface subsystem interfaced with the automated music composition engine, and employing a graphical user interface (GUI) for permitting the user to select the music related control variable for the music composition; a processing subsystem interfaced with the automated music composition engine:

(i) processing the music related control variable through a music composer module to compose a music file based on the music related control variable;

(ii) synthesizing a raw audio file based on the music file from (i) from a rendered module; and

(iii) processing the raw audio file through a mixer module to produce auditory effects to produce the music composition.