WO2021068812A1

WO2021068812A1 - Music generation method and apparatus, electronic device and computer-readable storage medium

Info

Publication number: WO2021068812A1
Application number: PCT/CN2020/119078
Authority: WO
Inventors: 刘奡智; 蔡梓丰; 王健宗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-10-12
Filing date: 2020-09-29
Publication date: 2021-04-15
Also published as: CN110827789B; CN110827789A

Abstract

A music generation method, comprising: recording a motion video of a user; reading a current video frame of the motion video as a first target video frame; identifying IDs and position coordinate values of human joint sites in the first target video frame, and controlling a playback unit to start up and play back music generated according to preset initial values of a music parameter and sound effect parameter; by using the reading time of the first target video frame as a time start point, reading a current video frame of the motion video at every preset time as a second target video frame; identifying the IDs and position coordinate values of the human joint sites in the second target video frame; and adjusting the music parameter and the sound effect parameter according to changes in the position coordinate values of the human joint sites, thereby adjusting the music so as to generate new music. A music generation apparatus, an electronic device and a computer-readable storage medium. Thus, the problem in which music creation is difficult and is not easy to expand is solved.

Description

Music generation method, device, electronic equipment and computer readable storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, application number CN201910969868.5, and titled "Music Generation Method, Electronic Device, and Computer-readable Storage Medium" on October 12, 2019, and the entire content of it is approved The reference is incorporated in this application.

Technical field

This application relates to the field of data processing technology, and in particular to a music generation method, device, electronic equipment, and computer-readable storage medium.

Background technique

In today's society, music has penetrated deeply into people's lives. Music can adjust mood, relieve stress and reduce anxiety. The inventor realized that the generation of traditional music requires the creator to have a certain knowledge of music theory, combined with inspiration and creative experience, in order to create complete music. For those who do not have a musical foundation, these conditions have formed a high threshold, which prevents many non-professional people who love music from participating in the creation of music. At present, there is a lack of a music generation method that is simple to create and easy to expand.

Summary of the invention

The music generation method provided in this application includes:

The first recognition step: use the camera unit to record the user's action video, read the current video frame of the action video as the first target video frame, input the first target video frame into the pre-trained model, and recognize the first target video frame. A user’s key part information in a target video frame, where the key part information includes the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and its position coordinate value;

Generation step: when the position coordinate values of the first type of human body joint parts and the position coordinate values of the second type of human body joint parts in the first target video frame are recognized, the playback unit is controlled to start and play according to the preset music parameters and The music generated by the initial value of the sound effect parameter;

The second identification step: taking the reading time of the first target video frame as the starting point of time, reading the current video frame of the action video as the second target video frame at a preset time interval, and setting the second target video frame Input the pre-trained model to identify the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and the position coordinate value of the user in the second target video;

The adjustment step: adjust the music according to the predetermined mapping relationship table of the first type of human body joint parts and the music parameters, the change of the position coordinate values of the first type of human body joint parts in the second target video frame, and the first preset adjustment range table Parameters: adjust the sound effect parameters according to the predetermined mapping table of the second type of human joint parts and the sound effect parameters, the change in the position coordinate values of the second type of human joint parts in the second target video frame, and the second preset adjustment range table , And adjust the music according to the adjusted music parameters and sound effect parameters to generate new music.

The present application also provides a music generating device, which includes:

The first recognition module is used to record the user's action video by using the camera unit, read the current video frame of the action video as the first target video frame, input the first target video frame into the pre-trained model, and identify all The key part information of the user in the first target video frame, where the key part information includes the ID of the first type of human body joint part and its position coordinate value, and the ID of the second type of human body joint part and its position coordinate value;

The generating module is used to control the playback unit to start and play music according to the preset when the position coordinate values of the first type of human joint parts and the second type of human joint parts in the first target video frame are recognized The music generated by the initial values of the parameters and sound effect parameters;

The second recognition module is used to read the current video frame of the action video as the second target video frame by taking the reading time of the first target video frame as the time starting point and every preset time interval, and set the second target video frame as the second target video frame. The video frame is input into the pre-trained model to identify the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and the position coordinate value of the user in the second target video;

The adjustment module is used to perform the mapping relationship between the first type of human body joints and the music parameters, the change of the position coordinate values of the first type of human joints in the second target video frame, and the first preset adjustment range table. Adjust the music parameters according to the predetermined mapping relationship table of the second type of human body joints and the sound effect parameters, the change of the position coordinate value of the second type of human joints in the second target video frame, and the second preset adjustment range table. Sound effect parameters, and adjust the music according to the adjusted music parameters and sound effect parameters to generate new music.

The present application also provides an electronic device, the electronic device comprising: a memory, a processor, the memory stores a music generation program that can run on the processor, and when the music generation program is executed by the processor To achieve the following steps:

The present application also provides a computer-readable storage medium having a music generation program stored on the computer-readable storage medium, and the music generation program can be executed by one or more processors to implement the following steps:

Description of the drawings

FIG. 1 is a schematic diagram of an embodiment of an electronic device of this application;

2 is a schematic diagram of modules of an embodiment of a music generating device;

Fig. 3 is a flowchart of an embodiment of a method for generating music according to this application.

Detailed ways

In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

It should be noted that the descriptions related to "first", "second", etc. in this application are only for descriptive purposes, and cannot be understood as indicating or implying their relative importance or implicitly indicating the number of indicated technical features . Therefore, the features defined with "first" and "second" may explicitly or implicitly include at least one of the features. In addition, the technical solutions between the various embodiments can be combined with each other, but it must be based on what can be achieved by a person of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be achieved, it should be considered that such a combination of technical solutions does not exist. , Is not within the scope of protection required by this application.

As shown in FIG. 1, it is a schematic diagram of an embodiment of the electronic device 1 of this application. The electronic device 1 is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. The electronic device 1 may be a computer, a single web server, a server group composed of multiple web servers, or a cloud composed of a large number of hosts or web servers based on cloud computing, where cloud computing is a type of distributed computing, A super virtual computer composed of a group of loosely coupled computer sets.

In this embodiment, the electronic device 1 includes, but is not limited to, a memory 11, a processor 12, and a network interface 13 that can be communicatively connected to each other through a system bus. The memory 11 stores a music generation program 10, the music generation program 10 can be executed by the processor 12. Figure 1 only shows an electronic device 1 with components 11-13 and a music generating program 10. Those skilled in the art will understand that the structure shown in Figure 1 does not constitute a limitation on the electronic device 1, and may include a comparison chart. Show fewer or more components, or combinations of certain components, or different component arrangements.

Among them, the memory 11 includes a memory and at least one type of readable storage medium. The memory provides a cache for the operation of the electronic device 1; the readable storage medium can be, for example, flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM) ), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks and other non-volatile or volatile storage media. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as the hard disk of the electronic device 1. In other embodiments, the non-volatile or volatile storage medium may also be an electronic device. The external storage device of the device 1, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, and a flash card (Flash Card) equipped on the electronic device 1. In this embodiment, the readable storage medium of the memory 11 is generally used to store the operating system and various application software installed in the electronic device 1, for example, to store the code of the music generation program 10 in an embodiment of the present application. In addition, the memory 11 can also be used to temporarily store various types of data that have been output or will be output.

In some embodiments, the processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 12 is generally used to control the overall operation of the electronic device 1, such as performing data interaction or communication-related control and processing with other devices. In this embodiment, the processor 12 is used to run the program code or process data stored in the memory 11, for example, to run the music generation program 10 and so on.

The network interface 13 may include a wireless network interface or a wired network interface, and the network interface 13 is used to establish a communication connection between the electronic device 1 and a client (not shown in the figure).

Optionally, the electronic device 1 may further include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the electronic device 1 and to display a visualized user interface.

In an embodiment of the present application, when the music generation program 10 is executed by the processor 12, the following first recognition step, generation step, second recognition step, and adjustment step are implemented.

The first recognition step: use the camera unit to record the user's action video, read the current video frame of the action video as the first target video frame, input the first target video frame into the pre-trained model, and recognize the first target video frame. The key part information of the user in a target video frame, where the key part information includes the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and its position coordinate value.

The recording of the user's action video can be the recording of the user's dance video, or the recording of the user's fitness video, sports training video, or any other action video.

In an embodiment of the present application, the pre-trained model is a PoseNet model, which is a convolutional neural network model, which runs on Tensorflow.js (a deep learning framework) and can be viewed in a browser Perform real-time human pose estimation.

The PoseNet model can recognize single-person postures, as well as multi-person postures. In this embodiment, the single-player action video of the user is selected.

The training process of the PoseNet model includes:

B1. Obtain a preset number (for example, 10,000) of character action picture samples, and divide the picture samples into a training set of a first proportion and a verification set of a second proportion;

B2. Use the training set to train the PoseNet model;

B3. Use the verification set to verify the accuracy of the trained PoseNet model. If the accuracy is greater than or equal to the preset accuracy (for example, 95%), the training ends;

B4. If the accuracy rate is less than the preset accuracy rate, increase the preset number of character action picture samples according to a preset percentage (for example, 15%), and return to step B1.

In this embodiment, the output of the PoseNet model is the ID of the user's 17 key joint parts and their position coordinate values.

The relationship between human joints and their IDs can be shown in Table 1 below:

IDID	人体关节部位Human joints
00	鼻子nose
11	左眼Left eye
22	右眼Right eye
33	左耳Left ear
44	右耳Right ear
55	左肩Left shoulder
66	右肩Right shoulder
77	左肘Left elbow
88	右肘Right elbow
99	左腕 Left wrist
1010	右腕Right wrist
1111	左臀Left hip
1212	右臀Right hip
1313	左膝Left knee
1414	右膝Right knee
1515	左踝Left ankle
1616	右踝Right ankle

Table 1

In this embodiment, the key joint parts are divided into the first type of human joint parts and the second type of human joint parts according to the position distribution of the human body joint parts in the human body. For example, the first type of human joints may be the upper half of the human body, the second type of human joints may be the lower half of the human body, or the first type of human joints may be the left half of the human body, and the second type The node part of the human body is the joint part of the right half of the human body.

In this embodiment, the first type of joint parts of the human body are the joint parts of the left half of the human body, for example, the left wrist, the left knee, the left elbow, and the left hip.

The second type of human joints is the joints of the right half of the human body, such as the right wrist, right knee, right elbow, and right hip.

In this embodiment, the position of the camera unit is fixed, the position coordinate value of the human body joint part is the two-dimensional coordinate value (X, Y) of the human body joint part in each video frame, and the X of the two-dimensional coordinate system The axis is the upper border of each video frame, the Y axis is the left border of each video frame, and the origin is the intersection of the upper border and the left border of each video frame.

The key part information also includes the confidence score of the position accuracy of the human body joint part. The confidence score is between 0 and 1.0. The higher the confidence score, the position of the identified human joint part is indicated. The higher the accuracy.

Generation step: When the position coordinate values of the first and second types of human joint parts in the first target video frame are recognized, the playback unit is controlled to start and play the generated music parameters and the initial values of the sound effect parameters. music.

The music parameters include pitch, music speed, note time value, sound zone and so on.

The pitch is the height of the sound, including four types: A, B, C, and D.

The music speed is the number of beats per minute, the slow speed is 40 to 69 beats per minute, the medium speed is 72 to 84 beats per minute, and the fast speed is 108 to 28 beats per minute.

The note duration is used to indicate the relative duration between notes. The duration of a half note is 1/2 of a whole note, a quarter note is 1/4 of a whole note, and an eighth note is 1/8 of a whole note. .

The sound zone includes a high-range zone, a mid-range zone and a low-range zone, and the numerical range is 3-5.

The sound effect parameters include loudness, delay time, left and right phase, reverberation time, and the like.

The loudness is used to describe the size of the volume.

The delay time is the intermediate period of time from sound emission to human ear reception.

The reverberation time is the intermediate period of time between reflection and absorption of sound waves after the sound source stops sounding before the sound disappears.

The left-right phase is the direction of the sound, including three types: left, right, and center.

For example, the initial values of preset music parameters and sound effect parameters are as follows:

The initial value of pitch is C, the initial value of music speed is 90 beats, the initial value of note duration is one quarter, the initial value of sound zone is 4, the initial value of loudness is 80% of the system volume, and the initial value of delay time is 0.6 seconds. , The initial value of the reverberation time is 1 second, and the initial value of the left and right phases is centered.

The second identification step: taking the reading time of the first target video frame as the starting point of time, reading the current video frame of the action video as the second target video frame at a preset time interval, and setting the second target video frame Input the pre-trained model to identify the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and its position coordinate value of the user in the second target video.

Because the changes of adjacent video frames in the action video are small, and in order to reduce the amount of data to be processed, this embodiment does not read all the video frames, but adopts a method of reading one frame at a preset time interval.

The adjustment step: adjust the music according to the predetermined mapping relationship table of the first type of human body joint parts and the music parameters, the change of the position coordinate values of the first type of human body joint parts in the second target video frame, and the first preset adjustment range table Parameter, adjust the sound effect parameter according to the predetermined mapping relationship table of the second type of human body joint parts and the sound effect parameter, the change amount of the position coordinate value of the second type of human body joint part in the second target video frame, and the second preset adjustment range table , And adjust the music according to the adjusted music parameters and sound effect parameters to generate new music.

In an embodiment of the present application, the adjustment step includes:

A1. Use the position coordinate value of each joint part of the human body in the first target video frame as the initial value of the position of each joint part;

For example, the initial value of the left wrist position of the first target video frame ID 9 can be expressed as (X _9-start , Y _9-start ), and the initial value of the right wrist position ID of 10 can be expressed as (X _10-start , Y _10-start ).

A2. According to the position coordinate values of the first type of human body joints in the second target video frame and their initial position values, calculate the amount of change in the position coordinate values of the first type of human joints in the second target video frame;

For example, the position coordinate value of the left wrist with the ID of 9 in the second target video frame is represented as (X _9-2 , Y _9-2 ), then the position coordinate value of the left wrist with the ID of 9 in the second target video frame on the X axis The amount of change in X _9-2start =X _9-2 -X _9-start , and the amount of change in the position coordinate value of the Y axis is Y _9-2start =Y _9-2 -Y _9-start .

A3. According to the position coordinate values of the second type of human joints in the second target video frame and their initial position values, calculate the amount of change in the position coordinate values of the second type of human joints in the second target video frame;

For example, the position coordinate value of the right wrist with ID 10 in the second target video frame is expressed as (X _10-2 , Y _10-2 ), then the X-axis position coordinate value of the right wrist with ID 10 in the second target video frame The amount of change of X _10-2start =X _10-2 -X _10-start , and the amount of change of the position coordinate value of the Y axis is Y _10-2start =Y _10-2 -Y _10-start .

A4. Determine the name of the music parameter that needs to be adjusted according to the change in the position coordinate value of the first type of human joint part in the second target video frame and the predetermined mapping relationship between the first type of human joint part and the music parameter. 2. The amount of change in the position coordinate values of the second type of human joint parts in the target video frame and the predetermined mapping relationship table between the second type of human joint parts and the sound effect parameters determine the name of the sound effect parameter that needs to be adjusted;

The predetermined mapping relationship table between IDs of the first type of human joint parts and music parameters can be represented by Table 2 below.

IDID	动作姿态Action posture	有变化的坐标Changing coordinates	音乐参数Music parameters
99	左腕上下移动Move left wrist up and down	YY	音高pitch
99	左腕左右摆动Swing left and right	XX	音乐速度Music speed
1313	左膝左右摆动Left knee swing left and right	XX	音符时值Note duration
77	左肘上下移动Move the left elbow up and down	YY	音区Sound zone
……	……	……	……

Table 2

The predetermined mapping relationship table between the ID of the second type of human joint part and the sound effect parameter can be represented by Table 3 below.

IDID	动作姿态Action posture	有变化的坐标Changing coordinates	音效参数 Sound effect parameters
1010	右腕上下移动Move right wrist up and down	YY	响度Loudness
88	右肘左右摆动Swing right elbow left and right	XX	延迟时间delay
66	右肩左右摆动Swing right shoulder	XX	左右相位Left and right phase
1414	右膝左右摆动Right knee swing left and right	XX	混响时间Reverberation time
……	……	……	……

table 3

For example, according to the change in the X-axis position coordinate value of the left wrist with ID 9 in the second target video frame, it can be determined that the music speed in the music parameters needs to be adjusted, and according to the Y-axis of the left wrist with ID 9 in the second target video frame The amount of change in the position coordinate value can determine that the pitch of the music parameter needs to be adjusted.

A5. Adjust the music parameters that need to be adjusted according to the amount of change in the position coordinate values of the joints of the first type of human body in the second target video frame and the first preset adjustment range table, and adjust the music parameters to be adjusted according to the second target video frame in the second target video frame. The change amount of the position coordinate values of the human body joint parts and the second preset adjustment range table adjust the sound effect parameters that need to be adjusted.

The first preset adjustment range table can be represented by Table 4.

Table 4

The second preset adjustment range table can be represented by Table 5.

table 5

_{For example, if d is 5, when the amount of change X 9-2start} of the X axis position coordinate value of the left wrist with ID 9 in the second target video frame is 8, the music speed needs to be adjusted to 110 beats.

_{When the change amount Y 10-2start} of the Y-axis position coordinate value of the right wrist with the ID of 10 in the second target video frame is -13, the loudness needs to be adjusted to 74% of the system volume.

A6. Adjust the music according to the adjusted music parameters and sound effect parameters to generate new music.

In an embodiment of the present application, when the music generating program 10 is executed by the processor 12, the following steps are further implemented:

Stopping step: when a preset stop signal is received, the playing unit is controlled to stop playing the music.

In this embodiment, the preset stop signal may be to stop recording the user's action video, or it may be that the music playing time reaches a preset time threshold (for example, 3 minutes).

It can be seen from the above embodiment that the electronic device 1 proposed in this application first reads the current video frame of the action video being recorded as the first target video frame, and identifies the ID and position coordinates of the human joints in the first target video frame Value and control the playback unit to start, play music generated according to the preset music parameters and the initial values of the sound effect parameters; then, take the reading time of the first target video frame as the starting point, and read the action video every preset time The current video frame is used as the second target video frame, the ID and position coordinate values of the human joints in the second target video frame are identified, and the music parameters and sound effect parameters are adjusted according to the changes in the position coordinate values of the human joints, so that the music Make adjustments to generate new music, thereby solving the problem of difficulty in music creation and not easy to expand.

As shown in FIG. 2, it is a schematic diagram of modules of an embodiment of the music generating apparatus 100.

In an embodiment of the present application, the music generation device 100 includes a first recognition module 110, a generation module 120, a second recognition module 130, and an adjustment module 140. Illustratively:

The first recognition module 110 is configured to use a camera unit to record a user's action video, read the current video frame of the action video as a first target video frame, and input the first target video frame into a pre-trained model Identify the key part information of the user in the first target video frame, where the key part information includes the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and its position coordinate value;

The generating module 120 is configured to control the playing unit to start and play the first target video frame when the position coordinate values of the first type of human joint parts and the second type of human joint parts are recognized. Music generated by the initial values of the set music parameters and sound effect parameters;

The second identification module 130 is configured to take the reading time of the first target video frame as the starting point of time, and read the current video frame of the action video as the second target video frame at a preset time interval, and set the The second target video frame is input to the pre-trained model to identify the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and its position coordinate in the second target video value;

The adjustment module 140 is used to determine the mapping relationship between the first type of human body joint parts and the music parameters, the change in the position coordinate values of the first type of human body joint parts in the second target video frame, and the first preset The adjustment range table adjusts the music parameters, according to the predetermined mapping table of the second type of human body joints and the sound effect parameters, the change of the position coordinate values of the second type of human joints in the second target video frame, and the second preset adjustment The amplitude table adjusts the sound effect parameters, and adjusts the music according to the adjusted music parameters and the sound effect parameters to generate new music.

The functions or operation steps implemented by the first recognition module 110, the generation module 120, the second recognition module 130, and the adjustment module 140 when executed are substantially the same as those in the foregoing embodiment, and will not be repeated here.

As shown in FIG. 3, it is a flowchart of an embodiment of the music generating method of this application. The music generating method includes steps S1-S4.

S1. Use the camera unit to record the user's action video, read the current video frame of the action video as the first target video frame, input the first target video frame into a pre-trained model, and identify the first target video The key part information of the user in the frame, the key part information includes the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and its position coordinate value.

The training process of the PoseNet model includes:

B2. Use the training set to train the PoseNet model;

The relationship between the body joints and their ID can be as shown in Table 1 above.

In this embodiment, the key joint parts are divided into the first type of human joint parts and the second type of human joint parts according to the position distribution of the human body joint parts in the human body. For example, the first type of human joints may be the upper half of the human body, the second type of human joints may be the lower half of the human body, or the first type of joints may be the left half of the human body, and the second type of joints The part is the joint part of the right half of the human body.

S2. When the position coordinate values of the first and second types of human joint parts in the first target video frame are recognized, control the playback unit to start and play the music generated according to the preset music parameters and the initial values of the sound effect parameters .

The pitch is the height of the sound, including four types: A, B, C, and D.

The loudness is used to describe the size of the volume.

S3. Taking the reading time of the first target video frame as the starting point of time, reading the current video frame of the action video as the second target video frame at a preset time interval, and inputting the second target video frame to the The pre-trained model identifies the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and its position coordinate value of the user in the second target video.

S4. Adjust the music parameters according to the predetermined mapping relationship table of the first type of human body joint parts and the music parameters, the change amount of the position coordinate values of the first type of human body joint parts in the second target video frame, and the first preset adjustment range table , Adjust the sound effect parameters according to the predetermined mapping relationship table of the second type of human body joint parts and the sound effect parameters, the change amount of the position coordinate values of the second type of human body joint parts in the second target video frame, and the second preset adjustment range table, And according to the adjusted music parameters and sound effect parameters, the music is adjusted to generate new music.

In an embodiment of the present application, the adjustment step includes:

The predetermined mapping relationship table between IDs of the first type of human joint parts and music parameters can be represented by Table 2 above.

The predetermined mapping relationship table between the ID of the second type of human body joint part and the sound effect parameter can be represented by Table 3 above.

A5. Adjust the music parameters that need to be adjusted according to the amount of change in the position coordinate values of the joints of the first type of human body in the second target video frame and the first preset adjustment range table, and according to the second target video frame in the second target video frame. The change amount of the position coordinate values of the human body joint parts and the second preset adjustment range table adjust the sound effect parameters that need to be adjusted.

The first preset adjustment range table can be represented by Table 4 above.

The second preset adjustment range table can be represented by Table 5 above.

It can be seen from the above embodiments that the music generation method proposed in this application first reads the current video frame of the action video being recorded as the first target video frame, and identifies the ID and position coordinates of the human joints in the first target video frame Value and control the playback unit to start, play music generated according to the preset music parameters and the initial values of the sound effect parameters; then, take the reading time of the first target video frame as the starting point, and read the action video every preset time The current video frame is used as the second target video frame, the ID and position coordinate values of the human joints in the second target video frame are identified, and the music parameters and sound effect parameters are adjusted according to the changes in the position coordinate values of the human joints, so that the music Make adjustments to generate new music, thereby solving the problem of difficulty in music creation and not easy to expand.

In addition, the embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium may be volatile or non-volatile. The computer-readable storage medium may be a hard disk, a multimedia card, or an SD card. , Flash memory card, SMC, read only memory (ROM), erasable programmable read only memory (EPROM), portable compact disk read only memory (CD-ROM), USB memory, etc. any one or more of them random combination. The computer-readable storage medium includes a music generation program 10, and when the music generation program 10 is executed by a processor, the following operations are implemented:

Use the camera unit to record the user's action video, read the current video frame of the action video as the first target video frame, input the first target video frame into the pre-trained model, and identify the first target video frame The key part information of the user, where the key part information includes the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and its position coordinate value;

When the position coordinate values of the first type of human body joint parts and the position coordinate values of the second type of human body joint parts in the first target video frame are recognized, the playback unit is controlled to start and play the music parameters and sound effect parameters according to the preset music parameters and sound effect parameters. Music generated by the initial value;

Taking the reading time of the first target video frame as the starting point of time, every preset time interval, read the current video frame of the action video as the second target video frame, and input the second target video frame into the pre-training A good model to identify the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and its position coordinate value of the user in the second target video;

The music parameters are adjusted according to the predetermined mapping relationship table of the first type of human body joint parts and the music parameters, the change of the position coordinate value of the first type of human body joint parts in the second target video frame, and the first preset adjustment range table. The predetermined mapping relationship table between the second type of human body joints and the sound effect parameters, the change of the position coordinate values of the second type of human joints in the second target video frame, and the second preset adjustment range table to adjust the sound effect parameters, and adjust the sound effect parameters according to The adjusted music parameters and sound effect parameters are adjusted to the music to generate new music.

The specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the above-mentioned music generation method and electronic device, and will not be repeated here.

The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the superiority of the embodiments.

It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.

Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.

The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A music generation method is applied to an electronic device, the electronic device includes a camera unit and a playback unit, wherein the method includes:

The first recognition step: use the camera unit to record the user's action video, read the current video frame of the action video as the first target video frame, input the first target video frame into the pre-trained model, and recognize the first target video frame. A user’s key part information in a target video frame, where the key part information includes the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and its position coordinate value;

Generation step: when the position coordinate values of the first type of human body joint parts and the position coordinate values of the second type of human body joint parts in the first target video frame are recognized, the playback unit is controlled to start and play according to the preset music parameters and The music generated by the initial value of the sound effect parameter;

The second identification step: taking the reading time of the first target video frame as the starting point of time, reading the current video frame of the action video as the second target video frame at a preset time interval, and setting the second target video frame Input the pre-trained model to identify the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and the position coordinate value of the user in the second target video;

The adjustment step: adjust the music according to the predetermined mapping relationship table of the first type of human body joint parts and the music parameters, the change of the position coordinate values of the first type of human body joint parts in the second target video frame, and the first preset adjustment range table Parameters: adjust the sound effect parameters according to the predetermined mapping table of the second type of human joint parts and the sound effect parameters, the change in the position coordinate values of the second type of human joint parts in the second target video frame, and the second preset adjustment range table , And adjust the music according to the adjusted music parameters and sound effect parameters to generate new music.
8. The music generating method according to claim 1, wherein the adjusting step comprises:

A1. Use the position coordinate value of each joint part of the human body in the first target video frame as the initial value of the position of each joint part;

A2. According to the position coordinate values of the first type of human body joints in the second target video frame and their initial position values, calculate the amount of change in the position coordinate values of the first type of human joints in the second target video frame;

A3. According to the position coordinate values of the second type of human joints in the second target video frame and their initial position values, calculate the amount of change in the position coordinate values of the second type of human joints in the second target video frame;

A4. Determine the name of the music parameter that needs to be adjusted according to the change in the position coordinate value of the first type of human joint part in the second target video frame and the predetermined mapping relationship between the first type of human joint part and the music parameter. 2. The amount of change in the position coordinate values of the second type of human joint parts in the target video frame and the predetermined mapping relationship table between the second type of human joint parts and the sound effect parameters determine the name of the sound effect parameter that needs to be adjusted;

A5. Adjust the music parameters that need to be adjusted according to the variation of the position coordinate values of the joints of the first type of human body in the second target video frame and the first preset adjustment range table, and adjust the music parameters to be adjusted according to the second target video frame in the second target video frame. The amount of change in the position coordinate values of the human body joint parts and the second preset adjustment range table adjust the sound effect parameters that need to be adjusted;

A6. Adjust the music according to the adjusted music parameters and sound effect parameters to generate new music.
3. The music generating method according to claim 2, wherein the method further comprises:

Stopping step: when a preset stop signal is received, the playing unit is controlled to stop playing the music.
5. The music generating method according to claim 1, wherein the first type of human joint part is a left half of the human body joint part, and the second type of human body joint part is a right half of the human body joint part.
The music generation method according to any one of claims 1 to 4, wherein the pre-trained model is a PoseNet model, and the training process of the PoseNet model includes:

B1. Obtain a preset number of character action picture samples, and divide the picture samples into a training set of a first proportion and a verification set of a second proportion;

B2. Use the training set to train the PoseNet model;

B3. Use the verification set to verify the accuracy of the trained PoseNet model. If the accuracy is greater than or equal to the preset accuracy, the training ends;

B4. If the accuracy rate is less than the preset accuracy rate, increase the preset number of character action picture samples according to the preset percentage, and return to step B1.
3. The music generating method according to claim 1, wherein the key part information further includes the confidence scores of the position accuracy of the first and second types of human joint parts.
3. The music generating method according to claim 1, wherein the music parameters include pitch, music speed, note duration, and sound zone, and the sound effect parameters include loudness, delay time, left and right phase, and reverberation time.
A music generating device, wherein the device includes:

The first recognition module is used to record the user's action video by using the camera unit, read the current video frame of the action video as the first target video frame, input the first target video frame into the pre-trained model, and identify all The key part information of the user in the first target video frame, where the key part information includes the ID of the first type of human body joint part and its position coordinate value, and the ID of the second type of human body joint part and its position coordinate value;

The generating module is used to control the playback unit to start and play music according to the preset when the position coordinate values of the first type of human joint parts and the second type of human joint parts in the first target video frame are recognized The music generated by the initial values of the parameters and sound effect parameters;

The second recognition module is used to read the current video frame of the action video as the second target video frame by taking the reading time of the first target video frame as the time starting point and every preset time interval, and set the second target video frame as the second target video frame. The video frame is input into the pre-trained model to identify the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and the position coordinate value of the user in the second target video;

The adjustment module is used to perform the mapping relationship between the first type of human body joints and the music parameters, the change of the position coordinate values of the first type of human joints in the second target video frame, and the first preset adjustment range table. Adjust the music parameters according to the predetermined mapping relationship table of the second type of human body joints and the sound effect parameters, the change of the position coordinate value of the second type of human joints in the second target video frame, and the second preset adjustment range table. Sound effect parameters, and adjust the music according to the adjusted music parameters and sound effect parameters to generate new music.
An electronic device, wherein the electronic device includes a memory and a processor, the memory stores a music generation program that can run on the processor, and when the music generation program is executed by the processor, the following is achieved step:

The first recognition step: use the camera unit to record the user's action video, read the current video frame of the action video as the first target video frame, input the first target video frame into the pre-trained model, and recognize the first target video frame. A user’s key part information in a target video frame, where the key part information includes the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and its position coordinate value;

Generation step: when the position coordinate values of the first type of human body joint parts and the position coordinate values of the second type of human body joint parts in the first target video frame are recognized, the playback unit is controlled to start and play according to the preset music parameters and The music generated by the initial value of the sound effect parameter;

The second identification step: taking the reading time of the first target video frame as the starting point of time, reading the current video frame of the action video as the second target video frame at a preset time interval, and setting the second target video frame Input the pre-trained model to identify the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and the position coordinate value of the user in the second target video;

The adjustment step: adjust the music according to the predetermined mapping relationship table of the first type of human body joint parts and the music parameters, the change of the position coordinate values of the first type of human body joint parts in the second target video frame, and the first preset adjustment range table Parameters: adjust the sound effect parameters according to the predetermined mapping table of the second type of human joint parts and the sound effect parameters, the change in the position coordinate values of the second type of human joint parts in the second target video frame, and the second preset adjustment range table , And adjust the music according to the adjusted music parameters and sound effect parameters to generate new music.
9. The electronic device of claim 9, wherein the adjusting step comprises:

A1. Use the position coordinate value of each joint part of the human body in the first target video frame as the initial value of the position of each joint part;

A2. According to the position coordinate values of the first type of human body joints in the second target video frame and their initial position values, calculate the amount of change in the position coordinate values of the first type of human joints in the second target video frame;

A3. According to the position coordinate values of the second type of human joints in the second target video frame and their initial position values, calculate the amount of change in the position coordinate values of the second type of human joints in the second target video frame;

A4. Determine the name of the music parameter that needs to be adjusted according to the change in the position coordinate value of the first type of human joint part in the second target video frame and the predetermined mapping relationship between the first type of human joint part and the music parameter. 2. The amount of change in the position coordinate values of the second type of human joint parts in the target video frame and the predetermined mapping relationship table between the second type of human joint parts and the sound effect parameters determine the name of the sound effect parameter that needs to be adjusted;

A5. Adjust the music parameters that need to be adjusted according to the variation of the position coordinate values of the joints of the first type of human body in the second target video frame and the first preset adjustment range table, and adjust the music parameters to be adjusted according to the second target video frame in the second target video frame. The amount of change in the position coordinate values of the human body joint parts and the second preset adjustment range table adjust the sound effect parameters that need to be adjusted;

A6. Adjust the music according to the adjusted music parameters and sound effect parameters to generate new music.
9. The electronic device of claim 10, wherein the following steps are further implemented when the music generating program is executed by the processor:

Stopping step: when a preset stop signal is received, the playing unit is controlled to stop playing the music.
9. The electronic device according to claim 9, wherein the first type of human joint part is a left half of the human body joint part, and the second type of human body joint part is a right half of the human body joint part.
The electronic device according to any one of claims 9 to 12, wherein the pre-trained model is a PoseNet model, and the training process of the PoseNet model includes:

B1. Obtain a preset number of character action picture samples, and divide the picture samples into a training set of a first proportion and a verification set of a second proportion;

B2. Use the training set to train the PoseNet model;

B3. Use the verification set to verify the accuracy of the trained PoseNet model. If the accuracy is greater than or equal to the preset accuracy, the training ends;

B4. If the accuracy rate is less than the preset accuracy rate, increase the preset number of character action picture samples according to the preset percentage, and return to step B1.
9. The electronic device according to claim 9, wherein the key part information further comprises the confidence scores of the position accuracy of the first and second types of human joint parts.
9. The electronic device of claim 9, wherein the music parameters include pitch, music speed, note duration, and sound zone, and the sound effect parameters include loudness, delay time, left and right phase, and reverberation time.
A computer-readable storage medium, wherein a music generation program is stored on the computer-readable storage medium, and the music generation program can be executed by one or more processors to implement the following steps:

The first recognition step: use the camera unit to record the user's action video, read the current video frame of the action video as the first target video frame, input the first target video frame into the pre-trained model, and recognize the first target video frame. A user’s key part information in a target video frame, where the key part information includes the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and its position coordinate value;

Generation step: when the position coordinate values of the first type of human body joint parts and the position coordinate values of the second type of human body joint parts in the first target video frame are recognized, the playback unit is controlled to start and play according to the preset music parameters and The music generated by the initial value of the sound effect parameter;

The second identification step: taking the reading time of the first target video frame as the starting point of time, reading the current video frame of the action video as the second target video frame at a preset time interval, and setting the second target video frame Input the pre-trained model to identify the ID of the first type of human body joint part and its position coordinate value and the ID of the second type of human body joint part and the position coordinate value of the user in the second target video;

The adjustment step: adjust the music according to the predetermined mapping relationship table of the first type of human body joint parts and the music parameters, the change of the position coordinate values of the first type of human body joint parts in the second target video frame, and the first preset adjustment range table Parameter, adjust the sound effect parameter according to the predetermined mapping relationship table of the second type of human body joint parts and the sound effect parameter, the change amount of the position coordinate value of the second type of human body joint part in the second target video frame, and the second preset adjustment range table , And adjust the music according to the adjusted music parameters and sound effect parameters to generate new music.
The computer-readable storage medium of claim 16, wherein the adjusting step comprises:

A1. Use the position coordinate value of each joint part of the human body in the first target video frame as the initial value of the position of each joint part;

A2. According to the position coordinate values of the first type of human body joints in the second target video frame and their initial position values, calculate the amount of change in the position coordinate values of the first type of human joints in the second target video frame;

A3. According to the position coordinate values of the second type of human joints in the second target video frame and their initial position values, calculate the amount of change in the position coordinate values of the second type of human joints in the second target video frame;

A4. Determine the name of the music parameter that needs to be adjusted according to the change in the position coordinate value of the first type of human joint part in the second target video frame and the predetermined mapping relationship between the first type of human joint part and the music parameter. 2. The amount of change in the position coordinate values of the second type of human joint parts in the target video frame and the predetermined mapping relationship table between the second type of human joint parts and the sound effect parameters determine the name of the sound effect parameter that needs to be adjusted;

A5. Adjust the music parameters that need to be adjusted according to the variation of the position coordinate values of the joints of the first type of human body in the second target video frame and the first preset adjustment range table, and adjust the music parameters to be adjusted according to the second target video frame in the second target video frame. The amount of change in the position coordinate values of the human body joint parts and the second preset adjustment range table adjust the sound effect parameters that need to be adjusted;

A6. Adjust the music according to the adjusted music parameters and sound effect parameters to generate new music.
17. The computer-readable storage medium according to claim 17, wherein the following steps are further implemented when the music generation program is executed by one or more processors:

Stopping step: when a preset stop signal is received, the playing unit is controlled to stop playing the music.
16. The computer-readable storage medium of claim 16, wherein the first type of human joint part is a left half of the human body joint part, and the second type of human body joint part is a right half of the human body joint part.
The computer-readable storage medium according to any one of claims 16 to 19, wherein the pre-trained model is a PoseNet model, and the training process of the PoseNet model includes:

B1. Obtain a preset number of character action picture samples, and divide the picture samples into a training set of a first proportion and a verification set of a second proportion;

B2. Use the training set to train the PoseNet model;

B3. Use the verification set to verify the accuracy of the trained PoseNet model. If the accuracy is greater than or equal to the preset accuracy, the training ends;

B4. If the accuracy rate is less than the preset accuracy rate, increase the preset number of character action picture samples according to the preset percentage, and return to step B1.