WO2020155915A1

WO2020155915A1 - Method and apparatus for playing back audio

Info

Publication number: WO2020155915A1
Application number: PCT/CN2019/126772
Authority: WO
Inventors: 黄佳斌
Original assignee: 北京字节跳动网络技术有限公司
Priority date: 2019-01-29
Filing date: 2019-12-19
Publication date: 2020-08-06
Also published as: CN109828741A

Abstract

A method and an apparatus for playing back an audio. A specific embodiment of the method comprises: acquiring a display video frame displayed on a target interface (401); performing human skeleton key point detection on the display video frame, and in response to the detection of a human skeleton key point information set, determining whether the human skeleton key point information set comprises target human skeleton key point information (402); in response to determining that the human skeleton key point information set comprises the target human skeleton key point information, for an audio trigger position among at least one audio trigger position, in response to determining that a human skeleton key point indicated by the target human skeleton key point information is moved from outside of the audio trigger position onto the audio trigger position, playing back a preset audio corresponding to the audio trigger position (4031). The embodiment implements that a photographed person is able to trigger audio playback by means of a body movement, thereby improving the flexibility of triggering audio playback.

Description

Method and device for playing audio

Cross references to related applications

This application is filed based on a Chinese patent application with an application number of 201910086010.4 and an application date of January 29, 2019, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated into this application by reference.

Technical field

The embodiments of the present disclosure relate to the field of computer technology, and in particular to methods and devices for playing audio.

Background technique

With the development of computer technology, people can use mobile phones, tablet computers and other devices to shoot small videos and conduct video chats. When people are shooting videos of playing music, they usually need to use actual musical instruments or play music by touching virtual musical instruments displayed on electronic devices.

Summary of the invention

The embodiments of the present disclosure propose methods and devices for playing audio.

In the first aspect, an embodiment of the present disclosure provides a method for playing audio, the method includes: obtaining a display video frame displayed on a target interface, where the display video frame is a video frame included in the currently shot video, and the target At least one audio trigger position is preset on the interface; human bone key point detection is performed on the displayed video frame, and in response to detecting the human bone key point information set, it is determined whether the human bone key point information set includes target human bone key point information; The determining includes, for the audio trigger position in the at least one audio trigger position, in response to determining that the human bone key point indicated by the target human bone key point information moves from outside the audio trigger position to the audio trigger position, playing the preset , The audio corresponding to the audio trigger position.

In some embodiments, the target human bone key point information is human bone key point information used to characterize the hand.

In some embodiments, the video is a video taken in real time of the target user; and after the human bone key point detection is performed on the displayed video frame, the method further includes: in response to the human bone key point information set being not detected, in the target interface The prompt message used to prompt the target user's position error is displayed on the screen.

In some embodiments, the video is a video taken in real time for the target user; and after determining whether the human bone key point information set includes the target human bone key point information, the method further includes: responding to determining that the human bone key point information set does not Including the key point information of the target human skeleton, and displaying the prompt information for prompting the target user's position error on the target interface.

In some embodiments, in response to the determination includes, for the audio trigger position in the at least one audio trigger position, in response to determining the target human bone key point information, the human bone key point indicated by the audio trigger position is moved to the audio Above the trigger position, before playing the preset audio corresponding to the audio trigger position, the method further includes: determining a human body image based on the human skeleton key point information set; in response to determining that the size of the human body image is less than the preset size, displaying The video frame is enlarged so that the size of the human body image reaches the preset size.

In some embodiments, the audio trigger position is characterized by at least one of the following: an area of a preset size and shape, and a line of a preset length.

In some embodiments, after playing the preset audio corresponding to the audio trigger position, the method further includes: in response to determining that the audio trigger position is represented by a region of a preset size and shape, and determining the key to the target human bone The key point of the human skeleton indicated by the point information stays at the audio trigger position for a preset duration, and the audio corresponding to the audio trigger position is stopped.

In some embodiments, playing the preset audio corresponding to the audio trigger position includes: determining the moving speed of the key points of the human bones indicated by the key point information of the target human bones on the target interface; according to the preset and determined The volume corresponding to the moving speed of, plays the preset audio corresponding to the audio trigger position.

In a second aspect, an embodiment of the present disclosure provides a device for playing audio. The device includes: an acquiring unit configured to acquire a display video frame displayed on a target interface, wherein the display video frame is a currently shot video At least one audio trigger position is preset on the target interface for the included video frames; the detection unit is configured to perform human bone key point detection on the displayed video frame, and in response to detecting the human bone key point information set, determine the human bone key point Whether the information set includes the target human bone key point information; the playback unit is configured to, in response to determining that the audio trigger position in at least one audio trigger position is included, in response to determining that the target human bone key point information indicates the human bone key point Move outside the audio trigger position to above the audio trigger position, and play the preset audio corresponding to the audio trigger position.

In some embodiments, the video is a video taken in real time of the target user; and the detection unit is further configured to: in response to not detecting the key point information set of the human bones, display on the target interface a message for prompting the target user to be incorrectly positioned. Prompt information.

In some embodiments, the video is a video taken in real time of the target user; and the device further includes: a display unit configured to respond to determining that the human bone key point information set does not include the target human bone key point information, on the target interface Displays the prompt message used to prompt the target user's position error.

In some embodiments, the device further includes: a determining unit configured to determine a human body image based on a set of human bone key point information; and an amplifying unit configured to display the video frame in response to determining that the size of the human body image is less than a preset size Zoom in so that the size of the human body image reaches the preset size.

In some embodiments, the playback unit is further configured to: in response to determining that the audio trigger position is represented by an area of preset size and shape, and determine that the human bone key point indicated by the target human bone key point information is at the audio trigger position When the dwell time reaches the preset duration, stop playing the audio corresponding to the audio trigger position.

In some embodiments, the playing unit includes: a determining module configured to determine the moving speed of the human bone key points indicated by the target human bone key point information on the target interface; the playing module is configured to determine according to preset and determined The volume corresponding to the moving speed of, plays the preset audio corresponding to the audio trigger position.

In the third aspect, the embodiments of the present disclosure provide a terminal device, the terminal device includes: one or more processors; a storage device on which one or more programs are stored; when one or more programs are Multiple processors execute, so that one or more processors implement the method described in any implementation manner of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, and the computer program, when executed by a processor, implements the method described in any implementation manner in the first aspect.

The method and device for playing audio provided by the embodiments of the present disclosure perform human bone key point detection on the displayed video frame by acquiring the display video frame displayed on the target interface. If the human bone key point information set is detected, and the human body The bone key point information set includes target human bone key point information. For an audio trigger position in at least one audio trigger position on the target interface, in response to determining that the human bone key point indicated by the target human bone key point information is at the audio trigger position, Play the preset audio corresponding to the audio trigger position, so that the audio can be triggered by the body motion of the person being photographed, which improves the flexibility of triggering the audio playback and helps the person being photographed without using an instrument Down, you can play music only through body movements.

Description of the drawings

By reading the detailed description of the non-limiting embodiments with reference to the following drawings, other features, purposes and advantages of the present disclosure will become more apparent:

Fig. 1 is an exemplary system architecture diagram in which an embodiment of the present disclosure can be applied;

Fig. 2 is a flowchart of one embodiment of a method for playing audio according to an embodiment of the present disclosure;

Fig. 3 is a schematic diagram of an application scenario of the method for playing audio according to an embodiment of the present disclosure;

4 is a flowchart of another embodiment of a method for playing audio according to an embodiment of the present disclosure;

5 is a schematic structural diagram of an embodiment of an apparatus for playing audio according to an embodiment of the present disclosure;

Fig. 6 is a schematic structural diagram of a terminal device suitable for implementing embodiments of the present disclosure.

detailed description

The present disclosure will be further described in detail below in conjunction with the drawings and embodiments. It can be understood that the specific embodiments described here are only used to explain the relevant disclosure, but not to limit the disclosure. In addition, it should be noted that, for ease of description, only the parts related to the relevant disclosure are shown in the drawings.

It should be noted that the embodiments in the present disclosure and the features in the embodiments can be combined with each other if there is no conflict. Hereinafter, the present disclosure will be described in detail with reference to the drawings and in conjunction with embodiments.

FIG. 1 shows an exemplary system architecture 100 of a method for playing audio or a device for playing audio to which embodiments of the present disclosure can be applied.

As shown in FIG. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables.

The user can use the

terminal devices

101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages, and so on. Various communication client applications, such as video shooting applications, video playback applications, social platform software, etc., may be installed on the

terminal devices

101, 102, and 103.

The

terminal devices

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices. When the

terminal devices

101, 102, and 103 are software, they can be installed in the aforementioned electronic devices. It can be implemented as multiple software or software modules (for example, software or software modules used to provide distributed services), or as a single software or software module. No specific restrictions are made here.

The server 105 may be a server that provides various services, for example, a back-end server that provides support for videos shot by the

terminal devices

101, 102, and 103. The background server can be used to set the audio trigger position on the target interface and the audio corresponding to the audio trigger position.

It should be noted that the method for displaying images provided by the embodiments of the present disclosure is generally executed by the

terminal devices

101, 102, 103, and correspondingly, the device for displaying images is generally set in the

terminal devices

101, 102, 103 .

It should be noted that the server can be hardware or software. When the server is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or as a single server. When the server is software, it can be implemented as multiple software or software modules (for example, software or software modules for providing distributed services), or as a single software or software module. No specific restrictions are made here.

It should be understood that the numbers of terminal devices, networks, and servers in FIG. 1 are merely illustrative. According to implementation needs, there can be any number of terminal devices, networks and servers. In the case that the display video frame does not need to be obtained remotely, the foregoing system architecture may not include a server and a network.

Continuing to refer to FIG. 2, a process 200 of an embodiment of the method for playing audio according to the present disclosure is shown. The method for playing audio includes the following steps:

Step 201: Obtain a display video frame displayed on the target interface.

In this embodiment, the executor of the method for playing audio (for example, the terminal device shown in FIG. 1) may locally obtain the display video frame displayed on the target interface. Wherein, the displayed video frame is a video frame included in the currently shot video. The target interface may be an interface for displaying video frames of the aforementioned video. For example, the target interface may be a playback interface of a video playback application installed on the execution subject. At least one audio trigger position is preset on the target interface. The audio trigger position is used to trigger audio playback.

In some optional implementation manners of this embodiment, the audio trigger position may be characterized by at least one of the following: an area of a preset size and shape, and a line of a preset length. As an example, the audio trigger position may be characterized by a rectangular area with a preset size. The aforementioned predetermined length line can be a straight line segment or a curved line segment.

Step 202: Perform human bone key point detection on the displayed video frame, and in response to detecting the human bone key point information set, determine whether the human bone key point information set includes target human bone key point information.

In this embodiment, the above-mentioned execution subject may perform human bone key point detection on the displayed video frame, and in response to detecting the human bone key point information set, determine whether the human bone key point information set includes target human bone key point information. Among them, the key point information of the human bone is used to indicate the key point of the human bone. The key points of human bones are points used to characterize specific parts of the human body, for example, points used to characterize the top of the head, elbow joints, shoulder joints and other parts. The human bone key point information may include coordinates in a coordinate system established on the display video frame, and the coordinates may be used to characterize the position of the human bone key point in the display video frame.

The above-mentioned execution subject may perform human bone key point detection on the displayed video frame according to various existing methods for determining human bone key points. For example, the above-mentioned execution subject may input the display video frame into a pre-trained Convolutional Neural Networks (CNN) for detecting key points of human bones to obtain a set of key point information of human bones. The aforementioned convolutional neural network may be existing convolutional neural networks with various structures, such as R-CNN (Region-CNN), STN (Spatial Transform Networks, spatial transformation network), etc. It should be noted that the above-mentioned method for detecting key points of human bones is a well-known technology that is currently widely researched and applied, and will not be repeated here.

The above-mentioned target human bone key point information may be human bone key point information used to characterize a specific part of the human body (for example, hands, feet, etc.) from the detected human bone key point information set. Generally, the key point information of human bones may have a corresponding serial number, and the serial number may be determined by the body part corresponding to the key point of the human bone indicated by each key point information of the human bone when the execution subject detects the key point information set of the human bone of. The above-mentioned execution subject may determine the target human bone key point information from the human bone key point information set according to a preset sequence number corresponding to the target human bone key point information.

In some optional implementation manners of this embodiment, the target human bone key point information is human bone key point information used to characterize the hand. Among them, the key point information of the target human skeleton may be the key point information of the human skeleton of any hand representing a person. By setting the key point information of the human bones representing the hands as the key point information of the target human bones, it can be helpful for the person being photographed to flexibly control the audio playback through hand movements.

In some optional implementation manners of this embodiment, the above-mentioned video is a video taken in real time for the target user. The target user may be a user captured by a camera included in the execution subject or a camera included in an electronic device communicatively connected with the execution subject. The above-mentioned execution subject may respond to the determination that the key point information set of the human bone is not detected, and display prompt information for prompting the user to have an incorrect position on the target interface. Specifically, the reason why the above-mentioned executive body fails to detect the key point information collection of the human bones is usually due to the inaccurate position of the target user, and the above-mentioned executive body cannot obtain the human body image of the target user. At this time, the above-mentioned execution subject may display a prompt message for prompting the user of a position error on the target interface. The prompt information may include but is not limited to at least one of the following: text, image, etc. As an example, the prompt information may be an image used to characterize the outline of the human body, and the target user may refer to the image and adjust the position so that the image of his human body is located at the position of the image. In practice, when the human body image is at the position of the image, it is usually the best position to trigger audio playback.

In some optional implementation manners of this embodiment, the above-mentioned execution subject may perform the following steps before step 203:

First, the human body image is determined based on the key point information collection of human bones. Specifically, as an example, the above-mentioned execution subject may include a rectangular area (for example, the area included in the smallest rectangle, or on the basis of the smallest rectangle) that includes all the human bone key points indicated by the human bone key point information from the displayed video frame. The area included in the rectangle obtained by enlarging the preset multiple is determined as a human body image. Alternatively, the human bone key point information may have a corresponding serial number, and the above-mentioned execution subject may determine the rectangular region including the human bone key points corresponding to these serial numbers as the human body image according to the pre-designated serial number. Generally, the human body image determined according to the aforementioned designated serial number may be the upper body image of the human body.

Then, in response to determining that the size of the human body image is smaller than the preset size, the displayed video frame is enlarged so that the size of the human body image reaches the preset size.

Among them, the size of the image is usually characterized by the number of pixels, for example, x×y, where x is the number of horizontal pixels and y is the number of vertical pixels. The foregoing preset size may be a preset fixed size; or, the foregoing preset size may be a size determined according to a preset ratio. For example, assuming that the size of the interface for displaying the foregoing display video frame is m×n, the preset If the ratio is 0.8, the default size is 0.8m×0.8n. It should be noted that when at least one of the following conditions is met, it can be determined that the size of the human body image is smaller than the preset size: the number of horizontal pixels of the human image is less than the number of horizontal pixels of the preset size, and the number of vertical pixels of the body image is less than the preset size The number of vertical pixels, the number of pixels included in the diagonal of the human body image is less than the number of pixels included in the diagonal of the rectangle represented by the preset size. Correspondingly, when at least one of the following conditions is met, it is determined that the human body image reaches the preset size: the number of horizontal pixels of the human body image is the same as the number of horizontal pixels of the preset size, and the number of vertical pixels of the body image is the same as the number of vertical pixels of the preset size. Similarly, the number of pixels included in the diagonal of the human body image is the same as the number of pixels included in the diagonal of the rectangle represented by the preset size. It should be understood that the above conditions are merely exemplary, and other conditions may also be included in practice.

By executing this optional implementation manner, the human body image can be enlarged to a preset size when the human body image of the captured user is small, thereby helping the user to trigger audio playback through body movements more accurately.

Step 203: In response to the determination, for the audio trigger position in the at least one audio trigger position, in response to the determination of the target human bone key point information, the human bone key point indicated by the target human bone key point information is moved from outside the audio trigger position to above the audio trigger position To play the preset audio corresponding to the audio trigger position.

In this embodiment, the execution subject may be in response to determining that the human bone key point information set includes target human bone key point information, and for the audio trigger position in the at least one audio trigger position, respond to the determination of the target human bone key point information indication The key points of the human bones are moved from outside the audio trigger position to above the audio trigger position, and the preset audio corresponding to the audio trigger position is played. Wherein, the audio corresponding to the audio trigger position can be pre-stored in the above-mentioned execution subject, and the corresponding relationship between the audio trigger position and the audio can be pre-established in the form of a list, a pointer, or the like.

Specifically, as an example, for a certain audio trigger position, if the audio trigger position is represented by an area of preset size and shape, when the key point of the human bone indicated by the key point information of the target human bone is detected, the key point of the human skeleton moves from outside the area to the When in the area, the human bone key point indicated by the target human bone key point information is determined to move from outside the audio trigger position to the audio trigger position. At this time, the preset audio corresponding to the audio trigger position is played. For another example, for a certain audio trigger position, if the audio trigger position is represented by a straight line segment with a preset length, when it is detected that the key points of the human bones indicated by the key point information of the target human body are in contact with the line segment, the target human bones are determined The human bone key points indicated by the key point information move from outside the audio trigger position to the audio trigger position. At this time, the preset audio corresponding to the audio trigger position is played.

In practice, the audio corresponding to the audio trigger position is to simulate a certain tone of a certain musical instrument. By triggering to play the audio corresponding to each audio trigger position, it is possible to simulate playing the musical instrument through human body movements. In addition, the audio corresponding to the audio trigger position may also be other types of audio, such as a piece of music, a sound effect, and so on.

In some optional implementation manners of this embodiment, the video is a video taken in real time for the target user. The target user may be a user captured by a camera included in the execution subject or a camera included in an electronic device communicatively connected with the execution subject. The above-mentioned execution subject may respond to determining that the human bone key point information set does not include the target human bone key point information, and display prompt information for prompting the target user's position error on the target interface. Specifically, when it is determined that the human bone key point information set does not include the target human bone key point information, it means that the target person's position is inaccurate, the camera cannot capture a complete human body image, and the above-mentioned executive body cannot detect the target human bone key point information . At this time, the above-mentioned execution subject may display prompt information on the target interface to remind the target user of the position error. The prompt information in this implementation manner may be the same as the prompt information in the foregoing optional implementation manner, and will not be repeated here.

Continue to refer to FIG. 3, which is a schematic diagram of an application scenario of the method for playing audio according to this embodiment. In the application scenario of FIG. 3, the terminal device 301 first obtains the display video frame currently displayed on the target interface 302 (that is, the page used to display the video frame included in the captured video on the terminal device), where the display video frame is for use Video frames included in the video taken by the user of the terminal device self-portrait. The target interface 302 is preset with 7 audio trigger positions (ie, the rectangular areas A-G in the figure), and the 7 audio trigger positions are used to simulate piano keys. Then, the terminal device 301 performs human bone key point detection on the displayed video frame to obtain a human bone key point information set, where the human bone key point information included corresponds to the human bone key point shown by the black origin in the figure. Among them, the human bone key point information corresponding to the human bone

key points

303 and 304 respectively are the target human bone key point information, which is used to represent the human hand. Finally, in response to determining the human bone key point 303 indicated by the target human bone key point information, the terminal device 301 moves from outside the audio trigger position G to above G, and plays a preset audio corresponding to the audio trigger position.

The method provided by the above-mentioned embodiments of the present disclosure performs human bone key point detection on the displayed video frame by acquiring the display video frame displayed on the target interface. If the human bone key point information set is detected, and the human bone key point information set includes Target human bone key point information, for the audio trigger position in at least one audio trigger position on the target interface, in response to determining that the human bone key point information indicated by the target human bone key point information is at the audio trigger position, the preset, and The audio corresponding to the audio trigger position, so that the person being photographed can trigger the playback of audio through physical movements, which improves the flexibility of triggering the playback of the audio, and helps the person being photographed to only pass the Music can be played by body movements.

With further reference to FIG. 4, it shows a process 400 of another embodiment of a method for playing audio. The process 400 of the method for playing audio includes the following steps:

Step 401: Obtain display video frames displayed on the target interface.

In this embodiment, the audio trigger position can be characterized by at least one of the following: an area of a preset size and shape, and a line of a preset length. As an example, the audio trigger position may be characterized by a rectangular area with a preset size. The aforementioned predetermined length line can be a straight line segment or a curved line segment.

Step 402: Perform human bone key point detection on the displayed video frame, and in response to detecting the human bone key point information set, determine whether the human bone key point information set includes target human bone key point information.

In this embodiment, step 402 is basically the same as step 202 in the embodiment corresponding to FIG. 2, and will not be repeated here.

Step 403, in response to the determination including, for the audio trigger position in the at least one audio trigger position, in response to the determination of the target human bone key point information, the human bone key point indicated by moving from the audio trigger position to above the audio trigger position , Play the preset audio corresponding to the audio trigger position; in response to determining that the audio trigger position is represented by an area of preset size and shape, and determine that the human bone key point indicated by the target human bone key point information is triggered by the audio The dwell time at the position reaches the preset length, and the audio playback is stopped.

In this embodiment, in response to determining that the human bone key point information set includes the target human bone key point information, for the audio trigger position in the at least one audio trigger position, the following sub-steps (including step 4031-step 4032):

Step 4031: In response to determining that the human bone key point indicated by the target human bone key point information moves from outside the audio trigger position to above the audio trigger position, play a preset audio corresponding to the audio trigger position.

Specifically, step 4031 is basically the same as step 203 in the embodiment corresponding to FIG. 2, and will not be repeated here.

Step 4032, in response to determining that the audio trigger position is represented by an area of preset size and shape, and determining that the key points of the human bones indicated by the target human bone key point information stay for the preset duration at the audio trigger position, stop playing Audio.

Specifically, when the audio trigger position is characterized by an area of a preset size and shape (for example, a rectangular area of a preset size), the execution subject can determine that the key point of the human bone indicated by the key point information of the target human bone is at the trigger position Stay time. Generally, the above-mentioned execution subject can start timing while playing the audio corresponding to the audio trigger position, and detect the key points of human bones on the video frame displayed on the target interface in real time. When it is detected that the residence time of the human bone key points indicated by the target human bone key point information at the audio trigger position reaches a preset length of time (for example, 3 seconds), stop playing the audio corresponding to the audio trigger position. In practice, you can usually stop playing audio with a gradual decrease in volume.

In some optional implementation manners of this embodiment, the above-mentioned execution subject may play the audio corresponding to the audio trigger position according to the following steps:

First, determine the moving speed of the key points of the human bones on the target interface indicated by the key point information of the target human bones. Specifically, the above-mentioned execution subject can perform human bone key point detection on the video frame displayed on the target interface in real time, by detecting the human bone key point indicated by the target human bone key point information in two adjacent video frames (or intermediate intervals). The change of the position of the two video frames of the preset number of video frames) and the play time interval of the two video frames can determine in real time that the human bone key points indicated by the target human bone key point information are in the target interface The speed of movement. When it is detected that the human bone key points indicated by the target human bone key point information move from outside the audio trigger position to above the audio trigger position, determine the human bone key point movement speed indicated by the target human bone key point information at this time The movement speed used to determine the volume of the audio below.

Then, according to the preset volume corresponding to the determined moving speed, the preset audio corresponding to the audio trigger position is played. This implementation manner can control the volume of the played audio according to the moving speed of the key points of the human bones indicated by the key point information of the target human bones, thereby helping to more accurately simulate the performance of the musical instrument. For example, when the aforementioned audio trigger position is used to simulate a piano, the movement speed of the key points of the human bones indicated by the key point information of the target human bones can represent the strength of the human fingers hitting the keys, thereby more realistically simulating piano performance.

It can be seen from FIG. 4 that, compared with the embodiment corresponding to FIG. 2, the process 400 of the method for playing audio in this embodiment highlights that the key points of the human bones indicated by the target human bone key points are triggered in the audio The dwell time of the position, the step of stopping the audio playback. Therefore, the solution described in this embodiment can control the audio playback more flexibly, which helps to more accurately simulate the performance of a musical instrument.

With further reference to FIG. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a device for playing audio. The device embodiment corresponds to the method embodiment shown in FIG. The device can be applied to various electronic devices.

As shown in FIG. 5, the apparatus 500 for playing audio in this embodiment includes: an acquiring unit 501 configured to acquire a display video frame displayed on the target interface, where the display video frame is a video frame included in the currently shot video , The target interface is preset with at least one audio trigger position; the detection unit 502 is configured to perform human bone key point detection on the displayed video frame, and in response to detecting the human bone key point information set, determine whether the human bone key point information set Including the target human bone key point information; the playback unit 503 is configured to, in response to determining that the audio trigger position in at least one audio trigger position is included, in response to determining the target human bone key point information, the human bone key point indicated by the audio Move beyond the trigger position to the audio trigger position, and play the preset audio corresponding to the audio trigger position.

In this embodiment, the obtaining unit 501 may obtain the display video frame displayed on the target interface. Wherein, the displayed video frame is a video frame included in the currently shot video. The target interface may be an interface for displaying video frames of the aforementioned video. For example, the target interface may be a playback interface of a video playback application installed on the aforementioned device 500. At least one audio trigger position is preset on the target interface. The audio trigger position is used to trigger audio playback.

In this embodiment, the detection unit 502 may perform human bone key point detection on the displayed video frame, and in response to detecting the human bone key point information set, determine whether the human bone key point information set includes target human bone key point information. Among them, the key point information of the human bone is used to indicate the key point of the human bone. The key points of human bones are points used to characterize specific parts of the human body, for example, points used to characterize the top of the head, elbow joints, shoulder joints, etc. The human bone key point information may include coordinates in a coordinate system established on the display video frame, and the coordinates may be used to characterize the position of the human bone key point in the display video frame.

The aforementioned detection unit 502 can perform human bone key point detection on the displayed video frame according to various existing methods for determining human bone key points. For example, the aforementioned detection unit 502 may input the display video frame into a pre-trained convolutional neural network (Convolutional Neural Networks, CNN) to obtain a set of key point information of human bones. The aforementioned convolutional neural network may be existing convolutional neural networks with various structures, such as R-CNN (Region-CNN), STN (Spatial Transform Networks, spatial transformation network), etc. It should be noted that the above-mentioned method for detecting key points of human bones is a well-known technology that is currently widely researched and applied, and will not be repeated here.

The aforementioned target human bone key point information may be human bone key point information used to characterize specific parts of the human body (for example, hands, feet, etc.) from the detected human bone key point information set. Generally, the human bone key point information may have a corresponding serial number, and the serial number may be the human body part corresponding to the human bone key point indicated by each human bone key point information when the detection unit 502 detects the human bone key point information set definite. The aforementioned detection unit 502 may determine the target human bone key point information from the human bone key point information set according to a preset sequence number corresponding to the target human bone key point information.

In this embodiment, the playback unit 503 may respond to determining that the human bone key point information set includes target human bone key point information, and for the audio trigger position in the at least one audio trigger position, respond to the determination of the target human bone key point information indication The key points of the human bones are moved from outside the audio trigger position to above the audio trigger position, and the preset audio corresponding to the audio trigger position is played.

In some optional implementation manners of this embodiment, the target human bone key point information is human bone key point information used to characterize the hand.

In some optional implementation manners of this embodiment, the video is a video taken in real time of the target user; and the detection unit 502 may be further configured to: in response to not detecting the human bone key point information set, display it on the target interface The prompt message used to prompt the target user of the wrong position.

In some optional implementations of this embodiment, the video is a video taken in real time for the target user; and the device 500 may further include: a display unit (not shown in the figure), configured to respond to determining the key to the human skeleton The point information set does not include the key point information of the target human skeleton, and a prompt message for prompting the target user's position error is displayed on the target interface.

In some optional implementations of this embodiment, the device 500 may further include: a determining unit (not shown in the figure), configured to determine a human body image based on a set of key point information of human bones; and an amplifying unit (not shown in the figure) (Shown), in response to determining that the size of the human body image is smaller than the preset size, the displayed video frame is enlarged so that the size of the human body image reaches the preset size.

In some optional implementations of this embodiment, the audio trigger position is characterized by at least one of the following: an area with a preset size and shape, and a line with a preset length.

In some optional implementations of this embodiment, the playing unit 503 may be further configured to: in response to determining that the audio trigger position is represented by an area of a preset size and shape, and determining the human body indicated by the key point information of the target human skeleton The bone key point stays at the audio trigger position for a preset duration, and stops playing the audio corresponding to the audio trigger position.

In some optional implementations of this embodiment, the playing unit 503 may include: a determining module (not shown in the figure), configured to determine whether the key point of the human bone indicated by the key point information of the target human bone is on the target interface Moving speed; a playing module (not shown in the figure), configured to play a preset audio corresponding to the audio trigger position according to a preset volume corresponding to the determined moving speed.

The device provided by the above-mentioned embodiment of the present disclosure detects the human bone key point of the displayed video frame by acquiring the display video frame displayed on the target interface. If the human bone key point information set is detected, and the human bone key point information set includes Target human bone key point information, for the audio trigger position in at least one audio trigger position on the target interface, in response to determining that the human bone key point information indicated by the target human bone key point information is at the audio trigger position, the preset, and The audio corresponding to the audio trigger position, so that the person being photographed can trigger the playback of audio through physical movements, which improves the flexibility of triggering the playback of the audio, and helps the person being photographed to only pass the Music can be played by body movements.

Reference is now made to FIG. 6, which shows a schematic structural diagram of a terminal device 600 suitable for implementing embodiments of the present disclosure. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals ( For example, mobile terminals such as car navigation terminals and fixed terminals such as digital TVs and desktop computers. The terminal device shown in FIG. 6 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.

As shown in FIG. 6, the terminal device 600 may include a processing device (such as a central processing unit, a graphics processor, etc.) 601, which may be loaded into a random access device according to a program stored in a read-only memory (ROM) 602 or from a storage device 608. The program in the memory (RAM) 603 executes various appropriate actions and processing. In the RAM 603, various programs and data required for the operation of the terminal device 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to the bus 604.

Generally, the following devices can be connected to the I/O interface 605: including input devices 606 such as touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, liquid crystal display (LCD), speakers, vibration An output device 607 such as a computer; a storage device 608 such as a memory; and a communication device 609. The communication device 609 may allow the terminal device 600 to perform wireless or wired communication with other devices to exchange data. Although FIG. 6 shows a terminal device 600 having various devices, it should be understood that it is not required to implement or have all the illustrated devices. It may alternatively be implemented or provided with more or fewer devices. Each block shown in FIG. 6 may represent one device, or may represent multiple devices as needed.

In particular, according to an embodiment of the present disclosure, the process described above with reference to the flowchart can be implemented as a computer software program. For example, the embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network through the communication device 609, or installed from the storage device 608, or installed from the ROM 602. When the computer program is executed by the processing device 601, the above-mentioned functions defined in the method of the embodiment of the present disclosure are executed. It should be noted that the computer-readable medium described in the embodiment of the present disclosure may be a computer-readable signal medium or a computer-readable medium, or any combination of the two. The computer-readable medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples of computer-readable media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable Read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

In the embodiments of the present disclosure, the computer-readable medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device. In the embodiments of the present disclosure, the computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer-readable signal medium may also be any computer-readable medium other than the computer-readable medium, and the computer-readable signal medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wire, optical cable, RF (Radio Frequency), etc., or any suitable combination of the above.

The above-mentioned computer-readable medium may be included in the above-mentioned terminal device; or it may exist alone without being assembled into the terminal device. The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the terminal device, the terminal device: obtains the display video frame displayed on the target interface, where the display video frame is the current shot At least one audio trigger position is preset on the target interface; the human bone key point detection is performed on the displayed video frame, and in response to detecting the human bone key point information set, it is determined whether the human bone key point information set includes Target human bone key point information; the response to the determination includes, for the audio trigger position in at least one audio trigger position, in response to determining the target human bone key point information, the human bone key point indicated by the audio trigger position moves to the audio Above the trigger position, play the preset audio corresponding to the audio trigger position.

The computer program code used to perform the operations of the embodiments of the present disclosure can be written in one or more programming languages or a combination thereof, the programming languages including object-oriented programming languages such as Java, Smalltalk, C++, It also includes conventional procedural programming languages-such as "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer, or entirely executed on the remote computer or server. In the case of a remote computer, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to pass Internet connection).

The flowcharts and block diagrams in the accompanying drawings illustrate the possible implementation architecture, functions, and operations of the system, method, and computer program product according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram can represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logic function Executable instructions. It should also be noted that, in some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.

The units involved in the embodiments described in the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner. The described unit may also be provided in the processor, for example, it may be described as: a processor includes an acquiring unit, a detecting unit, and a playing unit. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances. For example, the acquiring unit can also be described as "a unit for acquiring the displayed video frame displayed on the target interface".

The above description is only a preferred embodiment of the present disclosure and an explanation of the applied technical principles. Those skilled in the art should understand that the scope of the invention involved in the embodiments of the present disclosure is not limited to the technical solution formed by the specific combination of the above technical features, and should also cover the above-mentioned inventive concept without departing from the above-mentioned inventive concept. Other technical solutions formed by any combination of technical features or equivalent features. For example, the above-mentioned features and the technical features disclosed in the embodiments of the present disclosure (but not limited to) having similar functions are replaced with each other to form a technical solution.

Claims

A method for playing audio, including:

Acquiring a display video frame displayed on a target interface, where the display video frame is a video frame included in the currently shot video, and at least one audio trigger position is preset on the target interface;

Performing human bone key point detection on the display video frame, and in response to detecting a human bone key point information set, determining whether the human bone key point information set includes target human bone key point information;

Responsive to the determination includes, for the audio trigger position in the at least one audio trigger position, in response to determining that the human bone key point indicated by the target human bone key point information moves from outside the audio trigger position to one of the audio trigger positions On, play the preset audio corresponding to the audio trigger position.
The method according to claim 1, wherein the target human bone key point information is human bone key point information used to characterize the hand.
The method according to claim 1, wherein the video is a video taken in real time for the target user; and

After the detection of human bone key points on the display video frame, the method further includes:

In response to the failure to detect the key point information collection of the human bones, a prompt message for prompting the target user to be incorrectly positioned is displayed on the target interface.
The method according to claim 1, wherein the video is a video taken in real time for the target user; and

After the determining whether the human bone key point information set includes target human bone key point information, the method further includes:

In response to determining that the human bone key point information set does not include the target human bone key point information, prompt information for prompting the target user's position error is displayed on the target interface.
The method according to claim 1, wherein the determining in response to the determination comprises, for an audio trigger position in the at least one audio trigger position, in response to determining a human bone key point indicated by the target human bone key point information Before moving from outside the audio trigger position to above the audio trigger position, and before playing the preset audio corresponding to the audio trigger position, the method further includes:

Determining a human body image based on the human bone key point information set;

In response to determining that the size of the human body image is smaller than a preset size, the display video frame is enlarged so that the size of the human body image reaches the preset size.
The method according to any one of claims 1 to 5, wherein the audio trigger position is characterized by at least one of the following: an area of a preset size and shape, and a line of a preset length.
The method according to claim 6, wherein after the playing the preset audio corresponding to the audio trigger position, the method further comprises:

In response to determining that the audio trigger position is represented by a region of preset size and shape, and determining that the key points of the human bones indicated by the target human bone key point information have stayed at the audio trigger position for a preset length of time, stop playing the The audio corresponding to the audio trigger position.
The method according to any one of claims 1 to 5, wherein said playing the preset audio corresponding to the audio trigger position comprises:

Determining the moving speed of the key point of the human bone indicated by the key point information of the target human bone on the target interface;

According to the preset volume corresponding to the determined moving speed, the preset audio corresponding to the audio trigger position is played.
A device for playing audio, including:

The acquiring unit is configured to acquire a display video frame displayed on a target interface, where the display video frame is a video frame included in the currently shot video, and at least one audio trigger position is preset on the target interface;

The detection unit is configured to perform human bone key point detection on the display video frame, and in response to detecting a human bone key point information set, determine whether the human bone key point information set includes target human bone key point information;

The playback unit is configured to, in response to determining, include, for an audio trigger position in the at least one audio trigger position, in response to determining that the target human bone key point information indicates the human bone key point is moved outside the audio trigger position To the audio trigger position, play the preset audio corresponding to the audio trigger position.
The device according to claim 9, wherein the target human bone key point information is human bone key point information used to characterize the hand.
The device according to claim 9, wherein the video is a video taken in real time for the target user; and

The detection unit is further configured to:

In response to the failure to detect the key point information collection of the human bones, a prompt message for prompting the target user to be incorrectly positioned is displayed on the target interface.
The device according to claim 9, wherein the video is a video taken in real time for the target user; and

The device also includes:

The display unit is configured to, in response to determining that the human bone key point information set does not include target human bone key point information, display prompt information for prompting the target user's position error on the target interface.
The device according to claim 9, wherein the device further comprises:

The determining unit is configured to determine a human body image based on the human skeleton key point information set;

The magnifying unit is configured to, in response to determining that the size of the human body image is smaller than a preset size, magnify the display video frame so that the size of the human body image reaches the preset size.
The device according to any one of claims 9-13, wherein the audio trigger position is characterized by at least one of the following: an area of a preset size and shape, and a line of a preset length.
The apparatus according to claim 14, wherein the playing unit is further configured to:

In response to determining that the audio trigger position is represented by a region of preset size and shape, and determining that the key points of the human bones indicated by the target human bone key point information have stayed at the audio trigger position for a preset length of time, stop playing the The audio corresponding to the audio trigger position.
The device according to any one of claims 9-13, wherein the playing unit comprises:

A determining module configured to determine the moving speed of the key points of the human bones indicated by the key point information of the target human bones on the target interface;

The playing module is configured to play the preset audio corresponding to the audio trigger position according to the preset volume corresponding to the determined moving speed.
A terminal device, including:

One or more processors;

Storage device, on which one or more programs are stored, display screen?

When the one or more programs are executed by the one or more processors, the one or more processors implement the method according to any one of claims 1-8.
A computer-readable medium with a computer program stored thereon, wherein the program is executed by a processor to implement the method according to any one of claims 1-8.