WO2022111168A1 - Procédé et appareil de classement de vidéos - Google Patents

Procédé et appareil de classement de vidéos Download PDF

Info

Publication number
WO2022111168A1
WO2022111168A1 PCT/CN2021/125750 CN2021125750W WO2022111168A1 WO 2022111168 A1 WO2022111168 A1 WO 2022111168A1 CN 2021125750 W CN2021125750 W CN 2021125750W WO 2022111168 A1 WO2022111168 A1 WO 2022111168A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
target
video
image frame
matching degree
Prior art date
Application number
PCT/CN2021/125750
Other languages
English (en)
Chinese (zh)
Inventor
徐东
刘承诚
Original Assignee
腾讯音乐娱乐科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯音乐娱乐科技(深圳)有限公司 filed Critical 腾讯音乐娱乐科技(深圳)有限公司
Publication of WO2022111168A1 publication Critical patent/WO2022111168A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present application relates to the technical field of data processing, and in particular, to a video classification method and apparatus.
  • users can imitate and sing the actions of the characters in the video while playing the video and accompaniment, and record audio and video of themselves at the same time.
  • the embodiment of the present application provides a video classification method, which can solve the problem of the lack of a method capable of classifying the imitation of singing and dancing videos in the prior art.
  • a video classification method comprising:
  • target audio and corresponding target video including human motion
  • the method further includes:
  • the human action matching score curve is displayed, and based on the audio matching score corresponding to each target audio segment, the audio matching score curve is displayed.
  • the method further includes:
  • the human action matching degree score of the target image frame corresponding to the target time point and the audio matching degree score of the target audio segment corresponding to the target time point are displayed.
  • the method further includes:
  • the human action matching degree score is added in the form of an image at a position in the target video corresponding to the target image frame.
  • the method further includes:
  • the audio matching degree score is added in the form of an image at a position corresponding to the target audio segment in the target video.
  • determining the human body of the target video relative to the reference video based on the human action matching degree of each target image frame in the target video relative to the corresponding reference image frame in the reference video Before the total action match score also includes:
  • the multiple human body key points include preset reference human body key points and non-reference human body key points, and the determination of all human body key points based on the positions of the multiple human body key points in the target image frame
  • the angle between the line connecting the same human key points in the target image frame and the reference image frame including:
  • the target image frame For each non-reference human body key point, in the target image frame, determine the non-reference human body key point and the reference human body based on the position of the non-reference human body key point and the position of the reference human body key point The first connection line of key points, obtain the second connection line between the non-reference human body key point and the reference human body key point in the reference image frame, and determine the angle between the first connection line and the second connection line .
  • the method further includes:
  • the determining, based on the determined included angle, the degree of human motion matching of the target image frame relative to the reference image frame including:
  • the matching degree of the human motion of the target image frame with respect to the reference image frame is determined.
  • the target audio is determined relative to the Before the overall audio match score for the benchmark audio, it also includes:
  • the target audio segments included in the target audio are obtained one by one, and each time a target audio segment is obtained, the fundamental frequency similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio is determined, based on the target audio segment relative to the reference audio segment.
  • the fundamental frequency similarity of the corresponding reference audio segment in the reference audio is determined, and the audio matching degree of the target audio segment relative to the corresponding reference audio segment is determined.
  • the determining the fundamental frequency similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio includes:
  • the proportion of the number in the total number of frames of the target audio segment is used as the similarity of the fundamental frequency of the target audio segment with respect to the corresponding reference audio segment in the reference audio.
  • the method further includes:
  • Determining the audio matching degree of the target audio segment relative to the corresponding reference audio segment based on the fundamental frequency similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio including:
  • the audio matching degree of the target audio segment with respect to the corresponding reference audio segment is determined.
  • the determining the text similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio includes:
  • the text similarity between the target audio segment and the corresponding reference audio segment in the reference audio is taken as the text similarity.
  • a video classification device comprising:
  • an acquisition module for acquiring target audio and corresponding target video including human motion
  • a video determination module configured to determine the total human motion matching degree of the target video relative to the reference video based on the human motion matching degree of each target image frame in the target video relative to the corresponding reference image frame in the reference video score;
  • An audio determination module configured to determine the audio of the target audio relative to the reference audio based on the audio matching degree of each target audio segment in the target audio relative to the corresponding reference audio segment in the reference audio of the reference video Overall match score;
  • a comprehensive determination module configured to determine a comprehensive classification result based on the total human action matching degree score and the audio total matching degree score.
  • the device further includes:
  • a first determination module configured to determine the human action matching degree score corresponding to the human action matching degree of the target image frame relative to the reference image frame, and determine the audio matching of the target audio segment relative to the reference audio segment The audio matching score corresponding to the degree;
  • the first display module is used for displaying the human action matching degree score curve based on the human body motion matching degree score corresponding to each target image frame, and displaying the audio matching degree scoring curve based on the audio matching degree score corresponding to each target audio segment.
  • the device further includes:
  • a second display module configured to display the time axis corresponding to the target video and the target audio
  • the human action matching degree score of the target image frame corresponding to the target time point and the audio matching degree score of the target audio segment corresponding to the target time point are displayed.
  • the device further includes:
  • a first determining module configured to determine a corresponding human motion matching degree score based on the human motion matching degree of the target image frame relative to the reference image frame;
  • the first adding module is configured to add the human action matching degree score in the form of an image at a position corresponding to the target image frame in the target video.
  • the device further includes:
  • a first determining module configured to determine a corresponding audio matching degree score based on the audio matching degree of the target audio segment relative to the reference audio segment
  • the second adding module is configured to add the audio matching degree score in the form of an image at a position corresponding to the target audio segment in the target video.
  • the first determining module is further used for:
  • the plurality of human body key points include preset reference human body key points and non-reference human body key points, and the first determination module is used for:
  • the target image frame For each non-reference human body key point, in the target image frame, determine the non-reference human body key point and the reference human body based on the position of the non-reference human body key point and the position of the reference human body key point The first connection line of key points, obtain the second connection line between the non-reference human body key point and the reference human body key point in the reference image frame, and determine the angle between the first connection line and the second connection line .
  • the first determining module is further used for:
  • the first determining module is further used for:
  • each determined included angle is processed, and the corresponding processing result value of each included angle is obtained;
  • the matching degree of the human motion of the target image frame with respect to the reference image frame is determined.
  • the first determining module is further used for:
  • the target audio segments included in the target audio are obtained one by one, and each time a target audio segment is obtained, the fundamental frequency similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio is determined, based on the target audio segment relative to the reference audio segment.
  • the fundamental frequency similarity of the corresponding reference audio segment in the reference audio is determined, and the audio matching degree of the target audio segment relative to the corresponding reference audio segment is determined.
  • the first determining module is used for:
  • the proportion of the number in the total number of frames of the target audio segment is used as the similarity of the fundamental frequency of the target audio segment relative to the corresponding reference audio segment in the reference audio.
  • the first determining module is further used for:
  • Determining the audio matching degree of the target audio segment relative to the corresponding reference audio segment based on the fundamental frequency similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio including:
  • the audio matching degree of the target audio segment with respect to the corresponding reference audio segment is determined.
  • the first determining module is used for:
  • the text similarity of the target audio segment with respect to the corresponding reference audio segment in the reference audio is used.
  • a video classification method comprising:
  • target audio and corresponding target video including human motion
  • a total audio matching degree score of the target audio relative to the reference audio is determined, wherein the reference audio is the audio corresponding to the reference video;
  • the method further includes:
  • the human action matching score curve is displayed, and based on the audio matching score corresponding to each target audio segment, the audio matching score curve is displayed.
  • the method further includes:
  • the human action matching degree score of the target image frame corresponding to the target time point and the audio matching degree score of the target audio segment corresponding to the target time point are displayed.
  • the method further includes:
  • the human action matching degree score is added in the form of an image.
  • the method further includes:
  • the audio matching degree score is added in the form of an image at a position corresponding to the target audio segment in the target video.
  • determining the human body of the target video relative to the reference video based on the human action matching degree of each target image frame in the target video relative to the corresponding reference image frame in the reference video Before the total action match score also includes:
  • the multiple human body key points include preset reference human body key points and non-reference human body key points, and the determination of all human body key points based on the positions of the multiple human body key points in the target image frame
  • the angle between the line connecting the same human key points in the target image frame and the reference image frame including:
  • the target image frame For each non-reference human body key point, in the target image frame, determine the non-reference human body key point and the reference human body based on the position of the non-reference human body key point and the position of the reference human body key point The first connection line of the key points, obtain the second connection line between the non-reference human body key point and the reference human body key point in the reference image frame, and determine the angle between the first connection line and the second connection line .
  • the method further includes:
  • determining, based on the determined included angle, the degree of human motion matching of the target image frame relative to the reference image frame including:
  • the matching degree of human motion of the target image frame with respect to the reference image frame is determined.
  • match scoring also include:
  • the target audio segments included in the target audio are obtained one by one, and each time a target audio segment is obtained, the fundamental frequency similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio is determined, based on the target audio segment relative to the reference audio segment.
  • the fundamental frequency similarity of the corresponding reference audio segment in the reference audio is determined, and the audio matching degree of the target audio segment relative to the corresponding reference audio segment is determined.
  • the determining the fundamental frequency similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio includes:
  • the proportion of the number in the total number of frames of the target audio segment is used as the similarity of the fundamental frequency of the target audio segment with respect to the corresponding reference audio segment in the reference audio.
  • the method further includes:
  • Determining the audio matching degree of the target audio segment relative to the corresponding reference audio segment based on the fundamental frequency similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio including:
  • the audio matching degree of the target audio segment with respect to the corresponding reference audio segment is determined.
  • the determining the text similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio includes:
  • the text similarity between the target audio segment and the corresponding reference audio segment in the reference audio is taken as the text similarity.
  • a video classification device comprising:
  • an acquisition module for acquiring target audio and corresponding target video including human motion
  • a video determination module configured to determine the total human motion matching degree of the target video relative to the reference video based on the human motion matching degree of each target image frame in the target video relative to the corresponding reference image frame in the reference video score;
  • An audio determination module configured to determine the total audio matching score of the target audio relative to the reference audio based on the audio matching degree of each target audio segment in the target audio relative to the corresponding reference audio segment in the reference audio, Wherein, the reference audio is the audio corresponding to the reference video;
  • a comprehensive determination module configured to determine a comprehensive classification result based on the total human action matching degree score and the audio total matching degree score.
  • the device further includes:
  • a first determination module configured to determine the human action matching degree score corresponding to the human action matching degree of the target image frame relative to the reference image frame, and determine the audio matching of the target audio segment relative to the reference audio segment The audio matching score corresponding to the degree;
  • the first display module is used for displaying the human action matching degree score curve based on the human body action matching degree score corresponding to each target image frame, and displaying the audio matching degree scoring curve based on the audio matching degree score corresponding to each target audio segment.
  • the device further includes:
  • a second display module configured to display the time axis corresponding to the target video and the target audio
  • the human action matching degree score of the target image frame corresponding to the target time point and the audio matching degree score of the target audio segment corresponding to the target time point are displayed.
  • the device further includes:
  • a first determining module configured to determine a human motion matching degree score corresponding to the target image frame based on the human motion matching degree of the target image frame relative to the reference image frame;
  • the first adding module is configured to add the human action matching degree score in the form of an image at a position corresponding to the target image frame in the target video.
  • the device further includes:
  • a first determining module configured to determine an audio matching degree score corresponding to the target audio segment based on the audio matching degree of the target audio segment relative to the reference audio segment;
  • the second adding module is configured to add the audio matching degree score in the form of an image at a position corresponding to the target audio segment in the target video.
  • the first determining module is further used for:
  • the plurality of human body key points include preset reference human body key points and non-reference human body key points, and the first determination module is used for:
  • the target image frame For each non-reference human body key point, in the target image frame, determine the non-reference human body key point and the reference human body based on the position of the non-reference human body key point and the position of the reference human body key point The first connection line of the key points, obtain the second connection line between the non-reference human body key point and the reference human body key point in the reference image frame, and determine the angle between the first connection line and the second connection line .
  • the first determining module is further used for:
  • the first determining module is further used for:
  • each determined included angle is processed, and the corresponding processing result value of each included angle is obtained;
  • the matching degree of the human motion of the target image frame with respect to the reference image frame is determined.
  • the first determining module is further used for:
  • the first determining module is used for:
  • the proportion of the number in the total number of frames of the target audio segment is used as the similarity of the fundamental frequency of the target audio segment relative to the corresponding reference audio segment in the reference audio.
  • the first determining module is further used for:
  • Determining the audio matching degree of the target audio segment relative to the corresponding reference audio segment based on the fundamental frequency similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio including:
  • the audio matching degree of the target audio segment with respect to the corresponding reference audio segment is determined.
  • the first determining module is used for:
  • the text similarity between the target audio segment and the corresponding reference audio segment in the reference audio is taken as the text similarity.
  • a computer device in a fifth aspect, includes a processor and a memory, the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement operations performed by the video classification method.
  • a computer-readable storage medium stores at least one instruction, and the instruction is loaded and executed by a processor to implement operations performed by the video classification method.
  • the solution mentioned in the embodiment of the present application can obtain the target audio and target video, and determine the target video relative to the reference video based on the human motion matching degree of each target image frame in the target video relative to the corresponding reference image frame in the reference video. Based on the audio matching degree of each target audio segment in the target audio relative to the corresponding reference audio segment in the reference audio, determine the total audio matching degree score of the target audio relative to the reference audio, and then based on the human action The total match score and the audio total match score determine the comprehensive classification result.
  • the comprehensive classification result can reflect the general imitation of the video and audio. Therefore, the embodiment of the present application provides a method that can classify the imitation of the singing and dancing video.
  • FIG. 1 is a flowchart of a method for classifying videos provided in an embodiment of the present application
  • FIG. 2 is a schematic display diagram of a scoring window provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of a method for determining human action matching degree provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a human body key point provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of determining an included angle between a first connection line and a second connection line according to an embodiment of the present application
  • FIG. 6 is a flowchart of a method for determining an audio matching degree provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a video classification apparatus provided by an embodiment of the present application.
  • FIG. 8 is a structural block diagram of a terminal provided by an embodiment of the present application.
  • FIG. 9 is a structural block diagram of a server provided by an embodiment of the present application.
  • the embodiment of the present application provides a video classification method, and the method can be implemented by a computer device.
  • the computer equipment can be a terminal and a server, etc.
  • the terminal can be a desktop computer, a notebook computer, a tablet computer, a mobile phone, etc.
  • the server can be a single server or a server cluster.
  • a computer device may include a processor, memory, input components, output components, communication components, and the like.
  • the memory can be a variety of non-volatile storage devices or volatile storage devices, and can be used for data storage, such as data for applications with video shooting and evaluation functions, pre-stored data for determining matching and displaying scores. data, data generated when the user records video and audio, intermediate data generated during the process of determining the match, and so on.
  • the processor can be used to run applications with video capture and assessment capabilities, process user-recorded video and audio, and more.
  • the input component may include a mouse, a keyboard, a touchpad, a handwriting pad, a microphone, etc., and is used to obtain data input by the user.
  • the output components can include image output components, audio output components, vibration output components, etc.
  • the image output components can be used to display the interface of the application program with video shooting and evaluation functions for user operation, and can also be used for reference video and target video.
  • the audio output component can be used to play reference audio, and can also be used to play audio recorded by the user, and the vibration output component can be used to output some prompt signals.
  • the computer device can be connected with an image capture device, such as a camera, etc., which can be used for video shooting.
  • the image capture device can be an independent device or a matching component of the computer device.
  • the video in this application refers to a video that does not have an audio part and only has an image part.
  • FIG. 1 is a flowchart of a video classification method provided by an embodiment of the present application. Referring to Figure 1, this embodiment includes:
  • the target audio is the audio in the audio and video recorded by the user, for example, it can be the singing audio of a song and dance audio and video recorded by the user
  • the target video is the video in the audio and video recorded by the user, for example, it can be a song and dance recorded by the user.
  • Dance video in audio and video.
  • the user can click to run an application program with video shooting and evaluation functions on the terminal, record the audio and video that the user wants to imitate on the application program, and then the computer device can obtain the audio and video recorded by the user, and obtain the audio and video content. Included target audio and target video.
  • the time node for performing step 101 may have various possibilities. You can obtain the complete target audio and target video recorded when the audio and video recording is completed and enter the completion interface, or you can record it before selecting it in the My Works interface. target audio and target video, etc.
  • the reference video is a video in the reference audio and video imitated by the user, for example, it may be a video in an MV (Music Vedio, music short) imitated by the user.
  • MV Music Vedio, music short
  • the technician can preset the time interval between two adjacent target image frames.
  • the computer device can periodically acquire the target image frame and the corresponding reference image frame. Specifically, in the process of recording the target audio and video, the computer device can periodically acquire the target video based on a preset interval.
  • the image frame in the image frame and the image frame in the reference video corresponding to the modified image frame are determined as the target image frame and the corresponding reference image frame respectively. Then, it is processed to calculate the human motion matching degree between the target image frame and the corresponding reference image frame (the calculation method of the human motion matching degree will be described in detail below, and will not be repeated here).
  • the human motion corresponding to the human motion matching degree can be determined based on each human motion matching degree and the algorithm preset by the technician Match score. For example, if the human action matching degree is 0.8, the human action matching degree score can be set to be 100 times the human action matching degree, which is 80 points. Then, the obtained human action matching degree score is stored in a preset position. After the next target image frame is acquired, the human motion matching score between the next target image frame and the corresponding reference image frame is calculated, and then stored until the user's recording is completed.
  • the total score of the target video relative to the reference video can be calculated, that is, the total human action matching degree score.
  • the display of the completion interface is triggered.
  • the human motion matching score corresponding to each target image frame calculated and saved during the recording process will be retrieved, and then the total human motion matching score can be calculated and displayed at the completion of the recording. in the interface.
  • the average human action matching score of each target image frame in the target video can be selected as the total human action matching score, or the target video can be selected.
  • the middle value of the human action matching degree score of each target image frame in the target image frame is used as the total human action matching degree score. It can also remove a minimum value and a maximum value, and then select the average value as the total human action matching degree score, etc. , any one of the above calculation methods can be selected to calculate the total human action matching degree score, or other calculation methods can be selected to calculate, this application does not limit the algorithm.
  • the reference audio is the audio corresponding to the reference video, and is the audio in the reference audio and video imitated by the user, for example, may be the audio in the MV (Music Vedio, music short) imitated by the user.
  • MV Music Vedio, music short
  • the technician presets the duration of the target audio segment.
  • the computer device may periodically acquire the target audio segment and the corresponding reference audio segment. Specifically, the computer device may acquire the target audio segment and the corresponding reference audio segment according to the preset duration of the target audio segment. Then it is processed, and the matching degree between the target audio segment and the corresponding reference audio segment is calculated, that is, the audio matching degree (the calculation method of the audio matching degree will be described in detail below, and will not be repeated here).
  • an audio matching degree score corresponding to the audio matching degree can be determined based on each audio matching degree and an algorithm preset by the technician.
  • the audio matching degree score can be set to be 100 times the audio matching degree, which is 50 points. Then, the obtained audio matching score is stored in a preset location. After the next target audio segment is acquired, the audio matching score of the next target audio segment and the corresponding reference audio segment is calculated, and then stored until the user recording is completed.
  • the total score of the target audio relative to the reference audio that is, the total audio matching degree score can be calculated.
  • the audio matching score corresponding to each target audio segment calculated and saved during the recording process will be retrieved, and then the total audio matching score can be calculated and displayed in the completion interface.
  • the average value of the audio matching degree score of each target audio segment in the target audio can be selected as the total audio matching degree score, or the average audio matching degree score of each target audio segment in the target audio can be selected.
  • the middle value of the audio matching score of each target audio segment is used as the total audio matching score.
  • the comprehensive classification result may be considered as a comprehensive score or a comprehensive rating of human actions and audio in the audio and video recorded by the user.
  • the human action total matching score and the audio total matching score can be weighted, so as to obtain A composite score or composite rating for the target video and target audio imitated by the user, which is then displayed in the completion interface.
  • the comprehensive classification result is a comprehensive score
  • the comprehensive score can be obtained by weighting the total human action matching score and the audio total matching score.
  • the technician needs to classify the comprehensive classification result in advance, and after calculating the comprehensive score, determine the comprehensive rating according to the classification situation.
  • technicians can pre-classify the comprehensive classification results into five categories: B, A, S, SS, and SSS, which correspond to comprehensive scores of 0-20, 21-40, 41-60, 61-80, and 81, respectively.
  • B the total human action matching score is 80 points
  • the audio total matching score is 60 points
  • the video weight is 0.6
  • the audio weight is 0.4
  • the scores of each part of the recorded audio and video can be calculated, and these scores can be presented in the form of curves for subsequent viewing by the user.
  • the corresponding processing can be as follows:
  • the completion interface in addition to displaying the total human action matching score, the total audio matching score and the comprehensive classification result, the completion interface also has a "Generate” button. Users can click the “Generate” button to trigger the target audio , the target video and the accompaniment are synthesized to obtain the synthesized audio and video, and then the release interface will be triggered to display, which is set with a "release” button.
  • the matching degree score, the total human action matching degree score, the audio total matching degree score, the comprehensive classification result and the synthesized audio and video are stored in the server. Users can click the icon of the imitated audio and video in the "My Works" interface to trigger the display of the audio and video work interface, and then view the score and the synthesized audio and video, as follows:
  • Audio and video options and scoring options set in the My Works interface. Clicking the audio and video options will trigger the retrieval of the stored and synthesized audio and video from the server, and then play the target audio and target video recorded by the user, as well as the synthesized accompaniment. Audio and video, click the scoring option, it will trigger the retrieval of the scores stored in the server, and display the scoring interface, which displays the total human action matching score, the total audio matching score and the comprehensive classification result, as well as the abbreviation icon of each audio text. , click the abbreviation icon of any audio text to trigger the display of the scoring window.
  • the window displays the human action matching score of the time period corresponding to the audio text abbreviation and the human action matching score curve drawn by the audio matching score. and audio match score curve. For example, as shown in Figure 2, it is an audio and video score from 1 minute 15 seconds to 1 minute 24 seconds, where the curve corresponding to the action is the human action matching score curve, and the curve corresponding to the sound is the audio matching score curve.
  • the time axis corresponding to the human action matching degree scoring curve and the audio matching degree scoring curve can be displayed accordingly, so that the user can view it according to the time point, and the corresponding processing can be as follows:
  • the corresponding time axis can be displayed at the position corresponding to the human action matching score curve and the audio matching score curve.
  • the user wants to view the human action matching score and the audio matching score at a certain time point
  • Figure 2 shows the time axis from 1 minute 15 seconds to 1 minute 24 seconds, the mouse is located at the position of 1 minute 20 seconds on the time axis, and the magnified window below the mouse shows the human action matching degree corresponding to 1 minute and 20 seconds Scoring and Audio Match Score.
  • the scores of each part of the audio and video can be integrated into the target video in the form of images to generate a new video in which the corresponding scores are displayed.
  • the corresponding processing can be as follows:
  • the human motion matching degree score corresponding to the target image frame is determined; the human motion matching degree score is added in the form of an image at the position corresponding to the target image frame in the target video.
  • the audio matching degree score corresponding to the target audio segment is determined; the audio matching degree score is added in the form of an image at the position corresponding to the target audio segment in the target video.
  • the user can click the "Generate” button in the completion interface, which will trigger the modification of the pixels in the corresponding positions of all image frames in the target video, and display the total human action matching score, audio total matching score and comprehensive classification results at the position.
  • the pixel of the corresponding position of the image frame of the corresponding time period, the audio matching degree score corresponding to the target audio frame is displayed on the position, or the image frame of the time period corresponding to the plurality of target audio segments within the preset duration is modified.
  • the average value of the audio matching score corresponding to the multiple target audio segments is displayed at the position. In this way, various ratings can be integrated into the target video in the form of images.
  • the user when watching the audio and video, the user can not only see the action video imitated by himself, hear the accompaniment and the audio recorded by himself, but also view the total score of the recorded audio, the total score of the video, the classification of the audio and video, and the current The scores of the video and audio corresponding to the playback time point.
  • the score of each part of the audio and video can be displayed in real time, and the corresponding processing can be as follows:
  • the human motion matching degree score corresponding to the processed target image frame can be displayed after a preset time period after the time point when the target image frame and the corresponding reference image frame are obtained.
  • the value of the human action matching score can be displayed for a preset display time in the form of floating layer display in the recording interface.
  • the technician can preset the interval and display time between adjacent target image frames to be 3 seconds.
  • the computer equipment obtains the target image frame and the corresponding reference image frame in the second second of the entire target video, and will The calculated human motion matching score of the target image frame will be displayed in the 5th second, and the next target image frame and the corresponding reference image frame will be obtained in the 5th second, and then the human motion matching will be updated in the 8th second.
  • the degree score is displayed, and the human action matching degree score corresponding to this target image frame is displayed, and so on.
  • the technician can preset the interval between adjacent target image frames as a second, and the computer device obtains the first target image frame and the corresponding reference image frame in the bth second of the entire target video, and then calculates the When the human action matching degree corresponding to the target image frame is scored, it will be displayed.
  • the computer device obtains the second target image frame and the corresponding reference image frame, and then processes them to obtain the second target image frame and the corresponding reference image frame.
  • the human action matching score corresponding to each target image frame is updated, and the human action matching score in the recording interface is updated to display the human action matching score corresponding to the second target image frame, and so on.
  • the human action matching score can be displayed in the form of numerical values or in the form of curves. If the human action matching score is displayed in numerical form, it will be displayed in the recording interface and updated in real time.
  • the human action matching score is displayed in the form of a curve, a human action matching score is calculated, the point corresponding to the human action matching score is connected with the point corresponding to the previous human action matching score, and a line is drawn on the recording interface.
  • the human action matching score curve is formed, and the entire human action matching score curve can be moved to the left synchronously in the recording interface with the playback, so as to ensure that the user can see the real-time human action matching score updated each time. It is understood that the human action matching score curve starts from 0.
  • the audio matching degree score can be displayed in the form of a numerical value, or can be displayed in the form of a curve. If the audio matching score is displayed in numerical form, the audio matching score corresponding to the preset number of target audio segments can be calculated, then the average value is taken, and the average value is displayed. After the average value of the audio matching score corresponding to the segment, the displayed score is updated to display the latest calculated average value. If the audio matching score is displayed in the form of a curve, in the recording interface, a lyric being sung or about to be sung will be displayed in real time during recording, and the color of the lyrics will gradually change to other colors with the playback time.
  • the lyrics corresponding to the audio before the corresponding time point of the audio frame will change to other colors, and as the user sings, each time a target audio segment is completed, the corresponding position above the lyrics will be displayed.
  • the connection line between the audio matching degree score corresponding to the target audio segment and the previous audio matching degree score that is, the audio matching degree score curve is displayed.
  • the displayed audio match score curve is the connection of multiple audio match scores corresponding to the entire sentence of lyrics sung by the user. of a curve. It is understandable that each segment of the curve starts from 0. Each time an audio match score is generated, the curve extends to the right for a segment until the next lyric appears, and a new curve starts from 0 again.
  • FIG. 3 is a flowchart of a method for determining a human action matching degree provided by an embodiment of the present application. Referring to Figure 3, this embodiment includes:
  • the target image frame is an image frame in the target video collected by the image collection device.
  • the user can click to run the application with video shooting and evaluation functions on the terminal, triggering the display of the main interface of the application.
  • the main interface is provided with options and search bars for recommended popular videos, and the user can click the most recent popular video. In the video option, you can also search for the video the user wants to imitate in the search bar. Click the option of the video to trigger the interface to display the video to be imitated. Click the "record" button on the video interface to trigger the playback of the imitated video, that is, the benchmark.
  • Video at this time, the user can imitate the action in the reference video, and the image acquisition device called by the application program will record the video imitated by the user, that is, the target video.
  • the interface of the application can display the reference video and the target video at the same time.
  • the target image frame may be all image frames in the target video, that is, all image frames in the target video are processed in the process of this solution.
  • the target image frame can also be an image frame obtained periodically in the target video, that is, only part of the image frame in the target video is processed in the process of this solution, for example, every 20 image frames is selected from the 20 image frames.
  • the first image frame serves as the target image frame of the target video. Periodically acquiring the target image frame will not have much influence on determining the matching degree of the user's imitation action, and the amount of data that needs to be processed will not be too large.
  • the corresponding reference image frame that is, the image frame of the video imitated by the user
  • the corresponding processing method may be as follows:
  • Manner 1 After acquiring the target image frame collected by the image acquisition device, a reference image frame whose playback time point in the reference video is the same as the playback time point of the target image frame in the target video may also be acquired.
  • a technician may preset the frame interval of the target video and the reference video, and the frame interval of the two may be the same or different.
  • the frame interval of the target video and the reference video is the same, after the target image frame is collected by the image acquisition device, according to the playback time point of the target image frame in the target video, the frame closest to the playback time point is determined in the reference video. Play time point, and then select the image frame in the reference video at the playback time point as the reference image frame corresponding to the target image frame.
  • the playback time point in the reference video that is the same as the playback time point is determined, and then The image frame closest to the playback time point in the reference video is selected as the reference image frame corresponding to the target image frame.
  • the reference image frame played at the collection time point of the target image frame may also be obtained.
  • the clock time when the target image frame is collected can be determined, and then a difference value can be obtained by comparing it with the clock time obtained in advance when the reference video starts playing.
  • the value is the corresponding playback time point of the reference video frame in the reference video. Therefore, according to the difference value, the reference image frame corresponding to the playback time point in the reference video can be determined.
  • the key points of the human body are the feature points of the human body with obvious features that can display the relative position of each part when the human body moves, for example, a wrist point, an ankle point, a knee point, an elbow point, and the like.
  • technicians can preset a number of human key points.
  • 17 human key points are set, as shown in FIG. 4 , which are set as the head point (that is, the point between the eyebrows), the throat point, the chest cavity Point, navel point, chin point, left shoulder point, right shoulder point, left elbow point, right elbow point, left wrist point, right wrist point, left span point, right hip point, left knee point, right knee point, left ankle point point, right ankle point.
  • Technicians can train the machine learning model for 17 key points in the human body in advance, and then input the target image frame into the machine learning model, and can output the position coordinates of the 17 key points of the human body in the target image frame.
  • the machine learning model can also be used to determine the positions of the same human body key points in the reference image frame, and the 17 human body key points in the reference image frame can be obtained.
  • Position coordinates any two human body key points can be selected to connect the target image frame and the reference image frame respectively, the two connecting lines are placed in the same coordinate system, and one of the connecting lines can be translated into a two-dimensional plane so that the two One endpoint of the connection line coincides, and the angle between the two lines can be obtained at this time.
  • the slope of the connection line can also be obtained through the connection line in the reference image frame, and then based on the slope of the two connection lines, the included angle of the two connection lines can be calculated.
  • one or more human body key points with reference significance can be set as the reference human body key point, and further, the connection between the non-reference key point and the reference key point can be determined, Then determine the angle.
  • the multiple human body key points may include preset reference human body key points and non-reference human body key points, and the processing of step 303 may be as follows:
  • the second connecting line between the non-reference human body key points and the reference human body key points in the image frame determines the included angle between the first connecting line and the second connecting line.
  • the technician can preset one of the multiple human body key points as the reference human body key point.
  • the throat point is selected as the reference human body key point, and the other 16 human body key points are non-reference human body key points.
  • Connecting lines with reference to the key points of the human body to obtain the second connecting line, and translate one of the connecting lines in the same coordinate system, so that the key points of the reference human body (the throat points in this embodiment) are coincident.
  • the first connecting line can be determined.
  • the angle between the connecting line and the second connecting line For example, as shown in Fig. 5, for the throat point and the left wrist point, obtain the first line connecting the throat point A1 and the left wrist point B1 in the target image frame, and the throat point A2 and the left wrist point in the reference image frame
  • the second connecting line of point B2 at this time, translate the first connecting line in the two-dimensional plane of the two connecting lines so that the throat point A1 and the throat point A2 coincide, and the included angle 1 of the two connecting lines can be obtained, and the included angle 1 It is the angle corresponding to the throat point and the left wrist point.
  • the included angle 2 corresponding to the throat point and the right wrist point, the included angle 3 corresponding to the throat point and the head point, the included angle 4 corresponding to the throat point and the thoracic point, etc. can be obtained.
  • a total of 16 included angles can be obtained for the target image frame and the reference image frame.
  • the human action matching degree is the similarity of the user's action relative to the action of the characters in the imitated video.
  • the angle value is converted, and the processing result values corresponding to the 16 included angles are obtained, and then according to these 16 A processing result value can be determined, and a score that can be fed back to the user and used to display the matching degree of the human body action to the user can be determined.
  • the process of determining the human motion matching degree of the target image frame relative to the reference image frame may be as follows:
  • each determined angle is processed to obtain the processing result value corresponding to each included angle; based on the processing result value corresponding to each included angle, the human motion matching of the target image frame relative to the reference image frame is determined Spend.
  • the value of each included angle can be converted by a function preset by the technician, and converted into a processing result value that can display the integrity of the user's imitation action.
  • a function preset by the technician For example, the following can be used Formula (1) is used to calculate:
  • y is the processing result value
  • is the included angle
  • the 16 included angles are respectively put into the formula to obtain 16 processing result values, and then the average of these 16 processing result values can be obtained as the matching degree of human motion between the target image frame and the reference image frame.
  • the human action matching degree of this image frame can be displayed on the interface of the application as a score, so that users can view the score of their own action imitation in real time for adjustment.
  • FIG. 6 is a flow of a method for determining the audio matching degree provided by the embodiment of the present application. picture. Referring to Figure 6, the processing procedure may be as follows:
  • users can imitate and score audio in addition to imitating video.
  • the user can choose a karaoke song, or choose a mode of both dancing and singing, that is, imitating the actions in the reference video and also imitating the corresponding reference audio.
  • the audio in this embodiment of the present application may be dry sound of a cappella music, or human voice audio.
  • the user can select an imitation file with both video and audio in the application, and click the "record" button to trigger the playback of the reference video, the reference audio corresponding to the reference video and the accompaniment audio.
  • the reference audio corresponding to the reference video Playing or not playing can be selected according to the user's operation in the interface, so that the user can choose whether to record the target audio while recording the target video.
  • the user's voice can also be recorded through the microphone of the terminal, which is the target audio.
  • the fundamental frequency of the target audio segment and the corresponding reference audio segment can be obtained, and by judging whether the fundamental frequency of the target audio segment and the corresponding fundamental frequency of the reference audio segment are consistent or in the difference value range to determine the fundamental frequency similarity between the target audio segment and the corresponding reference audio segment.
  • the process of determining the fundamental frequency similarity may be as follows:
  • the number of target audio frames in the target audio The proportion of the total number of frames in the segment, as the fundamental frequency similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio.
  • the technician may preset the reference audio, and divide the reference audio into multiple reference audio segments.
  • the duration of each reference audio segment in the reference audio may be the same or different.
  • the corresponding target audio is also divided into multiple target audio segments. It can be understood that the start time point and the end time point of each target audio segment and the corresponding reference audio segment are the same.
  • the technician can preset the duration corresponding to each frame in the reference audio, and then can obtain the fundamental frequency corresponding to each audio frame.
  • the fundamental frequencies of all audio frames in the target audio segment and the reference audio segment are obtained, and the fundamental frequencies in the target audio segment are compared with the corresponding fundamental frequencies in the reference audio segment respectively, that is, the target audio segment summarizes each audio frequency How to compare the fundamental frequency with the fundamental frequency of each corresponding audio frame in the reference audio segment, and obtain the difference between the fundamental frequency of each audio frame in the target audio segment and the audio frame in the corresponding reference audio segment , calculate the fundamental frequency number of the fundamental frequency difference within the preset range, that is, the difference between the fundamental frequency of each audio frame in the target audio segment and the audio frame in the corresponding reference audio segment can be obtained, and then can be calculated, It is compared with the total number of fundamental frequencies in the target audio segment, and the obtained ratio is the fundamental frequency similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio.
  • both the target audio segment and the reference audio segment contain 1000 audio frames, that is, 1000 fundamental frequency information difference values are obtained, and the preset range of the fundamental frequency information difference value is selected as 8 Hz.
  • the determined fundamental frequency If there are 800 fundamental frequency information difference values within the range of 8 Hz, the audio similarity of the target audio segment is 800/1000, that is, 0.8.
  • the fundamental frequency similarity of each target audio segment relative to the corresponding reference audio segment in the reference audio is that each target audio segment is relative to the corresponding reference audio segment. audio match.
  • step 602 When determining the audio matching degree between the target audio segment and the reference audio segment, in addition to the fundamental frequency, it is also possible to consider whether the lyrics sung by the user are accurate.
  • the processing of step 602 can be as follows:
  • For each target audio segment determine the text similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio; based on the fundamental frequency similarity of each target audio segment in the target audio relative to the corresponding reference audio segment in the reference audio and text similarity, determine the audio matching degree of each target audio segment relative to the corresponding reference audio segment.
  • the target audio segment after determining the fundamental frequency similarity between each target audio segment and the reference audio segment, it can also be determined whether the user's lyrics of the target audio segment are accurate. According to the text comparison between the target audio segment and the corresponding reference audio segment, the text similarity is obtained. At this time, for the target audio segment, there are both fundamental frequency similarity and text similarity, and the target audio segment can be obtained by processing. Audio match.
  • the process of determining the text similarity of the target audio segment may be as follows:
  • the voice text of the user in the target audio segment can be obtained through speech recognition technology, and then the text of the reference audio segment corresponding to the target audio segment is obtained, that is, the reference recognition text. stored.
  • the target recognition text is compared with the benchmark recognition text, and the text similarity is determined according to whether each word and sequence are consistent. For example, the text of the pre-stored reference audio segment is "the weather is fine today", the text of the user's voice obtained through speech recognition technology is "the weather is fine”, and the text is similar if the user does not sing the word "today” Degree is 5/7.
  • the text of the pre-stored reference audio segment is "the weather is fine today", the text of the user's voice obtained through the speech recognition technology is “we are fine today”, and the user sings the word “we” more, then Text similarity is 7/9.
  • the text of the pre-stored reference audio segment is "the weather is fine today”, and the text of the user's voice obtained through speech recognition technology is "the sky is fine today", if the user sings the word "qi" incorrectly, the text is similar. Degree is 6/7.
  • the Euclidean distance between the target recognition text and the reference recognition text can also be directly calculated as the similarity.
  • the fundamental frequency similarity and the text similarity can be weighted to obtain the target audio segment.
  • the solution mentioned in the embodiment of the present application can obtain the target audio and target video, and determine the target video relative to the reference video based on the human motion matching degree of each target image frame in the target video relative to the corresponding reference image frame in the reference video. Based on the audio matching degree of each target audio segment in the target audio relative to the corresponding reference audio segment in the reference audio, determine the total audio matching degree score of the target audio relative to the reference audio, and then based on the human action The total match score and the audio total match score determine the comprehensive classification result.
  • the comprehensive classification result can reflect the general imitation of the video and audio. Therefore, the embodiment of the present application provides a method that can classify the imitation of the singing and dancing video.
  • An embodiment of the present application provides an apparatus for classifying videos, and the apparatus may be the computer equipment in the foregoing embodiment. As shown in FIG. 7 , the apparatus includes:
  • an acquisition module 710 configured to acquire target audio and corresponding target video including human actions
  • the video determination module 720 is configured to determine the total match of the human motion of the target video relative to the reference video based on the human action matching degree of each target image frame in the target video relative to the corresponding reference image frame in the reference video degree score;
  • the audio determination module 730 is configured to determine the total audio matching score of the target audio relative to the reference audio based on the audio matching degree of each target audio segment in the target audio relative to the corresponding reference audio segment in the reference audio , wherein the reference audio is the audio corresponding to the reference video;
  • the comprehensive determination module 740 is configured to determine a comprehensive classification result based on the total human action matching degree score and the audio total matching degree score.
  • the device further includes:
  • a first determination module configured to determine the human action matching degree score corresponding to the human action matching degree of the target image frame relative to the reference image frame, and determine the audio matching of the target audio segment relative to the reference audio segment The audio matching score corresponding to the degree;
  • the first display module is used for displaying the human action matching degree score curve based on the human body action matching degree score corresponding to each target image frame, and displaying the audio matching degree scoring curve based on the audio matching degree score corresponding to each target audio segment.
  • the device further includes:
  • a second display module configured to display the time axis corresponding to the target video and the target audio
  • the human action matching degree score of the target image frame corresponding to the target time point and the audio matching degree score of the target audio segment corresponding to the target time point are displayed.
  • the device further includes:
  • a first determining module configured to determine a human motion matching degree score corresponding to the target image frame based on the human motion matching degree of the target image frame relative to the reference image frame;
  • the first adding module is configured to add the human action matching degree score in the form of an image at a position corresponding to the target image frame in the target video.
  • the device further includes:
  • a first determining module configured to determine an audio matching degree score corresponding to the target audio segment based on the audio matching degree of the target audio segment relative to the reference audio segment;
  • the second adding module is configured to add the audio matching degree score in the form of an image at a position corresponding to the target audio segment in the target video.
  • the first determining module is further used for:
  • the plurality of human body key points include preset reference human body key points and non-reference human body key points, and the first determination module is used for:
  • the target image frame For each non-reference human body key point, in the target image frame, determine the non-reference human body key point and the reference human body based on the position of the non-reference human body key point and the position of the reference human body key point The first connection line of the key points, obtain the second connection line between the non-reference human body key point and the reference human body key point in the reference image frame, and determine the angle between the first connection line and the second connection line .
  • the first determining module is further used for:
  • the first determining module is further used for:
  • each determined included angle is processed, and the corresponding processing result value of each included angle is obtained;
  • the matching degree of the human motion of the target image frame with respect to the reference image frame is determined.
  • the first determining module is further used for:
  • the first determining module is used for:
  • the proportion of the number in the total number of frames of the target audio segment is used as the similarity of the fundamental frequency of the target audio segment relative to the corresponding reference audio segment in the reference audio.
  • the first determining module is further used for:
  • Determining the audio matching degree of the target audio segment relative to the corresponding reference audio segment based on the fundamental frequency similarity of the target audio segment relative to the corresponding reference audio segment in the reference audio including:
  • the audio matching degree of the target audio segment with respect to the corresponding reference audio segment is determined.
  • the first determining module is used for:
  • the text similarity between the target audio segment and the corresponding reference audio segment in the reference audio is taken as the text similarity.
  • the solution mentioned in the embodiment of the present application can obtain the target audio and target video, and determine the target video relative to the reference video based on the human motion matching degree of each target image frame in the target video relative to the corresponding reference image frame in the reference video. Based on the audio matching degree of each target audio segment in the target audio relative to the corresponding reference audio segment in the reference audio, determine the total audio matching degree score of the target audio relative to the reference audio, and then based on the human action The total match score and the audio total match score determine the comprehensive classification result.
  • the comprehensive classification result can reflect the general imitation of the video and audio. Therefore, the embodiment of the present application provides a method that can classify the imitation of the singing and dancing video.
  • the video classification device provided in the above embodiment is used for video classification, only the division of the above functional modules is used as an example for illustration. In practical applications, the above functions can be allocated according to different functional modules. , that is, dividing the internal structure of the device into different functional modules to complete all or part of the functions described above.
  • the video classification apparatus and the video classification method embodiments provided by the above embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments, which will not be repeated here.
  • FIG. 8 shows a structural block diagram of a terminal 800 provided by an exemplary embodiment of the present application.
  • the terminal may be the computer device in the above embodiment.
  • the terminal 800 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, the standard audio level 3 of moving picture expert compression), MP4 (Moving Picture Experts Group Audio Layer IV, moving picture expert compression standard audio Level 4) Player, laptop or desktop computer.
  • Terminal 800 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.
  • the terminal 800 includes: a processor 801 and a memory 802 .
  • the processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 801 can use at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • the processor 801 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor for processing data in a standby state.
  • the processor 801 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 801 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, a non-transitory computer-readable storage medium in the memory 802 is used to store at least one instruction for being executed by the processor 801 to implement the generation test provided by the method embodiments in this application Script code method.
  • the terminal 800 may optionally further include: a peripheral device interface 803 and at least one peripheral device.
  • the processor 801, the memory 802 and the peripheral device interface 803 can be connected by a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 803 through a bus, a signal line or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 804 , a display screen 805 , a camera 806 , an audio circuit 807 , a positioning component 808 and a power supply 809 .
  • the peripheral device interface 803 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 801 and the memory 802 .
  • processor 801, memory 802, and peripherals interface 803 are integrated on the same chip or circuit board; in some other embodiments, any one of processor 801, memory 802, and peripherals interface 803 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.
  • the radio frequency circuit 804 is used for receiving and transmitting RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals.
  • the radio frequency circuit 804 communicates with the communication network and other communication devices via electromagnetic signals.
  • the radio frequency circuit 804 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • the radio frequency circuit 804 includes an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and the like.
  • the radio frequency circuit 804 may communicate with other terminals through at least one wireless communication protocol.
  • the wireless communication protocols include, but are not limited to, metropolitan area networks, mobile communication networks of various generations (2G, 3G, 4G and 5G), wireless local area networks and/or WiFi (Wireless Fidelity, wireless fidelity) networks.
  • the radio frequency circuit 804 may further include a circuit related to NFC (Near Field Communication, short-range wireless communication), which is not limited in this application.
  • the display screen 805 is used to display UI (User Interface, user interface).
  • the UI can include graphics, text, icons, video, and any combination thereof.
  • the display screen 805 also has the ability to acquire touch signals on or above the surface of the display screen 805 .
  • the touch signal can be input to the processor 801 as a control signal for processing.
  • the display screen 805 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards.
  • the display screen 805 may be one display screen 805, which is provided on the front panel of the terminal 800; in other embodiments, there may be at least two display screens 805, which are respectively arranged on different surfaces of the terminal 800 or in a folded design; In still other embodiments, the display screen 805 may be a flexible display screen, which is disposed on a curved surface or a folding surface of the terminal 800 . Even, the display screen 805 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen.
  • the display screen 805 can be prepared by using materials such as LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light emitting diode).
  • the camera assembly 806 is used to capture images or video.
  • the camera assembly 806 includes a front camera and a rear camera.
  • the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal.
  • there are at least two rear cameras which are any one of a main camera, a depth-of-field camera, a wide-angle camera, and a telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function, the main camera It is integrated with the wide-angle camera to achieve panoramic shooting and VR (Virtual Reality, virtual reality) shooting functions or other integrated shooting functions.
  • camera assembly 806 may also include a flash.
  • the flash can be a single color temperature flash or a dual color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
  • Audio circuitry 807 may include a microphone and speakers.
  • the microphone is used to collect the sound waves of the user and the environment, convert the sound waves into electrical signals, and input them to the processor 801 for processing, or to the radio frequency circuit 804 to realize voice communication.
  • the microphone may also be an array microphone or an omnidirectional collection microphone.
  • the speaker is used to convert the electrical signal from the processor 801 or the radio frequency circuit 804 into sound waves.
  • the loudspeaker can be a traditional thin-film loudspeaker or a piezoelectric ceramic loudspeaker.
  • audio circuitry 807 may also include a headphone jack.
  • the positioning component 808 is used to locate the current geographic location of the terminal 800 to implement navigation or LBS (Location Based Service, location-based service).
  • the positioning component 808 may be a positioning component based on the GPS (Global Positioning System, global positioning system) of the United States, the Beidou system of China, the Grenas system of Russia, or the Galileo system of the European Union.
  • GPS Global Positioning System, global positioning system
  • the power supply 809 is used to power various components in the terminal 800 .
  • the power source 809 may be alternating current, direct current, disposable batteries or rechargeable batteries.
  • the rechargeable battery can support wired charging or wireless charging.
  • the rechargeable battery can also be used to support fast charging technology.
  • terminal 800 also includes one or more sensors 810 .
  • the one or more sensors 810 include, but are not limited to, an acceleration sensor 811 , a gyro sensor 812 , a pressure sensor 813 , a fingerprint sensor 814 , an optical sensor 815 , and a proximity sensor 816 .
  • the acceleration sensor 811 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 800 .
  • the acceleration sensor 811 can be used to detect the components of the gravitational acceleration on the three coordinate axes.
  • the processor 801 can control the display screen 805 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 811 .
  • the acceleration sensor 811 can also be used for game or user movement data collection.
  • the gyroscope sensor 812 can detect the body direction and rotation angle of the terminal 800 , and the gyroscope sensor 812 can cooperate with the acceleration sensor 811 to collect 3D actions of the user on the terminal 800 .
  • the processor 801 can implement the following functions according to the data collected by the gyroscope sensor 812 : motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.
  • the pressure sensor 813 may be disposed on the side frame of the terminal 800 and/or the lower layer of the display screen 805 .
  • the processor 801 can perform left and right hand identification or shortcut operations according to the holding signal collected by the pressure sensor 813.
  • the processor 801 controls the operability controls on the UI interface according to the user's pressure operation on the display screen 805 .
  • the operability controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.
  • the fingerprint sensor 814 is used to collect the user's fingerprint, and the processor 801 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 814 , or the fingerprint sensor 814 identifies the user's identity according to the collected fingerprint. When the user's identity is identified as a trusted identity, the processor 801 authorizes the user to perform related sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, making payments, and changing settings.
  • the fingerprint sensor 814 may be provided on the front, back or side of the terminal 800 . When the terminal 800 is provided with physical buttons or a manufacturer's logo, the fingerprint sensor 814 may be integrated with the physical buttons or the manufacturer's logo.
  • Optical sensor 815 is used to collect ambient light intensity.
  • the processor 801 may control the display brightness of the display screen 805 according to the ambient light intensity collected by the optical sensor 815 . Specifically, when the ambient light intensity is high, the display brightness of the display screen 805 is increased; when the ambient light intensity is low, the display brightness of the display screen 805 is decreased.
  • the processor 801 may also dynamically adjust the shooting parameters of the camera assembly 806 according to the ambient light intensity collected by the optical sensor 815 .
  • a proximity sensor 816 also called a distance sensor, is usually provided on the front panel of the terminal 800 .
  • the proximity sensor 816 is used to collect the distance between the user and the front of the terminal 800 .
  • the processor 801 controls the display screen 805 to switch from the bright screen state to the off screen state; when the proximity sensor 816 detects When the distance between the user and the front of the terminal 800 gradually increases, the processor 801 controls the display screen 805 to switch from the closed screen state to the bright screen state.
  • FIG. 8 does not constitute a limitation on the terminal 800, and may include more or less components than the one shown, or combine some components, or adopt different component arrangements.
  • FIG. 9 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 900 may vary greatly due to different configurations or performance, and may include one or more processors (central processing units, CPU) 901 and a Or more than one memory 902, wherein, the memory 902 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 901 to implement the methods provided by the foregoing method embodiments.
  • the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface for input and output, and the server may also include other components for implementing device functions, which will not be described here.
  • a computer-readable storage medium such as a memory including instructions, is also provided, and the instructions can be executed by a processor in the terminal to complete the video classification method in the foregoing embodiment.
  • the computer-readable storage medium may be non-transitory.
  • the computer-readable storage medium may be ROM (Read-Only Memory, read-only memory), RAM (Random Access Memory, random access memory), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Image Analysis (AREA)

Abstract

La présente demande divulgue un procédé et un appareil de classification vidéo, appartenant au domaine technique du traitement des données. Le procédé comprend les étapes consistant : à acquérir un audio cible et une vidéo cible correspondante comprenant des actions de corps humain ; sur la base d'un degré de correspondance d'action de corps humain de chaque trame cible dans la vidéo cible par rapport à une trame de référence correspondante dans une vidéo de référence, à déterminer un score de degré de correspondance d'action de corps humain total de la vidéo cible par rapport à la vidéo de référence ; sur la base d'un degré de correspondance audio de chaque segment audio cible dans l'audio cible par rapport à un segment audio de référence correspondant dans un audio de référence de la vidéo de référence, à déterminer un score de degré de correspondance audio total de l'audio cible par rapport à l'audio de référence ; et sur la base du score de degré de correspondance d'action de corps humain total et du degré de correspondance audio total, à déterminer un résultat de classification complet. La présente application fournit un procédé de classification des vidéos de chant et de danse.
PCT/CN2021/125750 2020-11-26 2021-10-22 Procédé et appareil de classement de vidéos WO2022111168A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011350031.1 2020-11-26
CN202011350031.1A CN112487940B (zh) 2020-11-26 2020-11-26 视频的分类方法和装置

Publications (1)

Publication Number Publication Date
WO2022111168A1 true WO2022111168A1 (fr) 2022-06-02

Family

ID=74935277

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/125750 WO2022111168A1 (fr) 2020-11-26 2021-10-22 Procédé et appareil de classement de vidéos

Country Status (2)

Country Link
CN (1) CN112487940B (fr)
WO (1) WO2022111168A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115115822A (zh) * 2022-06-30 2022-09-27 小米汽车科技有限公司 车端图像处理方法、装置、车辆、存储介质及芯片

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487940B (zh) * 2020-11-26 2023-02-28 腾讯音乐娱乐科技(深圳)有限公司 视频的分类方法和装置
CN113596353B (zh) * 2021-08-10 2024-06-14 广州艾美网络科技有限公司 体感互动数据处理方法、装置及体感互动设备
CN114513694A (zh) * 2022-02-17 2022-05-17 平安国际智慧城市科技股份有限公司 评分确定方法、装置、电子设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013141522A1 (fr) * 2012-03-20 2013-09-26 Young Dae Kim Jeu de karaoké et de danse
CN107943291A (zh) * 2017-11-23 2018-04-20 乐蜜有限公司 人体动作的识别方法、装置和电子设备
CN109887524A (zh) * 2019-01-17 2019-06-14 深圳壹账通智能科技有限公司 一种演唱评分方法、装置、计算机设备及存储介质
CN112487940A (zh) * 2020-11-26 2021-03-12 腾讯音乐娱乐科技(深圳)有限公司 视频的分类方法和装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106970950B (zh) * 2017-03-07 2021-08-24 腾讯音乐娱乐(深圳)有限公司 相似音频数据的查找方法及装置
CN111081277B (zh) * 2019-12-19 2022-07-12 广州酷狗计算机科技有限公司 音频测评的方法、装置、设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013141522A1 (fr) * 2012-03-20 2013-09-26 Young Dae Kim Jeu de karaoké et de danse
CN107943291A (zh) * 2017-11-23 2018-04-20 乐蜜有限公司 人体动作的识别方法、装置和电子设备
CN109887524A (zh) * 2019-01-17 2019-06-14 深圳壹账通智能科技有限公司 一种演唱评分方法、装置、计算机设备及存储介质
CN112487940A (zh) * 2020-11-26 2021-03-12 腾讯音乐娱乐科技(深圳)有限公司 视频的分类方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115115822A (zh) * 2022-06-30 2022-09-27 小米汽车科技有限公司 车端图像处理方法、装置、车辆、存储介质及芯片
CN115115822B (zh) * 2022-06-30 2023-10-31 小米汽车科技有限公司 车端图像处理方法、装置、车辆、存储介质及芯片

Also Published As

Publication number Publication date
CN112487940A (zh) 2021-03-12
CN112487940B (zh) 2023-02-28

Similar Documents

Publication Publication Date Title
CN108008930B (zh) 确定k歌分值的方法和装置
WO2022111168A1 (fr) Procédé et appareil de classement de vidéos
WO2020103550A1 (fr) Procédé et appareil de notation de signal audio, dispositif terminal et support de stockage informatique
WO2019128593A1 (fr) Procédé et dispositif de recherche de son
CN110956971B (zh) 音频处理方法、装置、终端及存储介质
CN110688082B (zh) 确定音量的调节比例信息的方法、装置、设备及存储介质
CN111048111B (zh) 检测音频的节奏点的方法、装置、设备及可读存储介质
CN111061405B (zh) 录制歌曲音频的方法、装置、设备及存储介质
CN109192223B (zh) 音频对齐的方法和装置
CN111081277B (zh) 音频测评的方法、装置、设备及存储介质
CN111428079B (zh) 文本内容处理方法、装置、计算机设备及存储介质
CN111625682A (zh) 视频的生成方法、装置、计算机设备及存储介质
CN111276122A (zh) 音频生成方法及装置、存储介质
CN110867194B (zh) 音频的评分方法、装置、设备及存储介质
CN113420177A (zh) 音频数据处理方法、装置、计算机设备及存储介质
CN111368136A (zh) 歌曲识别方法、装置、电子设备及存储介质
CN112086102B (zh) 扩展音频频带的方法、装置、设备以及存储介质
CN112667844A (zh) 检索音频的方法、装置、设备和存储介质
CN109003627B (zh) 确定音频得分的方法、装置、终端及存储介质
CN112118482A (zh) 音频文件的播放方法、装置、终端及存储介质
CN109036463B (zh) 获取歌曲的难度信息的方法、装置及存储介质
CN108831423B (zh) 提取音频数据中主旋律音轨的方法、装置、终端及存储介质
CN112992107B (zh) 训练声学转换模型的方法、终端及存储介质
CN111063372B (zh) 确定音高特征的方法、装置、设备及存储介质
CN111063364A (zh) 生成音频的方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896670

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19.09.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21896670

Country of ref document: EP

Kind code of ref document: A1