CN113099305A

CN113099305A - Play control method and device

Info

Publication number: CN113099305A
Application number: CN202110404946.4A
Authority: CN
Inventors: 张奕
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2021-07-09

Abstract

The embodiment of the application provides a play control method and a device, wherein the play control method comprises the following steps: the method comprises the steps of collecting a signal to be processed of a user, sampling the signal to be processed according to a preset time interval, generating at least one sampling segment, inputting the at least one sampling segment into a signal processing model for classification, generating user state information of the user in at least one dimension, determining a playing control instruction corresponding to a target multimedia resource according to the user state information of the at least one dimension, and adjusting the playing state of the target multimedia resource according to the playing control instruction.

Description

Play control method and device

Technical Field

The embodiment of the application relates to the technical field of signal processing, in particular to a play control method. One or more embodiments of the present application also relate to a playback control apparatus, a computing device, and a computer-readable storage medium.

Background

With the continuous development of information technology, the functions and the types of the intelligent terminals are more and more, and the intelligent terminals with screens of different sizes are presented at present. Especially for a touch screen mobile phone which is developed relatively fast, mobile phones of different or same brands can be provided with various application programs, and users can meet their living needs, such as browsing news, watching videos of various categories, and the like, through the application programs.

Currently, a video playing application program generally configures playing information adjusting functions such as adjusting volume and progress in a region below a playing interface in the form of buttons. In the using process, the user needs to contact the play information adjusting function button of the play interface to adjust the play state of the multimedia resource, such as start play, pause play, etc., but this way requires the user to make active and conscious control actions to apply play control, which affects the viewing experience of the user.

Disclosure of Invention

In view of this, an embodiment of the present application provides a play control method. One or more embodiments of the present application also relate to a playback control apparatus, a computing device, and a computer-readable storage medium, so as to solve the technical defect that in the audio/video playback control method in the prior art, if a user does not have the ability or awareness of actively controlling audio/video playback, playback control cannot be implemented.

According to a first aspect of an embodiment of the present application, there is provided a playback control method, including:

acquiring a signal to be processed of a user, and sampling the signal to be processed according to a preset time interval to generate at least one sampling segment;

inputting the at least one sampling segment into a signal processing model for classification, and generating user state information of the user in at least one dimension;

determining a playing control instruction corresponding to the target multimedia resource according to the user state information of at least one dimension;

and adjusting the playing state of the target multimedia resource according to the playing control instruction.

According to a second aspect of embodiments of the present application, there is provided a playback control apparatus including:

the device comprises a sampling module, a processing module and a processing module, wherein the sampling module is configured to collect a signal to be processed of a user, sample the signal to be processed according to a preset time interval and generate at least one sampling segment;

a classification module configured to classify the at least one sampling segment input signal processing model, generating user state information of the user in at least one dimension;

the determining module is configured to determine a playing control instruction corresponding to the target multimedia resource according to the user state information of the at least one dimension;

and the adjusting module is configured to adjust the playing state of the target multimedia resource according to the playing control instruction.

According to a third aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions, wherein the processor implements the steps of the playback control method when executing the computer-executable instructions.

According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the playback control method.

An embodiment of the application realizes a play control method and a device, wherein the play control method includes collecting a signal to be processed of a user, sampling the signal to be processed according to a preset time interval, generating at least one sampling segment, inputting the at least one sampling segment into a signal processing model for classification, generating user state information of the user in at least one dimension, determining a play control instruction corresponding to a target multimedia resource according to the user state information of the at least one dimension, and adjusting a play state of the target multimedia resource according to the play control instruction.

The method comprises the steps of collecting a signal to be processed of a user, carrying out multi-dimensional analysis on state information of the user through the signal to be processed, and comprehensively judging a playing control instruction of a target multimedia resource based on the multi-dimensional user state information obtained through analysis, so that the purpose of automatically controlling the playing state of the target multimedia resource is achieved, and the watching experience of the user is favorably improved.

Drawings

Fig. 1 is a system architecture diagram of a play control method according to an embodiment of the present application;

fig. 2 is a flowchart of a play control method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a semantic category generation process provided by one embodiment of the present application;

fig. 4 is a flowchart of the playback control method in the sleep monitoring field according to an embodiment of the present application;

fig. 5 is a flowchart of the playing control method in the field of interactive video lessons according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a playback control apparatus according to an embodiment of the present application;

fig. 7 is a block diagram of a computing device according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In the present application, a playback control method is provided. One or more embodiments of the present application are also related to a playback control apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

The playing control method provided by the embodiment of the application can be applied to any field needing to control the playing state of multimedia resources, such as the control of the playing state of video resources in the video field, the control of the playing state of audio resources in the audio field, the control of the playing state of voice conversations in the communication field, the control of the playing state of voice messages in the media field, and the like; for convenience of understanding, the embodiment of the present application takes the application of the playback control method to control the playback state of a video resource in the video field as an example, but is not limited to this.

Then, in the case that the playing control method is applied to control of the playing state of the video resource in the video field as an example, the target multimedia resource in the playing control method can be understood as the video resource.

In specific implementation, the target multimedia resource of the embodiment of the present application may be presented in clients such as a large-scale video playing device, a game console, a desktop computer, a smart phone, a tablet computer, an MP3(Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4) player, a laptop, an e-book reader, and other display terminals.

Referring to fig. 1, fig. 1 is a diagram illustrating an architecture of a play control method according to an embodiment of the present application.

In fig. 1, the architecture diagram of the play control method mainly includes an audio/video play device, a multi-mode signal acquisition device, a multi-mode signal modeling system, and a play control decision system. The method comprises the steps of collecting multi-mode signal feedback during the video watching period of a video watcher, and establishing a model from the feedback to a play control decision, so as to realize the play control of the video.

Wherein the multi-modal signal comprises a visual signal, an audio signal and the like. The multi-mode signal acquisition device acquires audio and visual signals when a video is watched by a viewer, and the acquired audio and visual signals are used as input of multi-mode modeling so as to analyze feedback when the video is watched by the viewer.

The multi-modal signal modeling system is used for modeling, analyzing and identifying signals such as audio, vision and the like, performing necessary fusion and extracting relevant feedback information. The visual signals can be used for content modeling of expressions, actions and the like in multiple dimensions, and the audio signals can be used for voice recognition, other sound detection classification and the like.

Signals such as audio and vision are continuous features, and semantic contents of multiple dimensions such as expressions and actions also have continuity, so that the embodiment of the application uses continuous feature monitoring in a time domain in modeling to achieve the purpose of time sequence modeling, and the method can be specifically realized through the following steps:

the method comprises the following steps: sampling audio and visual signals at fixed time intervals, and extracting the characteristics of audio and visual signal sampling segments;

step two: establishing a time sequence model based on the characteristics of the audio and visual signal sampling segments, and calculating the characteristic association degree between nodes on the time sequence;

after the model is built, the distance measurement between the two sequence diagrams is defined, and the matching degree between the semantic content corresponding to the actual audio and visual signal segments and the specific semantic content is obtained by calculating the distance between the actual sequence diagram and the specific semantic content sequence diagram.

And the playing control decision system decides the action of controlling the audio and video playing through comprehensive decision according to the output result of each model. The decision reasoning conditions in the comprehensive decision can be freely combined according to the requirements of users, and the output results of the models and the playing action form various reasoning combinations to meet the requirements of different application scenes.

Referring to fig. 2, fig. 2 is a flowchart illustrating a play control method according to an embodiment of the present application, including the following steps:

step 202, collecting a signal to be processed of a user, and sampling the signal to be processed according to a preset time interval to generate at least one sampling segment.

The audio and video playing control is a basic function in the video field, and the conventional method is to control the functions of playing, pausing, fast forwarding, fast rewinding, terminating and the like of a video through a physical or virtual touch key. In recent years, some methods for performing play control through voice recognition have appeared, but one of the common characteristics of these methods is that users are required to perform active and conscious control actions to perform play control on audio and video, for example, touching keys with hands or giving out voice instructions, etc., and if users do not have the capability or awareness of actively controlling audio and video play, the play control cannot be realized, which affects the viewing experience of users.

Based on this, in the embodiment of the application, a signal to be processed of a user is collected and sampled according to a preset time interval, so as to generate at least one sampling segment, the at least one sampling segment is input into a signal processing model for classification, so as to generate user state information of the user in at least one dimension, a play control instruction corresponding to a target multimedia resource is determined according to the user state information of the at least one dimension, and a play state of the target multimedia resource is adjusted according to the play control instruction.

Wherein the signal to be processed comprises an audio signal and/or a visual signal; multimedia assets include, but are not limited to, audio assets, video assets, and the like.

The embodiment of the application can be provided with a signal acquisition module in the multimedia resource playing device, wherein the signal acquisition module comprises but is not limited to audio and video monitoring acquisition equipment, such as a visible light or infrared camera and the like, and the signal acquisition module is utilized to acquire audio signals and/or visual signals of a user and the like.

After the audio signal and/or the visual signal of the user are collected, the audio signal and/or the visual signal can be sampled according to a preset time interval, at least one sampling segment is generated, and user state information of the user is determined based on the at least one sampling segment.

Step 204, inputting the at least one sampling segment into a signal processing model for classification, and generating user state information of the user in at least one dimension.

Specifically, the user state information includes, but is not limited to, state information of a user, such as expressions, actions, and sounds, where the expressions, actions, and sounds are user states belonging to three dimensions.

After the at least one sampling segment is input into the signal processing model, different signal processing sub-modules in the signal processing model can respectively process the sampling segments with different dimensions so as to generate user state information of the user in different dimensions.

Wherein the signal processing model is trained by:

acquiring a historical sample signal of a user, and sampling the historical sample signal according to a preset time interval to generate at least one sampling segment;

determining user state information corresponding to the at least one sampling segment;

and taking the at least one sampling segment as a training sample, taking user state information corresponding to the at least one sampling segment as a sample label, and inputting a signal processing model to be trained for training to obtain the signal processing model.

Specifically, the historical sample signal includes an audio signal and/or a visual signal, and the state information of the user includes, but is not limited to, state information of sound, motion, expression, and the like of the user, specifically speaking state, crying state, moving state, sleeping state, and the like.

The method comprises the steps of obtaining historical sample signals, sampling the historical sample signals according to preset time intervals to generate corresponding sampling segments, determining user state information corresponding to each sampling segment, using the sampling segments as training samples, using the user state information corresponding to the sampling segments as sample labels, inputting a signal processing model to be trained, and training to obtain the signal processing model.

In addition, since the historical sample signal may include a visual signal and/or an audio signal, and has continuity in time, the embodiment of the present application may monitor the continuous characteristics of the historical sample signal in a time domain during the modeling process of the signal processing model to build the timing diagram data model according to the detection result.

The time sequence model is a dynamic network with a graph network structure changing and evolving with time, and the attributes, the number and the like of nodes and edges in the network change with time.

In this embodiment of the application, if the historical sample signal is a visual signal, a visual node graph model is constructed according to temporally continuous signal frames in the visual signal, each node in the visual node graph model represents one signal frame, connections are established between the nodes according to temporal precedence relationships between the signal frames, and the visual node graph model includes all changes of the signal frames in the visual signal, specifically, changes of connecting edges at the time are represented by marking a plurality of timestamps on the connecting edges of each node.

After the training of the signal processing model is completed, the at least one sampling segment is input into the signal processing model for classification, and user state information of the user in at least one dimension is generated, which specifically is: and inputting the at least one sampling segment into a signal processing model for classification, calculating the similarity between the at least one sampling segment and a sample sampling segment by the signal processing model, and determining the user state information of the user in at least one dimension according to the similarity.

Further, calculating the similarity between the at least one sampling segment and the sample sampling segment includes:

extracting the characteristics of signal frames in a target sampling segment, and calculating the characteristic correlation degree between adjacent signal frames in the target sampling segment, wherein the target sampling segment is one of the at least one sampling segment;

determining a first feature vector corresponding to the target sampling segment according to the feature association degree;

calculating the distance between the first feature vector and a second feature vector corresponding to the sample sampling segment;

and determining the similarity between the target sampling segment and the sample sampling segment according to the distance.

Specifically, after the signal to be processed is sampled to generate at least one sampling segment, and the at least one sampling segment is input into the signal processing model, the signal processing model determines the user state information of the user in at least one dimension according to the similarity by calculating the similarity between the at least one sampling segment and the sample sampling segment, wherein the sample sampling segment is a sampling segment pre-stored in a database and is used as a reference in the process of determining the user state information of the at least one sampling segment.

The signal to be processed is continuous, and the signal to be processed is composed of a plurality of continuous signal frames, so after the signal to be processed is sampled and a sampling segment generated by the sampling is input into the signal processing model, the signal processing model may perform feature extraction on the signal frame in each sampling segment, and since the signal to be processed may be a visual signal and/or an audio signal, the signal frame in each sampling segment is subjected to feature extraction, that is, image features and/or audio features in the signal frame are extracted. After the feature extraction is finished, calculating the feature association degree between adjacent signal frames in each sampling segment, and determining a first feature vector corresponding to the target sampling segment according to the feature association degree.

In practical application, the signal processing model may perform feature extraction on each signal frame in each sampling segment, and the result of the feature extraction may be a feature vector, and set a weight for each signal frame according to a feature association degree between adjacent signal frames in the sampling segment, then perform product operation on the feature vector of each signal frame and the corresponding weight thereof, and then perform summation operation on the operation result to obtain the first feature vector.

After the first feature vector is obtained, a second feature vector of a target sampling segment can be determined, and the distance between the first feature vector and the second feature vector is calculated, so that the similarity between the first feature vector and the second feature vector can be determined according to the distance. In practical application, the similarity is inversely proportional to the distance, and the smaller the distance is, the higher the similarity is.

In addition, in order to ensure the accuracy of the similarity calculation result, the calculation process of the first feature vector of the sampling segment needs to be consistent with the calculation mode of the second feature vector of the sampling segment.

Further, determining the user state information of the user in at least one dimension according to the similarity, specifically, screening the sample sampling segments according to the similarity, and taking the sample labels corresponding to the sample sampling segments in the screening result as the user state information of the user in the dimension corresponding to the target sampling segment.

Specifically, in order to ensure the accuracy of the result of determining the user state information, after the signal processing model calculates the similarity between at least one sampling segment and a sample sampling segment, the sample sampling segment can be screened according to the similarity, a target sample sampling segment with the similarity larger than a preset similarity threshold value with the sampling segment is obtained, and then a sample label corresponding to the target sample sampling segment is used as the user state information of the user in the dimension corresponding to the sampling segment.

In addition, as mentioned above, after the at least one sampling segment is input into the signal processing model, different signal processing sub-modules in the signal processing model may respectively process sampling segments with different dimensions, so that, when the signal to be processed includes an audio signal, the at least one sampling segment is input into the signal processing model for classification, and user state information of the user in at least one dimension is generated, which may specifically be implemented in the following manner:

inputting at least one audio signal sampling segment into an audio signal processing submodule of the signal processing model for classification, and generating a corresponding classification result;

and screening the classification result to obtain the sound state information of the user in the sound dimension.

Specifically, in this embodiment of the application, the signal processing model includes an audio signal processing sub-module, which is configured to classify sampling segments of the audio signal and generate user state information of a user in a sound dimension. Therefore, under the condition that the signal to be processed contains the audio signal, the audio signal can be sampled to generate a corresponding audio signal sampling segment, and the audio signal sampling segment is input into the audio signal processing submodule to be classified to generate a corresponding classification result.

In order to ensure the usability of the classification result output by the signal processing model, before the result is output, the classification result can be screened to obtain the sound state information of the user in the sound dimension.

For example, in a case where the play control method is applied to a sleep monitoring scenario, after the signal acquisition device acquires an audio signal, the play state of the target multimedia resource needs to be determined according to the sound state information of the user, for example, when it is determined that the user is in the sleep state, the play of the target multimedia resource is suspended, but since the acquired audio signal may include an environmental noise, and the environmental noise is not removed, the corresponding classification result is directly output, and therefore, when the classification result is used to determine the play control instruction, the accuracy of the determination result of the play control instruction may be affected.

Or, under the condition that the signal to be processed includes a visual signal, inputting the at least one sampling segment into a signal processing model for classification, and generating user state information of the user in at least one dimension, specifically: and inputting at least one visual signal sampling segment into at least one visual signal processing submodule of the signal processing model for classification, and generating visual state information of the user in at least one visual dimension.

Specifically, in this embodiment of the application, the signal processing model further includes a visual signal processing sub-module, which is configured to classify sampling segments of the visual signal and generate user state information of the user in the visual dimension. Therefore, under the condition that the signals to be processed contain visual signals, the visual signals can be sampled to generate corresponding visual signal sampling segments, and the visual signal sampling segments are input into the visual signal processing submodule to be classified to generate corresponding classification results.

And step 206, determining a playing control instruction corresponding to the target multimedia resource according to the user state information of the at least one dimension.

In specific implementation, determining a play control instruction corresponding to the target multimedia resource according to the user state information of the at least one dimension includes:

determining a resource type corresponding to the target multimedia resource, and combining the user state information of at least one dimension according to the resource type;

and determining the playing control instruction corresponding to the target multimedia resource according to the combination result and the mapping relation between the pre-configured resource type and the playing control instruction.

Specifically, the resource types include, but are not limited to, audio types, video types, and the like, that is, the multimedia resources may be audio resources or video resources, and the corresponding play control commands of different types of multimedia resources may be different, and therefore, the corresponding play control commands of the same user state information in different types of multimedia resources may be different.

In addition, since the user state information includes, but is not limited to, state information such as expression, motion, and sound of the user, a corresponding semantic type may be determined according to the user state information, for example, if the user state is motion, the semantic type determined according to the user state may be motion, and similarly, if the user state is sound, the semantic type determined according to the user state may be speaking; after the semantic categories are determined according to the user state, the semantic categories can be combined, and corresponding playing control instructions in different types of multimedia resources are determined according to a combination result.

For example, if the resource type is an audio type, the play control instruction corresponding to the audio resource may include, but is not limited to, start playing, pause playing, play a next set (next), and the like; when the resource type is a video type, in addition to the above-mentioned play control command, fast-forward play, double-speed play, etc. may be included, and the combination result of the user status information corresponding to these play control commands may differ according to the different resource types.

Therefore, after the signal processing model is used for generating the user state information of the user in at least one dimension, the resource type corresponding to the target multimedia resource can be determined, the user state information of the at least one dimension is combined according to the resource type, or the semantic category is determined according to the user state and combined, so that the playing control instruction corresponding to the target multimedia resource is determined according to the combination result and the mapping relation between the pre-configured resource type and the playing control instruction.

For example, if the signal to be processed includes an audio signal and a visual signal, the sound state information of the user in the sound dimension and the sound state information of the visual dimension output by the signal processing model may be combined, or semantic categories may be determined according to the sound state information and the sound state information, and the semantic categories are combined to determine a play control instruction corresponding to the combination result.

In addition, the playing control instruction corresponding to the target multimedia resource is determined according to the user state information of at least one dimension, and the method can also be realized by the following steps:

combining the user state information of the at least one dimension according to at least one preset time interval;

and determining the playing control instruction corresponding to the target multimedia resource according to at least one combination result and a mapping relation between the pre-configured resource type and the playing control instruction.

Specifically, after the user state information of the user in at least one dimension is generated, since the user state information of the user in different time points or different time intervals is different, but there may be an association relationship between the user state information in different time points or different time intervals, after the user state information of at least one dimension is combined according to at least one preset time interval, the generated combination result may correspond to different play control instructions.

For example, the target multimedia resource is a video course resource, the acquired to-be-processed signals are visual signals and voice signals of actions, expressions and the like made by the user in the course playing process, and the system can determine user state information of the user in at least one dimension according to the to-be-acquired signals made by the user at different moments so as to determine a corresponding playing control instruction according to the user state information.

In the course playing process, the user may be required to make specified actions, expressions, voices and the like, and the system makes video playing decisions according to the actions, expressions and voices made by the user, wherein the decisions include: if the user fails to make the action, expression and voice which meet the requirements, the video is played again, and the user is required to try to make the action, expression, voice and the like which meet the requirements again; and if the user can make the action, expression and voice which meet the requirements, the video clip with the expressive encouragement can be jumped and played.

In addition, the user state information of the user can be combined according to time intervals to determine the corresponding playing control instruction according to the combination result, for example, if the preset time interval is 10 minutes, the user state information in two time intervals of 1-10 minutes and/or 11-20 minutes is combined to determine the playing control instruction of the video course according to the continuous action feedback of the user in different time intervals in the whole video course.

If the user state information is combined according to a time interval, the user does not make corresponding specified actions, expressions, voices and the like within 1-10 minutes of the video course, or the user does not make corresponding specified actions, expressions, voices and the like within 11-20 minutes, the video playing control instruction can be determined to be the video course to be played again; if the user state information is combined according to the two time intervals, and the user does not make corresponding specified actions, expressions, voices and the like in 1-10 minutes of the video course, but makes corresponding specified actions, expressions, voices and the like in 11-20 minutes of the video course, it indicates that the user can gradually understand the course content along with the continuous progress of the playing progress of the video course, and therefore, the video playing control instruction can be determined to be a video clip with a raised play encouragement.

According to the embodiment of the application, the user state information of at least one dimension is combined according to at least one preset time interval, and the playing control instruction corresponding to the target multimedia resource is determined according to at least one combination result, so that the accuracy of the determination result of the playing control instruction is guaranteed.

Or, in the case that the signal to be processed includes an audio signal and a visual signal, at least one visual signal sampling segment may be input to at least one visual signal processing sub-module of a signal processing model for classification, and a visual feature vector of the user in at least one visual dimension is generated; and the number of the first and second groups,

inputting at least one audio signal sampling segment into at least one audio signal processing submodule of a signal processing model for classification, and generating an audio feature vector of the user in at least one sound dimension;

performing feature fusion on the audio feature vector and the visual feature vector, inputting a fusion result into a classification model for semantic category division, and generating a comprehensive semantic category corresponding to the fusion result;

and determining a playing control instruction corresponding to the target multimedia resource according to the comprehensive semantic category.

Specifically, a schematic diagram of a semantic category generating process provided in an embodiment of the present application is shown in fig. 3, and after acquiring a signal to be processed of a user, an audio signal in the signal to be processed is sampled, and a sampled audio sample segment is input to a signal processing model, an audio signal processing sub-model in the signal processing model performs feature extraction on a signal frame in the audio sample segment, and calculates a feature association degree between adjacent signal frames in each audio sample segment, so as to determine an audio feature vector corresponding to each audio sample segment according to the feature association degree.

Similarly, sampling the visual signals in the signals to be processed, inputting the sampled visual sample segments into a signal processing model, performing feature extraction on signal frames in the visual sample segments by a visual signal processing sub-model in the signal processing model, and calculating the feature association degree between adjacent signal frames in each visual sampling segment so as to determine the visual feature vector corresponding to each visual sampling segment according to the feature association degree.

After the visual feature vector and the audio feature vector are generated, the visual feature vector and the audio feature vector can be fused, the fused vector is input into a classification model, and the classification model performs semantic classification on the fused vector to generate a corresponding comprehensive semantic class; in addition, after the comprehensive semantic category is generated, a corresponding playing control instruction can be determined according to the comprehensive semantic category.

In addition, in order to ensure the accuracy of the semantic category determination result, in the embodiment of the application, after the classification model outputs the comprehensive semantic category, the audio signal processing sub-model outputs the audio semantic category, and the visual signal processing sub-model outputs the visual semantic category, the comprehensive semantic category, the audio semantic category and the visual semantic category can be fused to obtain a target semantic category, and a corresponding playing control instruction can be determined according to the comprehensive semantic category.

And 208, adjusting the playing state of the target multimedia resource according to the playing control instruction.

Specifically, as mentioned above, if the resource type is an audio type, the play control instruction corresponding to the audio resource may include, but is not limited to, start playing, pause playing, play a next set (next), and the like; and when the resource type is a video type, besides the above-mentioned play control instruction, the method can also include fast forward play, double speed play, and the like. Therefore, after the play control instruction corresponding to the target multimedia resource is determined, the play state of the target multimedia resource can be adjusted according to the play control instruction.

For example, if the current playing status of the target multimedia resource is playing and the playing control command is pause playing, the playing status of the target multimedia resource can be adjusted from the playing status to the pause playing status.

In specific implementation, if the signal to be processed includes an audio signal and the voice state information of the user is determined to be a speaking state (target state), the audio signal may be subjected to voice recognition, and a play control instruction corresponding to the target multimedia resource is determined according to a voice recognition result.

Specifically, the user can control the playing state of the target multimedia resource by using voice, so that the signal acquisition device performs voice recognition on the user when acquiring the audio signal of the user and determining the user state information to be the speaking state by analyzing the audio signal, so as to determine the playing control instruction according to the recognition result.

The invention determines the action decision for controlling the audio and video playing by capturing and analyzing various feedbacks of the video viewer, thereby realizing the purpose of finishing the audio and video playing without the active action of the viewer. Here, the various types of feedback of the video viewer include multi-modal information such as visual information and sound information, and the final play control is generated by comprehensively modeling the dimensional information.

An embodiment of the present application implements a play control method, including acquiring a signal to be processed of a user, sampling the signal to be processed according to a preset time interval, generating at least one sampling segment, inputting the at least one sampling segment into a signal processing model for classification, generating user state information of the user in at least one dimension, determining a play control instruction corresponding to a target multimedia resource according to the user state information of the at least one dimension, and adjusting a play state of the target multimedia resource according to the play control instruction.

Referring to fig. 4, the application of the play control method provided in the embodiment of the present application in the sleep monitoring field is taken as an example to further describe the play control method. Fig. 4 shows a flowchart of a play control method in the sleep monitoring field according to an embodiment of the present application, which specifically includes the following steps:

step 402, collecting sleep data signals of a child.

Wherein the sleep data signal comprises an audio signal and a visual signal.

Step 404, sampling the sleep data signal according to a preset time interval, and generating at least one sampling segment.

The audio signal and the visual signal are sampled at fixed time intervals, and the characteristics of the audio signal sampling segment and the characteristics of the video signal sampling segment are extracted.

Step 406, inputting at least one audio signal sampling segment into an audio signal processing submodule of the signal processing model for classification, and generating sound state information of the user in at least one sound dimension.

Specifically, feature extraction is carried out on signal frames in an audio signal sampling segment, and feature correlation degrees between adjacent signal frames in the audio signal sampling segment are calculated;

determining the similarity between the target sampling segment and the sample sampling segment according to the distance;

and determining sound state information of the user in at least one dimension according to the similarity.

Step 408, inputting the visual signal sampling segment into at least one visual signal processing submodule of the signal processing model for classification, and generating the visual state information of the user in at least one visual dimension.

And step 410, combining the sound state information and the visual state information, and determining a playing control instruction corresponding to the target multimedia resource according to a combination result.

Establishing a time sequence model for the characteristics of the audio and video signal sampling segments, calculating the characteristic association degree between each node (each signal frame) on the time sequence diagram, matching with the time sequence diagram model corresponding to the related semantic content samples acquired off line, and determining whether the time sequence diagram model belongs to specific semantic content.

In the embodiment of the application, the related semantic content includes a sleep posture, whether to move, whether to speak, and the like.

And controlling a story player according to the detected related semantic content, and making action decisions including termination of playing, continuation of playing, adjustment of playing volume and the like.

Step 412, adjusting the playing status of the target multimedia resource according to the playing control instruction.

And adjusting the playing state of the story player according to the determined action decision.

By collecting the sleep data signals of the children, the state information of the children is subjected to multi-dimensional analysis through the sleep data signals, and the playing control instruction of the story playing resource is comprehensively judged based on the state information of multiple dimensions obtained through analysis, so that the purpose of automatically controlling the playing state of the story player is achieved, and the watching experience of a user is promoted.

Referring to fig. 5, the application of the play control method provided in the embodiment of the present application in the field of interactive video courses is taken as an example to further explain the play control method. Fig. 5 shows a flowchart of a play control method in the field of interactive video courses according to an embodiment of the present application, which specifically includes the following steps:

step 502, collecting visual and audio signals of the curriculum object for real-time analysis.

Step 504, sampling the visual and audio signals according to a preset time interval, and generating at least one sampling segment.

The visual and audio signals are sampled at fixed time intervals, and the characteristics of the sampled segments of the audio and video signals are extracted.

Step 506, inputting at least one audio signal sampling segment into an audio signal processing submodule of the signal processing model for classification, and generating sound state information of the user in at least one sound dimension.

Step 508, inputting the visual signal sampling segments into at least one visual signal processing submodule of the signal processing model for classification, and generating visual state information of the user in at least one visual dimension.

And 510, combining the sound state information and the visual state information, and determining a playing control instruction corresponding to the video course according to a combination result.

Establishing a time sequence model for the characteristics of the audio and video signal sampling segments, calculating the characteristic association degree between each node on the time sequence, matching with the time sequence graph model corresponding to the related semantic content samples acquired offline, and determining whether the time sequence graph model belongs to specific semantic content. The relevant semantic content in the embodiment of the application comprises actions, expressions, voice and the like.

According to the requirements in the course video, the course object needs to make the specified content, such as action, expression, voice and the like, which conforms to the requirements, after the content is identified and judged, the system makes a video playing decision according to the conforming degree of the identification result and the specification, and the decision comprises the following steps: in case that the course object fails to make the required content, playing back the video and requiring the course object to try again; and skipping to play the video clips of the promotion encouragement under the condition that the course object makes the required content according to the regulations.

And step 512, adjusting the playing state of the video course according to the playing control instruction.

And adjusting the playing state of the video course according to the determined video playing decision.

In addition, the overall evaluation can be carried out on the course video according to the time-sharing and time-sharing continuous action feedback of the course object in the whole course video. For example, in the first 10 minutes, the course object does not make the corresponding requirement content, but in the last 10 minutes, the course object can make the corresponding requirement content, which indicates that the course object can understand the course as the course advances.

Through audio frequency, the visual signal of gathering the course object, with pass through audio frequency, visual signal carry out the multidimension degree analysis to the state information of course object to the play control command of video course is synthesized to the state information of a plurality of dimensions that obtains based on the analysis, thereby realize the purpose of the broadcast state of automatic control video course, be favorable to promoting user's the experience of watching.

Corresponding to the above method embodiment, the present application further provides an embodiment of a playback control apparatus, and fig. 6 shows a schematic structural diagram of a playback control apparatus provided in an embodiment of the present application. As shown in fig. 6, the apparatus includes:

a sampling module 602, configured to collect a signal to be processed of a user, and sample the signal to be processed according to a preset time interval, so as to generate at least one sampling segment;

a classification module 604 configured to classify the at least one sampling segment input signal processing model to generate user state information of the user in at least one dimension;

a determining module 606 configured to determine a play control instruction corresponding to the target multimedia resource according to the user state information of the at least one dimension;

an adjusting module 608 configured to adjust the playing state of the target multimedia resource according to the playing control instruction.

Optionally, the signal to be processed comprises an audio signal;

the classification module 604 includes:

the first classification submodule is configured to classify the audio signal processing submodule of the at least one audio signal sampling segment input signal processing model to generate a corresponding classification result;

and the screening submodule is configured to screen the classification result to obtain the sound state information of the user in the sound dimension.

Optionally, the signal to be processed comprises a visual signal;

the classification module 604 includes:

and the second classification submodule is configured to input at least one visual signal sampling segment into at least one visual signal processing submodule of the signal processing model for classification, and generates visual state information of the user in at least one visual dimension.

Optionally, the classification module 604 includes:

and the third classification submodule is configured to input the at least one sampling segment into a signal processing model for classification, the signal processing model calculates the similarity between the at least one sampling segment and a sample sampling segment, and determines the user state information of the user in at least one dimension according to the similarity.

Optionally, the third classification sub-module includes:

the extraction unit is configured to extract features of signal frames in a target sampling segment, and calculate a feature correlation degree between adjacent signal frames in the target sampling segment, wherein the target sampling segment is one of the at least one sampling segment;

a first determining unit configured to determine a first feature vector corresponding to the target sampling segment according to the feature relevance;

a calculating unit configured to calculate a distance between the first feature vector and a second feature vector corresponding to a sample sampling segment;

a second determining unit configured to determine a similarity between the target sampling segment and the sample sampling segment according to the distance.

Optionally, the third classification sub-module includes:

and the screening unit is configured to screen the sample sampling segments according to the similarity, and take a sample label corresponding to the sample sampling segment in a screening result as the user state information of the user in the dimension corresponding to the target sampling segment.

Optionally, the determining module 606 includes:

the first combination sub-module is configured to determine a resource type corresponding to a target multimedia resource and combine the user state information of the at least one dimension according to the resource type;

and the first instruction control module is configured to determine the play control instruction corresponding to the target multimedia resource according to the combination result and a mapping relation between the pre-configured resource type and the play control instruction.

Optionally, the determining module 606 includes:

a first combining sub-module configured to combine the user state information of the at least one dimension according to at least one preset time interval;

and the second instruction control module is configured to determine the play control instruction corresponding to the target multimedia resource according to at least one combination result and a mapping relation between a pre-configured resource type corresponding to the target multimedia resource and the play control instruction.

Optionally, the signal processing model is trained by:

Optionally, the playback control apparatus further includes:

the recognition module is configured to perform voice recognition on the audio signal under the condition that the user state is determined to be the target state according to the user state information;

and the instruction determining module is configured to determine a playing control instruction corresponding to the target multimedia resource according to the voice recognition result.

Optionally, the signal to be processed comprises an audio signal and a visual signal;

the classification module 604 includes:

a first classification submodule configured to input at least one visual signal sampling segment into at least one visual signal processing submodule of a signal processing model for classification, and generate a visual feature vector of the user in at least one visual dimension;

a second classification submodule configured to classify at least one audio signal processing submodule of the at least one audio signal sampling segment input signal processing model, generating an audio feature vector of the user in at least one acoustic dimension.

Optionally, the determining module 606 includes:

the fusion submodule is configured to perform feature fusion on the audio feature vector and the visual feature vector, input a fusion result into a classification model to perform semantic category division, and generate a comprehensive semantic category corresponding to the fusion result;

and the determining submodule is configured to determine a playing control instruction corresponding to the target multimedia resource according to the comprehensive semantic category.

The above is a schematic scheme of a playback control apparatus of the present embodiment. It should be noted that the technical solution of the playback control apparatus and the technical solution of the playback control method belong to the same concept, and for details that are not described in detail in the technical solution of the playback control apparatus, reference may be made to the description of the technical solution of the playback control method.

FIG. 7 illustrates a block diagram of a computing device 700 provided according to an embodiment of the present application. The components of the computing device 700 include, but are not limited to, memory 710 and a processor 720. Processor 720 is coupled to memory 710 via bus 730, and database 750 is used to store data.

Computing device 700 also includes access device 740, access device 740 enabling computing device 700 to communicate via one or more networks 760. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 740 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the application, the above-described components of the computing device 700 and other components not shown in fig. 7 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 7 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 700 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 700 may also be a mobile or stationary server.

Wherein the processor 720 is configured to execute the computer-executable instructions, and the processor is configured to execute the computer-executable instructions, wherein the steps of the playback control method are implemented when the processor executes the computer-executable instructions.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned playing control method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the above-mentioned playing control method.

An embodiment of the present application further provides a computer-readable storage medium, which stores computer-executable instructions, and when the instructions are executed by a processor, the steps of the playback control method are implemented.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the above-mentioned playing control method, and for details that are not described in detail in the technical solution of the storage medium, reference may be made to the description of the technical solution of the above-mentioned playing control method.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application embodiment is not limited by the described acts or sequences, because some steps may be performed in other sequences or simultaneously according to the present application embodiment. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that acts and modules referred to are not necessarily required to implement the embodiments of the application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments of the application and its practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A playback control method, comprising:

2. The playback control method according to claim 1, wherein the signal to be processed includes an audio signal;

the classifying the at least one sampling segment input signal processing model to generate the user state information of the user in at least one dimension includes:

3. The playback control method according to claim 1 or 2, wherein the signal to be processed includes a visual signal;

and inputting at least one visual signal sampling segment into at least one visual signal processing submodule of the signal processing model for classification, and generating visual state information of the user in at least one visual dimension.

4. The playback control method of claim 1, wherein the classifying the at least one sampling segment into a signal processing model to generate user status information of the user in at least one dimension comprises:

and inputting the at least one sampling segment into a signal processing model for classification, calculating the similarity between the at least one sampling segment and a sample sampling segment by the signal processing model, and determining the user state information of the user in at least one dimension according to the similarity.

5. The playback control method of claim 4, wherein the calculating the similarity between the at least one sampling segment and the sample sampling segment comprises:

6. The playback control method of claim 5, wherein the determining the user status information of the user in at least one dimension according to the similarity comprises:

and screening the sample sampling fragments according to the similarity, and taking sample labels corresponding to the sample sampling fragments in a screening result as user state information of the user in the dimension corresponding to the target sampling fragment.

7. The playback control method according to claim 1, wherein the determining the playback control command corresponding to the target multimedia resource according to the user status information of the at least one dimension includes:

8. The playback control method according to claim 1, wherein the determining the playback control command corresponding to the target multimedia resource according to the user status information of the at least one dimension includes:

and determining the playing control instruction corresponding to the target multimedia resource according to at least one combination result and a preset mapping relation between the resource type corresponding to the target multimedia resource and the playing control instruction.

9. The playback control method of claim 1, wherein the signal processing model is trained by:

10. The playback control method according to claim 2, further comprising:

under the condition that the user state is determined to be the target state according to the user state information, carrying out voice recognition on the audio signal;

and determining a playing control instruction corresponding to the target multimedia resource according to the voice recognition result.

11. The playback control method according to claim 1, wherein the signal to be processed includes an audio signal and a visual signal;

inputting at least one visual signal sampling segment into at least one visual signal processing submodule of a signal processing model for classification, and generating a visual feature vector of the user in at least one visual dimension; and the number of the first and second groups,

and inputting at least one audio signal sampling segment into at least one audio signal processing submodule of the signal processing model for classification, and generating an audio feature vector of the user in at least one sound dimension.

12. The method according to claim 11, wherein the determining the playback control command corresponding to the target multimedia resource according to the user status information of the at least one dimension includes:

13. A playback control apparatus, comprising:

14. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions, wherein the processor implements the steps of the playback control method according to any one of claims 1 to 12 when executing the computer-executable instructions.

15. A computer-readable storage medium, characterized in that it stores computer instructions which, when executed by a processor, implement the steps of the playback control method of any one of claims 1-12.