CN111935501A - Scene recognition method and device - Google Patents

Scene recognition method and device Download PDF

Info

Publication number
CN111935501A
CN111935501A CN201910393305.6A CN201910393305A CN111935501A CN 111935501 A CN111935501 A CN 111935501A CN 201910393305 A CN201910393305 A CN 201910393305A CN 111935501 A CN111935501 A CN 111935501A
Authority
CN
China
Prior art keywords
target
scene
time stamp
determining
sound state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910393305.6A
Other languages
Chinese (zh)
Other versions
CN111935501B (en
Inventor
王聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN201910393305.6A priority Critical patent/CN111935501B/en
Publication of CN111935501A publication Critical patent/CN111935501A/en
Application granted granted Critical
Publication of CN111935501B publication Critical patent/CN111935501B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/06Electrically-operated educational appliances with both visual and audible presentation of the material to be studied
    • G09B5/065Combinations of audio and video presentations, e.g. videotapes, videodiscs, television systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4312Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8455Structuring of content, e.g. decomposing content into time segments involving pointers to the content, e.g. pointers to the I-frames of the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The application provides a scene recognition method and a scene recognition device, which belong to the technical field of computers, and the method comprises the following steps: acquiring a target video to be identified; determining a corresponding relation between a sound state and a playing time stamp through a preset audio processing algorithm, audio data of the target video and the corresponding playing time stamp of the audio data in the target video, wherein the sound state comprises a sound state and a silent state; and determining the playing time stamp with the corresponding sound state as the sound state according to the corresponding relation between the sound state and the playing time stamp, and determining the scene corresponding to the determined playing time stamp as a target scene so as to establish the corresponding relation between the scene and the playing time stamp. By the aid of the method and the device, video display effect can be improved.

Description

Scene recognition method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a scene recognition method and apparatus.
Background
The recording and playing system has the functions of live video, video on demand, video editing and the like. A server of the recording and broadcasting system can shoot target activities through a video recorder arranged at a preset position to obtain video files containing activity scenes, and then the server can store the video files locally. Therefore, the user can acquire the video file in the server through the client and then carry out video on demand. For example, the server can shoot the course of a teacher in a classroom through the video recorder to obtain a video file containing a teaching scene, and a user can obtain the video file through the client to realize teaching recording and broadcasting.
However, the teaching scene includes not only a teacher teaching scene but also a student solution problem scene, and when the user watches the video through the client, the user can only determine that a certain scene is the teacher teaching scene or the student solution problem scene through the watching mode, so that the video display effect is poor.
Disclosure of Invention
An object of the embodiments of the present application is to provide a scene recognition method and apparatus, so as to improve a video display effect. The specific technical scheme is as follows:
in a first aspect, a method for scene recognition is provided, where the method includes:
acquiring a target video to be identified;
determining a corresponding relation between a sound state and a playing time stamp through a preset audio processing algorithm, audio data of the target video and the corresponding playing time stamp of the audio data in the target video, wherein the sound state comprises a sound state and a silent state;
and determining the playing time stamp with the corresponding sound state as the sound state according to the corresponding relation between the sound state and the playing time stamp, and determining the scene corresponding to the determined playing time stamp as a target scene so as to establish the corresponding relation between the scene and the playing time stamp.
Optionally, the determining, according to the correspondence between the sound state and the play time stamp, that the corresponding sound state is the play time stamp of the sound state, and determining, as the target scene, the scene corresponding to the determined play time stamp includes:
when the sound state corresponding to the current playing time stamp is a silent state, determining the minimum playing time stamp which is behind the current playing time stamp and the corresponding sound state is a sound state, and determining a scene corresponding to the minimum playing time stamp as a target scene;
acquiring a sound state corresponding to each sampling moment according to a preset sampling time interval;
if the sound state is in a sound state, determining a scene corresponding to the playing time stamp in the sampling time interval as a target scene, and acquiring the sound state corresponding to the next sampling moment;
if the sound state is a silent state, determining a scene corresponding to a playing time stamp in the sampling time interval as a non-target scene, determining the sampling time as a current playing time stamp, executing the step of determining a minimum playing time stamp which is behind the current playing time stamp and corresponds to the sound state as a sound state, and determining the scene corresponding to the minimum playing time stamp as a target scene.
Optionally, before determining the scene corresponding to the play timestamp in the sampling time interval as the non-target scene, the method further includes:
determining the minimum detection duration according to the sampling time, the playing time stamp which is behind the sampling time and corresponds to the sound state as the sound state;
if the minimum detection time length is less than a preset detection time length threshold value, determining a scene corresponding to the playing time stamp in the sampling time interval as a target scene, and acquiring a sound state corresponding to the next sampling time;
and if the minimum detection duration is greater than the detection duration threshold, determining the scene corresponding to the playing timestamp in the sampling time interval as a non-target scene step.
Optionally, the method further includes:
receiving a video acquisition request sent by a client, wherein the video acquisition request is used for indicating to acquire a target video containing a target scene;
determining a target playing time stamp corresponding to the target scene according to the corresponding relation between the scene and the playing time stamp;
and generating a response message according to the target playing time stamp, and sending the response message to the client, wherein the response message is used for determining the target scene contained in the target video.
Optionally, when the client is in a state of displaying the target video, generating a response message according to the target play timestamp, and sending the response message to the client includes:
generating a response message containing the target playing time stamp;
and sending the response message to the client so that the client displays the target playing time stamp of the target scene in a preset playing time progress bar in the current state of displaying the target video.
Optionally, when the client is in a state where the target video is not displayed, the generating a response message according to the target play timestamp and sending the response message to the client includes:
in the video data of a target video, acquiring the video data corresponding to the target playing time stamp to obtain a response message;
and sending the response message to the client so that the client displays the target scene based on the video data corresponding to the target playing time stamp.
In a second aspect, a scene recognition apparatus is provided, where the apparatus includes:
the acquisition module is used for acquiring a target video to be identified;
the first determining module is used for determining the corresponding relation between a sound state and a playing time stamp through a preset audio processing algorithm, the audio data of the target video and the corresponding playing time stamp of the audio data in the target video, wherein the sound state comprises a sound state and a silent state;
and the establishing module is used for determining the playing time stamp of which the corresponding sound state is the sound state according to the corresponding relation between the sound state and the playing time stamp, and determining a scene corresponding to the determined playing time stamp as a target scene so as to establish the corresponding relation between the scene and the playing time stamp.
Optionally, the establishing module includes:
the first determining submodule is used for determining the minimum playing time stamp which is behind the current playing time stamp and the corresponding sound state of which is the sound state when the sound state corresponding to the current playing time stamp is the soundless state, and determining the scene corresponding to the minimum playing time stamp as the target scene;
the acquisition submodule is used for acquiring a sound state corresponding to each sampling moment according to a preset sampling time interval;
the second determining submodule is used for determining a scene corresponding to the playing timestamp in the sampling time interval as a target scene and acquiring a sound state corresponding to the next sampling moment when the sound state is in a sound state;
and a third determining submodule, configured to determine, when the sound state is a silent state, a scene corresponding to the play timestamp in the sampling time interval as a non-target scene, determine the sampling time as a current play timestamp, execute the step of determining a minimum play timestamp which is after the current play timestamp and corresponds to the sound state as a sound state, and determine a scene corresponding to the minimum play timestamp as a target scene.
Optionally, the establishing module further includes:
the fourth determining submodule is used for determining the minimum detection time according to the sampling time, the playing time stamp which is behind the sampling time and corresponds to the sound state and is in the sound state;
the second determining submodule is further used for determining a scene corresponding to the playing timestamp in the sampling time interval as a target scene and acquiring a sound state corresponding to the next sampling moment when the minimum detection time length is smaller than a preset detection time length threshold;
and the third determining submodule is further configured to, when the minimum detection duration is greater than the detection duration threshold, perform a step of determining a scene corresponding to the play timestamp in the sampling time interval as a non-target scene.
Optionally, the apparatus further comprises:
the system comprises a receiving module, a processing module and a display module, wherein the receiving module is used for receiving a video acquisition request sent by a client, and the video acquisition request is used for indicating to acquire a target video containing a target scene;
a second determining module, configured to determine, according to a correspondence between the scene and the play timestamp, a target play timestamp corresponding to the target scene;
and the sending module is used for generating a response message according to the target playing time stamp and sending the response message to the client, wherein the response message is used for determining the target scene contained in the target video.
Optionally, when the client is in a state of displaying the target video, the sending module includes:
the first generation submodule is used for generating a response message containing the target playing time stamp;
and the first sending submodule is used for sending the response message to the client so as to enable the client to display the target playing time stamp of the target scene in a preset playing time progress bar in the current state of displaying the target video.
Optionally, when the client is in a state of not displaying the target video, the sending module includes:
the second generation submodule is used for acquiring video data corresponding to the target playing timestamp from the video data of the target video to obtain a response message;
and the second sending submodule is used for sending the response message to the client so that the client can display the target scene based on the video data corresponding to the target playing time stamp.
In a third aspect, an electronic device is provided, which includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor adapted to perform the method steps of any of the first aspects when executing a program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, having stored thereon a computer program which, when being executed by a processor, carries out the method steps of any of the first aspects.
According to the scene recognition method and device provided by the embodiment of the application, the target video to be recognized can be obtained; determining a corresponding relation between a sound state and a playing time stamp through a preset audio processing algorithm, audio data of a target video and the corresponding playing time stamp of the audio data in the target video, wherein the sound state comprises a sound state and a silent state; and determining the playing time stamp with the corresponding sound state as the sound state according to the corresponding relation between the sound state and the playing time stamp, and determining the scene corresponding to the determined playing time stamp as a target scene so as to establish the corresponding relation between the scene and the playing time stamp. The corresponding relation between the scenes and the playing time stamps is established, so that the target playing time stamps of the target scenes in the target video can be conveniently determined according to the corresponding relation, a user does not need to determine whether a certain scene is the target scene or not in a one-by-one watching mode, and the video display effect can be improved.
Of course, not all advantages described above need to be achieved at the same time in the practice of any one product or method of the present application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic diagram of a recording and broadcasting system according to an embodiment of the present application;
fig. 2 is a flowchart of a scene recognition method according to an embodiment of the present application;
fig. 3 is a flowchart of a scene recognition method according to an embodiment of the present application;
fig. 4 is a flowchart of a scene recognition method according to an embodiment of the present application;
fig. 5 is a schematic diagram of a display page provided in an embodiment of the present application;
fig. 6 is a schematic structural diagram of a scene recognition apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a scene recognition method, which is applied to a server of a recording and broadcasting system, wherein the server can be an electronic device with a storage function and a data processing function, and the recording and broadcasting system can also comprise a video acquisition device, an audio acquisition device and a client which is in communication connection with the server. Wherein the video capture device may be a video recorder and the audio capture device may be a sound pickup. Fig. 1 is a schematic diagram of a recording and playing system according to an embodiment of the present application.
The recording and broadcasting system can be applied to recording and broadcasting of various activities, such as recording and broadcasting of meeting activities, recording and broadcasting of classroom teaching and the like. Taking the example of recording and broadcasting classroom teaching, the work flow of the recording and broadcasting system comprises: the server can shoot teaching activities in a classroom through a video recorder arranged in the classroom to obtain video data containing teaching scenes; meanwhile, the server can collect audio data of teaching activities through a sound pickup arranged in a classroom, and then the server can locally store video data containing teaching scenes and audio data of the teaching activities. Therefore, the user can acquire the video data and the audio data through the client of the recording and broadcasting system to watch the video containing the teaching scene, and the teaching recording and broadcasting are realized.
In the embodiment of the application, the server can realize scene identification on the video by establishing the corresponding relation between the scene in the video and the playing time stamp of the video, so that a subsequent user can conveniently and quickly locate the video clip containing the scene that the user wants to watch when watching the video through the client, and the video display effect is improved.
As shown in fig. 2, an embodiment of the present application provides an implementation manner for a server to obtain a corresponding relationship between a scene and a play timestamp, where a specific processing flow includes:
step 201, a target video to be identified is obtained.
In implementation, the server may acquire audio data of a target activity through an audio acquisition device preset in a target place, and use the acquired audio data as audio data of a target video to be identified.
In a possible implementation manner, the server may receive a video identification instruction, and the video identification instruction may carry a video identifier of a target video to be identified. Then, the server may determine a target video to be identified according to the video identifier in the plurality of videos stored locally, and then acquire audio data of the target video.
Step 202, determining a corresponding relation between the sound state and the playing time stamp according to a preset audio processing algorithm, the audio data of the target video and the playing time stamp corresponding to the audio data in the target video.
The server may be preset with an audio processing algorithm, and the audio processing algorithm may be used to determine a sound state corresponding to each playing time stamp in the audio played based on the audio data. The sound state includes a sound state and a silent state. The audio processing algorithm may be a Voice Activity Detection (VAD) algorithm.
In an implementation, the server may determine, by an audio processing algorithm, audio data corresponding to a play time stamp of the audio data in the target video, and determine a sound state in the audio played based on the audio data, so that the server may determine a correspondence between the sound state and the play time stamp.
In the embodiment of the application, in order to accelerate the audio processing speed, the server may determine the sound state corresponding to each playing time stamp according to a preset playing time interval through an audio processing algorithm. The playing time interval may be 0.25s or 0.5s, and in a possible implementation, the playing time interval should not be greater than 1 s.
For example, the server may determine, by using an audio processing algorithm, a sound state corresponding to each playing time stamp according to a preset playing time interval of 0.5s, to obtain a corresponding relationship between the sound state and the playing time stamp, as shown in table 1:
TABLE 1
Play time stamp 0s 0.5s 1s 1.5s 2s 2.5s
Sound state Silent state Silent state Voiced state Voiced state Voiced state Voiced state
In the prior art, any algorithm having a function of determining a sound state corresponding to each playing time stamp in an audio played based on audio data may be used as an audio processing algorithm, and the embodiment of the present application is not particularly limited.
Step 203, according to the corresponding relationship between the sound state and the playing time stamp, determining the playing time stamp with the sound state as the corresponding sound state, and determining the scene corresponding to the determined playing time stamp as the target scene, so as to establish the corresponding relationship between the scene and the playing time stamp.
In an implementation, for each play timestamp in the correspondence of the sound state and the play timestamp, the server may determine the sound state corresponding to the play timestamp, and then the server may determine whether the sound state is a sound state.
If the sound state is a sound state, the server can determine the scene corresponding to the playing time stamp as a target scene; if the sound state is not a sound state, the server may determine a scene corresponding to the play timestamp as a non-target scene. Thus, the server can establish the corresponding relation between the scene and the playing time stamp.
In the embodiment of the application, the server may allocate the scene identifier to each scene in advance, and correspondingly, the server may store the corresponding relationship between the scene and the play timestamp by using the scene identifier as a key value of the scene. When the target video is a teaching video, the target scene can be a teaching scene of a teacher; when the target video is a conference video, the target scene may be a presenter speaking scene.
Similarly, when the target video is a teaching video, the target scene may be a student answering scene; when the target video is a conference video, the target scene may be an inter-break scene. At this time, when the sound state corresponding to a certain play time stamp is a silent state, the server may determine a scene corresponding to the play time stamp as a target scene to establish a correspondence between the scene and the play time stamp.
In a possible implementation manner, the server may store a correspondence between the sound state and the scene in advance, and the server may determine, for each play time stamp in the correspondence between the sound state and the play time stamp, a sound state corresponding to the play time stamp. Then, the server may determine the scene corresponding to the sound state according to the correspondence between the sound state and the scene, and thus, the server may determine the scene corresponding to the play timestamp to obtain the correspondence between the scene and the play timestamp.
For example, the correspondence of the sound state to the scene may be: the scene corresponding to the sound state is a teaching scene of a teacher, and the scene corresponding to the sound-free state is a question answering scene of a student. The server may determine, for the silent state corresponding to the play timestamp 0s in table 1, the scene corresponding to the play timestamp as the student answering scene according to the correspondence between the sound state and the scene.
In the embodiment of the application, the server determines the corresponding relation between the sound state and the playing time stamp according to the audio data of the target video to be identified and the playing time stamp corresponding to the audio data in the target video through the audio processing algorithm. And then, the server determines the playing time stamp with the corresponding sound state as the sound state according to the corresponding relation between the sound state and the playing time stamp, determines the scene corresponding to the determined playing time stamp as a target scene, and obtains the corresponding relation between the scene of the target video and the playing time stamp. Therefore, the server can provide a service for rapidly positioning the target scene for a user watching the target video through the recording and playing system according to the corresponding relation between the scene of the target video and the playing timestamp, so that the user does not need to determine whether a certain scene is an interested target scene in a one-by-one watching mode, and the video display effect is improved.
Optionally, a sampling time interval may be preset in the server, and when the sound state of the target video corresponding to the current playing timestamp is a sound state, the server may determine a scene corresponding to the current playing timestamp according to the sampling time interval, as shown in fig. 3, which specifically includes the following steps:
step 301, when the sound state corresponding to the current playing timestamp is a silent state, determining a minimum playing timestamp which is after the current playing timestamp and the corresponding sound state is a sound state, and determining a scene corresponding to the minimum playing timestamp as a target scene.
In an implementation, the server may use an initial play time stamp included in the correspondence between the sound state and the play time stamp as a current play time stamp according to the play order of the audio, and determine whether the sound state corresponding to the current play time stamp is a silent state.
If the sound state corresponding to the current playing time stamp is a sound state, the server may determine the scene corresponding to the current playing time stamp as the target scene. Then, the server may use a next playing time stamp after the current playing time stamp as the current playing time stamp according to the playing sequence of the audio, and then determine the sound state corresponding to the current playing time stamp.
If the sound state corresponding to the current playing time stamp is a silent state, the server may determine, according to the playing sequence of the audio, the minimum playing time stamp which is after the current playing time stamp and the corresponding sound state is a sound state, and determine a scene corresponding to the minimum playing time stamp as a target scene.
For example, the server may set the initial play time stamp 0s as the current play time stamp in the order of playing the audio, and determine that the sound state corresponding to the current play time stamp is a silent state. Then, the server may determine, according to the playing sequence of the audio, a minimum playing time stamp that is after the current playing time stamp and whose corresponding sound state is a sound state, to obtain 1s, and then, the server may determine a scene corresponding to the minimum playing time stamp 1s as a target scene.
Step 302, obtaining a sound state corresponding to each sampling time according to a preset sampling time interval.
In implementation, the server may determine a plurality of sampling moments according to the minimum playing timestamp and the preset sampling time interval, and then, the server may obtain a sound state corresponding to each sampling moment. Thereafter, the server may determine whether the sound state is a sound state, and if the sound state is a sound state, the server may perform step 303, and if the sound state is a soundless state, the server may perform step 304.
For example, the server may determine a sampling time 11s according to the minimum playing time stamp 1s and a preset sampling time interval 10s, then the server may obtain a sound state corresponding to the sampling time 11s to obtain a sound state, and then the server may determine a scene corresponding to the playing time stamp within the sampling time interval 1s-11s as a target scene.
Step 303, determining a scene corresponding to the playing timestamp in the sampling time interval as a target scene, and acquiring a sound state corresponding to the next sampling time.
In implementation, the server may determine a scene corresponding to the play timestamp in the sampling time interval as a target scene, and then, the server may obtain a sound state corresponding to a next sampling time.
Step 304, determining the scene corresponding to the playing time stamp in the sampling time interval as a non-target scene, determining the sampling time as a current playing time stamp, executing the step of determining the minimum playing time stamp which is behind the current playing time stamp and the corresponding sound state is a sound state, and determining the scene corresponding to the minimum playing time stamp as a target scene.
In an implementation, the server may determine a scene corresponding to the playing timestamp in the sampling time interval as a non-target scene, and take the sampling time as the current playing timestamp, and then the server may perform step 301.
In this embodiment, the server may determine, when the sound state corresponding to the current play timestamp is a silent state, a minimum play timestamp that is after the current play timestamp and in which the corresponding sound state is a sound state. Then, determining sampling time according to the minimum playing time stamp and the sampling time interval, and determining a scene corresponding to a certain sampling time as a target scene when the sound state corresponding to the certain sampling time is in a sound state; and when the sound state is a silent state, determining the scene corresponding to the sampling time as a non-target scene, determining the sampling time as a current playing time stamp, and determining the minimum playing time stamp corresponding to the current playing time stamp. Because the sampling time is determined according to the sampling time interval and the corresponding scene is determined according to the sampling time, the operation load of the server can be reduced, and the processing speed of scene recognition on the target video is improved.
Optionally, when the sound state corresponding to a certain sampling time is a silent state, the server may perform the following steps before determining a scene corresponding to the play timestamp in the sampling time interval as a non-target scene:
step one, according to the sampling time, the playing time stamp which is behind the sampling time and corresponds to the sound state as the sound state, the minimum detection time length is determined.
In an implementation, the server may determine, from among the playback timestamps that are subsequent to the sampling time and whose corresponding sound states are sound states, a detection playback timestamp whose playback time is closest to the sampling time. Then, the server may calculate a time difference between the sampling time and the detection play timestamp to obtain a minimum detection duration.
Then, the server can judge whether the minimum detection duration is greater than a preset detection duration threshold, and if the minimum detection duration is less than the detection duration threshold, the server can execute the second step; if the minimum detection duration is greater than the detection duration threshold, the server may perform step three.
For example, when the sound state corresponding to the sampling time 31s is a silent state, the server may determine the detection play time stamp after the sampling time 31s and when the corresponding sound state is a sound state, to obtain 39 s. Then, the server may calculate the time difference between the sampling time 31s and the detection playing time stamp 39s to obtain the minimum detection time length 9 s. The server may determine that the minimum detection time 9s is less than the pre-stored detection time threshold 10s, and then the server may determine a scene corresponding to the play timestamp in the sampling time interval as a target scene, and obtain a sound state corresponding to the next sampling time 49 s.
And step two, determining a scene corresponding to the playing time stamp in the sampling time interval as a target scene, and acquiring a sound state corresponding to the next sampling time.
In implementation, the server may determine a scene corresponding to the play timestamp in the sampling time interval as a target scene, and then obtain a sound state corresponding to a next sampling time.
And step three, determining a scene corresponding to the playing time stamp in the sampling time interval as a non-target scene, determining the sampling time as a current playing time stamp, determining a minimum playing time stamp which is behind the current playing time stamp and the corresponding sound state is a sound state, and determining the scene corresponding to the minimum playing time stamp as a target scene.
In the implementation, the specific processing procedure of this step is similar to that of step 304, and is not described herein again.
In this embodiment, the server may determine the minimum detection duration according to the sampling time, the playing time stamp after the sampling time and corresponding to the sound state being a sound state when the sound state corresponding to the sampling time is a silent state. And then, when the minimum detection duration is smaller than the detection duration threshold, acquiring the sound state corresponding to the next sampling moment. Due to the fact that a pause phenomenon possibly exists in the speaking process, the sound state corresponding to a certain playing time stamp is different from the sound state corresponding to the playing time stamp adjacent to the certain playing time stamp, and therefore the pause phenomenon can be detected by determining the minimum detection time length and obtaining the sound state corresponding to the next sampling time when the minimum detection time length is smaller than the detection time length threshold value, the fact that the playing time stamp in the sampling time interval is determined to be a non-target scene is avoided, and therefore the detection accuracy rate of the target scene is improved.
Optionally, in the process that the user acquires the target video through the recording and playing system, the server may quickly locate the target scene in the target video based on the correspondence between the scene and the playing timestamp, as shown in fig. 4, the specific processing process includes:
step 401, receiving a video acquisition request sent by a client.
The video acquisition request is used for indicating acquisition of a target video containing a target scene. In a feasible implementation manner, the video acquisition request may carry an identifier of a target video to be acquired and an identifier of a target scene.
In implementation, when a user wants to watch a video clip of a target video, which only includes a target scene, the user may perform a preset operation to issue a video display instruction to a client of the recording and playing system. The preset operation may be clicking a scene display icon in a preset display page, or inputting a code for displaying a video clip of a target scene. The client may obtain the identifier of the target video and the identifier of the target scene after receiving the video display instruction, generate a video obtaining request including the identifier of the target video and the identifier of the target scene, and send the video obtaining request to the server.
The server can receive a video acquisition request sent by the client, and then acquire the identifier of the target video and the identifier of the target scene carried by the video acquisition request.
Step 402, determining a target playing time stamp corresponding to the target scene according to the corresponding relationship between the scene and the playing time stamp.
In implementation, the server may determine a corresponding relationship between a scene of the target video and the play timestamp according to the identifier of the target video in the corresponding relationship between the scene of each video and the play timestamp stored in advance, and then, the server may use the play timestamp corresponding to the identifier of the target scene as the target play timestamp of the target scene in the target video in the corresponding relationship between the scene of the target video and the play timestamp.
For example, when the identifier of the target scene is an identifier of a teacher teaching scene, the server may use a playing time stamp corresponding to the identifier of the teaching scene as a target playing time stamp of the teaching scene in the target video, in the correspondence between the scene of the target video and the playing time stamp, for 10 minutes to 13 minutes.
Step 403, generating a response message according to the target playing timestamp, and sending the response message to the client.
Wherein the response message is used for determining a target scene contained in the target video.
In an implementation, the server may generate a response message for determining that the target video includes video data of the target scene according to the target play time stamp. The response message may carry a target play timestamp, or the response message may also carry video data including a target scene. Then, the server may send a response message to the client, so that the client may determine the target scene contained in the target video according to the response message after receiving the response message.
For example, when the response message carries a target play timestamp, the client may determine video data including a target scene according to the target play timestamp and video data of a locally stored target video; when the response message carries the video data of the target scene, the client may use the video data included in the response message as the video data including the target scene in the target video.
In the embodiment of the application, the server can receive a video acquisition request sent by the client; then, according to the corresponding relation between the scene and the playing time stamp, determining a target playing time stamp corresponding to the target scene; and then, generating a response message for determining a target scene contained in the target video according to the target playing time stamp, and sending the response message to the client. The target playing time stamp of the target scene in the target video is determined according to the corresponding relation between the scene and the playing time stamp, and then the response message for determining the target scene contained in the target video is generated according to the target playing time stamp, so that the client can display the target scene based on the response message, and further, the situation that whether a certain scene is an interested target scene or not is determined by a user in a one-by-one watching mode is avoided, and the video display effect can be improved.
Meanwhile, when the target video is the teaching video, the user can also know the number of times the teacher gives lessons and the period of time of giving lessons through the video clip containing the scene of giving lessons displayed by the client, so that the user can conveniently know the distribution situation of the class teaching time of the teacher.
Optionally, the server may generate different response messages according to whether the client is in a state of displaying the target video, and if the client is in a state of displaying the target video, the client may generate a video acquisition request including an identifier of the displayed target video after receiving the video display instruction. In this case, the process of the server generating the response message and transmitting the response message to the client may include the steps of:
step one, generating a response message containing a target playing time stamp.
In an implementation, the server may determine whether the received video capture request includes an identifier of a displayed target video, and when the video capture request includes the identifier of the displayed target video, the server may generate a response message including a target play timestamp.
And step two, sending the response message to the client so that the client displays the target playing time stamp of the target scene in a preset playing time progress bar in the current state of displaying the target video.
In implementation, the server may send a response message including the target play timestamp to the client, so that the client obtains the target play timestamp in the response message, and then displays the target play timestamp in a play time progress bar in a preset display page, which is convenient for a user to locate a video clip only including a target scene in a target video.
For example, as shown in fig. 5, for a schematic diagram of displaying a page provided in the embodiment of the present application, a client may identify a progress area (e.g., an area 510 identified by an arrow in fig. 5) corresponding to a target playing time stamp in a playing time progress bar as red, so that a user may quickly locate a video clip only including a target scene by dragging the playing time progress bar to the progress area.
In the embodiment of the application, if the client receives a video display instruction after displaying the target video, the client can generate a video acquisition request containing the identifier of the displayed target video; then, the server may generate a response message including the target play timestamp after acquiring the target play timestamp, so that the client may determine, according to the target play timestamp, a video clip only including the target scene from video data of the target video locally stored by the client. Therefore, the watching requirements of users in different use scenes can be met.
Optionally, if the client is in a state where the target video is not displayed, the client may generate a video acquisition request that does not include the identifier of the displayed target video after receiving the video display instruction. In this case, the process of the server generating the response message and transmitting the response message to the client may include the steps of:
step 1, obtaining video data corresponding to a target playing time stamp from video data of a target video to obtain a response message.
In implementation, the server may determine whether the received video acquisition request includes an identifier of a displayed target video, and when the video acquisition request does not include the identifier of the displayed target video, the server may determine the target video from the plurality of videos stored locally according to the identifier of the target video. The server may then obtain video data of the target video. Then, the server may use, as a response message, video data corresponding to the target play timestamp in the video data of the target video according to a preset correspondence between the video data and the play timestamp.
And 2, sending a response message to the client so that the client can display the target scene based on the video data corresponding to the target playing time stamp.
In implementation, the server may send a response message including video data corresponding to the target playing timestamp to the client, so that the client may obtain the video data in the response message, and then display a video clip of the target scene in a preset display page.
In the embodiment of the application, if the client receives a video display instruction before displaying the target video, the client can generate a video acquisition request which does not contain the identifier of the displayed target video; then, after obtaining the target playing timestamp, the server may determine, in the video data of the locally stored target video, the video data corresponding to the target playing timestamp, and generate a response message including the video data corresponding to the target playing timestamp. In this way, the client can obtain the video clip containing the target scene by receiving the response message. Therefore, the user can directly watch the video clip only containing the target scene or download the file only containing the video clip of the target scene, and the watching requirements of the user under different use scenes can be met.
An embodiment of the present application further provides a scene recognition apparatus, as shown in fig. 6, the apparatus includes:
an obtaining module 610, configured to obtain a target video to be identified;
a first determining module 620, configured to determine a corresponding relationship between a sound state and a play timestamp according to a preset audio processing algorithm, audio data of the target video, and a play timestamp corresponding to the audio data in the target video, where the sound state includes a sound state and a silent state;
the establishing module 630 is configured to determine, according to the correspondence between the sound state and the play timestamp, the play timestamp with the sound state being the sound state, and determine a scene corresponding to the determined play timestamp as a target scene, so as to establish a correspondence between the scene and the play timestamp.
Optionally, the establishing module includes:
the first determining submodule is used for determining the minimum playing time stamp which is behind the current playing time stamp and the corresponding sound state of which is the sound state when the sound state corresponding to the current playing time stamp is the soundless state, and determining the scene corresponding to the minimum playing time stamp as the target scene;
the acquisition submodule is used for acquiring a sound state corresponding to each sampling moment according to a preset sampling time interval;
the second determining submodule is used for determining a scene corresponding to the playing timestamp in the sampling time interval as a target scene and acquiring a sound state corresponding to the next sampling moment when the sound state is in a sound state;
and a third determining submodule, configured to determine, when the sound state is a silent state, a scene corresponding to the play timestamp in the sampling time interval as a non-target scene, determine the sampling time as a current play timestamp, execute the step of determining a minimum play timestamp which is after the current play timestamp and corresponds to the sound state as a sound state, and determine a scene corresponding to the minimum play timestamp as a target scene.
Optionally, the establishing module further includes:
the fourth determining submodule is used for determining the minimum detection time according to the sampling time, the playing time stamp which is behind the sampling time and corresponds to the sound state and is in the sound state;
the second determining submodule is further used for determining a scene corresponding to the playing timestamp in the sampling time interval as a target scene and acquiring a sound state corresponding to the next sampling moment when the minimum detection time length is smaller than a preset detection time length threshold;
and the third determining submodule is further configured to, when the minimum detection duration is greater than the detection duration threshold, perform a step of determining a scene corresponding to the play timestamp in the sampling time interval as a non-target scene.
Optionally, the apparatus further comprises:
the system comprises a receiving module, a processing module and a display module, wherein the receiving module is used for receiving a video acquisition request sent by a client, and the video acquisition request is used for indicating to acquire a target video containing a target scene;
a second determining module, configured to determine, according to a correspondence between the scene and the play timestamp, a target play timestamp corresponding to the target scene;
and the sending module is used for generating a response message according to the target playing time stamp and sending the response message to the client, wherein the response message is used for determining the target scene contained in the target video.
Optionally, when the client is in a state of displaying the target video, the sending module includes:
the first generation submodule is used for generating a response message containing the target playing time stamp;
and the first sending submodule is used for sending the response message to the client so as to enable the client to display the target playing time stamp of the target scene in a preset playing time progress bar in the current state of displaying the target video.
Optionally, when the client is in a state of not displaying the target video, the sending module includes:
the second generation submodule is used for acquiring video data corresponding to the target playing timestamp from the video data of the target video to obtain a response message;
and the second sending submodule is used for sending the response message to the client so that the client can display the target scene based on the video data corresponding to the target playing time stamp.
The scene recognition device provided by the embodiment of the application can acquire a target video to be recognized; determining a corresponding relation between a sound state and a playing time stamp through a preset audio processing algorithm, audio data of a target video and the corresponding playing time stamp of the audio data in the target video, wherein the sound state comprises a sound state and a silent state; and determining the playing time stamp with the corresponding sound state as the sound state according to the corresponding relation between the sound state and the playing time stamp, and determining the scene corresponding to the determined playing time stamp as a target scene so as to establish the corresponding relation between the scene and the playing time stamp. The corresponding relation between the scenes and the playing time stamps is established, so that the target playing time stamps of the target scenes in the target video can be conveniently determined according to the corresponding relation, a user does not need to determine whether a certain scene is the target scene or not in a one-by-one watching mode, and the video display effect can be improved.
The embodiment of the present application further provides an electronic device, as shown in fig. 7, which includes a processor 701, a communication interface 702, a memory 703 and a communication bus 704, where the processor 701, the communication interface 702, and the memory 703 complete mutual communication through the communication bus 704,
a memory 703 for storing a computer program;
the processor 701 is configured to implement the following steps when executing the program stored in the memory 703:
acquiring a target video to be identified;
determining a corresponding relation between a sound state and a playing time stamp through a preset audio processing algorithm, audio data of the target video and the corresponding playing time stamp of the audio data in the target video, wherein the sound state comprises a sound state and a silent state;
and determining the playing time stamp with the corresponding sound state as the sound state according to the corresponding relation between the sound state and the playing time stamp, and determining the scene corresponding to the determined playing time stamp as a target scene so as to establish the corresponding relation between the scene and the playing time stamp.
Optionally, the determining, according to the correspondence between the sound state and the play time stamp, that the corresponding sound state is the play time stamp of the sound state, and determining, as the target scene, the scene corresponding to the determined play time stamp includes:
when the sound state corresponding to the current playing time stamp is a silent state, determining the minimum playing time stamp which is behind the current playing time stamp and the corresponding sound state is a sound state, and determining a scene corresponding to the minimum playing time stamp as a target scene;
acquiring a sound state corresponding to each sampling moment according to a preset sampling time interval;
if the sound state is in a sound state, determining a scene corresponding to the playing time stamp in the sampling time interval as a target scene, and acquiring the sound state corresponding to the next sampling moment;
if the sound state is a silent state, determining a scene corresponding to a playing time stamp in the sampling time interval as a non-target scene, determining the sampling time as a current playing time stamp, executing the step of determining a minimum playing time stamp which is behind the current playing time stamp and corresponds to the sound state as a sound state, and determining the scene corresponding to the minimum playing time stamp as a target scene.
Optionally, before determining the scene corresponding to the play timestamp in the sampling time interval as the non-target scene, the method further includes:
determining the minimum detection duration according to the sampling time, the playing time stamp which is behind the sampling time and corresponds to the sound state as the sound state;
if the minimum detection time length is less than a preset detection time length threshold value, determining a scene corresponding to the playing time stamp in the sampling time interval as a target scene, and acquiring a sound state corresponding to the next sampling time;
and if the minimum detection duration is greater than the detection duration threshold, determining the scene corresponding to the playing timestamp in the sampling time interval as a non-target scene step.
Optionally, the method further includes:
receiving a video acquisition request sent by a client, wherein the video acquisition request is used for indicating to acquire a target video containing a target scene;
determining a target playing time stamp corresponding to the target scene according to the corresponding relation between the scene and the playing time stamp;
and generating a response message according to the target playing time stamp, and sending the response message to the client, wherein the response message is used for determining the target scene contained in the target video.
Optionally, when the client is in a state of displaying the target video, generating a response message according to the target play timestamp, and sending the response message to the client includes:
generating a response message containing the target playing time stamp;
and sending the response message to the client so that the client displays the target playing time stamp of the target scene in a preset playing time progress bar in the current state of displaying the target video.
Optionally, when the client is in a state where the target video is not displayed, the generating a response message according to the target play timestamp and sending the response message to the client includes:
in the video data of a target video, acquiring the video data corresponding to the target playing time stamp to obtain a response message;
and sending the response message to the client so that the client displays the target scene based on the video data corresponding to the target playing time stamp.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
The electronic equipment provided by the embodiment of the application can acquire a target video to be identified; determining a corresponding relation between a sound state and a playing time stamp through a preset audio processing algorithm, audio data of a target video and the corresponding playing time stamp of the audio data in the target video, wherein the sound state comprises a sound state and a silent state; and determining the playing time stamp with the corresponding sound state as the sound state according to the corresponding relation between the sound state and the playing time stamp, and determining the scene corresponding to the determined playing time stamp as a target scene so as to establish the corresponding relation between the scene and the playing time stamp. The corresponding relation between the scenes and the playing time stamps is established, so that the target playing time stamps of the target scenes in the target video can be conveniently determined according to the corresponding relation, a user does not need to determine whether a certain scene is the target scene or not in a one-by-one watching mode, and the video display effect can be improved.
In yet another embodiment provided by the present application, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any of the above-mentioned scene recognition methods.
In yet another embodiment provided by the present application, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any one of the scene recognition methods in the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus, electronic devices, computer-readable storage media, and computer program product embodiments containing instructions that are substantially similar to method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.
The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims (14)

1. A method for scene recognition, the method comprising:
acquiring a target video to be identified;
determining a corresponding relation between a sound state and a playing time stamp through a preset audio processing algorithm, audio data of the target video and the corresponding playing time stamp of the audio data in the target video, wherein the sound state comprises a sound state and a silent state;
and determining the playing time stamp with the corresponding sound state as the sound state according to the corresponding relation between the sound state and the playing time stamp, and determining the scene corresponding to the determined playing time stamp as a target scene so as to establish the corresponding relation between the scene and the playing time stamp.
2. The method according to claim 1, wherein the determining, according to the correspondence between the sound state and the play time stamp, the play time stamp whose corresponding sound state is the sound state, and the determining, as the target scene, the scene corresponding to the determined play time stamp comprises:
when the sound state corresponding to the current playing time stamp is a silent state, determining the minimum playing time stamp which is behind the current playing time stamp and the corresponding sound state is a sound state, and determining a scene corresponding to the minimum playing time stamp as a target scene;
acquiring a sound state corresponding to each sampling moment according to a preset sampling time interval;
if the sound state is in a sound state, determining a scene corresponding to the playing time stamp in the sampling time interval as a target scene, and acquiring the sound state corresponding to the next sampling moment;
if the sound state is a silent state, determining a scene corresponding to a playing time stamp in the sampling time interval as a non-target scene, determining the sampling time as a current playing time stamp, executing the step of determining a minimum playing time stamp which is behind the current playing time stamp and corresponds to the sound state as a sound state, and determining the scene corresponding to the minimum playing time stamp as a target scene.
3. The method according to claim 2, wherein before determining the scene corresponding to the playing time stamp in the sampling time interval as the non-target scene, further comprising:
determining the minimum detection duration according to the sampling time, the playing time stamp which is behind the sampling time and corresponds to the sound state as the sound state;
if the minimum detection time length is less than a preset detection time length threshold value, determining a scene corresponding to the playing time stamp in the sampling time interval as a target scene, and acquiring a sound state corresponding to the next sampling time;
and if the minimum detection duration is greater than the detection duration threshold, determining the scene corresponding to the playing timestamp in the sampling time interval as a non-target scene step.
4. The method of claim 1, further comprising:
receiving a video acquisition request sent by a client, wherein the video acquisition request is used for indicating to acquire a target video containing a target scene;
determining a target playing time stamp corresponding to the target scene according to the corresponding relation between the scene and the playing time stamp;
and generating a response message according to the target playing time stamp, and sending the response message to the client, wherein the response message is used for determining the target scene contained in the target video.
5. The method of claim 4, wherein generating a response message according to the target playing timestamp and sending the response message to the client when the client is in a state of displaying the target video comprises:
generating a response message containing the target playing time stamp;
and sending the response message to the client so that the client displays the target playing time stamp of the target scene in a preset playing time progress bar in the current state of displaying the target video.
6. The method of claim 4, wherein generating a response message according to the target playing timestamp and sending the response message to the client when the client is in a state where the target video is not displayed comprises:
in the video data of a target video, acquiring the video data corresponding to the target playing time stamp to obtain a response message;
and sending the response message to the client so that the client displays the target scene based on the video data corresponding to the target playing time stamp.
7. A scene recognition apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring a target video to be identified;
the first determining module is used for determining the corresponding relation between a sound state and a playing time stamp through a preset audio processing algorithm, the audio data of the target video and the corresponding playing time stamp of the audio data in the target video, wherein the sound state comprises a sound state and a silent state;
and the establishing module is used for determining the playing time stamp of which the corresponding sound state is the sound state according to the corresponding relation between the sound state and the playing time stamp, and determining a scene corresponding to the determined playing time stamp as a target scene so as to establish the corresponding relation between the scene and the playing time stamp.
8. The apparatus of claim 7, wherein the establishing module comprises:
the first determining submodule is used for determining the minimum playing time stamp which is behind the current playing time stamp and the corresponding sound state of which is the sound state when the sound state corresponding to the current playing time stamp is the soundless state, and determining the scene corresponding to the minimum playing time stamp as the target scene;
the acquisition submodule is used for acquiring a sound state corresponding to each sampling moment according to a preset sampling time interval;
the second determining submodule is used for determining a scene corresponding to the playing timestamp in the sampling time interval as a target scene and acquiring a sound state corresponding to the next sampling moment when the sound state is in a sound state;
and a third determining submodule, configured to determine, when the sound state is a silent state, a scene corresponding to the play timestamp in the sampling time interval as a non-target scene, determine the sampling time as a current play timestamp, execute the step of determining a minimum play timestamp which is after the current play timestamp and corresponds to the sound state as a sound state, and determine a scene corresponding to the minimum play timestamp as a target scene.
9. The apparatus of claim 8, wherein the establishing module further comprises:
the fourth determining submodule is used for determining the minimum detection time according to the sampling time, the playing time stamp which is behind the sampling time and corresponds to the sound state and is in the sound state;
the second determining submodule is further used for determining a scene corresponding to the playing timestamp in the sampling time interval as a target scene and acquiring a sound state corresponding to the next sampling moment when the minimum detection time length is smaller than a preset detection time length threshold;
and the third determining submodule is further configured to, when the minimum detection duration is greater than the detection duration threshold, perform a step of determining a scene corresponding to the play timestamp in the sampling time interval as a non-target scene.
10. The apparatus of claim 7, further comprising:
the system comprises a receiving module, a processing module and a display module, wherein the receiving module is used for receiving a video acquisition request sent by a client, and the video acquisition request is used for indicating to acquire a target video containing a target scene;
a second determining module, configured to determine, according to a correspondence between the scene and the play timestamp, a target play timestamp corresponding to the target scene;
and the sending module is used for generating a response message according to the target playing time stamp and sending the response message to the client, wherein the response message is used for determining the target scene contained in the target video.
11. The apparatus of claim 10, wherein when the client is in a state of displaying the target video, the sending module comprises:
the first generation submodule is used for generating a response message containing the target playing time stamp;
and the first sending submodule is used for sending the response message to the client so as to enable the client to display the target playing time stamp of the target scene in a preset playing time progress bar in the current state of displaying the target video.
12. The apparatus of claim 10, wherein when the client is in a state where the target video is not displayed, the sending module comprises:
the second generation submodule is used for acquiring video data corresponding to the target playing timestamp from the video data of the target video to obtain a response message;
and the second sending submodule is used for sending the response message to the client so that the client can display the target scene based on the video data corresponding to the target playing time stamp.
13. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1-6 when executing a program stored in the memory.
14. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 6.
CN201910393305.6A 2019-05-13 2019-05-13 Scene recognition method and device Active CN111935501B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910393305.6A CN111935501B (en) 2019-05-13 2019-05-13 Scene recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910393305.6A CN111935501B (en) 2019-05-13 2019-05-13 Scene recognition method and device

Publications (2)

Publication Number Publication Date
CN111935501A true CN111935501A (en) 2020-11-13
CN111935501B CN111935501B (en) 2022-06-03

Family

ID=73282528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910393305.6A Active CN111935501B (en) 2019-05-13 2019-05-13 Scene recognition method and device

Country Status (1)

Country Link
CN (1) CN111935501B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116600166A (en) * 2023-05-26 2023-08-15 武汉星巡智能科技有限公司 Video real-time editing method, device and equipment based on audio analysis

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1032776A (en) * 1996-07-18 1998-02-03 Matsushita Electric Ind Co Ltd Video display method and recording/reproducing device
US20030016945A1 (en) * 2001-07-17 2003-01-23 Pioneer Corporation Apparatus and method for summarizing video information, and processing program for summarizing video information
US20030112265A1 (en) * 2001-12-14 2003-06-19 Tong Zhang Indexing video by detecting speech and music in audio
US20080266457A1 (en) * 2007-04-24 2008-10-30 Nec Electronics Corporation Scene change detection device, coding device, and scene change detection method
CN102737109A (en) * 2011-04-12 2012-10-17 尼尔森(美国)有限公司 Methods and apparatus to generate a tag for media content
CN104519401A (en) * 2013-09-30 2015-04-15 华为技术有限公司 Video division point acquiring method and equipment
CN108764304A (en) * 2018-05-11 2018-11-06 Oppo广东移动通信有限公司 scene recognition method, device, storage medium and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1032776A (en) * 1996-07-18 1998-02-03 Matsushita Electric Ind Co Ltd Video display method and recording/reproducing device
US20030016945A1 (en) * 2001-07-17 2003-01-23 Pioneer Corporation Apparatus and method for summarizing video information, and processing program for summarizing video information
US20030112265A1 (en) * 2001-12-14 2003-06-19 Tong Zhang Indexing video by detecting speech and music in audio
US20080266457A1 (en) * 2007-04-24 2008-10-30 Nec Electronics Corporation Scene change detection device, coding device, and scene change detection method
CN102737109A (en) * 2011-04-12 2012-10-17 尼尔森(美国)有限公司 Methods and apparatus to generate a tag for media content
CN104519401A (en) * 2013-09-30 2015-04-15 华为技术有限公司 Video division point acquiring method and equipment
CN108764304A (en) * 2018-05-11 2018-11-06 Oppo广东移动通信有限公司 scene recognition method, device, storage medium and electronic equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘安安等: "新闻视频结构化浏览与标注系统", 《计算机工程》 *
张南平等: "基于模式识别视频搜索技术的研究", 《福建电脑》 *
陈忠克等: "足球比赛精彩场景的自动分析与提取", 《计算机辅助设计与图形学学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116600166A (en) * 2023-05-26 2023-08-15 武汉星巡智能科技有限公司 Video real-time editing method, device and equipment based on audio analysis
CN116600166B (en) * 2023-05-26 2024-03-12 武汉星巡智能科技有限公司 Video real-time editing method, device and equipment based on audio analysis

Also Published As

Publication number Publication date
CN111935501B (en) 2022-06-03

Similar Documents

Publication Publication Date Title
CN106998486B (en) Video playing method and device
CN107066619B (en) User note generation method and device based on multimedia resources and terminal
CN110673777A (en) Online teaching method and device, storage medium and terminal equipment
CN107612815B (en) Information sending method, device and equipment
CN106840209B (en) Method and apparatus for testing navigation applications
CN112653902B (en) Speaker recognition method and device and electronic equipment
CN109120954B (en) Video message pushing method and device, computer equipment and storage medium
CN104320682B (en) A kind of formulation task order method and system, relevant device
US11127307B2 (en) Joint media broadcasting and live media methods and systems
CN108509611A (en) Method and apparatus for pushed information
CN111209417A (en) Information display method, server, terminal and storage medium
CN111163348A (en) Searching method and device based on video playing
CN111930453A (en) Dictation interaction method and device and electronic equipment
CN111935501B (en) Scene recognition method and device
CN112073757B (en) Emotion fluctuation index acquisition method, emotion fluctuation index display method and multimedia content production method
CN113391745A (en) Method, device, equipment and storage medium for processing key contents of network courses
WO2023056850A1 (en) Page display method and apparatus, and device and storage medium
US11960703B2 (en) Template selection method, electronic device and non-transitory computer-readable storage medium
CN109343761B (en) Data processing method based on intelligent interaction equipment and related equipment
CN113420135A (en) Note processing method and device in online teaching, electronic equipment and storage medium
CN109982143B (en) Method, device, medium and equipment for determining video playing time delay
CN108228829B (en) Method and apparatus for generating information
CN111930229B (en) Man-machine interaction method and device and electronic equipment
CN113808615B (en) Audio category positioning method, device, electronic equipment and storage medium
CN109815408B (en) Method and device for pushing information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant