WO2017211206A1

WO2017211206A1 - Video marking method and device, and video monitoring method and system

Info

Publication number: WO2017211206A1
Application number: PCT/CN2017/086325
Authority: WO
Inventors: 韦薇; 王启贵; 谢思远
Original assignee: 中兴通讯股份有限公司
Priority date: 2016-06-08
Filing date: 2017-05-27
Publication date: 2017-12-14
Also published as: CN107483879A; CN107483879B

Abstract

Provided are a video marking method and device, and a video monitoring method and system. The video marking method comprises: extracting, from a video file, a sound feature of an audio signal; performing matching on the basis of the extracted sound feature and each audio event in an audio event library; and if the extracted sound feature matches at least one audio event, adding an event marker at a corresponding position in the video file to signify occurrence of an audio event.

Description

Video marking method, device and video monitoring method and system

Technical field

This application relates to, but is not limited to, the field of communication technology.

Background technique

Related technologies In video recording in areas such as monitoring, the recorded video files are manually marked afterwards. The general process is: first video recording to get a video file, then open a recorded video in the video editor, manually view the video to find the time to mark the time point, add the corresponding time on the timeline of the marker bar, add Mark the indicator and add a text label. This way of tagging video has the following problems:

The process of tagging a video file requires manual viewing and a determination as to whether or not to mark it, and where to mark the video file. This type of marking is not only inefficient, but also whether the judgment result of the marking and the determination of the marking position are subject to manual influence, which may result in poor marking accuracy.

Summary of invention

The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.

The present invention provides a video tagging method, device, and video monitoring method and system, which solve the problem of low efficiency and poor accuracy when the video tagging is manually implemented in the related art.

A video marking method, including:

Extracting sound characteristics of the audio signal in the video file;

The extracted sound features are matched to each audio event in the audio event library; each of the audio events is established based on a sound characteristic of the audio signal generated at the time of the event;

When the sound feature is successfully matched with at least one of the audio events, an event flag is generated for the audio event that occurs at a corresponding location in the video file.

Optionally, the extracting the sound characteristics of the audio signal in the video file includes:

The sound characteristics of the audio signal in the video file are extracted during video recording.

Optionally, the extracting the sound characteristics of the audio signal includes:

Converting the audio signal to a time-frequency domain and extracting a foreground signal of the audio signal;

The matching the extracted sound features with each of the audio events comprises:

And extracting a sound feature set from the foreground signal, and calculating a similarity between the sound feature set and each of the audio events, and the obtained similarity is greater than a set similarity threshold, and the matching is successful.

Optionally, the corresponding location in the video file performs event marking on the audio event that occurs, including performing one or more of the following:

Marking a start time of occurrence of the audio event at a key video frame position of the video file;

Acquiring and marking one or more of direction information and distance information of the sound source relative to the pickup in the audio event that occurs;

Obtaining and marking the severity level corresponding to the audio event that occurred;

Gets and marks the name of the audio event that occurred.

Optionally, the acquiring the severity level corresponding to the audio event that occurs includes:

The severity level corresponding to the audio event is determined according to one or more of recording location information of the video file and duration after the audio event occurs.

Optionally, the event marking the occurrence of the audio event in the corresponding position in the video file includes:

When the severity level corresponding to the audio event is marked, the marking format corresponding to the severity level of the audio event is selected according to the correspondence table of the severity level and the marking format.

Optionally, the extracting the sound feature of the audio signal in the video file includes: extracting a sound feature of the audio signal in the video file according to a preset detection period;

The method further includes:

When the sound features of the audio signals extracted in the adjacent detection period are successfully matched with the at least one of the audio events, determining whether the adjacent two detection weeks are performed according to a preset event combining rule The audio events occurring during the period are merged;

When it is determined that the merging is performed, the start time of the audio event in the previous detection period is taken as the start time of the audio event in the current detection period;

When it is determined that the merging is not performed, the end time of the audio event in the previous detection period and the start time of the audio event in the current detection period are set as the start time of the current detection period.

The embodiment of the invention further provides a video monitoring method, including:

Perform surveillance video recording;

In the process of recording the video, performing event marking on the recorded video file by using the video marking method according to any one of the above;

After an event tag is completed on the video file, the video file of the event tag portion is displayed as an alarm.

The embodiment of the invention further provides a video marking device, comprising:

a feature extraction module, configured to: extract a sound feature of the audio signal in the video file;

a processing module, configured to: match the sound feature extracted by the feature extraction module with each audio event in the audio event library; each audio event is established based on a sound feature of an audio signal generated when the event occurs;

And a marking module, configured to: when the processing result of the processing module is that the sound feature is successfully matched with the at least one of the audio events, the corresponding event in the video file performs an event tag on the audio event that occurs.

Optionally, the device further includes:

The video recording module is set to: perform video recording;

The feature extraction module is configured to: extract a sound feature of an audio signal in the video file during video recording by the video recording module.

Optionally, the marking module performs event marking on the occurrence of the audio event in a corresponding position in the video file, including performing one or more of the following markings:

Gets and marks the name of the audio event that occurred.

The embodiment of the invention further provides a video monitoring system, comprising: a monitoring processing device and the video marking device according to any one of the preceding claims;

The video tagging device is configured to: perform an event tag on the video file recorded during the video monitoring process, and notify the monitoring processing device after completing an event tag on the video file;

The monitoring processing device is configured to: after receiving the alarm of the video marking device, perform an alarm display on the video file of the event marking portion.

The embodiment of the present invention further provides a computer readable storage medium, where the computer readable storage medium stores computer executable instructions, and when the processor executes the computer executable instructions, the following operations are performed:

Extracting sound characteristics of the audio signal in the video file;

Perform surveillance video recording;

The video marking method, device and video monitoring method and system provided by the embodiments of the present invention, by extracting the sound characteristics of the audio signal in the video file, matching the extracted sound features with each audio event in the audio event library, when the extracted sound When the feature is successfully matched with the at least one audio event, it indicates that the audio event occurs in the video file, and the audio event occurs in the corresponding position in the video file; wherein the audio event is based on the audio generated when the event occurs in advance The sound characteristics of the signal are established. In the embodiment of the present invention, by setting an audio event in advance, and then matching the sound feature of the audio signal in the video file with each audio event to determine whether the corresponding mark needs to be performed, it is not necessary to manually view the video content to determine whether to mark. The efficiency and accuracy of marking video files can be greatly improved.

Other aspects will be apparent upon reading and understanding the drawings and detailed description.

BRIEF abstract

FIG. 1 is a flowchart of a video marking method according to an embodiment of the present invention;

2 is a flowchart of a video monitoring method according to an embodiment of the present invention;

3 is a schematic structural diagram of a video marking apparatus according to an embodiment of the present invention;

4 is a schematic structural diagram of another video marking apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of still another video marking apparatus according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a video monitoring system according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a component of a video monitoring system according to an embodiment of the present disclosure;

FIG. 8 is a flowchart of still another video monitoring method according to an embodiment of the present invention.

Detailed

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments herein may be arbitrarily combined with each other.

The steps illustrated in the flowchart of the figures may be executed in a computer system in accordance with a set of computer executable instructions. Also, although logical sequences are shown in the flowcharts, in some cases the steps shown or described may be performed in a different order than the ones described herein.

The embodiments described below are only some of the embodiments of the present invention, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

In the embodiment of the present invention, the audio event is set in advance, and then the sound feature of the audio signal in the video file to be processed is extracted, and the sound feature is matched with each audio event to automatically determine whether the corresponding mark needs to be performed, and does not need to be manually viewed. The video content thus determines whether or not to mark, which can greatly improve the efficiency and accuracy of marking video files. As shown in FIG. 1 , it is a flowchart of a video marking method according to an embodiment of the present invention. The video marking method provided by the embodiment of the present invention may include the following steps, namely, S101 to S105:

S101: Extract sound characteristics of the audio signal in the video file.

The video file in the embodiment of the present invention may include an audio signal and a synchronously recorded video signal. Optionally, the S101 in the embodiment of the present invention may be performed after the video file is recorded, or during the video file recording process, and the time for marking the video file may be improved during the video file recording process. Real-time marking can be realized. For some areas, especially in the field of video surveillance, the marked alarm content is seen one minute or one second earlier, and the impact on the subsequent alarm event situation may be very different. Therefore, the implementation of S101 during video file recording is of great significance for the field of video surveillance.

S102: Match the extracted sound features with each audio event in the audio event library.

Each audio event in an embodiment of the invention is established based on the acoustic characteristics of the audio signal produced at the time the event occurred. For example, for a smoke alarm event, a smoke alarm sound is generated, and the sound characteristics of the sound are extracted to obtain an audio event of the smoke alarm. For another example, for a robbery or violent or aggressive incident, there may be a call for help, such as screaming for help, and extracting the sound characteristics of these sounds may result in an audio event of a robbery or violent or aggressive event. In addition to the above examples, there are generally one or more sound features corresponding to the event for different events, for example, a gunshot event may correspond to a gunshot sound, and in other events, a corresponding glass breakage may occur. Sound, crying, car horn, etc. This embodiment of the present invention will not be described again. Those skilled in the art should understand that the audio events in the embodiments of the present invention can be flexibly configured according to the requirements in practical applications.

S103: determining whether the extracted sound feature is successfully matched with at least one audio event; When the matching is performed, S104 is performed; when it is determined that the matching is not performed, S105 is performed;

S104: Perform event tagging on the audio event that occurs in the corresponding position in the video file.

S105: The current no audio event occurs, and can continue to wait for the next detection.

The embodiment of the present invention completes matching with each audio event by presetting the audio event and automatically extracting the sound features of the sound signal in the video file through the flow shown in FIG. 1, if the extracted sound feature matches one of the audio events. If successful, the audio event occurs in the video file, and the audio event is automatically marked in the corresponding position of the video file. The above judgment and marking process do not require manual participation at all, which can improve efficiency and improve the accuracy of marking.

Optionally, in the embodiment of the present invention, the foregoing process may also be performed in a video recording process, and in the video recording process, real-time marking processing may be performed on the recorded video file, and the related file needs to be recorded in the video file. After the completion of the mark processing method, the time limit of the mark processing can be greatly improved. For the video surveillance field, some malicious events can be stopped in time or even some malicious events can be avoided to ensure the security of the user's property and life.

Optionally, in the embodiment of the present invention, the implementation of extracting the sound feature of the audio signal and matching the extracted sound feature with each audio event may be performed by using the following process:

After transforming the audio signal into the time-frequency domain, the background signal and the foreground signal of the audio signal are extracted. For example, the background signal and the foreground signal can be separated by behavioral modeling based on the neural mechanism of human hearing; this can eliminate the influence of the background signal.

Extracting a set of sound features from the foreground signal, and then reading each audio event from the audio event library, calculating a similarity between the extracted sound feature set and each audio event, when the sound feature set is similar to an audio event When the value is greater than the set similarity threshold, it is determined that the audio event is successfully matched, that is, the audio event occurs in the video file. Event events can then be flagged for the audio event that occurs in the video file.

Optionally, in the embodiment of the present invention, performing event marking on the audio event that occurs in the corresponding position in the video file includes performing one or more of the following:

The start time of the occurrence of the audio event is marked at the key video frame position of the video file; the key video frame in the embodiment of the present invention refers to the video frame at the moment when the audio event occurs.

Acquire and mark the direction information and distance information of the sound source relative to the pickup in the audio event that occurred In the embodiment of the present invention, when the sound source is positioned, the event mark can be performed after obtaining a clear sound source direction (which can also be characterized by an angle) and/or a distance to avoid unclear sound. The source information instead leads to misleading subsequent processing.

Gets and marks the severity level corresponding to the audio event that occurred.

Gets and marks the name of the audio event that occurred.

Optionally, in the embodiment of the present invention, a corresponding severity level may be set for each audio event. For example, for an audio event, when it is a non-malignant and non-hazardous event, the severity level may be set to be a general Severe, characterized by a coefficient of 0; for a vicious audio event, set its severity level to be more severe, characterized by a factor of 1; for a dangerous audio event, set its severity level to be very severe, characterized by a factor of 2. When you perform a severity level tag, you can directly mark the coefficient for each severity level.

For an event, its severity level may not only be related to the event itself, but may also be closely related to where the event occurred (eg, hotels, shops, schools, residential areas, within the home) and the duration of the event.

Optionally, in the embodiment of the present invention, the implementation manner of obtaining the severity level corresponding to the audio event may include: according to the location information of the recorded video file (that is, the location where the audio event occurs) and/or the audio event occurs. The duration of the subsequent determination of the severity level corresponding to the audio event; that is, determining the severity level corresponding to the audio event in combination with the acquired location information and/or duration, such that the determined severity level considers a more comprehensive factor. The results obtained are also more accurate.

Optionally, in the embodiment of the present invention, in order to facilitate the subsequent monitoring personnel to view and process, the implementation of the event marking of the audio event in the corresponding position in the video file may include: marking the severity level corresponding to the audio event The tag format corresponding to the severity level of the audio event may be selected according to a preset correspondence table between the severity level and the tag format. Optionally, the mark format in the embodiment of the present invention includes, but is not limited to, a different color/format adopted by the text box and the text. The following is an example. See Table 1.

Table 1

Based on the above correspondence table, the audio events of different severity levels are marked with different mark formats, and when the video files of the mark parts are displayed, the mark can be displayed in different formats, for the user Play different prompts, which is more conducive to users to respond correctly and quickly to events of different severity levels.

Optionally, in the embodiment of the present invention, the process of marking the video file may be performed periodically. For this, the embodiment of the present invention may preset a detection period, for example, 10 seconds, that is, every 10 seconds. It can be detected once; it can also be set to 30 seconds, 1 minute, etc. In practical applications, the value of the detection period can be flexibly set according to actual needs. In this way, the audio signal and its sound characteristics are extracted from the recorded video file during each detection cycle, and the extracted sound features are matched with each audio event. In this way, there will be the following situations: for example, suppose a robbery occurs, in which the threat of intimidation, crying, crying, and crying for several seconds, and no obvious long interruption in the middle, may continue to occur for a period of time. Is an audio event or associated audio event. Therefore, the embodiment of the present invention can set an event merging rule, which can be integrated into an audio event for the above similar situation, and improves the intelligence and accuracy of detection and marking. For example, the merge rule can be set to any of the following rules:

Consolidation when detecting the same audio event (without limiting duration);

In the M audio detection periods, when the same audio event is detected, the combination is performed, and M is greater than or equal to 2; for example, if the detection period is 1 minute and M is 10, the same audio event is allowed to be merged within 10 minutes;

Merging when detected as an associated audio event (without limiting duration);

During the N audio detection periods, the combination is detected as an associated audio event, and N is greater than or equal to 2.

In an application scenario of the embodiment of the present invention, when an audio event is detected for the first time in a certain detection period, the start time of the event is marked in the video file, and the end time may not be marked first. , wait for the next detection cycle, if the next detection cycle If no audio event is detected, or no identical or associated audio event is detected, the end time of the audio event is marked as the start time of the next detection cycle.

In another application scenario of the embodiment of the present invention, when the sound features of the audio signals extracted in the adjacent detection period are successfully matched with the at least one audio event, that is, the audio events occur in two adjacent detection periods. According to the above preset event merging rule, it is determined whether to merge the audio events occurring in two adjacent detection periods; when it is determined that the merging is performed, the start time of the audio event in the previous detection period is used as the current detection. The start time of the audio event in the period, the end time is not marked first, waiting for the subsequent detection result; when it is judged that the combination is not performed, the end time of the audio event in the previous detection period is set as the start time of the current detection period, and the current detection is set. The start time of the audio event in the cycle is the start time of the current detection cycle.

After the merging, the duration of the audio event is changed, so the severity level corresponding to the audio event may also change. Optionally, the method provided by the embodiment of the present invention combines the generated audio events. The re-acquisition of the severity level corresponding to the audio event may be re-acquired, and when the change occurs, the corresponding update may be performed. Thus, in an embodiment of the invention, the marking of the video file may also include the end time and/or duration of the audio event. The video file for the start time and end time segments can be referred to as a tag video. In the subsequent alarm display, the video of this label can be displayed in a targeted manner.

FIG. 2 is a flowchart of a video monitoring method according to an embodiment of the present invention. The video monitoring method provided by the embodiment of the present invention may include the following steps, that is, S201 to S203:

S201: Perform monitoring video recording; in practical applications, video acquisition and synchronized audio collection can be performed by an image collector (such as a camera) and a pickup.

S202: During the video recording process, the recorded video file is event-marked by a video marking method in any of the embodiments shown in FIG.

S203: After completing an event tag on the video file, the video file of the event tag part is displayed in an alarm.

Through the video monitoring method provided by the embodiment of the present invention, the monitoring personnel can check the most timely I saw the video content of the audio event part and made the corresponding processing in time. It should be noted that, according to the content described in the above embodiments, the event may still occur, has not ended yet; or the event may have ended, depending on the event duration and the event detection period.

Optionally, in the embodiment of the present invention, the implementation of the alarm display of the video file of the marked part, for example, may send an alarm to the background server. If the real-time video of the image collector in the embodiment of the present invention is displayed before the display device, the video content of the marked portion is currently displayed by the display device, and the corresponding event flag is displayed correspondingly. If the corresponding display device is not displaying the real-time video of the image collector in the embodiment of the present invention, the video link of the alarm message and the event tag portion may be sent to the display device, and the user may play the video by clicking the video link, and The function of switching to real-time video at any time can also be provided in the embodiment of the present invention.

Alternatively, in an embodiment of the invention, corresponding alarm processing may be required for audio events that occur (eg, robbery, shooting, etc.). Therefore, when displaying on the display device, the embodiment of the present invention can also provide an alarm option bar, which can be integrated with the time point mark on the video progress bar, and the alarm option can be popped up when the user clicks. At the same time, it is considered that the user needs to view multiple times for the key event to make the determination. Therefore, the embodiment of the present invention can also provide a lookback function and can also be integrated in a certain position on the video progress bar (for example, the corresponding mark can be embodied, for example, integrated. On the time point mark), the user needs to look back and click on the logo of the corresponding location.

Through the technical solution provided by the embodiment of the present invention, the audio event occurring can be timely and accurately observed in the monitoring process, and a timely and accurate response can be made to ensure the security of the user's property and life.

FIG. 3 is a schematic structural diagram of a video marking apparatus according to an embodiment of the present invention. The video marking device 30 provided by the embodiment of the present invention may include: a feature extraction module 31, a processing module 32, and a marking module 33.

The feature extraction module 31 is configured to: extract sound features of the audio signal in the video file.

The video file in the embodiment of the present invention may include an audio signal and a synchronously recorded video signal.

The processing module 32 is configured to: the sound feature and the audio event library extracted by the feature extraction module 31 Each audio event is matched.

Each audio event in an embodiment of the invention is established based on the acoustic characteristics of the audio signal produced at the time the event occurred. For different events, there are generally one or more sound features corresponding to the event. For example, a gunshot event may have a gunshot sound, while in other events, a glass break sound may be generated correspondingly, crying. Sound, car horn, and so on. This embodiment of the present invention will not be described again. Those skilled in the art should understand that the audio events in the embodiments of the present invention can be flexibly configured according to the requirements in practical applications.

The marking module 33 is configured to: when the processing result of the processing module 32 is that the sound feature is successfully matched with the at least one audio event, the corresponding event in the video file is event-marked for the audio event that occurs.

Optionally, the foregoing functions of the feature extraction module 31, the processing module 32, and the marking module 33 in the embodiment of the present invention may be implemented by a processor, or may be implemented independently. The feature extraction module 31 automatically extracts the sound features of the sound signal in the video file to complete the matching with each audio event via the processing module 32. When there is a matching successful audio event (representing the audio event in the video file), Event markers are automatically made at the appropriate location in the video file. The entire process does not require manual participation, and the marking efficiency and accuracy can be greatly guaranteed.

FIG. 4 is a schematic structural diagram of another video marking apparatus according to an embodiment of the present invention. Based on the structure of the device shown in FIG. 3, the video tagging device 30 in the embodiment of the present invention may further include:

The video recording module 34 is configured to: perform video recording.

Optionally, the video recording module 34 can include a video capture device and a pickup. That is, the video tagging device 30 itself can be used as a monitoring device, which can cooperate with the monitoring platform to complete video monitoring in various scenarios.

The feature extraction module 31 is configured to: during the video recording process performed by the video recording module 34, extract the sound features of the audio signal in the video file, and then complete the subsequent marking process flow via the processing module 32 and the marking module. This can improve the timeliness of the processing of marking the video file, and basically realize real-time marking. For the video monitoring field, the marked alarm content is seen one minute or one second earlier, and the impact on the subsequent alarm event situation may be very Different.

Optionally, in the embodiment of the present invention, the feature extraction module 31 extracts a sound feature of the audio signal. And the processing module 32 matches the extracted sound features with each audio event, which can be performed by the following process:

After transforming the audio signal into the time-frequency domain, the feature extraction module 31 extracts the background signal and the foreground signal of the audio signal, and extracts the sound feature set from the foreground signal. The feature extraction module 31 can separate the background signal and the foreground signal by behavioral modeling based on the neural mechanism of human hearing; this can eliminate the influence of the background signal.

The processing module 32 reads the audio event from the audio event library, and calculates the similarity between the sound feature set extracted by the feature extraction module 31 and each audio event, when the similarity between the sound feature set and an audio event is greater than the set similarity. When the threshold is reached, it is determined that the audio event is successfully matched, that is, the audio event is determined to have occurred in the video file. Event events can then be flagged for the audio event that occurs in the video file.

Optionally, in the embodiment of the present invention, the marking module 33 performs event marking on the generated audio event in the corresponding position in the video file, including performing one or more of the following markings:

Obtaining and marking one or more of the direction information and the distance information of the sound source relative to the pickup in the generated audio event; in the embodiment of the present invention, when the sound source is positioned, a clear sound source direction can be obtained (the angle can also be used) Event marking is performed after characterization and/or distance to avoid unclear sound source information and otherwise lead to misleading subsequent processing.

Gets and marks the name of the audio event that occurred.

Optionally, in the embodiment of the present invention, a corresponding severity level may be set for each audio event. For example, for an audio event, when it is a non-malignant and non-hazardous event, the severity level may be set to be a general Severe, characterized by a coefficient of 0; for a vicious audio event, set its severity level to be more severe, characterized by a factor of 1; for a dangerous audio event, set its severity level to be severe, characterized by a factor of 2. When you perform a severity level tag, you can directly mark the coefficient for each severity level.

For an event, its severity level may not only be related to the event itself, but may also be related to the event. Locations that occur (such as hotels, shops, schools, residential areas, homes) and the duration of events are closely related.

Optionally, in the embodiment of the present invention, in order to facilitate the subsequent monitoring personnel to view and process, the implementation of the event marking of the audio event in the corresponding position in the video file may include: marking the severity level corresponding to the audio event The tag format corresponding to the severity level of the audio event may be selected according to a preset correspondence table between the severity level and the tag format. Optionally, the mark format in the embodiment of the present invention includes, but is not limited to, a different color/format adopted by the text box and the text.

Optionally, FIG. 5 is a schematic structural diagram of still another video marking apparatus according to an embodiment of the present invention. Based on the structure of the device shown in FIG. 4, the video tagging apparatus 30 provided by the embodiment of the present invention may further include:

The cache module 35 is configured to: store an audio event library, which may be obtained from another server (for example, a monitoring server), or directly receive the user's setting acquisition. The cache module 35 is further configured to: cache audio data and video data collected by the video recording module 34, and various data marked by the cache tag module 33.

Optionally, in the embodiment of the present invention, the process of marking the video file may be performed periodically. For this, the embodiment of the present invention may preset a detection period, for example, 10 seconds, that is, every 10 seconds. It can be detected once; it can also be set to 30 seconds, 1 minute, etc. In practical applications, the value of the detection period can be flexibly set according to actual needs. Thus, during each detection cycle, the feature extraction module 31 extracts the audio signal and its sound features from the recorded video file, and the processing module 32 matches the extracted sound features with each audio event. In this way, there will be matching to the same or associated audio events over multiple detection cycles. Therefore, the embodiment of the present invention can set an event merging rule, and the similarity flag module 33 can integrate it into an audio event to improve the intelligence and accuracy of detection and marking. For example, the tagging module 33 can be based on the following merge rules Process any of them:

Consolidation when detecting the same audio event (without limiting duration);

In the M audio detection periods, when the same audio event is detected, the combination is performed, and M is greater than or equal to 2; for example, if the detection period is 2 minutes and M is 5, the same audio event is allowed to be merged within 10 minutes;

Merging when detected as an associated audio event (without limiting duration);

In an application scenario of the embodiment of the present invention, when the processing module 32 detects that an audio event occurs for the first time in a certain detection period, the marking module 33 first marks the start time of the event in the video file. The end time may not be marked first, waiting for the detection result of the next detection period. If no audio event is detected in the next detection period, or no identical or associated audio event is detected, the end time of marking the audio event is The start time of the next detection cycle.

In another application scenario of the embodiment of the present invention, the processing module 32 has a sound feature of the audio signal extracted in the adjacent detection period that matches at least one audio event, that is, in two adjacent detection periods. When an audio event occurs, the marking module 33 may determine whether to merge the audio events occurring in two adjacent detection periods according to the above-mentioned preset event merging rule; when it is determined that the merging is performed, the audio in the previous detection period is The start time of the event is used as the start time of the audio event in the current detection period. The end time is not marked first, waiting for the subsequent detection result. When it is judged that the combination is not performed, the end time of the audio event in the previous detection period is set as the current detection period. The start time of the audio event in the current detection period is set to the start time of the current detection period.

After the merging, the duration of the audio event is changed, so the severity level corresponding to the audio event may also change. Optionally, in the embodiment of the present invention, the marking module 33 merges the generated audio events. , can re-acquire whether the severity level corresponding to the audio event changes, and when the change occurs, the mark can be updated accordingly. Therefore, in the embodiment of the present invention, the marking performed by the marking module 33 on the video file may further include an end time and/or duration of the audio event. The video file for the start time and end time can be called label view. frequency.

FIG. 6 is a schematic structural diagram of a video monitoring system according to an embodiment of the present invention. The video monitoring system 60 provided by the embodiment of the present invention may include: a monitoring processing device 61 and a video marking device 62 in any of the embodiments shown in FIGS. 3 to 5.

The video tagging device 62 is configured to: perform event tagging on the video file recorded during the video monitoring process, and after performing an event tag on the video file, alert the monitoring processing device 61; the process of the alarm is also an event. The activation process is marked, which can be done by a tag activation module set in video tagging device 62.

The monitoring processing device 61 is configured to: after receiving the alarm of the video marking device 62, display the video file of the event marking portion. It should be noted that, according to the above description, the event may still occur, and has not yet ended; or the event may have ended, depending on factors such as the duration of the event, the detection period, and the like.

Optionally, in the embodiment of the present invention, the monitoring processing device 61 may be implemented by using a background server in combination with a corresponding display device, the background server storing the storage medium for storing the audio event library, and also for storing the information from the video marking device 62. Video data, alarm information, etc.

Optionally, in the embodiment of the present invention, the video tagging device 62 may further include an interaction unit, where the interaction unit may be a display unit, configured to: receive a tag video and real-time video that can be viewed from the tag module 33, or receive or deliver Various interactive messages.

Optionally, in the embodiment of the present invention, the video marking device 62 sends the video file of the event tag portion to the monitoring processing device 61 for alarm. If the real-time video of the image collector in the embodiment of the present invention is displayed before the monitoring processing device 61, the monitoring device 61 currently displays the video content of the marked portion, and the corresponding event flag is displayed accordingly. If the monitoring processing device 61 is not displaying the real-time video of the image collector in the embodiment of the present invention, the video link of the alarm message and the event tag portion may be sent to the monitoring processing device 61, and the user may click the link to play the video. The function of switching to real-time video at any time can also be provided in the embodiment of the present invention.

Alternatively, in an embodiment of the invention, corresponding alarm processing may be required for audio events that occur (eg, robbery, shooting, etc.). Therefore, the embodiment of the present invention is in monitoring the processing equipment. When the 61 is displayed, an alarm option bar can also be provided. The alarm option bar can be integrated with the time point mark on the video progress bar, and when the user clicks, the alarm option can be popped up. At the same time, it is considered that the user needs to view multiple times for the key event to determine. Therefore, the embodiment of the invention can also provide a lookback function and can also be integrated in a certain position on the video progress bar, and the user needs to look back at the identifier of the corresponding location. Just fine.

Optionally, in the embodiment of the present invention, the monitoring processing device 61 and the video marking device 62 can be combined to form a monitoring system, and the video marking device 62 can realize real-time marking function on the video, which can be timely and accurate in the monitoring process. View the audio events that occur and make timely and accurate responses to ensure the safety of the user's property and life.

For a better understanding of the video marking method, device, and video monitoring method and system provided by the embodiments of the present invention, the method provided by the embodiment of the present invention is exemplarily described below in conjunction with an actual monitoring scenario.

FIG. 7 is a schematic structural diagram of a video surveillance system according to an embodiment of the present invention. The component of the video surveillance system may include:

Camera and pickup module 71 (also referred to as monitoring module or monitoring device), audio event object 72, network, background server 73 (core portion of monitoring processing device 61), management station 74, and display component 75 (may be a display or separate Display terminals, such as mobile phones, pads, etc.).

The camera and the pickup module 71 may be a camera with a built-in pickup or a camera with an external pickup. If it is external, audio and video synchronization is required.

The camera and the pickup module 71 further includes a feature extraction module, a processing module, a marking module, and a cache module. The feature extraction module and the processing module are configured to: detect an audio event object 72 according to the collected audio signal; and set the marking module to: According to the detected audio event object 72, the mark attribute of the time point mark of the video event mark is acquired, and the real-time video is edited, the time point mark is marked, and the eye-catching text box annotation is added on the label video frame; the cache module is set to: Cache audio event libraries, acquired audio and video signals, event markers, and more. The feature extraction module is configured to: first separate the foreground signal and the background signal of the audio signal, and perform feature extraction on the foreground signal, and the processing module is further configured to: compare the foreground signal with an audio event of the event detection model library in the cache module, if similar If the degree exceeds the set threshold, then one or more types of audio events are detected. Mark The module can locate the sound source and obtain the sound source distance and sound source direction. Then determine the severity. The processing module is further configured to: first determine whether to integrate the audio event, and if so, integrate the audio event, obtain the start time, the end time, and the duration, and integrate the audio detection conclusion, the sound source angle, and the sound source distance, and re-determine the severity level. An audio event within an integrated time period generates only one point-in-time marker, including the marker start time and the marker end time. Saved to the cache module and synchronized to the database of the background server 73. The camera and pickup module 71 is connected to the background server 73 via a network. Send an alarm to the background server. If the real-time video of the camera is being displayed before, continue to display. At this time, the label video with the marked attribute should be displayed; if the real-time video of the camera is not displayed before, the background server is displayed. Send alarm messages and video links, click to display the tag video with tag attributes, you can switch to live video at any time. The time point mark appears on the video progress bar. Click the mark to select the alarm or roll back. If you select the alarm, dial the specified alarm call and share the tag video marked with time and location.

In addition, through the management station 74 (which may include an audio event management module, an audio event severity level determination management module, and a merge rule management module), audio features of specific events, entry and management of severity level determination rules, and merge rules, etc., can be entered and managed. .

The following is a description of the structure of the video monitoring shown in FIG. 7 combined with a robbery audio event. E.g:

In a residential community, a man followed a young woman into the elevator and armed with a weapon to commit robbery. The woman was scared and shouted and said, "Don’t come over, pack it for you." It lasted for about 1 minute and the man forcibly robbed After the bag began to turn over, taking advantage of this space, the woman quickly pressed the elevator button on the nearest floor, and the elevator door ran out quickly. Since there are cameras and pickups in the elevator room, video files are recorded and displayed in real time on the monitoring screen of the property management room of the community through the network. The process of implementing monitoring by using the video monitoring method provided by the embodiment of the present invention is as follows:

The monitoring device (including the camera and the pickup) is disposed in the elevator. After the monitoring device detects the data collected by the pickup through the audio event in the current detection period, the detected audio event is “in-the-bail robbery”. The audio event E1 is pre-registered and the mark attribute of the corresponding time point mark S1 is recorded, including the mark start time (ie, the current time), the severity level, the sound source distance, and the sound source direction. For example, the mark start time: 19:50:00, mark attribute: robbery | more serious | within 1 meter | upper right.

No audio events were detected during the last detection cycle, so no audio event integration is required. The audio event E1 is officially registered, and the video in the time period from the mark start time to the mark end time is called a tag video, and the tag video can be appropriately time-slided, for example, the start time of the tag video is pushed forward n Seconds and the end time is pushed back n seconds to get a more complete picture of the event. The start time of the tag video is 19 seconds before 19:50:00. If n is 5, the start time is 19:49:55, and the end time is empty, indicating that the audio event is still occurring and does not end.

Send an alarm to the background server of the cell property management tube. On the display screen corresponding to the camera in the cell management security room, the text annotation is displayed in the center of the current video frame. "Comparative alarm: 19:50:00, robbery, upper right, Within 1 meter, the orange number 3 font with border is added, and the progress bar is displayed. At 19:50:00, the annotation with orange font is marked as "robbery" and highlighted. Click on the progress bar. At the time point mark, two buttons “Alarm” and “Video Fallback to Mark” will pop up. If “Alarm” is clicked, the specified alarm will be alarmed and the link of the label video will be shared; if “Click Video” to return to the video Mark", the video will fall back to the video starting at 19:49:55, and you can right click to select "watch live video" and the video will be restored to live video.

If the detection period is set to 10 seconds, the next detection period starts at 19:50:11, and the audio and video acquisition and audio event detection are still performed according to the previous steps. After the data collected by the pickup device of the monitoring device is detected by the audio event, Detection is "robbery." The audio event is pre-registered as E2, and the current time, severity level, sound source distance, and sound source direction are recorded. For example, the mark start time: 19:50:11, mark attribute: robbery | more serious | within 1 meter | upper right.

Event consolidation is performed on E2 and E1 according to the event integration decision rule. The flag of the event flag S1 of the audio event E1 is updated, for example, the duration, and the severity level is re-determined according to the severity level determination rule.

Repeat the above test process until 1 minute 9 seconds to 19:51:09.

The start time of the next detection cycle is 19:51:10. The audio and video acquisition and audio event detection are still performed according to the previous steps. After the data collected by the pickup of the camera is detected by the audio event, no event is detected. And since the mark end time of the time point mark S1 of the last audio event E1 is empty, the mark end time of setting S1 is 19:51:09. In the cell security room, the normal video is played on the corresponding display screen of the camera. There is no text comment in the center of the video, and the progress bar is displayed. At 19:50:00, there is an annotation in orange font for "robbery: 1 minute. The 09-second time point mark is no longer highlighted. Click on the time point mark on the progress bar, and two buttons “Alarm” and “Video Fallback to Mark” will pop up. If you click “Alarm”, the specified alarm call will be dialed and the time point will be shared. Record the link of S1's tag video; if you click "Video Rewind to Tag", the video will be rolled back to the video starting at 19:49:55, and you can right click to select "watch live video" and the video will be restored to live video. .

For the above process, please refer to FIG. 8 , which is a flowchart of still another video monitoring method according to an embodiment of the present invention. The method provided by the embodiment of the present invention may include the following steps, that is, S801 to S820:

S801: an event detection period starts, and at time T1, an audio event is detected according to the above-mentioned video marking method (that is, matching of an audio event is performed);

S802: determining whether an audio event is detected; when it is determined that an audio event is not detected, executing S803; when it is determined that an audio event is detected, executing S805;

S803: determining whether the marking end time of the last audio event E0 is empty; when it is determined that it is empty, executing S804; when it is determined that it is not empty, executing S801 (waiting for the arrival of the next event detecting period);

S804: set the mark end time of the last audio event E0 to the previous second of T1, re-determine the severity level of E0, update the event flag S0 of the audio event E0, activate the flag S0; then execute S811;

S805: pre-register the audio event E1, record the start time T1 and the mark attribute of the corresponding event mark S1;

S806: determining whether the audio event E1 is integrated with the last audio event E0; if integrated, executing S807; if not, executing S808;

S807: Integrate the audio event E1 with the audio event E0, re-determine the severity level of E0, update the event flag S0, delete E1, activate the flag S0; then execute S811;

S808: determining whether the end time of the marking of the audio event E0 is empty; when it is judged to be empty, executing S809; when it is determined that it is not empty, executing S810;

S809: Recording the end time of the time point mark of the audio event E0 is the previous second of T1, re-determining the severity level of E0, updating the event flag S0; then executing S810;

S810: Formally register the audio event E1, the event marker S1 starts at time T1, the end time is empty, the activation flag E1; and then executes S811;

S811: determining whether the video of the camera is being played; when it is determined that the video is being played, executing S814; when it is determined that the video is not playing, executing S812;

S812: displaying the alarm message and the label video link of the current audio event on the display screen (the display manner may be various, for example, displaying in the right area of the screen and sorting according to the event start time from the time of going to the post, if the monitor has not viewed the audio all the time. The tag video link of the event may have multiple event alarm messages after a period of time. For the same audio event that updates the tag attribute multiple times, the alarm message needs to be merged);

S813; determining whether to click the tag video link of an audio event; if yes, executing S814; if not, executing S812, continuing to display to wait for the monitoring personnel to click to view the tag video (or may also be set to receive audio with a severe severity) The alarm message of the event indicates that the screen actively switches to the tag video of the audio event);

S814: playing a tag video, simultaneously displaying an event tag attribute and a start time of the corresponding audio event on the screen, and displaying a progress bar, and displaying an audio event flag on a start time of the time point mark corresponding to the audio event on the progress bar;

S815: Click the event tag of the current audio event, and pop up an "alarm", "video fall back to the mark" button;

S816: judging whether to click "alarm"; if clicked, executing S817; if not clicking, executing S818;

S817: Alert the designated terminal by phone or SMS or other specified means, share the tag video link of the audio event; then end the process.

S818: judging whether to click "video to fall back to the mark"; if clicked, execute S819; if not click, execute S816, and then continue to determine whether to click "alarm";

S819: playing a video of the label that is played back to the point in time of the current audio event;

S820: "watch live video" is selected during viewing, and switch to real-time video;

The process ends.

The video marking method and the video monitoring method provided by the embodiments of the present invention can quickly locate the moment when a specific behavior or a specific event occurs in the video in the video monitoring, so that the video monitoring personnel can quickly find the problem and improve the working efficiency of the video monitoring personnel. .

Embodiments of the present invention also provide a computer readable storage medium storing computer executable instructions that, when executing computer executable instructions, perform the following operations, namely, S11 to S13:

S11, extracting a sound feature of the audio signal in the video file;

S12. Match the extracted sound features with each audio event in the audio event library; each audio event is established based on the sound characteristics of the audio signal generated when the event occurs;

S13. When the sound feature is successfully matched with the at least one audio event, the corresponding event in the video file is event-marked for the audio event that occurs.

Embodiments of the present invention also provide a computer readable storage medium storing computer executable instructions that, when executing computer executable instructions, perform the following operations, namely S21 to S23:

S21, performing monitoring video recording;

S22, in the process of video recording, performing event marking on the recorded video file by using a video marking method as described above;

S23. After completing an event tag on the video file, the video file of the event tag part is displayed in an alarm.

The above is only an alternative embodiment and an optional embodiment of the present invention, and is not intended to limit the scope of protection of the embodiments of the present invention. For those skilled in the art, various changes and modifications may be made to the embodiments of the present invention. . Any modifications, equivalent substitutions, improvements, etc. within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

One of ordinary skill in the art will appreciate that all or a portion of the steps of the above-described embodiments can be implemented using a computer program flow, which can be stored in a computer readable storage medium on a corresponding hardware platform (according to The system, device, device, device, etc. are executed, and when executed, include one or a combination of the steps of the method embodiments.

Alternatively, all or part of the steps of the above embodiments may also be implemented by using an integrated circuit. These steps may be separately fabricated into individual integrated circuit modules, or multiple modules or steps may be fabricated into a single integrated circuit module. achieve.

The device/function module/functional unit in the above embodiment can be implemented by using a general-purpose computing device. Now, they can be concentrated on a single computing device or distributed over a network of multiple computing devices.

When the device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. The above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Industrial applicability

In the embodiment of the present invention, the extracted sound feature is matched with each audio event in the audio event library by extracting the sound feature of the audio signal in the video file, and the video file is indicated when the extracted sound feature is successfully matched with the at least one audio event. The audio event occurs, and the audio event occurring in the video file is event-marked; wherein the audio event is established in advance based on the sound characteristics of the audio signal generated when the event occurs. In the embodiment of the present invention, by setting an audio event in advance, and then matching the sound feature of the audio signal in the video file with each audio event to determine whether the corresponding mark needs to be performed, it is not necessary to manually view the video content to determine whether to mark. The efficiency and accuracy of marking video files can be greatly improved.

Claims

A video marking method, including:

Extracting sound characteristics of the audio signal in the video file;

The extracted sound features are matched to each audio event in the audio event library; each of the audio events is established based on a sound characteristic of the audio signal generated at the time of the event;

When the sound feature is successfully matched with at least one of the audio events, an event flag is generated for the audio event that occurs at a corresponding location in the video file.
The video marking method according to claim 1, wherein the extracting the sound characteristics of the audio signal in the video file comprises:

The sound characteristics of the audio signal in the video file are extracted during video recording.
The video marking method according to claim 1, wherein said extracting sound characteristics of said audio signal comprises:

Converting the audio signal to a time-frequency domain and extracting a foreground signal of the audio signal;

The matching the extracted sound features with each of the audio events comprises:

And extracting a sound feature set from the foreground signal, and calculating a similarity between the sound feature set and each of the audio events, and the obtained similarity is greater than a set similarity threshold, and the matching is successful.
The video marking method according to any one of claims 1 to 3, wherein the corresponding location in the video file performs event marking on the audio event that occurs, including performing one or more of the following markings. :

Marking a start time of occurrence of the audio event at a key video frame position of the video file;

Acquiring and marking one or more of direction information and distance information of the sound source relative to the pickup in the audio event that occurs;

Obtaining and marking the severity level corresponding to the audio event that occurred;

Gets and marks the name of the audio event that occurred.
The video marking method according to claim 4, wherein the acquiring the severity level corresponding to the audio event that occurs includes:

According to the location information of the recorded video file and the duration after the occurrence of the audio event One or more of the determinations of the severity level corresponding to the audio event.
The video tagging method according to claim 5, wherein the event tagging the audio event that occurs in the video file corresponding to the location includes:

When the severity level corresponding to the audio event is marked, the marking format corresponding to the severity level of the audio event is selected according to the correspondence table of the severity level and the marking format.
The video marking method according to any one of claims 1 to 3, wherein the extracting the sound feature of the audio signal in the video file comprises: extracting a sound feature of the audio signal in the video file according to a preset detection period;

The method further includes:

When the sound features of the audio signals extracted by the adjacent detection period are successfully matched with the at least one of the audio events, determining whether to merge the audio events occurring in the adjacent two detection periods according to a preset event combining rule ;

When it is determined that the merging is performed, the start time of the audio event in the previous detection period is taken as the start time of the audio event in the current detection period;

When it is determined that the merging is not performed, the end time of the audio event in the previous detection period and the start time of the audio event in the current detection period are set as the start time of the current detection period.
A video monitoring method includes:

Perform surveillance video recording;

In the process of the video recording, the recorded video file is event-marked by the video marking method according to any one of claims 1-7;

After an event tag is completed on the video file, the video file of the event tag portion is displayed as an alarm.
A video marking device comprising:

a feature extraction module, configured to: extract a sound feature of the audio signal in the video file;

a processing module, configured to: the sound feature and the audio event extracted by the feature extraction module Each audio event in the library is matched; each audio event is established based on the sound characteristics of the audio signal generated when the event occurs;

And a marking module, configured to: when the processing result of the processing module is that the sound feature is successfully matched with the at least one of the audio events, the corresponding event in the video file performs an event tag on the audio event that occurs.
The video marking device of claim 9 further comprising:

The video recording module is set to: perform video recording;

The feature extraction module is configured to: extract a sound feature of an audio signal in the video file during video recording by the video recording module.
The video tagging apparatus according to claim 9 or 10, wherein said tagging module performs event tagging on said audio event occurring at a corresponding position in said video file, including performing one or more of the following:

Marking a start time of occurrence of the audio event at a key video frame position of the video file;

Acquiring and marking one or more of direction information and distance information of the sound source relative to the pickup in the audio event that occurs;

Obtaining and marking the severity level corresponding to the audio event that occurred;

Gets and marks the name of the audio event that occurred.
A video surveillance system comprising: a monitoring processing device and the video marking device of any of claims 9-11;

The video tagging device is configured to: perform an event tag on the video file recorded during the video monitoring process, and notify the monitoring processing device after completing an event tag on the video file;

The monitoring processing device is configured to: after receiving the alarm of the video marking device, perform an alarm display on the video file of the event marking portion.
A computer readable storage medium storing computer executable instructions, and when the processor executes the computer executable instructions, performing the following operations:

Extracting sound characteristics of the audio signal in the video file;

The extracted sound features are matched to each audio event in the audio event library; each of the audio events is established based on a sound characteristic of the audio signal generated at the time of the event;

When the sound feature is successfully matched with at least one of the audio events, an event flag is generated for the audio event that occurs at a corresponding location in the video file.
A computer readable storage medium storing computer executable instructions, and when the processor executes the computer executable instructions, performing the following operations:

Perform surveillance video recording;

In the process of the video recording, the recorded video file is event-marked by the video marking method according to any one of claims 1-7;

After an event tag is completed on the video file, the video file of the event tag portion is displayed as an alarm.