CN113114986B - Early warning method based on picture and sound synchronization and related equipment - Google Patents

Early warning method based on picture and sound synchronization and related equipment Download PDF

Info

Publication number
CN113114986B
CN113114986B CN202110353106.XA CN202110353106A CN113114986B CN 113114986 B CN113114986 B CN 113114986B CN 202110353106 A CN202110353106 A CN 202110353106A CN 113114986 B CN113114986 B CN 113114986B
Authority
CN
China
Prior art keywords
target
audio data
text information
sound collection
early warning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110353106.XA
Other languages
Chinese (zh)
Other versions
CN113114986A (en
Inventor
唐军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Guanbiao Technology Co ltd
Original Assignee
Shenzhen Soyo Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Soyo Technology Development Co ltd filed Critical Shenzhen Soyo Technology Development Co ltd
Priority to CN202110353106.XA priority Critical patent/CN113114986B/en
Publication of CN113114986A publication Critical patent/CN113114986A/en
Application granted granted Critical
Publication of CN113114986B publication Critical patent/CN113114986B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/18Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/695Control of camera direction for changing a field of view, e.g. pan, tilt or based on tracking of objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/04Synchronising

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application provides an early warning method and related equipment based on picture and sound synchronization, wherein the method comprises the following steps: acquiring audio data acquired by a plurality of sound acquisition devices; analyzing the audio data, and determining a target sound collection device from the plurality of sound collection devices based on sensitive word filtering; adjusting the angle of the video monitoring equipment to acquire the field picture of the object acquired by the target sound acquisition equipment; predicting the behavior of the object by utilizing the target audio data acquired by the target sound acquisition equipment and the scene; and outputting early warning information under the condition that the behavior is a preset behavior. The embodiment of the application is beneficial to improving the timeliness of early warning information output and preventing the occurrence of security hidden danger.

Description

Early warning method based on picture and sound synchronization and related equipment
Technical Field
The application relates to the technical field of security monitoring, in particular to an early warning method and related equipment based on picture and sound synchronization.
Background
With the development of video image processing technology, video monitoring has become the most powerful means in the security field, and various places such as business, work, leisure tourism equipartition have controlled video monitoring equipment, and the staff can carry out artificial early warning under emergency based on the video monitoring picture that video monitoring equipment gathered, and the server also can carry out automatic early warning under emergency by analyzing video monitoring picture. However, the existing video monitoring pictures are mostly 'silent', no matter the workers or the server can only perform early warning according to the pictures, so that the value of the on-site sound is always ignored, early warning information is output only when or already in emergency, and the untimely output of the early warning information often causes immeasurable consequences.
Disclosure of Invention
Aiming at the problems, the application provides an early warning method and related equipment based on picture and sound synchronization, which are beneficial to synchronizing monitoring pictures and sounds so as to improve timeliness of early warning information output.
In order to achieve the above object, a first aspect of the embodiments of the present application provides an early warning method based on synchronization of images and sounds, the method including:
acquiring audio data acquired by a plurality of sound acquisition devices;
analyzing the audio data, and determining a target sound collection device from the plurality of sound collection devices based on sensitive word filtering;
adjusting the angle of the video monitoring equipment to acquire the field picture of the object acquired by the target sound acquisition equipment;
predicting the behavior of the object by utilizing the target audio data acquired by the target sound acquisition equipment and the scene;
and outputting early warning information under the condition that the behavior is a preset behavior.
With reference to the first aspect, in a possible implementation manner, the audio data are obtained by performing forward error correction coding on collected sounds by the plurality of sound collection devices; the analyzing the audio data, determining a target sound collection device from the plurality of sound collection devices based on sensitive word filtering, including:
Performing forward error correction decoding on the audio data to obtain corresponding audio signals;
converting the audio signal into text information;
determining target text information from the text information based on sensitive word filtering;
determining the audio data corresponding to the target text information as the target audio data;
inquiring a sound collection device identifier corresponding to the target audio data;
and determining the target sound collection equipment from the sound collection equipment according to the sound collection equipment identification.
With reference to the first aspect, in a possible implementation manner, the determining, based on the filtering of the sensitive words, target text information from the text information includes:
performing word segmentation and part-of-speech tagging on the text information, and reserving nouns, adjectives and verbs, wherein the nouns, adjectives and verbs obtained after the word segmentation and part-of-speech tagging form a candidate keyword set;
constructing a candidate keyword graph by the candidate keyword set; each node in the candidate keyword graph represents each candidate keyword in the candidate keyword set;
calculating the weight of each candidate keyword in the keyword graph in the text information;
Carrying out weighted random sampling on nodes in the candidate keyword graph based on the weights to obtain target candidate keywords;
calculating the matching degree between the target candidate keyword and each word in a preset first sensitive word set, a preset second sensitive word set and a preset third sensitive word set;
determining the target candidate keywords with the matching degree larger than or equal to a preset value as sensitive words;
and determining the text information containing the sensitive words into the target text information.
With reference to the first aspect, in a possible implementation manner, before outputting the early warning information, the method further includes:
obtaining information to be sent based on the target text information; the information to be sent is used for carrying out early warning on the object;
converting the information to be transmitted into a digital signal;
and performing forward error correction coding on the digital signal to obtain the early warning information.
With reference to the first aspect, in a possible implementation manner, the predicting the behavior of the object by using the target audio data collected by the target sound collection device and the live view includes:
obtaining a first emotion label of the object based on the target text information corresponding to the target audio data;
Extracting image frames from the field picture to obtain a plurality of image frame sequences;
obtaining a second emotional tag of the subject based on the plurality of image frame sequences;
acquiring a feature image to be classified based on the plurality of image frame sequences;
the first emotion label, the second emotion label and the feature map to be classified form a matrix to be classified;
and classifying the matrix to be classified to obtain the behavior of the object.
With reference to the first aspect, in a possible implementation manner, the obtaining, based on the plurality of image frame sequences, a second emotion tag of the object includes:
performing face detection on each frame of image in the plurality of image frame sequences, and cutting out a face area image from each frame of image based on the face detection;
carrying out face action unit identification on the face area image;
and obtaining the second emotion label according to the face action unit recognition result.
A second aspect of the embodiments of the present application provides an early warning device based on synchronization of images and sounds, where the device includes:
the audio acquisition module is used for acquiring audio data acquired by the plurality of sound acquisition devices;
the audio analysis module is used for analyzing the audio data and determining target sound collection equipment from the plurality of sound collection equipment based on sensitive word filtering;
The synchronization module is used for adjusting the angle of the video monitoring equipment to acquire the field picture of the object acquired by the target sound acquisition equipment;
the behavior prediction module is used for predicting the behavior of the object by utilizing the target audio data acquired by the target sound acquisition equipment and the scene;
and the alarm module is used for outputting early warning information under the condition that the behavior is a preset behavior.
A third aspect of the embodiments of the present application provides an electronic device, including an input device and an output device, and further including a processor adapted to implement one or more instructions; and a computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the steps of:
acquiring audio data acquired by a plurality of sound acquisition devices;
analyzing the audio data, and determining a target sound collection device from the plurality of sound collection devices based on sensitive word filtering;
adjusting the angle of the video monitoring equipment to acquire the field picture of the object acquired by the target sound acquisition equipment;
predicting the behavior of the object by utilizing the target audio data acquired by the target sound acquisition equipment and the scene;
And outputting early warning information under the condition that the behavior is a preset behavior.
A fourth aspect of the present embodiments provides a computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform the steps of:
acquiring audio data acquired by a plurality of sound acquisition devices;
analyzing the audio data, and determining a target sound collection device from the plurality of sound collection devices based on sensitive word filtering;
adjusting the angle of the video monitoring equipment to acquire the field picture of the object acquired by the target sound acquisition equipment;
predicting the behavior of the object by utilizing the target audio data acquired by the target sound acquisition equipment and the scene;
and outputting early warning information under the condition that the behavior is a preset behavior.
The scheme of the application at least comprises the following beneficial effects: compared with the prior art, the method and the device have the advantages that the audio data acquired by the plurality of sound acquisition devices are acquired; analyzing the audio data, and determining a target sound collection device from the plurality of sound collection devices based on sensitive word filtering; adjusting the angle of the video monitoring equipment to acquire the field picture of the object acquired by the target sound acquisition equipment; predicting the behavior of the object by utilizing the target audio data acquired by the target sound acquisition equipment and the scene; and outputting early warning information under the condition that the behavior is a preset behavior. The method comprises the steps that the target sound collection equipment for collecting the site sound with the security hidden danger is determined from a plurality of sound collection equipment based on audio data and sensitive word filtration, then the site picture is collected through the video monitoring equipment to achieve synchronization of the picture and the sound, the behavior of a site object is predicted through the site sound (namely the target audio data) and the site picture, and the site object is warned under the condition that the site object is about to make preset behavior, so that timeliness of early warning information output is improved, and the occurrence of the security hidden danger is prevented.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of an application environment provided in an embodiment of the present application;
fig. 2 is a schematic flow chart of an early warning method based on synchronization of images and sounds according to an embodiment of the present application;
fig. 3 is a schematic diagram of extracting video frames according to an embodiment of the present application;
fig. 4 is a schematic flow chart of another early warning method based on synchronization of images and sounds according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an early warning device based on synchronization of images and sounds according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
The terms "comprising" and "having" and any variations thereof, as used in the specification, claims and drawings, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. Furthermore, the terms "first," "second," and "third," etc. are used for distinguishing between different objects and not for describing a particular sequential order.
The embodiment of the application provides an early warning method based on picture and sound synchronization, which can be implemented based on an application environment shown in fig. 1, as shown in fig. 1, the application environment comprises a plurality of sound collecting devices 11, an audio processing device 12, a video monitoring device 13, an audio playing device 14 and a display device 15, and optionally, the sound collecting devices 11 and the audio playing device 14 can be mutually independent devices or integrated devices integrating sound collection and audio playing; the audio processing device 12 is communicatively connected to the sound collecting device 11 and the audio playing device 14 by a wireless communication technology, which may alternatively be a Long Range Radio (LoRa) wireless transmission technology. In a specific implementation, the sound collection device 11 is configured to collect sound in a coverage area, encode the collected sound, send the encoded audio data to the audio processing device 12, analyze the obtained audio data by the audio processing device 12 to determine that a potential safety hazard may exist in a scene collected by one or more sound collection devices 11, then position the video monitoring device 13 within a preset range of the sound collection device 11 to a suitable angle to accurately collect a scene of the sound collection device 11, send the collected scene to the audio processing device 12, and display the scene in real time on the display device 15, where the audio processing device 12 may play the sound synchronized with the scene through a peripheral device. The audio processing device 12 predicts the behavior of the live person object based on the live picture and live sound of the sound collecting device 11, and if it is predicted that a preset behavior (such as an overstress behavior) will occur, outputs warning information to the audio playing device 14, and the audio playing device 14 decodes and plays the warning information to alert the object.
Referring to fig. 2, fig. 2 is a flow chart of an early warning method based on picture and sound synchronization according to an embodiment of the present application, where the early warning method based on picture and sound synchronization may be implemented based on the application environment shown in fig. 1, as shown in fig. 2, and includes steps S21-S25:
s21, acquiring audio data acquired by a plurality of sound acquisition devices.
In this embodiment of the present application, the audio data is obtained by encoding the collected sound by a plurality of sound collection devices 11, for example, forward error correction encoding is performed on the collected sound, the sound collection devices 11 send the encoded audio data to the audio processing device 12 through a wireless communication technology, and the audio processing device 12 can store the identifier of each sound collection device 11 in association with the corresponding audio data, where the identifier of the sound collection device 11 is a unique code, may be a serial number of the sound collection device itself, or may be a custom code of a worker.
S22, analyzing the audio data, and determining target sound collection equipment from the sound collection equipment based on sensitive word filtering.
In this embodiment of the present application, the target sound collection device refers to a sound collection device 11 that collects audio data containing sensitive words, after receiving the audio data, the audio processing device 12 decodes, checks and corrects the erroneous data through forward error correction, obtains an audio signal corresponding to the audio data of each sound collection device 11, and stores the audio signal in a split way, for example: the target sound collecting device collects the sound of two persons, two paths of corresponding audio signals exist, for the audio signals stored in the shunt, the audio processing device 12 can call the vacant streaming voice recognition resource to recognize each path of audio signal, convert the voice signal into text information, and then determine target text information from the text information corresponding to each path of audio signal through keyword recognition, wherein the target text information refers to text information containing sensitive words. Based on the corresponding relation between the audio signal and the audio data, the audio data corresponding to the target text information is the target audio data, and the target sound collection device can be determined from the plurality of sound collection devices 11 by inquiring the sound collection device identifier corresponding to the target audio data.
In one possible implementation manner, the determining the target text information from the text information based on the filtering of the sensitive words includes:
performing word segmentation and part-of-speech tagging on the text information, and reserving nouns, adjectives and verbs, wherein the nouns, adjectives and verbs obtained after the word segmentation and part-of-speech tagging form a candidate keyword set;
constructing a candidate keyword graph by the candidate keyword set; each node in the candidate keyword graph represents each candidate keyword in the candidate keyword set;
calculating the weight of each candidate keyword in the keyword graph in the text information;
carrying out weighted random sampling on nodes in the candidate keyword graph based on the weights to obtain target candidate keywords;
calculating the matching degree between the target candidate keyword and each word in a preset first sensitive word set, a preset second sensitive word set and a preset third sensitive word set;
determining the target candidate keywords with the matching degree larger than or equal to a preset value as sensitive words;
and determining the text information containing the sensitive words into the target text information.
In this embodiment of the present application, the preset first sensitive word set refers to a preset noun set, the preset second sensitive word set refers to a preset adjective set, and the preset third sensitive word set refers to a preset verb set. The keyword graph is denoted as g= (V, E), V represents a vertex set in the keyword graph, i.e. a candidate keyword set, E represents an edge set in the keyword graph, a TextRank algorithm is used to calculate a weight of each candidate keyword in the keyword graph in text information, and then a weighted random sampling is performed on nodes in the keyword graph based on the weight, for example: nodes in the keyword graph are represented as { a, b, c and d }, a preset probability of 0.8 is generated, the weights of the nodes in the keyword graph are accumulated from a, when c is sampled, the sum of the accumulated weights reaches 0.8, c is selected, the number of times of sampling is calculated a, b, c, d after the preset number of times of sampling, candidate keywords corresponding to the nodes with the number of times of sampling being greater than a threshold value are determined as target candidate keywords, then the candidate keywords are matched with each word in a preset noun set, a preset adjective set and a preset verb set, the matching degree is calculated, the matching degree can be represented by a cosine distance, a soil carrying distance or a hamming distance, if the matching degree between the target candidate keywords and any word in the preset first sensitive word set, the preset second sensitive word set and the preset third sensitive word set is greater than or equal to a preset value, the matching degree is determined as a sensitive word, the text information where the target text information is located is determined as target text information, the accuracy of determining the target text information is improved, and the warning information is prevented from being sent out by errors.
S23, adjusting the angle of the video monitoring equipment to acquire the field picture of the object acquired by the target sound acquisition equipment.
S24, predicting the behavior of the object by using the target audio data acquired by the target sound acquisition device and the scene.
In the specific embodiment of the application, after the target sound collection device is determined, the site sound source of the target sound collection device can be positioned, and the angle collection site picture of the video monitoring device is adjusted based on the position of the sound source so as to realize the synchronization of the sound and the picture.
In a possible implementation manner, the predicting the behavior of the object by using the target audio data collected by the target sound collection device and the live view includes:
obtaining a first emotion label of the object based on the target text information corresponding to the target audio data;
extracting image frames from the field picture to obtain a plurality of image frame sequences;
obtaining a second emotional tag of the subject based on the plurality of image frame sequences;
acquiring a feature image to be classified based on the plurality of image frame sequences;
the first emotion label, the second emotion label and the feature map to be classified form a matrix to be classified;
And classifying the matrix to be classified to obtain the behavior of the object.
Specifically, the first emotion label is an emotion label identified through target text information, target text information corresponding to target audio data is taken as input, and a trained emotion identification model is adopted to identify the emotion of the object, so that the first emotion label is obtained. In terms of image frame extraction on a live picture, as shown in fig. 3, a voice part and an ambient noise part can be detected from a target audio signal corresponding to target audio data, a video image segment corresponding to the voice part in the live picture is determined to obtain a plurality of video image segments, each image frame of the video image segment is extracted to obtain a plurality of image frame sequences, and the plurality of video image segments are in one-to-one correspondence with the plurality of image frame sequences. The second emotion label refers to an emotion label obtained through recognition of a face Action unit, in one possible implementation manner, in terms of obtaining a second emotion label of an object based on a plurality of image frame sequences, face detection is performed on each image frame in the plurality of image frame sequences to obtain a face region image, feature extraction is performed on the face region image by using a trained face Action Unit (AU) recognition model to obtain a feature map corresponding to each face region image, the feature maps corresponding to each face region image are fused, and the feature maps obtained after fusion are classified by using a full connection layer to obtain the second emotion label. For example: AU3 indicates that the subject's eyebrows are depressed and gathered, indicating that the current emotion is angry. And for each image frame in the plurality of image frame sequences, carrying out feature extraction on each image frame by adopting a trained behavior prediction model to obtain a feature image corresponding to each image frame, fusing the feature images corresponding to each image frame to obtain the feature image to be classified, forming a matrix to be classified by the first emotion label, the second emotion label and the feature image to be classified, taking the matrix to be classified as the input of a full-connection layer, and finally outputting the finally predicted behavior by utilizing a softmax function. In the embodiment, the first emotion label predicted by the target text information, the second emotion label predicted by the face action unit and the feature map to be classified are combined to form an input matrix, so that various information is fused on behavior prediction, and the accuracy of the behavior prediction is improved.
S25, outputting early warning information under the condition that the behavior is the preset behavior.
In one possible implementation manner, before outputting the early warning information, the method further includes:
obtaining information to be sent based on the target text information; the information to be sent is used for carrying out early warning on the object;
converting the information to be transmitted into a digital signal;
and performing forward error correction coding on the digital signal to obtain the early warning information.
Specifically, in the case where the predicted behavior is a preset behavior, for example: the audio processing device 12 may select a corresponding prompt from the prediction library as the information to be transmitted based on the intention of the target text information, then convert the information to be transmitted into a digital signal, and perform forward error correction coding to obtain the early warning information, then send the early warning information to the audio playing device 14, and the audio playing device 14 performs forward error correction decoding on the early warning information to obtain a corresponding digital signal, converts the digital signal into audio for playing, so that the warning object does not need to take the overdriving action. In some scenarios, the information to be sent may also be text information or voice information input by the staff based on the target text information, where the audio processing device 12 first obtains the audio signal of the voice information, then converts the audio signal into a digital signal, and then performs forward error correction coding, where the information to be sent is text information input by the staff, and may directly convert the information to the digital signal to perform forward error correction coding. In this embodiment, forward error correction encoding is performed on the information to be transmitted, so that the anti-interference capability of the data in the process of being transmitted to the audio playing device 14 can be improved, the transmission distance of the early warning information can be improved, and long-distance early warning can be realized.
As can be seen, in the embodiment of the present application, audio data acquired by a plurality of sound acquisition devices are acquired; analyzing the audio data, and determining a target sound collection device from the plurality of sound collection devices based on sensitive word filtering; adjusting the angle of the video monitoring equipment to acquire the field picture of the object acquired by the target sound acquisition equipment; predicting the behavior of the object by utilizing the target audio data acquired by the target sound acquisition equipment and the scene; and outputting early warning information under the condition that the behavior is a preset behavior. The method comprises the steps that the target sound collection equipment for collecting the site sound with the security hidden danger is determined from a plurality of sound collection equipment based on audio data and sensitive word filtration, then the site picture is collected through the video monitoring equipment to achieve synchronization of the picture and the sound, the behavior of a site object is predicted through the site sound (namely the target audio data) and the site picture, and the site object is warned under the condition that the site object is about to make preset behavior, so that timeliness of early warning information output is improved, and the occurrence of the security hidden danger is prevented.
Referring to fig. 4, fig. 4 is a flowchart of another early warning method based on synchronization of images and sounds according to an embodiment of the present application, which can be implemented based on the application environment shown in fig. 1, as shown in fig. 4, and includes steps S401 to S410:
S401, acquiring audio data acquired by a plurality of sound acquisition devices; the audio data is obtained by performing forward error correction coding on the collected sound by the plurality of sound collection devices;
s402, performing forward error correction decoding on the audio data to obtain corresponding audio signals;
s403, converting the audio signal into text information;
s404, determining target text information from the text information based on sensitive word filtering;
s405, determining the audio data corresponding to the target text information as the target audio data;
s406, inquiring a sound collection device identifier corresponding to the target audio data;
s407, determining target sound collection equipment from the plurality of sound collection equipment according to the sound collection equipment identification;
s408, adjusting the angle of the video monitoring equipment to acquire the field picture of the object acquired by the target sound acquisition equipment;
s409, predicting the behavior of the object by using the target audio data acquired by the target sound acquisition device and the scene;
s410, outputting early warning information under the condition that the behavior is the preset behavior.
The specific implementation of steps S401 to S410 is described in detail in the embodiment shown in fig. 2, and the same or similar beneficial effects can be achieved, which is not described herein.
Based on the description of the foregoing embodiment of the early warning method based on the picture and sound synchronization, the present application further provides an early warning device based on the picture and sound synchronization, where the early warning device based on the picture and sound synchronization may be a computer program (including a program code) running in a terminal. The early warning device based on the synchronization of the picture and the sound can execute the method shown in fig. 2 or fig. 4. Referring to fig. 5, the apparatus includes:
an audio acquisition module 51 for acquiring audio data acquired by a plurality of sound acquisition devices;
an audio analysis module 52, configured to analyze the audio data, and determine a target sound collection device from the plurality of sound collection devices based on sensitive word filtering;
the synchronization module 53 is configured to adjust an angle of the video monitoring device to acquire a field picture of the object acquired by the target sound acquisition device;
a behavior prediction module 54 for predicting a behavior of the object using the target audio data collected by the target sound collection device and the live view;
and the alarm module 55 is used for outputting early warning information under the condition that the behavior is preset behavior.
In one possible implementation, in analyzing the audio data and determining a target sound collection device from the plurality of sound collection devices based on sensitive word filtering, the audio analysis module 52 is specifically configured to:
Performing forward error correction decoding on the audio data to obtain corresponding audio signals;
converting the audio signal into text information;
determining target text information from the text information based on sensitive word filtering;
determining the audio data corresponding to the target text information as the target audio data;
inquiring a sound collection device identifier corresponding to the target audio data;
and determining the target sound collection equipment from the sound collection equipment according to the sound collection equipment identification.
In one possible implementation, the audio analysis module 52 is specifically configured to:
performing word segmentation and part-of-speech tagging on the text information, and reserving nouns, adjectives and verbs, wherein the nouns, adjectives and verbs obtained after the word segmentation and part-of-speech tagging form a candidate keyword set;
constructing a candidate keyword graph by the candidate keyword set; each node in the candidate keyword graph represents each candidate keyword in the candidate keyword set;
calculating the weight of each candidate keyword in the keyword graph in the text information;
Carrying out weighted random sampling on nodes in the candidate keyword graph based on the weights to obtain target candidate keywords;
calculating the matching degree between the target candidate keyword and each word in a preset first sensitive word set, a preset second sensitive word set and a preset third sensitive word set;
determining the target candidate keywords with the matching degree larger than or equal to a preset value as sensitive words;
and determining the text information containing the sensitive words into the target text information.
In one possible implementation, the alarm module 55 is further configured to:
obtaining information to be sent based on the target text information; the information to be sent is used for carrying out early warning on the object;
converting the information to be transmitted into a digital signal;
and performing forward error correction coding on the digital signal to obtain the early warning information.
In one possible implementation, in predicting the behavior of the object using the target audio data collected by the target sound collection device and the live view, the behavior prediction module 54 is specifically configured to:
obtaining a first emotion label of the object based on the target text information corresponding to the target audio data;
Extracting image frames from the field picture to obtain a plurality of image frame sequences;
obtaining a second emotional tag of the subject based on the plurality of image frame sequences;
acquiring a feature image to be classified based on the plurality of image frame sequences;
the first emotion label, the second emotion label and the feature map to be classified form a matrix to be classified;
and classifying the matrix to be classified to obtain the behavior of the object.
In one possible implementation, the behavior prediction module 54 is specifically configured to, in deriving the second emotion tag of the object based on the plurality of image frame sequences:
performing face detection on each frame of image in the plurality of image frame sequences, and cutting out a face area image from each frame of image based on the face detection;
carrying out face action unit identification on the face area image;
and obtaining the second emotion label according to the face action unit recognition result.
According to one embodiment of the present application, each module of the early warning device based on the synchronization of the screen and the sound shown in fig. 5 may be separately or all combined into one or several additional units, or some (some) of the modules may be further split into a plurality of units with smaller functions, which may achieve the same operation without affecting the implementation of the technical effects of the embodiments of the present invention. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present invention, the early warning device based on synchronization of images and sounds may also include other units, and in practical applications, these functions may also be implemented with assistance of other units, and may be implemented by cooperation of multiple units.
According to another embodiment of the present application, the early warning apparatus device based on picture and sound synchronization as shown in fig. 5 may be constructed by running a computer program (including program code) capable of executing the steps involved in the respective methods as shown in fig. 2 or 4 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), and the like, and the early warning method based on picture and sound synchronization of the embodiments of the present application is implemented. The computer program may be recorded on, for example, a computer-readable recording medium, and loaded into and executed by the above-described computing device via the computer-readable recording medium.
Based on the description of the foregoing method embodiments and apparatus embodiments, please refer to fig. 6, fig. 6 is a schematic structural diagram of an electronic device provided in the embodiment of the present application, and as shown in fig. 6, the electronic device includes at least a processor 61, an input device 62, an output device 63, and a computer storage medium 64. Wherein the processor 61, input device 62, output device 63, and computer storage media 64 within the electronic device may be coupled by a bus or other means.
The computer storage medium 64 may be stored in a memory of an electronic device, the computer storage medium 64 being for storing a computer program comprising program instructions, the processor 61 being for executing the program instructions stored by the computer storage medium 64. The processor 61, or CPU (Central Processing Unit ), is a computing core as well as a control core of the electronic device, which is adapted to implement one or more instructions, in particular to load and execute one or more instructions to implement a corresponding method flow or a corresponding function.
In one embodiment, the processor 61 of the electronic device provided in the embodiments of the present application may be used to perform a series of early warning processes based on picture and sound synchronization:
acquiring audio data acquired by a plurality of sound acquisition devices;
analyzing the audio data, and determining a target sound collection device from the plurality of sound collection devices based on sensitive word filtering;
adjusting the angle of the video monitoring equipment to acquire the field picture of the object acquired by the target sound acquisition equipment;
predicting the behavior of the object by utilizing the target audio data acquired by the target sound acquisition equipment and the scene;
And outputting early warning information under the condition that the behavior is a preset behavior.
In yet another embodiment, the audio data is obtained by performing forward error correction coding on the collected sound by the plurality of sound collection devices; processor 61 performs the analysis of the audio data to determine a target sound collection device from the plurality of sound collection devices based on sensitive word filtering, comprising:
performing forward error correction decoding on the audio data to obtain corresponding audio signals;
converting the audio signal into text information;
determining target text information from the text information based on sensitive word filtering;
determining the audio data corresponding to the target text information as the target audio data;
inquiring a sound collection device identifier corresponding to the target audio data;
and determining the target sound collection equipment from the sound collection equipment according to the sound collection equipment identification.
In yet another embodiment, the determining the target text information from the text information based on the sensitive word filtering includes:
performing word segmentation and part-of-speech tagging on the text information, and reserving nouns, adjectives and verbs, wherein the nouns, adjectives and verbs obtained after the word segmentation and part-of-speech tagging form a candidate keyword set;
Constructing a candidate keyword graph by the candidate keyword set; each node in the candidate keyword graph represents each candidate keyword in the candidate keyword set;
calculating the weight of each candidate keyword in the keyword graph in the text information;
carrying out weighted random sampling on nodes in the candidate keyword graph based on the weights to obtain target candidate keywords;
calculating the matching degree between the target candidate keyword and each word in a preset first sensitive word set, a preset second sensitive word set and a preset third sensitive word set;
determining the target candidate keywords with the matching degree larger than or equal to a preset value as sensitive words;
and determining the text information containing the sensitive words into the target text information.
In yet another embodiment, the processor 61 is further configured to, prior to outputting the pre-warning information:
obtaining information to be sent based on the target text information; the information to be sent is used for carrying out early warning on the object;
converting the information to be transmitted into a digital signal;
and performing forward error correction coding on the digital signal to obtain the early warning information.
In still another embodiment, the processor 61 performs the actions of predicting the object using the target audio data collected by the target sound collection device and the live view, including:
Obtaining a first emotion label of the object based on the target text information corresponding to the target audio data;
extracting image frames from the field picture to obtain a plurality of image frame sequences;
obtaining a second emotional tag of the subject based on the plurality of image frame sequences;
acquiring a feature image to be classified based on the plurality of image frame sequences;
the first emotion label, the second emotion label and the feature map to be classified form a matrix to be classified;
and classifying the matrix to be classified to obtain the behavior of the object.
In yet another embodiment, the processor 61 performs the deriving the second emotional tag of the object based on the plurality of image frame sequences, including:
performing face detection on each frame of image in the plurality of image frame sequences, and cutting out a face area image from each frame of image based on the face detection;
carrying out face action unit identification on the face area image;
and obtaining the second emotion label according to the face action unit recognition result.
The electronic device may be a sound collection device, an audio playing device, a server, a host computer, a cloud server, or the like. The electronic devices may include, but are not limited to, a processor 61, an input device 62, an output device 63, and a computer storage medium 64. It will be appreciated by those skilled in the art that the schematic diagram is merely an example of an electronic device and is not limiting of an electronic device, and may include more or fewer components than shown, or certain components may be combined, or different components.
It should be noted that, since the steps in the foregoing early warning method based on the synchronization of the picture and the sound are implemented when the processor 61 of the electronic device executes the computer program, the embodiments of the foregoing early warning method based on the synchronization of the picture and the sound are applicable to the electronic device, and the same or similar beneficial effects can be achieved.
The embodiments also provide a computer storage medium (Memory) that is a Memory device in an information processing device or an information transmitting device or an information receiving device, for storing programs and data. It will be appreciated that the computer storage medium herein may include both a built-in storage medium in the terminal and an extended storage medium supported by the terminal. The computer storage medium provides a storage space that stores an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory; alternatively, it may be at least one computer storage medium located remotely from the aforementioned processor. In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by a processor to implement the corresponding steps in the above-described early warning method with respect to picture and sound synchronization.
The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (6)

1. An early warning method based on picture and sound synchronization is characterized by comprising the following steps:
acquiring audio data acquired by a plurality of sound acquisition devices;
analyzing the audio data, and determining a target sound collection device from the plurality of sound collection devices based on sensitive word filtering;
adjusting the angle of the video monitoring equipment to acquire the field picture of the object acquired by the target sound acquisition equipment;
predicting the behavior of the object by utilizing the target audio data acquired by the target sound acquisition equipment and the scene;
outputting early warning information under the condition that the behavior is a preset behavior;
the audio data is obtained by performing forward error correction coding on the collected sound by the plurality of sound collection devices; the analyzing the audio data, determining a target sound collection device from the plurality of sound collection devices based on sensitive word filtering, including:
Performing forward error correction decoding on the audio data to obtain corresponding audio signals;
converting the audio signal into text information;
determining target text information from the text information based on sensitive word filtering;
determining the audio data corresponding to the target text information as the target audio data;
inquiring a sound collection device identifier corresponding to the target audio data;
determining the target sound collection equipment from the plurality of sound collection equipment according to the sound collection equipment identification;
the method for determining the target text information from the text information based on the sensitive word filtering comprises the following steps:
performing word segmentation and part-of-speech tagging on the text information, and reserving nouns, adjectives and verbs, wherein the nouns, adjectives and verbs obtained after the word segmentation and part-of-speech tagging form a candidate keyword set;
constructing a candidate keyword graph by the candidate keyword set; each node in the candidate keyword graph represents each candidate keyword in the candidate keyword set;
calculating the weight of each candidate keyword in the keyword graph in the text information;
carrying out weighted random sampling on nodes in the candidate keyword graph based on the weights to obtain target candidate keywords;
Calculating the matching degree between the target candidate keyword and each word in a preset first sensitive word set, a preset second sensitive word set and a preset third sensitive word set;
determining the target candidate keywords with the matching degree larger than or equal to a preset value as sensitive words;
determining the text information containing the sensitive words as the target text information;
before outputting the early warning information, the method further comprises:
obtaining information to be sent based on the target text information; the information to be sent is used for carrying out early warning on the object;
converting the information to be transmitted into a digital signal;
and performing forward error correction coding on the digital signal to obtain the early warning information.
2. The method of claim 1, wherein predicting the behavior of the object using the target audio data collected by the target sound collection device and the live view comprises:
obtaining a first emotion label of the object based on the target text information corresponding to the target audio data;
extracting image frames from the field picture to obtain a plurality of image frame sequences;
obtaining a second emotional tag of the subject based on the plurality of image frame sequences;
Acquiring a feature image to be classified based on the plurality of image frame sequences;
the first emotion label, the second emotion label and the feature map to be classified form a matrix to be classified;
and classifying the matrix to be classified to obtain the behavior of the object.
3. The method of claim 2, wherein the deriving a second emotional tag of the subject based on the plurality of image frame sequences comprises:
performing face detection on each frame of image in the plurality of image frame sequences, and cutting out a face area image from each frame of image based on the face detection;
carrying out face action unit identification on the face area image;
and obtaining the second emotion label according to the face action unit recognition result.
4. An early warning device based on picture and sound synchronization, the device comprising:
the audio acquisition module is used for acquiring audio data acquired by the plurality of sound acquisition devices;
the audio analysis module is used for analyzing the audio data and determining target sound collection equipment from the plurality of sound collection equipment based on sensitive word filtering;
the synchronization module is used for adjusting the angle of the video monitoring equipment to acquire the field picture of the object acquired by the target sound acquisition equipment;
The behavior prediction module is used for predicting the behavior of the object by utilizing the target audio data acquired by the target sound acquisition equipment and the scene;
the alarm module is used for outputting early warning information under the condition that the behavior is a preset behavior;
in the aspect of analyzing the audio data and determining target sound collection equipment from the plurality of sound collection equipment based on sensitive word filtering, the audio analysis module is specifically configured to:
performing forward error correction decoding on the audio data to obtain corresponding audio signals;
converting the audio signal into text information;
determining target text information from the text information based on sensitive word filtering;
determining the audio data corresponding to the target text information as the target audio data;
inquiring a sound collection device identifier corresponding to the target audio data;
determining the target sound collection equipment from the plurality of sound collection equipment according to the sound collection equipment identification;
in determining target text information from the text information based on sensitive word filtering, the audio analysis module is specifically configured to:
performing word segmentation and part-of-speech tagging on the text information, and reserving nouns, adjectives and verbs, wherein the nouns, adjectives and verbs obtained after the word segmentation and part-of-speech tagging form a candidate keyword set;
Constructing a candidate keyword graph by the candidate keyword set; each node in the candidate keyword graph represents each candidate keyword in the candidate keyword set;
calculating the weight of each candidate keyword in the keyword graph in the text information;
carrying out weighted random sampling on nodes in the candidate keyword graph based on the weights to obtain target candidate keywords;
calculating the matching degree between the target candidate keyword and each word in a preset first sensitive word set, a preset second sensitive word set and a preset third sensitive word set;
determining the target candidate keywords with the matching degree larger than or equal to a preset value as sensitive words;
determining the text information containing the sensitive words as the target text information;
the alarm module is further configured to:
obtaining information to be sent based on the target text information; the information to be sent is used for carrying out early warning on the object;
converting the information to be transmitted into a digital signal;
and performing forward error correction coding on the digital signal to obtain the early warning information.
5. An electronic device comprising an input device and an output device, further comprising:
A processor adapted to implement one or more instructions; the method comprises the steps of,
a computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the method of any one of claims 1-3.
6. A computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform the method of any one of claims 1-3.
CN202110353106.XA 2021-03-30 2021-03-30 Early warning method based on picture and sound synchronization and related equipment Active CN113114986B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110353106.XA CN113114986B (en) 2021-03-30 2021-03-30 Early warning method based on picture and sound synchronization and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110353106.XA CN113114986B (en) 2021-03-30 2021-03-30 Early warning method based on picture and sound synchronization and related equipment

Publications (2)

Publication Number Publication Date
CN113114986A CN113114986A (en) 2021-07-13
CN113114986B true CN113114986B (en) 2023-04-28

Family

ID=76713716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110353106.XA Active CN113114986B (en) 2021-03-30 2021-03-30 Early warning method based on picture and sound synchronization and related equipment

Country Status (1)

Country Link
CN (1) CN113114986B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113989924A (en) * 2021-10-22 2022-01-28 北京明略软件系统有限公司 Violent behavior early warning method and device
CN114245205B (en) * 2022-02-23 2022-05-24 达维信息技术(深圳)有限公司 Video data processing method and system based on digital asset management

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391942A (en) * 2014-11-25 2015-03-04 中国科学院自动化研究所 Short text characteristic expanding method based on semantic atlas
CN109255118A (en) * 2017-07-11 2019-01-22 普天信息技术有限公司 A kind of keyword extracting method and device
CN110830771A (en) * 2019-11-11 2020-02-21 广州国音智能科技有限公司 Intelligent monitoring method, device, equipment and computer readable storage medium
CN111626126A (en) * 2020-04-26 2020-09-04 腾讯科技(北京)有限公司 Face emotion recognition method, device, medium and electronic equipment
CN112419661A (en) * 2019-08-20 2021-02-26 北京国双科技有限公司 Danger identification method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391942A (en) * 2014-11-25 2015-03-04 中国科学院自动化研究所 Short text characteristic expanding method based on semantic atlas
CN109255118A (en) * 2017-07-11 2019-01-22 普天信息技术有限公司 A kind of keyword extracting method and device
CN112419661A (en) * 2019-08-20 2021-02-26 北京国双科技有限公司 Danger identification method and device
CN110830771A (en) * 2019-11-11 2020-02-21 广州国音智能科技有限公司 Intelligent monitoring method, device, equipment and computer readable storage medium
CN111626126A (en) * 2020-04-26 2020-09-04 腾讯科技(北京)有限公司 Face emotion recognition method, device, medium and electronic equipment

Also Published As

Publication number Publication date
CN113114986A (en) 2021-07-13

Similar Documents

Publication Publication Date Title
CN108256404B (en) Pedestrian detection method and device
CN113114986B (en) Early warning method based on picture and sound synchronization and related equipment
JP5061382B2 (en) Time-series data identification device and person meta information addition device for moving images
CN111783712A (en) Video processing method, device, equipment and medium
CN115050077A (en) Emotion recognition method, device, equipment and storage medium
CN115103157A (en) Video analysis method and device based on edge cloud cooperation, electronic equipment and medium
CN115272656A (en) Environment detection alarm method and device, computer equipment and storage medium
CN110874554B (en) Action recognition method, terminal device, server, system and storage medium
CN113160279A (en) Method and device for detecting abnormal behaviors of pedestrians in subway environment
WO2015093687A1 (en) Data processing system
US10186253B2 (en) Control device for recording system, and recording system
CN110855932B (en) Alarm method and device based on video data, electronic equipment and storage medium
CN113099283B (en) Method for synchronizing monitoring picture and sound and related equipment
JP2018137639A (en) Moving image processing system, encoder and program, decoder and program
CN114708429A (en) Image processing method, image processing device, computer equipment and computer readable storage medium
CN113342978A (en) City event processing method and device
CN113705689A (en) Training data acquisition method and abnormal behavior recognition network training method
JP5907487B2 (en) Information transmission system, transmission device, reception device, information transmission method, and program
CN112215114A (en) Target identification method, device, equipment and computer readable storage medium
CN111904429A (en) Human body falling detection method and device, electronic equipment and storage medium
CN112052325A (en) Voice interaction method and device based on dynamic perception
CN113518201B (en) Video processing method, device and equipment
KR102497399B1 (en) Image control system compatible object search method based on artificial intelligence technology
WO2023238721A1 (en) Information creation method and information creation device
CN113743293B (en) Fall behavior detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230925

Address after: Room 206, Floor 2, R&D Building B, Nanning ASEAN Agricultural Science and Technology Development Enterprise Headquarters Base, No. 10, Gaoxin 3rd Road, Nanning, Guangxi Zhuang Autonomous Region, 530000

Patentee after: Guangxi Guanbiao Technology Co.,Ltd.

Address before: 518000 room 528A, 5 / F, building 3, Duoli Industrial Zone, 105 Meihua Road, Meifeng community, Meilin street, Futian District, Shenzhen City, Guangdong Province

Patentee before: SHENZHEN SOYO TECHNOLOGY DEVELOPMENT Co.,Ltd.

TR01 Transfer of patent right