CN112422601A

CN112422601A - Data processing method and device and electronic equipment

Info

Publication number: CN112422601A
Application number: CN201910783506.7A
Authority: CN
Inventors: 伊威; 王全占; 庄博宇; 李名杨; 古鉴
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2021-02-26
Anticipated expiration: 2039-08-23
Also published as: CN112422601B

Abstract

The application provides three data processing methods, devices and electronic equipment, wherein one data processing method comprises the following steps: obtaining current image data representing a current environment; obtaining real-time image data representing an environment to be processed; obtaining reference image data representing a reference environment; judging whether the characteristics of the target object in the real-time image data are matched with the characteristics of the reference target object in the reference image data, if the characteristics of the target object are not matched with the characteristics of the reference target object, judging that the real-time image data are abnormal image data, and sending the abnormal image data to a server; and sending the audio data matched with the abnormal image data to the server, wherein the audio data matched with the abnormal image data is the audio data of the environment collected when the abnormal image data is generated. The data processing method achieves screening of audio data based on the image data, and reduces data uploaded to the cloud, so that pressure of storage and data processing of the server is reduced.

Description

Data processing method and device and electronic equipment

Technical Field

The present application relates to the field of data processing, and in particular, to three data processing methods and apparatuses, and an electronic device.

Background

With the rapid development of science and technology, people's lives are more and more intelligent, and meanwhile, intelligent products emerge endlessly in recent years. The intelligent monitoring is taken as the key research field of intelligent products, can extract and control abnormal behaviors in videos in real time, and thoroughly changes the passive state that the traditional monitoring can only monitor and cannot control.

In the field of existing intelligent monitoring, a camera device is mainly applied to an intelligent electronic product. The intelligent electronic product monitoring system mainly adopts a camera device to monitor the surrounding environment of the intelligent electronic product and returns acquired real-time information to control the work of the intelligent electronic product. For example, the camera device is installed on the intelligent sweeping robot, and the camera device installed on the intelligent sweeping robot can be used for monitoring the working environment. When the sweeping robot works, the camera device returns monitored information including ground environment information and other to-be-swept environment information to a user terminal for controlling the sweeping robot in real time, and a user controls a working route of the sweeping robot according to the obtained environment information.

However, the conventional method of applying the camera device to the intelligent electronic product is to control the operation of the intelligent electronic product only on hardware. Although a few intelligent electronic products have the function of processing data, in the process of processing data, a large amount of image data and audio data need to be stored, and all data are uploaded to a server side for processing; the intelligent electronic product has limited data transmission speed and capacity, the whole data processing and running process is slowed down due to the fact that a large amount of data are transmitted, and even though the data can be uploaded and stored to the cloud, the computing efficiency of the cloud is still affected by a large amount of image data and audio data uploaded to the cloud locally.

Disclosure of Invention

The application provides a data processing method, which is used for reducing data uploaded to a cloud end by locally screening image data and audio data, so that the pressure of data storage and processing of a server end is reduced. The application simultaneously provides another two data processing methods, three data processing devices and three data processing electronic devices.

The application provides a data processing method, which comprises the following steps:

obtaining real-time image data representing an environment to be processed;

obtaining reference image data representing a reference environment;

judging whether the characteristics of a target object in the real-time image data are matched with the characteristics of a reference target object in the reference image data, if the characteristics of the target object are not matched with the characteristics of the reference target object, judging that the real-time image data are abnormal image data, and sending the abnormal image data to a server;

and sending the audio data matched with the abnormal image data to the server, wherein the audio data matched with the abnormal image data is the audio data of the environment collected when the abnormal image data is generated.

Optionally, the obtaining real-time image data representing an environment to be processed includes: the method comprises the steps of obtaining real-time image data used for representing a to-be-processed environment around an audio playing device through a camera device arranged on the audio playing device.

Optionally, the obtaining reference image data representing a reference environment includes: the method comprises the steps of obtaining default image data which are stored in advance by an audio playing device and used for representing default environment around the audio playing device.

Optionally, the obtaining real-time image data representing an environment to be processed includes:

sending a request for obtaining the real-time image data;

receiving the real-time image data for the request.

Optionally, the determining whether the features of the target object in the real-time image data are matched with the features of the reference target object in the reference image data includes:

judging whether the target object and the reference target object are the same object or not;

if the target object and the reference target object are the same object, judging whether the similarity between the characteristics of the target object and the characteristics of the reference target object exceeds a preset similarity threshold value;

determining that the feature of the target object matches the feature of the reference target object if the similarity between the feature of the target object and the feature of the reference target object exceeds the similarity threshold, otherwise determining that the feature of the target object does not match the feature of the reference target object.

Optionally, the determining whether the similarity between the feature of the target object and the feature of the reference target object exceeds a predetermined similarity threshold includes:

determining whether difference information between the position of the target object in the real-time image data and the position of the reference target object in the reference image data exceeds a predetermined position difference threshold;

determining that the similarity between the feature of the target object and the feature of the reference target object does not exceed the similarity threshold if the difference information exceeds the location difference threshold, otherwise determining that the similarity between the feature of the target object and the feature of the reference target object exceeds the similarity threshold.

determining that the feature of the target object does not match the feature of the reference target object if the target object and the reference target object are different objects.

Optionally, the target object and the reference target object are different objects, that is, all target objects in the real-time image data are different from all reference target objects in the reference image data, or all target objects in the real-time image data are not completely the same as all reference target objects in the reference image data.

Optionally, if the feature of the target object is not matched with the feature of the reference target object, determining that the real-time image data is abnormal image data, and sending the abnormal image data to a server, including:

if the characteristics of the target object are not matched with the characteristics of the reference target object, inputting the characteristics of the target object into a deep neural network model to obtain an identification result of whether the target object comprises a target object needing attention; the deep neural network model is used for identifying whether the target object comprises a target object needing attention according to the characteristics of the target object;

and if the target object comprises a target object needing attention, judging that the real-time image data is abnormal image data, and sending the abnormal image data to a server.

Optionally, if the target object includes a target object that needs to be paid attention to, determining that the real-time image data is abnormal image data, and sending the abnormal image data to a server, including:

and if the target object comprises a target object needing attention and the target object needing attention is matched with a preset comparison object, judging that the real-time image data is abnormal image data and sending the abnormal image data to a server.

Optionally, the method further includes: and sending the audio data set comprising the audio data matched with the abnormal image data in the specified time range to the server.

Optionally, the method further includes:

searching audio data with the same acquisition time as the abnormal image data from the audio data set in the specified time range;

and taking the searched audio data as audio data matched with the abnormal image data.

Optionally, the method further includes:

obtaining real-time audio data representing an environment to be processed;

obtaining reference audio data representing a reference environment;

and judging whether the audio features of the target object in the real-time audio data are matched with the audio features of the reference target object in the reference audio data, and if the audio features of the target object are not matched with the audio features of the reference target object, sending the real-time audio data to a server.

Optionally, the determining whether the audio feature of the target object in the real-time audio data matches the audio feature of the reference target object in the reference audio data includes:

obtaining sound characteristic information of a reference target object in the reference audio data;

acquiring sound characteristic information of a target object in the real-time audio data;

judging whether the similarity between the sound characteristic of the target object and the sound characteristic of the reference target object exceeds a preset sound similarity threshold value or not;

determining that the sound feature of the target object matches the sound feature of the reference target object if the similarity between the sound feature of the target object and the sound feature of the reference target object exceeds the predetermined sound similarity threshold, otherwise determining that the sound feature of the target object does not match the sound feature of the reference target object.

Correspondingly, the present application provides a data processing apparatus comprising:

a real-time image data acquisition unit for acquiring real-time image data representing an environment to be processed;

a reference image data obtaining unit for obtaining reference image data representing a reference environment;

a judging unit configured to judge whether a feature of a target object in the real-time image data matches a feature of a reference target object in the reference image data;

the first sending unit is used for judging the real-time image data to be abnormal image data if the characteristics of the target object are not matched with the characteristics of the reference target object, and sending the abnormal image data to a server;

and the second sending unit is used for sending the audio data matched with the abnormal image data to the server side, wherein the audio data matched with the abnormal image data is the audio data of the environment collected when the abnormal image data is generated.

The present application further provides a data processing method, including:

obtaining real-time audio data representing an environment to be processed;

obtaining reference audio data representing a reference environment;

judging whether the audio features of a target object in the real-time audio data are matched with the audio features of a reference target object in the reference audio data, if the audio features of the target object are not matched with the audio features of the reference target object, judging that the real-time audio data are abnormal audio data, and sending the abnormal audio data to a server;

and sending the image data matched with the abnormal audio data to the server, wherein the image data matched with the abnormal audio data is the image data of the environment collected when the abnormal audio data is generated.

if the target object and the reference target object are the same object, judging whether the similarity between the audio features of the target object and the reference target object exceeds a preset similarity threshold value;

determining that the audio feature of the target object matches the audio feature of the reference target object if the similarity between the audio feature of the target object and the audio feature of the reference target object exceeds the similarity threshold, otherwise determining that the audio feature of the target object does not match the audio feature of the reference target object.

determining that the audio features of the target object do not match the audio features of the reference target object if the target object and the reference target object are different objects.

Optionally, the target object and the reference target object are different objects, that is, all target objects in the real-time audio data are different from all reference target objects in the reference audio data, or all target objects in the real-time audio data are not completely the same as all reference target objects in the reference audio data.

Optionally, if the audio feature of the target object is not matched with the audio feature of the reference target object, determining that the real-time audio data is abnormal audio data, and sending the abnormal audio data to a server, including:

if the audio features of the target object are not matched with the audio features of the reference target object, inputting the audio features of the target object into a deep neural network model to obtain an identification result of whether the target object comprises a target object needing attention; the deep neural network model is used for identifying whether the target object comprises a target object needing attention according to the audio features of the target object;

and if the target object comprises a target object needing attention, judging that the real-time audio data is abnormal audio data, and sending the abnormal audio data to a server.

Optionally, if the target object includes a target object that needs to be paid attention to, determining that the real-time audio data is abnormal audio data, and sending the abnormal audio data to a server, including:

and if the target object comprises a target object needing attention and the target object needing attention is matched with a preset comparison object, judging that the real-time audio data is abnormal image audio data, and sending the abnormal audio data to a server.

Optionally, the method further includes: and sending the image data set which comprises the image data matched with the abnormal audio data in the specified time range to the server.

Optionally, the method further includes:

searching image data with the same acquisition time as the abnormal audio data from the image data set within the specified time range;

and taking the searched image data as the image data matched with the abnormal audio data.

Optionally, the method further includes:

obtaining real-time image data representing an environment to be processed;

obtaining reference image data representing a reference environment;

and judging whether the characteristics of the target object in the real-time image data are matched with the characteristics of the reference target object in the reference image data, and if the characteristics of the target object are not matched with the characteristics of the reference target object, sending the real-time image data to a server.

obtaining feature information of a reference target object in the reference image data;

acquiring characteristic information of a target object in the real-time image data;

determining whether a similarity between the feature of the target object and the feature of the reference target object exceeds a predetermined similarity threshold;

determining that the feature of the target object matches the feature of the reference target object if the similarity between the feature of the target object and the feature of the reference target object exceeds the predetermined similarity threshold, otherwise determining that the feature of the target object does not match the feature of the reference target object.

Correspondingly, the present application additionally provides a data processing apparatus comprising:

a real-time audio data acquisition unit for acquiring real-time audio data representing an environment to be processed;

a reference audio data acquisition unit for acquiring reference audio data representing a reference environment;

the judging unit is used for judging whether the audio characteristics of the target object in the real-time audio data are matched with the audio characteristics of the reference target object in the reference audio data;

the first sending unit is used for judging the real-time audio data to be abnormal audio data if the audio characteristics of the target object are not matched with the audio characteristics of the reference target object, and sending the abnormal audio data to a server;

and the second sending unit is used for sending the image data matched with the abnormal audio data to the server side, wherein the image data matched with the abnormal audio data is the image data of the environment collected when the abnormal audio data is generated.

The present application further provides a data processing method, including:

obtaining an image data set and an audio data set, wherein the image data in the image data set and the audio data in the audio data set are used for representing the surrounding environment of the monitoring device;

acquiring an image data set needing attention from the image data set by using a deep neural network model, and acquiring an abnormal audio data set meeting abnormal audio data conditions from the audio data set; the deep neural network model is used for identifying whether the target object comprises a target object needing attention or not according to the characteristics of the target object in the image data;

and if the time information of at least one image data in the image data set needing attention, which is acquired by the monitoring equipment, is matched with the time information of at least one abnormal audio data in the abnormal audio data set, which is acquired by the monitoring equipment, sending alarm information to the monitoring equipment or computing equipment for displaying a monitoring result.

Optionally, the obtaining the image data set and the audio data set includes:

acquiring an image data set through a cloud storage server which stores the image data set in advance;

the audio data set is obtained by a storage means provided on the audio playback apparatus that stores the audio data set in advance.

Optionally, the obtaining the image data set and the audio data set includes:

sending a request for obtaining an image data set and an audio data set;

an image data set and an audio data set for the request are obtained.

Optionally, the obtaining, by using the deep neural network model, an image data set that needs to be focused from the image data set includes:

screening out image data containing the target object needing attention from the image data set by utilizing a deep neural network model;

marking the screened image data containing the target object needing attention;

and deleting the image data which do not contain the marks in the image data set to obtain all the image data containing the marks.

Optionally, the obtaining an abnormal audio data set satisfying an abnormal audio data condition from the audio data set includes:

judging whether each piece of audio data in the audio data set meets an abnormal audio data condition;

and marking the audio data meeting the abnormal audio data condition, and taking all the audio data containing the marks as an abnormal audio data set meeting the abnormal audio data condition.

Optionally, the determining whether each piece of audio data in the audio data set satisfies an abnormal audio data condition includes:

obtaining abnormal audio data conditions; the abnormal audio data condition comprises that the audio features of the target object of the audio data do not match the audio features of the reference object;

and judging whether each audio data in the audio data set meets the abnormal audio data condition or not according to the abnormal audio data condition.

Optionally, the method further includes:

acquiring time information of each audio data of an abnormal audio data set meeting abnormal audio data conditions, wherein the time information is acquired by the monitoring equipment;

acquiring time information of each image data of an image data set needing attention, which is acquired by the monitoring equipment;

and judging whether the time information of each audio data of the abnormal audio data set, which is acquired by the monitoring equipment, is matched with the time information of each image data of the image data set needing attention, which is acquired by the monitoring equipment.

Optionally, the determining whether the time information of each audio data of the abnormal audio data set, which is acquired by the monitoring device, matches with the time information of each image data of the image data set that needs to be focused, which is acquired by the monitoring device, includes:

respectively calculating the matching degree of the time information of each audio data of the abnormal audio data set, which is acquired by the monitoring equipment, and the time information of each image data of the image data set needing attention, which is acquired by the monitoring equipment;

and if one matching degree calculation result exists in the matching degree calculation results and is within a specified time matching degree threshold, matching the time information acquired by the monitoring equipment for at least one image data in the image data set needing attention with the time information acquired by the monitoring equipment for at least one abnormal audio data in the abnormal audio data set.

Optionally, the sending of the alarm information to the monitoring device or the computing device for displaying the monitoring result includes:

matching and combining the abnormal audio data matched with the time information and the image data needing attention according to the time information;

sending the abnormal audio data matched and combined with the image data needing attention to the monitoring equipment or computing equipment for displaying a monitoring result; and sending alarm information to the monitoring equipment or the computing equipment for displaying the monitoring result.

Optionally, the method further includes:

judging the type of an abnormal event according to the abnormal audio data matched and combined with the image data needing attention;

and acquiring the warning mode corresponding to the type of the abnormal event according to the type of the abnormal event and the corresponding relation between the type of the abnormal event and the warning mode.

And warning aiming at the type of the abnormal event according to the type of the abnormal event.

Optionally, the method further includes:

judging whether the type of the abnormal event is in a list of the corresponding relation between the type of the abnormal event and a warning mode, and if the type of the abnormal event is in the list, directly obtaining the warning mode corresponding to the type of the abnormal event according to the list;

if the type of the abnormal event is not in the list, acquiring an alarm mode corresponding to the type of the abnormal event according to the type of the abnormal event, and adding the alarm mode corresponding to the type of the abnormal event into the list.

The present application also provides a data processing apparatus, comprising:

the data set obtaining unit is used for obtaining an image data set and an audio data set, wherein the image data in the image data set and the audio data in the audio data set are used for representing the surrounding environment of the monitoring device;

the screening unit is used for acquiring an image data set needing attention from the image data set by using a deep neural network model and acquiring an abnormal audio data set meeting abnormal audio data conditions from the audio data set; the deep neural network model is used for identifying that the target object comprises a target object which needs to be concerned according to the characteristics of the target object in the image data;

and the warning unit is used for sending warning information to the monitoring equipment or computing equipment for displaying a monitoring result if the time information of at least one image data in the image data set needing attention, which is acquired by the monitoring equipment, is matched with the time information of at least one abnormal audio data in the abnormal audio data set, which is acquired by the monitoring equipment.

The present application further provides an electronic device, comprising:

a processor;

a memory for storing a program of the data method and performing the following steps:

obtaining real-time image data representing an environment to be processed;

obtaining reference image data representing a reference environment;

The present application additionally provides an electronic device comprising:

a processor;

obtaining real-time audio data representing an environment to be processed;

obtaining reference audio data representing a reference environment;

The present application further provides an electronic device, comprising:

a processor;

The present application also provides a computer storage medium storing a program of a data processing method, the program being executed by a processor to perform the steps of:

obtaining real-time image data representing an environment to be processed;

obtaining reference image data representing a reference environment;

The present application further provides a computer storage medium storing a program of a data processing method, the program being executed by a processor to perform the steps of:

obtaining real-time audio data representing an environment to be processed;

obtaining reference audio data representing a reference environment;

Compared with the prior art, the method has the following advantages:

the application provides a data processing method, which comprises the following steps: obtaining real-time image data representing an environment to be processed; obtaining reference image data representing a reference environment; judging whether the characteristics of a target object in the real-time image data are matched with the characteristics of a reference target object in the reference image data, if the characteristics of the target object are not matched with the characteristics of the reference target object, judging that the real-time image data are abnormal image data, and sending the abnormal image data to a server; and sending the audio data matched with the abnormal image data to the server, wherein the audio data matched with the abnormal image data is the audio data of the environment collected when the abnormal image data is generated. By adopting the data processing method, the data of the sending server can be reduced by carrying out primary screening on the acquired image data locally, so that the pressure of the server for storing the data and processing the data is reduced, namely: the method comprises the steps of comparing collected real-time image data with reference image data, judging whether the real-time image data are abnormal image data or not, and meanwhile when the real-time image data are judged to be abnormal image data primarily locally, matching audio data corresponding to the abnormal data primarily judged to be sent to a server together, so that the audio data are screened based on the image data.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1-a is a schematic diagram of a first application scenario embodiment provided in the present application.

Fig. 1-B is a schematic diagram of a second application scenario embodiment provided in the present application.

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a data processing apparatus according to a second embodiment of the present application.

Fig. 3 is a flowchart of a data processing method according to a third embodiment of the present application.

Fig. 4 is a schematic diagram of a data processing apparatus according to a fourth embodiment of the present application.

Fig. 5 is a flowchart of a data processing method according to a fifth embodiment of the present application.

Fig. 6 is a schematic diagram of a data processing apparatus according to a sixth embodiment of the present application.

Fig. 7 is a schematic diagram of an electronic device for data processing according to a seventh embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The application provides three data processing methods and devices, electronic equipment and a computer storage medium respectively.

Some embodiments that this application provided can be applied to client, high in the clouds server and have camera device and the mutual scene of intelligent audio amplifier of sound collection equipment (like the microphone). As shown in fig. 1-a, which is a schematic diagram of a first application scenario embodiment provided in the present application. The intelligent sound box with the camera device and the sound acquisition equipment acquires image data and audio data of a current environment (a detection object) in real time and performs primary screening on the image data. And uploading the preliminarily screened image data and the audio data corresponding to the preliminarily screened image data to a cloud server. The cloud server receives the image data after the preliminary screening and the audio data corresponding to the image data after the preliminary screening, the audio recognition technology and the image recognition technology are adopted to respectively analyze and calculate the image data after the preliminary screening and the audio data corresponding to the image data after the preliminary screening, and the identification result of the abnormal event is obtained through analysis and calculation. And finally, sending the identification result of the abnormal event to the client and sending out alarm information. The client opens a voice interaction system for the monitoring object and provides help, and the monitoring object feeds back the situation needing help through a microphoneAnd (5) transmitting to the client. Under this scene, because the intelligent audio amplifier data volume that has camera device and sound collection equipment that local end obtained is great, under the comparison, the data volume that image data after the high in the clouds server received the preliminary screening and the audio data that image data after the preliminary screening corresponds is less. Therefore, when the data amount is large and the data amount is small, different service modes can be selected. For example, the time interval T may be used for capturing audio and video data when the data size is large₁(ii) a When the data volume is small, the time interval T can be adopted for collecting audio and video data₂Wherein T is₁Greater than T₂. It should be noted that the application scenario is only one embodiment of the application scenario, and this embodiment of the application scenario is provided to facilitate understanding of the data processing method of the present application, and is not used to limit the data processing method of the present application.

The present application provides a data processing method, and the following embodiments are specific examples.

Fig. 1 is a flowchart of an embodiment of a data processing method according to an embodiment of the present application. The method comprises the following steps.

Step S101: real-time image data representing an environment to be processed is obtained.

When data processing is performed using the data processing method of the present embodiment, first, there are various ways of obtaining real-time image data representing an environment to be processed, in other words, obtaining current image data representing a current environment, and obtaining current image data representing the current environment, one of which is described below.

Current image data representing a current environment around the audio playback apparatus is obtained by an image pickup device provided on the audio playback apparatus. In the present embodiment, current image data representing the current environment can be obtained using an audio playback apparatus equipped with a camera. For example, a camera device may be mounted on a smart speaker. Specifically, the camera device can be set as a built-in camera in the smart sound box, and the camera device can also be set as an external camera device which can be separated independently. Or connecting the camera device with the intelligent sound box in other ways.

The audio playing device connected with the camera device can acquire the audio data and the image data of the current environment around the audio playing device in real time. Specifically, a storage device for storing audio data and image data collected in real time may be built in the audio playback device.

With the audio playback apparatus equipped with the image pickup apparatus of the present embodiment, current image data representing the current environment can be obtained in the following manner.

First, a request to obtain the current image data is sent.

Due to the fact that the audio playing device connected with the camera shooting device can acquire the audio data and the image data of the current environment in real time and store the audio data and the image data acquired in real time into the storage device. Therefore, when the current image data is obtained, a request for obtaining the current image data is first sent to a storage device. Naturally, the camera device also has a function of storing image data acquired in real time, and the camera device may also be used as a storage carrier for storing audio data and image data acquired in real time, and at this time, a request for obtaining the current image data needs to be sent to the camera device.

Thereafter, the current image data for the request is received. After sending the request for obtaining the current image data, the image capturing device or the storage device receives the request for obtaining the current image data. After receiving the request, the image pickup device or the storage device sends out the current image data for the request, and the party sending out the request receives the current image data for the request sent out by the image pickup device or the storage device.

Step S102: reference image data representing a reference environment is obtained.

While step S101 is being performed, a step of obtaining reference image data representing a reference environment is performed.

The audio playing device connected with the camera device of the embodiment can achieve the effect of intelligently monitoring the surrounding environment by acquiring the audio data and the image data of the current environment and comparing the acquired audio data and the image data of the current environment with the audio data and the image data of the reference environment. Therefore, it is necessary to obtain reference image data of a reference environment while obtaining image data of a current environment. And comparing the difference between the image data of the current environment and the image data of the reference environment, and analyzing the change of the current environment according to the difference degree of the image data.

As one of the ways of obtaining reference image data representing a reference environment, first, default image data representing a default environment in the vicinity of an audio playback device, which is stored in advance by the audio playback device, is obtained. Then, the default image data of the default environment is used as the reference image data of the reference environment.

Specifically, default image data of a default environment around the audio playback apparatus may be stored in the audio playback apparatus in advance. For example, in an initial stage of enabling the audio playing device, the audio playing device starts to store the surrounding environment image information, the surrounding environment image information of the audio playing device in the initial stage may be stored in the audio playing device, and the surrounding environment image information of the audio playing device in the initial stage may be used as default image data of a default environment, that is, reference image data of a reference environment.

Step S103: and judging whether the characteristics of the target object in the real-time image data are matched with the characteristics of the reference target object in the reference image data, if the characteristics of the target object are not matched with the characteristics of the reference target object, judging that the real-time image data are abnormal image data, and sending the abnormal image data to the server.

After current image data representing a current environment is obtained and reference image data representing a reference environment is obtained, it is determined whether a feature of a current target object in the current image data matches a feature of a reference target object in the reference image data.

By adopting the data processing method of the embodiment, the image data and the audio data of the current environment need to be screened in addition to the image data and the audio data collected in real time. And screening out the data with larger difference between the image data and the audio data of the current environment and the image data and the audio data of the reference environment. And then the data with larger difference is sent to the cloud end for fine operation, so that intelligent monitoring is achieved according to the calculation result of the cloud end. The screening is mainly because fine operation needs to be performed on the image data and the audio data through a cloud algorithm in the field of intelligent monitoring, so that whether the image data and the audio data of the current environment are abnormal or not is known. However, the storage capacity of the cloud is limited, and the image data and the audio data of the current environment collected in real time occupy a large memory space, so that if the data collected in real time are not screened, data storage pressure can be caused to a memory of the cloud, and the processing speed of a cloud algorithm is further influenced.

In the actual data processing process, the audio data and the image data of the current environment can be preliminarily screened in multiple modes, so that the data storage pressure of a cloud memory is reduced. Since the audio data occupies a much smaller memory than the image data, in this embodiment, it is preferable to perform preliminary screening on the image data.

As a way of primarily screening the image data and the current environment, it may be determined whether the features of the target object in the real-time image data are matched with the features of the reference target object in the reference image data, and if the features of the target object are not matched with the features of the reference target object, it is determined that the real-time image data are abnormal image data, and the abnormal image data are sent to the server. Accordingly, if the features of the target object match the features of the reference target object, the real-time image data is not transmitted to the server.

Specifically, whether the features of the target object in the real-time image data match the features of the reference target object in the reference image data may be determined in a manner described below.

First, it is determined whether the target object and the reference target object are the same object.

Before judging whether the target object and the reference target object are the same object, firstly, the target object is obtained according to the real-time image data, and the reference target object is obtained according to the reference image data.

And then, judging whether the target object and the reference target object are the same object or not according to the obtained target object and the reference target object.

For example, if the reference target object obtained in the reference image data is a child, and the current target object obtained in the current image data is also the same child, the current target object and the reference target object are the same object. Or if the reference target object obtained in the reference image data is an automobile and the current target object obtained in the current image data is the same automobile, the current target object and the reference target object are the same object. In short, the determination in this step mainly determines whether or not the current target object of the current image data has changed with respect to the reference target object in the reference image data, with reference to the reference target object in the reference image data.

After determining whether the target object and the reference target object are the same object, if the target object and the reference target object are the same object, determining whether a similarity between a feature of the target object and a feature of the reference target object exceeds a predetermined similarity threshold.

Specifically, after the target object and the reference target object are determined to be the same object, feature information of the target object and feature information of the reference target object are obtained, and whether the similarity between the features of the target object and the reference target object exceeds a predetermined similarity threshold is determined.

More specifically, it is determined whether or not the similarity between the feature of the target object and the feature of the reference target object exceeds a predetermined similarity threshold, and the determination may be made in the manner described below.

First, it is determined whether difference information between a position of a target object in the real-time image data and a position of the reference target object in the reference image data exceeds a predetermined position difference threshold.

For example, when the current target object and the reference target object are both the same child, the child is in a standing state in the reference image data, and the position of the child in the standing state in the reference image data can be determined, for example, the vertical distance d between the head of the child and a specified reference object in the reference image data can be determined₁. Similarly, the vertical distance d between the head of the child in the current image data and a specified reference object in the current image data can be determined₂. For comparison, the designated reference object in the reference image data and the designated reference object in the current image data are set as the same reference object, and the position of the reference object in the current image data is the same as that of the reference object in the reference image data. For example, the reference object may be a ground surface. Obtaining a vertical distance d from the head of the child to a reference object specified in the reference image data₁And the vertical distance d between the head of the child and the specified reference object in the current image data₂Then, compare d₁And d₂To determine whether the child is in a fall or about to fall.

Specifically, a position difference threshold may be preset, and the position difference threshold may be a height difference or a percentage. If the preset position difference threshold value is a height difference, obtaining d₁And d₂And directly performing difference making, comparing the obtained difference with a preset height difference, if the obtained difference is smaller than the preset height difference, determining that the child is in a safer state, and otherwise determining that the child is in an unsafe state. Similarly, if the preset position difference threshold is a percentage, d obtained above₁And d₂Direct doing the quotient, can also obtain d₁And d₂And comparing the obtained quotient with a preset percentage, if the obtained quotient is within the preset percentage range, determining that the child is in a safer state, otherwise determining that the child is in an unsafe state.

In the two embodiments, it is determined that the difference information does not exceed the position difference threshold value when the obtained difference value is smaller than a preset height difference or the quotient value is within a preset percentage range; conversely, it is determined that the difference information exceeds the position difference threshold value if the obtained difference value is greater than a preset height difference or the quotient value is not within a preset percentage range.

Determining that the similarity between the feature of the target object and the feature of the reference target object does not exceed the similarity threshold if the difference information exceeds the location difference threshold, otherwise determining that the similarity between the feature of the target object and the feature of the reference target object exceeds the similarity threshold. For example, in the above example, if the obtained difference value is smaller than the preset height difference or the quotient value is within the preset percentage range, the similarity between the feature of the target object and the feature of the reference target object exceeds the similarity threshold, which means that the child is in a standing state all the time, and the above result confirms that the child is in a safer state all the time. If the obtained difference value is larger than the preset height difference or the quotient value is not within the preset percentage range, the similarity between the characteristics of the target object and the characteristics of the reference target object does not exceed the similarity threshold value, which means that the child is not always in a standing state, and the result confirms that the child may be in an unsafe state of falling. After determining whether the similarity between the features of the target object and the features of the reference target object exceeds the similarity threshold, judging whether the features of the target object are matched with the features of the reference target object according to the result of whether the similarities of the features of the target object and the reference target object exceed the similarity threshold.

Specifically, if the similarity between the features of the target object and the features of the reference target object exceeds a similarity threshold, determining that the features of the target object match the features of the reference target object;

and if the similarity between the features of the target object and the reference target object does not exceed the similarity threshold, determining that the features of the target object are not matched with the features of the reference target object.

The above-mentioned determination process mainly aims at the explanation that the determination target object and the reference target object are the same object. When it is actually determined whether or not the feature of the target object matches the feature of the reference target object, there is also a case where the target object and the reference target object are not the same object.

If the target object and the reference target object are different objects, it is determined that the features of the target object do not match the features of the reference target object.

Specifically, the target object and the reference target object are different objects, that is, all target objects in the real-time image data are different from all reference target objects in the reference image data, or all target objects in the real-time image data are not completely the same as all reference target objects in the reference image data.

For example, if the reference target object obtained in the reference image data is a child and a car, and the current target object obtained in the current image data is only a car, it is determined that all the current target objects in the current image data are not completely identical to all the reference target objects in the reference image data.

If the reference target objects obtained in the reference image data are a child and a car and the current target objects obtained in the current image data are a pot, an electric car and an adult man, all the current target objects in the current image data are different from all the reference target objects in the reference image data.

Of course, whether all the target objects in the real-time image data are different from all the reference target objects in the reference image data or all the target objects in the real-time image data are not completely the same as all the reference target objects in the reference image data, it means that the target objects and the reference target objects are different objects, and when the target objects and the reference target objects are different objects, it is directly determined that the features of the target objects are not matched with the features of the reference target objects.

And after the characteristic of the target object is determined to be not matched with the characteristic of the reference target object, judging that the real-time image data is abnormal image data, and sending the abnormal image data to the server. In the method, the real-time image data with the unmatched characteristics of the target object and the reference target object are sent to the server, the image data of the current environment is primarily screened, and compared with the method of directly sending all the image data to the server, the method can reduce the storage pressure of the cloud storage server to a certain extent.

In this embodiment, the server corresponds to the cloud storage, and data can be transmitted between the server and the cloud storage, which is not described herein again because the data transmission technology for the server and the cloud storage in the prior art is mature.

Certainly, in order to further reduce the storage pressure of the cloud storage server, the image data after the primary screening may be subjected to secondary screening before being sent to the server. Specific secondary screening procedures are described below.

Firstly, if the characteristics of the current target object are not matched with the characteristics of the reference target object, inputting the characteristics of the current target object into a deep neural network model to obtain an identification result of whether the current target object comprises a target object needing attention; the deep neural network model is used for identifying whether the target object comprises the target object needing attention according to the characteristics of the target object.

The deep neural network model can receive the characteristics of the target objects and identify whether the target objects comprise the target objects needing attention or not according to the characteristics of the target objects. For example, if the current target object of the current image data is one pot flower, one electric vehicle, and one adult man, and the reference target object of the reference image data is one pot flower and one electric vehicle, even if the feature of the current target object does not match the feature of the reference target object, the feature of the target object identified via the deep neural network model (adult man) does not include the target object requiring attention, where the child is the target object requiring attention. Therefore, in this scenario, even if it is determined that the feature of the current target object does not match the feature of the reference target object, the abnormal image data may not be transmitted to the server.

And then, judging whether the abnormal image data is sent to the server side according to the judgment result of whether the current target object comprises the target object needing attention.

If the current target object comprises a target object needing attention, further, whether the target object needing attention is matched with a preset comparison object needs to be judged, and if yes, the abnormal image data are sent to a server side. For example, the preset comparison object is a child a, and if the target object needing attention is also the child a, the abnormal image data is sent to the server; otherwise, the abnormal image data is not sent to the server.

Specifically, if the current target object includes a target object that needs to be focused, before sending the abnormal image data to the server, the following method may be further performed.

And judging whether the frequency of the target object needing to be concerned appearing in the abnormal image data is within a specified threshold frequency within a specified time, and if so, sending the abnormal image data to a server. Otherwise, the abnormal image data is not sent to the server.

By adopting the above judging mode, image data corresponding to some target objects which need attention and occur accidentally in the abnormal image data can be screened out, and meanwhile, the abnormal image data is not uploaded to the server. By adopting the judging process, the current image data corresponding to the target object needing attention in the normal state can not be uploaded to the server. In the embodiment, the image data of the abnormal situations (including falling, quarrel, getting up, etc.) are mainly processed to achieve intelligent monitoring, so that the image data of the target object needing attention in the normal state can not be uploaded to the cloud storage server or the server, and the storage pressure of the cloud storage server is reduced.

Step S104: and sending the audio data matched with the abnormal image data to the server, wherein the audio data matched with the abnormal image data is the audio data of the environment collected when the abnormal image data is generated.

And sending the audio data matched with the abnormal image data to the server, wherein firstly, the audio data set including the audio data matched with the abnormal image data in a specified time range can be sent to the server.

The audio data set including the audio data matched with the abnormal image data in the specified time range is obtained, and all the audio data sets collected by the audio playing device may be sent to the server, and all the audio data sets may be used as the audio data sets including the audio data matched with the abnormal image data in the specified time range. Or intercepting all audio data sets collected by the audio playing device within a specified time period, and sending the intercepted audio data sets to the server. Wherein the intercepted audio data set at least comprises an audio data set of the audio data matched with the abnormal image data.

And then, obtaining audio data matched with the abnormal image data. As one way of acquiring the audio data matched with the abnormal image data, the following way may be adopted. Firstly, searching audio data with the same acquisition time as the abnormal image data from the audio data set in the specified time range. And then, the searched audio data is used as the audio data matched with the abnormal image data. By adopting the method, all the audio data matched with the image data uploaded to the server can be searched, the image data and the audio data matched with the image data are integrated, the integrated image data and the audio data matched with the image data are integrated, and the abnormal event is presumed to warn.

After the audio data set including the audio data matched with the abnormal image data in the specified time range is obtained, the audio data set can be directly sent to the server, and the audio data can be preliminarily screened and then sent to the server.

The audio data is primarily screened and then sent to the server, and the audio data can be primarily screened in the following manner. Since the method of audio preliminary screening is substantially similar to the preliminary screening of image data, specific screening can be referred to as detailed description about the preliminary screening of image data.

First, current audio data representing a current environment is obtained.

Thereafter, reference audio data representing a reference environment is obtained.

Of course, there is no restriction on the order of obtaining the current audio data representing the current environment and obtaining the reference audio data representing the reference environment, and therefore, the reference audio data representing the reference environment may be obtained first and then the current audio data representing the current environment.

After current audio data used for representing the current environment and reference audio data used for representing the reference environment are obtained, whether the audio features of the current target object in the current audio data are matched with the audio features of the reference target object in the reference audio data or not is judged, and if the audio features of the current target object are not matched with the audio features of the reference target object, the current audio data are sent to a server side.

Specifically, determining whether the audio features of the current target object in the current audio data match the audio features of the reference target object in the reference audio data may be performed in the following manner.

First, sound characteristic information of a reference target object in reference audio data is obtained.

Secondly, sound characteristic information of the current target object in the current audio data is obtained.

Certainly, there is no sequential limitation in obtaining the sound characteristic information of the reference target object in the reference audio data and obtaining the sound characteristic information of the current target object in the current audio data, so the sound characteristic information of the current target object in the current audio data may be obtained first, and then the sound characteristic information of the reference target object in the reference audio data may be obtained.

After obtaining the sound feature information of the reference target object in the reference audio data and obtaining the sound feature information of the current target object in the current audio data, it is determined whether the similarity between the sound feature of the current target object and the sound feature of the reference target object exceeds a predetermined sound similarity threshold.

And if the similarity between the sound characteristic of the current target object and the sound characteristic of the reference target object exceeds a preset sound similarity threshold value, determining that the sound characteristic of the current target object is matched with the sound characteristic of the reference target object, and otherwise, determining that the sound characteristic of the current target object is not matched with the sound characteristic of the reference target object.

And after the sound characteristics of the current target object are determined to be not matched with the sound characteristics of the reference target object, directly sending the current audio data to the server. The current audio data with the audio characteristics of the current target object unmatched with the audio characteristics of the reference target object are sent to the server, the audio data of the current environment are basically subjected to primary screening, and compared with the mode that all the audio data are directly sent to the server, the mode can reduce the pressure of the cloud storage server for storing the audio data to a certain extent.

By adopting the above judging mode, the audio data corresponding to the non-abnormal event in the current audio data can be screened out, and meanwhile, the part of audio data is not uploaded to the server. In the embodiment, the audio data in abnormal situations (including sounds such as help seeking, quarrel, screaming and the like) are mainly processed so as to achieve intelligent monitoring, so that the audio data in a normal state can not be uploaded to the cloud storage server or the server, and the storage pressure of the cloud storage server is reduced.

In the first embodiment, a data processing method is provided, and correspondingly, the present application further provides a data processing apparatus. Fig. 2 is a schematic diagram of a data processing apparatus according to a second embodiment of the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

A data processing apparatus of the present embodiment includes:

a real-time image data acquisition unit 201 for acquiring real-time image data representing an environment to be processed;

a reference image data obtaining unit 202 for obtaining reference image data representing a reference environment;

a judging unit 203, configured to judge whether a feature of a target object in the real-time image data matches a feature of a reference target object in the reference image data;

a first sending unit 204, configured to determine that the real-time image data is abnormal image data if the feature of the target object is not matched with the feature of the reference target object, and send the abnormal image data to a server;

a second sending unit 205, configured to send audio data matched with the abnormal image data to the server, where the audio data matched with the abnormal image data is audio data of an environment collected when the abnormal image data is generated.

Optionally, the real-time image data acquiring unit is specifically configured to: the method comprises the steps of obtaining real-time image data used for representing a to-be-processed environment around an audio playing device through a camera device arranged on the audio playing device.

Optionally, the reference image data obtaining unit is specifically configured to: the method comprises the steps of obtaining default image data which are stored in advance by an audio playing device and used for representing default environment around the audio playing device.

Optionally, the real-time image data acquiring unit is specifically configured to:

sending a request for obtaining the real-time image data;

receiving the real-time image data for the request.

Optionally, the determining unit is specifically configured to:

Optionally, the first sending unit is specifically configured to:

Optionally, the second sending unit is further configured to:

and sending the audio data set comprising the audio data matched with the abnormal image data in the specified time range to the server.

Optionally, the second sending unit is further configured to:

obtaining real-time audio data representing an environment to be processed;

obtaining reference audio data representing a reference environment;

Optionally, the second sending unit is specifically configured to:

In the above embodiments, a data processing method and apparatus are provided respectively. The present application additionally provides a data processing method. Fig. 3 is a flowchart of a data processing method according to a third embodiment of the present application. Since the specific implementation manner of the method in this embodiment is substantially similar to that of the method in the embodiment, the description is relatively simple, and relevant points can be found in the partial description of the first embodiment.

The embodiment can be applied to the scenes of interaction of the client, the remote server and the monitoring equipment with the camera device and the sound acquisition equipment. As shown in fig. 1-B, which is a schematic diagram of a second application scenario embodiment provided in the present application. Its application is in the case of monitoring an industrial plant pipeline. Each monitor (including monitor 1 and monitor 2) monitors the audio and video of the pipeline in real time. The computer room (local end) is provided with a workstation for screening real-time abnormal audio data on the assembly line (namely selecting audio data meeting preset conditions), and sending the abnormal audio data (target audio) and image data (namely video) corresponding to the abnormal audio data to the service cluster (remote server). The remote server receives the target audio and video, analyzes and calculates the target audio and video respectively by adopting an audio recognition technology and an image recognition technology, and obtains the recognition result of the abnormal event through analysis and calculation. And finally, sending the identification result of the abnormal event to the client and sending out alarm information. It should be noted that the above application scenarios are only examples of application scenarios, and the purpose of this application scenario example is to facilitate understanding of the data processing method of the present application, and is not to limit the data processing method of the present application.

The method of this embodiment includes the following steps.

Fig. 3 is a flowchart of a data processing method according to a third embodiment of the present application. The method comprises the following steps.

Step S301: real-time audio data representing the environment to be processed is obtained.

Step S302: reference audio data representing a reference environment is obtained.

The manner of acquiring the real-time audio data representing the to-be-processed environment and the reference audio data representing the reference environment in steps S301 to S302 may refer to the manner of acquiring the corresponding image data in the first embodiment, which is not described herein again.

Step S303: and judging whether the audio features of the target object in the real-time audio data are matched with the audio features of the reference target object in the reference audio data, if the audio features of the target object are not matched with the audio features of the reference target object, judging that the real-time audio data are abnormal audio data, and sending the abnormal audio data to the server.

After obtaining real-time audio data representing the environment to be processed and obtaining reference audio data representing a reference environment, it is determined whether audio features of a target object in the real-time audio data match audio features of a reference target object in the reference audio data.

Specifically, as one way to determine whether the audio feature of the target object in the real-time audio data matches the audio feature of the reference target object in the reference audio data, the following may be used: first, it is determined whether the target object and the reference target object are the same object. If the target object and the reference target object are the same object, judging whether the similarity between the audio features of the target object and the reference target object exceeds a preset similarity threshold value; determining that the audio feature of the target object matches the audio feature of the reference target object if the similarity between the audio feature of the target object and the audio feature of the reference target object exceeds the similarity threshold, otherwise determining that the audio feature of the target object does not match the audio feature of the reference target object.

As another way to determine whether the audio feature of the target object in the real-time audio data matches the audio feature of the reference target object in the reference audio data, the following may be mentioned: first, it is determined whether the target object and the reference target object are the same object. Determining that the audio features of the target object do not match the audio features of the reference target object if the target object and the reference target object are different objects.

It should be noted that, in this embodiment, the target object and the reference target object are different objects, that is, all target objects in the real-time audio data are different from all reference target objects in the reference audio data, or all target objects in the real-time audio data are not completely the same as all reference target objects in the reference audio data.

More specifically, if the audio feature of the target object is not matched with the audio feature of the reference target object, determining that the real-time audio data is abnormal audio data, and sending the abnormal audio data to a server, that is, if the audio feature of the target object is not matched with the audio feature of the reference target object, inputting the audio feature of the target object into a deep neural network model to obtain an identification result of whether the target object includes a target object needing attention; the deep neural network model is used for identifying whether the target object comprises a target object needing attention according to the audio features of the target object;

Wherein, if the target object includes a target object needing attention, the real-time audio data is judged to be abnormal audio data, and the abnormal audio data is sent to a server, and the judgment is that: and if the target object comprises a target object needing attention and the target object needing attention is matched with a preset comparison object, judging that the real-time audio data is abnormal image audio data, and sending the abnormal audio data to a server.

Step S304: and sending the image data matched with the abnormal audio data to a server, wherein the image data matched with the abnormal audio data is the image data of the environment collected when the abnormal audio data is generated.

After sending the common image audio data to the server, the method further comprises the following steps: and sending the image data set which comprises the image data matched with the abnormal audio data in the specified time range to the server. Specifically, the manner described below can be followed. Searching image data with the same acquisition time as the abnormal audio data from the image data set within the specified time range; and taking the searched image data as the image data matched with the abnormal audio data.

In addition, in order to prevent abnormal image data corresponding to non-abnormal audio data from not being sent to the server, abnormal image data is also sent to the server in the present embodiment. The specific process of screening abnormal image data is described below.

First, real-time image data representing an environment to be processed is obtained. Thereafter, reference image data representing a reference environment is obtained; and finally, judging whether the characteristics of the target object in the real-time image data are matched with the characteristics of the reference target object in the reference image data, and if the characteristics of the target object are not matched with the characteristics of the reference target object, sending the real-time image data to a server.

Further, the determining whether the features of the target object in the real-time image data match the features of the reference target object in the reference image data may be performed as follows: first, feature information of a reference target object in the reference image data is obtained. Then, feature information of a target object in the real-time image data is obtained. Finally, judging whether the similarity between the characteristics of the target object and the characteristics of the reference target object exceeds a preset similarity threshold value or not; determining that the feature of the target object matches the feature of the reference target object if the similarity between the feature of the target object and the feature of the reference target object exceeds the predetermined similarity threshold, otherwise determining that the feature of the target object does not match the feature of the reference target object.

The main difference between the third embodiment and the first embodiment is that the first embodiment is to filter audio data based on image data at the local end, and the third embodiment is to filter image data based on audio data at the local end. The data can be screened, so that the data pressure of the uploading server is reduced.

In the third embodiment, a data processing method is provided, and correspondingly, the present application further provides a data processing apparatus. Fig. 4 is a schematic diagram of a data processing apparatus according to a fourth embodiment of the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

A data processing apparatus of the present embodiment includes:

a real-time audio data acquisition unit 401 for acquiring real-time audio data representing an environment to be processed;

a reference audio data acquisition unit 402 for acquiring reference audio data representing a reference environment;

a determining unit 403, configured to determine whether an audio feature of a target object in the real-time audio data matches an audio feature of a reference target object in the reference audio data;

a first sending unit 403, configured to determine that the real-time audio data is abnormal audio data if the audio feature of the target object is not matched with the audio feature of the reference target object, and send the abnormal audio data to a server;

a second sending unit 403, configured to send image data matched with the abnormal audio data to the server, where the image data matched with the abnormal audio data is image data of an environment collected when the abnormal audio data is generated.

Optionally, the determining unit is specifically configured to:

Optionally, the first sending unit is specifically configured to:

Optionally, the second sending unit is further configured to:

and sending the image data set which comprises the image data matched with the abnormal audio data in the specified time range to the server.

Optionally, the second sending unit is further configured to:

obtaining real-time image data representing an environment to be processed;

obtaining reference image data representing a reference environment;

Optionally, the second sending unit is further configured to:

In the above embodiments, two data processing methods and apparatuses are provided, respectively. The application also provides a data processing method. Fig. 5 is a flowchart of a fifth embodiment of a data processing method according to the present application. Since the specific implementation manner of the method in this embodiment is substantially similar to that of the method in the embodiment, the description is relatively simple, and relevant points can be found in the partial description of the first embodiment.

The scene of the embodiment is basically similar to the scene of the embodiment three, and can be applied to a client, a remote server and a scene of interaction of a monitoring device with a camera and a sound collection device. In particular, it can be applied in the case of monitoring an industrial plant pipeline. Each monitor (including monitor 1 and monitor 2) monitors the audio and video of the pipeline in real time. The computer room (local end) is provided with a workstation, and the workstation is used for screening real-time abnormal audio data or abnormal image data on the production line, and sending the abnormal audio data (target audio) and the image data (namely video) corresponding to the abnormal audio data to the service cluster (remote server), or sending the abnormal image data (target video) and the audio data (namely audio) corresponding to the abnormal image data to the service cluster (remote server). The remote server receives the target audio and the video or the target video and the audio corresponding to the target audio and the video, and respectively carries out fine analysis calculation on the target audio and the video or carries out fine analysis calculation on the target video and the audio by adopting an audio recognition technology and an image recognition technology. Compared with the analysis and calculation of the third embodiment, the analysis and calculation of the part is more refined, and more accurate abnormal audio or video can be screened. And then, obtaining an accurate identification result of the abnormal event through accurate analysis calculation. And finally, sending the accurate identification result of the abnormal event to the client and sending out alarm information. Meanwhile, for the application scenario of the fifth embodiment, in the data transmission mode, the transmission policy is adjusted according to the video bandwidth of data transmission, for example, if the video bandwidth of data transmission is wide, the target audio data which is judged to be slightly abnormal and the target video data which is judged to be slightly abnormal may be transmitted to the remote server at the monitoring end; for the situation that the video bandwidth of the data transmission is narrow, the target audio data with a serious abnormal condition and the target video data with a serious abnormal condition can be transmitted to the remote server at the monitoring end. By adopting the data transmission mode, the speed of the whole data processing running process can be further improved. It should be noted that the above application scenarios are only examples of application scenarios, and the purpose of this application scenario example is to facilitate understanding of the data processing method of the present application, and is not to limit the data processing method of the present application.

The method of this embodiment includes the following steps.

Step S501: an image data set and an audio data set are obtained, and the image data in the image data set and the audio data in the audio data set are used for representing the surrounding environment of the monitoring device.

As one way of obtaining the image data set and the audio data set, first, the image data set is obtained by a cloud storage server that stores the image data set in advance.

Then, the audio data set is obtained through a storage device arranged on the audio playing device and used for storing the audio data set in advance.

In the data processing method in the first embodiment, the preliminarily screened image data set is already stored in the cloud storage server, and the audio data set is pre-stored in the storage device on the audio playing device. Therefore, in step S501, the image data set and the audio data set stored in the first embodiment can be directly obtained through the cloud storage server and the storage device on the audio playing apparatus. Of course, there is no sequential limitation in acquiring the image data set and the audio data set.

As another way to obtain the image data set and the audio data set, a request to obtain the image data set and the audio data set may be sent first; thereafter, a set of image data and a set of audio data for the request are obtained.

Step S502: acquiring an image data set needing attention from the image data set by using a deep neural network model, and acquiring an abnormal audio data set meeting abnormal audio data conditions from the audio data set; the deep neural network model is used for identifying whether the target object comprises a target object needing attention according to the characteristics of the target object in the image data.

And after the image data set and the audio data set are obtained, finely screening the image data set and the audio data set. During specific screening, the fine screening conditions can be stored in the cloud storage server in advance, and the data of the image data set and the audio data set and the fine screening conditions are uploaded to the server side for screening.

The fine screening of the image data set and the audio data set is performed by screening the image data set and the audio data set, respectively. The fine screening of the image data set means that an image data set needing attention is obtained from the image data set; the fine screening of the audio data sets refers to obtaining abnormal audio data sets meeting abnormal audio data conditions from the audio data sets.

In this embodiment, the fine screening of the image data set means that an image data set that needs attention is obtained from the image data set by using a deep neural network model.

Specifically, the step of obtaining an image data set needing attention from the image data set by using the deep neural network model can be performed as described below. Since the specific implementation details of obtaining the target object to be focused from the image data by using the deep neural network model have been described in detail in the first embodiment, the relevant points refer to the description of the first embodiment, and are not described herein again.

First, image data including a target object to be focused is screened out from an image data set by using a deep neural network model.

And then marking the screened image data containing the target object needing attention.

And finally, deleting the image data which do not contain the marks in the image data set to obtain all the image data containing the marks.

Marking the screened image data containing the target object needing attention; and deleting the image data which do not contain the marks from the image data set so as to obtain all the image data containing the marks, wherein details are not repeated in the embodiment because the technical field related to the part is mature.

Similarly, the fine screening of the audio data sets refers to obtaining abnormal audio data sets meeting abnormal audio data conditions from the audio data sets.

Specifically, the step of obtaining an abnormal audio data set satisfying the abnormal audio data condition from the audio data sets may be performed as described below.

Firstly, judging whether each piece of audio data in the audio data set meets an abnormal audio data condition.

Obtaining an abnormal audio data condition before judging whether each piece of audio data in the audio data set meets the abnormal audio data condition; the abnormal audio data condition may refer to that the audio features of the target object of the audio data do not match the audio features of the reference object. After the abnormal audio data conditions are obtained, whether each audio data in the audio data set meets the abnormal audio data conditions or not is judged according to the abnormal audio data conditions.

Specifically, whether each piece of audio data in the audio data set meets an abnormal audio data condition is judged, and the judgment is performed by judging whether the audio feature of the target object of each piece of audio data in the audio data set is matched with the audio feature of the reference target object.

First, sound characteristic information of a target object of each piece of audio data in an audio data set is obtained.

Thereafter, sound characteristic information of the reference target object is obtained.

After obtaining the sound feature information of the target object of each piece of audio data in the audio data set and obtaining the sound feature information of the reference target object, judging whether the similarity between the sound feature of the target object of each piece of audio data in the audio data set and the sound feature of the reference target object exceeds a preset sound similarity threshold value.

In the above method, whether each piece of audio data in the audio data set meets the condition of abnormal audio data is judged, the audio data in abnormal situations (including sounds such as help seeking, quarrel, screaming and the like) is mainly extracted so as to achieve intelligent monitoring, and in the process of audio identification (whether the abnormal audio data condition is judged), voiceprint information of different objects is identified and recorded mainly by combining the technologies of audio identification, natural language understanding, natural language generation, audio synthesis and the like so as to trigger alarm information of stranger sound identification and abnormal sound identification. In a word, the audio recognition result can be obtained after the audio recognition, and the alarm information is directly triggered through the audio recognition result. In addition, the audio recognition result and the image recognition result can be combined to trigger the alarm information.

Step S503: and if the time information acquired by the monitoring equipment of at least one image data in the image data set needing attention is matched with the time information acquired by the monitoring equipment of at least one abnormal audio data in the abnormal audio data set, sending alarm information to the monitoring equipment or the computing equipment for displaying the monitoring result.

After the image data set needing attention and the abnormal audio data set meeting the abnormal audio data condition are obtained in step S502, and before step S503 is executed, a step of determining whether each image data in the image data set needing attention and each abnormal audio data in the abnormal audio data set are matched needs to be executed.

Specifically, the step of determining may be performed by determining whether time information of the image data acquired by the monitoring apparatus and time information of the audio data acquired by the monitoring apparatus match. The judgment process is specifically realized by the following three steps.

Firstly, time information of each audio data of an abnormal audio data set meeting abnormal audio data conditions, which is acquired by the monitoring equipment, is obtained.

Then, time information of each image data of the image data set needing attention acquired by the monitoring equipment is obtained.

And then, judging whether the time information of each audio data of the abnormal audio data set, which is acquired by the monitoring equipment, is matched with the time information of each image data of the image data set needing attention, which is acquired by the monitoring equipment.

As a way of determining whether the time information acquired by the monitoring device of each audio data of the abnormal audio data set matches the time information acquired by the monitoring device of each image data of the image data set to be paid attention, whether the time information matches may be determined by calculating the degree of matching between the time information acquired by the monitoring device of each audio data of the abnormal audio data set and the time information acquired by the monitoring device of each image data of the image data set to be paid attention, respectively.

Finally, judging whether the time information of each audio data of the abnormal audio data set, which is acquired by the monitoring equipment, is matched with the time information of each image data of the image data set needing attention, which is acquired by the monitoring equipment, or not according to the obtained calculation result of the matching degree.

And after determining that the time information acquired by the monitoring equipment of at least one image data in the image data set needing attention is matched with the time information acquired by the monitoring equipment of at least one abnormal audio data in the abnormal audio data set, sending alarm information to the monitoring equipment or the computing equipment for displaying the monitoring result.

As a preferable way of sending the alarm information to the monitoring device or the computing device for presenting the monitoring result, the following way may be implemented.

Firstly, matching and combining the abnormal audio data matched with the time information and the image data needing attention according to the time information.

And then, sending the abnormal audio data matched and combined with the image data needing attention to the monitoring equipment or computing equipment for displaying the monitoring result.

And finally, sending alarm information to the monitoring equipment or the computing equipment for displaying the monitoring result.

More specifically, as one of the ways of sending the warning information, first, the type of the abnormal event is determined according to the abnormal audio data and the image data that needs to be focused in the matching combination. And then, acquiring the warning mode corresponding to the type of the abnormal event according to the type of the abnormal event and the corresponding relation between the type of the abnormal event and the warning mode. And finally, warning aiming at the type of the abnormal event according to the type of the abnormal event.

It should be noted that, in the above process, it is further required to determine whether the type of the abnormal event is in a list of a corresponding relationship between the type of the abnormal event and a warning manner, and if the type of the abnormal event is in the list, directly obtain the warning manner corresponding to the type of the abnormal event according to the list; if the type of the abnormal event is not in the list, acquiring an alarm mode corresponding to the type of the abnormal event according to the type of the abnormal event, and adding the alarm mode corresponding to the type of the abnormal event into the list.

The abnormal audio data and the image data needing attention which are matched and combined can judge the type of the abnormal event. For example, when a target object to be focused appearing in image data to be focused is a child, it is preliminarily determined that the target object is in need of help. And then, if crying sound is identified from the matched abnormal audio data, judging that the type of the abnormal event is an urgent abnormal event, and sending alarm information to the monitoring equipment or the computing equipment for displaying the monitoring result according to the urgent abnormal event. When the alarm information is sent out, the target object needing attention can be inquired whether help is needed or not through an audio interaction system (a loudspeaker and a microphone device can be installed on one side of the intelligent sound box), and therefore the monitoring system is intelligentized.

The data processing method of the embodiment can be applied to the field of intelligent monitoring, and can also be applied to the fields of instant snapshot of wonderful videos, classification of intelligent video albums and the like.

In a fifth embodiment, a data processing method is provided, and correspondingly, the present application further provides a data processing apparatus. Fig. 6 is a schematic diagram of a data processing apparatus according to a sixth embodiment of the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

A data processing apparatus of the present embodiment includes:

a data set obtaining unit 601, configured to obtain an image data set and an audio data set, where image data in the image data set and audio data in the audio data set are used to represent a surrounding environment of a monitoring device;

a screening unit 602, configured to obtain, by using a deep neural network model, an image data set that needs to be focused from the image data set, and obtain, from the audio data set, an abnormal audio data set that meets an abnormal audio data condition; the deep neural network model is used for identifying whether the target object comprises a target object needing attention or not according to the characteristics of the target object in the image data;

an alarm unit 603, configured to send alarm information to the monitoring device or a computing device for displaying a monitoring result if time information obtained by the monitoring device for at least one image data in the image data set that needs to be focused is matched with time information obtained by the monitoring device for at least one abnormal audio data in the abnormal audio data set.

Optionally, the data set obtaining unit is specifically configured to:

sending a request for obtaining an image data set and an audio data set;

an image data set and an audio data set for the request are obtained.

Optionally, the screening unit is specifically configured to:

marking the screened image data containing the target object needing attention;

Optionally, the screening unit is specifically configured to:

Optionally, the apparatus further includes a determining unit, where the determining unit is specifically configured to:

Optionally, the determining unit is specifically configured to:

Optionally, the alarm unit is specifically configured to:

Optionally, the system further comprises an alarm judging unit;

the alarm determination unit is specifically configured to: judging whether the type of the abnormal event is in a list of the corresponding relation between the type of the abnormal event and a warning mode, and if the type of the abnormal event is in the list, directly obtaining the warning mode corresponding to the type of the abnormal event according to the list;

The first embodiment of the present application provides a data processing method, and the seventh embodiment of the present application provides an electronic device corresponding to the data processing method.

Fig. 7 is a schematic diagram of an electronic device according to a seventh embodiment of the present application.

An electronic device of the present embodiment includes:

a processor 701;

the memory 702 is used for storing the program of the data method and executing the following steps:

obtaining real-time image data representing an environment to be processed;

obtaining reference image data representing a reference environment;

The third embodiment of the present application provides a data processing method, and the eighth embodiment of the present application provides an electronic device corresponding to the data processing method.

Since the eighth embodiment is the same as the seventh embodiment in illustration, please continue to refer to fig. 7, which shows a schematic diagram of an electronic device of the data processing method provided by the eighth embodiment of the present application.

An electronic device of the present embodiment includes:

a processor;

obtaining real-time audio data representing an environment to be processed;

obtaining reference audio data representing a reference environment;

Fifth embodiment of the present application provides a data processing method, and ninth embodiment of the present application provides an electronic device corresponding to the data processing method.

Since the ninth embodiment is the same as the seventh embodiment in illustration, please continue to refer to fig. 7, which shows a schematic diagram of an electronic device of the data processing method provided by the ninth embodiment of the present application.

An electronic device of the present embodiment includes:

a processor;

A first embodiment of the present application provides a data processing method, and a tenth embodiment of the present application provides a computer storage medium, where a program of the data processing method is stored in the computer storage medium, and the program is executed by a processor to perform the following steps:

obtaining real-time image data representing an environment to be processed;

obtaining reference image data representing a reference environment;

A third embodiment of the present application provides a data processing method, and a eleventh embodiment of the present application provides a computer storage medium, where a program of the data processing method is stored in the computer storage medium, and the program is executed by a processor to perform the following steps:

obtaining real-time audio data representing an environment to be processed;

obtaining reference audio data representing a reference environment;

A fifth embodiment of the present application provides a data processing method, and a twelfth embodiment of the present application provides a computer storage medium, where a program of the data processing method is stored in the computer storage medium, and the program is run by a processor and executes the following steps:

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer-readable medium does not include non-transitory computer-readable storage media (non-transitory computer readable storage media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A data processing method, comprising:

obtaining real-time image data representing an environment to be processed;

obtaining reference image data representing a reference environment;

2. The data processing method of claim 1, wherein the obtaining real-time image data representing an environment to be processed comprises: the method comprises the steps of obtaining real-time image data used for representing a to-be-processed environment around an audio playing device through a camera device arranged on the audio playing device.

3. The data processing method of claim 1, wherein the obtaining reference image data representing a reference environment comprises: the method comprises the steps of obtaining default image data which are stored in advance by an audio playing device and used for representing default environment around the audio playing device.

4. The data processing method of claim 1, wherein the obtaining real-time image data representing an environment to be processed comprises:

sending a request for obtaining the real-time image data;

receiving the real-time image data for the request.

5. The data processing method of claim 1, wherein the determining whether the features of the target object in the real-time image data match the features of the reference target object in the reference image data comprises:

6. The data processing method of claim 5, wherein the determining whether the similarity between the feature of the target object and the feature of the reference target object exceeds a predetermined similarity threshold comprises:

7. The data processing method of claim 1, wherein the determining whether the features of the target object in the real-time image data match the features of the reference target object in the reference image data comprises:

8. The data processing method according to claim 7, wherein the target object and the reference target object are different objects, that is, all target objects in the real-time image data are not the same as all reference target objects in the reference image data, or all target objects in the real-time image data are not completely the same as all reference target objects in the reference image data.

9. The data processing method according to claim 1, wherein if the features of the target object do not match the features of the reference target object, determining that the real-time image data is abnormal image data and sending the abnormal image data to a server, includes:

10. The data processing method according to claim 9, wherein if the target object includes a target object that needs to be focused on, determining that the real-time image data is abnormal image data, and sending the abnormal image data to a server, includes:

11. The data processing method of claim 1, further comprising: and sending the audio data set comprising the audio data matched with the abnormal image data in the specified time range to the server.

12. The data processing method of claim 11, further comprising:

13. The data processing method of claim 1, further comprising:

obtaining real-time audio data representing an environment to be processed;

obtaining reference audio data representing a reference environment;

14. The data processing method of claim 13, wherein the determining whether the audio features of the target object in the real-time audio data match the audio features of the reference target object in the reference audio data comprises:

15. A data processing apparatus, comprising:

16. A data processing method, comprising:

obtaining real-time audio data representing an environment to be processed;

obtaining reference audio data representing a reference environment;

17. The data processing method of claim 16, wherein the determining whether the audio features of the target object in the real-time audio data match the audio features of the reference target object in the reference audio data comprises:

18. The data processing method of claim 16, wherein the determining whether the audio features of the target object in the real-time audio data match the audio features of the reference target object in the reference audio data comprises:

19. The data processing method of claim 18, wherein the target objects and the reference target objects are different objects, that is, all target objects in the real-time audio data are not the same as all reference target objects in the reference audio data, or all target objects in the real-time audio data are not completely the same as all reference target objects in the reference audio data.

20. The data processing method according to claim 16, wherein if the audio feature of the target object does not match the audio feature of the reference target object, determining that the real-time audio data is abnormal audio data and sending the abnormal audio data to a server, comprises:

21. The data processing method according to claim 20, wherein if the target object includes a target object that needs to be focused, determining that the real-time audio data is abnormal audio data, and sending the abnormal audio data to a server, includes:

22. The data processing method of claim 16, further comprising: and sending the image data set which comprises the image data matched with the abnormal audio data in the specified time range to the server.

23. The data processing method of claim 22, further comprising:

24. The data processing method of claim 16, further comprising:

obtaining real-time image data representing an environment to be processed;

obtaining reference image data representing a reference environment;

25. The data processing method of claim 24, wherein the determining whether the features of the target object in the real-time image data match the features of the reference target object in the reference image data comprises:

26. A data processing apparatus, comprising:

27. A data processing method, comprising:

28. The data processing method of claim 27, wherein the obtaining the image data set and the audio data set comprises:

29. The data processing method of claim 27, wherein the obtaining the image data set and the audio data set comprises:

sending a request for obtaining an image data set and an audio data set;

an image data set and an audio data set for the request are obtained.

30. The data processing method of claim 27, wherein obtaining an image data set requiring attention from the image data set using a deep neural network model comprises:

marking the screened image data containing the target object needing attention;

31. The data processing method of claim 27, wherein obtaining the abnormal audio data set satisfying the abnormal audio data condition from the audio data sets comprises:

32. The data processing method of claim 31, wherein the determining whether each piece of audio data in the set of audio data satisfies an abnormal audio data condition comprises:

33. The data processing method of claim 27, further comprising:

34. The data processing method of claim 33, wherein the determining whether the time information of each audio data of the abnormal audio data set acquired by the monitoring device matches the time information of each image data of the image data set needing attention acquired by the monitoring device comprises:

35. The data processing method of claim 27, wherein the sending of the alarm message to the monitoring device or the computing device for displaying the monitoring result comprises:

36. The data processing method of claim 35, further comprising:

37. The data processing method of claim 36, further comprising:

38. A data processing apparatus, comprising:

the screening unit is used for acquiring an image data set needing attention from the image data set by using a deep neural network model and acquiring an abnormal audio data set meeting abnormal audio data conditions from the audio data set; the deep neural network model is used for identifying whether the target object comprises a target object needing attention or not according to the characteristics of the target object in the image data;

39. An electronic device, comprising:

a processor;

obtaining real-time image data representing an environment to be processed;

obtaining reference image data representing a reference environment;

40. An electronic device, comprising:

a processor;

obtaining real-time audio data representing an environment to be processed;

obtaining reference audio data representing a reference environment;

41. An electronic device, comprising:

a processor;

acquiring an image data set needing attention from the image data set by using a deep neural network model, and acquiring an abnormal audio data set meeting abnormal audio data conditions from the audio data set; the deep neural network model is used for identifying that the target object comprises a target object which needs to be concerned according to the characteristics of the target object in the image data;

42. A computer storage medium storing a program of a data processing method, the program being executed by a processor to perform the steps of:

obtaining real-time image data representing an environment to be processed;

obtaining reference image data representing a reference environment;

43. A computer storage medium storing a program of a data processing method, the program being executed by a processor to perform the steps of:

obtaining real-time audio data representing an environment to be processed;

obtaining reference audio data representing a reference environment;

44. A computer storage medium storing a program of a data processing method, the program being executed by a processor to perform the steps of: