CN110263653A

CN110263653A - A kind of scene analysis system and method based on depth learning technology

Info

Publication number: CN110263653A
Application number: CN201910433837.8A
Authority: CN
Inventors: 王志宇; 杨嘉欣; 杨嘉烨
Original assignee: Guangdong Dingyi Interconnection Technology Co ltd
Current assignee: Guangdong Dingyi Interconnection Technology Co ltd
Priority date: 2019-05-23
Filing date: 2019-05-23
Publication date: 2019-09-20

Abstract

The scene analysis system and method based on depth learning technology that the invention discloses a kind of, the system include: data acquisition subsystem and cloud AI platform；Data acquisition subsystem acquires image and voice；Face recognition module carries out recognition of face to testing image according to depth learning technology in cloud AI platform；Facial Expression Analysis module people's facial expression analyzes and determines；Speech recognition module treats acoustic frequency according to depth learning technology and carries out speech recognition；Semanteme, the intonation of speech analysis module audio to be measured are analyzed and determined；Comprehensive analysis module carries out comprehensive analysis to facial Expression Analysis module and the obtained result of speech analysis module.The present invention can meet the identification of face and voice simultaneously, and according to depth learning technology, the recognition result of human face expression, the semanteme of voice and intonation is obtained, not only make recognition result more accurate but also ensure that recognition speed, further enriches scene analysis technology.

Description

A kind of scene analysis system and method based on deep learning technology

Technical field

The present invention relates to deep learning technology fields, more particularly to a kind of based on based on deep learning technology Scene analysis system and method.

Background technique

With being constantly progressive for modern science and technology, intellectualization times have arrived, wherein natural language processing and human face expression Identification technology also becomes the important topic of those skilled in the art's research already.

However, on the one hand, due to the limitation of traditional shallow Model, traditional Natural Language Processing Models need using A large amount of linguistic knowledge carrys out manual construction feature, and these are generally characterized by by concrete application guide, therefore not specific Wide applicability must construct new feature by hand again again if specific tasks change；

On the other hand, current face recognition technology is mainly based upon the feature extraction algorithm of hand-designed also to carry out reality It is existing, and in actual complex environment, human face data often there is the influence of various factors, such as illumination, block, posture becomes Change etc., in this case, the existing face identification method based on hand-designed feature extraction algorithm has poor robustness, It is poor to the anti-interference ability of above-mentioned influence factor, and these uncontrollable factors make the recognition of face based on existing method Performance sharply declines, it is difficult to which the effect for guaranteeing recognition of face has that face recognition accuracy rate is low.

And people different field explore image recognition, speech recognition, semantic analysis application, but by nature language Speech processing, recognition of face and human facial expression recognition, which combine, applies the application in scene analysis still less, even in hair The exhibition stage can not be accurately identified.

Therefore, develop that a kind of identification is accurate and the field of the natural language processing based on deep learning and facial expression recognition The problem of scape analysis system and method are those skilled in the art's urgent need to resolve.

Summary of the invention

In view of this, the present invention provides a kind of scene analysis system and method based on deep learning technology, pass through Deep learning technology identifies face or voice, and further to the expression of face and to the semanteme and intonation of voice It is analyzed, the accuracy of identification with analysis has been effectively ensured.

To achieve the goals above, the present invention adopts the following technical scheme:

A kind of scene analysis system based on deep learning technology, comprising: data acquisition subsystem, database and cloud AI are flat Platform；Wherein,

The data acquisition subsystem, the acquisition for image and voice；

The database, for storing data；

The cloud AI platform includes data preprocessing module, face recognition module, facial Expression Analysis module, speech recognition Module, speech analysis module and comprehensive analysis module；

The data preprocessing module, it is pre- for being carried out to data acquisition subsystem institute's acquired image and voice Processing；

The face recognition module, for carrying out recognition of face to testing image according to deep learning technology, and according to institute It states the data in database and differentiates whether the face in testing image has existed, and constantly carry out recognition of face deep learning；

The facial Expression Analysis module, for being analyzed and determined to people's facial expression in testing image, and constantly Carry out facial Expression Analysis deep learning；

The speech recognition module carries out speech recognition for treating acoustic frequency, voice content is converted to word content, Semantic analysis is carried out to voice content, and constantly carries out speech recognition deep learning；

The speech analysis module, semanteme, intonation for audio to be measured are analyzed and determined；

The comprehensive analysis module, for being carried out to facial Expression Analysis module and the obtained result of speech analysis module Comprehensive analysis.

Preferably, the pretreatment content includes: to carry out dimension-reduction treatment to image, carries out noise reduction process and text to audio This output.

Preferably, the data acquisition subsystem includes image capture module and audio collection module,

Described image acquisition module and the audio collection module, are respectively used to be acquired image and audio, and will The collected described image of institute and the audio are sent to the data preprocessing module.

Preferably, the face recognition module includes fisrt feature extraction unit, the first deep learning model and first Match and recognition unit；

The fisrt feature extraction unit, for according to the first deep learning model by pretreated image zooming-out face Image feature vector；

First matching and recognition unit, for by the facial image feature vector extracted and the database In facial image matched, obtain the first recognition result, and by first recognition result be sent to the database into Row storage, the first deep learning model are constantly updated according to the update of database.

Preferably, the speech recognition module includes second feature extraction unit, the second deep learning model and second Match and recognition unit；

The second feature extraction unit, for according to the second deep learning model by pretreated audio extraction audio Feature vector；

Second matching and recognition unit, for will be in the audio feature vector that extracted and the database Audio data is matched, and obtains the second recognition result, and second recognition result is sent to the database and is deposited Storage, the second deep learning model are constantly updated according to the update of database.

Preferably, the speech analysis module includes semantic analysis unit and intonation analytical unit；

What the semantic analysis unit and the intonation analytical unit were recognized according to the voice recognition unit respectively Voice carries out the analysis of semantic and intonation.

A kind of scene analysis method based on deep learning technology, comprising the following steps:

(1) acquisition of image and voice；

(2) institute's acquired image and voice are pre-processed；

(3) recognition of face is carried out to testing image according to deep learning technology, judged in database with the presence or absence of to mapping Face as in, and the people's facial expression recognized is analyzed and determined；

(4) acoustic frequency is treated according to deep learning technology and carries out speech recognition, convert speech into word content, and to knowledge Semanteme, the intonation for the voice being clipped to are analyzed and determined；

(5) the analytical judgment result of step (3) and step (4) carries out comprehensive analysis.

Preferably, the detailed process of recognition of face are as follows:

According to the first deep learning model by pretreated image zooming-out facial image feature vector；

The facial image feature vector extracted is matched with the facial image in database, obtains the first identification knot Fruit, and first recognition result is sent to the database and is stored, the first deep learning model is according to data The update in library and constantly update.

Preferably, the detailed process of speech recognition are as follows:

According to the second deep learning model by pretreated audio extraction audio feature vector；

The audio feature vector extracted is matched with the audio data in database, obtains the second identification knot Fruit, and second recognition result is sent to the database and is stored, the second deep learning model is according to data The update in library and constantly update.

It can be seen via above technical scheme that compared with prior art, the present disclosure provides one kind to be based on depth The scene analysis system and method for habit technology, wherein the system can meet the identification of face and voice, and root simultaneously first According to deep learning technology, the recognition result of human face expression, the semanteme of voice and intonation is obtained, not only makes recognition result more accurate And ensure that recognition speed, scene analysis technology is further enriched, secondly, deep learning model during use can Enough continuous iteration update, and further ensure that the accuracy of recognition result.The present invention can be used for service trade, smart city etc. In field, there is timely insight into customer mood can preferably meet the advantages such as the needs of client.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 attached drawing is structural schematic diagram provided by the invention；

Fig. 2 attached drawing is cloud AI platform interior structural schematic diagram provided by the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

The scene analysis system based on deep learning technology that the embodiment of the invention discloses a kind of, as shown in Figure 1, comprising: Data acquisition subsystem, database and cloud AI platform；Wherein,

Data acquisition subsystem, the acquisition for image and voice；

Database, for storing data；

As shown in Fig. 2, cloud AI platform includes data preprocessing module, face recognition module, facial Expression Analysis module, language Sound identification module, speech analysis module and comprehensive analysis module；

Data preprocessing module, for being pre-processed to data acquisition subsystem institute's acquired image and voice；

Face recognition module, for carrying out recognition of face to testing image according to deep learning technology, and according to database Interior data differentiate whether the face in testing image has existed, and constantly carry out recognition of face deep learning；

Facial Expression Analysis module for analyzing and determining to people's facial expression in testing image, and constantly carries out Facial Expression Analysis deep learning；

Speech recognition module carries out speech recognition for treating acoustic frequency, voice content is converted to word content, to language Sound content carries out semantic analysis, and constantly carries out speech recognition deep learning；

Speech analysis module, semanteme, intonation for audio to be measured are analyzed and determined；

Comprehensive analysis module, for being integrated to facial Expression Analysis module and the obtained result of speech analysis module Analysis.

Preferably, pretreatment content includes: to carry out dimension-reduction treatment to image, carries out noise reduction process to audio and text is defeated Out.

Further, which further includes database, for storing data；

Further, pretreatment content includes: to carry out dimension-reduction treatment to image, carries out noise reduction process and text to audio This output.

Further, data acquisition subsystem includes image capture module and audio collection module,

Image capture module and audio collection module are respectively used to be acquired image and audio, and will be collected Image and audio be sent to data preprocessing module.

Further, face recognition module includes fisrt feature extraction unit, the first deep learning model and first Match and recognition unit；

Fisrt feature extraction unit, for according to the first deep learning model by pretreated image zooming-out facial image Feature vector；

First matching and recognition unit, the facial image in facial image feature vector and database for will extract It is matched, obtains the first recognition result, and the first recognition result is sent to database and is stored, the first deep learning mould Type is constantly updated according to the update of database.

Further, speech recognition module includes second feature extraction unit, the second deep learning model and second Match and recognition unit；

Second feature extraction unit, for according to the second deep learning model by pretreated audio extraction audio frequency characteristics Vector；

Second matching and recognition unit, for carrying out the audio data in the audio feature vector and database that extract Matching, obtains the second recognition result, and the second recognition result is sent to database and is stored, the second deep learning model root It is constantly updated according to the update of database.

Further, speech analysis module includes semantic analysis unit and intonation analytical unit；

Semantic analysis unit and intonation analytical unit carry out according to the voice that voice recognition unit is recognized semantic respectively It is analyzed with intonation.

The operation principle of the present invention is that:

Acquired image and voice are sent to data prediction mould respectively by image capture module and voice acquisition module Block, image is carried out the processing such as dimensionality reduction by data preprocessing module, and carries out the processing such as noise reduction and text output to voice, and data are pre- Pretreated image data and voice data are respectively sent to face recognition module and speech recognition module, people by processing module Face identification module by the first matching and recognition unit by the data in the facial image feature vector extracted and database into Row matching, judges whether there is the face, and obtain face recognition result, further carries out facial table according to face recognition result Mutual affection analysis；Speech recognition module passes through the second matching and recognition unit for the number in the audio feature vector and database that extract According to being matched, and then carry out semantic, intonation analysis.

Comprehensive analysis module synthesis face facial expression analysis and semantic intonation analysis are as a result, to obtain in current scene Mood of identified the people etc. is as a result, complete scene analysis.Client's mood can be obtained in real time according to scene analysis result, known Customer satisfaction degree can carry out early warning to emergency event, in addition, can be prevented for the service of smart city with dynamic early-warning Social event occurs.

(1) acquisition of image and voice；

(2) institute's acquired image and voice are pre-processed；

It, can also be with it should be understood that the sequencing of step (3) and step (4) is not necessarily, can to carry out simultaneously It first carries out step (3) and carries out step (4) afterwards, vice versa, can also only carry out one of step, determines as needed.

Further, the detailed process of recognition of face are as follows:

The facial image feature vector extracted is matched with the facial image in database, obtains the first identification knot Fruit, and the first recognition result is sent to database and is stored, the first deep learning model according to the update of database without It is disconnected to update.

Further, the detailed process of speech recognition are as follows:

The audio feature vector extracted is matched with the audio data in database, obtains the second recognition result, And the second recognition result is sent to database and is stored, the second deep learning model according to the update of database and constantly more Newly.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of scene analysis system based on deep learning technology characterized by comprising data acquisition subsystem, data Library and cloud AI platform；Wherein,

The data acquisition subsystem, the acquisition for image and voice；

The database, for storing data；

The cloud AI platform includes data preprocessing module, face recognition module, facial Expression Analysis module, speech recognition mould Block, speech analysis module and comprehensive analysis module；

The data preprocessing module, for being located in advance to data acquisition subsystem institute's acquired image and voice Reason；

The face recognition module, for carrying out recognition of face to testing image according to deep learning technology, and according to the number Differentiate whether the face in testing image has existed according to the data in library, and constantly carries out recognition of face deep learning；

The facial Expression Analysis module for analyzing and determining to people's facial expression in testing image, and constantly carries out Facial Expression Analysis deep learning；

The speech recognition module carries out speech recognition for treating acoustic frequency, voice content is converted to word content, to language Sound content carries out semantic analysis, and constantly carries out speech recognition deep learning；

The comprehensive analysis module, for being integrated to facial Expression Analysis module and the obtained result of speech analysis module Analysis.

2. a kind of scene analysis system based on deep learning technology according to claim 1, which is characterized in that described pre- Process content includes: to carry out dimension-reduction treatment to image, carries out noise reduction process and text output to audio.

3. a kind of scene analysis system based on deep learning technology according to claim 1, which is characterized in that the number It include image capture module and audio collection module according to acquisition subsystem,

Described image acquisition module and the audio collection module, are respectively used to be acquired image and audio, and will be adopted The described image and the audio collected is sent to the data preprocessing module.

4. a kind of scene analysis system based on deep learning technology according to claim 1, which is characterized in that the people Face identification module includes fisrt feature extraction unit, the first deep learning model and the first matching and recognition unit；

The fisrt feature extraction unit, for according to the first deep learning model by pretreated image zooming-out facial image Feature vector；

First matching and recognition unit, for will be in the facial image feature vector that extracted and the database Facial image is matched, and obtains the first recognition result, and first recognition result is sent to the database and is deposited Storage, the first deep learning model are constantly updated according to the update of database.

5. a kind of scene analysis system based on deep learning technology according to claim 1, which is characterized in that institute's predicate Sound identification module includes second feature extraction unit, the second deep learning model and the second matching and recognition unit；

The second feature extraction unit, for according to the second deep learning model by pretreated audio extraction audio frequency characteristics Vector；

Second matching and recognition unit, for by the audio in the audio feature vector extracted and the database Data are matched, and obtain the second recognition result, and second recognition result is sent to the database and is stored, institute The second deep learning model is stated to be constantly updated according to the update of database.

6. a kind of scene analysis system based on deep learning technology according to claim 1, which is characterized in that institute's predicate Sound analysis module includes semantic analysis unit and intonation analytical unit；

The voice that the semantic analysis unit and the intonation analytical unit are recognized according to the voice recognition unit respectively Carry out the analysis of semantic and intonation.

7. a kind of scene analysis method based on deep learning technology, which comprises the following steps:

(1) acquisition of image and voice；

(2) institute's acquired image and voice are pre-processed；

(3) recognition of face is carried out to testing image according to deep learning technology, judged in database with the presence or absence of in testing image Face, and the people's facial expression recognized is analyzed and determined；

(4) acoustic frequency is treated according to deep learning technology and carries out speech recognition, convert speech into word content, and to recognizing Semanteme, the intonation of voice analyzed and determined；

8. a kind of scene analysis method based on deep learning technology according to claim 8, which is characterized in that face is known Other detailed process are as follows:

The facial image feature vector extracted is matched with the facial image in database, obtains the first recognition result, And first recognition result is sent to the database and is stored, the first deep learning model is according to database It updates and constantly updates.

9. a kind of scene analysis method based on deep learning technology according to claim 8, which is characterized in that voice is known Other detailed process are as follows:

The audio feature vector extracted is matched with the audio data in database, obtains the second recognition result, And second recognition result is sent to the database and is stored, the second deep learning model is according to database It updates and constantly updates.