CN110727871A

CN110727871A - Multi-mode data acquisition and comprehensive analysis platform based on convolution decomposition depth model

Info

Publication number: CN110727871A
Application number: CN201910999213.2A
Authority: CN
Inventors: 王钟贤; 姚潇; 刘旭宸; 李朝宇; 徐宁; 刘小峰
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2019-10-21
Filing date: 2019-10-21
Publication date: 2020-01-24

Abstract

The invention discloses a multi-modal data acquisition and comprehensive analysis platform based on a convolution decomposition depth model, which comprises the following steps: s1, establishing a data interaction module; s2, establishing a data analysis module; and S3, establishing a user service module. The invention simultaneously supports the forms of the metadata such as text, voice, picture and the like; in the aspect of data collection, the default user of the invention is the main source of data, so that a good interaction mode and a high-concurrency and high-availability database management mode are provided; on the aspect of data analysis, training and classifying pictures based on deep learning CNN and RNN networks, and extracting and merging texts by using a TF-IDF (TransFlash-IDF) word frequency network in NLP (non line segment); the BP neural network constructed based on the standard keras module and tf.keras module under the tenserflow is used for realizing the collection and accurate classification of the audio.

Description

Multi-mode data acquisition and comprehensive analysis platform based on convolution decomposition depth model

Technical Field

The invention relates to a multi-modal data acquisition and comprehensive analysis platform based on a convolution decomposition depth model, and belongs to the technical field of computers.

Background

The artificial intelligence industry is in a vigorous development period, a core algorithm is continuously broken through, the computing capability is obviously improved, and the demands for various data and learning samples are explosively increased; secondly, artificial intelligence's segmentation field is numerous, and the application of a great deal of aspect such as smart mobile phone, intelligent automobile, intelligent house, intelligent robot, internet entertainment social field certainly leads to the data set demand to become more meticulous day by day, and the direction is changed. Users who wish to obtain their desired data sets often find the mass data on the internet too coarse or discrete to meet their needs.

From the current macroscopic environment, firstly, the artificial intelligence industry is in a vigorous development period, the core algorithm is continuously broken through, the computing capability is obviously improved, and the demands for various data and learning samples are in an explosive growth situation; secondly, artificial intelligence's segmentation field is numerous, and the application of a great deal of aspect such as smart mobile phone, intelligent automobile, intelligent house, intelligent robot, internet entertainment social field certainly leads to the data set demand to become more meticulous day by day, and the direction is changed. Therefore, from the above analysis, the demand for an AI data acquisition platform for serving users as a guide and refining data is increasing. The existing data acquisition platform often has the following problems: firstly, users hope to obtain a data set expected by the users, but often find that mass data on an acquisition platform are too coarse or discrete to meet refined use standards of the mass data; second, retrieval matches are often less than optimal and intelligent, which causes the user to often drift in the manipulation of the data. Third, the platform is not highly interactive with users and it is difficult to fulfill high concurrent high availability design requirements.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a multi-modal data acquisition and comprehensive analysis platform based on a convolution decomposition depth model, and simultaneously supports the forms of text, voice, pictures and other metadata; in the aspect of data collection, the default user of the invention is the main source of data, so that a good interaction mode and a high-concurrency and high-availability database management mode are provided; on the aspect of data analysis, training and classifying pictures based on deep learning CNN and RNN networks, and extracting and merging texts by using a TF-IDF (TransFlash-IDF) word frequency network in NLP (non line segment); the BP neural network constructed based on the standard keras module and tf.keras module under the tenserflow is used for realizing the collection and accurate classification of the audio.

The technical path of the invention is as follows:

the data soliciter issues a task at the client, the task information (such as the demand, the theme and the like) is stored in the mysql database, so that the subsequent calling and comparison are facilitated, and meanwhile, the task information is displayed on the APP interface. The task receiving person can upload corresponding data, after the background verifies the matching of the uploaded data and the required data, if the similarity is higher than a certain preset threshold value, the uploading is successful, otherwise, the uploading is not performed. The successful task uploader may receive the corresponding reward.

The method specifically comprises the following steps:

s1, establishing a data interaction module;

s2, establishing a data analysis module;

and S3, establishing a user service module.

In step S1, a data interaction module is established, which includes the following processes:

s11, web backend data transfer based on Tomcat.

And S12, performing IDEA-based Android development.

And S13, realizing horizontal expansion of the database by the service layer by using a data hash mode and a service connection pool.

In the step S2, step S2 is a functional core module, including the following processes:

firstly, identifying and classifying data in the form of pictures and texts, building a correlation network by using pycharm ide based on a Tensorflow framework, and constructing a matching algorithm.

The CNN and RNN neural networks are used in image recognition and character set text classification, and image recognition based on inclusion 3 can identify more fine objects. In which, unlike conventional convolutional networks, a convolutional kernel decomposition technique is utilized.

In the early days, relatively large convolution kernels were used, such as LeNet-5 with a 5x5 convolution kernel and AlexNet with 11x11,5x5,3x3 convolution kernels. In fact, a large convolution kernel is not as effective as a small convolution kernel, and to achieve the same visual range, the number of parameters is actually smaller and the number of layers is deeper by adding convolution layers, so that the classification effect of the model is improved instead because fewer parameters are used and deeper and more nonlinear transformation is obtained. Therefore, it is better to construct the convolutional layer by the 3 × 3 convolutional kernel. For example, convolution with two layers of 3x3 can achieve the same visual range as one layer of 5x5 convolution. Meaning that the model generalization capability is better.

Due to the slight dispersion of the feature point selection of the basic inclusion, the complex feature picture (such as a complex feature picture) is identified

Seaside houses, church of colored tiles) and improved in the cottleneck layer for additional training.

The improved convolutional neural network is based on classical Alexnet and comprises a convolutional layer 1, a convolutional layer 2, a convolutional layer 3, a convolutional layer 4, a convolutional layer 5, a pooling layer 1, a pooling layer 2, a pooling layer 3, a pooling layer 4, a pooling layer 5, a full-link layer 1, a full-link layer 2, a full-link layer 3 and a fusion layer. The last full connection layer outputs twice the number of feature points, such as house tiles, and the number of feature points at the top of the church is 9, then the output is 18.

Extracting and screening information by the convolution layer and the pooling layer, wherein the convolution kernel of the convolution layer is 3 multiplied by 3, the step length is set to be 1, and the pooling kernel of the maximum pooling layer is 2 multiplied by 2. Including two stacked convolutional layers in

convolutional layers

2, 3, 4, 5, the concatenation of two 3 × 3 convolutional layers is equivalent to 1 convolutional layer of 5 × 5, and the number of convolutional layers is much less than that of convolutional layers of 5 × 5, which can reduce the training time of the whole network.

Dropout operation is performed after the fully connected

layers

1 and 2, thereby improving the generalization ability. The activation function selects LeakyReLu:

x_ian input matrix representing a convolved image; y is_iRepresenting the convolved image output matrix after the activation function.

The LeakyReLu function converges faster than the conventional ReLu function.

Therefore, the marginalized weight is reduced, the central object block is highlighted, the identification accuracy is improved, and the higher level of 98.6% is achieved.

And adopting supervised learning to divide texts which can be uploaded by a user at ordinary times into 20 categories, and constructing an RNN and a CNN network for training.

Deep learning deals with natural language processing problems, some based on phrases, and some based on words. The neural network does not need to know knowledge about words (lookup table or word2vec) in advance, and moreover, the knowledge of the words is often high-dimensional and is difficult to apply to the convolutional neural network. In addition, the convolutional neural network does not need to know the knowledge of syntax and semantics in advance. The model realizes classification of texts based on character level, and a training and testing data set comes from the university of Qinghua THUCnews, and the accuracy rate of the testing set is more than 96%.

Has the advantages that:

1. the convolutional neural network used in text classification and picture recognition is based on the AlexNet convolutional neural network, a Leaky relu activation function is used, a plurality of cpus are used, and training speed is improved. And the overlapped pool is formed, so that the precision is improved, and overfitting is not easy to generate. The local response is normalized, the precision is improved, the smoothing processing is carried out, a competition mechanism is established for the activity of local neurons, the corresponding larger value is relatively larger, and the generalization capability of the model is improved.

2. The invention realizes flexible and diversified support for data collection, and almost various formats of pictures, audios and texts support uploading, thereby ensuring the diversity and sufficiency of data sources. In the aspect of picture identification, for the inclusion 3 standard widely accepted in the industry, retraining is performed on a bottleeck layer, the marginalized weight is reduced, the central object block is highlighted, the identification accuracy is improved, richer spatial features brought by the Factorization in small statistics idea (splitting of a larger two-dimensional convolution into two smaller one-dimensional convolutions) are sharper, and the identification rate is improved by about 3 percentage points.

3. The database is horizontally expanded by using a consistent Hash algorithm (DHT derivation) and a connection pool technology, the high-concurrency and high-availability use requirements are met, and the use experience of a user is greatly improved.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a functional diagram of the module of step S1;

FIG. 3 is a functional diagram of the module of step S2;

FIG. 4 is a functional diagram of the audio processing and verifying module in the step S2;

FIG. 5 is a first schematic diagram illustrating a value range of an output value of a hash function;

fig. 6 is a schematic diagram two of the value range of the output value of the hash function.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

As shown in fig. 1, a multi-modal data collection and comprehensive analysis platform based on a convolutional decomposition depth model specifically includes the following steps:

s1, establishing a data interaction module;

s11, web backend data transfer based on Tomcat.

And S12, performing IDEA-based Android development.

The module of step S1 is shown in fig. 2, and in Tomcat part, there is Tomcat-based web backend data transmission. The module S11 is explained in detail.

With tomcat server as the center, APP user publishes information packing to be json data and passes to tomcat server through TCP/IP agreement, and the server analyzes, and some data transfer are saved to the database, return the analysis result again to the APP client and present for the user.

Taking the APP user as an example to upload the picture in the task:

1. and obtaining the picture information at the APP terminal, and transmitting the data to a tomcat server by using the okhttp 3.

2. And analyzing and storing the picture in the server, and returning the stored picture path to the APP client.

3. And starting a thread to package data to a socket network based on python for communication, judging whether the pictures stored in the server meet the task requirement or not by a neural network based on TensorFlow, and returning a result to the APP client after the judgment is finished.

4. If the picture is approved, the picture uploaded by the user is displayed, otherwise, the user is prompted that the submitted picture is not in accordance with the requirements.

5. The user submits the task and again accesses the tomcat server and stores the data in the database.

For the S13 module, the invention uses the data hash mode and the service connection pool to realize the horizontal expansion of the database by the service layer.

In order to meet the requirements of high concurrency and high availability, the database is horizontally expanded based on a consistent hash algorithm. Consistent hashing algorithm in 1997, a Distributed Hashing (DHT) implementation algorithm proposed by the institute of technology, massachusetts, is designed to solve the problem of Hot spots (Hot spots) in the internet, and is a special hashing mode. The traditional hash method causes a large amount of data migration when the number of nodes changes, but the use of consistent hash does not cause the problem. Coherent hashing was originally the placement algorithm of a Distributed Caching (Distributed Caching) system (now popular is for coherent hashing). But it has now been widely used in various other fields. For any hash function, the output value has a value range, and we can draw this value range as a ring, as shown in fig. 5 and 6,

each node is assigned a position on the ring by a hash function, and each key value is also mapped to a position on the ring. This key value is finally placed on the node closest to its position with a position number equal to or greater than the value, i.e. on the next node clockwise. Because the adopted hash function is generally a uniform function irrelevant to input, when key values and nodes are very many, consistent hash can achieve good distribution uniformity.

Through consistent hashing, the database is more robust and has a good effect under the condition of large-batch load requests.

In addition, "establishing database connections" is rather resource and time consuming, and at the same time, the number of establishing database connections is limited. Therefore, the data interaction process is optimized by using the connection pool technology, the client obtains the connection object in the connection pool, and the physical connection of the connection object is not closed all the time from the beginning to the end of the use, so that the time required for establishing the connection is reduced to a certain extent, and the method is very beneficial to a multi-use and high-concurrency platform.

S2, establishing a data analysis module

As shown in fig. 3, the convolutional neural network CNN is essentially an input-to-output mapping, which can learn a large number of input-to-output mapping relationships without any precise mathematical expression between the input and the output, and the network has the capability of mapping between input-output pairs as long as the convolutional network is trained with a known pattern.

The convolutional neural network CNN is mainly used to identify two-dimensional patterns of displacement, scaling and other forms of distortion invariance. The convolution neural network has unique superiority in the aspects of voice recognition and image processing by virtue of a special structure with shared local weight, the layout of the convolution neural network is closer to that of an actual biological neural network, the complexity of the network is reduced by virtue of weight sharing, and particularly, the complexity of data reconstruction in the processes of feature extraction and classification is avoided by virtue of the characteristic that an image of a multi-dimensional input vector can be directly input into the network.

In the text and image analysis and classification, a neural network is used, and in the CNN neural network, a gradient descent method usually falls into a local optimal strange circle, so that the final accuracy is not high. Gradient descent presents various challenges.

1. It is difficult to select an appropriate learning rate. Too little learning rate may result in too slow network convergence, while too much learning rate may affect convergence and cause the loss function to fluctuate over a minimum, even with gradient divergence.

2. Furthermore, the same learning rate is not applicable to all parameter updates. If the training set data is sparse and the feature frequencies are very different, then it should not all be updated to the same extent, but for rarely occurring features, a greater update rate should be used.

3. Another key challenge in minimizing the non-convex error function in neural networks is to avoid trapping in multiple other local minima. In practice, the problem does not stem from local minima, but from saddle points, i.e. points tilted upwards in one dimension and downwards in the other dimension. These saddle points are usually surrounded by a plane of equal error values, which makes it difficult for the SGD algorithm to come out because the gradient is close to zero in all dimensions.

Using an improved BP algorithm, and increasing momentum items; an adaptive learning rate is used.

And combining with other optimization algorithms, optimizing the initial weight by the genetic algorithm, and locking the global optimum in advance.

Retraining, wherein the result of each training is different, the next training is probably not trapped in local minimum, the learning function and the training function are changed for many times, and the test is repeated.

Finally, the accuracy of the model classification of the invention is successfully improved to 95 +%.

The data analysis module in step S2 shown in fig. 4 includes an audio processing and accuracy verification module. The concrete description is as follows:

BP neural networks constructed based on standard keras modules and tf.

The first step is to tag the GTZAN gene Collection music dataset for classification. Firstly, extracting MFCC characteristic points, quantization bits, frequency spectrums, sampling frequencies and the like from audio frequency which is processed in a frequency domain by using a Mel frequency reciprocal to represent audio characteristics and neutral data. The audio data set was then automatically classified into ten categories using the Kmeans clustering algorithm, and the ten categories were finally manually labeled.

And the second step is to train a BP neural network to realize the function of the multiple classifiers by the labeled data set. 25000 MFCC feature points are first extracted from the audio as neurons of the input layer. Then two hidden layers with nodes of 10000 each are set. And classifying the audio to be classified into the expected audio category by using a ReLu nonlinear activation function and a softmax normalized exponential function as a multi-classification classifier.

Before audio classification and speech recognition, it is necessary to process and analyze the input audio to extract features therefrom and facilitate further operations.

In terms of audio spectral analysis and preprocessing:

the method comprises the following steps of firstly preprocessing a voice frequency, mainly automatically cutting the voice frequency and simply denoising. The audio is cut using the pydub and wave libraries of python itself. Firstly, acquiring the sampling frequency and the number of sampling points of the audio, and converting the read audio binary data into a calculable array according to the number of channels and the quantization unit, thereby calculating the size of each frame of sound signal. The threshold and pass band are then manually set to determine which portion of the drink is of too poor quality to cut.

Secondly, the MFCC Mel frequency cepstrum is used to simulate the human sense of hearing and obtain the corresponding MFCC value, which has the advantage that the calculation result can be directly input into the neural network.

At the speech recognition module:

the method comprises the steps of preprocessing a recording, cutting the recording into multiple sections of audio within ten seconds according To the strength and ventilation of voice, and calling a Speech To Text interface in IBM BLUMEMIX. And converting the returned DetailedResponse type into json format and reading the expected text by using a regular expression.

Crawling a text containing common words:

because the formation of word frequency networks requires a large amount of text and the aspects that the text contains cover as many aspects of life as possible, such as life, finance, sports, etc. So the choice crawls documents that are centurie aware. The key to crawling in hundredths is determining the type of crawling that is required. The crawler executes the functions of logging in the Baidu account and avoiding advertisements to open related texts by means of a selenium browser automatic test frame, acquires common words by using modules such as a regular expression and a copyry and stores the common words in the texts.

Recognizing the text by using a TF-IDF word frequency network:

and converting the short sentence recognized by the machine by using a transform function in a TFIDF network according to the result of the picture recognition or the audio recognition, and combining the existing dictionary in the word frequency network to finally obtain a sparse matrix in the csc format. Then, the specific requirements and the judgment result of the machine on the task submitted by the user are converted into a computable matrix form through the network, and finally, the similarity of the short sentence is solved by cos or Euclidean distance, and the accuracy of the task is judged according to the similarity.

The function of extracting keywords is added in the text processing process, and extraction is performed based on the existing word frequency network. Firstly, stop words are added to the text preprocessing part, so that a large number of meaningless word-help words such as 'worsted', 'worsted' and the like are prevented from being extracted. And then setting a window variable required when the text is read according to binary, ternary or multi-element phrases required, and specifying the step length for reading the text.

TF-IDF algorithm

Wi is the key weight of the entry in the sentence;

ti is an independent entry of a sentence planting;

n: the total number of documents in the corpus with a limited range;

ni, the number of documents containing the entry Wi.

And S3, establishing a user service module.

In the user service module, the invention has the following special advantages:

different users, different rights

The ordinary user can obtain an account through registration, and login and personal information modification are carried out by using the account. The task publisher can obtain all answers under the task, others are prohibited to view, and the task answering person can only see the information answered by the task answering person. And the administrator cannot acquire the answer condition under the specific task.

Friendly interactive interface

When the platform is used, a user needs to register an account number first, and can log in by using the account number after the registration is successful, the password retrieving function is convenient and quick, and unnecessary troubles are reduced for the user. The user can fill in information on a personal information page, which will be an important reference for others to know your. Through the client, the user can also check the detailed information of each task at will, acquire some data required by the user, or issue a new task and invite other users to provide the data. The users and the data which are not qualified for AI scoring can be manually evaluated, the data which are manually evaluated and passed can be fed back to the task publisher, and the task receiver can obtain corresponding rewards.

At the notification details page, the user can see the notification and the latest reply of the current platform at a glance. On the publishing page, the user can clearly see the real-time dynamic of the tasks published by the user.

Fuzzy search function of task

The task receiver can search the interested tasks through the search function, and meanwhile, the user can search the data and the related information required by the user through the function.

Diverse data support

The AI data acquisition platform strives for diversification on the basis of single item fineness and supports data collection in various forms. The method has various forms of pictures, texts, audios and the like, the pictures support local files and are shot and uploaded, and the formats of jpg, bmp, gif, png and the like can be identified; the text supports formats such as doc, txt and the like of the local document; the audio may be mp3 or wav formatted audio files or instant recordings for ease of use by the user.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A multi-modal data acquisition and comprehensive analysis platform based on a convolution decomposition depth model is characterized by comprising the following steps:

s1, establishing a data interaction module;

s2, establishing a data analysis module;

and S3, establishing a user service module.

2. The platform of claim 1, wherein the step S1 is to establish a data interaction module, and comprises the following steps:

s11, web backend data transmission based on Tomcat;

s12, Android development based on IDEA;

3. The platform of claim 1, wherein the step S2 is specifically performed as follows:

for the identification and classification of data in the form of pictures and texts, based on a Tensorflow framework, a relevant network is built by using pycharm ide, and a matching algorithm is constructed;

the algorithm adopts an improved convolutional neural network, and is based on classical Alexnet, and comprises a convolutional layer 1, a convolutional layer 2, a convolutional layer 3, a convolutional layer 4, a convolutional layer 5, a pooling layer 1, a pooling layer 2, a pooling layer 3, a pooling layer 4, a pooling layer 5, a full-link layer 1, a full-link layer 2, a full-link layer 3 and a fusion layer, wherein the last full-link layer outputs twice feature points;

extracting and screening information by the convolution layer and the pooling layer, wherein the convolution kernel of the convolution layer is 3 multiplied by 3, the step length is set to be 1, and the pooling kernel of the largest pooling layer is 2 multiplied by 2; the convolutional layers 2, 3, 4 and 5 comprise two stacked convolutional layers, the series connection of the two 3x3 convolutional layers is equivalent to 1 convolutional layer of 5x5, and the parameter number of the convolutional layers is far less than that of the convolutional layers of 5x5, so that the training time of the whole network is reduced;

and performing Dropout operation after the full connection layer 1 and the full connection layer 2 to improve the generalization capability, wherein the activation function selects LeakyReLu: