CN112434714A

CN112434714A - Multimedia identification method, device, storage medium and electronic equipment

Info

Publication number: CN112434714A
Application number: CN202011416146.6A
Authority: CN
Inventors: 闫冰程
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2021-03-02

Abstract

The present disclosure relates to a method, an apparatus, a storage medium, and an electronic device for multimedia recognition, the method comprising: collecting multimedia data through a collecting component; acquiring data characteristics of the multimedia data; inputting the data characteristics into a multimedia recognition model to obtain a recognition result of the data characteristics; the multimedia recognition model is obtained by training a preset training model according to a multimedia sample data set, a first loss function and an objective function, wherein the objective function comprises a sample random function and/or a second loss function; the sample random function is used for acquiring sample data characteristics of a plurality of multimedia sample data under different amplitudes; the second loss function is used for enabling the weight sum of a plurality of sample data characteristics of the multimedia sample data to approach a preset value. Therefore, the number of multimedia sample data required by training is reduced, the occupation of the multimedia sample data on the storage space is reduced, and the accuracy of multimedia identification is improved.

Description

Multimedia identification method, device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a storage medium, and an electronic device for multimedia recognition.

Background

With the development of artificial intelligence technology, people have deeper and deeper knowledge on multimedia recognition such as voice recognition, image recognition and the like, and meanwhile, the requirement on the accuracy of multimedia recognition is higher and higher. In the related art of multimedia recognition, a preset training model generally needs to be trained through a large amount of multimedia sample data to obtain a multimedia recognition model meeting the recognition requirements. And in the case that the multimedia sample data used for training is less, the recognition accuracy of the trained multimedia recognition model is not high.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a method, an apparatus, a storage medium, and an electronic device for multimedia recognition.

According to a first aspect of embodiments of the present disclosure, there is provided a method of multimedia recognition, the method including:

collecting multimedia data through a collecting component;

acquiring data characteristics of the multimedia data;

inputting the data characteristics into a multimedia recognition model to obtain recognition results of the data characteristics, wherein the recognition results are used for representing target information corresponding to the multimedia data;

the multimedia recognition model is obtained after a preset training model is trained according to a multimedia sample data set, a first loss function and an objective function, wherein the multimedia sample data set comprises a plurality of multimedia sample data and a target recognition result corresponding to each multimedia sample data; the objective function comprises a sample random function and/or a second loss function; the sample random function is used for acquiring sample data characteristics of a plurality of multimedia sample data under different amplitudes; the first loss function is used for representing the difference between a training recognition result obtained by a plurality of multimedia sample data through a preset training model and the target recognition result; the second loss function is used for enabling the weight sum of the sample data characteristics of the plurality of multimedia sample data to approach a preset value, and the weight represents the influence degree of the sample data characteristics on the recognition result output by the multimedia recognition model.

By adopting the method, the embodiment of the disclosure can use the sample random function to represent different amplitudes of the sample data characteristics in the training of the multimedia recognition model, and/or use the second loss function to reduce the influence of different amplitudes of the sample data characteristics, thereby reducing the number of multimedia sample data required by training, reducing the occupation of the multimedia sample data on the storage space, and improving the efficiency of model training; the multimedia recognition model obtained by the training method can realize accurate recognition aiming at multimedia data with different amplitudes, and improves the accuracy of multimedia recognition.

Optionally, the multimedia recognition model is trained by:

acquiring the multimedia sample data set, wherein the multimedia sample data set comprises a plurality of multimedia sample data and a target identification result corresponding to each multimedia sample data;

acquiring sample data characteristics corresponding to each multimedia sample data;

determining the first loss function according to the training recognition result and the target recognition result;

and training the preset training model according to the sample data characteristics, the target recognition result, the first loss function and the target function to obtain the multimedia recognition model.

By adopting the mode, the sample random function is used for representing different amplitudes of the sample data characteristics, and/or the second loss function is used for reducing the influence of different amplitudes of the sample data characteristics, so that the number of multimedia sample data required by training is reduced, and the occupation of the multimedia sample data on the storage space is reduced.

Optionally, when the target function includes a sample random function, training the preset training model according to the sample data feature, the target recognition result, the first loss function, and the target function to obtain the multimedia recognition model includes:

calculating to obtain new sample data characteristics according to the sample data characteristics and the sample random function;

and taking the new sample data characteristics as the input of a preset training model, taking the target recognition result as the output of the preset training model, and training the preset training model according to the first loss function to obtain the multimedia recognition model.

By adopting the mode, different amplitudes of the sample data characteristics are represented by using the sample random function, and the occupation of the multimedia sample data on the storage space is reduced.

Optionally, when the objective function includes a second loss function, training the preset training model according to the sample data feature, the target recognition result, the first loss function, and the objective function to obtain the multimedia recognition model includes:

determining a second loss function according to the weight of the sample data characteristics;

and taking the sample data characteristics as input of a preset training model, taking the target recognition result as output of the preset training model, and training the preset training model according to the first loss function and the second loss function to obtain the multimedia recognition model.

By adopting the mode, the influence of different amplitudes of the sample data characteristics is reduced by using the second loss function, so that the number of multimedia sample data required by training is reduced, and the occupation of the multimedia sample data on the storage space is reduced.

Optionally, the obtaining sample data characteristics corresponding to each piece of multimedia sample data includes:

extracting the characteristics of each multimedia sample data, and acquiring the temporary characteristics of the sample data corresponding to each multimedia sample data;

obtaining the standard deviation of the temporary characteristics of the sample data, wherein the standard deviation represents the discrete degree of the temporary characteristics of the sample data;

standardizing each sample data temporary feature according to the standard deviation to obtain the sample data features;

said determining a second loss function according to the weight of the sample data features comprises:

and determining a second loss function according to the weight of the sample data characteristic and the standard deviation.

In the above manner, in the case of normalizing the sample features, the second loss function is determined by using the weight and the standard deviation of the sample features, so that the influence of different amplitudes of the sample data features can be reduced.

Optionally, the multimedia data comprises voice data or image data.

By adopting the mode, the recognition of multimedia such as voice recognition, voice awakening, image recognition and the like can be realized.

According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for multimedia recognition, the apparatus comprising:

a data acquisition module configured to acquire multimedia data through an acquisition component;

a feature acquisition module configured to acquire data features of the multimedia data;

the data identification module is configured to input the data characteristics into a multimedia identification model to obtain an identification result of the data characteristics, and the identification result is used for representing target information corresponding to the multimedia data;

Optionally, the apparatus further comprises a model training module, the model training module comprising:

a sample data set obtaining sub-module configured to obtain the multimedia sample data set, where the multimedia sample data set includes a plurality of multimedia sample data and a target identification result corresponding to each multimedia sample data;

the sample characteristic acquisition sub-module is configured to acquire sample data characteristics corresponding to each piece of multimedia sample data;

a first loss function determination submodule configured to determine the first loss function from the training recognition result and the target recognition result;

and the first training submodule is configured to train the preset training model according to the sample data characteristics, the target recognition result, the first loss function and the target function to obtain the multimedia recognition model.

Optionally, in case the objective function comprises a sample random function, the first training submodule is configured to:

Optionally, in case the objective function comprises a second loss function, the first training submodule is configured to:

Optionally, the sample feature acquisition sub-module is configured to:

the first training submodule configured to:

determining a second loss function according to the weight of the sample data characteristic and the standard deviation;

Optionally, the multimedia data comprises voice data or image data.

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect of the present disclosure.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of the first aspect of the disclosure.

The technical scheme provided by the embodiment of the disclosure can at least achieve the following beneficial effects:

in the training of the multimedia recognition model, different amplitudes of sample data features are represented by using a sample random function, and/or the influence of the different amplitudes of the sample data features is reduced by using a second loss function, so that the number of multimedia sample data required by training is reduced, the occupation of the multimedia sample data on a storage space is reduced, the model training efficiency is improved, the multimedia recognition model obtained by the training method can be used for realizing accurate recognition aiming at the multimedia data with different amplitudes, and the accuracy of multimedia data recognition is improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow chart illustrating a method of multimedia recognition according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a method of multimedia recognition model training in accordance with an exemplary embodiment.

Fig. 3 is a schematic diagram illustrating a structure of a multimedia recognition apparatus according to an exemplary embodiment.

Fig. 4 is a schematic diagram illustrating another multimedia recognition apparatus according to an example embodiment.

FIG. 5 is a block diagram illustrating an electronic device in accordance with an example embodiment.

FIG. 6 is a block diagram illustrating another electronic device in accordance with an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

First, an application scenario of the present disclosure will be explained. The method and the device can be applied to artificial intelligence scenes, in particular to scenes of multimedia recognition, such as image recognition, voice awakening and the like. In the related art, when the multimedia sample data used for training is small, the recognition accuracy of the trained multimedia recognition model is not high. The inventor finds that the amplitude of the multimedia data (i.e., the brightness of the image and the volume of the audio) has a large influence on the accuracy of the multimedia recognition model, and in order to improve the accuracy of the multimedia recognition model, a large amount of multimedia sample data with different amplitudes needs to be acquired for model training, but the acquisition efficiency of the multimedia sample data with different amplitudes in a manual acquisition mode is low, and the acquired multimedia sample data also needs to occupy a large amount of storage space.

In the related art, multimedia sample data with different amplitudes is generally acquired in a data augmentation mode for model training, that is, before model training, amplitude transformation is performed on original multimedia sample data to obtain a large amount of multimedia sample data with different amplitudes and store the multimedia sample data, and then model training is performed by using the stored multimedia sample data with different amplitudes. However, this method still requires a large amount of memory space and reduces the efficiency of model training.

In order to solve the above problems, the present disclosure provides a method, an apparatus, a storage medium, and an electronic device for multimedia recognition, in training a multimedia recognition model, different amplitudes of sample data features are represented by using a sample random function, and/or an influence of different amplitudes of the sample data features is reduced by using a second loss function, so that the number of multimedia sample data required for training is reduced, the occupation of multimedia sample data on a storage space is reduced, and the efficiency of model training is improved.

It should be noted that the multimedia recognition related in the embodiment of the present disclosure may include image recognition, voice wakeup and other multimedia recognition; the multimedia recognition model related in the embodiment of the present disclosure may be a model having a certain capability after sample learning, and may be, for example, a Neural network model, such as a CNN (Convolutional Neural network) model, an RNN (Recurrent Neural network) model, or the like. Of course, the multimedia recognition model may also adopt other types of Machine learning models, such as SVM (Support Vector Machine).

The present disclosure is described below with reference to specific examples.

Fig. 1 is a flow chart illustrating a method of multimedia recognition according to an exemplary embodiment, which may include the steps of, as shown in fig. 1:

and S101, acquiring multimedia data through an acquisition component.

The multimedia data may be voice data or image data, for example, voice data uttered by a user and received by a microphone, or image data acquired by a camera.

And S102, acquiring the data characteristics of the multimedia data.

In this step, there are various ways to obtain data features, for example, when the multimedia data is voice data, the voice data may be subjected to feature extraction by a FilterBank (FilterBank) method to obtain data features. In the case where the multimedia data is an image, pixels of the image may be acquired as data characteristics by scanning the image. By which multimedia data can be converted into data features that can be processed by a computer program for multimedia recognition.

S103, inputting the data characteristics into a multimedia recognition model to obtain a recognition result of the data characteristics.

The identification result of the data characteristic is used for representing the target information corresponding to the multimedia data. In the case where the multimedia data is voice data, the target information may be target text, target action, target wake-up word, or the like. In the case where the multimedia data is image data, the target information may be target text, target motion, target emotion, target animal or target plant, or the like.

The multimedia recognition model is obtained after a preset training model is trained according to a multimedia sample data set, a first loss function and a target function, wherein the multimedia sample data set comprises a plurality of multimedia sample data and a target recognition result corresponding to each multimedia sample data; the first loss function is used for representing the difference between a training recognition result obtained by a plurality of multimedia sample data through a preset training model and the target recognition result; the objective function comprises a sample random function and/or a second loss function; the sample random function is used for acquiring sample data characteristics of a plurality of multimedia sample data under different amplitudes; the second loss function is used for enabling the weight sum of a plurality of sample data characteristics of the multimedia sample data to approach a preset value, and the weight represents the influence degree of the sample data characteristics on the recognition result output by the multimedia recognition model.

By adopting the method, in the training of the multimedia recognition model, different amplitudes of the sample data characteristics are represented by using the sample random function, and/or the influence of different amplitudes of the sample data characteristics is reduced by using the second loss function, so that the number of multimedia sample data required by training is reduced, the occupation of the multimedia sample data on the storage space is reduced, and the model training efficiency is improved; the multimedia recognition model obtained by the training method can realize accurate recognition aiming at multimedia data with different amplitudes, and improves the accuracy of multimedia recognition. By taking a voice wake-up scene as an example, the method can obviously improve the voice wake-up rate in a far-field low-volume environment.

Fig. 2 shows a method for obtaining a multimedia recognition model by training a preset training model according to a multimedia sample data set, a first loss function, and an objective function, where fig. 2 is a flowchart of a method for training a multimedia recognition model according to an exemplary embodiment, and the method may include the following steps:

s201, acquiring a multimedia sample data set.

The multimedia sample data set comprises a plurality of multimedia sample data and a target recognition result corresponding to each multimedia sample data, and can be used for training a preset training model or correcting the multimedia recognition model.

The multimedia sample data set can be acquired through manual acquisition and labeling, and an existing multimedia training data set can also be used. For example, the multimedia sample data set may be a voice sample data set or an image sample data set, for example, the multimedia sample data set may contain a plurality of voice sample data, and a target recognition result corresponding to each voice sample data, and the target recognition result may be a target text, a target action, a target wake-up word, or the like. For another example, the multimedia sample data set may also include a plurality of image sample data and a target recognition result corresponding to each image sample data, where the target recognition result may be a target text, a target motion, a target animal or a target plant, and the like

S202, acquiring sample data characteristics corresponding to each multimedia sample data.

Under the condition that the multimedia sample data is voice sample data, signal processing and feature extraction can be carried out on each voice sample data, so that corresponding sample data features are obtained. For example, the voice sample data may include both a voice signal sent by a speaker and an external noise signal, and first, signal processing, such as noise reduction and reverberation removal, needs to be performed on the voice sample data, so that a relatively pure voice signal can be obtained after the signal processing; and then, carrying out feature extraction on the voice signal to obtain sample data features corresponding to the voice sample data. The feature extraction mode can adopt a FilterBank (filter bank) mode, and the feature extraction step comprises the following steps: firstly, voice sample data is subjected to framing, windowing, direct-current component removal and pre-emphasis processing, then Fast Fourier Transform (FFT) is carried out, then a Mel filter bank is used, and finally, the characteristics of the voice sample are obtained through modular operation, square calculation and logarithm calculation.

In the case where the multimedia sample data is image sample data, pixels in the image sample data may be acquired by image scanning as sample data features.

S203, determining a first loss function according to the training recognition result and the target recognition result.

The training recognition result represents a result obtained by the multimedia sample data through a preset training model, the target recognition result is obtained from the multimedia sample data set, and the first loss function is used for representing a difference between the training recognition result and the target recognition result. The goal of model training is to get the value of the first loss function close to a first target loss value, which may be 0 or some predetermined value.

And S204, training a preset training model according to the sample data characteristics, the target recognition result, the first loss function and the target function to obtain a multimedia recognition model.

Wherein the objective function comprises a sample random function and/or a second loss function; the sample random function is used for acquiring sample data characteristics of multimedia sample data under different characteristic amplitudes; the second loss function is used for enabling the weight sum of the sample data characteristics of a plurality of multimedia sample data to approach a preset value, and the weight represents the influence degree of the sample data characteristics on the recognition result output by the multimedia recognition model.

The training can be carried out in the following three ways:

the first training mode is as follows: under the condition that the target function comprises a sample random function, new sample data characteristics of a plurality of multimedia sample data under different characteristic amplitudes can be obtained according to the sample random function; and then, taking the new sample data characteristics as the input of a preset training model, taking the target recognition result as the output of the preset training model, and training the preset training model according to the first loss function to obtain the multimedia recognition model.

According to the above multimedia sample data being voice sample data or image sample data, the sample random function may have different expression forms.

In the case that the multimedia sample data is voice sample data, the expression of the sample random function may be 2ln (rand ()), where rand () represents to obtain a random variable greater than 0, and further, a new sample data feature may be obtained by calculating through the following first feature update formula:

x2＝x1+2ln(rand())

where x1 denotes the sample data feature obtained by the feature extraction, 2ln (rand ()) denotes a random variable greater than 0 obtained by the sample random function, and x2 denotes a new sample data feature after calculation.

It should be noted that, in the related art, in the amplitude value augmentation mode of the multimedia sample data used for training, a plurality of new multimedia sample data may be obtained by multiplying the multimedia sample data by a plurality of preset multiples, and the plurality of new multimedia sample data are stored and used as the input of model training, and when the preset multiple is greater than 0 and less than 1, it indicates that the amplitude value of the multimedia sample data is reduced to obtain new multimedia sample data; and when the preset multiple is larger than 1, the method indicates that the amplitude of the multimedia sample data is increased to obtain new multimedia sample data. In order to improve the effect of amplitude amplification, a plurality of preset multiples need to be set, for example, the preset multiples may be a plurality of numerical values such as 0.1, 0.5, 0.8, 2, 5, 10, and the like, and thus the obtained multimedia sample data needs to occupy a large amount of storage space.

The inventor finds that, in the mode of acquiring the sample data characteristics in step S202, in the case that the multimedia sample data is voice sample data, the framing, windowing, dc component removal, pre-emphasis, FFT and mel filtering are all linear transformations, and the final modulo operation, squaring operation and logarithm operation are non-linear transformations. Therefore, the above-mentioned manner of amplitude enhancement by multiplying by the preset multiple may be replaced by data obtained by adding a preset variable to the sample data feature after feature extraction, where the preset variable may be twice the logarithm of the preset multiple. In the case where the preset multiple is a random variable obtained by a sample random function, a manner of amplitude amplification by the first feature update formula may be obtained.

Further, the training of the preset training model may be performed for multiple rounds of training for each input sample data feature, in order to reflect the influence of different amplitudes, each sample data feature may obtain a corresponding new sample data feature through the first feature update formula before being used for model training each time, and then use the new sample data feature as an input of the preset training model.

Similarly, when the multimedia sample data is image sample data and the image sample data is a pixel, the pixel acquisition of the image sample data is linear, so the sample data feature can be increased in amplitude by increasing a random value. The sample random function may be y1+ rand (), where y1 represents a value of a pixel in the image sample data, rand () represents an acquired random variable, and the value may be positive or negative, and further a new sample data feature may be obtained by calculating through the following second feature update formula:

y2＝y1+rand()

where x2 represents the new sample data feature after calculation, x1 represents the sample data feature obtained by the feature extraction, y1 represents the value of a pixel in the image sample data, and rand () represents the random variable obtained, and may be positive or negative.

Therefore, by adopting the first training mode, the new sample data characteristics representing different amplitudes can be obtained by using the sample random function after the characteristics are extracted, the occupation of storage space is reduced, the efficiency of obtaining the sample data characteristics is improved, and the multimedia recognition model obtained by training the new sample data characteristics can be used for realizing accurate recognition on the multimedia data with different amplitudes.

A second training mode: in the case where the objective function includes a second loss function, the second loss function may be determined first according to the weight of the sample data features; and then, taking the sample data characteristics as the input of a preset training model, taking the target recognition result as the output of the preset training model, and training the preset training model according to the first loss function and the second loss function to obtain the multimedia recognition model.

The second loss function is used for enabling the weight sum of a plurality of sample data characteristics of the multimedia sample data to approach a preset value, and the weight represents the influence degree of the sample data characteristics on the recognition result output by the multimedia recognition model.

Under the condition that the neural network is adopted to train the multimedia recognition model, the output of the first hidden layer is as follows: wx1+ b, where x1 represents the sample data features, w represents modeled weights, and b represents modeled bias. According to the description of the first training mode, a random variable may be obtained through a sample random function, and a new sample data feature may be obtained for training after the sample data feature is added to the random variable, in which case the output of the first hidden layer becomes wx2+ b ═ w (x1+2ln (β)) + b, that is, wx1+ b +2ln (β) × wE, where β represents a random variable obtained through the sample random function and is used for characterizing a magnitude, and E represents a unit vector. Although the random variable is generated in a random form, by making the sum of the weights of the sample data features approach 0, that is, wE approach 0, the influence of the variable β representing the amplitude on the multimedia recognition model, that is, the influence of the amplitude on the recognition result of the multimedia recognition model can be eliminated. In particular, when wE is 0, the effect of the amplitude on the recognition result of the multimedia recognition model is completely eliminated.

It should be noted that, when the multimedia recognition model is trained, because the range of the obtained sample data features is often very large and the dispersion is very high, the obtained sample data features can be standardized, so that the convergence speed of model parameters in the training process is increased, and the efficiency of model training is improved. The acquired sample data characteristics and the normalization processing of the acquired sample data characteristics can be performed by the following steps:

firstly, feature extraction is carried out on each multimedia sample data, and sample data temporary features corresponding to each multimedia sample data are obtained.

Secondly, the standard deviation of the temporary characteristics of the sample data is obtained, and the standard deviation represents the arithmetic square root of the arithmetic mean of the squares of the mean differences, and represents the discrete degree of all the temporary characteristics of the sample data.

And finally, standardizing each sample data temporary characteristic according to the standard deviation to obtain the sample data characteristic.

Of course, during model training, the acquired sample data features may not be normalized. The following describes how to obtain the second loss function according to whether to standardize the sample data features during training of the multimedia recognition model:

the first method for obtaining the second loss function is as follows: when the multimedia recognition model is trained, without performing normalization processing on the sample data features, a second loss function may be determined according to the weights of the sample data features, and an expression of the second loss function may include:

or the like, or, alternatively,

wherein L is₁Representing the value of the second loss function, w representing sample dataThe weight of the features, M represents the length of a convolution kernel of a first hidden layer of the multimedia recognition model, the first hidden layer is a hidden layer directly connected with an input layer of the multimedia recognition model, N represents the width of the convolution kernel, C represents the number of the convolution kernels, M, N and C jointly represent a plurality of convolution kernels for segmenting sample data features.

The second loss function is obtained in a second way: when the multimedia recognition model is trained, in a case of performing normalization processing on sample data features, a second loss function may be determined according to weights of the sample data features and the standard deviation, and an expression of the second loss function may include:

or the like, or, alternatively,

wherein L is₁And a value representing the second loss function, w represents the weight of the sample data characteristics, v represents the standard deviation, and C represents the number of convolution kernels of a first hidden layer of the multimedia recognition model, wherein the first hidden layer is a hidden layer directly connected with an input layer of the multimedia recognition model.

Therefore, by adopting the second training mode, the influence of the amplitude on the multimedia recognition model is eliminated through the second loss function, so that the relatively accurate multimedia recognition model is obtained by training under the condition that multimedia sample data with different amplitudes are not added, the occupation of the multimedia sample data for training on the storage space is reduced, and the accurate recognition of the multimedia recognition model obtained by training can be realized aiming at the multimedia data with different amplitudes.

A third training mode: under the condition that the target function comprises a sample random function and a second loss function, new sample data characteristics of a plurality of multimedia sample data under different characteristic amplitudes can be obtained according to the sample random function; then, taking the new sample data characteristics as the input of a preset training model, taking the target recognition result as the output of the preset training model, and training the preset training model according to the first loss function and the second loss function to obtain a multimedia recognition model

In the third mode, the sample random function in the first training mode and the second loss function in the second training mode are used together, and the specific implementation mode may refer to the description in the first training mode and the second training mode, which is not described herein again.

By adopting the third training mode, the recognition accuracy of the multimedia recognition model obtained by training can be further improved.

Fig. 3 is a schematic structural diagram illustrating a multimedia recognition apparatus according to an exemplary embodiment, and as shown in fig. 3, the apparatus includes a data obtaining module 301, a feature obtaining module 302, and a data recognition module 303:

a data acquisition module 301 configured to acquire multimedia data by an acquisition component;

a feature obtaining module 302 configured to obtain a data feature of the multimedia data;

the data identification module 303 is configured to input the data characteristics into the multimedia identification model, so as to obtain an identification result of the data characteristics, where the identification result is used to represent target information corresponding to the multimedia data;

the multimedia recognition model is obtained after a preset training model is trained according to a multimedia sample data set, a first loss function and an objective function, wherein the multimedia sample data set comprises a plurality of multimedia sample data and a target recognition result corresponding to each multimedia sample data; the objective function comprises a sample random function and/or a second loss function; the sample random function is used for acquiring sample data characteristics of a plurality of multimedia sample data under different amplitudes; the first loss function is used for representing the difference between a training recognition result obtained by a plurality of multimedia sample data through a preset training model and the target recognition result; the second loss function is used for enabling the weight sum of a plurality of sample data characteristics of the multimedia sample data to approach a preset value, and the weight represents the influence degree of the sample data characteristics on the recognition result output by the multimedia recognition model.

By the device, in the training of the multimedia recognition model, different amplitudes of the sample data characteristics are represented by using the sample random function, and/or the influence of different amplitudes of the sample data characteristics is reduced by using the second loss function, so that the number of multimedia sample data required by training is reduced, the occupation of the multimedia sample data on the storage space is reduced, and the model training efficiency is improved; the multimedia recognition model obtained by the training method can realize accurate recognition aiming at multimedia data with different amplitudes, and improves the accuracy of multimedia recognition.

Optionally, fig. 4 is a schematic structural diagram of a multimedia recognition apparatus according to another exemplary embodiment, and as shown in fig. 4, the apparatus further includes a model training module 401, where the model training module 401 includes:

a sample data set obtaining sub-module 4011, configured to obtain the multimedia sample data set, where the multimedia sample data set includes a plurality of multimedia sample data and a target identification result corresponding to each multimedia sample data;

a sample characteristic obtaining sub-module 4012 configured to obtain a sample data characteristic corresponding to each multimedia sample data;

a first loss function determining sub-module 4013 configured to determine the first loss function according to the training recognition result and the target recognition result;

the first training sub-module 4014 is configured to train the preset training model according to the sample data features, the target recognition result, the first loss function, and the target function to obtain the multimedia recognition model.

Optionally, in case the objective function comprises a sample random function, the first training submodule 4014 is configured to:

Optionally, in case the objective function comprises a second loss function, the first training submodule 4014 is configured to:

determining a second loss function according to the weight of the sample data characteristic;

and taking the sample data characteristics as the input of a preset training model, taking the target recognition result as the output of the preset training model, and training the preset training model according to the first loss function and the second loss function to obtain the multimedia recognition model.

Optionally, the sample feature obtaining sub-module 4012 is configured to:

obtaining the standard deviation of the temporary characteristic of the sample data, wherein the standard deviation represents the discrete degree of the temporary characteristic of the sample data;

standardizing each temporary sample data feature according to the standard deviation to obtain the sample data features;

the first training sub-module 4014 is configured to:

Optionally, the multimedia data includes voice data or image data.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the multimedia recognition method provided by the present disclosure.

Fig. 5 is a block diagram illustrating an electronic device 500 in accordance with an example embodiment. For example, the electronic device 500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, electronic device 500 may include one or more of the following components: a processing component 502, a memory 504, a power component 506, a multimedia component 508, an audio component 510, an input/output (I/O) interface 512, a sensor component 514, and a communication component 516.

The processing component 502 generally controls overall operation of the electronic device 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 502 may include one or more processors 520 to execute instructions to perform all or a portion of the steps of the multimedia recognition method described above. Further, the processing component 502 can include one or more modules that facilitate interaction between the processing component 502 and other components. For example, the processing component 502 can include a multimedia module to facilitate interaction between the multimedia component 508 and the processing component 502.

The memory 504 is configured to store various types of data to support operations at the electronic device 500. Examples of such data include instructions for any application or method operating on the electronic device 500, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 504 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power component 506 provides power to the various components of the electronic device 500. Power components 506 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 500.

The multimedia component 508 includes a screen that provides an output interface between the electronic device 500 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 508 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 500 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 510 is configured to output and/or input audio signals. For example, the audio component 510 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 500 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 504 or transmitted via the communication component 516. In some embodiments, audio component 510 further includes a speaker for outputting audio signals.

The I/O interface 512 provides an interface between the processing component 502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 514 includes one or more sensors for providing various aspects of status assessment for the electronic device 500. For example, the sensor assembly 514 may detect an open/closed state of the electronic device 500, the relative positioning of components, such as a display and keypad of the electronic device 500, the sensor assembly 514 may detect a change in the position of the electronic device 500 or a component of the electronic device 500, the presence or absence of user contact with the electronic device 500, orientation or acceleration/deceleration of the electronic device 500, and a change in the temperature of the electronic device 500. The sensor assembly 514 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 514 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 516 is configured to facilitate wired or wireless communication between the electronic device 500 and other devices. The electronic device 500 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 516 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 516 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described multimedia recognition method.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 504 comprising instructions, executable by the processor 520 of the electronic device 500 to perform the above-described multimedia recognition method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned multimedia recognition method when executed by the programmable apparatus.

Fig. 6 is a block diagram illustrating another electronic 600 according to an example embodiment. For example, the apparatus 600 may be provided as a server. Referring to fig. 6, the apparatus 600 includes a processing component 622 that further includes one or more processors and memory resources, represented by memory 632, for storing instructions, such as applications, that are executable by the processing component 622. The application programs stored in memory 632 may include one or more modules that each correspond to a set of instructions. Further, the processing component 622 is configured to execute instructions to perform the multimedia recognition method described above.

The apparatus 600 may also include a power component 626 configured to perform power management of the apparatus 600, a wired or wireless network interface 650 configured to connect the apparatus 600 to a network, and an input/output (I/O) interface 658. The apparatus 600 may operate based on an operating system, such as Windows Server, stored in the memory 632^TM，Mac OS X^TM，Unix^TM，Linux^TM，FreeBSD^TMOr the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of multimedia recognition, the method comprising:

collecting multimedia data through a collecting component;

acquiring data characteristics of the multimedia data;

2. The method of claim 1, wherein the multimedia recognition model is trained by:

3. The method of claim 2, wherein, in the case that the objective function includes a sample random function, training the preset training model according to the sample data characteristics, the target recognition result, the first loss function, and the objective function to obtain the multimedia recognition model comprises:

4. The method of claim 2, wherein, in the case that the objective function includes a second loss function, training the preset training model according to the sample data features, the target recognition result, the first loss function, and the objective function to obtain the multimedia recognition model comprises:

5. The method according to claim 4, wherein said obtaining sample data characteristics corresponding to each said multimedia sample data comprises:

6. The method of any of claims 1 to 5, wherein the multimedia data comprises voice data or image data.

7. An apparatus for multimedia recognition, the apparatus comprising:

the data identification module is configured to input the data characteristics into a multimedia identification model to obtain an identification result of the data characteristics, wherein the identification result is used for representing target information corresponding to the multimedia data;

8. The apparatus of claim 7, further comprising a model training module, the model training module comprising:

9. The apparatus of claim 8, wherein in the case that the objective function comprises a sample random function, the first training submodule is configured to:

10. The apparatus of claim 8, wherein in the case that the objective function comprises a second loss function, the first training submodule is configured to:

11. The apparatus of claim 10, wherein the sample feature acquisition sub-module is configured to:

the first training submodule configured to:

12. The apparatus of any of claims 7 to 11, wherein the multimedia data comprises voice data or image data.

13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.

14. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 6.