CN111477212B

CN111477212B - Content identification, model training and data processing method, system and equipment

Info

Publication number: CN111477212B
Application number: CN201910008803.4A
Authority: CN
Inventors: 李鹏; 王炎
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-01-04
Filing date: 2019-01-04
Publication date: 2023-10-24
Anticipated expiration: 2039-01-04
Also published as: CN111477212A

Abstract

The embodiment of the application provides a method, a system and equipment for content identification, model training and data processing. The content identification method comprises the following steps: taking the content to be identified as input of an application model, executing the application model and outputting first result information; determining a content tag as a recognition result based on the first result information; executing corresponding business operation according to the content label; the application model is obtained after training a training model, at least two loss values after one iteration are calculated by the training model in the training process through at least two loss functions, and updating of parameters is completed based on the at least two loss values. The technical scheme provided by the embodiment of the application has high content identification accuracy, and particularly has better distinguishing force for the content with higher similarity, such as near-voice characters and homophone characters.

Description

Content identification, model training and data processing method, system and equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, system, and apparatus for content identification, model training, and data processing.

Background

Content recognition technology is to have a machine recognize and understand the content sent by a user. For example, speech recognition technology can facilitate both human-to-human communication (HHC) and human-to-machine communication (HMC). HHC, for example, the voice message sent to other people can be converted into text to be convenient to read, and voice input is more convenient; HMCs such as voice searches, personal intelligent assistants, voice-controlled games, smart home, etc.

The traditional voice recognition technology has strong dependence on manually selected characteristics and low accuracy. The deep learning technology is applied to the field of voice recognition, can simulate the mode of learning and recognizing voice signals by the brain, and can greatly improve the accuracy of voice recognition. However, the existing speech recognition technology still has the situation that recognition of near words or homophones frequently goes wrong, and although correction of a speech model is carried out later, ideal effects cannot be achieved frequently.

Disclosure of Invention

Embodiments of the present application provide a method, system and apparatus for content recognition, model training, data processing that solves or at least partially solves the above-mentioned problems.

In one embodiment of the present application, a content recognition method is provided. The content identification method comprises the following steps:

Taking the content to be identified as input of an application model, executing the application model and outputting first result information;

determining a content tag as a recognition result based on the first result information;

executing corresponding business operation according to the content label;

the application model is obtained after training a training model, and at least two loss values after one iteration are calculated by the training model through at least two loss functions in the training process so as to finish updating parameters based on the at least two loss values.

In another embodiment of the present application, a model training method is provided. The model training method comprises the following steps:

taking the sample content as the input of a training model, executing the training model and outputting second result information;

calculating at least two loss values by adopting at least two loss functions based on the second result information;

and when the training convergence condition is determined to be reached according to the at least two loss values, the training model is trained and can be used as an application model for content identification.

In yet another embodiment of the present application, there is provided a model training method, including:

when the training convergence condition is not met according to the at least two loss values, updating parameters in the training model according to the at least two loss values; and proceeds to the next iteration.

In yet another embodiment of the present application, a data processing method is provided. The data processing method comprises the following steps:

acquiring data of a service object;

judging whether the data meets the set requirements or not by using an application model;

providing corresponding service for the service object according to the judging result;

the application model is obtained after training a training model, at least two loss values after one iteration are calculated by the training model in the training process through at least two loss functions, and updating of parameters is completed based on the at least two loss values.

providing local data to a service party;

The service side receives the service provided by the service side based on the judging result after judging whether the data meets the set requirement by using an application model;

In yet another embodiment of the present application, a data processing system is provided. The data processing system includes:

the server is used for acquiring the data of the service object; judging whether the data meets the set requirements or not by using an application model; providing corresponding service for the service object according to the judging result;

a service object providing local data to a service party; the service side receives the service provided by the service side based on the judging result after judging whether the data meets the set requirement by using an application model;

In yet another embodiment of the present application, an electronic device is provided. The electronic device includes: a memory and a processor; wherein,

the memory is used for storing programs;

the processor, coupled to the memory, is configured to execute the program stored in the memory for:

executing corresponding business operation according to the content label;

the memory is used for storing programs;

taking sample content as input of a training model, and executing output result information of the training model;

based on the result information, calculating at least two loss values by adopting at least two loss functions;

In yet another embodiment of the present application, a server device is provided. The server device includes: a memory and a processor; wherein,

The memory is used for storing programs;

acquiring data of a service object;

In yet another embodiment of the present application, a service object apparatus is provided. The service object device includes: a memory and a processor; wherein,

the memory is used for storing programs;

providing local data to a service party;

In the technical scheme provided by the embodiment of the application, at least two loss functions are adopted to calculate the loss value of one iteration, and parameter update in a training model during one iteration is influenced according to the at least two loss values obtained by calculation; the training model finishes training and can be used as an application model for content identification until the training convergence condition is determined to be reached according to at least two loss values; the application model is adopted for content identification, the content identification accuracy is high, and particularly, better distinguishing force is provided for content with higher similarity (such as near words and homophones).

In another technical scheme provided by the embodiment of the application, the application model is utilized to process the data to obtain third result information; judging whether the data meets the set requirements or not according to the third result information; then providing corresponding service for the service object according to the judging result; the application model is obtained after training the training model, at least two loss values after one iteration are calculated by adopting at least two loss functions in the training process of the training model, and updating of parameters is completed based on the at least two loss values; the application model is used for judging the data, the accuracy is high, and the service quality provided for the service object is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a model training method according to an embodiment of the present application;

FIG. 2 is a flow chart of a model training method according to another embodiment of the present application;

FIG. 3 is a flowchart illustrating a content recognition method according to an embodiment of the present application;

FIG. 4 is a flowchart of a model training method according to another embodiment of the present application;

FIG. 5 is a flowchart illustrating a content recognition method according to another embodiment of the present application;

FIG. 6 is a schematic diagram of a data processing system according to an embodiment of the present application;

FIG. 7 is a flow chart illustrating a data processing method according to an embodiment of the application;

FIG. 8 is a flowchart of a data processing method according to another embodiment of the present application;

FIG. 9 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a model training apparatus according to another embodiment of the present application;

fig. 11 is a schematic structural diagram of a content recognition device according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a data processing apparatus according to another embodiment of the present application;

fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings.

In some of the flows described in the description of the application, the claims, and the figures described above, a number of operations occurring in a particular order are included, and the operations may be performed out of order or concurrently with respect to the order in which they occur. The sequence numbers of operations such as 101, 102, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types. In addition, the following embodiments are only some, but not all, embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The embodiments provided by the application all relate to an application model. The following description is given to the process of obtaining the application model, so as to facilitate the description of the subsequent embodiments and facilitate the understanding of the scheme.

Fig. 1 is a schematic flow chart of a model training method according to an embodiment of the application. As shown in fig. 1, the method includes:

101. and taking the sample content as the input of a training model, and executing the training model to output second result information.

102. And calculating at least two loss values by adopting at least two loss functions based on the second result information.

103. And when the training convergence condition is determined to be reached according to the at least two loss values, the training model is trained and can be used as an application model for content identification.

In 101, the training model may be a deep learning model, for example, including but not limited to: convolutional neural network (Convolutional Neural Network, CNN) learning models, cyclic neural network (Recurrent Neural Network, RNN) learning models, fully connected deep neural network (Deep Neural Network, DNN) learning models, and the like, to which embodiments of the present application are not limited in detail.

In 102 above, the parameters in the training model are typically very simple initialization values before training begins. The difference (or distance) between the actual result and the result information output by the training model is a measure to measure whether the parameters in the training model are proper or not, and the difference can be represented by a loss function. In other words, the loss function is used to represent the difference between the training model and the ideal model under the current parameters, so that the training model parameters are properly adjusted.

There are many kinds of loss functions, such as: the present embodiment is not limited specifically to the two kinds of cross entropy loss, edit distance (edit loss), maximum interval loss (large margin softmax loss), center loss (center loss), join timing classification (Connectionist temporal classification), and the like.

In the above 103, in implementation, it may be determined whether each of the at least two loss values is smaller than a preset threshold, and when all of the at least two loss values are smaller than the preset threshold, it is determined that the training convergence condition has been reached. Or determining a comprehensive loss value according to the at least two loss values, and determining that the training convergence condition is reached when the comprehensive loss value is judged to be smaller than the preset threshold value.

In one possible technical solution, the integrated loss value may be obtained by calculating a weighted sum of the at least two loss values. In this embodiment, the weight corresponding to each loss value in the weighted sum calculation scheme is not specifically limited, and the weight of each loss value can be selected according to actual needs.

According to the technical scheme provided by the embodiment, at least two loss functions are adopted to calculate the loss value of each iteration, and parameter update in each iteration training model is influenced according to the at least two loss values obtained through calculation; the training model finishes training and can be used as an application model for content identification until the training convergence condition is determined to be reached according to at least two loss values; the application model is adopted for content identification, the content identification accuracy is high, and particularly, better distinguishing force is provided for content with higher similarity (such as near words and homophones).

What needs to be explained here is: the above may be voice, graphics, video, etc., which is not particularly limited in this embodiment. In practice, to use the training model to recognize the voice, the samples used in training should be voice samples. To identify the graphics context by using the training model, the samples used in training should be graphics context samples. To identify the video using a training model, the samples used during training should be video samples.

Further, the method provided in this embodiment may further include the following steps:

104. when the training convergence condition is not met according to the at least two loss values, updating parameters in the training model according to the at least two loss values; and proceeds to the next iteration.

In an implementation solution, the "updating the parameters in the training model according to the at least two loss values" in the above steps may specifically include the following steps:

1041. and determining a comprehensive loss value according to the at least two loss values.

In a specific implementation, the comprehensive loss value can be obtained by calculating a weighted sum of the at least two loss values.

1042. And according to the comprehensive loss value, performing layer-by-layer recursive calculation to obtain each layer of gradient contained in the training model.

1043. And updating parameters in the training model according to the gradient of each layer.

The implementation process of updating the parameters in the training model based on the loss value may refer to the related content in the prior art, and will not be described herein.

In one implementation solution, the step 102 "calculating at least two loss values using at least two loss functions based on the second result information" includes:

1021. acquiring a content tag of the sample content and a center point characteristic corresponding to the sample content;

1022. taking the second result information and the content tag as the input of a joint time sequence classification loss function, and calculating to obtain a first loss value;

1023. and taking the second result information and the central point feature as the input of a central loss function, and calculating to obtain a second loss value.

Further, after the second loss value is calculated in step 1023, the method provided in this embodiment further includes:

105. and updating the center point characteristics corresponding to the sample content according to the second result information.

Taking voice recognition as an example, the voice recognition comprises a voice model and a language model, wherein an algorithm framework for constructing the voice model is a CNN+RNN+CTC structure.

1. CNN (Convolutional Neural Network ) covers many types of network structures, such as AlexNet, VGG, inception, googleNet, resNet, and the main function is to extract convolutional feature maps from speech features.

2. RNNs (Recurrent Neural Network, recurrent neural networks) are mainly used for modeling processing sequence data, and function to construct a timing model from sequences of convolution feature maps.

3. CTC (Connectionist temporal classification, join timing classification) is a way to calculate a loss function calculation, with the main advantage that unaligned data can be automatically aligned for training of serialized data. Such as speech recognition, OCR (Optical Character Recognition ) recognition, and the like.

The language model is a model for calculating the probability of one sentence. Using a language model, it can be determined which word sequence is more likely, or given a number of words, the next most likely word can be predicted. In the task of speech recognition, for a plurality of preferred sentences given by a speech model, which probability is the largest is calculated by the language model, namely the final recognition result. The language model is a knowledge representation of the composition of a set of word sequences. The language model may represent the probability that a certain word sequence occurs. A common language model in speech recognition is N-Gram (N-Gram), which is the probability of N words occurring before and after statistics. N-gram assumes that the probability of a word appearing is only related to the probability of the preceding N-1 words appearing.

When calculating the loss value by using CTC, because there is no strict corresponding relation, the sound characteristics and the actual labels are aligned mainly by CTC, so that the recognition of near-voice words or homophones is often wrong, and the ideal effect cannot be achieved even though the correction of the back-end language model is performed. Aiming at the problems, the embodiment provides a technical scheme, namely a mode of combining Center Loss and CTC Loss is adopted, the distance between an acoustic model predicted value and an actual value is considered, and then iteration is carried out on network parameters, so that the model parameters tend to have better distinguishing force on near words and homophones. That is, the step 102 "calculating at least two loss values using at least two loss functions based on the result information" in this embodiment may be specifically:

1021', obtaining a voice tag of the sample voice and a phoneme center point characteristic corresponding to the sample voice.

1022', taking the second result information and the voice tag as input of a joint time sequence classification loss function, and calculating to obtain a first loss value.

1023', taking the second result information and the central point feature of the phonemes as the input of a central loss function, and calculating to obtain a second loss value.

The loss value was calculated using CTC 1022' above. The 1023' above introduces a center loss function to calculate the loss value. Wherein, the phonemes are the smallest units in the speech, and are analyzed according to the pronunciation actions in syllables, and one action constitutes one phoneme. The phoneme center point feature can be simply understood as: the center point feature of the class of samples to which the sample speech belongs, which can be characterized essentially as a center vector. Center Loss (Center Loss), the feature vector output by the LSTM (long short term memory, long and short term memory network) module can be used as the feature value of the corresponding groundtrunk (representing the classification accuracy of the training set for supervised learning) label in the acoustic module, the correct label data can be called groundtrunk label, the Center vector is updated along with the training step, and the distance operation can be performed on the output vector of the LSTM module and the Center vector each time, so that euclidean distance or other distance calculation can be selected. That is, in the technical solution provided in this embodiment, after the second loss value is calculated in the step 1023', the step 105 may specifically be:

and 105', updating the phoneme center point characteristics corresponding to the sample speech according to the second result information.

The center loss is mainly used for reducing the intra-class distance, and the intra-class distance can be reduced, so that the intra-class distance is reduced, and the inter-class distance can be increased. The center loss not only can divide the feature, but also can gather the similar features, so that the center loss has better generalization capability on samples which are not seen, and better distinguishing force on near words and homophones.

Further, the method provided in this embodiment further includes:

106. and determining a comprehensive loss value according to the at least two loss values.

In specific implementation, a weighted sum of the at least two loss values may be calculated to obtain the integrated loss value. In calculating the weighted sum, the weight corresponding to each loss value may be a set value, for example, the set value may be obtained according to need, which is not particularly limited in this embodiment.

107. And determining whether the training convergence condition is reached according to the comprehensive loss value.

In the implementation, whether the comprehensive loss value is smaller than a preset threshold value or not can be judged, and if so, it is determined that the training convergence condition is reached; otherwise, the training convergence condition is not reached.

Further, in the embodiment, the step 101 "taking the sample content as the input of the training model, executing the training model to obtain the second result information" may specifically include:

1011. feature data is extracted from the sample content.

In a specific implementation example, the process of extracting feature data from sample content is implemented by a mel-frequency cepstral coefficient (Mel Frequency Cepstral Coefficents), a Filter bank (filters) or the like.

1012. And extracting a convolution characteristic diagram from the characteristic data.

In one implementation technical scheme, the characteristic data is used as input of a convolutional neural network model, and the convolutional neural network model is executed so that the characteristic data is subjected to multi-layer network deep learning calculation to obtain a convolutional characteristic diagram. The convolutional neural network model can be realized by adopting a network such as AlexNet, VGG, inception, googleNet, resNet.

1013. Modeling the convolution feature map over a time sequence.

In specific implementation, schemes such as LSTM (Long Short Time Memory, long and short time memory network) and GRU (Gated Recurrent Unit, gate-controlled circulation unit network) can be selected to realize the modeling of the convolution characteristic diagram on the time sequence.

1014. And extracting a time sequence network characteristic diagram from the modeling result as the second result information.

Fig. 2 is a schematic flow chart of a model training method according to an embodiment of the application. As shown in fig. 2, the method includes:

201. and taking the sample content as the input of a training model, and executing the training model to output second result information.

202. And calculating at least two loss values by adopting at least two loss functions based on the second result information.

203. When the training convergence condition is not met according to the at least two loss values, updating parameters in the training model according to the at least two loss values; and proceeds to the next iteration.

For the foregoing 201 to 202, reference may be made to the corresponding content in the foregoing embodiment, and the description is omitted here.

In the foregoing 203, "updating the parameters in the training model according to the at least two loss values" may specifically be:

2031. and determining a comprehensive loss value according to the at least two loss values.

2032. And according to the comprehensive loss value, performing layer-by-layer recursive calculation to obtain each layer of gradient contained in the training model.

2033. And updating parameters in the training model according to the gradient of each layer.

Likewise, the process of updating the parameters in the training based on the loss values may refer to the corresponding content in the prior art, and will not be described herein.

Fig. 3 is a flow chart illustrating a content recognition method according to an embodiment of the application. As shown in fig. 3, the method includes:

301. and taking the content to be identified as input of an application model, and executing the application model to output first result information.

302. And determining a content tag serving as a recognition result based on the first result information.

303. And executing corresponding business operation according to the content label.

The content to be identified can be voice, graphics context, video and the like. It should be noted that, when the content to be recognized is voice, the training model needs to use sample voice to train during the training process to obtain the application model. And the application model is obtained by training a training model by adopting sample graphics context and video.

In the above 301, after receiving the most original audio signal, the content to be identified may be obtained by performing processing such as enhancing the content by removing noise and channel distortion.

In 303, the business operation may be different in different application scenarios. Assuming the intelligent sound box field, the intelligent sound box can play corresponding audio, online ordering, online shopping and the like according to the identified content tag. It is assumed that in a content search application scenario, a search operation is performed to search out a result matching a content tag as a search keyword with the content tag as a recognition result. It is assumed that in the content violation determination application scenario, whether or not the content to be recognized contains a violation word may be determined based on the content tag as the recognition result.

What needs to be explained here is: the application model adopted in the embodiment is trained by the model training method provided in each embodiment. The training process of the application model in this embodiment may refer to the corresponding content in the above embodiments, which is not described herein.

According to the technical scheme provided by the embodiment, at least two loss functions are adopted to calculate the loss value of each iteration, and parameter update in each iteration training model is influenced according to the at least two loss values obtained through calculation; the training model finishes training and can be used as an application model for content identification until the training convergence condition is determined to be reached according to at least two loss values; the application model is adopted for content identification, the content identification accuracy is high, and particularly, the application model has better distinguishing force for near words and homophones.

Further, when the content to be identified is voice, accordingly, step 302 "determining the content tag as the identification result based on the first result information" in this embodiment may specifically include the following steps:

3021. processing the first result information to obtain a plurality of content tags;

3022. a content tag as a recognition result is determined from the plurality of content tags based on the language model.

Specifically, the sentence confusion degree may be calculated by a language model, and an optimal content tag may be selected from the plurality of content tags as the recognition result.

In one possible technical solution, the result information is a time-series network feature map; accordingly, the step 3021 "processing the first result information to obtain a plurality of content tags" may specifically include the following steps:

30211. And processing the time sequence network feature map to obtain feature vectors.

In particular, the time sequence network feature map is calculated to be the same feature vector as the dictionary space dimension to be identified by using the fully connected layer in the neural network.

30212. And calculating a probability vector based on the feature vector.

For example, a softmax classifier is applied, the features are mapped to [0,1], and the feature values for each vector are integrated to 1 to correspond to the probability of each class.

30213. And decoding the probability vector to obtain the content labels.

Specifically, the probability vector is decoded into a plurality of content tags, and an alternative scheme includes Greedy decoding (Greedy Coding) and cluster search decoding (Beam Search Coding).

Fig. 4 shows a flow chart of a training model training process. As shown in fig. 4, in the training process, the input data includes a sample voice and a labeled voice tag, and in the processing process, the coordinates of the central point of the phoneme are updated; the specific process comprises the following steps:

s11, extracting characteristic data from the voice stream data.

Alternative schemes include mel-frequency cepstral coefficients (Mel Frequency Cepstral Coefficents), filter Banks (filters Banks), and the like.

S12, extracting a convolution feature graph from the feature data.

The method is used for extracting a convolution characteristic diagram from voice signal characteristics through multi-layer network deep learning calculation, and alternative schemes of the module comprise AlexNet, VGG, inception, googleNet, resNet, a fully connected Deep Neural Network (DNN) and the like.

And S13, modeling the convolution characteristic diagram on a time sequence, and extracting a time sequence network characteristic diagram.

Alternatives include LSTM (Long Short Time Memory, long and short time memory network) and GRU (Gated Recurrent Unit, gated loop cell network), bi-directional long and short time memory network (BLSTM), and the like.

S14, taking the time sequence network characteristic diagram and the voice tag as the input of a joint time sequence classification loss function CTC, automatically aligning two sequence signals which are not strictly aligned, and calculating a first loss value.

S15, taking the time sequence network characteristic diagram and the voice element central point characteristic as the input of a central Loss function (Center Loss), and calculating a second Loss value.

S16, updating the central point characteristics of the phonemes according to the time sequence network characteristic diagram.

Here, this step 16 must be performed after the completion of step S15.

S17, calculating a weighted sum of the first loss value and the second loss value as a comprehensive loss value in the training process.

S18, calculating the gradient of the network and returning the gradient layer by layer to update the model parameters when the comprehensive loss value is determined to not reach the training convergence condition; and the next iteration is entered after the parameter is updated.

And S19, when the comprehensive loss value is determined to reach the training convergence condition, the training model is trained and can be used as an application model for speech recognition.

Fig. 5 shows a flow chart of a training model training process. As shown in fig. 5, in the prediction process, the input data is a speech stream, and the text content of the speech stream is predicted using the application model trained in the training process. The specific process comprises the following steps:

s21, extracting characteristic data from the voice to be recognized.

S22, extracting a convolution feature graph from the feature data.

S23, modeling the convolution feature map on a time sequence, and extracting a time sequence network feature map.

S24, calculating the time sequence network feature map as a feature vector with the same space dimension as the dictionary to be identified.

S25, calculating a probability vector based on the feature vector.

Specifically, feature vectors are mapped to [0,1] such that the sum of feature values of each vector is 1 to correspond to the probability of each category.

S26, decoding the probability vector into a plurality of voice tags.

S27, selecting an optimal voice tag from a plurality of voice tags as a recognition result in a mode of calculating sentence confusion through a language model.

What is needed here is that: the feature extraction methods in steps S11, S12, S21 and S22 may be a spectrogram.

According to the technical scheme provided by the embodiment, the distance between the predicted value and the actual value of the acoustic model is considered by combining the Center Loss and the CTC Loss, and then the parameters of the training model are iterated, so that the model parameters tend to have better distinguishing force on the near-voice words and the homophone words.

FIG. 6 is a schematic diagram of a data processing system according to an embodiment of the present application. The data processing system includes: the service party 401 and the service object 402. Wherein,

a service side 401 for acquiring data of a service object 402; judging whether the data meets the set requirements or not by using an application model; providing corresponding services for the service object 402 according to the determination result;

a service object 402 that provides local data to the service party 401; the service side 401 receives the service provided by the judging result after judging whether the data meets the set requirement by using an application model;

In practice, the service object may be an e-commerce operator, a video website operator, or the like. The data of the service object may be: content, such as pictures, videos, content, etc., presented or played on the website. The service party may be a merchant or the like that provides content identification services.

In order to facilitate the understanding of the scheme, the technical scheme of the application will be described below by taking a service object and a service side in a data processing system as an execution main body respectively; i.e. the above-mentioned service object and the service party can also implement the methods in the respective embodiments described below.

Fig. 7 is a flow chart illustrating a data processing method according to an embodiment of the application. The execution subject of the method provided in this embodiment may be a server, such as a server side server or cloud end that provides a service for a user. Specifically, as shown in fig. 7, the method includes:

501. data of the service object is acquired.

502. And judging whether the data meets the set requirements or not by using an application model.

503. And providing corresponding service for the service object according to the judging result.

In 501 above, the service object may be an e-commerce operator, a video website operator, or the like. The data of the service object may be: content, such as pictures, videos, content, etc., presented or played on the website. In particular, the data may be automatically captured from the website of the service object or may be automatically uploaded by the service object.

In the above 502, the determining whether the data meets the set requirement by using the application model may specifically include the following steps:

taking the data as input of the application model, and executing the application model to obtain third result information;

and judging whether the data meets the set requirements or not according to the third result information.

The setting requirements can be set according to actual application scenes. For example, in the context violation determination application scenario, the setting requirement may specifically be: whether or not the offending content is contained. Assuming that the data is voice, determining a voice tag serving as a voice recognition result according to third result information; then, judging whether the voice tag contains an illegal tag or not; if the illegal tag is contained, the data does not meet the set requirement; and if the data does not contain the violation label, the data meets the set requirement.

In 503, the services that may be provided for the service object include: the embodiment is not limited specifically, and the method can remind the offending data, provide offending webpage addresses, view offending webpages, provide a convenient and easy-to-use result display platform (the service object can process offending data quickly, such as deleting, shielding and the like). Specifically, step 503: providing the corresponding service for the service object according to the determination result, wherein the service object can comprise at least one of the following steps:

when the judging result is that the data does not meet the set requirement, sending a prompt aiming at the data to the service object;

when the judging result shows that the data does not meet the set requirement, providing a display interface containing the data for the service object so that the service object can operate the data;

and when the data does not meet the set requirement according to the judging result, providing the service object with the service of blocking the page containing the data.

The page blocking service performs blocking operation on the suspected illegal page, and the blocked suspected illegal URL (Uniform Resource Locator ) is displayed as the blocked page.

In the technical scheme provided by the embodiment, the application model is utilized to process the data to obtain third result information; judging whether the data meets the set requirements or not according to the third result information; then providing corresponding service for the service object according to the judging result; the application model is obtained after training the training model, at least two loss values after one iteration are calculated by adopting at least two loss functions in the training process of the training model, and updating of parameters is completed based on the at least two loss values; the application model is used for judging the data, the accuracy is high, and the service quality provided for the service object is improved.

Fig. 8 is a flow chart illustrating a data processing method according to an embodiment of the application. The execution subject of the method provided in this embodiment may be a service object, and the service object may be specifically a user side server that requests a service. Specifically, as shown in fig. 8, the method includes:

601. the local data is provided to the service.

602. And the service side receives the service provided by the service side based on the judging result after judging whether the data meets the set requirement by using the application model.

In 603, the "receiving the service provided by the service side based on the determination result after determining whether the data meets the set requirement by using the application model" may include at least one of the following:

receiving and displaying the data processed by the service side by using an application model to obtain a reminder which is sent out by the service side and aims at the data when the data does not meet the set requirements;

displaying a display interface containing the data, which is provided when the data does not meet the set requirements, by the service side for processing the data by using an application model so that the service object can operate the data;

and the service side receives the data to process the data by using an application model to obtain the service for blocking the page containing the data, wherein the service is provided when the data does not meet the set requirements.

Fig. 9 is a schematic structural diagram of a model training device according to an embodiment of the present application. As shown, the model training apparatus includes: an execution module 11 and a processing module 12; the execution module 11 is configured to take the sample content as an input of a training model, execute the training model, and output second result information; the processing module 12 is configured to calculate at least two loss values using at least two loss functions based on the second result information; and when the training convergence condition is determined to be reached according to the at least two loss values, the training model is trained and can be used as an application model for content identification.

According to the technical scheme provided by the embodiment, at least two loss functions are adopted to calculate the loss value of each iteration, and parameter update in each iteration training model is influenced according to the at least two loss values obtained through calculation; the training model finishes training and can be used as an application model for content identification until the training convergence condition is determined to be reached according to at least two loss values; the application model is adopted for content identification, the content identification accuracy is high, and particularly, the content with higher similarity, such as near words and homophones, has better distinguishing force.

Further, the processing module 12 is further configured to:

acquiring a content tag of the sample content and a center point characteristic corresponding to the sample content;

taking the result information and the content tag as the input of a joint time sequence classification loss function, and calculating to obtain a first loss value;

and taking the result information and the central point feature as the input of a central loss function, and calculating to obtain a second loss value.

Further, the processing module 12 is further configured to: and updating the center point characteristics corresponding to the sample content based on the result information.

Further, the processing module 12 is further configured to:

determining a comprehensive loss value according to the at least two loss values;

and determining whether the training convergence condition is reached according to the comprehensive loss value.

Further, the execution module 11 is further configured to:

extracting feature data from the sample content;

extracting a convolution feature graph from the feature data;

Modeling the convolution feature map over a time sequence;

and extracting a time sequence network characteristic diagram from the modeling result as the second result information.

What needs to be explained here is: the model training device provided in the foregoing embodiments may implement the technical solutions described in the foregoing method embodiments, and the specific implementation principles of the foregoing modules or units may be referred to the corresponding content in the foregoing method embodiments, which is not described herein again.

Fig. 10 is a schematic structural diagram of a model training device according to an embodiment of the present application. As shown in fig. 10, the model training apparatus includes: execution module 21 and processing module 22. Wherein, the execution module 21 is configured to take the sample content as an input of a training model, execute the training model, and output second result information; the processing module 22 is configured to calculate at least two loss values using at least two loss functions based on the second result information; when the training convergence condition is not met according to the at least two loss values, updating parameters in the training model according to the at least two loss values; and proceeds to the next iteration.

Further, the processing module 22 is further configured to:

according to the comprehensive loss value, performing layer-by-layer recursive calculation to obtain each layer of gradient contained in the training model;

and updating parameters in the training model according to the gradient of each layer.

Fig. 11 is a schematic diagram showing a structure of a content recognition apparatus according to an embodiment of the present application. As shown in fig. 8, the content recognition apparatus includes: the execution module 31 and the determination module 32. The execution module 31 is configured to take the content to be identified as input of an application model, execute the application model, and output first result information; the determining module 32 is configured to determine a content tag as a recognition result based on the first result information. The application model is obtained after training the training model, and the training model calculates at least two loss values after each iteration by adopting at least two loss functions in the training process so as to complete the updating of parameters based on the at least two loss values.

Further, when the content to be identified is voice, the determining module 32 is further configured to:

processing the first result information to obtain a plurality of content tags;

a content tag as a recognition result is determined from the plurality of content tags based on the language model.

Further, the result information is a time sequence network characteristic diagram; accordingly, the determining module 32 is further configured to:

processing the time sequence network feature map to obtain feature vectors;

calculating to obtain a probability vector based on the feature vector;

and decoding the probability vector to obtain the content labels.

What needs to be explained here is: the content recognition device provided in the foregoing embodiments may implement the technical solutions described in the foregoing method embodiments, and the specific implementation principles of the foregoing modules or units may refer to the corresponding content in the foregoing method embodiments, which is not repeated herein.

Fig. 12 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 12, the data processing apparatus includes: an acquisition module 41, a decision module 42 and a service module 43. Wherein, the obtaining module 41 is used for obtaining the data of the service object; the determining module 42 is configured to determine whether the data meets a set requirement by using an application model; the service module 43 is configured to provide a corresponding service for the service object according to the determination result. The application model is obtained after training a training model, at least two loss values after one iteration are calculated by the training model in the training process through at least two loss functions, and updating of parameters is completed based on the at least two loss values.

Further, the determining module 42 is further configured to:

when the judging result is that the data does not meet the set requirement, sending a prompt aiming at the data to the service object; and/or

When the judging result shows that the data does not meet the set requirement, providing a display interface containing the data for the service object so that the service object can operate the data; and/or

Fig. 13 is a schematic structural diagram of a data processing apparatus according to another embodiment of the present application. As shown in fig. 13, the data processing apparatus includes: the data providing module 51 and the processing module 52. Wherein, the data providing module 51 is configured to provide local data to a service party; the processing module 52 is configured to receive a service provided by the service party based on a determination result after determining whether the data meets a set requirement by using an application model. The application model is obtained after training a training model, at least two loss values after one iteration are calculated by the training model in the training process through at least two loss functions, and updating of parameters is completed based on the at least two loss values.

Further, the processing module 52 is further configured to:

receiving and displaying the data processed by the service side by using an application model to obtain a reminder which is sent out by the service side and aims at the data when the data does not meet the set requirements; and/or

Displaying a display interface containing the data, which is provided when the data does not meet the set requirements, by the service side for processing the data by using an application model so that the service object can operate the data; and/or

Fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device comprises a memory 61 and a processor 62. The memory 61 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device. The memory 61 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The processor 62 is coupled to the memory 61 for executing the program stored in the memory 61 for:

In addition to the above functions, the processor 62 may also realize other functions when executing the program in the memory 61, and the foregoing description of the embodiments may be specifically referred to.

Further, as shown in fig. 14, the electronic device further includes: a display 64, a communication component 63, a power supply component 65, an audio component 66, and other components. Only some of the components are schematically shown in fig. 14, which does not mean that the electronic device only comprises the components shown in fig. 14.

Accordingly, the embodiments of the present application also provide a computer-readable storage medium storing a computer program capable of implementing the steps or functions of the model processing method provided in each of the above embodiments when the computer program is executed by a computer.

The embodiment of the application also provides a structural schematic diagram of the electronic equipment. The structure of the electronic device provided in this embodiment is similar to that of the above-described electronic device embodiment, and is shown in fig. 14. The electronic device includes: a memory and a processor. Wherein the memory may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

Wherein the processor, when executing the program in the memory, may perform other functions in addition to the above functions, see in particular the description of the embodiments above.

Accordingly, the embodiments of the present application also provide a computer-readable storage medium storing a computer program, where the computer program when executed by a computer can implement the steps or functions of the model training method provided in the foregoing embodiments.

executing corresponding business operation according to the content label;

Accordingly, the embodiments of the present application also provide a computer-readable storage medium storing a computer program capable of implementing the steps or functions of the content identification method provided in each of the above embodiments when the computer program is executed by a computer.

The embodiment of the application also provides a structural schematic diagram of the server equipment. The structure of the server device provided in this embodiment is similar to that of the above-described electronic device embodiment, and is shown in fig. 14. The server device includes: a memory and a processor. Wherein the memory may be configured to store various other data to support operations on the server device. Examples of such data include instructions for any application or method operating on the server device. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

acquiring data of a service object;

Accordingly, the embodiments of the present application also provide a computer-readable storage medium storing a computer program capable of implementing the steps or functions of the data processing method provided in the above embodiments when the computer program is executed by a computer.

The embodiment of the application also provides a schematic structure diagram of the service object equipment. The structure of the service object device provided in this embodiment is similar to that of the above-described electronic device embodiment, and is shown in fig. 14. The service object device includes: a memory and a processor. Wherein the memory may be configured to store various other data to support operations on the service object device. Examples of such data include instructions for any application or method operating on the service object device. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

providing local data to a service party;

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A content identification method, comprising:

taking the content to be identified as input of an application model, executing the application model and outputting first result information; wherein the content to be identified is voice, graphics context or video;

executing corresponding business operation according to the content label;

the application model is obtained after training a training model, at least two loss values after one iteration are calculated by the training model in the training process by adopting at least two loss functions, and updating of parameters is completed based on the at least two loss values; the at least two loss values include a first loss value and a second loss value;

the training process of the training model comprises the following steps: taking the sample content as the input of a training model, executing the training model and outputting second result information; the sample content is voice, graphics context or video which is used as a training sample;

taking the second result information and the content label of the sample content as the input of a joint time sequence classification loss function, and calculating to obtain the first loss value;

and taking the second result information and the center point characteristic of the sample content as the input of a center loss function, and calculating to obtain the second loss value.

2. The method according to claim 1, wherein the content to be recognized is voice, and determining a content tag as a recognition result based on the first result information, comprises:

processing the first result information to obtain a plurality of content tags;

3. The method of claim 2, wherein the first result information is a time-series network profile; and

processing the first result information to obtain a plurality of content tags, including:

processing the time sequence network feature map to obtain feature vectors;

calculating to obtain a probability vector based on the feature vector;

and decoding the probability vector to obtain the content labels.

4. A method according to any one of claims 1 to 3, further comprising:

when the training convergence condition is determined to be reached according to the at least two loss values, the training model completes training and can be used as an application model for content identification;

5. A method of model training, comprising:

taking the sample content as the input of a training model, executing the training model and outputting second result information; the sample content is voice, graphics context or video which is used as a training sample;

when the training convergence condition is determined to be reached according to the at least two loss values, the training model is trained and can be used as an application model for identifying the voice, the image text or the video to be identified;

wherein the at least two loss values comprise a first loss value and a second loss value; based on the second result information, calculating the first loss value and the second loss value by adopting two loss functions, including:

taking the second result information and the content label of the sample content as the input of a joint time sequence classification loss function, and calculating to obtain a first loss value;

and taking the second result information and the central point characteristic of the sample content as the input of a central loss function, and calculating to obtain a second loss value.

6. The model training method of claim 5, further comprising:

7. The method of claim 5, wherein after calculating the second loss value, the method further comprises:

and updating the center point characteristics corresponding to the sample content according to the second result information.

8. The method according to any one of claims 5 to 7, further comprising:

9. The method according to any one of claims 5 to 7, wherein the taking sample content as input to a training model, executing the training model to obtain second result information, comprises:

extracting feature data from the sample content;

extracting a convolution feature graph from the feature data;

modeling the convolution feature map over a time sequence;

10. A method of model training, comprising:

when the training convergence condition is not met according to the at least two loss values, updating parameters in the training model according to the at least two loss values; and entering the next iteration;

11. The method of claim 10, wherein updating parameters in the training model based on the at least two loss values comprises:

12. A method of data processing, comprising:

acquiring data of a service object; wherein the data is voice, graphics context or video;

13. The method of claim 12, wherein providing the corresponding service for the service object based on the determination comprises at least one of:

14. A method of data processing, comprising:

providing local data to a service party; wherein, the local data is voice, graphics context or video;

15. The method of claim 14, wherein the service provided by the service side based on the determination result after determining whether the data meets the set requirement by using the application model comprises at least one of the following:

16. A data processing system, comprising:

the server is used for acquiring the data of the service object; judging whether the data meets the set requirements or not by using an application model; providing corresponding service for the service object according to the judging result; the data are voice, graphics context or video;

17. An electronic device, comprising: a memory and a processor; wherein,

the memory is used for storing programs;

executing corresponding business operation according to the content label;

18. An electronic device, comprising: a memory and a processor; wherein,

the memory is used for storing programs;

19. An electronic device, comprising: a memory and a processor; wherein,

the memory is used for storing programs;

taking sample content as input of a training model, and executing output result information of the training model; the sample content is voice, graphics context or video which is used as a training sample;

wherein the at least two loss values comprise a first loss value and a second loss value; based on the result information, calculating the first loss value and the second loss value by adopting two loss functions, including:

taking the result information and the content label of the sample content as the input of a joint time sequence classification loss function, and calculating to obtain a first loss value;

and taking the result information and the central point characteristic of the sample content as the input of a central loss function, and calculating to obtain a second loss value.

20. A server device, comprising: a memory and a processor; wherein,

The memory is used for storing programs;

21. A service object apparatus, characterized by comprising: a memory and a processor; wherein,

the memory is used for storing programs;