CN111340117A - CTC model training method, data processing method, device and storage medium - Google Patents

CTC model training method, data processing method, device and storage medium Download PDF

Info

Publication number
CN111340117A
CN111340117A CN202010124513.9A CN202010124513A CN111340117A CN 111340117 A CN111340117 A CN 111340117A CN 202010124513 A CN202010124513 A CN 202010124513A CN 111340117 A CN111340117 A CN 111340117A
Authority
CN
China
Prior art keywords
probability
likelihood
sequence
output
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010124513.9A
Other languages
Chinese (zh)
Inventor
巢林林
陈景东
褚崴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202010124513.9A priority Critical patent/CN111340117A/en
Publication of CN111340117A publication Critical patent/CN111340117A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The present specification relates to a training method and apparatus of a associative semantic temporal classification (CTC) model, in which method feature vectors are input into a first fully-connected layer and a second fully-connected layer, respectively; determining a joint expression vector of the feature vector and the label sequence and inputting the joint expression vector into a third full-connection layer; and then, the CTC loss layer determines the likelihood distribution of the label sequence and the prior distribution of the blank characters according to the normalization result output by each layer, thereby determining the gradient value of the training and finishing one-time training. The specification also provides a data processing method and device based on the CTC model, an electronic device and a computer readable storage medium.

Description

CTC model training method, data processing method, device and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a training method for a CTC model, a data processing method, an apparatus, an electronic device, and a computer-readable storage medium.
Background
Connection-induced Temporal Classification (CTC) is a time-series Classification algorithm in which there is no strict alignment information between data units and label units, and is currently widely used in Optical Character Recognition (OCR) and speech recognition. The main function of the CTC model is to construct a loss function for the sequence and transmit the gradient determined by the loss function back to the previous layer in the back propagation process to complete the training of the CTC model. There is a need for an efficient and highly accurate method of training a CTC model.
Disclosure of Invention
In view of this, embodiments of the present disclosure provide a method for training a CTC model. The method can comprise the following steps:
acquiring an embedded representation vector of a feature sequence and a tag sequence corresponding to the feature sequence, wherein the feature sequence comprises at least one feature vector, and the at least one feature vector comprises: the character or voice signal in the picture format is processed by a feature extraction network to obtain a feature vector corresponding to each moment;
sequentially inputting the at least one feature vector into a first full-connection layer, and normalizing the output of the first full-connection layer to obtain prior distribution of the blank character corresponding to each moment;
sequentially inputting the at least one feature vector into a second full-connection layer, and normalizing the output of the second full-connection layer to obtain the probability of each element in a dictionary set corresponding to the characters or the voice signals corresponding to each moment;
determining a joint expression vector of the at least one feature vector and the label sequence, inputting the joint expression vector into a third full-connection layer, and normalizing the output of the third full-connection layer to obtain the posterior approximation probability of the blank character corresponding to each moment;
determining the likelihood distribution of the label sequence according to the posterior approximation probability of the blank character corresponding to each moment and the probability of each element in the dictionary set corresponding to each moment; and
and determining a gradient value of the training according to the likelihood distribution of the label sequence and the prior distribution of the blank characters, and adjusting the weights of the first full-connection layer, the second full-connection layer and the third full-connection layer according to the gradient value.
Wherein the determining the likelihood distribution of the tag sequence may include: determining the likelihood probability of each element in the dictionary set and each time corresponding to the blank character; determining the likelihood distribution of the label sequence according to the likelihood probability of each element in the dictionary set and each time corresponding to the blank character; the likelihood probability of the blank character corresponding to each moment is the posterior approximation probability of the blank character corresponding to each moment; the likelihood probability of each element in the dictionary set corresponding to each time is the product of the posterior approximation probability of each non-blank character corresponding to each time and the probability of the element at the corresponding time.
Wherein the determining the likelihood distribution of the tag sequence may include: determining the likelihood probability of a plurality of output paths of the CTC model according to the likelihood probability of each element in the dictionary set and each time corresponding to the blank character; summing the likelihood probabilities of a plurality of output paths corresponding to the same output sequence, and taking the obtained sum as the likelihood probability of the output sequence; and using the distribution of likelihood probabilities of a plurality of output sequences of the CTC model as the likelihood distribution of the tag sequences.
Wherein, the determining the gradient value of the training according to the likelihood distribution of the label sequence and the prior distribution of the blank characters may include: the following expression is used as a loss function for CTC model training:
Figure BDA0002394007070000021
wherein, p (O)b| X) is the prior distribution of the blank characters; p (Y | O)bX) is the likelihood distribution of the tag sequence; o isbRepresenting a blank character output sequence; and determining the gradient value of the training according to the loss function.
Wherein, the determining the gradient value of the training according to the likelihood distribution of the label sequence and the prior distribution of the blank characters may include: the following expression is used as a loss function for CTC model training:
Figure BDA0002394007070000022
wherein p (Y | O)bX) is the likelihood distribution of the tag sequence; q. q.sψ(Ob| X, Y) is the posterior approximation distribution of the blank character; p (O)b| X) is the prior distribution of the blank characters; KL () is the divergence calculation; e () is the desired operation; and determining the gradient value of the training according to the loss function.
The embedded expression vector of the tag sequence can be determined as follows: mapping each element in the dictionary set to an initial vector respectively; and averaging initial vectors corresponding to elements contained in the label sequence to obtain an embedded expression vector of the label sequence.
Wherein the determining the joint representation vector of the at least one feature vector and the tag sequence comprises: and solving the Hadamard product of the at least one characteristic vector and the embedded expression vector of the label sequence in sequence.
Embodiments of the present specification also provide a CTC model-based data processing method, which may include:
obtaining a feature sequence, wherein the feature sequence comprises at least one feature vector, and the at least one feature vector comprises: the character or voice signal of the picture format to be recognized is processed by a feature extraction network to obtain feature vectors corresponding to different moments;
sequentially inputting the at least one feature vector into a first full-connection layer, and normalizing the output of the first full-connection layer to obtain the probability of the blank character corresponding to each moment;
sequentially inputting the at least one feature vector into a second full-connection layer, and normalizing the output of the second full-connection layer to obtain the probability of each element in a dictionary set corresponding to the text or voice signal corresponding to each moment;
determining the likelihood probability of each element in the dictionary set and each time corresponding to the blank character according to the probability of each time corresponding to the blank character and the probability of each element in the dictionary set corresponding to each time; and
and determining an output sequence corresponding to the characteristic sequence according to the likelihood probability of each element in the dictionary set and each time corresponding to the blank character.
Wherein the determining the likelihood probability of each element in the dictionary set and each time corresponding to the blank character comprises: determining the likelihood probability of each time corresponding to the blank character as the probability of each time corresponding to the blank character; and determining the likelihood probability of each element in the dictionary set corresponding to each time as the product of the probability of each element corresponding to each time of the non-blank character and the probability of the element at the corresponding time.
Wherein the determining an output sequence corresponding to the feature sequence according to the likelihood probability of each element in the dictionary set and each time corresponding to the blank character comprises:
determining the likelihood probability of a plurality of output paths of the CTC model according to the likelihood probability of each element in the dictionary set and each time corresponding to the blank character;
respectively determining at least one output sequence corresponding to the plurality of output paths;
for each output sequence, adding the likelihood probabilities of all output paths corresponding to the output sequence, and taking the sum as the likelihood probability of the output sequence;
and taking the output sequence with the maximum likelihood probability as the output sequence corresponding to the characteristic sequence.
Wherein respectively determining at least one output sequence corresponding to the plurality of output paths may include: and aiming at each output path in the output paths, combining repeated elements among the blank characters on the output path, and removing the blank characters on the output path to obtain an output sequence corresponding to the output path.
Embodiments of the present specification also disclose a CTC model training device, which may include:
a feature vector obtaining module, configured to obtain a feature sequence and an embedded representation vector of a tag sequence corresponding to the feature sequence, where the feature sequence includes at least one feature vector, and the at least one feature vector includes: the character or voice signal in the picture format is processed by a feature extraction network to obtain a feature vector corresponding to each moment;
the blank character prior distribution determining module is used for sequentially inputting the at least one feature vector into a first full-connection layer, normalizing the output of the first full-connection layer and then obtaining prior distribution of the blank character corresponding to each moment;
the prior probability determination module is used for sequentially inputting the at least one feature vector into a second full-connection layer, normalizing the output of the second full-connection layer and obtaining the probability of each element in a dictionary set corresponding to the text or voice signal corresponding to each moment;
the posterior approximation probability determining module is used for determining a joint expression vector of the at least one characteristic vector and the label sequence, inputting the joint expression vector into a third full-connection layer, and normalizing the output of the third full-connection layer to obtain the posterior approximation probability of the blank character corresponding to each moment;
a likelihood distribution determining module, configured to determine likelihood distribution of the label sequence according to a posterior approximation probability of the blank character corresponding to each time and a probability of each element in the dictionary set corresponding to each time;
and the loss determining module is used for determining a gradient value of the training according to the likelihood distribution of the label sequence and the blank character prior distribution, and adjusting the weight values of the first full-connection layer, the second full-connection layer and the third full-connection layer according to the gradient value.
The likelihood distribution determining module determines likelihood probabilities of each element in the dictionary set and each time corresponding to the blank character; determining the likelihood distribution of the label sequence according to the likelihood probability of each element in the dictionary set and each time corresponding to the blank character; the likelihood probability of the blank character corresponding to each moment is the posterior probability of the blank character corresponding to each moment; and the likelihood probability of each element in the dictionary set corresponding to each moment is the product of the posterior probability of each non-blank character corresponding to each moment and the probability of the element at the corresponding moment.
The likelihood distribution determining module respectively determines the likelihood probabilities of a plurality of output paths of the CTC model according to the likelihood probabilities of each element in the dictionary set and each time corresponding to the blank character; summing the likelihood probabilities of a plurality of output paths corresponding to the same output sequence, and taking the obtained sum as the likelihood probability of the output sequence; and using the distribution of likelihood probabilities of a plurality of output sequences of the CTC model as the likelihood distribution of the tag sequences.
Wherein, the loss determining module takes the following expression as a loss function of the CTC model training:
Figure BDA0002394007070000051
wherein, p (O)b| X) is the prior distribution of the blank characters; p (Y | O)bX) is the likelihood distribution of the tag sequence; o isbRepresenting a blank character output sequence; and determining the gradient value of the training according to the loss function.
Wherein, the loss determining module takes the following expression as a loss function of the CTC model training:
Figure BDA0002394007070000052
wherein p (Y | O)bX) is the likelihood distribution of the tag sequence; q. q.sψ(Ob| X, Y) is the posterior approximation distribution of the blank character; p (O)b| X) is the prior distribution of the blank characters; KL () is the divergence calculation; e () is the desired operation; and determining the gradient value of the training according to the loss function.
Embodiments of the present specification also disclose a CTC model-based data processing apparatus, which may include:
a feature vector obtaining module, configured to obtain a feature sequence, where the feature sequence includes at least one feature vector, and the at least one feature vector includes: obtaining a feature vector corresponding to each moment after the picture format characters or voice signals to be recognized are subjected to feature extraction network processing;
the blank character prior distribution determining module is used for sequentially inputting the at least one feature vector into a first full-connection layer and normalizing the output of the first full-connection layer to obtain the probability of the blank character corresponding to each moment;
the prior probability determination module is used for sequentially inputting the at least one feature vector into a second full-connection layer, normalizing the output of the second full-connection layer and obtaining the probability of each element in a dictionary set corresponding to the text or voice signal corresponding to each moment;
a likelihood probability determination module, configured to determine likelihood probabilities of each element in the dictionary set and each time corresponding to a blank character according to the probability of each time corresponding to the blank character and the probability of each time corresponding to each element in the dictionary set; and
and the output module is used for determining an output sequence corresponding to the characteristic sequence according to the likelihood probability of each element in the dictionary set and each time corresponding to the blank character.
The likelihood probability determining module determines the likelihood probability of each time corresponding to the blank character as the probability of each time corresponding to the blank character; and determining the likelihood probability of each element in the dictionary set corresponding to each time as the product of the probability of each element corresponding to each time of the non-blank character and the probability of the element at the corresponding time.
The output module determines the likelihood probability of a plurality of output paths of the CTC model according to the likelihood probability of each element in the dictionary set and each time corresponding to the blank character; respectively determining at least one output sequence corresponding to the plurality of output paths; for each output sequence, adding the likelihood probabilities of all output paths corresponding to the output sequence, and taking the sum as the likelihood probability of the output sequence; and taking the output sequence with the maximum likelihood probability as the output sequence corresponding to the characteristic sequence.
An embodiment of the present specification also discloses an electronic device, including: memory, processor and computer program stored on the memory and executable on the processor, wherein the processor implements the above method when executing the program.
Embodiments of the present specification also disclose a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the above-described method.
Therefore, through the CTC model training method and the CTC model, the uniform distribution of the blank characters and the label characters in the CTC algorithm is modified into the first-layer distribution for firstly judging whether the blank characters or the label characters and then judging which label character is specifically output, and the problem of non-blank character under-learning caused by unbalanced distribution of the blank characters and the non-blank characters in the CTC algorithm is solved. Meanwhile, the posterior probability and the prior probability of the minimized blank character can play a role in regularization, and more accurate output distribution of the blank character can be obtained when a model with an unknown label sequence is inferred. After the variational method is adopted, the output of the CTC model is more sufficiently aligned with the input data, and the distinguishing capability of the output confidence score can be greatly improved, so that the precision and recall level of the CTC model are improved, and the precision and recall level of OCR or voice recognition are further improved.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present specification, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic flow chart of a CTC model training method according to an embodiment of the present disclosure;
fig. 2 shows an example of a plurality of feature sequences obtained by processing an image text through a feature extraction network according to an embodiment of the present specification;
FIG. 3 is a schematic diagram of the internal structure of a CTC model according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of the internal structure of a CTC model training device according to an embodiment of the present disclosure;
FIG. 5 is a schematic flow chart of a data processing method based on a CTC model according to an embodiment of the present disclosure;
fig. 6 is a flowchart illustrating a method for determining an output sequence corresponding to a feature sequence according to likelihood probabilities of each element and a blank character in a dictionary set at each time according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram of the internal structure of a CTC model for data processing according to an embodiment of the present disclosure; and
fig. 8 is a schematic diagram of an internal structure of a data processing apparatus according to an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the present specification more apparent, the present specification is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.
It should be noted that technical terms or scientific terms used in the embodiments of the present specification should have a general meaning as understood by those having ordinary skill in the art to which the present disclosure belongs, unless otherwise defined. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
As previously mentioned, CTC algorithms are currently widely used in Optical Character Recognition (OCR) and speech recognition. The CTC algorithm realizes automatic alignment of the tag characters and the data sequence by adding blank characters into the tag sequence, thereby completing the identification of the whole sequence. However, all output paths of the CTC algorithm contain characters that are mostly blank characters at all times, and non-blank characters are usually output in the form of spikes. It has been observed and analyzed that, at present, non-blank tag characters are not sufficiently aligned with the data sequence, and the non-blank tag characters are not sufficiently learned, resulting in a reduction in the accuracy or recall level of the CTC model, thereby affecting the accuracy or recall level of OCR or speech recognition.
Therefore, the embodiment of the specification provides a training method of a CTC model, which modifies the uniform distribution of blank characters and label characters in a CTC algorithm into a first-layer network for firstly judging whether the blank characters or the label characters and a second-layer network for secondly judging which label characters are specifically output by a variation reasoning method, so that the problem of non-blank character under-learning caused by unbalanced distribution of the blank characters and the non-blank characters in the CTC algorithm is solved, the precision and recall level of the CTC model are improved, and the precision and recall level of OCR or voice recognition are improved.
FIG. 1 shows a flowchart of a CTC model training method described in the examples herein. As shown in fig. 1, the method may include:
in step 102, an embedded representation vector of a signature sequence and a tag sequence corresponding to the signature sequence is obtained.
In an embodiment of the present specification, the feature sequence may include at least one feature vector. The at least one feature vector may be a feature vector corresponding to different times obtained after a text in a picture format is processed by a feature extraction network, or a feature vector corresponding to different times obtained after a voice signal is processed by the feature extraction network, or the like.
Fig. 2 shows an example of a feature sequence obtained after a text in a picture format is processed through a feature extraction network according to an embodiment of the present specification. As shown in FIG. 2, a single character includes the character "Board! The picture is subjected to feature extraction through a feature extraction network to obtain a feature sequence containing N feature vectors. In order to recognize the characters in the picture, N generally needs to be greater than the number of characters included in the picture, for example, in fig. 2, N generally needs to be greater than 5. The feature extraction network may be implemented by various feature extraction models such as a Recurrent Neural Network (RNN) Convolutional Neural Network (CNN).
In an embodiment of the present specification, the tag sequence corresponding to the feature sequence is correct characters corresponding to the feature sequence and to be recognized, that is, correct characters corresponding to image characters or voice signals obtained through OCR or voice recognition. For example, in the example shown in FIG. 2, the label sequence corresponding to the image text should be "Board! ".
To train a CTC model, a tag sequence typically needs to be added to the training process of the CTC model in the form of an embedded representation vector.
Specifically, in the embodiment of the present specification, first, each element in the dictionary set corresponding to the above-mentioned text or speech signal may be mapped to an initial vector in advance, and the initial vectors may be used as the embedded representation vectors of all the elements in the dictionary set. In the embodiments of the present specification, the initial vector may be randomly set.
Here, the dictionary set generally refers to a set of all characters possibly contained in the OCR or speech recognition result, and is an element library of the OCR or speech recognition. For example, for OCR recognition, the dictionary set is a character library, such as a chinese character library including a chinese character set or an alphabetic character library including an english alphabet set. For example, in speech recognition, the dictionary set means a pronunciation dictionary, and may be a set of phonemes in nature. For Chinese, the correspondence between pinyin and Chinese characters is obtained; in the case of English, it is the correspondence of phonetic symbols to words. The method aims to find corresponding Chinese characters (words) or words according to phonemes identified by an acoustic model, and is used for establishing a bridge between the acoustic model and a language model and connecting the acoustic model and the language model.
After the embedded expression vectors of all elements in the dictionary set are obtained, the initial vectors corresponding to all elements included in the tag sequence may be averaged for the tag sequence corresponding to the feature sequence to obtain an average vector, and the average vector may be used as the embedded expression vector of the tag sequence. The embedded representation vector of the tag sequence is applied to a training process of the CTC model as a target value for determining a deviation between a predicted value and a target value of the CTC model output.
In step 104, the at least one feature vector is sequentially input into a first fully-connected layer.
In implementations of the present description, the first fully-connected layer may include at least 2 output neurons.
In the above-described first fully-connected layer, the output of each neuron may refer to the following expression (1):
zj=Wp×xt+bp(1)
wherein, x istThe characteristic vector corresponding to the time t is obtained; wpAnd bpIs the weight of the first fully connected layer. When the number of neurons in the first fully-connected layer is equal to 2, j may take a value of 1 or 2.
In step 106, the output of the first fully-connected layer is normalized to obtain the probability of the blank character corresponding to each moment, and the blank character prior distribution is determined accordingly.
In an embodiment of the present specification, the normalization may be implemented by using a first softmax logistic regression layer or a first Sigmoid layer, so as to obtain probabilities of the blank characters corresponding to each time, and determine a priori distribution of the blank characters according to the probabilities.
In an embodiment of the present specification, the output of the above-described first softmax logistic regression layer may refer to the following expression (2):
Figure BDA0002394007070000101
wherein the content of the first and second substances,
Figure BDA0002394007070000102
the probability of the blank character corresponding to the time t is obtained;
Figure BDA0002394007070000103
the probability of a non-blank character corresponding to time t.
The first Sigmoid layer may normalize an output of the first fully-connected layer using a Sigmoid function.
Furthermore, in the embodiments of the present specification, if the timings of the sequence are independent of each other and the distribution of each timing is defined as a bernoulli distribution, the probability of each timing of a blank character and a non-blank character can be determined based on the probability of each timing of the blank character and the non-blank character
Figure BDA0002394007070000104
Determining a prior distribution p (O) of white space charactersb|X)。
At step 114, the at least one feature vector is sequentially input into a second fully-connected layer.
In implementations of the present description, the second fully-connected layer may include at least K output neurons, where K is a number of elements included in a dictionary set of CTCs.
In the above-described second fully-connected layer, the output of each neuron may refer to the following expression (3):
zi=Wc×xt+bc(3)
wherein, x istThe characteristic vector corresponding to the time t is obtained; wcAnd bcIs the weight of the second fully connected layer. When the number of neurons in the second fully-connected layer is equal to the number of elements included in the dictionary set, the value of i may be 1 to the number K of elements included in the dictionary set.
In step 116, the output of the second fully-connected layer is normalized to obtain the probability of each element in the dictionary set corresponding to each time.
In an embodiment of the present specification, the normalization may be implemented by a second softmax logistic regression layer, so as to obtain a probability that each element in the dictionary set corresponds to each time.
In an embodiment of the present specification, the output of the above-described second softmax logistic regression layer may refer to the following expression (4):
p(at|xt)=softmax(zi) (4)
wherein the content of the first and second substances,
Figure BDA0002394007070000105
p(at|xt) As element a in a dictionary settCorresponding to the probability at time t.
In step 122, a joint representation vector of the at least one feature vector and the tag sequence is determined.
In an embodiment of the present specification, in the embodiment of the present specification, a dot product may be performed on the at least one feature vector and the embedded expression vector of the tag sequence to obtain a joint expression vector of the at least one feature vector and the tag sequence. Specifically, the dot product may be a hadamard product, or the output of the third fully-connected layer may be concatenated with an embedded representation vector of the tag sequence, or the like.
In the embodiment of the present specification, in order to make the at least one feature vector consistent with the dimension of the embedded representation vector of the tag sequence, the at least one feature vector may be sequentially input into a fully-connected layer (referred to as a fourth fully-connected layer) before the step 122 is performed.
In an embodiment of the present specification, the fourth fully-connected layer may include at least M neurons, where M is a dimension of the embedded representation vector of the tag sequence. The purpose of using the fourth fully-connected layer is to convert the dimensions of the feature vector to conform to the dimensions of the embedded representation vector of the tag sequence.
At step 124, the joint representation vector of the at least one feature vector and the tag sequence is input into a third fully-connected layer.
In an implementation of the present specification, the third fully-connected layer may include at least 2 output neurons.
In the above-described third fully-connected layer, the output of each neuron may refer to the following expression (5):
Figure BDA0002394007070000114
wherein, x istThe characteristic vector corresponding to the time t is obtained; waAnd baIs the weight of the fourth fully connected layer. When the number of neurons in the first fully-connected layer is equal to 2, the value of m may be 1 or 2. f () represents the dimension conversion operation of the third fully-connected layer; degree represents the hadamard product operation.
In step 126, the output of the third fully-connected layer is normalized to obtain the posterior approximation probability of each time corresponding to the blank character.
In an embodiment of the present specification, the output of the above-described third softmax logistic regression layer may refer to the following expression (6):
Figure BDA0002394007070000111
wherein the content of the first and second substances,
Figure BDA0002394007070000112
the posterior approximation probability of the blank character at the time t is obtained;
Figure BDA0002394007070000113
is the posterior approximation probability of the non-blank character at the time t.
Further, in the embodiments of the present specification, the posterior approximation distribution q of the space character may also be obtained according to the posterior approximation probabilities of the space character and the non-space character at each timeψ(Ob|X,Y)。
In step 130, the likelihood distribution of the label sequence is determined according to the posterior approximation probability of the blank character corresponding to each time and the probability of each element in the dictionary set corresponding to each time.
In an embodiment of the present specification, the likelihood probabilities of each element in the dictionary set and each time corresponding to a blank character may be determined according to the posterior approximation probability of each time corresponding to the blank character and the probability of each time corresponding to each element in the dictionary set; then, the likelihood distribution of the label sequence is determined according to the likelihood probability.
The following detailed description will respectively describe a specific method for determining the likelihood probability of each element and blank character in the dictionary set corresponding to each time, and a specific method for determining the likelihood distribution of the label sequence according to the likelihood probability.
In the implementation of the present specification, the likelihood probability of each element and blank character in the dictionary set corresponding to each time can be determined by the following method:
on one hand, for a blank character, the likelihood probability of the blank character at each time can be determined as the posterior approximation probability of the blank character at each time, that is, the likelihood probability of the blank character at each time can be determined by using the following expression (7):
Figure BDA0002394007070000121
wherein the content of the first and second substances,
Figure BDA0002394007070000122
for elements in a dictionary set or for blank characters lkLikelihood probability at time t; lbRepresenting a blank character;
Figure BDA0002394007070000123
is the posterior approximation probability of the space character at the time t.
On the other hand, for a non-blank character, the likelihood probability of each element in the dictionary set at each time may be determined as the product of the posterior approximation probability of the non-blank character at each time and the probability of the element at the corresponding time, that is, the likelihood probability of each element in the dictionary set at each time may be determined by using the following expression (8):
Figure BDA0002394007070000124
wherein the content of the first and second substances,
Figure BDA0002394007070000125
for elements in a dictionary set or for blank characters lkLikelihood probability at time t;
Figure BDA0002394007070000126
the posterior approximation probability of the non-blank character at the moment t is obtained; p (a)t=lk|xt) For the dictionary element lkProbability at time t.
In embodiments of the present description, a sequence of characters that the CTC model may predict to be output at each moment may be referred to as an output path of the CTC model. The output path contains a plurality of characters corresponding to different time instants, and the characters can be blank characters or elements in a dictionary set. Because the output path contains the space characters or the repeated characters, the characters contained in the output path need to be processed, the repeated characters are combined, and the space characters are deleted, so that the output sequence of the CTC model can be obtained. That is, the number of characters included in the output path is usually greater than the length of the output sequence, and there are multiple output paths corresponding to the same output sequence. In general, the mapping of output paths to output sequences can be done by equation F (). F () can achieve the following two goals:
1) combining the continuously repeated characters into one;
2) the blank character "-" in the output path is deleted.
For example, assuming that the input signature sequence length is 6 for the output sequence Y ═ ab, the F () may map the following output paths to the output sequence ab:
Figure BDA0002394007070000131
in this case, the likelihood distribution of the above-described tag sequence can be determined by the following procedure:
firstly, the likelihood probabilities of a plurality of output paths of the CTC model are respectively determined according to the likelihood probabilities of each element and blank character in the dictionary set at each moment.
It will be appreciated that in embodiments of the present specification, the likelihood probability of an output path is the product of the likelihood probabilities of the individual characters (which may be elements of a dictionary set or blank characters) on the output path at their corresponding time instants. The likelihood probability of an output path pi may be specifically determined by using the following expression (9):
Figure BDA0002394007070000132
wherein, p (pi | O)bAnd X) is the likelihood probability of the output path pi; t is the number of characters included in the output path; o isbRepresenting a blank character output sequence.
Then, the likelihood probabilities of a plurality of output paths corresponding to the same output sequence are summed, and the resulting sum is taken as the likelihood probability of the output sequence.
As described above, in the embodiments of the present specification, a plurality of output paths may correspond to the same output sequence, and in this step, the output sequence corresponding to each output path may be determined by the above-mentioned F ().
In an embodiment of the present specification, the likelihood probabilities of the plurality of output sequences of the CTC model described above may be determined using the following expression (10):
Figure BDA0002394007070000133
and finally, taking the distribution of the likelihood probability of a plurality of output sequences of the CTC model as the likelihood distribution of the label sequence. It can be seen that the likelihood distributions of the tag sequences correspond to all output sequences, which are determined jointly by the outputs at the respective time instants.
In step 132, a gradient value of the training is determined according to the likelihood distribution of the tag sequence and the prior distribution of the dummy characters, and weights of the first full link layer, the second full link layer, and the third full link layer are adjusted according to the gradient value, thereby completing a training.
In some embodiments of the present description, the loss function of the CTC model training, i.e., the learning objective function of the CTC model training, may be determined using expression (11) below:
Figure BDA0002394007070000143
the loss function represents the posterior distribution of the tag sequence, i.e., the Margin distribution corresponding to the space character output sequence.
In other embodiments of the present description, the posterior probabilities of blank and non-blank characters at various times may be based on
Figure BDA0002394007070000141
And performing binarization sampling on the output of the blank characters, and fixing the output paths corresponding to the label sequence and the image after the output of the blank characters is determined. Therefore, the above-described loss function of CTC model training can be further simplified to expression (12) below, which is a lower bound of the loss function shown in expression (11) above. That is, in embodiments of the present description, the loss function of the CTC model training, i.e., the learning objective function of the CTC model training, may be determined using expression (12) below:
Figure BDA0002394007070000142
wherein, KL () is divergence calculation; e () is the desired operation.
In the embodiments of the present description, after determining the loss function of the CTC model training, the loss value of the current training may be further determined, and the gradient value of the current training may be determined by performing a derivative operation.
And then, transmitting the determined gradient value to a first full connection layer, a second full connection layer, a third full connection layer and a fourth full connection layer in the CTC model in a back propagation mode, and adjusting the weight of each full connection layer in a gradient descent mode.
After the training round is finished, the training round can return to the step 102 again to start the next training round until the loss value determined by the loss function reaches the lowest value.
FIG. 3 shows the structure of a CTC model as described in some embodiments herein. In the examples of the present specification. As shown in fig. 3, the CTC model for data training described above includes:
the first fully-connected layer 302, which includes at least 2 output neurons, is configured to take the feature vector as an input and determine at least 2 outputs according to its own weight.
And a first normalization layer 304, configured to obtain probabilities of the blank characters at each time according to the output of the first fully-connected layer.
As previously described, the first normalization layer 304 described above may be implemented using a first softmax logistic regression layer or a first Sigmoid layer.
The second fully-connected layer 306 comprises at least K output neurons, and is used for taking the feature vectors as input and determining at least K outputs according to the weight of the second fully-connected layer; where K is the number of elements contained in the dictionary set of CTCs.
And a second normalization layer 308, configured to obtain probabilities of each element in the dictionary set at each time according to the output of the second fully-connected layer.
As previously described, the second normalization layer 308 described above may be implemented using a second softmax logistic regression layer.
And a dot product layer 312 for performing a dot product of the feature vector and the embedded expression vector of the tag sequence to obtain a joint expression vector of at least one feature vector and the tag sequence.
And the third fully-connected layer 314 comprises at least 2 output neurons, and is used for taking the joint expression vector of the feature vector and the label sequence as input and determining at least 2 outputs according to the self weight.
And a third normalization layer 316, configured to obtain a posterior probability of the blank character at each time according to the output of the third fully-connected layer.
As previously described, the third normalization layer 316 described above may be implemented using a third softmax logistic regression layer or a third Sigmoid layer.
A CTC loss layer 318 for determining likelihood distribution of the tag sequence according to a posterior approximation probability of the space character at each time and a probability of each element in the dictionary set at each time; and determining a gradient value of the training according to the likelihood distribution of the label sequence and the prior distribution of the blank characters, and adjusting the weights of the first full-link layer, the second full-link layer and the third full-link layer according to the gradient value.
In an embodiment of the present specification, the CTC model for data training described above may further include: a fourth fully connected layer 310, including at least M output neurons, for taking the feature vector as input and determining at least M outputs according to its own weight; where M is the dimension of the tag sequence embedding representation vector.
The specific implementation method of each component module in the CTC model may refer to the technical scheme shown in fig. 1, and is not described herein again.
Corresponding to the above-mentioned CTC model training method and CTC model, an embodiment of the present specification provides a CTC model training apparatus, an internal structure of which is shown in fig. 4, including:
a feature vector obtaining module 402, configured to obtain a feature sequence and an embedded representation vector of a tag sequence corresponding to the feature sequence, where the feature sequence includes at least one feature vector, and the at least one feature vector includes: the character or voice signal in the picture format is processed by a feature extraction network to obtain a feature vector corresponding to each moment;
a blank character prior distribution determining module 404, configured to sequentially input the at least one feature vector into a first fully-connected layer, and normalize an output of the first fully-connected layer to obtain a prior distribution of a blank character at each time;
a prior probability determining module 406, configured to sequentially input the at least one feature vector into a second fully-connected layer, and normalize an output of the second fully-connected layer to obtain a probability that each element in a dictionary set corresponding to the text or voice signal corresponds to each time;
a posterior approximation probability determining module 408, configured to determine a joint representation vector of the at least one feature vector and the tag sequence, input the joint representation vector into a third full-link layer, and normalize an output of the third full-link layer to obtain a posterior approximation probability of the blank character corresponding to each time;
a likelihood distribution determining module 410, configured to determine likelihood distribution of the tag sequence according to a posterior approximation probability of each time corresponding to the blank character and a probability of each time corresponding to each element in the dictionary set; and
and a loss determining module 412, configured to determine a gradient value of the training according to the likelihood distribution of the tag sequence and the prior distribution of the blank characters, and adjust weights of the first full-link layer, the second full-link layer, and the third full-link layer according to the gradient value.
It can be seen that the whitespace character a priori distribution determination module 404 may include the first fully connected layer 302 and the first normalization layer 304. The prior probability determination module 406 may include the second fully-connected layer 306 and the second normalization layer 308. The posterior approximation probability determination module 408 may include a point build layer 312, a third fully connected layer 314, and a third normalized layer 316; a fourth fully-connected layer 310 may also be further included.
In an embodiment of the present specification, the likelihood distribution determining module may first determine likelihood probabilities of respective elements and blank characters in the dictionary set at respective times, and then determine the likelihood distribution of the label sequence according to the determined likelihood probabilities. The likelihood probability of the blank character corresponding to each moment is the posterior approximation probability of the blank character corresponding to each moment; the likelihood probability of each element in the dictionary set corresponding to each time is the product of the posterior approximation probability of each non-blank character corresponding to each time and the probability of the element at the corresponding time.
In an embodiment of the present specification, the likelihood distribution determining module determines, according to likelihood probabilities of respective elements in the dictionary set and respective times corresponding to blank characters, likelihood probabilities of a plurality of output paths of the CTC model respectively; summing the likelihood probabilities of a plurality of output paths corresponding to the same output sequence to obtain the likelihood probability of the output sequence; and using the distribution of likelihood probabilities of a plurality of output sequences of the CTC model as the likelihood distribution of the tag sequences.
In an embodiment of the present specification, the loss determining module may use the expressions (11) and (12) as a loss function for the CTC model training, and determine a gradient value of one training according to the loss function.
It can be seen that the above-mentioned training method for CTC model, CTC model and CTC model training device modify the uniform distribution of the space characters and the tag characters in the CTC algorithm into a first layer network for firstly distinguishing the space characters or the tag characters and a second layer network for secondly distinguishing which tag character is specifically output, by a variation reasoning method, thereby solving the problem of non-space character under-learning caused by unbalanced distribution of the space characters and the non-space characters in the CTC algorithm. After the non-blank characters are fully learned, the output of the CTC model is more fully aligned with the input data, and the distinguishing capability of the output confidence score is greatly improved, so that the precision and recall level of the CTC model are improved, and the precision and recall level of OCR or voice recognition are further improved.
Embodiments of the present specification also provide a data processing method based on the CTC model, which may be used in optical character recognition or speech recognition. Referring to fig. 5, the data processing process according to the embodiment of the present invention may include:
at step 502, a signature sequence is obtained.
In an embodiment of the present specification, the feature sequence includes at least one feature vector, where the at least one feature vector includes: and the character or voice signals of the picture format to be recognized are subjected to feature extraction network processing to obtain feature vectors corresponding to different moments.
At step 504, the at least one feature vector is sequentially input into a first fully-connected layer.
In an embodiment of the present specification, the number of neurons in the above-described first fully-connected layer may be equal to 2.
In step 506, the output of the first fully-connected layer is normalized to obtain the probability of the space character at each time.
In an embodiment of the present specification, the normalization may be implemented by a first softmax logistic regression layer or a first Sigmoid layer.
At step 514, the at least one feature vector is sequentially input into a second fully-connected layer.
In an embodiment of the present specification, the number of neurons in the above-described second fully-connected layer may be equal to the number of elements included in the dictionary set.
In step 516, the output of the second fully-connected layer is normalized to obtain the probability of each element in the dictionary set at each time.
In an embodiment of the present specification, the normalization may be implemented by a second softmax logistic regression layer.
The specific implementation methods of the steps 504, 506 and 414, 416 can refer to the steps 104, 106 and 114, 116, which are not described herein again.
In step 518, the likelihood probabilities of each element in the dictionary set and the space character at each time are determined based on the probability of the space character at each time and the probability of each element in the dictionary set at each time.
In an embodiment of the present specification, the determining the likelihood probability of each element in the dictionary set and the blank character at each time may include the following two aspects:
in one aspect, the likelihood probability of the whitespace character at each time is determined to be the probability of the whitespace character at each time.
In another aspect, the likelihood probability of each element in the dictionary set at each time instant is determined as the product of the probability of a non-blank character at each time instant and the probability of the element at the corresponding time instant.
In step 520, an output sequence corresponding to the feature sequence is determined based on likelihood probabilities of each element and the space character in the dictionary set at each time.
In an embodiment of the present specification, with reference to fig. 6, the specific method for determining an output sequence corresponding to the feature sequence according to likelihood probabilities of each element and the blank character in the dictionary set at each time may specifically include:
in step 602, a likelihood probability of at least one output path is determined according to likelihood probabilities of each element in the dictionary set and the blank character at each time.
In an embodiment of the present specification, the output path is composed of at least one element corresponding to each time, where the at least one element is an element in the dictionary set or a blank character. The likelihood probability of the output path is a product of likelihood probabilities of the at least one element at the corresponding time.
At step 604, at least one output sequence corresponding to the at least one output path is determined.
In an embodiment of the present specification, for each output path of the at least one output path, combining the repeated elements between the blank characters on the output path; and then, removing the blank characters on the output path to obtain an output sequence corresponding to the output path.
In step 606, for each output sequence, the likelihood probabilities of all output paths corresponding to the output sequence are added, and the sum is taken as the likelihood probability of the output sequence.
In step 608, the output sequence with the maximum likelihood probability is used as the output sequence corresponding to the feature sequence.
In response to the above processing method, an embodiment of the present specification further provides a CTC model for performing data processing, also referred to as a CTC model-based data processing apparatus, which may include, as shown in fig. 7:
the first fully-connected layer 702, which includes at least 2 output neurons, is configured to take the feature vector as an input and determine at least 2 outputs according to its own weight.
And a first normalization layer 704 for obtaining probabilities of the blank characters at respective time instants according to the output of the first fully-connected layer.
In an embodiment of the present specification, the first normalization layer 704 may be implemented by a first softmax logistic regression layer or a first Sigmoid layer.
The second fully-connected layer 706 comprises at least K output neurons, and is used for taking the feature vectors as input and determining at least K outputs according to the weight of the second fully-connected layer; where K is the number of elements contained in the dictionary set of CTCs.
A second normalization layer 708, configured to obtain probabilities of each element in the dictionary set at each time according to the output of the second fully-connected layer.
In an embodiment of the present description, the second normalization layer 708 may be implemented by a second softmax logistic regression layer.
A CTC output layer 710 for determining likelihood probabilities of each element in the dictionary set and the space character at each time point based on the probability of the space character at each time point and the probability of each element in the dictionary set at each time point; and determining an output sequence corresponding to the characteristic sequence according to the likelihood probability of each element and blank character in the dictionary set at each moment.
The specific implementation method of each component module in the CTC model may refer to the technical solutions shown in fig. 5 and 6, and is not described herein again.
In correspondence with the above-described data processing method and CTC model, an embodiment of the present specification provides a data processing apparatus, an internal structure of which is shown in fig. 8, including:
a second feature vector obtaining module 802, configured to obtain a feature sequence, where the feature sequence includes at least one feature vector, and the at least one feature vector includes: the character or voice signal of the picture format to be recognized is processed by a feature extraction network to obtain a feature vector corresponding to each moment;
a blank character prior distribution determining module 804, configured to sequentially input the at least one feature vector into a first full-link layer, and normalize an output of the first full-link layer to obtain probabilities of blank characters corresponding to respective times;
a prior probability determining module 806, configured to sequentially input the at least one feature vector into a second fully-connected layer, and normalize an output of the second fully-connected layer to obtain a probability that each element in the dictionary set corresponds to each time;
a likelihood probability determining module 808, configured to determine likelihood probabilities of each element in the dictionary set and each time corresponding to a blank character according to the probability of each time corresponding to the blank character and the probability of each time corresponding to each element in the dictionary set; and
and an output module 810, configured to determine an output sequence corresponding to the feature sequence according to likelihood probabilities of each element in the dictionary set and each time corresponding to a blank character.
In an embodiment of the present specification, the likelihood probability determining module 808 determines that the likelihood probability of each time corresponding to the blank character is the probability of each time corresponding to the blank character; and determining the likelihood probability of each element in the dictionary set corresponding to each time as the product of the probability of each element corresponding to each time of the non-blank character and the probability of the element at the corresponding time.
In an embodiment of the present specification, the output module 810 determines likelihood probabilities of multiple output paths of the CTC model according to likelihood probabilities of respective time instants corresponding to respective elements and characters in the dictionary set; respectively determining at least one output sequence corresponding to the plurality of output paths; for each output sequence, adding the likelihood probabilities of all output paths corresponding to the output sequence, and taking the sum as the likelihood probability of the output sequence; and taking the output sequence with the maximum likelihood probability as the output sequence corresponding to the characteristic sequence.
It can be seen that the uniform distribution of the blank characters and the label characters in the CTC algorithm is modified into the first layer distribution for firstly judging whether the blank characters or the label characters and then judging which label character is specifically output, so that the problem of non-blank character under-learning caused by unbalanced distribution of the blank characters and the non-blank characters in the training process of the CTC algorithm is solved. After the non-blank characters are fully learned, the alignment between the non-blank characters and the input data can be more fully realized, so that the distinguishing capability of the output confidence score can be improved when the CTC model is used for data processing, the precision and the recall level of the CTC model are improved, and the precision and the recall level of OCR or voice recognition are further improved.
Further, in the embodiments of the present specification, the CTC model may also be regarded as one electronic device, and thus, the CTC model may include: memory, processor, input/output interface, communication interface, and bus. The processor, the memory, the input/output interface and the communication interface realize communication connection among each other inside the device through a bus.
The Memory may be implemented in the form of a ROM (Read Only Memory), a RAM (Random access Memory), a static storage device, a dynamic storage device, or the like. The memory may store an operating system and other application programs, and may also store various modules of the server provided in the embodiments of the present specification, such as the above-mentioned first fully-connected layer 302, 602, first softmax logistic regression layer 304, 604, second fully-connected layer 306, 606, second softmax logistic regression layer 308, 608, third fully-connected layer 310, request layer 312, fourth fully-connected layer 314, third softmax logistic regression layer 316, CTC loss layer 318, CTC output layer 610, and the like, and when the technical solution provided in the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory and called and executed by the processor.
The processor may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute a relevant program to implement the technical solutions provided in the embodiments of the present specification.
The input/output interface is used for connecting the input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface is used for connecting a communication module (not shown in the figure) to realize the communication interaction of the equipment and other equipment. The communication module can realize communication in a wired mode (for example, USB, network cable, etc.), and can also realize communication in a wireless mode (for example, mobile network, WIFI, bluetooth, etc.).
A bus includes a path that transfers information between the various components of the device, such as the processor, memory, input/output interfaces, and communication interfaces.
It should be noted that although the above-described device shows only a processor, a memory, an input/output interface, a communication interface and a bus, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the electronic device embodiment and the computer storage medium embodiment, since they are substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of this description, features in the above embodiments or in different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of this description as described above, which are not provided in detail for the sake of brevity.
While the present description has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Claims (20)

1. A method of training a associative temporal classification CTC model, comprising:
acquiring an embedded representation vector of a feature sequence and a tag sequence corresponding to the feature sequence, wherein the feature sequence comprises at least one feature vector, and the at least one feature vector comprises: the character or voice signal in the picture format is processed by a feature extraction network to obtain a feature vector corresponding to each moment;
sequentially inputting the at least one feature vector into a first full-connection layer, and normalizing the output of the first full-connection layer to obtain prior distribution of the blank character corresponding to each moment;
sequentially inputting the at least one feature vector into a second full-connection layer, and normalizing the output of the second full-connection layer to obtain the probability of each element in a dictionary set corresponding to the characters or the voice signals corresponding to each moment;
determining a joint expression vector of the at least one feature vector and the label sequence, inputting the joint expression vector into a third full-connection layer, and normalizing the output of the third full-connection layer to obtain the posterior approximation probability of the blank character corresponding to each moment;
determining the likelihood distribution of the label sequence according to the posterior approximation probability of the blank character corresponding to each moment and the probability of each element in the dictionary set corresponding to each moment; and
and determining a gradient value of the training according to the likelihood distribution of the label sequence and the prior distribution of the blank characters, and adjusting the weights of the first full-connection layer, the second full-connection layer and the third full-connection layer according to the gradient value.
2. The method of claim 1, wherein determining the likelihood distribution of the tag sequence comprises:
determining the likelihood probability of each element in the dictionary set and each time corresponding to the blank character; the likelihood probability of the blank character corresponding to each moment is the posterior approximation probability of the blank character corresponding to each moment; the likelihood probability of each element in the dictionary set corresponding to each moment is the product of the posterior approximation probability of each non-blank character corresponding to each moment and the probability of the element at the corresponding moment;
and determining the likelihood distribution of the label sequence according to the likelihood probability of each element in the dictionary set and each time corresponding to the blank character.
3. The method of claim 2, wherein determining the likelihood distribution of the sequence of labels based on the likelihood probabilities for each element in the dictionary set and each time instance corresponding to a blank character comprises:
determining the likelihood probability of a plurality of output paths of the CTC model according to the likelihood probability of each element in the dictionary set and each time corresponding to the blank character;
summing the likelihood probabilities of a plurality of output paths corresponding to the same output sequence to obtain the likelihood probability of the output sequence; and
and taking the distribution of the likelihood probability of a plurality of output sequences of the CTC model as the likelihood distribution of the label sequence.
4. The method of claim 1, wherein determining the gradient values for the current training from the likelihood distributions of the tag sequences and the prior distributions of the white space characters comprises:
the following expression is used as a loss function for CTC model training:
Figure FDA0002394007060000021
wherein, p (O)b| X) is the prior distribution of the blank characters; p (Y | O)bX) is the likelihood distribution of the tag sequence; o isbRepresenting a blank character output sequence; and
and determining the gradient value of the training according to the loss function.
5. The method of claim 1, wherein determining the gradient values for the current training from the likelihood distributions of the tag sequences and the prior distributions of the white space characters comprises:
the following expression is used as a loss function for CTC model training:
Figure FDA0002394007060000022
wherein p (Y | O)bX) is the likelihood distribution of the tag sequence; q. q.sψ(Ob| X, Y) is the posterior approximation distribution of the blank character; p (O)b| X) is the prior distribution of the blank characters; KL () is the divergence calculation; e () is the desired operation; and
and determining the gradient value of the training according to the loss function.
6. The method of claim 1, wherein the embedded representation vector of the tag sequence is determined by: mapping each element in the dictionary set to an initial vector respectively; averaging initial vectors corresponding to elements contained in the tag sequence to obtain an embedded expression vector of the tag sequence;
said determining a joint representation vector of the at least one feature vector and the sequence of tags comprises: and respectively solving the Hadamard product of the at least one feature vector and the embedded expression vector of the label sequence.
7. A data processing method for classifying CTC models based on associative meaning time, comprising:
obtaining a feature sequence, wherein the feature sequence comprises at least one feature vector, and the at least one feature vector comprises: the character or voice signal of the picture format to be recognized is processed by a feature extraction network to obtain a feature vector corresponding to each moment;
sequentially inputting the at least one feature vector into a first full-connection layer, and normalizing the output of the first full-connection layer to obtain the probability of the blank character corresponding to each moment;
sequentially inputting the at least one feature vector into a second full-connection layer, and normalizing the output of the second full-connection layer to obtain the probability of each element in a dictionary set corresponding to the characters or the voice signals corresponding to each moment;
determining the likelihood probability of each element in the dictionary set and each time corresponding to the blank character according to the probability of each time corresponding to the blank character and the probability of each element in the dictionary set corresponding to each time; and
and determining an output sequence corresponding to the characteristic sequence according to the likelihood probability of each element in the dictionary set and each time corresponding to the blank character.
8. The method of claim 7, wherein determining likelihood probabilities for respective instances of time for respective elements of the dictionary set and for a blank character comprises:
determining the likelihood probability of each time corresponding to the blank character as the probability of each time corresponding to the blank character; and
and determining the likelihood probability of each element in the dictionary set corresponding to each moment as the product of the probability of each element corresponding to each moment and the probability of each element at the corresponding moment.
9. The method of claim 7, wherein the determining an output sequence corresponding to the sequence of features comprises:
determining the likelihood probability of a plurality of output paths of the CTC model according to the likelihood probability of each element in the dictionary set and each time corresponding to the blank character;
respectively determining at least one output sequence corresponding to the plurality of output paths;
for each output sequence, adding the likelihood probabilities of all output paths corresponding to the output sequence to obtain the likelihood probability of the output sequence;
and taking the output sequence with the maximum likelihood probability as the output sequence corresponding to the characteristic sequence.
10. The method of claim 9, wherein determining at least one output sequence corresponding to the plurality of output paths, respectively, comprises:
and aiming at each output path in the output paths, combining repeated elements among the blank characters on the output path, and removing the blank characters on the output path to obtain an output sequence corresponding to the output path.
11. A associative-temporal classification CTC model training device, the device comprising:
a feature vector obtaining module, configured to obtain a feature sequence and an embedded representation vector of a tag sequence corresponding to the feature sequence, where the feature sequence includes at least one feature vector, and the at least one feature vector includes: the character or voice signal in the picture format is processed by a feature extraction network to obtain a feature vector corresponding to each moment;
the blank character prior distribution determining module is used for sequentially inputting the at least one feature vector into a first full-connection layer, normalizing the output of the first full-connection layer and then obtaining prior distribution of the blank character corresponding to each moment;
the prior probability determination module is used for sequentially inputting the at least one feature vector into a second full-connection layer, normalizing the output of the second full-connection layer and then obtaining the probability of each element in a dictionary set corresponding to the characters or the voice signals corresponding to each moment;
the posterior approximation probability determining module is used for determining a joint expression vector of the at least one characteristic vector and the label sequence, inputting the joint expression vector into a third full-connection layer, and normalizing the output of the third full-connection layer to obtain the posterior approximation probability of the blank character corresponding to each moment;
a likelihood distribution determining module, configured to determine likelihood distribution of the label sequence according to a posterior approximation probability of the blank character corresponding to each time and a probability of each element in the dictionary set corresponding to each time;
and the loss determining module is used for determining a gradient value of the training according to the likelihood distribution of the label sequence and the blank character prior distribution, and adjusting the weight values of the first full-connection layer, the second full-connection layer and the third full-connection layer according to the gradient value.
12. The apparatus according to claim 11, wherein the likelihood distribution determining module determines likelihood probabilities of each element in the dictionary set and each time instant of a blank character according to the posterior approximation probability of each time instant of the blank character and the probability of each element in the dictionary set corresponding to each time instant, and determines the likelihood distribution of the label sequence according to the likelihood probabilities of each element in the dictionary set and each time instant of a blank character; the likelihood probability of the blank character corresponding to each moment is the posterior approximation probability of the blank character corresponding to each moment; the likelihood probability of each element in the dictionary set corresponding to each time is the product of the posterior approximation probability of each non-blank character corresponding to each time and the probability of the element at the corresponding time.
13. The apparatus of claim 12, wherein said likelihood distribution determination module determines likelihood probabilities for a plurality of output paths of said CTC model based on likelihood probabilities for respective instances of time for respective elements of said dictionary set and for respective ones of said whitespace characters; summing the likelihood probabilities of a plurality of output paths corresponding to the same output sequence to obtain the likelihood probability of the output sequence; and using the distribution of likelihood probabilities of a plurality of output sequences of the CTC model as the likelihood distribution of the tag sequences.
14. The apparatus of claim 11, wherein the loss determination module is configured to perform the loss function of the CTC model training using the following expression:
Figure FDA0002394007060000051
wherein, p (O)b| X) is the prior distribution of the blank characters; p (Y | O)bX) is the likelihood distribution of the tag sequence; o isbRepresenting a blank character output sequence; and
and determining the gradient value of the training according to the loss function.
15. The apparatus of claim 11, wherein the loss determination module is configured to perform the loss function of the CTC model training using the following expression:
Figure FDA0002394007060000052
wherein p (Y | O)bX) is the likelihood distribution of the tag sequence; q. q.sψ(Ob| X, Y) is the posterior approximation distribution of the blank character; p (O)b| X) is the prior distribution of the blank characters; KL () is the divergence calculation; e () is the desired operation; and
and determining the gradient value of the training according to the loss function.
16. A data processing apparatus for classifying a CTC model based on associative semantic time, the apparatus comprising:
a feature vector obtaining module, configured to obtain a feature sequence, where the feature sequence includes at least one feature vector, and the at least one feature vector includes: the character or voice signal of the picture format to be recognized is processed by a feature extraction network to obtain a feature vector corresponding to each moment;
the blank character prior distribution determining module is used for sequentially inputting the at least one feature vector into a first full-connection layer and normalizing the output of the first full-connection layer to obtain the probability of the blank character corresponding to each moment;
the prior probability determination module is used for sequentially inputting the at least one feature vector into a second full-connection layer, normalizing the output of the second full-connection layer and then obtaining the probability of each element in a dictionary set corresponding to the characters or the voice signals corresponding to each moment;
a likelihood probability determination module, configured to determine likelihood probabilities of each element in the dictionary set and each time corresponding to a blank character according to the probability of each time corresponding to the blank character and the probability of each time corresponding to each element in the dictionary set; and
and the output module is used for determining an output sequence corresponding to the characteristic sequence according to the likelihood probability of each element in the dictionary set and each time corresponding to the blank character.
17. The apparatus of claim 16, wherein the likelihood probability determination module determines the likelihood probability of the blank character for each time instant as a probability of the blank character for each time instant; and determining the likelihood probability of each element in the dictionary set corresponding to each time as the product of the probability of each element corresponding to each time of the non-blank character and the probability of the element at the corresponding time.
18. The apparatus of claim 16, wherein said output module determines likelihood probabilities for a plurality of output paths of said CTC model based on likelihood probabilities for respective instances of time for respective elements of said dictionary set and for respective ones of said characters; respectively determining at least one output sequence corresponding to the plurality of output paths; for each output sequence, adding the likelihood probabilities of all output paths corresponding to the output sequence, and taking the sum as the likelihood probability of the output sequence; and taking the output sequence with the maximum likelihood probability as the output sequence corresponding to the characteristic sequence.
19. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, wherein the processor implements the method according to any of claims 1 to 10 when executing the program.
20. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 10.
CN202010124513.9A 2020-02-27 2020-02-27 CTC model training method, data processing method, device and storage medium Pending CN111340117A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010124513.9A CN111340117A (en) 2020-02-27 2020-02-27 CTC model training method, data processing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010124513.9A CN111340117A (en) 2020-02-27 2020-02-27 CTC model training method, data processing method, device and storage medium

Publications (1)

Publication Number Publication Date
CN111340117A true CN111340117A (en) 2020-06-26

Family

ID=71185627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010124513.9A Pending CN111340117A (en) 2020-02-27 2020-02-27 CTC model training method, data processing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN111340117A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580343A (en) * 2020-11-03 2021-03-30 北京字节跳动网络技术有限公司 Model generation method, question and answer quality judgment method, device, equipment and medium
CN113468492A (en) * 2021-07-13 2021-10-01 京东科技控股股份有限公司 Verification method and device for verification information and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580343A (en) * 2020-11-03 2021-03-30 北京字节跳动网络技术有限公司 Model generation method, question and answer quality judgment method, device, equipment and medium
CN113468492A (en) * 2021-07-13 2021-10-01 京东科技控股股份有限公司 Verification method and device for verification information and readable storage medium

Similar Documents

Publication Publication Date Title
US11100921B2 (en) Pinyin-based method and apparatus for semantic recognition, and system for human-machine dialog
WO2023024412A1 (en) Visual question answering method and apparatus based on deep learning model, and medium and device
CN110990543A (en) Intelligent conversation generation method and device, computer equipment and computer storage medium
CN111324743A (en) Text relation extraction method and device, computer equipment and storage medium
US20230080671A1 (en) User intention recognition method and apparatus based on statement context relationship prediction
CN113035231B (en) Keyword detection method and device
CN110598210B (en) Entity recognition model training, entity recognition method, entity recognition device, entity recognition equipment and medium
KR102258004B1 (en) Method and server for providing image tlanslation service using a user interface displaying one or more text images
CN111340117A (en) CTC model training method, data processing method, device and storage medium
US11776289B2 (en) Method and electronic device for predicting plurality of multi-modal drawings
Inunganbi et al. Handwritten Meitei Mayek recognition using three‐channel convolution neural network of gradients and gray
CN113449840A (en) Neural network training method and device and image classification method and device
CN113626563A (en) Method and electronic equipment for training natural language processing model and natural language processing
CN113536784A (en) Text processing method and device, computer equipment and storage medium
US20230130662A1 (en) Method and apparatus for analyzing multimodal data
CN113095072A (en) Text processing method and device
CN116127027A (en) Intention recognition method and device, and training method and device of intention recognition model
CN115759262A (en) Visual common sense reasoning method and system based on knowledge perception attention network
CN114357964A (en) Subjective question scoring method, model training method, computer device, and storage medium
CN112509565A (en) Voice recognition method and device, electronic equipment and readable storage medium
CN113095066A (en) Text processing method and device
US20240185840A1 (en) Method of training natural language processing model method of natural language processing, and electronic device
CN113283240B (en) Co-reference digestion method and electronic equipment
CN114817452A (en) Semantic matching method, device and equipment and storage medium
Ahli Towards a Reliable Machine Learning-based Model Designed for Translating Sign Language Videos to Text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200626