CN115147849A - Training method of character coding model, character matching method and device - Google Patents

Training method of character coding model, character matching method and device Download PDF

Info

Publication number
CN115147849A
CN115147849A CN202210686424.2A CN202210686424A CN115147849A CN 115147849 A CN115147849 A CN 115147849A CN 202210686424 A CN202210686424 A CN 202210686424A CN 115147849 A CN115147849 A CN 115147849A
Authority
CN
China
Prior art keywords
character string
sample
string
vector
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210686424.2A
Other languages
Chinese (zh)
Inventor
陈珺
孙清清
邹泊滔
赖伟达
郑行
王爱凌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202210686424.2A priority Critical patent/CN115147849A/en
Publication of CN115147849A publication Critical patent/CN115147849A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification describes a training method of a character coding model, a character matching method and a device. According to the method of the embodiment, the sample training set is firstly obtained, and then the coding processing is carried out on each sample training set. Further, the loss function value can be calculated by using the characterization vectors obtained by the training sets of the samples. And finally, training the character coding model by using the obtained loss function value. Because each sample training set of the training model comprises the standard character string, the positive sample character string and the negative sample character string, the objects represented by the positive sample character string and the standard character string are the same, and the objects represented by the negative sample character string and the standard character string are different. When the character strings are coded by the obtained model, the similarity of the characteristic vectors of the character strings representing the same object can be higher, the similarity of the characteristic vectors of the character strings representing different objects can be lower, and the accuracy of character string matching can be improved when the character strings are matched.

Description

Training method of character coding model, character matching method and device
Technical Field
One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method for training a character encoding model, a method for character matching, and an apparatus for character matching.
Background
Character matching is a fuzzy matching method for text, and is mainly applied to matching names such as names of people, places, organizations and the like.
However, due to factors such as language, geographical cultural diversity, etc., names referring to the same thing may have multiple spellings. For example, muhammad, mohammed, and Mohammad are all english transliteration of the same arabic name. If these factors are not taken into account, different spellings for the same object name may not accurately achieve a match.
Disclosure of Invention
One or more embodiments of the present specification describe a training method of a character encoding model, a character matching method, and an apparatus, which can improve accuracy of character string matching.
According to a first aspect, there is provided a training method of a character encoding model, comprising:
obtaining at least two sample training sets; wherein each sample training set comprises: a standard string, a positive sample string and a negative sample string; the positive sample character string in each sample training set is the same as the object represented by the standard character string, the negative sample character string is different from the object represented by the standard character string, and the positive sample character string is different from the standard character string;
coding each sample training set to obtain a characterization vector corresponding to each sample training set;
calculating a loss function value by using the characterization vectors of the training sets of the samples;
and training the character coding model according to the loss function value.
In one possible implementation, the standard string includes: a string corresponding to a name of an object;
the positive sample string includes: the objects corresponding to the standard character strings have the same ID and are different from the spelling forms of the standard character strings;
and/or the presence of a gas in the atmosphere,
the negative example string includes: the objects corresponding to the standard character strings have different IDs and are different character strings from the standard character string spelling forms.
In a possible implementation manner, the encoding processing performed on each sample training set to obtain a characterization vector corresponding to each sample training set includes:
for each sample training set, performing:
carrying out numerical value coding on the standard character string, the positive sample character string and the negative sample character string in the current sample training set to respectively obtain a standard numerical value vector of the corresponding standard character string, a positive sample numerical value vector of the corresponding positive sample character string and a negative sample numerical value vector of the corresponding negative sample character string of the current sample training set;
mapping the standard numerical vector, the positive sample numerical vector and the negative sample numerical vector of the current sample training set to a first dimension space to respectively obtain a standard characterization vector corresponding to the standard character string, a positive sample characterization vector corresponding to the positive sample character string and a negative sample characterization vector corresponding to the negative sample character string of the current sample training set; and the dimension of the first dimension space is smaller than the dimension of the dimension space in which any one of the standard numerical value vector, the positive sample numerical value vector and the negative sample numerical value vector is located.
In one possible implementation, the condition that the loss function value between the standard string, the positive sample string, and the negative sample string satisfies includes: the similarity between the positive sample character string and the standard character string in each sample training set is greater than a first similarity threshold, the similarity between the negative sample character string and the standard character string is less than a second similarity threshold, and the first similarity threshold is greater than the second similarity threshold.
In one possible implementation, the calculating the loss function value by using the characterization vectors of the training sets of the respective samples includes:
calculating the loss function value using the following calculation:
Figure BDA0003699860830000031
wherein L is used to characterize the loss function value, N is used to characterize the number of the sample training sets, f (x) i ) The canonical characterization vector, f (x), used to characterize the corresponding canonical string in the ith sample training set i + ) A positive sample characterization vector, f (x), for characterizing the corresponding positive sample string in the ith sample training set i - ) And the negative sample characterization vector is used for characterizing a corresponding negative sample character string in the ith sample training set, and epsilon is used for characterizing a hyperparameter balancing similarity measurement and dissimilarity measurement, wherein the similarity measurement is used for characterizing the similarity degree of the positive sample character string and the standard character string, and the dissimilarity measurement is used for characterizing the difference degree of the negative sample character string and the standard character string.
According to a second aspect, there is provided a character matching method comprising:
acquiring a first character string and a second character string to be matched;
inputting the first character string and the second character string into the character coding model trained by the training method of the character coding model according to any one of claims 1 to 5, respectively, to obtain a first token vector corresponding to the first character string and a second token vector corresponding to the second character string;
and calculating the similarity between the first characterization vector and the second characterization vector, and determining the matching degree between the first character string and the second character string.
In one possible implementation, the calculating the similarity between the first token vector and the second token vector determines a degree of matching between the first character string and the second character string, and includes:
calculating a cosine value between the first token vector and the second token vector;
if the obtained cosine value is not smaller than a preset first matching threshold value, determining that the first character string is matched with the second character string;
and if the obtained cosine value is smaller than a preset first matching threshold value, determining that the first character string is not matched with the second character string.
According to a third aspect, there is provided a training apparatus for a character encoding model, comprising: the device comprises a sample acquisition module, a coding processing module, a loss calculation module and a model training module;
the sample acquisition module is configured to acquire at least two sample training sets; wherein each sample training set comprises: a standard string, a positive sample string and a negative sample string; the positive sample character string in each sample training set is the same as the object represented by the standard character string, the negative sample character string is different from the object represented by the standard character string, and the positive sample character string is different from the standard character string;
the coding processing module is configured to perform coding processing on each sample training set acquired by the sample acquisition module to obtain a characterization vector corresponding to each sample training set;
the loss calculation module is configured to calculate a loss function value by using the characterization vectors of the sample training sets obtained by the encoding processing module;
and the model training module is configured to train the character coding model according to the loss function value obtained by the loss calculation module.
According to a fourth aspect, there is provided a character matching apparatus comprising: the device comprises a character string acquisition module, a vector output module and a similarity calculation module;
the character string acquisition module is configured to acquire a first character string and a second character string to be matched;
the vector output module is configured to input the first character string and the second character string acquired by the character string acquisition module into the character coding model trained by the training device of the character coding model according to claim 8, so as to obtain a first token vector corresponding to the first character string and a second token vector corresponding to the second character string;
the similarity calculation module is configured to calculate similarity between the first token vector and the second token vector output by the vector output module, and determine a matching degree between the first character string and the second character string.
According to a fifth aspect, there is provided a computing device comprising: a memory having executable code stored therein, and a processor that, when executing the executable code, implements the method of any of the first and second aspects described above.
According to the method and the device provided by the embodiment of the specification, when a character coding model is trained, sample training sets are firstly obtained, and then coding processing is carried out on each sample training set. Further, the loss function value can be calculated by using the characterization vectors obtained by each sample training set. And finally, training the character coding model by using the obtained loss function value. Because each sample training set of the training model comprises a standard character string, a positive sample character string and a negative sample character string, the positive sample character string and the standard character string represent the same object, and the negative sample character string and the standard character string represent different objects. When the character strings are coded by the obtained model, the similarity of the characteristic vectors of the character strings representing the same object can be higher, the similarity of the characteristic vectors of the character strings representing different objects can be lower, and the accuracy of character string matching can be improved when the character strings are matched.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present specification, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart of a training method for a character encoding model according to an embodiment of the present disclosure;
fig. 2 is a flowchart of an encoding processing method according to an embodiment of the present specification;
FIG. 3 is a flow chart of a method for character matching provided by an embodiment of the present description;
FIG. 4 is a flow diagram of another method for character matching provided by one embodiment of the present specification;
FIG. 5 is a diagram illustrating an apparatus for training a character encoding model according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a character matching apparatus according to an embodiment of the present disclosure.
Detailed Description
As mentioned above, character matching is a fuzzy text matching, and is particularly applied to matching names such as names of people, places and organizations.
However, in practical applications, the same object name may have a large difference in spelling due to different languages, geographic culture differences, and other factors. Especially in speech recognition, the same thing may recognize multiple english names. For example, muhammad, mohammed, mohammad, muhammed, mohamed, mohamad, muhamad, muhamed, mohamed, mohummed, mohamed, mohammod, and Mohamad are all english translations of the same arabic person name. Therefore, if the situation that the same thing has multiple spellings is not considered, when the texts match, the different spellings of the same thing are likely to be judged as being unmatched, and even determined as being non-identical.
In the application, when the model is trained, the character coding model is trained by using not only the standard character string, but also the positive sample character string which represents the same object with the standard character string and the negative sample character string which represents a different object from the standard character string. When the character coding model obtained by training is used for carrying out numerical coding on the character strings, the similarity between the character strings representing the same object can be higher, and the similarity between the character strings representing different objects is very low, so that the accuracy of character string matching is improved.
As shown in fig. 1, an embodiment of the present specification provides a method for training a character encoding model, which may include the following steps:
step 101: obtaining at least two sample training sets; wherein each sample training set comprises: a standard string, a positive sample string and a negative sample string; the positive sample character string in each sample training set is the same as the object represented by the standard character string, the negative sample character string is different from the object represented by the standard character string, and the positive sample character string is different from the standard character string;
step 103: coding each sample training set to obtain a characterization vector corresponding to each sample training set;
step 105: calculating a loss function value by using the characterization vectors of the training sets of the samples;
step 107: and training the character coding model according to the loss function value.
In this embodiment, when training a character coding model, a sample training set is first obtained, and then coding processing is performed on each sample training set. Further, the loss function value can be calculated by using the characterization vectors obtained by the training sets of the samples. Finally, the character coding model can be trained by using the obtained loss function value. Because each sample training set of the training model comprises a standard character string, a positive sample character string and a negative sample character string, the positive sample character string and the standard character string represent the same object, and the negative sample character string and the standard character string represent different objects. When the character strings are coded by using the obtained model, the similarity of the characteristic vectors of the character strings representing the same object can be higher, the similarity of the characteristic vectors of the character strings representing different objects can be lower, and the accuracy of character string matching can be improved when the character strings are matched.
The steps in FIG. 1 are described below with reference to specific examples.
First, in step 101, at least two sample training sets are obtained.
In the sample training sets obtained in this step, each sample training set should include a standard string, a positive sample string, and a negative sample string. Wherein the objects characterized by the positive sample string and the standard string are the same. That is to say, in various character strings representing the same object, any one character string may be used as a standard character string, and any other character string except the standard character string may be used as a positive sample character string, so as to ensure that the objects represented by the positive sample character string and the standard character string are the same, and the positive sample character string and the standard character string are different.
For another example, the characterized object is the person name Liu Ying, the Chinese spelling is Liu ying ying, and the English spelling may be Liu ying or Yingying Liu. Therefore, for a sample training set, any one of the 4 spellings can be determined as a standard string, and any one of the remaining 3 spellings can be selected as a positive sample string. It is easy to understand that each spelling representing the same object in the constructed sample training set is used as a standard character string at least once, so that more possibilities can be covered, the learned information of the model is more comprehensive, and the reliability of the model and the accuracy of character string matching subsequently can be improved.
The object represented by the character string may be a person name, a place name, an organization, etc., and when the standard character string includes a character string corresponding to the name of one object, the positive sample character string may include a character string having the same ID as the object corresponding to the standard character string and having a different spelling form from the standard character string. For example, when the standard string is the name of a user, the positive sample string may be a string having an identity, such as the same id number, as the user and having a different spelling from the standard string. The spelling forms may include characters constituting a character string, an order of the characters in the character string, and the like, that is, the different spelling forms may be that the characters constituting the character string and the order of the characters are not exactly the same. For example, liu ying and Lau ying are the characters that form the character string are not identical, liu ying and ying Liu are the character sequences in the character string are not identical, lau ying and ying Liu are the characters that form the character string are not identical, and the character sequences in the character string are not identical.
In addition, in the same sample training set, the object represented by the negative sample character string and the standard character string are different. For example, the object being characterized is the place name Beijing, the standard string may be Beijing, the negative sample string may be any place name other than Beijing, such as Tianjin, shanghai, guingzhou, etc., or the negative sample string may not be the place name being characterized, such as the person name Li Ming, date June, weather sunny, etc. As can be seen, when the standard string includes a string corresponding to the name of an object, the negative sample string may include a string in which the object corresponding to the standard string has a different ID and is different from the spelling form of the standard string. That is, the negative example string may be a string that characterizes any object other than the object characterized by the standard string.
Then, in step 103, coding is performed on each sample training set to obtain a characterization vector corresponding to each sample training set.
After the sample training sets are obtained, coding processing is considered to be carried out on each sample training set, and the sample training sets are coded into numerical type characterization vectors. In one possible implementation, as shown in fig. 2, step 103 may perform the following steps for each sample training set:
step 201: carrying out numerical value coding on the standard character string, the positive sample character string and the negative sample character string in the current sample training set to respectively obtain a standard numerical value vector of the corresponding standard character string, a positive sample numerical value vector of the corresponding positive sample character string and a negative sample numerical value vector of the corresponding negative sample character string of the current sample training set;
step 203: mapping a standard numerical vector, a positive sample numerical vector and a negative sample numerical vector of a current sample training set to a first dimension space to respectively obtain a standard representation vector of a corresponding standard character string of the current sample training set, a positive sample representation vector of a corresponding positive sample character string and a negative sample representation vector of a corresponding negative sample character string; the dimension of the first dimension space is smaller than the dimension of the dimension space where any one of the standard numerical value vector, the positive sample numerical value vector and the negative sample numerical value vector is located.
In this embodiment, when encoding a sample training set, firstly, numerical encoding is performed on a standard character string, a positive sample character string, and a negative sample character string in the sample training set, respectively, to obtain a standard numerical vector, a positive sample numerical vector, and a negative sample numerical vector. Then, the standard numerical value vector, the positive sample numerical value vector and the negative sample numerical value vector are respectively mapped into a space with lower dimensionality, so that the data processing amount can be reduced, and the efficiency of model training of a system is improved
The encoder may convert the input multilingual string into a continuous numeric vector. Therefore, in step 201, when numerically encoding the standard string, the positive sample string, and the negative sample string in the sample training set, it is considered to perform encoding by using a Transformer encoder with a strong representation capability. In encoding, the input of the encoder at each time step is a character, that is, each character input is encoded by a numerical value. Of course, in some possible implementations, the encoder may also be a CNN, RNN, etc. based encoder.
After numerical encoding by the encoder, the encoded vector may be mapped to a low-dimensional space by a full-connected layer. The full-link header of the full-link layer is a module that is connected immediately after the encoder, and the loss function is connected behind the full-link layer. The full connector obtains a characterization vector by projecting the characterization of the numerical vector learned by the encoder to a lower-dimensional vector space, so that a downstream loss function can evaluate the characterization vector.
In step 203, when the numeric vectors obtained after encoding by the encoder are mapped to the first dimension space, the standard numeric vector, the positive sample numeric vector, and the negative sample numeric vector may be multiplied by a matrix, respectively, to obtain a standard token vector corresponding to the standard string, a positive sample token vector corresponding to the positive sample string, and a negative sample token vector corresponding to the negative sample string. The multiplied matrix should satisfy that the transverse dimension is equal to the length of the characterization vector, and the longitudinal dimension is smaller than the transverse dimension. For example, if the standard value vector obtained by encoding the value by the encoder is a 1 × n vector, the vector may be multiplied by an n × m matrix to obtain a 1 × m vector. In the matrix with the size of n multiplied by m, the value of m should be smaller than that of n, so that the coded numerical value vector can be mapped to a space with lower dimensionality.
Of course, in one possible implementation, there may be a difference in vector length due to the encoded normal vector of values, positive sample vector of values, and negative sample vector of values. Thus, it is readily understood that the normal vector of values, the positive vector of values, and the negative vector of values may need to be multiplied by matrices of different sizes when mapped into the same dimensional space. For example, the dimension size of the standard numerical vector is 1 × 16, the dimension size of the positive sample numerical vector is 1 × 16, the dimension size of the negative sample numerical vector is 1 × 32, and the dimension size of the required characterization vector is 1 × 8. Then the dimension size of the matrix multiplied by the normal value vector and the positive sample value vector is 16 x 8, while the dimension size of the matrix multiplied by the negative sample value vector is 32 x 8.
Further in step 105, a loss function value is calculated using the characterization vectors of the training sets of samples.
After each sample training set is coded, the loss function values among the standard characterization vector, the positive sample characterization vector and the negative sample characterization vector are calculated by considering the characterization vectors obtained by using each sample training set. In order to enable the trained model to learn the information that the distance between the embedded vectors of two samples of the same type is small enough and the distance between the embedded vectors of two samples of different types is large enough, the loss function among the standard character string, the positive sample character string and the negative sample character string is considered to meet a certain condition. For example, the similarity between the positive sample character string and the standard character string in each sample training set is greater than a first similarity threshold, the similarity between the negative sample character string and the standard character string is less than a second similarity threshold, and the first similarity threshold is greater than the second similarity threshold.
It will be readily appreciated that in one possible implementation, the first similarity threshold should be much greater than the second similarity threshold. The information thus learned can ensure that the distance between the positive sample string and the standard string is sufficiently small, and the distance between the negative sample string and the standard string is sufficiently large. Therefore, when two character strings representing the same object are coded by using the trained character coding model, the similarity degree of two vectors obtained after coding is very high. And when two character strings representing different objects are encoded, the similarity degree of two vectors obtained after encoding is very low. Thus, whether the two character strings are matched or not can be determined according to the similarity more accurately.
In one possible implementation, step 105 may use the following calculation formula when calculating the loss function value using the characterization vectors of the training sets of the respective samples:
Figure BDA0003699860830000111
wherein, L is used for representing the loss function value, N is used for representing the number of the sample training set, f (x) i ) A canonical representation vector for representing the corresponding canonical string in the ith sample training set,
Figure BDA0003699860830000112
a positive sample characterization vector for characterizing a corresponding positive sample string in the ith sample training set,
Figure BDA0003699860830000113
the negative sample characterization vector is used for characterizing a negative sample character string corresponding to the ith sample training set, the epsilon is used for characterizing a hyperparameter balancing similarity measurement and dissimilarity measurement, the similarity measurement is used for characterizing the similarity degree of a positive sample character string and a standard character string, and the dissimilarity measurement is used for characterizing the difference degree of the negative sample character string and the standard character string.
The ternary loss function can be used as a training target function for deep metric learning in a contrast mode, and measures the similarity and difference of the input name triples. When the input sample triplet is (x, x) + , x - ) Then, a function f is learned through a neural network, and an input vector is coded into a characterization vector, so that the standard characterization vector of the standard character string x and the positive sample character string x of the same category as the standard character string x + Is as similar as possible to the negative sample string x in its different class - The characterization vectors of the negative samples of (a) are as different as possible. In this way, the ternary loss function maximizes the distance between the token vectors of non-homogeneous samples while minimizing the distance between the token vectors of two homogeneous samples.
For example, in the present embodiment,
Figure BDA0003699860830000114
for calculating the distance between the positive sample string and the token vector corresponding to the standard string,
Figure BDA0003699860830000115
and the method is used for calculating the distance between the characterization vectors corresponding to the negative sample character string and the standard character string. Thus through calculation
Figure BDA0003699860830000116
And
Figure BDA0003699860830000117
the difference between the two is calculated by taking the maximum value of the difference and 0 to obtain the loss function value. When in use
Figure BDA0003699860830000118
And
Figure BDA0003699860830000119
the difference between the two is greater than 0, and when the difference is greater, the value of the loss function is greater, and the effect of the model is poorer at the moment; when in
Figure BDA00036998608300001110
And
Figure BDA00036998608300001111
the difference between the two is less than 0, and when the absolute value of the difference is larger, the value of the loss function approaches to 0, and the model has a better effect at the moment.
Further, in the above formula for calculating the loss function value, the degree of similarity between the positive sample character string and the standard character string and the degree of difference between the negative sample character string and the standard character string are balanced by setting the hyper-parameter e in advance, thereby correcting the loss function value
Figure BDA0003699860830000121
And
Figure BDA0003699860830000122
the difference between them further improves the reliability of the loss function value. Where ε may be an empirical value.
Finally, in step 107, the character encoding model is trained based on the loss function values.
In this step, when obtaining the loss function value, training the character coding model by using a stochastic gradient descent algorithm may be considered.
In addition, in one possible implementation, after the loss function is obtained, different weights may be given to different sample training sets. Thus, after the weights corresponding to different sample training sets are obtained, the weighted sum of the loss functions is used as a learning target, and model parameters are optimized through the backward propagation random gradient descent, so that a character coding model with better performance is obtained through training.
As shown in fig. 3, an embodiment of the present specification further provides a character matching method, which may include the following steps:
step 301: acquiring a first character string and a second character string to be matched;
step 303: inputting the first character string and the second character string into a character coding model trained by the training method of the character coding model provided by any one of the embodiments to obtain a first characterization vector corresponding to the first character string and a second characterization vector corresponding to the second character string;
step 305: and calculating the similarity between the first token vector and the second token vector, and determining the matching degree between the first character string and the second character string.
In this embodiment, when determining whether the obtained first character string and the obtained second character string are matched, the first character string and the second character string may be first input into the character coding model obtained by training in any of the embodiments, respectively, to obtain a first token vector of the first character string and a second token vector of the second character string. Then, the similarity between the first token vector and the second token vector is calculated, and whether the first character string and the second character string are matched can be determined.
The character encoding model provided by any one of the above embodiments fully considers the situation that the same object has multiple spellings during training, and minimizes the distance between the positive sample character string and the standard character string and maximizes the distance between the negative sample character string and the standard character string when solving the loss function. In this way, after the two character strings are input into the character coding model, if the similarity between the two character strings is small, the difference of the two characterization vectors output by the model is larger; and if the similarity between the two character strings is larger, the difference of the two characterization vectors output by the model is smaller, so that whether the two character strings are matched or not can be conveniently judged based on the difference.
In one possible implementation, as shown in fig. 4, when calculating the similarity between the first token vector and the second token vector to determine the matching degree between the first character string and the second character string, step 305 may be implemented by:
step 401: calculating a cosine value between the first token vector and the second token vector;
step 403: judging the size relation between the cosine value obtained by calculation and a preset first matching threshold value;
step 405: if the obtained cosine value is not smaller than a preset first matching threshold value, determining that the first character string is matched with the second character string;
step 407: and if the obtained cosine value is smaller than a preset first matching threshold value, determining that the first character string is not matched with the second character string.
In this embodiment, when determining whether the first character string and the second character string are matched by calculation, a cosine value between the first token vector and the second token vector may be calculated, and a relationship between the cosine value and a preset first matching threshold may be determined. If the obtained cosine value is smaller than the first matching threshold, the similarity of the two vectors is low, namely the first character string and the second character string are not matched. And if the obtained cosine value is not less than the first matching threshold value, the similarity of the two vectors is higher, namely the first character string is matched with the second character string.
It will be readily appreciated that the cosine value will have a value in the range (-1, 1), and therefore the first matching threshold value should also have a value in this range.
In one possible implementation, step 401 may calculate the cosine value between the first token vector and the second token vector by using the following calculation formula:
Figure BDA0003699860830000131
wherein cos theta is used for representing cosine values between the first representation vector and the second representation vector, n is used for representing the number of elements in the first representation vector and the second representation vector, A i For characterising the first characterising vectorThe ith element, B i For characterizing the ith element of the second characterization vector.
In a possible implementation manner, in order to improve the accuracy of the matching result, a third matching threshold and a fourth matching threshold may also be set, where a value range of the third matching threshold is between (0, 1), and a value range of the fourth matching threshold is between (-1, 0). In this way, if the obtained cosine value is greater than or equal to the third matching threshold, it can be determined that the first character string matches the second character string. If the resulting cosine value is less than the fourth match threshold, it may be determined that the first string and the second string do not match. If the resulting cosine value lies between the third matching threshold and the fourth matching threshold, a further determination may be made by other methods as to whether the two strings match. When setting the third matching threshold and the fourth matching threshold, the third matching threshold should be closer to 1, and the fourth matching threshold should be closer to-1. The result of determining whether the first character string is matched with the second character string in this way is more accurate.
As shown in fig. 5, an embodiment of the present specification further provides an apparatus for training a character encoding model, where the apparatus may include: a sample acquisition module 501, a coding processing module 502, a loss calculation module 503 and a model training module 504;
a sample obtaining module 501 configured to obtain at least two sample training sets; wherein each sample training set comprises: a standard string, a positive sample string and a negative sample string; the positive sample character string in each sample training set is the same as the object represented by the standard character string, the negative sample character string is different from the object represented by the standard character string, and the positive sample character string is different from the standard character string;
the encoding processing module 502 is configured to perform encoding processing on each sample training set acquired by the sample acquisition module 501 to obtain a characterization vector corresponding to each sample training set;
a loss calculation module 503 configured to calculate a loss function value by using the characterization vectors of the training sets of the samples obtained by the encoding processing module 502;
a model training module 504 configured to train the character coding model according to the loss function value obtained by the loss calculating module 503.
In a possible implementation manner, if the standard character strings in the sample training set acquired by the sample acquisition module 501 include: a string corresponding to a name of an object;
then, the positive sample string may include: the object corresponding to the standard character string has the same ID and is different from the spelling form of the standard character string.
In a possible implementation manner, if the standard character strings in the sample training set acquired by the sample acquisition module 501 include: a string corresponding to a name of an object;
then, the negative example string may include: the object corresponding to the standard character string has a character string having a different ID and a different spelling form from the standard character string.
In a possible implementation manner, when the coding processing module 502 performs coding processing on each sample training set to obtain a characterization vector corresponding to each sample training set, the following operations are performed for each sample training set:
carrying out numerical value coding on the standard character string, the positive sample character string and the negative sample character string in the current sample training set to respectively obtain a standard numerical value vector of the corresponding standard character string of the current sample training set, a positive sample numerical value vector of the corresponding positive sample character string and a negative sample numerical value vector of the corresponding negative sample character string;
mapping the standard numerical vector, the positive sample numerical vector and the negative sample numerical vector of the current sample training set to a first dimension space to respectively obtain a standard characterization vector corresponding to the standard character string, a positive sample characterization vector corresponding to the positive sample character string and a negative sample characterization vector corresponding to the negative sample character string of the current sample training set; the dimension of the first dimension space is smaller than the dimension of the dimension space where any one of the standard numerical value vector, the positive sample numerical value vector and the negative sample numerical value vector is located.
In one possible implementation, the condition that the loss function value between the standard string, the positive sample string and the negative sample string satisfies when the loss function value is satisfied by the loss calculating module 503 may include: the similarity between the positive sample character string and the standard character string in each sample training set is greater than a first similarity threshold, the similarity between the negative sample character string and the standard character string is less than a second similarity threshold, and the first similarity threshold is greater than the second similarity threshold.
In one possible implementation, the loss calculation module 503, when calculating the loss function value using the characterization vectors of the respective sample training sets, is configured to perform the following operations:
the loss function value is calculated using the following calculation:
Figure BDA0003699860830000161
wherein, L is used for representing the loss function value, N is used for representing the number of the sample training set, and f (x) i ) A canonical characterization vector for characterizing the corresponding canonical string in the ith sample training set,
Figure BDA0003699860830000162
a positive sample characterization vector for characterizing a corresponding positive sample string in the ith sample training set,
Figure BDA0003699860830000163
the negative sample characterization vector is used for characterizing a negative sample character string corresponding to the ith sample training set, the epsilon is used for characterizing a hyperparameter balancing similarity measurement and dissimilarity measurement, the similarity measurement is used for characterizing the similarity degree of a positive sample character string and a standard character string, and the dissimilarity measurement is used for characterizing the difference degree of the negative sample character string and the standard character string.
As shown in fig. 6, an embodiment of the present specification further provides a character matching apparatus, which may include: a character string acquisition module 601, a vector output module 602 and a similarity calculation module 603;
a character string obtaining module 601 configured to obtain a first character string and a second character string to be matched;
a vector output module 602, configured to input the first character string and the second character string acquired by the character string acquisition module 601 into the character coding model trained by the training apparatus for the character coding model according to any of the embodiments, respectively, to obtain a first token vector corresponding to the first character string and a second token vector corresponding to the second character string;
a similarity calculation module 603 configured to calculate a similarity between the first token vector and the second token vector output by the vector output module 602, and determine a matching degree between the first character string and the second character string.
In one possible implementation, when calculating the similarity between the first token vector and the second token vector to determine the matching degree between the first character string and the second character string, the similarity calculation module 603 is configured to perform the following operations:
calculating cosine values between the first characterization vector and the second characterization vector;
if the obtained cosine value is not smaller than a preset first matching threshold value, determining that the first character string is matched with the second character string;
and if the obtained cosine value is smaller than a preset first matching threshold value, determining that the first character string is not matched with the second character string.
The present specification also provides a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any one of the embodiments of the specification.
The present specification also provides a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method in any of the embodiments of the specification.
It is to be understood that the illustrated structure of the embodiments of the present specification does not constitute a specific limitation to the training apparatus and the character matching apparatus for the character encoding model. In other embodiments of the description, the training means of the character encoding model and the character matching means may comprise more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
For the information interaction, execution process and other contents between the units in the above-mentioned apparatus, because the same concept is based on as the method embodiment of this specification, specific contents can refer to the description in the method embodiment of this specification, and are not described herein again.
Those skilled in the art will recognize that in one or more of the examples described above, the functions described in this specification can be implemented in hardware, software, hardware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, the purpose, technical solutions and advantages described in the present specification are further described in detail, it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (10)

1. The training method of the character coding model comprises the following steps:
acquiring at least two sample training sets; wherein, each sample training set comprises: a standard string, a positive sample string and a negative sample string; the positive sample character string in each sample training set is the same as the object represented by the standard character string, the negative sample character string is different from the object represented by the standard character string, and the positive sample character string is different from the standard character string;
coding each sample training set to obtain a characterization vector corresponding to each sample training set;
calculating a loss function value by using the characterization vectors of the sample training sets;
and training the character coding model according to the loss function value.
2. The method of claim 1, wherein the standard string comprises: a string corresponding to a name of an object;
the positive sample string includes: the objects corresponding to the standard character strings have the same ID and are different from the spelling forms of the standard character strings;
and/or the presence of a gas in the gas,
the negative example string includes: the objects corresponding to the standard character strings have different IDs and are different character strings from the standard character string spelling forms.
3. The method of claim 1, wherein the encoding for each sample training set to obtain a characterization vector for each sample training set comprises:
for each sample training set, performing:
carrying out numerical value coding on the standard character string, the positive sample character string and the negative sample character string in the current sample training set to respectively obtain a standard numerical value vector of the corresponding standard character string, a positive sample numerical value vector of the corresponding positive sample character string and a negative sample numerical value vector of the corresponding negative sample character string of the current sample training set;
mapping the standard numerical vector, the positive sample numerical vector and the negative sample numerical vector of the current sample training set to a first dimension space to respectively obtain a standard characterization vector corresponding to the standard character string, a positive sample characterization vector corresponding to the positive sample character string and a negative sample characterization vector corresponding to the negative sample character string of the current sample training set; and the dimension of the first dimension space is smaller than the dimension of the dimension space in which any one of the standard numerical value vector, the positive sample numerical value vector and the negative sample numerical value vector is located.
4. The method of claim 1, wherein the condition that the loss function values between the standard string, the positive sample string, and the negative sample string satisfy comprises: the similarity between the positive sample character string and the standard character string in each sample training set is greater than a first similarity threshold, the similarity between the negative sample character string and the standard character string is less than a second similarity threshold, and the first similarity threshold is greater than the second similarity threshold.
5. The method of any one of claims 1 to 4, wherein said calculating loss function values using the characterization vectors of the respective sample training sets comprises:
calculating the loss function value using the following calculation:
Figure FDA0003699860820000021
wherein L is used to characterize the loss function value, N is used to characterize the number of the sample training sets, f (x) i ) A canonical representation vector for representing the corresponding canonical string in the ith sample training set,
Figure FDA0003699860820000022
a positive sample characterization vector for characterizing a corresponding positive sample string in the ith sample training set,
Figure FDA0003699860820000023
and the negative sample characterization vector is used for characterizing a corresponding negative sample character string in the ith sample training set, and epsilon is used for characterizing a hyperparameter balancing similarity measurement and dissimilarity measurement, wherein the similarity measurement is used for characterizing the similarity degree of the positive sample character string and the standard character string, and the dissimilarity measurement is used for characterizing the difference degree of the negative sample character string and the standard character string.
6. The character matching method comprises the following steps:
acquiring a first character string and a second character string to be matched;
inputting the first character string and the second character string into the character coding model trained by the training method of the character coding model according to any one of claims 1 to 5, respectively, to obtain a first token vector corresponding to the first character string and a second token vector corresponding to the second character string;
and calculating the similarity between the first characterization vector and the second characterization vector, and determining the matching degree between the first character string and the second character string.
7. The method of claim 6, wherein said calculating a similarity between the first token vector and the second token vector determines a degree of match between the first string and the second string, comprising:
calculating cosine values between the first token vector and the second token vector;
if the obtained cosine value is not smaller than a preset first matching threshold value, determining that the first character string is matched with the second character string;
and if the obtained cosine value is smaller than a preset first matching threshold value, determining that the first character string is not matched with the second character string.
8. The training device of the character coding model comprises: the device comprises a sample acquisition module, a coding processing module, a loss calculation module and a model training module;
the sample acquisition module is configured to acquire at least two sample training sets; wherein each sample training set comprises: a standard string, a positive sample string and a negative sample string; the positive sample character string in each sample training set is the same as the object represented by the standard character string, the negative sample character string is different from the object represented by the standard character string, and the positive sample character string is different from the standard character string;
the coding processing module is configured to perform coding processing on each sample training set acquired by the sample acquisition module to obtain a characterization vector corresponding to each sample training set;
the loss calculation module is configured to calculate a loss function value by using the characterization vectors of the sample training sets obtained by the encoding processing module;
and the model training module is configured to train the character coding model according to the loss function value obtained by the loss calculation module.
9. A character matching apparatus comprising: the device comprises a character string acquisition module, a vector output module and a similarity calculation module;
the character string acquisition module is configured to acquire a first character string and a second character string to be matched;
the vector output module is configured to input the first character string and the second character string acquired by the character string acquisition module into the character coding model trained by the training device of the character coding model according to claim 8, so as to obtain a first token vector corresponding to the first character string and a second token vector corresponding to the second character string;
the similarity calculation module is configured to calculate similarity between the first token vector and the second token vector output by the vector output module, and determine a matching degree between the first character string and the second character string.
10. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-7.
CN202210686424.2A 2022-06-17 2022-06-17 Training method of character coding model, character matching method and device Pending CN115147849A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210686424.2A CN115147849A (en) 2022-06-17 2022-06-17 Training method of character coding model, character matching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210686424.2A CN115147849A (en) 2022-06-17 2022-06-17 Training method of character coding model, character matching method and device

Publications (1)

Publication Number Publication Date
CN115147849A true CN115147849A (en) 2022-10-04

Family

ID=83408922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210686424.2A Pending CN115147849A (en) 2022-06-17 2022-06-17 Training method of character coding model, character matching method and device

Country Status (1)

Country Link
CN (1) CN115147849A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117272937A (en) * 2023-11-03 2023-12-22 腾讯科技(深圳)有限公司 Text coding model training method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117272937A (en) * 2023-11-03 2023-12-22 腾讯科技(深圳)有限公司 Text coding model training method, device, equipment and storage medium
CN117272937B (en) * 2023-11-03 2024-02-23 腾讯科技(深圳)有限公司 Text coding model training method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
WO2021135910A1 (en) Machine reading comprehension-based information extraction method and related device
CN108052512B (en) Image description generation method based on depth attention mechanism
CN111460807B (en) Sequence labeling method, device, computer equipment and storage medium
CN108304372B (en) Entity extraction method and device, computer equipment and storage medium
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
CN111931517B (en) Text translation method, device, electronic equipment and storage medium
CN116795973B (en) Text processing method and device based on artificial intelligence, electronic equipment and medium
CN106503231B (en) Search method and device based on artificial intelligence
CN110390049B (en) Automatic answer generation method for software development questions
CN111222330B (en) Chinese event detection method and system
CN113536795A (en) Method, system, electronic device and storage medium for entity relation extraction
CN115658846A (en) Intelligent search method and device suitable for open-source software supply chain
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN115147849A (en) Training method of character coding model, character matching method and device
CN113486174B (en) Model training, reading understanding method and device, electronic equipment and storage medium
CN110929532A (en) Data processing method, device, equipment and storage medium
CN112085091B (en) Short text matching method, device, equipment and storage medium based on artificial intelligence
CN110210035B (en) Sequence labeling method and device and training method of sequence labeling model
CN113705207A (en) Grammar error recognition method and device
CN110287487B (en) Master predicate identification method, apparatus, device, and computer-readable storage medium
CN114239559B (en) Text error correction and text error correction model generation method, device, equipment and medium
CN115630652A (en) Customer service session emotion analysis system, method and computer system
CN114169447B (en) Event detection method based on self-attention convolution bidirectional gating cyclic unit network
CN112966476B (en) Text processing method and device, electronic equipment and storage medium
CN114510561A (en) Answer selection method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination