CN111626039A - Training method and device for text similarity recognition model and related equipment - Google Patents

Training method and device for text similarity recognition model and related equipment Download PDF

Info

Publication number
CN111626039A
CN111626039A CN202010456628.8A CN202010456628A CN111626039A CN 111626039 A CN111626039 A CN 111626039A CN 202010456628 A CN202010456628 A CN 202010456628A CN 111626039 A CN111626039 A CN 111626039A
Authority
CN
China
Prior art keywords
text
sample
similarity
text sample
recognition model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010456628.8A
Other languages
Chinese (zh)
Inventor
李小娟
徐国强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN202010456628.8A priority Critical patent/CN111626039A/en
Priority to PCT/CN2020/105662 priority patent/WO2021237928A1/en
Publication of CN111626039A publication Critical patent/CN111626039A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of text recognition in artificial intelligence, and provides a training method, a device and related equipment for a text similarity recognition model, wherein the method comprises the following steps: acquiring a plurality of groups of first sample groups comprising first text samples and second text samples; taking an element of which the literal similarity with the first text sample reaches a preset threshold value as a third text sample; labeling the third text sample to obtain a negative text sample, and forming a plurality of groups of second sample groups; representing the samples in each second sample group by using a representation vector; calculating a first similarity and a second similarity; and adjusting parameters according to the first similarity and the second similarity, and repeatedly acquiring the expression vector to the step to obtain the trained text similarity recognition model. By implementing the text similarity recognition method, the problem of low recognition accuracy in the text similarity recognition method in the prior art can be solved. Also, the present invention relates to a blockchain technique, the first set of samples and the second set of samples may be stored in a blockchain node.

Description

Training method and device for text similarity recognition model and related equipment
Technical Field
The invention relates to the technical field of text recognition in artificial intelligence, in particular to a training method and a training device for a text similarity recognition model and related equipment.
Background
In recent years, with the continuous development of information technology, the amount of generated information shows an explosive trend, and in order to better analyze, process and apply the information, various algorithms and models are generally used to calculate the similarity of the information so as to facilitate the search query to obtain the target information. For example, in an information retrieval query process, the similarity between texts is generally required to be calculated so as to obtain a target text by the query.
At present, in order to calculate the similarity between texts, the similarity between texts is generally calculated through various similarity calculation algorithms, and the similarity calculation is realized through a keyword matching technology, such as a jaccard similarity coefficient, a cosine distance, a euclidean distance, a TFIDF, and the like. Although the existing text similarity calculation method can achieve text similarity acquisition to a certain extent, due to the fact that the complexity of a language is high, for homomorphism or heteromorphism synonymy phenomena, a simple keyword matching technology cannot meet the current requirements gradually, and the text similarity recognition accuracy is low.
In summary, the text similarity recognition method in the prior art has the problem of low recognition accuracy.
Disclosure of Invention
The invention provides a training method and a training device for a text similarity recognition model and related equipment, and aims to solve the problem that the recognition accuracy is low in a text similarity recognition method in the prior art.
The invention provides a training method of a text similarity recognition model, which comprises the following steps:
acquiring a plurality of groups of first sample groups, wherein each group of first sample group comprises a first text sample and a second text sample which are labeled in advance, and the first text sample and the second text sample have the same ideograph;
respectively calculating the literal similarity between each first text sample and each element in the LCQMC data set, and taking the element with the literal similarity reaching a preset threshold value as a third text sample;
receiving labeling information of the third text sample, obtaining negative text samples corresponding to the second text samples, and forming a plurality of groups of second sample groups comprising the first text samples, the second text samples and the negative text samples, wherein the negative text samples have different ideograms from the first text samples and the second text samples;
respectively representing the first text sample, the second text sample and the negative text sample in each second sample group by using a text similarity recognition model;
calculating a first similarity between the first text sample and the second text sample of each second sample group according to the representation vector, and calculating a second similarity between the negative text sample and the second text sample of each second sample group;
when the ratio of the number of groups with the first similarity larger than the second similarity in the total number of the second sample groups does not exceed the preset ratio, adjusting parameters of the expression vectors in the text similarity recognition model, repeatedly acquiring the expression vectors until the ratio of the expression vectors to the preset ratio is compared, and determining the current text similarity recognition model as a pre-trained text similarity recognition model when the ratio of the number of groups with the first similarity larger than the second similarity in the total number of the second sample groups exceeds the preset ratio.
The invention provides a training device of a text similarity recognition model, which comprises:
the first sample group acquisition module is used for acquiring a plurality of groups of first sample groups, each group of first sample group comprises a first text sample and a second text sample which are labeled in advance, and the first text sample and the second text sample have the same meaning;
the third text sample acquisition module is used for respectively calculating the literal similarity between each first text sample and each element in the LCQMC data set, and taking the element with the literal similarity reaching a preset threshold value as a third text sample;
the second sample group acquisition module is used for receiving labeling information of the third text sample, obtaining negative text samples corresponding to the second text samples, and forming a plurality of groups of second sample groups comprising the first text samples, the second text samples and the negative text samples, wherein the negative text samples have different meanings from the first text samples and the second text samples;
the expression vector acquisition module is used for respectively expressing the first text sample, the second text sample and the negative text sample in each group of second sample groups by expression vectors through a text similarity recognition model;
the similarity obtaining module is used for calculating a first similarity between the first text sample and the second text sample of each group of second sample groups according to the representation vector and calculating a second similarity between the negative text sample and the second text sample of each group of second sample groups;
and the model obtaining module is used for adjusting the parameters of the expression vectors in the text similarity recognition model when the ratio of the number of groups with the first similarity larger than the second similarity in the total number of groups of the second sample group does not exceed the preset ratio, repeatedly obtaining the expression vectors until the ratio of the expression vectors to the preset ratio is compared, and determining the current text similarity recognition model as the pre-trained text similarity recognition model when the ratio of the number of groups with the first similarity larger than the second similarity in the total number of groups of the second sample group exceeds the preset ratio.
The invention provides computer equipment which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the training method of the text similarity recognition model.
The invention provides a computer readable storage medium, which stores a computer program, and the computer program is executed by a processor to realize the steps of the training method of the text similarity recognition model provided by the invention.
The training method, the training device and the related equipment for the text similarity recognition model firstly acquire a plurality of groups of first sample groups, each group of first sample groups comprises a first text sample and a second text sample which are labeled in advance, the ideograms of the first text sample and the second text sample are the same, then the literal similarity between each first text sample and each element in an LCQMC data set is calculated respectively, the element with the literal similarity reaching a preset threshold value is used as a third text sample, then the labeling information of the third text sample is received, a negative text sample corresponding to each second text sample is obtained, a plurality of groups of second sample groups comprising the first text sample, the second text sample and the negative text sample are formed, the ideograms of the negative text sample are different from those of the first text sample and the second text sample, and then the text similarity recognition model is used for respectively identifying the first text sample, the second text sample and the second text sample group, And finally, when the ratio of the number of groups with the first similarity greater than the second similarity in the total number of groups of the second sample groups does not exceed the preset ratio, adjusting parameters of the expression vectors in the text similarity recognition model, and repeatedly acquiring the expression vectors to the comparison ratio and the preset ratio until the ratio of the number of groups with the first similarity greater than the second similarity in the total number of groups of the second sample groups exceeds the preset ratio, and determining the current text similarity recognition model as a pre-trained text similarity recognition model. Through the implementation of the method, the samples of three sample types can be simultaneously input to train the text similarity recognition model, the second text sample similar to the first text sample is distinguished from the negative text sample dissimilar to the first text sample, the situation that texts with similar characters but different meanings are recognized as text similarity by the text similarity recognition model in the subsequent actual recognition process is avoided, and the problem that the recognition accuracy is low in the text similarity recognition method in the prior art is solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without inventive labor.
FIG. 1 is an illustration of an application environment of a method for training a text similarity recognition model according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a training method of a text similarity recognition model according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating step 12 of a training method for a text similarity recognition model according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating step 122 of a method for training a text similarity recognition model according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating step 14 of a method for training a text similarity recognition model according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating step 15 of a training method for a text similarity recognition model according to an embodiment of the present invention;
FIG. 7 is a flowchart illustrating a method for training a text similarity recognition model according to an embodiment of the present invention;
FIG. 8 is a block diagram of a training apparatus for a text similarity recognition model according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of another module of the training apparatus for text similarity recognition model according to the embodiment of the present invention;
FIG. 10 is a block diagram of a computer device of an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without any inventive step, are within the scope of the present invention.
The training method of the text similarity recognition model provided by the first embodiment of the present invention can be applied to the application environment shown in fig. 1, in which a server obtains multiple groups of first sample groups from a client, each group of first sample group includes a first text sample and a second text sample which are labeled in advance, the first text sample and the second text sample have the same meaning, then respectively calculates the literal similarity between each first text sample and each element in an LCQMC (Large-scale Chinese Question Matching Corpus) data set, takes the element whose literal similarity reaches a preset threshold as a third text sample, receives labeling information for the third text sample, obtains negative text samples corresponding to each second text sample, forms multiple groups of second sample groups including the first text sample, the second text sample and the negative text sample, the negative text sample is different from the ideograph of the first text sample and the second text sample, then the first text sample, the second text sample and the negative text sample in each group of the second sample group are respectively expressed by the expression vector through the text similarity recognition model, then the first similarity between the first text sample and the second text sample of each group of the second sample group is calculated according to the expression vector, the second similarity between the negative text sample and the second text sample of each group of the second sample group is calculated, finally when the proportion of the group number with the first similarity larger than the second similarity in the total group number of the second sample group does not exceed the preset proportion, the parameter of the expression vector in the text similarity recognition model is adjusted, the expression vector is repeatedly obtained until the step compares the proportion with the preset proportion until the proportion of the group number with the first similarity larger than the second similarity in the total group number of the second sample group exceeds the preset proportion, and determining the current text similarity recognition model as a pre-trained text similarity recognition model. Wherein, the collection equipment can be the shooting equipment who possesses the camera. The server can be a device with image data processing capability, and the server can be implemented by an independent server or a server cluster consisting of a plurality of servers. The client can be an independent development APP, an applet, a webpage, a public number and the like. The client can be used in conjunction with a terminal device, which may be, but is not limited to, various personal computers, notebook computers, smart phones, tablets, and portable wearable devices. The server can be implemented by an independent server or a server cluster composed of a plurality of servers.
In the embodiment of the present invention, as shown in fig. 2, a method for training a text similarity recognition model is provided, which is described by taking the method applied to the server side in fig. 1 as an example, and includes the following steps 11 to 16.
Step 11: and acquiring a plurality of groups of first sample groups, wherein each group of first sample group comprises a first text sample and a second text sample which are labeled in advance, and the first text sample and the second text sample have the same meaning.
The first text samples and the second text samples in each group of first sample groups are in one-to-one correspondence, the first text samples and the second text samples are pre-labeled in application, and expression intentions of the first text samples and the second text samples are close to each other. For example, as shown in the following table (1):
first text sample Second text sample
What flour and eggs can do What the egg and flour can do
How to make the nose higher? How to make the nose higher
How to feed the pet How to feed the pet
Watch (1)
In this embodiment, a plurality of first sample groups may be obtained from the existing text library such as the data set and the dictionary.
It should be noted that, in order to further ensure the privacy and security of the first sample group, the first sample group may be stored in a node of a blockchain.
Step 12: and respectively calculating the literal similarity between each first text sample and each element in the LCQMC data set, and taking the element with the literal similarity reaching a preset threshold value as a third text sample.
Specifically, a third text sample may be obtained from each element in the LCQMC data set in a sol-gel preliminary screening manner.
Further, as an implementation manner of this embodiment, as shown in fig. 3, the step 12 specifically includes the following steps 121 to 123:
step 121: the individual tokens in the first text sample, the average length of the elements in the LCQMC data set, and the length of any element in the LCQMC data set are obtained.
Specifically, word segmentation processing is performed on the first text sample, so that each granulated word segmentation is obtained. In addition, an element may be composed of a series of characters, for example, an element may be a phrase, or the like. The length of an element specifically refers to the character length.
Step 122: and calculating the literal similarity between the first text sample and each element in the LCQMC data set according to each participle, the frequency of occurrence of each participle, the average length of the elements and the length of any element in the LCQMC data set.
In step 122, the literal similarity between the first text sample and an element in the LCQMC data set may be specifically calculated according to the following formula (1):
Figure BDA0002509597700000081
wherein Score (Q, d) represents the literal similarity between the first text sample and one element of LCQMC, Q represents the first text sample, d represents one element of LCQMC, Q represents the parsed word segmentation of the first text sample, and Q represents the parsed word segmentation of the first text sampleiRepresents the ith participle in the first text sample, dl represents the length of an element in the LCQMC, avgdl represents the average length of the elements in the LCQMC, k1And b is a fractional adjustment factor. Preferably, k is1B is usually set empirically, typically k1=2,b=0.75。
It is noted that in order to be able to obtain the literal similarity between the first text sample and each element in the LCQMC data set, the calculation is performed using the above equation (1) repeatedly.
Further, as an implementation manner of this embodiment, as shown in fig. 4, the step 122 specifically includes the following steps 1221 to 1222:
step 1221: and calculating according to each participle, the occurrence frequency of each participle and the length of the element to obtain the correlation score between each participle and the element.
Specifically, the relevance score between each word segmentation and each element can be calculated according to the following formula (2):
Figure BDA0002509597700000091
where N represents the number of elements in the LCQMC dataset and qiRepresents the ith participle, n (q), in the first text samplei) Representing the number of elements containing the ith word in the first text sample.
Step 1222: and calculating the literal similarity between the first text sample and each element in the LCQMC data set according to the correlation score and the average length of the elements.
Specifically, a correlation score between the ith participle in the first text sample and an element in the LCQMC dataset may be calculated first by the following (3):
Figure BDA0002509597700000092
wherein R (q)iD) represents a correlation score between the ith participle in the first text sample and an element in the LCQMC dataset, d represents LCElement in QMC dataset, qfiRepresenting the number of occurrences of a participle in a first text sample, k2Represents a fractional adjustment factor, and K represents an adjustable parameter greater than 0.
Specifically, the adjustable parameter K may be calculated according to the following formula (4):
Figure BDA0002509597700000093
wherein dl represents the length of an element in LCQMC, avgdl represents the average length of the elements in LCQMC, k1And b is a fractional regulator.
Specifically, the formula (1) can be derived according to the formula (2), the formula (3) and the formula (4) to obtain the literal similarity between the first text sample and each element in the LCQMC data set.
Through the implementation of the above steps 1221 to 1222, the literal similarity between the first text sample and each element in the LCQMC data set can be calculated.
Step 123: and taking the element with the literal similarity reaching the preset threshold as a third text sample.
Wherein, the larger the preset threshold value is, the more the occupation ratio of the part with the same word face between the third text sample and the first text sample is.
Through the implementation of the above steps 121 to 123, the literal similarity between the first text sample and an element in the LCQMC data set can be calculated, so as to obtain a third text sample with higher literal similarity to the first text sample.
Step 13: and receiving labeling information of the third text sample, obtaining negative text samples corresponding to the second text samples, and forming a plurality of groups of second sample groups comprising the first text samples, the second text samples and the negative text samples, wherein the negative text samples have different ideograms from the first text samples and the second text samples.
Specifically, each third text sample can be subjected to refined marking in a manual inspection mode, and the negative text sample obtained through marking is a text which has higher literal similarity with the second text sample and different expression intentions. For example, the relationship between the first text sample, the second text sample, and the negative text sample may be as shown in table (2) below:
first text sample Second text sample Negative text sample
What flour and eggs can do What the egg and flour can do Only flour and eggs are eaten
How to make the nose higher How to make the nose higher How big the nose becomes smaller
How to feed the pet How to feed the pet How to feed the pet of other people
Watch (2)
It should be noted that, in order to further ensure the privacy and security of the second sample set, the second sample set may be stored in a node of a blockchain.
Step 14: and respectively representing the first text sample, the second text sample and the negative text sample in each second sample group by using a text similarity recognition model.
Wherein, each text sample (in this embodiment, the text sample represents any one of the first text sample, the second text sample, and the negative text sample) can obtain a corresponding representation vector.
Further, as an implementation manner of this embodiment, as shown in fig. 5, the step 14 specifically includes the following steps 141 to 142:
step 141: and respectively obtaining initial vectors of the first text sample, the second text sample and the negative text sample through an embedding layer in the text similarity recognition model.
The method specifically includes the steps that firstly, texts in a first text sample, a second text sample and a negative text sample are respectively obtained through an embedding layer in a text similarity recognition model, and then initial vectors corresponding to characters or phrases in the texts corresponding to the first text sample, the second text sample and the negative text sample are respectively obtained. It should be noted that, in this embodiment, the initial vector should be feature values of each character or phrase in the text corresponding to the first text sample, the second text sample, and the negative text sample respectively, in multiple dimensions, and the feature values in the initial vector are not associated with each other.
Step 142: and the two-way long and short memory network in the text similarity recognition model respectively obtains a representation vector corresponding to the first text sample, a representation vector corresponding to the second text sample and a representation vector corresponding to the negative text sample according to the initial vector.
The method specifically includes the steps of respectively obtaining initial vectors corresponding to characters or phrases in texts corresponding to text samples through a two-way long and short memory network in a text similarity recognition model, obtaining characteristic values of the characters or phrases of the text samples in the initial vectors in multiple dimensions, and interconnecting contexts in the text samples to obtain characteristic values of the whole text samples in the multiple dimensions, wherein the characteristic values of the whole text samples in the multiple dimensions are used as expression vectors.
Through the implementation of the steps 141 to 142, the first text sample, the second text sample and the negative text sample are taken as a whole respectively, and context features in the text samples are associated to obtain a representation vector corresponding to the first text sample, a representation vector corresponding to the second text sample and a representation vector corresponding to the negative text sample, so that feature extraction of the first text sample, the second text sample and the negative text sample is more accurate, and accuracy of text similarity recognition is improved.
Step 15: a first similarity between the first text sample and the second text sample of each second sample group is calculated from the representation vector, and a second similarity between the negative text sample and the second text sample of each second sample group is calculated.
Further, as an implementation manner of this embodiment, as shown in fig. 6, the step 15 specifically includes the following steps 151 to 152:
step 151: and respectively obtaining the feature vectors on each dimension in the representation vectors representing the first text sample, the second text sample and the negative text sample.
The feature vector representing the first text sample is a numerical value of the representation vector representing the first text sample in each dimension, the feature vector representing the second text sample is a numerical value of the representation vector representing the second text sample in each dimension, and the feature vector representing the negative text sample is a numerical value of the representation vector representing the negative text sample in each dimension. It should be noted that the number and kinds of dimensions of the first text sample, the second text sample, and the representation vector of the negative text sample should be the same.
Step 152: and calculating the first similarity and the second similarity according to the feature vectors in all dimensions.
Specifically, the first similarity may be obtained by calculation according to the following formula (5):
Figure BDA0002509597700000121
where cos θ represents the similarity of the second text sample to the first text sample, and A represents the representation of the first text sampleVector, B represents a vector representing the second text sample, n represents a dimension in the vector, AiA feature vector in the ith dimension representing a representation vector of a first text sample, BiA feature vector representing a representation vector of the second text sample in the ith dimension.
In addition, the method for calculating the second similarity between the negative text sample and the second text sample is the same as the method for calculating the first similarity between the second text sample and the first text sample, and is not repeated here.
Through the implementation of the above steps 151 to 152, a first similarity between the first text sample and the second text sample of each second sample group and a second similarity between the negative text sample and the second text sample of each second sample group can be calculated, so that the similarity between the first text sample and the second text sample in the second text sample group and the similarity between the second text sample and the negative text sample can be obtained.
Step 16: and when the ratio of the number of groups with the first similarity larger than the second similarity in the total number of the second sample groups does not exceed the preset ratio, adjusting parameters of the expression vectors in the text similarity recognition model, repeatedly acquiring the expression vectors until the ratio of the comparison ratio to the preset ratio is obtained, and determining the current text similarity recognition model as the pre-trained text similarity recognition model when the ratio of the number of groups with the first similarity larger than the second similarity in the total number of the second sample groups exceeds the preset ratio.
Specifically, the weight values of all features in the process of obtaining the expression vectors in the text similarity recognition model are adjusted, and then the expression vectors are adjusted. In this embodiment, the higher the preset percentage is, the more accurate the text similarity recognition model obtained by training is in the subsequent recognition process.
Through the implementation of the steps 11 to 16, samples of three sample types can be simultaneously input to train the text similarity recognition model, the second text sample similar to the first text sample is distinguished from the negative text sample dissimilar to the first text sample, the situation that texts with similar characters but different meanings are recognized as text similarity by the text similarity recognition model in the subsequent actual recognition process is avoided, and the problem that the recognition accuracy is low in the text similarity recognition method in the prior art is solved.
Further, as an implementation manner of this embodiment, as shown in fig. 7, the calculating the text similarity between the first text and the second text by the pre-trained text similarity recognition model specifically includes the following steps 21 to 23:
step 21: and acquiring the first text and the second text through a pre-trained text similarity recognition model.
Wherein the first text and the second text may be any two texts having characters.
Step 22: and respectively representing the first text and the second text by text representation vectors through a text similarity recognition model.
The method for obtaining the expression vectors representing the first text and the second text by the text similarity recognition model is similar to the method for obtaining the expression vectors representing the first text sample, the second text sample and the negative text sample in step 14, and is not repeated here.
Step 23: and calculating the text similarity between the first text and the second text in each group according to the text representation vectors.
The method for calculating the text similarity between the first text and the second text is similar to the method for calculating the first similarity and the method for calculating the second similarity, and is not repeated here.
Through the implementation of the steps 21 to 23, the text similarity recognition model trained in advance can be applied to the actual text recognition process, so that the similarity between two texts can be judged. In addition, when the steps 21 to 23 are applied to the text retrieval process, the retrieval accuracy can be greatly improved, so that the retrieval result meets the intention of the user.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
A second embodiment of the present invention provides a training device for a text similarity recognition model, where the training device for the text similarity recognition model corresponds to the above-mentioned training method for the text similarity recognition model one by one.
Further, as shown in fig. 8, the training device of the text similarity recognition model includes a first sample group obtaining module 41, a third text sample obtaining module 42, a second sample group obtaining module 43, a vector quantity obtaining module 44, a similarity obtaining module 45, and a model obtaining module 46. The functional modules are explained in detail as follows:
a first sample group obtaining module 41, configured to obtain multiple groups of first sample groups, where each group of first sample group includes a first text sample and a second text sample labeled in advance, and the first text sample and the second text sample have the same meaning;
a third text sample obtaining module 42, configured to calculate a literal similarity between each first text sample and each element in the LCQMC data set, and use an element whose literal similarity reaches a preset threshold as a third text sample;
a second sample group obtaining module 43, configured to receive labeling information on a third text sample, obtain negative text samples corresponding to the second text samples, and form multiple groups of second sample groups including the first text sample, the second text sample, and the negative text sample, where the negative text sample has a different meaning from the first text sample and the second text sample;
a representation vector obtaining module 44, configured to separately represent, by a text similarity recognition model, the first text sample, the second text sample, and the negative text sample in each second sample group with a representation vector;
a similarity obtaining module 45, configured to calculate, according to the representation vector, a first similarity between the first text sample and the second text sample of each second sample group, and calculate a second similarity between the negative text sample and the second text sample of each second sample group;
and a model obtaining module 46, configured to, when the ratio of the number of groups with the first similarity greater than the second similarity in the total number of second sample groups does not exceed the preset ratio, adjust parameters of the expression vectors in the text similarity recognition model, and repeatedly obtain the expression vectors until the ratio of the expression vectors to the preset ratio is compared, until the ratio of the number of groups with the first similarity greater than the second similarity in the total number of second sample groups exceeds the preset ratio, determine the current text similarity recognition model as a pre-trained text similarity recognition model.
Further, as an implementation manner of this embodiment, as shown in fig. 9, the above-mentioned representative vector obtaining module 44 specifically includes an initial vector obtaining unit 441 and a representative vector obtaining unit 442. The functional units are explained in detail as follows:
the initial vector obtaining unit 441 is configured to obtain initial vectors of a first text sample, a second text sample, and a negative text sample through an embedding layer in the text similarity recognition model;
the expression vector obtaining unit 442 is configured to obtain, according to the initial vector, an expression vector corresponding to the first text sample, an expression vector corresponding to the second text sample, and an expression vector corresponding to the negative text sample by using the two-way long and short memory network in the text similarity recognition model.
Further, as an implementation manner of this embodiment, the third text sample obtaining module 42 specifically includes a second text sample segmentation obtaining unit, a literal similarity obtaining unit, and a third text sample obtaining unit. The functional units are explained in detail as follows:
a first text sample participle obtaining unit, configured to obtain each participle in the first text sample, an average length of an element in the LCQMC data set, and a length of any element in the LCQMC data set;
the literal similarity obtaining unit is used for calculating the literal similarity between the first text sample and each element in the LCQMC data set according to each participle, the occurrence frequency of each participle, the average length of the elements and the length of any element in the LCQMC data set;
and the third text sample acquisition unit is used for taking the element with the literal similarity reaching the preset threshold as a third text sample.
Further, as an implementation manner of this embodiment, the literal similarity obtaining unit specifically includes a correlation score obtaining subunit and a literal similarity obtaining subunit. The functional subunits are described in detail as follows:
the correlation score obtaining subunit is used for calculating to obtain correlation scores between the participles and the elements according to the participles, the occurrence frequency of the participles and the lengths of the elements;
and the literal similarity obtaining subunit is used for calculating the literal similarity between the first text sample and each element in the LCQMC data set according to the correlation score and the average length of the elements.
Further, as an implementation manner of this embodiment, the training device of the text similarity recognition model further includes a text obtaining module, a text representation vector obtaining module, and a text similarity obtaining module. The functional modules are explained in detail as follows:
the text acquisition module is used for acquiring a first text and a second text through a pre-trained text similarity recognition model;
the text representation vector acquisition module is used for respectively representing the first text and the second text by text representation vectors through a text similarity recognition model;
and the text similarity acquisition module is used for calculating the text similarity between each group of the first text and the second text according to the text representation vectors.
Further, as an implementation manner of this embodiment, the similarity obtaining module 25 specifically includes a feature vector obtaining unit and a similarity obtaining unit. The functional units are explained in detail as follows:
the characteristic vector acquisition unit is used for respectively acquiring characteristic vectors on all dimensions in the expression vectors representing the first text sample, the second text sample and the negative text sample;
and the similarity acquisition unit is used for calculating the first similarity and the second similarity according to the feature vectors on all dimensions.
For the specific limitation of the training device for the text similarity recognition model, reference may be made to the above limitation on the training method for the text similarity recognition model, and details are not repeated here. The modules/units in the training apparatus of the text similarity recognition model can be implemented in whole or in part by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
A third embodiment of the present invention provides a computer device, which may be a server, and the internal structure diagram of which may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing data involved in the training method of the text similarity recognition model. The network interface of the computer device is used for communicating with an external terminal through network connection. Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
According to an embodiment of the present application, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the steps of the training method for the text similarity recognition model, such as steps 11 to 16 shown in fig. 2, steps 121 to 123 shown in fig. 3, steps 1221 to 1222 shown in fig. 4, steps 141 to 142 shown in fig. 5, steps 151 to 152 shown in fig. 6, and steps 21 to 23 shown in fig. 7.
A fourth embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the training method for a text similarity recognition model provided by an embodiment of the present invention, such as steps 11 to 16 shown in fig. 2, steps 121 to 123 shown in fig. 3, steps 1221 to 1222 shown in fig. 4, steps 141 to 142 shown in fig. 5, steps 151 to 152 shown in fig. 6, and steps 21 to 23 shown in fig. 7. Alternatively, the computer program is executed by a processor to implement the functions of the modules/units of the training method for the text similarity recognition model provided in the first embodiment. To avoid repetition, further description is omitted here.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned functional units and modules are illustrated as being divided, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to complete all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the present invention, and are intended to be included within the scope thereof.

Claims (10)

1. A training method for a text similarity recognition model is characterized by comprising the following steps:
acquiring a plurality of groups of first sample groups, wherein each group of first sample group comprises a first text sample and a second text sample which are labeled in advance, and the first text sample and the second text sample have the same ideograph;
respectively calculating the literal similarity between each first text sample and each element in the LCQMC data set, and taking the element with the literal similarity reaching a preset threshold value as a third text sample;
receiving labeling information of the third text sample, obtaining negative text samples corresponding to the second text samples, and forming a plurality of groups of second sample groups including the first text sample, the second text sample and the negative text sample, wherein the negative text sample has different ideographs from the first text sample and the second text sample;
respectively representing the first text sample, the second text sample and the negative text sample in each second sample group by a text similarity recognition model;
calculating a first similarity between the first text sample and the second text sample of each of the second sample groups according to the representation vector, and calculating a second similarity between the negative text sample and the second text sample of each of the second sample groups;
when the ratio of the number of groups with the first similarity larger than the second similarity in the total number of the second sample group does not exceed the preset ratio, adjusting parameters of the expression vectors in the text similarity recognition model, repeatedly acquiring the expression vectors until the ratio is compared with the preset ratio until the ratio of the number of groups with the first similarity larger than the second similarity in the total number of the second sample group exceeds the preset ratio, and determining the current text similarity recognition model as a pre-trained text similarity recognition model.
2. The method for training the text similarity recognition model according to claim 1, wherein the representing the first text sample, the second text sample, and the negative text sample in each second sample group by the text similarity recognition model with a representation vector comprises:
respectively obtaining initial vectors of the first text sample, the second text sample and the negative text sample through an embedding layer in the text similarity recognition model;
and the two-way long and short memory network in the text similarity recognition model respectively obtains the expression vector corresponding to the first text sample, the expression vector corresponding to the second text sample and the expression vector corresponding to the negative text sample according to the initial vector.
3. The method of claim 1, wherein the calculating the literal similarity between each of the first text samples and each element in the LCQMC data set, and the determining the element with the literal similarity reaching a predetermined threshold as the third text sample comprises:
obtaining each participle in the first text sample, an average length of the elements in the LCQMC data set, and a length of any of the elements in the LCQMC data set;
calculating the literal similarity between the first text sample and each element in the LCQMC data set according to the respective participle, the frequency of occurrence of the respective participle, the average length of the element, and the length of any one element in the LCQMC data set;
and taking the element with the literal similarity reaching a preset threshold value as the third text sample.
4. The method of claim 3, wherein the calculating the literal similarity between the first text sample and each element in the LCQMC data set according to the each participle, the frequency of occurrence of each participle, the average length of the element, and the length of any element in the LCQMC data set comprises:
calculating to obtain a correlation score between each participle and the element according to each participle, the frequency of occurrence of each participle and the length of the element;
and calculating the literal similarity between the first text sample and each element in the LCQMC data set according to the correlation score and the average length of the elements.
5. The method for training the text similarity recognition model according to claim 1, wherein the step of calculating the text similarity between the first text and the second text by the pre-trained text similarity recognition model comprises:
acquiring the first text and the second text through the pre-trained text similarity recognition model;
respectively representing the first text and the second text by using text representation vectors through a text similarity recognition model;
and calculating the text similarity between the first text and the second text in each group according to the text representation vectors.
6. The method for training the text similarity recognition model according to claim 1, wherein the calculating a first similarity between the first text sample and the second text sample of each of the second sample groups according to the representation vector and calculating a second similarity between the negative text sample and the second text sample of each of the second sample groups comprises:
respectively acquiring feature vectors on each dimension in the representation vectors representing the first text sample, the second text sample and the negative text sample;
and calculating the first similarity and the second similarity according to the feature vectors on all dimensions.
7. A training device for a text similarity recognition model is characterized by comprising:
the first sample group acquisition module is used for acquiring a plurality of groups of first sample groups, each group of first sample group comprises a first text sample and a second text sample which are labeled in advance, and the first text sample and the second text sample have the same meaning;
a third text sample obtaining module, configured to calculate a literal similarity between each first text sample and each element in the LCQMC data set, and use the element whose literal similarity reaches a preset threshold as a third text sample;
a second sample group obtaining module, configured to receive label information of the third text sample, obtain negative text samples corresponding to the second text samples, and form multiple groups of second sample groups including the first text sample, the second text sample, and the negative text sample, where the negative text sample is different from the first text sample and the second text sample in ideograph;
a representation vector obtaining module, configured to separately represent, by a text similarity recognition model, the first text sample, the second text sample, and the negative text sample in each second sample group with a representation vector;
a similarity obtaining module, configured to calculate, according to the representation vector, a first similarity between the first text sample and the second text sample of each second sample group, and calculate a second similarity between the negative text sample and the second text sample of each second sample group;
and the model obtaining module is used for adjusting parameters of the expression vectors in the text similarity recognition model when the ratio of the number of groups with the first similarity larger than the second similarity in the total number of the second sample group does not exceed a preset ratio, repeatedly obtaining the expression vectors until the ratio is compared with the preset ratio until the ratio of the number of groups with the first similarity larger than the second similarity in the total number of the second sample group exceeds the preset ratio, and determining the current text similarity recognition model as a pre-trained text similarity recognition model.
8. The apparatus for training a text similarity recognition model according to claim 7, wherein the representation vector obtaining module comprises:
an initial vector obtaining unit, configured to obtain initial vectors of the first text sample, the second text sample, and the negative text sample through an embedding layer in the text similarity recognition model, respectively;
and the expression vector acquisition unit is used for acquiring the expression vector corresponding to the first text sample, the expression vector corresponding to the second text sample and the expression vector corresponding to the negative text sample by the two-way long and short memory network in the text similarity recognition model according to the initial vector.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the training method of the text similarity recognition model according to any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of a method for training a text similarity recognition model according to any one of claims 1 to 6.
CN202010456628.8A 2020-05-26 2020-05-26 Training method and device for text similarity recognition model and related equipment Pending CN111626039A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010456628.8A CN111626039A (en) 2020-05-26 2020-05-26 Training method and device for text similarity recognition model and related equipment
PCT/CN2020/105662 WO2021237928A1 (en) 2020-05-26 2020-07-30 Training method and apparatus for text similarity recognition model, and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010456628.8A CN111626039A (en) 2020-05-26 2020-05-26 Training method and device for text similarity recognition model and related equipment

Publications (1)

Publication Number Publication Date
CN111626039A true CN111626039A (en) 2020-09-04

Family

ID=72260009

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010456628.8A Pending CN111626039A (en) 2020-05-26 2020-05-26 Training method and device for text similarity recognition model and related equipment

Country Status (2)

Country Link
CN (1) CN111626039A (en)
WO (1) WO2021237928A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780449A (en) * 2021-09-16 2021-12-10 平安科技(深圳)有限公司 Text similarity calculation method and device, storage medium and computer equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843818A (en) * 2015-01-15 2016-08-10 富士通株式会社 Training device, training method, determining device, and recommendation device
CN110852056A (en) * 2018-07-25 2020-02-28 中兴通讯股份有限公司 Method, device and equipment for acquiring text similarity and readable storage medium
CN110866095A (en) * 2019-10-10 2020-03-06 重庆金融资产交易所有限责任公司 Text similarity determination method and related equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780449A (en) * 2021-09-16 2021-12-10 平安科技(深圳)有限公司 Text similarity calculation method and device, storage medium and computer equipment
CN113780449B (en) * 2021-09-16 2023-08-25 平安科技(深圳)有限公司 Text similarity calculation method and device, storage medium and computer equipment

Also Published As

Publication number Publication date
WO2021237928A1 (en) 2021-12-02

Similar Documents

Publication Publication Date Title
US10963637B2 (en) Keyword extraction method, computer equipment and storage medium
CN110021439B (en) Medical data classification method and device based on machine learning and computer equipment
US11348249B2 (en) Training method for image semantic segmentation model and server
CN110347835B (en) Text clustering method, electronic device and storage medium
CN110147551B (en) Multi-category entity recognition model training, entity recognition method, server and terminal
CN109815333B (en) Information acquisition method and device, computer equipment and storage medium
CN110569500A (en) Text semantic recognition method and device, computer equipment and storage medium
CN110598206A (en) Text semantic recognition method and device, computer equipment and storage medium
US20220292329A1 (en) Neural architecture search with weight sharing
CN109710921B (en) Word similarity calculation method, device, computer equipment and storage medium
CN113297366B (en) Emotion recognition model training method, device, equipment and medium for multi-round dialogue
CN111090719B (en) Text classification method, apparatus, computer device and storage medium
CN109766418B (en) Method and apparatus for outputting information
CN110688499A (en) Data processing method, data processing device, computer equipment and storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN111291177A (en) Information processing method and device and computer storage medium
CN111611383A (en) User intention recognition method and device, computer equipment and storage medium
CN113886550A (en) Question-answer matching method, device, equipment and storage medium based on attention mechanism
CN110502620B (en) Method, system and computer equipment for generating guide diagnosis similar problem pairs
CN112397197A (en) Artificial intelligence-based inquiry data processing method and device
CN111723870A (en) Data set acquisition method, device, equipment and medium based on artificial intelligence
CN113821587B (en) Text relevance determining method, model training method, device and storage medium
CN111552810B (en) Entity extraction and classification method, entity extraction and classification device, computer equipment and storage medium
CN111626039A (en) Training method and device for text similarity recognition model and related equipment
CN113343024A (en) Object recommendation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination