CN116796723B - Text set matching method and device, electronic equipment and storage medium - Google Patents
Text set matching method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN116796723B CN116796723B CN202310254504.5A CN202310254504A CN116796723B CN 116796723 B CN116796723 B CN 116796723B CN 202310254504 A CN202310254504 A CN 202310254504A CN 116796723 B CN116796723 B CN 116796723B
- Authority
- CN
- China
- Prior art keywords
- text
- similarity
- texts
- tag
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000003860 storage Methods 0.000 title claims abstract description 12
- 239000013598 vector Substances 0.000 claims description 179
- 238000012549 training Methods 0.000 claims description 131
- 238000012360 testing method Methods 0.000 claims description 51
- 238000012545 processing Methods 0.000 claims description 26
- 238000003062 neural network model Methods 0.000 claims description 25
- 238000007781 pre-processing Methods 0.000 claims description 20
- 238000010276 construction Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 10
- 238000011156 evaluation Methods 0.000 claims description 9
- 238000002372 labelling Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 abstract description 13
- 230000011218 segmentation Effects 0.000 description 30
- 238000010586 diagram Methods 0.000 description 17
- 239000011159 matrix material Substances 0.000 description 16
- 230000008901 benefit Effects 0.000 description 8
- 230000006870 function Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 230000007613 environmental effect Effects 0.000 description 6
- 238000007689 inspection Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 230000014509 gene expression Effects 0.000 description 5
- 238000009826 distribution Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000012423 maintenance Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 208000002720 Malnutrition Diseases 0.000 description 2
- 206010039203 Road traffic accident Diseases 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000003912 environmental pollution Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 230000002779 inactivation Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000001071 malnutrition Effects 0.000 description 2
- 235000000824 malnutrition Nutrition 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 208000015380 nutritional deficiency disease Diseases 0.000 description 2
- 238000007794 visualization technique Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text set matching method, a device, electronic equipment and a storage medium, wherein the text set matching method comprises the following steps: acquiring a tag text set, a first text set and a second text set; inputting the tag text set, the first text set and the second text set into a text matching model to obtain a first similarity of any one tag text and any one first text and a second similarity of any one tag text and any one second text; and matching the first text set with the second text set according to the first similarity and the second similarity. According to the invention, the respective encoders are respectively trained for different text sets to process the text sets, the semantics of the text can be expressed more fully, the accuracy of text set matching is improved, in addition, the text sets are matched with the texts of the different text sets through the label text, and then the different text sets are matched, so that the problem that the texts are difficult to align when the text sets are directly matched is solved, and the difficulty of text set matching is further reduced.
Description
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a text set matching method, a device, an electronic device, and a storage medium.
Background
Along with the rapid development of big data technology and the landing of artificial intelligence algorithms, various industries are digitally energized, a large number of starting data sets are needed, namely text sets with correlation in the same field range are needed to be matched, and different multi-text corresponding data sets are obtained. For example, the corresponding data set of cases and laws in the field of primary social treatment complaints can serve the construction of a subsequent primary social treatment system.
In the prior art, text matching is often performed by extracting text features of different text information through the same encoder, and then judging whether the different text information is matched based on the extracted text features. However, even in the same field, the text semantics of different application scenes or different language structures have larger differences in terms and expressions, for example, the same word may have different semantics in case text and French text, so that the extracted text features cannot accurately reflect the semantics of text information, the extracted text features are limited by large differences in semantic expression, and different text sets may have the problem that texts are difficult to align, so that subsequent matching is negatively affected, and the matching difficulty and matching accuracy of the text sets are high.
Disclosure of Invention
The invention aims to overcome the defects of large matching difficulty and low matching accuracy of text sets caused by large semantic expression differences of different text sets in the prior art, and provides a text set matching method, a device, electronic equipment and a storage medium.
The invention solves the technical problems by the following technical scheme:
according to a first aspect of the present invention, there is provided a text set matching method, the text set matching method comprising:
acquiring a tag text set, a first text set and a second text set; the first text set and the second text set belong to texts with relevance in the same field range; the set of tab texts comprises a plurality of tab texts, the first set of texts comprises a plurality of first texts, and the second set of texts comprises a plurality of second texts;
inputting the tag text set, the first text set and the second text set into a text matching model to obtain a first similarity of any one of the tag texts and any one of the first texts and a second similarity of any one of the tag texts and any one of the second texts; the text matching model comprises a first encoder and a second encoder; the first encoder takes the tag text set and the first text set as inputs and the first similarity as an output; the second encoder takes the tag text set and the second text set as inputs and the second similarity as an output;
And matching the first text set with the second text set according to the first similarity and the second similarity.
Preferably, the tag text has a unique identifier, and the step of matching the first text set and the second text set according to the first similarity and the second similarity includes:
for each of the tag texts, obtaining a first number of first target texts with the first similarity ranking being front from the first text set, and obtaining a second number of second target texts with the second similarity ranking being front from the second text set;
or alternatively, the first and second heat exchangers may be,
for each of the tag texts, obtaining the first target text with the first similarity greater than a first threshold value from the first text set, and obtaining the second target text with the second similarity greater than a second threshold value from the second text set;
labeling the first target text and the second target text based on the identification so as to match the first target text and the second target text.
Preferably, the text matching model is obtained through training comprising the following steps:
Acquiring a first training text and a second training text;
inputting the first training text and the second training text into a neural network model, wherein a first initial encoder in the neural network model obtains a first encoding vector based on the first training text and a second initial encoder obtains a second encoding vector based on the second training text;
constructing a first tag value according to any two of the first coding vectors and constructing a second tag value according to any two of the second coding vectors;
acquiring third similarity of any two first coding vectors and fourth similarity of any two second coding vectors;
training the neural network model based on the first tag value, the third similarity, the second tag value, and the fourth similarity to generate the text matching model.
Preferably, the text set matching method further comprises:
preprocessing the first target text with the same identifier to obtain a first test text, and preprocessing the second target text with the same identifier to obtain a second test text;
obtaining a fifth similarity between the first target text and the first test text, and obtaining a sixth similarity between the second target text and the second test text;
The first encoder is evaluated based on the fifth similarity and the second encoder is evaluated based on the sixth similarity.
According to a second aspect of the present invention, there is provided a text set matching apparatus, including a first acquisition module, a second acquisition module, and a matching module:
the first acquisition module is used for acquiring a tag text set, a first text set and a second text set; the first text set and the second text set belong to texts with relevance in the same field range; the set of tab texts comprises a plurality of tab texts, the first set of texts comprises a plurality of first texts, and the second set of texts comprises a plurality of second texts;
the second obtaining module is used for inputting the tag text set, the first text set and the second text set into a text matching model to obtain a first similarity between any one tag text and any one first text and a second similarity between any one tag text and any one second text; the text matching model comprises a first encoder and a second encoder; the first encoder takes the tag text set and the first text set as inputs and the first similarity as an output; the second encoder takes the tag text set and the second text set as inputs and the second similarity as an output;
The matching module is used for matching the first text set and the second text set according to the first similarity and the second similarity.
Preferably, the tag text has a unique identifier, and the matching module includes a first obtaining unit and a matching unit:
the first obtaining unit is configured to obtain, for each of the tag texts, a first number of first target texts with the first similarity ordered forward from the first text set, and a second number of second target texts with the second similarity ordered forward from the second text set;
or alternatively, the first and second heat exchangers may be,
the first obtaining unit is configured to obtain, for each of the tag texts, a first target text whose first similarity is greater than a first threshold value from the first text set, and a second target text whose second similarity is greater than a second threshold value from the second text set;
the matching unit is used for marking the first target text and the second target text based on the identification so as to match the first target text and the second target text.
Preferably, the text set matching device further comprises a training module, wherein the training module is used for training to obtain the text matching model, and the training module comprises a second obtaining unit, a coding unit, a construction unit, a third obtaining unit and a training unit:
The second acquisition unit is used for acquiring a first training text and a second training text;
the coding unit is used for inputting the first training text and the second training text into a preset neural network model, a first initial coder in the neural network model obtains a first coding vector based on the first training text, and a second initial coder obtains a second coding vector based on the second training text;
the construction unit is used for constructing a first label value according to any two first coding vectors and constructing a second label value according to any two second coding vectors;
the third obtaining unit is used for obtaining a third similarity between any two first coding vectors and a fourth similarity between any two second coding vectors;
the training unit is configured to train the neural network model based on the first tag value, the third similarity, the second tag value, and the fourth similarity to generate the text matching model.
Preferably, the text set matching device further comprises a processing module, a third acquisition module and an evaluation module:
the processing module is used for preprocessing the first target text with the same identifier to obtain a first test text, and preprocessing the second target text with the same identifier to obtain a second test text;
The third obtaining module is used for obtaining a fifth similarity between the first target text and the first test text and obtaining a sixth similarity between the second target text and the second test text;
the evaluation module is configured to evaluate the first encoder based on the fifth similarity and the second encoder based on the sixth similarity.
According to a third aspect of the present invention there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory for running on the processor, the processor implementing the text set matching method of the present invention when executing the computer program.
According to a fourth aspect of the present invention there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the text set matching method of the present invention.
On the basis of conforming to the common knowledge in the field, the above preferred conditions can be arbitrarily combined to obtain the preferred examples of the invention.
The invention has the positive progress effects that:
for a first text set and a second text set with correlation in the same field range, the first text set and the second text set are matched by utilizing the label text set, the label text of the label text set is specifically and uniquely identified, the similarity of any one first text in the first text set and any one label text is obtained through a first encoder in a text matching model, the similarity of any one second text in the second text set and any one label text is obtained through a second encoder, the first target text and the second target text matched by the label text are obtained according to the similarity, the first target text and the second target text are marked based on the identification, and then the first target text and the second target text are matched by utilizing the same identification. According to the invention, the respective encoders are respectively trained for different text sets to process the text sets, so that the semantics of the text of the different text sets can be expressed more fully, and the matching accuracy of the subsequent text sets is improved; the label text is matched with the texts of different text sets respectively, so that the texts of different text sets are matched by using the same identifier, the problem that the texts are difficult to align when the different text sets are directly matched is solved, and the matching difficulty of the text sets is further reduced.
Drawings
Fig. 1 is a flow chart of a text set matching method in embodiment 1 of the present invention.
Fig. 2 is a schematic diagram of a text set matching method according to embodiment 1 of the present invention.
Fig. 3 is a schematic flow chart of training a text matching model in the text set matching method of embodiment 1 of the present invention.
Fig. 4 is a schematic diagram of a framework for calculating similarity in the text set matching method of embodiment 1 of the present invention.
Fig. 5 is a flow chart of step S13 in the text set matching method of embodiment 1 of the present invention.
Fig. 6 is another flow chart of step S13 in the text set matching method of embodiment 1 of the present invention.
Fig. 7 is a schematic flow chart of an evaluation encoder in the text set matching method of embodiment 1 of the present invention.
Fig. 8 is a box diagram drawn based on the similarity between the target text and the test text in the text set matching method of embodiment 1 of the present invention.
Fig. 9 is a schematic structural diagram of a text set matching device in embodiment 2 of the present invention.
Fig. 10 is a schematic structural diagram of a training module 201 in the text set matching device in embodiment 2 of the present invention.
Fig. 11 is a schematic structural diagram of a matching module 23 in the text set matching device in embodiment 2 of the present invention.
Fig. 12 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention.
Detailed Description
The invention is further illustrated by means of the following examples, which are not intended to limit the scope of the invention.
Example 1
The implementation provides a text set matching method, as shown in fig. 1, which comprises the following steps:
s11, acquiring a label text set, a first text set and a second text set.
In this embodiment, the first text set and the second text set belong to texts having relevance within the same field range; the set of tab texts includes a plurality of tab texts, the first set of texts includes a plurality of first texts, and the second set of texts includes a plurality of second texts.
The field range is used to represent the application range of the first text set and the second text set, for example, a question text set and an answer text set applied to the field of tourist services, a case text set and a legal text set applied to the field of basic social management, and the like. It can be seen that although the first text set and the second text set belong to the same field scope, different text sets may belong to different application scenarios, etc., resulting in the same text having different semantics in different application scenarios. It should be noted that, the text having relevance in the same field scope may be two or more, and this embodiment is not limited to matching of two or more text sets, for example, in the field of tourism customer service, the client question text set, the customer service answer text set and the cited related policy text set may be matched in pairs.
In this embodiment, the labels refer to categories included in the fields to which the first text set and the second text set belong, for example, in the field of basic social management, including education, land feature removal, inspection and supervision, road traffic, and the like, and these category words are labeled as labels. Each tag text is used to represent only a unique tag (i.e., label), and as an alternative embodiment, the tag text may be a single category word directly, or the single category word may be embedded into the tag text formed by the template sentence. In this embodiment, the appropriateness of the tag will directly affect the matching of the text sets, requiring simultaneous consideration of the first text set and the second text set.
For convenience of explanation, the basic social management field is explained below, wherein the first text set is a case text set, and the second text set is a French text set. In this embodiment, the case text set includes eleven-ten-five thousands of historical cases related to the base social management field obtained from historical case data sets of still, majors, lishui and the like, the legal text set includes legal data of related laws in the base social management field, and both the case text set and the legal text set are saved in CSV format. As an optional implementation manner, the labels covered in the label text set are mainly a plurality of case types obtained according to the case text in the basic social management field, and then the rule categories in the existing rule classification system are combined to select the rule category corresponding to the case type as the label. The French classification system can be a French classification system of North Dafabao, and can cover most of case types of the case text set.
The embodiment finally concludes 28 categories as labels, specifically, the main case types included in the social management field are: political, ecological, urban and rural construction, labor and social security, scientific and information industry, market supervision, rural agriculture, economic management, transportation, natural resources, health, education, civil and emergency, party government affairs, civilian travel, army affairs, organizational personnel, discipline inspection, marital home disputes, damage reimbursement disputes, property disputes, other labor disputes, other contract disputes, mountain land disputes, neighborhood disputes, banking disputes, road traffic accident disputes, house homeland disputes, civil loan disputes, symptoma removal disputes, medical disputes, delinquent peasant work disputes, environmental pollution disputes, production management disputes, and the like.
According to the case type, the following 28 categories are selected as labels in combination with legal categories under relevant regulations, and the method comprises the following steps: farmers, public places and environmental sanitation, old, young, women and young disability protection, education, administrative and service charge management, labor safety and labor protection, refund soldier placement, base layer election, counterfeit and inferior goods, medical care, relocation, disassembly and placement, support, foster and maintenance, poverty relief, malnutrition, brisk, legacy, retirement and retirement, household and identity card, regulation-breaking building, traffic safety management, real estate rights management, fire control management, administrative penalty and administrative resolution, contract fulfillment, control complaints, inspection and reporting, food sanitation, tax invoice, house borrowing and civil complaint.
Because the case text and the French text in the embodiment are in the form of sentences and the single tag is in the form of words, in the subsequent processing, similarity calculation needs to be performed on the tag text and the case text and the French text respectively, and in order to eliminate interference, the single tag is usually embedded into some template sentences to obtain the tag text, and then a tag text set is constructed. It should be noted that these template sentences cannot have any tags themselves, and also prevent the above 28 tag words from being disturbed. Assuming that the farmer is taken as an example, the tag text may be "this text belongs to the farmer", "this text describes the farmer", and "this text relates to the farmer", etc.
S12, inputting the tag text set, the first text set and the second text set into a text matching model to obtain a first similarity of any one tag text and any one first text and a second similarity of any one tag text and any one second text.
Referring to fig. 2, the text matching model includes a first encoder and a second encoder; the first encoder takes the tag text set and the first text set as inputs and takes the first similarity as an output; the second encoder takes as input the tag text set and the second text set, and as output the second similarity.
In this embodiment, the actual process of outputting the first similarity between any one of the tag texts and any one of the first texts includes: and encoding the tag text and the first text through a first encoder to obtain a first tag text vector corresponding to the tag text and a first text vector corresponding to the first text, and calculating the similarity according to the first tag text vector and the first text vector.
Similarly, the actual process of outputting the first similarity between any one of the tag texts and any one of the second texts includes: and encoding the tag text and the second text through a second encoder to obtain a second tag text vector corresponding to the tag text and a second text vector corresponding to the second text, and calculating the similarity according to the second tag text vector and the second text vector.
As an alternative implementation manner, the output first label text vector, second label text vector, first text vector and second text vector are subjected to norm normalization, and a corresponding dictionary of vector-text is constructed, so that the subsequent mapping from the corresponding sentence vector back to the text is facilitated. The first label text vector and the second label text vector may be mapped back to the same label text, i.e. the first text and the second text may be connected by using the same label text, so as to establish a matching relationship between the first text and the second text.
The text matching model is obtained through training through the following steps, referring to fig. 3, including:
s1011, acquiring a first training text and a second training text.
In this embodiment, the first training text is specifically a case text, and the second training text is specifically a french press text. As an alternative implementation manner, the first training text is randomly extracted from the case text set, and the second training text is randomly extracted from the prosecuted legal text set.
S1012, inputting the first training text and the second training text into a neural network model, wherein a first initial encoder in the neural network model obtains a first encoding vector based on the first training text and a second initial encoder obtains a second encoding vector based on the second training text.
As an alternative implementation manner, the neural network model adopts an esimse (a neural network model) model as a basic structure, and an original encoder is modified into two encoders on the basic structure, which is not limited to modification into two encoders, but may be modified into more than two encoders if there are more than two text sets.
As an alternative implementation manner, the first initial encoder and the second initial encoder both use a pre-training model bert_base_Chinese (a Chinese pre-training language model), wherein parameters of bert_base_Chinese are 12 layers, dimensions of a hidden layer are 768 dimensions, 12-head attention, 768 dimensions are shared by sentence vectors coded by the pre-training model, and a word list size is 21128.
As an alternative embodiment, the whole step of preprocessing the training text is implemented directly in the model. Specifically, a word segmentation device structure is added in a bert_base_Chinese model, word segmentation processing is carried out on input text by the word segmentation device, meanwhile [ CLS ] marks are inserted at the beginning of word segmentation, and [ SEP ] marks are inserted at the end of each word segmentation, wherein the [ CLS ] marks are used for representing and fusing integral features of the whole text, and the [ SEP ] marks are used for representing and fusing the integral features of each word segmentation.
In this embodiment, the bert_base_Chinese model converts the processed word and the inserted token into corresponding embedded vectors, and as an optional implementation manner, the word and the token are converted into word Embedding (embedded vectors) based on an embedded method, which is a mature technology in natural language processing.
In this embodiment, the text vector may be represented by a variety of options, and as an alternative implementation, the first-last-avg (average of the first layer and the last layer) is selected in a pooling manner of the bert_base_Chinese model, that is, all word vector representations of the first hidden layer and the last hidden layer of the model are averaged as sentence vectors (i.e., text vectors) of the text. Of course, other pooling methods may be selected in this embodiment, and all word vectors of the last hidden layer of the model are averaged to obtain a sentence vector representation, or the corresponding code vector of [ CLS ] is used as the sentence vector representation.
In this embodiment, the embedded vector of the text obtained by the above processing is input into the core structure of the bert_base_Chinese model, i.e. the encoder, to obtain the encoded vector of the corresponding text. As an alternative embodiment, the model uses a batch size of 32, so the data format of the encoded vector is (32, 768).
As an alternative embodiment, the first initial encoder and the second initial encoder may use Dropout (random inactivation) technology to adjust the model infrastructure, so as to perform data enhancement on the first training text and the second training text. Because of the randomness of Dropout, two different text vectors (namely coding vectors) can be obtained when the same text is coded twice by the coder, and a positive sample and a negative sample can be constructed for comparison learning training. For example, dropout may randomly repeat some words so that the lengths of the texts are different, and further, by increasing the similarity between two coded vectors in the same text in subsequent comparison training, the length features are not an interference factor for text semantic expression.
In this embodiment, the first training text (i.e., case text) may be input into the first initial encoder multiple times, so as to obtain multiple different first encoding vectors, and then two-by-two comparison is performed, i.e., the first encoding vectors obtained by inputting the first initial encoder any two times are selected for comparison learning training.
Similarly, the second training text (namely, the normal text) can be input into the second initial encoder for a plurality of times to obtain a plurality of different second coding vectors, and then the second coding vectors obtained by inputting the second initial encoder for any two times are selected for comparison learning training.
S1013, constructing a first label value according to any two first coding vectors and constructing a second label value according to any two second coding vectors.
Taking the construction of a first label value according to any two first coding vectors as an example, taking two first coding vectors of the same first training text as a positive example text pair, namely, a first label value is "1", and taking the first coding vectors of the first training text and other first training texts in the same batch as a negative example text pair, namely, a first label value is "0". As an alternative embodiment, the total number of negative text pairs in a batch is set to 160.
In this embodiment, the steps of constructing the first tag value from any two first encoded vectors and constructing the second tag value from any two second encoded vectors are the same.
S1014, acquiring a third similarity of any two first coding vectors and a fourth similarity of any two second coding vectors.
In this embodiment, similarity is selected as a measure of text semantic proximity, where similarity is mainly cosine similarity. As an alternative implementation, adding a step of normalizing the encoded vectors is equivalent to directly calculating the inner product of the two vectors, i.e. the angle between the two vectors can be defined by the inner product. Adopts cosine similarity, has geometric meaning (namely the included angle of two vectors under the standard orthogonal base), is convenient for calculation and illustration, assuming that the two vectors are denoted by x and y respectively, x.y= |x|||y|cos θ, where x.y represents the inner product size and θ represents the cosine similarity.
Taking the third similarity of any two first coding vectors as an example, referring to fig. 4, assuming that there are N first training texts, inputting the first training text into the first initial encoder for the first time to obtain a first coding vector as a text vector X, inputting the first training text into the first initial encoder for the second time to obtain the first coding vector as a text vector Y, and performing inner product operation on the N text vectors X and the N text vectors Y to obtain any one text vector X and any one text vectorAnd a third similarity of the quantity Y. For example, assuming there are two first training samples, the third similarity obtained has x 1 ·y 1 、x 1 ·y 2 、x 2 ·y 1 And x 2 ·y 2 。
In this embodiment, the step of acquiring the third similarity of any two first encoded vectors and the step of acquiring the fourth similarity of any two second encoded vectors are the same.
S1015, training a neural network model based on the first label value, the third similarity, the second label value, and the fourth similarity to generate a text matching model.
The text matching model includes a first encoder and a second encoder, and in this embodiment, the process of training the neural network model actually includes: training a first initial encoder based on the first tag value and the third similarity to obtain a first encoder; training the second initial encoder based on the second tag value and the fourth similarity to obtain a second encoder.
In this embodiment, training is performed by increasing the cosine similarity between the two encoded vectors of the positive text pair and decreasing the cosine similarity between the two encoded vectors of the negative text pair.
As an optional implementation manner, the loss function is designed as cross entropy loss, and the first label value and the third similarity are substituted into the cross entropy loss function to obtain the cross entropy loss of the first initial encoder; substituting the second label value and the fourth similarity into the cross entropy loss function to obtain the cross entropy loss of the second initial encoder, and continuously performing iterative training to minimize the cross entropy loss of the first initial encoder and the cross entropy loss of the second initial encoder, so that the training of the first initial encoder and the second initial encoder is completed, and the machine learning model at the moment is a trained text matching model.
As an alternative implementation manner, the first initial encoder and the second initial encoder can be determined to be trained completely by presetting a relatively small loss threshold value, and when the cross entropy loss is smaller than the set loss threshold value; alternatively, the first initial encoder and the second initial encoder may be determined to be trained when the number of iterations of the first initial encoder and the second initial encoder training is greater than the set maximum number of iterations by presetting a relatively large number of iterations.
As an alternative implementation manner, the training aid may be implemented by constructing an identity matrix by using the first tag value, for example, training the first initial encoder based on the first tag value and the third similarity, as shown in fig. 4, where N text vectors X and N text vectors Y may obtain a similarity matrix with dimensions of n×n, where an element on a diagonal from an upper left corner to a lower right corner of the similarity matrix is exactly a cosine similarity between two encoding vectors of the positive text pair, and the identity matrix has a characteristic that elements on a diagonal from the upper left corner to the lower right corner are all 1, and all other elements are 0, and any matrix multiplied by the identity matrix is equal to itself, so that a value on a diagonal of the similarity matrix may be maximized by the identity matrix.
In this embodiment, the step of training the first initial encoder based on the first tag value and the third similarity is the same as the step of training the second initial encoder based on the second tag value and the fourth similarity.
And S13, matching the first text set with the second text set according to the first similarity and the second similarity.
In this embodiment, the process of matching the first text set and the second text set according to the first similarity and the second similarity actually includes: matching the tag text set with the first text set according to the first similarity, so that tags corresponding to each tag text in the tag text set are marked in each first text in the first text set; matching the tag text set with the second text set according to the second similarity, so that the tag corresponding to each tag text in the tag text set is marked in each second text in the second text set; and finally, matching the first text set with the second text set through the marked label.
As an alternative implementation manner, the text corresponding to the plurality of text vectors with the first similarity and the second similarity ranked earlier may be selected, specifically, referring to fig. 5, step S13 specifically includes the following steps:
S1311, for each tag text, obtaining a first number of first target texts with a first similarity ranking front from a first text set, and obtaining a second number of second target texts with a second similarity ranking front from a second text set.
As an alternative embodiment, the first number and the second number may be set to be the same or may be set to be different.
Taking a first number of first target texts with first similarity ranked at the front from a first text set as an example, assuming that the first number is three, and 10 first texts in the first text set are obtained, the similarity between the first text set and a certain tag text is 10, ranking the 10 first texts according to the sequence of the similarity from high to low, and then taking the first three first texts as the first target texts.
In this embodiment, the step of obtaining a first number of first target texts with a first similarity ranking from the first set of texts is the same as the step of obtaining a second number of second target texts with a second similarity ranking from the second set of texts.
And S132, labeling the first target text and the second target text based on the identification so as to match the first target text and the second target text.
The labels, that is, the unique labels contained in each label text in this embodiment, it should be noted that the labels carried by different label texts may be the same, for example, it is assumed that the label texts include "the text belongs to consumer benefit protection", "the text describes consumer benefit protection", and "the text relates to consumer benefit protection", and the labels corresponding to the three label texts are all "consumer benefit protection". In addition, each of the first target text and the second target text may correspond to a plurality of different labels, such as "food sanitation" and "consumer equity protection" assuming that the first target text is complaining that a business uses stale food materials to make food.
As an alternative implementation manner, a matching data set is established through the first target text and the second target text marked with the same label, so that two different text sets are fused together, namely, the labels are classified again, and the matching data set under each label is obtained. For example, the digital government can establish case-legal matching data sets under different categories through case text sets and legal document sets in the field of basic-level social management, and a foundation is laid for the subsequent intelligent establishment of a basic-level social management system by utilizing the corresponding relation between cases and legal documents.
As an alternative embodiment, the text corresponding to a plurality of text vectors with the first similarity and the second similarity greater than a certain threshold may be selected, specifically, referring to fig. 6, step S13 specifically includes the following steps:
s1312, for each tag text, obtaining a first target text from the first set of texts having a first similarity greater than a first threshold, and obtaining a second target text from the second set of texts having a second similarity greater than a second threshold.
As an alternative embodiment, the first threshold value and the second threshold value may be set to be the same or may be set to be different.
And S132, labeling the first target text and the second target text based on the identification so as to match the first target text and the second target text.
As an alternative implementation manner, the condition of selecting may be set to set that the similarity is greater than a preset threshold value and the number of selected texts satisfies the preset number at the same time, if the number of texts satisfying the similarity is greater than the preset threshold value does not reach the preset number, the condition that the number of selected texts reaches the preset number is preferentially satisfied, and the texts with the top similarity ranking are further selected.
In this embodiment, whether the encoder can better express the semantic information of the text, directly relate to similarity calculation of different texts, and play a role in deciding the construction of a matched text set, as shown in fig. 7, the encoder is evaluated by using the following steps:
S14, preprocessing the first target text with the same identifier to obtain a first test text, and preprocessing the second target text with the same identifier to obtain a second test text.
In this embodiment, the same label is used for representing the label, taking the social management field as an example, assuming that the first target text is a case text, and labels of farmers, public places and environmental sanitation, old, young, and women's disability protection, education, administrative and service charge management, labor safety and labor protection, refund soldier placement, base layer election, counterfeit goods, medical care, relocation and removal placement, caretaking care of support, banish relief, malpractice greedy, etc., are used as labels, and the case text of the label is acquired for each label described above.
As an optional implementation manner, the same preset number of first target texts can be obtained in each marked tag, for example, if the preset number is two hundred, two hundred case texts marked with "farmer" tags are obtained, two hundred case texts marked with "public places and environmental sanitation" tags are obtained, and the like; as an alternative implementation manner, all the first target texts can be acquired in each marked label, regardless of the text ratio under the label.
In this embodiment, the preprocessing is performed on the obtained first target text with the same identifier, and as an optional implementation manner, the preprocessing specifically includes word segmentation processing, removing stop words and further screening the word segments, and processing the finally obtained word segments into a word segment list form, that is, as the first test text.
As an alternative implementation manner, jieba (a chinese word segmentation tool) may be used for word segmentation, and a custom dictionary may be added to the jieba for the domain to which the first target text belongs, so that the word segmentation result is more accurate.
In this embodiment, some common words which are not very useful for final prediction exist in the word segmentation sequence obtained through word segmentation, and in order to perform semantic extraction more efficiently, the words are required to be removed from the word segmentation sequence, and the words are stop words. As an alternative way, based on the common stop word list and the special stop word list, all stop words in the whole word segmentation sequence are removed, and interference of redundant information is reduced. At present, a plurality of mature and complete stop word lists exist on the network, such as common stop words like conjunctions and prepositions, and also such as stop words with proper nouns like time, place and name being removed frequently, etc., of course, for different fields, a plurality of special stop words exist, and a special stop word list can be further constructed, such as a plurality of professional terms like letters in case texts, or word segmentation with word frequency lower than preset times, wherein the word frequency refers to the frequency of occurrence of each word in a first test text, and the special stop word list has no practical effect on distinguishing specific case texts.
As an optional implementation manner, sorting the words obtained by removing the stop words from small to large according to word frequency is combined, and a preset number of words with the front sorting, such as the first fifteen words with the word frequency arranged in reverse order, are obtained to be used as the first test text.
In this embodiment, the step of preprocessing the first target text having the same identifier to obtain the first test text is the same as the step of preprocessing the second target text having the same identifier to obtain the second test text.
S15, obtaining a fifth similarity between the first target text and the first test text, and obtaining a sixth similarity between the second target text and the second test text.
As an optional implementation manner, for the first target text, based on the constructed vector-text corresponding dictionary, the corresponding sentence vector of the first target text is obtained, and of course, the corresponding sentence vector of the first target text may also be obtained through the first encoder.
As an optional implementation manner, for the first test text, mapping each word segment in the first test text into a word vector by the first encoder in turn, and then carrying out average processing on all the obtained word vectors to obtain a corresponding sentence vector of the first test text.
In this embodiment, a first test text and at least one first target text may be obtained from the first target text under the same identifier, and then an inner product of a corresponding sentence vector of each first target text and a corresponding sentence vector of the first test text is calculated respectively as a fifth similarity between the first target text and the first test text.
In this embodiment, the step of acquiring the fifth similarity of the first target text and the first test text is the same as the step of acquiring the sixth similarity of the second target text and the second test text.
S16, evaluating the first encoder based on the fifth similarity and evaluating the second encoder based on the sixth similarity.
As an alternative implementation manner, taking the fifth similarity as a box diagram according to the similarity, taking the first target text as a case text as an example, specifically, as shown in fig. 8, 28 box diagrams are obtained for 28 marks, and the first encoder is evaluated by observing whether the distribution of the box diagrams is an ideal distribution diagram.
For example, if the text vector obtained by the first encoder is sufficiently expressed in terms of semantics, and the tag word is properly selected, the outliers should not be too many, the columns should be relatively concentrated, otherwise, the first encoder is insufficiently trained. As an alternative implementation manner, the box diagram may also be used to determine whether the label is selected properly, for example, when the outliers are too many, the text that indicates that the label word is suitable for use is less, that is, the label word is not selected properly. Of course, the present embodiment is not limited to using a box map to evaluate the first encoder, and other visualization methods may be employed.
In the present embodiment, the step of evaluating the first encoder based on the fifth similarity is the same as the step of evaluating the second encoder based on the sixth similarity.
The embodiment has the following beneficial effects: according to the embodiment, aiming at a first text set and a second text set which have correlation in the same field range, the first text set and the second text set are matched by utilizing the label text set, the label text of the label text set is specifically and uniquely identified, the similarity of any one first text in the first text set and any one label text is obtained through a first encoder in a text matching model, the similarity of any one second text in the second text set and any one label text is obtained through a second encoder, the first target text and the second target text matched by the label text are obtained according to the similarity, the first target text and the second target text are marked based on the identification, and then the first target text and the second target text are matched by utilizing the same identification.
According to the embodiment, the respective encoders are respectively trained for different text sets to process the text sets, the semantics of the text of the different text sets can be expressed more fully, the matching accuracy of the subsequent text sets is improved, in addition, the text sets are not directly matched, but are matched with the text of the different text sets through the label text, the text of the different text sets is matched by using the same identification, the problem that the text is difficult to align when the text sets are directly matched is solved, and the matching difficulty of the text sets is further reduced.
Example 2
The present embodiment provides a text set matching apparatus, as shown in fig. 9, which includes a first obtaining module 21, a second obtaining module 22, and a matching module 23.
The first obtaining module 21 is configured to obtain a tag text set, a first text set, and a second text set. In this embodiment, the first text set and the second text set belong to texts having relevance within the same field range; the set of tab texts includes a plurality of tab texts, the first set of texts includes a plurality of first texts, and the second set of texts includes a plurality of second texts.
The field range is used to represent the application range of the first text set and the second text set, for example, a question text set and an answer text set applied to the field of tourist services, a case text set and a legal text set applied to the field of basic social management, and the like. It can be seen that although the first text set and the second text set belong to the same field scope, different text sets may belong to different application scenarios, etc., resulting in the same text having different semantics in different application scenarios. It should be noted that, the text having relevance in the same field scope may be two or more, and this embodiment is not limited to matching of two or more text sets, for example, in the field of tourism customer service, the client question text set, the customer service answer text set and the cited related policy text set may be matched in pairs.
In this embodiment, the labels refer to categories included in the fields to which the first text set and the second text set belong, for example, in the field of basic social management, including education, land feature removal, inspection and supervision, road traffic, and the like, and these category words are labeled as labels. Each tag text is used to represent only a unique tag (i.e., label), and as an alternative embodiment, the tag text may be a single category word directly, or the single category word may be embedded into the tag text formed by the template sentence. In this embodiment, the appropriateness of the tag will directly affect the matching of the text sets, requiring simultaneous consideration of the first text set and the second text set.
For convenience of explanation, the basic social management field is explained below, wherein the first text set is a case text set, and the second text set is a French text set. In this embodiment, the case text set includes eleven-ten-five thousands of historical cases related to the base social management field obtained from historical case data sets of still, majors, lishui and the like, the legal text set includes legal data of related laws in the base social management field, and both the case text set and the legal text set are saved in CSV format. As an optional implementation manner, the labels covered in the label text set are mainly a plurality of case types obtained according to the case text in the basic social management field, and then the rule categories in the existing rule classification system are combined to select the rule category corresponding to the case type as the label. The French classification system can be a French classification system of North Dafabao, and can cover most of case types of the case text set.
The embodiment finally concludes 28 categories as labels, specifically, the main case types included in the social management field are: political, ecological, urban and rural construction, labor and social security, scientific and information industry, market supervision, rural agriculture, economic management, transportation, natural resources, health, education, civil and emergency, party government affairs, civilian travel, army affairs, organizational personnel, discipline inspection, marital home disputes, damage reimbursement disputes, property disputes, other labor disputes, other contract disputes, mountain land disputes, neighborhood disputes, banking disputes, road traffic accident disputes, house homeland disputes, civil loan disputes, symptoma removal disputes, medical disputes, delinquent peasant work disputes, environmental pollution disputes, production management disputes, and the like.
The first obtaining module 21 selects the following 28 categories as labels according to the case types and the legal categories under the related regulations, including: farmers, public places and environmental sanitation, old, young, women and young disability protection, education, administrative and service charge management, labor safety and labor protection, refund soldier placement, base layer election, counterfeit and inferior goods, medical care, relocation, disassembly and placement, support, foster and maintenance, poverty relief, malnutrition, brisk, legacy, retirement and retirement, household and identity card, regulation-breaking building, traffic safety management, real estate rights management, fire control management, administrative penalty and administrative resolution, contract fulfillment, control complaints, inspection and reporting, food sanitation, tax invoice, house borrowing and civil complaint.
Since the case text and the french text in the embodiment are both in the form of sentences, and the single tag is in the form of words, in the subsequent processing, the similarity between the tag text and the case text and the french text needs to be calculated, and in order to eliminate interference, the first obtaining module 21 generally embeds the single tag into some template sentences to obtain the tag text, and further constructs a tag text set. It should be noted that these template sentences cannot have any tags themselves, and also prevent the above 28 tag words from being disturbed. Assuming that the farmer is taken as an example, the tag text may be "this text belongs to the farmer", "this text describes the farmer", and "this text relates to the farmer", etc.
The second obtaining module 22 is configured to input the tag text set, the first text set, and the second text set into a text matching model, so as to obtain a first similarity between any one tag text and any one first text, and a second similarity between any one tag text and any one second text.
Referring to fig. 2, the text matching model includes a first encoder and a second encoder; the first encoder takes the tag text set and the first text set as inputs and takes the first similarity as an output; the second encoder takes as input the tag text set and the second text set, and as output the second similarity.
In this embodiment, the actual process of outputting the first similarity between any one of the tag texts and any one of the first texts includes: and encoding the tag text and the first text through a first encoder to obtain a first tag text vector corresponding to the tag text and a first text vector corresponding to the first text, and calculating the similarity according to the first tag text vector and the first text vector.
Similarly, the actual process of outputting the first similarity between any one of the tag texts and any one of the second texts includes: and encoding the tag text and the second text through a second encoder to obtain a second tag text vector corresponding to the tag text and a second text vector corresponding to the second text, and calculating the similarity according to the second tag text vector and the second text vector.
As an alternative implementation manner, the second obtaining module 22 normalizes the norms of the output first label text vector, the second label text vector, the first text vector and the second text vector, and constructs a corresponding dictionary of vectors-texts, so as to facilitate the subsequent mapping from the corresponding sentence vector back to the text. The first label text vector and the second label text vector may be mapped back to the same label text, i.e. the first text and the second text may be connected by using the same label text, so as to establish a matching relationship between the first text and the second text.
In this embodiment, as shown in fig. 9, the text set matching device further includes a training module 201, where the training module 201 is configured to train to obtain a text matching model, and referring to fig. 10, the training module 201 includes a second obtaining unit 2011, an encoding unit 2012, a building unit 2013, a third obtaining unit 2014, and a training unit 2015.
The second acquiring unit 2011 is configured to acquire the first training text and the second training text. In this embodiment, the first training text is specifically a case text, and the second training text is specifically a french press text. As an alternative implementation manner, the first training text is randomly extracted from the case text set, and the second training text is randomly extracted from the prosecuted legal text set.
The encoding unit 2012 is configured to input the first training text and the second training text into a neural network model, a first initial encoder in the neural network model obtaining a first encoding vector based on the first training text and a second initial encoder obtaining a second encoding vector based on the second training text.
As an alternative implementation manner, the neural network model adopts an esimse (a neural network model) model as a basic structure, and an original encoder is modified into two encoders on the basic structure, which is not limited to modification into two encoders, but may be modified into more than two encoders if there are more than two text sets.
As an alternative implementation manner, the first initial encoder and the second initial encoder both use a pre-training model bert_base_Chinese (a Chinese pre-training language model), wherein parameters of bert_base_Chinese are 12 layers, dimensions of a hidden layer are 768 dimensions, 12-head attention, 768 dimensions are shared by sentence vectors coded by the pre-training model, and a word list size is 21128.
As an alternative embodiment, the encoding unit 2012 performs the whole step of preprocessing the training text directly in the model. Specifically, a word segmentation device structure is added in a bert_base_Chinese model, word segmentation processing is carried out on input text by the word segmentation device, meanwhile [ CLS ] marks are inserted at the beginning of word segmentation, and [ SEP ] marks are inserted at the end of each word segmentation, wherein the [ CLS ] marks are used for representing and fusing integral features of the whole text, and the [ SEP ] marks are used for representing and fusing the integral features of each word segmentation.
In this embodiment, the bert_base_Chinese model converts the processed word and the inserted token into corresponding embedded vectors, and as an optional implementation manner, the word and the token are converted into word Embedding (embedded vectors) based on an embedded method, which is a mature technology in natural language processing.
In this embodiment, the text vector may be represented by a variety of options, and as an alternative implementation, the first-last-avg (average of the first layer and the last layer) is selected in a pooling manner of the bert_base_Chinese model, that is, all word vector representations of the first hidden layer and the last hidden layer of the model are averaged as sentence vectors (i.e., text vectors) of the text. Of course, other pooling methods may be selected in this embodiment, and all word vectors of the last hidden layer of the model are averaged to obtain a sentence vector representation, or the corresponding code vector of [ CLS ] is used as the sentence vector representation.
In this embodiment, the encoding unit 2012 inputs the embedded vector of the text obtained by the above processing to the core structure of the bert_base_Chinese model, i.e. the encoder, to obtain the encoded vector of the corresponding text. As an alternative embodiment, the model uses a batch size of 32, so the data format of the encoded vector is (32, 768).
As an alternative embodiment, the first initial encoder and the second initial encoder may use Dropout (random inactivation) technology to adjust the model infrastructure, so as to perform data enhancement on the first training text and the second training text. Because of the randomness of Dropout, two different text vectors (namely coding vectors) can be obtained when the same text is coded twice by the coder, and a positive sample and a negative sample can be constructed for comparison learning training. For example, dropout may randomly repeat some words so that the lengths of the texts are different, and further, by increasing the similarity between two coded vectors in the same text in subsequent comparison training, the length features are not an interference factor for text semantic expression.
In this embodiment, the encoding unit 2012 may input the first training text (i.e. the case text) into the first initial encoder multiple times to obtain multiple different first encoding vectors, and then perform pairwise comparison, i.e. select the first encoding vectors obtained by inputting the first initial encoder any two times to perform comparison learning training.
Similarly, the encoding unit 2012 may input the second training text (i.e. the french text) multiple times into the second initial encoder to obtain multiple different second encoding vectors, and then select the second encoding vectors obtained by inputting the second initial encoder any two times to perform the contrast learning training.
The construction unit 2013 is configured to construct a first tag value from any two first encoding vectors and a second tag value from any two second encoding vectors.
Taking the construction of the first label value according to any two first coding vectors as an example, the construction unit 2013 takes the twice first coding vectors of the same first training text as a positive example text pair, i.e. the first label value is "1", and takes the first coding vectors of the first training text and other first training texts in the same batch as a negative example text pair, i.e. the first label value is "0". As an alternative embodiment, the total number of negative text pairs in a batch is set to 160.
In this embodiment, the steps of constructing the first tag value from any two first encoded vectors and constructing the second tag value from any two second encoded vectors are the same.
The third obtaining unit 2014 is configured to obtain a third similarity of any two first encoding vectors and a fourth similarity of any two second encoding vectors.
In this embodiment, similarity is selected as a measure of text semantic proximity, where similarity is mainly cosine similarity. As an alternative implementation, adding a step of normalizing the encoded vectors is equivalent to directly calculating the inner product of the two vectors, i.e. the angle between the two vectors can be defined by the inner product. Adopts cosine similarity, has geometric meaning (namely the included angle of two vectors under the standard orthogonal base), is convenient for calculation and illustration, assuming that the two vectors are denoted by x and y respectively, x.y= |x|||y|cos θ, where x.y represents the inner product size and θ represents the cosine similarity.
Taking the third similarity of any two first encoded vectors as an example, referring to fig. 4, assuming that N first training texts are available, inputting the first training texts into the first initial encoder for the first time to obtain the first encoded vectors as text vectors X, inputting the first training texts into the first initial encoder for the second time to obtain the first encoded vectors as text vectors Y, and performing inner product operation on the N text vectors X and the N text vectors Y by the third obtaining unit 2014 to obtain the third similarity of any one text vector X and any one text vector Y. For example, assuming there are two first training samples, the third similarity obtained has x 1 ·y 1 、x 1 ·y 2 、x 2 ·y 1 And x 2 ·y 2 。
In this embodiment, the step of acquiring the third similarity of any two first encoded vectors and the step of acquiring the fourth similarity of any two second encoded vectors are the same.
The training unit 2015 is configured to train the neural network model based on the first label value, the third similarity, the second label value, and the fourth similarity to generate a text matching model.
The text matching model includes a first encoder and a second encoder, and in this embodiment, the process of training the neural network model by the training unit 2015 actually includes: training a first initial encoder based on the first tag value and the third similarity to obtain a first encoder; training the second initial encoder based on the second tag value and the fourth similarity to obtain a second encoder.
In this embodiment, the training unit 2015 trains by increasing the cosine similarity between the two encoded vectors of the positive text pair and decreasing the cosine similarity between the two encoded vectors of the negative text pair.
As an alternative implementation manner, the loss function is designed as cross entropy loss, and the training unit 2015 substitutes the first label value and the third degree into the cross entropy loss function to obtain the cross entropy loss of the first initial encoder; substituting the second label value and the fourth similarity into the cross entropy loss function to obtain the cross entropy loss of the second initial encoder, and continuously performing iterative training to minimize the cross entropy loss of the first initial encoder and the cross entropy loss of the second initial encoder, so that the training of the first initial encoder and the second initial encoder is completed, and the machine learning model at the moment is a trained text matching model.
As an alternative implementation manner, the training unit 2015 may preset a relatively small loss threshold, and determine that the first initial encoder and the second initial encoder are trained when the cross entropy loss is smaller than the set loss threshold; alternatively, the first initial encoder and the second initial encoder may be determined to be trained when the number of iterations of the first initial encoder and the second initial encoder training is greater than the set maximum number of iterations by presetting a relatively large number of iterations.
As an alternative embodiment, the training unit 2015 may construct an identity matrix through the first tag value to perform auxiliary training, for example, train the first initial encoder based on the first tag value and the third similarity, as shown in fig. 4, where N text vectors X and N text vectors Y may obtain a similarity matrix with a dimension of n×n, where an element on a diagonal from an upper left corner to a lower right corner of the similarity matrix is exactly a cosine similarity between two encoding vectors of the positive text pair, and the identity matrix has a characteristic that elements on a diagonal from the upper left corner to the lower right corner are all 1, and all other elements are all 0, and any matrix multiplied by the identity matrix is equal to itself, so that a value on a diagonal of the similarity matrix may be maximized through the identity matrix.
In this embodiment, the step of training the first initial encoder based on the first tag value and the third similarity is the same as the step of training the second initial encoder based on the second tag value and the fourth similarity.
The matching module 23 is configured to match the first text set and the second text set according to the first similarity and the second similarity.
In this embodiment, the process of matching the first text set and the second text set by the matching module 23 according to the first similarity and the second similarity actually includes: matching the tag text set with the first text set according to the first similarity, so that tags corresponding to each tag text in the tag text set are marked in each first text in the first text set; matching the tag text set with the second text set according to the second similarity, so that the tag corresponding to each tag text in the tag text set is marked in each second text in the second text set; and finally, matching the first text set with the second text set through the marked label. Referring to fig. 11, the matching module 23 includes a first acquisition unit 231 and a matching unit 232.
As an alternative implementation manner, the matching module 23 may pick texts corresponding to a plurality of text vectors with the first similarity and the second similarity ranked first.
The first obtaining unit 231 is configured to obtain, for each tag text, a first number of first target texts with a first similarity ranking being front from a first text set, and a second number of second target texts with a second similarity ranking being front from a second text set.
As an alternative embodiment, the first number and the second number may be set to be the same or may be set to be different.
Taking a first number of first target texts with first similarity ranked first from the first text set as an example, assuming that the first number is three, and 10 first texts in the first text set have 10 similarity of the first text set and a certain tag text, ranking the 10 first texts in order of high-to-low similarity, and then using the first three first texts as the first target texts by the first obtaining unit 231.
In the present embodiment, the first acquiring unit 231 acquires a first number of first target texts having a first similarity rank ahead from the first text set and a second number of second target texts having a second similarity rank ahead from the second text set in the same step.
The matching unit 232 is configured to annotate the first target text and the second target text based on the identification, so as to match the first target text and the second target text.
The labels, that is, the unique labels contained in each label text in this embodiment, it should be noted that the labels carried by different label texts may be the same, for example, it is assumed that the label texts include "the text belongs to consumer benefit protection", "the text describes consumer benefit protection", and "the text relates to consumer benefit protection", and the labels corresponding to the three label texts are all "consumer benefit protection". In addition, each of the first target text and the second target text may correspond to a plurality of different labels, such as "food sanitation" and "consumer equity protection" assuming that the first target text is complaining that a business uses stale food materials to make food.
As an alternative implementation manner, the matching unit 232 establishes a matching dataset by labeling the first target text and the second target text of the same label, so as to fuse two different text sets together, that is, to classify again with the label, and obtain the matching dataset under each label. For example, the digital government can establish case-legal matching data sets under different categories through case text sets and legal document sets in the field of basic-level social management, and a foundation is laid for the subsequent intelligent establishment of a basic-level social management system by utilizing the corresponding relation between cases and legal documents.
As an alternative embodiment, the first obtaining unit 231 may pick texts corresponding to a plurality of text vectors having the first similarity and the second similarity greater than a certain threshold.
The first obtaining unit 231 is configured to obtain, for each tag text, a first target text with a first similarity greater than a first threshold value from a first text set, and a second target text with a second similarity greater than a second threshold value from a second text set.
As an alternative embodiment, the first threshold value and the second threshold value may be set to be the same or may be set to be different.
The matching unit 232 is configured to annotate the first target text and the second target text based on the identification, so as to match the first target text and the second target text.
As an alternative implementation manner, the condition of selecting may be set to set that the similarity is greater than a preset threshold value and the number of selected texts satisfies the preset number at the same time, if the number of texts satisfying the similarity is greater than the preset threshold value does not reach the preset number, the condition that the number of selected texts reaches the preset number is preferentially satisfied, and the texts with the top similarity ranking are further selected.
In this embodiment, whether the encoder can better express the semantic information of the text, directly relate to similarity calculation of different texts, and play a role in determining the construction of a matched text set, as shown in fig. 9, the text set matching device further includes a processing module 24, a third obtaining module 25, and an evaluation module 26.
The processing module 24 is configured to pre-process a first target text with the same identifier to obtain a first test text, and pre-process a second target text with the same identifier to obtain a second test text. In this embodiment, the same label is used for representing the label, taking the social management field as an example, assuming that the first target text is a case text, and labels of farmers, public places and environmental sanitation, old, young, and women's disability protection, education, administrative and service charge management, labor safety and labor protection, refund soldier placement, base layer election, counterfeit goods, medical care, relocation and removal placement, support and maintenance, relief and relief, malpractice and greedy, and free, legacy and legacy, retirement and retirement, household and identity card, regulation-breaking building, traffic safety management, real estate right management, fire control management, administrative and administration, contract fulfillment, control and complaint, food sanitation, tax invoice, house lease and civil litigation are taken as labels, and the processing module 24 obtains the case text of the label for each label.
As an alternative implementation manner, the processing module 24 may obtain the same preset number of first target texts in each labeled label, for example, if the preset number is two hundred, then obtain two hundred case texts labeled "farmer" labels, obtain two hundred case texts labeled "public occasion and environmental sanitation" labels, and so on; alternatively, the processing module 24 may also obtain all the first target text in each labeled tag, regardless of the text under the tag.
In this embodiment, the processing module 24 performs preprocessing on the obtained first target text with the same identifier, and as an optional implementation manner, the preprocessing specifically includes word segmentation processing, removing stop words and further filtering word segments, and processing the finally obtained word segments into a word segment list form, that is, as the first test text.
As an alternative implementation manner, the processing module 24 may use jieba (a chinese word segmentation tool) to segment words, and may add a custom dictionary to the jieba for the domain to which the first target text belongs, so that the word segmentation result is more accurate.
In this embodiment, some common words which are not very useful for final prediction exist in the word segmentation sequence obtained through word segmentation, and in order to perform semantic extraction more efficiently, the words are required to be removed from the word segmentation sequence, and the words are stop words. Alternatively, the processing module 24 removes all stop words in the whole word segmentation sequence based on the common stop word list and the special stop word list, and reduces the interference of redundant information. At present, a plurality of mature and complete stop word lists exist on the network, such as common stop words like conjunctions and prepositions, and also such as stop words with proper nouns like time, place and name being removed frequently, etc., of course, for different fields, a plurality of special stop words exist, and a special stop word list can be further constructed, such as a plurality of professional terms like letters in case texts, or word segmentation with word frequency lower than preset times, wherein the word frequency refers to the frequency of occurrence of each word in a first test text, and the special stop word list has no practical effect on distinguishing specific case texts.
As an alternative implementation manner, the processing module 24 sorts the word segmentation obtained after removing the stop word from small to large according to the word frequency, so as to obtain a preset number of individual words with the top ranking, for example, the first fifteen words with the word frequency being arranged in an inverted order, as the first test text.
In this embodiment, the processing module 24 performs preprocessing on the first target text with the same identifier to obtain the first test text, and performs preprocessing on the second target text with the same identifier to obtain the second test text.
The third obtaining module 25 is configured to obtain a fifth similarity between the first target text and the first test text, and obtain a sixth similarity between the second target text and the second test text. As an optional implementation manner, for the first target text, the third obtaining module 25 obtains the corresponding sentence vector of the first target text based on the constructed vector-text corresponding dictionary, and of course, the corresponding sentence vector of the first target text may also be obtained through the first encoder.
As an alternative implementation manner, for the first test text, the third obtaining module 25 maps each word segment in the first test text into a word vector by the first encoder in turn, and then averages all the obtained word vectors to obtain the corresponding sentence vector of the first test text.
In this embodiment, a first test text and at least one first target text may be obtained from the first target text under the same identifier, and then the third obtaining module 25 calculates an inner product of a corresponding sentence vector of each first target text and a corresponding sentence vector of the first test text, as a fifth similarity between the first target text and the first test text.
In the present embodiment, the step of acquiring the fifth similarity of the first target text and the first test text by the third acquisition module 25 is the same as the step of acquiring the sixth similarity of the second target text and the second test text.
The evaluation module 26 is configured to evaluate the first encoder based on the fifth similarity and the second encoder based on the sixth similarity.
As an alternative implementation manner, the evaluation module 26 takes the fifth similarity as a box diagram according to the similarity, and supposing that the first target text is taken as an example of the case text, specific results are shown in fig. 8, and for the 28 identifications, 28 box diagrams are obtained, and evaluates the first encoder by observing whether the distribution of the box diagrams is an ideal distribution diagram. For example, if the text vector obtained by the first encoder is sufficiently expressed in terms of semantics, and the tag word is properly selected, the outliers should not be too many, the columns should be relatively concentrated, otherwise, the first encoder is insufficiently trained.
As an alternative implementation manner, the box diagram may also be used to determine whether the label is selected properly, for example, when the outliers are too many, the text that indicates that the label word is suitable for use is less, that is, the label word is not selected properly. Of course, the present embodiment is not limited to using a box map to evaluate the first encoder, and other visualization methods may be employed.
In the present embodiment, the evaluation module 26 evaluates the first encoder based on the fifth similarity in the same step as the second encoder based on the sixth similarity.
Example 3
The present embodiment provides an electronic device including a memory, a processor, and a computer program stored on the memory and configured to run on the processor, the processor implementing the text set matching method of embodiment 1 when executing the program.
The electronic device 30 shown in fig. 12 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.
The electronic device 30 may be in the form of a general purpose computing device, which may be a server device, for example. Components of electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, a bus 33 connecting the different system components, including the memory 32 and the processor 31.
The bus 33 includes a data bus, an address bus, and a control bus.
The memory 32 may include volatile memory such as Random Access Memory (RAM) 321 and cache memory 322, and may further include Read Only Memory (ROM) 323.
Memory 32 may also include a program tool 325 having a set (at least one) of program modules 324, such program modules 324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
The processor 31 executes various functional applications and data processing, such as the text set matching method of embodiment 1 of the present invention, by running a computer program stored in the memory 32.
The electronic device 30 may also communicate with one or more external devices 34. Such communication may be through an input/output (I/O) interface 35. Also, model-generated device 30 may also communicate with one or more networks through network adapter 36. As shown in fig. 12, network adapter 36 communicates with the other modules of model-generating device 30 via bus 33. It should be appreciated that although not labeled in FIG. 12, other hardware and/or software modules may be used in connection with the model-generating device 30, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, data backup storage systems, and the like.
It should be noted that although several units/modules or sub-units/modules of an electronic device are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present invention. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.
Example 6
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the text set matching method of embodiment 1.
More specifically, among others, readable storage media may be employed including, but not limited to: portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
In an alternative embodiment, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the text set matching method of embodiment 1 when said program product is run on the terminal device.
Wherein the program code for carrying out the invention may be written in any combination of one or more programming languages, which program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on the remote device or entirely on the remote device.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.
Claims (10)
1. A text set matching method, characterized in that the text set matching method comprises:
acquiring a tag text set, a first text set and a second text set; the first text set and the second text set belong to texts with relevance in the same field range; the set of tab texts comprises a plurality of tab texts, the first set of texts comprises a plurality of first texts, and the second set of texts comprises a plurality of second texts; the tag text has a unique identifier;
Inputting the tag text set, the first text set and the second text set into a text matching model to obtain a first similarity of any one of the tag texts and any one of the first texts and a second similarity of any one of the tag texts and any one of the second texts; the text matching model comprises a first encoder and a second encoder; the first encoder takes the tag text set and the first text set as inputs and the first similarity as an output; the second encoder takes the tag text set and the second text set as inputs and the second similarity as an output;
for each of the tag texts, obtaining a first target text from the first text set according to the first similarity, and obtaining a second target text from the second text set according to the second similarity;
labeling the first target text and the second target text based on the identification so as to match the first target text and the second target text.
2. The text set matching method of claim 1, wherein the step of obtaining a first target text from the first text set according to the first similarity and obtaining a second target text from the second text set according to the second similarity comprises:
Obtaining a first number of first target texts with the first similarity ranking being forward from the first text set, and obtaining a second number of second target texts with the second similarity ranking being forward from the second text set;
or alternatively, the first and second heat exchangers may be,
the first target text with the first similarity greater than a first threshold value is obtained from the first text set, and the second target text with the second similarity greater than a second threshold value is obtained from the second text set.
3.The text set matching method of claim 1, wherein the text matchesThe model is obtained through training the following steps:
acquiring a first training text and a second training text;
inputting the first training text and the second training text into a neural network model, wherein a first initial encoder in the neural network model obtains a first encoding vector based on the first training text, and a second initial encoder in the neural network model obtains a second encoding vector based on the second training text;
constructing a first tag value according to any two first coding vectors; constructing a second tag value according to any two second coding vectors;
Acquiring third similarity of any two first coding vectors and fourth similarity of any two second coding vectors;
training the neural network model based on the first tag value, the third similarity, the second tag value, and the fourth similarity to generate the text matching model.
4. The text set matching method according to claim 2, characterized in that the text set matching method further comprises:
preprocessing the first target text with the same identifier to obtain a first test text, and preprocessing the second target text with the same identifier to obtain a second test text;
obtaining a fifth similarity between the first target text and the first test text, and obtaining a sixth similarity between the second target text and the second test text;
the first encoder is evaluated based on the fifth similarity and the second encoder is evaluated based on the sixth similarity.
5. The text set matching device is characterized by comprising a first acquisition module, a second acquisition module and a matching module:
the first acquisition module is used for acquiring a tag text set, a first text set and a second text set; the first text set and the second text set belong to texts with relevance in the same field range; the set of tab texts comprises a plurality of tab texts, the first set of texts comprises a plurality of first texts, and the second set of texts comprises a plurality of second texts; the tag text has a unique identifier;
The second acquisition module is used for integrating the tag text set, the first text set and the second text setInputting a text set into a text matching model to obtain a first similarity between any one of the tag texts and any one of the first texts and a second similarity between any one of the tag texts and any one of the second texts; the text matching model comprises a first encoder and a second encoder; the first encoder takes the tag text set and the first text set as inputs and the first similarity as an output; the second encoder takes the tag text set and the second text set as inputs and the second similarity as an output;
the matching module comprises a first acquisition unit and a matching unit;
the first obtaining unit is used for obtaining a first target text from the first text set according to the first similarity and obtaining a second target text from the second text set according to the second similarity for each tag text;
the matching unit is used for marking the first target text and the second target text based on the identification so as to match the first target text and the second target text.
6. The text set matching device of claim 5, wherein the first obtaining unit is specifically configured to obtain a first number of first target texts with a first similarity ranking from the first text set, and obtain a second number of second target texts with a second similarity ranking from the second text set;
or alternatively, the first and second heat exchangers may be,
the first obtaining unit is specifically configured to obtain, from the first text set, a first target text with the first similarity greater than a first threshold value, and obtain, from the second text set, a second target text with the second similarity greater than a second threshold value.
7. The text set matching device of claim 5, further comprising a training module for training to obtain the text matching model, the training module comprising a second obtaining unit, a coding unit, a constructing unit, a third obtaining unit, and a training unit:
the second acquisition unit is used for acquiring a first training text and a second training text;
the coding unit is used for inputting the first training text and the second training text into a preset Is a neural network model of (2) Type, a first initial encoder in the neural network model is based on the first trainingText obtaining a first encoded vector and a second initial encoder obtaining a second encoded vector based on the second training text;
the construction unit is used for constructing a first label value according to any two first coding vectors and constructing a second label value according to any two second coding vectors;
the third obtaining unit is used for obtaining a third similarity between any two first coding vectors and a fourth similarity between any two second coding vectors;
the training unit is configured to train the neural network model based on the first tag value, the third similarity, the second tag value, and the fourth similarity to generate the text matching model.
8. The text set matching device of claim 6, further comprising a processing module, a third acquisition module, and an evaluation module:
the processing module is used for preprocessing the first target text with the same identifier to obtain a first test text, and preprocessing the second target text with the same identifier to obtain a second test text;
The third obtaining module is used for obtaining a fifth similarity between the first target text and the first test text and obtaining a sixth similarity between the second target text and the second test text;
the evaluation module is configured to evaluate the first encoder based on the fifth similarity and the second encoder based on the sixth similarity.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory for execution on the processor, wherein the processor implements the text set matching method of any of claims 1-4 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the text set matching method of any of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310254504.5A CN116796723B (en) | 2023-03-15 | 2023-03-15 | Text set matching method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310254504.5A CN116796723B (en) | 2023-03-15 | 2023-03-15 | Text set matching method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116796723A CN116796723A (en) | 2023-09-22 |
CN116796723B true CN116796723B (en) | 2024-02-06 |
Family
ID=88033380
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310254504.5A Active CN116796723B (en) | 2023-03-15 | 2023-03-15 | Text set matching method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116796723B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111382250A (en) * | 2018-12-29 | 2020-07-07 | 深圳市优必选科技有限公司 | Question text matching method and device, computer equipment and storage medium |
CN113420128A (en) * | 2021-08-23 | 2021-09-21 | 腾讯科技(深圳)有限公司 | Text matching method and device, storage medium and computer equipment |
CN113836885A (en) * | 2020-06-24 | 2021-12-24 | 阿里巴巴集团控股有限公司 | Text matching model training method, text matching device and electronic equipment |
CN114492661A (en) * | 2022-02-14 | 2022-05-13 | 平安科技(深圳)有限公司 | Text data classification method and device, computer equipment and storage medium |
CN115221284A (en) * | 2022-07-21 | 2022-10-21 | 重庆长安汽车股份有限公司 | Text similarity calculation method and device, electronic equipment and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109146064B (en) * | 2018-09-05 | 2023-07-25 | 腾讯科技(深圳)有限公司 | Neural network training method, device, computer equipment and storage medium |
-
2023
- 2023-03-15 CN CN202310254504.5A patent/CN116796723B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111382250A (en) * | 2018-12-29 | 2020-07-07 | 深圳市优必选科技有限公司 | Question text matching method and device, computer equipment and storage medium |
CN113836885A (en) * | 2020-06-24 | 2021-12-24 | 阿里巴巴集团控股有限公司 | Text matching model training method, text matching device and electronic equipment |
CN113420128A (en) * | 2021-08-23 | 2021-09-21 | 腾讯科技(深圳)有限公司 | Text matching method and device, storage medium and computer equipment |
CN114492661A (en) * | 2022-02-14 | 2022-05-13 | 平安科技(深圳)有限公司 | Text data classification method and device, computer equipment and storage medium |
CN115221284A (en) * | 2022-07-21 | 2022-10-21 | 重庆长安汽车股份有限公司 | Text similarity calculation method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN116796723A (en) | 2023-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111581396B (en) | Event graph construction system and method based on multi-dimensional feature fusion and dependency syntax | |
CN108829681B (en) | Named entity extraction method and device | |
CN107766371B (en) | Text information classification method and device | |
CN110827929B (en) | Disease classification code recognition method and device, computer equipment and storage medium | |
CN113822494A (en) | Risk prediction method, device, equipment and storage medium | |
CN112905868B (en) | Event extraction method, device, equipment and storage medium | |
CN112000801A (en) | Government affair text classification and hot spot problem mining method and system based on machine learning | |
CN109918647A (en) | A kind of security fields name entity recognition method and neural network model | |
CN116775872A (en) | Text processing method and device, electronic equipment and storage medium | |
CN112598039B (en) | Method for obtaining positive samples in NLP (non-linear liquid) classification field and related equipment | |
CN113656561A (en) | Entity word recognition method, apparatus, device, storage medium and program product | |
CN114372532A (en) | Method, device, equipment, medium and product for determining label marking quality | |
CN116662488A (en) | Service document retrieval method, device, equipment and storage medium | |
CN114757183B (en) | Cross-domain emotion classification method based on comparison alignment network | |
CN117112782A (en) | Method for extracting bid announcement information | |
CN110889717A (en) | Method and device for filtering advertisement content in text, electronic equipment and storage medium | |
CN118193668A (en) | Text entity relation extraction method and device | |
JP2023517518A (en) | Vector embedding model for relational tables with null or equivalent values | |
CN116796723B (en) | Text set matching method and device, electronic equipment and storage medium | |
CN111198943B (en) | Resume screening method and device and terminal equipment | |
CN117009516A (en) | Converter station fault strategy model training method, pushing method and device | |
CN115098629B (en) | File processing method, device, server and readable storage medium | |
CN113886602B (en) | Domain knowledge base entity identification method based on multi-granularity cognition | |
CN114116971A (en) | Model training method and device for generating similar texts and computer equipment | |
CN113392294A (en) | Sample labeling method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |