CN113282709B

CN113282709B - Text matching method, device and equipment and computer readable storage medium

Info

Publication number: CN113282709B
Application number: CN202110608724.4A
Authority: CN
Inventors: 冯豆豆; 陈庆伟
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2022-11-04
Anticipated expiration: 2041-06-01
Also published as: CN113282709A

Abstract

The invention relates to a data processing technology, and discloses a text matching method, which comprises the following steps: inputting the first word vector matrix and the second word vector matrix into a first text similarity calculation model to perform text similarity calculation processing based on word vectors, and inputting the first word vector matrix and the second word vector matrix into a second text similarity calculation model to perform text similarity calculation processing after term information mapping conversion, so as to respectively obtain a first similarity value and a second similarity value; carrying out weighted summation processing on the first similarity value and the second similarity value to obtain a final similarity value; and acquiring the text to be matched corresponding to the ultimate similarity value with the highest numerical value from the knowledge map library as a matching result of the input text. The invention also relates to a blockchain technology, and the second text similarity calculation model is stored in the blockchain. The method and the device can solve the problems that in the prior art, the matching result is not accurate due to neglect of the relation between the entities.

Description

Text matching method, device and equipment and computer readable storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a method and an apparatus for text matching, an electronic device, and a computer-readable storage medium.

Background

The knowledge map contains rich knowledge, which not only includes entities and entity attributes, but also includes the relationship between the entities. For example, a legal knowledge graph is a knowledge graph established based on legal terms, and the obvious characteristic of the knowledge graph is that an entity contains a large number of words, for example, an entity text of a certain illegal situation is: the "radioactive waste generation, discharge, treatment, storage, cleaning and relief control, delivery and disposal and the like are not regularly and faithfully reported according to the regulations of the governing department of environmental protection of the state department", which comprises 52 words in total, and the text content of the entity is long. For such a long text, similarity between an entity and a user text is usually measured according to the number of matched words, cosine distance of text distributed representation and the like, and the entity with the highest similarity is used as a matching result, wherein the text distributed representation mode commonly uses word vectors, sentence vectors and the like.

The modes are obtained by only utilizing plain text information training, and the relation between entities is ignored, so that the matching result is not accurate. For example, the following steps are carried out: entity A utilizes seepage wells, seepage pits, cracks and karst caves, sets concealed pipes privately, tampers with and forges monitoring data, entity B discharges water pollutants in a way of escaping supervision such as 'water pollution prevention and control facilities operated abnormally', entity A and entity B are from the same legal provision, punishment results are very similar, and therefore from the perspective of law, the entity A and the entity B are very similar. However, from the viewpoint of the number of overlapped words, word vectors, sentence vectors, etc., they are not similar, so that the common algorithms in the industry cannot accurately match to the correct result.

Disclosure of Invention

The invention provides a text matching method, a text matching device, electronic equipment and a computer readable storage medium, and mainly aims to solve the problems that in the prior art, matching results are obtained only by using plain text information, and the relation between entities is ignored, so that the matching results are not accurate and the like.

In a first aspect, to achieve the above object, the present invention provides a text matching method, where the method includes:

performing word segmentation vectorization processing on the acquired input text and a text to be matched in a knowledge graph library to respectively obtain a first word vector matrix of the input text and a second word vector matrix of the text to be matched;

inputting the first word vector matrix and the second word vector matrix into a first text similarity calculation model to perform text similarity calculation processing based on word vectors, and inputting the first word vector matrix and the second word vector matrix into a second text similarity calculation model to perform text similarity calculation processing after term information mapping conversion, so as to obtain a first similarity value and a second similarity value respectively;

the second text similarity calculation model comprises a term information mapping conversion layer and a similarity calculation layer, wherein the term information mapping conversion layer is used for respectively mapping and converting the first word vector matrix and the second word vector matrix into an input text distribution type with professional term information and a to-be-matched text distribution type with professional term information, and the similarity calculation layer is used for performing text similarity calculation processing on the input text distribution type with professional term information and the to-be-matched text distribution type with professional term information;

carrying out weighted summation processing on the first similarity value and the second similarity value to obtain an ultimate similarity value;

and acquiring a text to be matched corresponding to the ultimate similarity value with the highest numerical value from the knowledge map library as a matching result of the input text.

In a second aspect, in order to solve the above problem, the present invention further provides a text matching apparatus, including:

the word segmentation vectorization processing module is used for performing word segmentation vectorization processing on the acquired input text and the text to be matched in the knowledge map library to respectively obtain a first word vector matrix of the input text and a second word vector matrix of the text to be matched;

the similarity value calculation module is used for inputting the first word vector matrix and the second word vector matrix into a first text similarity calculation model to perform text similarity calculation processing based on word vectors, and inputting the first word vector matrix and the second word vector matrix into a second text similarity calculation model to perform text similarity calculation processing after term information mapping conversion, so as to obtain a first similarity value and a second similarity value respectively;

the weighted summation processing module is used for carrying out weighted summation processing on the first similarity value and the second similarity value to obtain an ultimate similarity value;

and the text matching module is used for acquiring the text to be matched corresponding to the ultimate similarity value with the highest numerical value from the knowledge map library as the matching result of the input text.

In a third aspect, to solve the above problem, the present invention further provides an electronic apparatus, including:

a memory storing at least one instruction; and

and the processor executes the instructions stored in the memory to realize the steps of the text matching method.

In a fourth aspect, to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one instruction is stored, and the at least one instruction is executed by a processor in an electronic device to implement the text matching method described above.

The text matching method, the text matching device, the electronic equipment and the computer-readable storage medium provided by the invention have the advantages that the text similarity calculation processing based on word vectors is carried out on a first word vector matrix of an input text and a second word vector matrix of a text to be matched through a first text similarity calculation model, the text similarity calculation processing after term information mapping conversion is carried out on the first word vector matrix of the input text and the second word vector matrix of the text to be matched through a second text similarity calculation model, the first similarity value and the second similarity value obtained by the two models are weighted and summed to obtain an ultimate similarity value, and a matching result is obtained according to the ultimate similarity value. The second text similarity calculation model can map term information of the universal word vector, so that semantic gap between the universal word vector and a knowledge map entity in the professional field in the text similarity calculation based on the word vector of the first text similarity calculation model is made up, and the accuracy of text matching is improved.

Drawings

Fig. 1 is a schematic flowchart of a text matching method according to an embodiment of the present invention;

fig. 2 is a schematic block diagram of a text matching apparatus according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an internal structure of an electronic device implementing a text matching method according to an embodiment of the present invention;

the implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a text matching method. Fig. 1 is a schematic flow chart of a text matching method according to an embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, the text matching method includes:

step S110, performing word segmentation vectorization processing on the acquired input text and the text to be matched in the knowledge map library to respectively obtain a first word vector matrix of the input text and a second word vector matrix of the text to be matched.

Specifically, a user inputs a text in a mode of a query box and the like, and after the processor obtains the input text of the user, the text to be matched is obtained from a knowledge map library one by one; the method comprises the steps of carrying out word segmentation vectorization processing on texts from two sources (an input text from a user and a text to be matched in a knowledge map library) to respectively obtain a first word vector matrix of the input text and a second word vector matrix of the text to be matched.

As an optional embodiment of the present invention, performing word segmentation vectorization processing on an acquired input text and a text to be matched in a knowledge spectrum library to obtain a first word vector matrix of the input text and a second word vector matrix of the text to be matched respectively, specifically including:

performing word segmentation on the input text and the text to be matched to respectively obtain a first word segmentation set of the input text and a second word segmentation set of the text to be matched;

removing stop words from the first word set and the second word set to respectively obtain a first word list and a second word list;

inputting the first word list and the second word list into a pre-training word vector model, and respectively obtaining a first word vector matrix and a second word vector matrix after carrying out word vectorization processing through the pre-training word vector model;

the pre-training word vector model comprises an input layer, a word vector conversion layer and an output layer, wherein the input layer is used for inputting a first word list and a second word list, the word vector conversion layer is used for comparing words in the first word list and the second word list with a pre-training vector dictionary to respectively obtain a first word vector matrix and a second word vector matrix, and the output layer is used for outputting the first word vector matrix and the second word vector matrix.

Specifically, the two texts are firstly subjected to word segmentation, and word segmentation tools such as jieba or jiagu can be selected for word segmentation; after word segmentation, respectively obtaining a first word segmentation set of an input text and a second word segmentation set of a text to be matched; in order to eliminate the trouble of the similarity calculation after the useless words are calculated, for example, some words without semantic information, such as: "la", etc.; respectively removing stop words from the first word set and the second word set, and then respectively obtaining a first word list and a second word list; and then, respectively converting the words in the first word list and the second word list into corresponding word vectors through a pre-training word vector model, thereby respectively obtaining a first word vector matrix and a second word vector matrix.

In the process of converting words in the word list into word vectors, taking the first word list as an example, the pre-training word vector model stores a pre-training vector dictionary, whose key is a word and value is a word vector, which is in the form of { word: word vectors }. And (4) correspondingly searching word vectors in the pre-training vector dictionary by using the words in the first word list to obtain a word vector matrix. At the moment, a word vector matrix with n x d dimensions can be obtained, wherein n refers to the number of words, and d refers to the dimension of the word vector; the expression is as follows: m = PM (word _ list), where PM is a pre-training word vector model; word list is the word list and M is the resulting word vector matrix.

Step S120, inputting the first word vector matrix and the second word vector matrix into a first text similarity calculation model to perform text similarity calculation processing based on word vectors, and inputting the first word vector matrix and the second word vector matrix into a second text similarity calculation model to perform text similarity calculation processing after term information mapping conversion to respectively obtain a first similarity value and a second similarity value;

the second text similarity calculation model comprises a term information mapping conversion layer and a similarity calculation layer, wherein the term information mapping conversion layer is used for mapping and converting the first word vector matrix and the second word vector matrix into an input text distribution type with professional term information and a to-be-matched text distribution type with professional term information respectively, and the similarity calculation layer is used for performing text similarity calculation processing on the input text distribution type with professional term information and the to-be-matched text distribution type with professional term information.

Specifically, the text similarity of the word vector based first word vector matrix and the text similarity of the word vector based second word vector matrix can be calculated by using a first text similarity calculation model; however, the first similarity value obtained by using the model is only similarity measurement performed on two texts (an input text and a text to be matched) from a literal meaning, and for a long text or a text with a substantial meaning of a professional term different from the literal meaning, the matching accuracy is low, for example, legal terms such as entity A 'utilizes seepage wells, seepage pits, cracks and caves, secretes, falsifies and counterfeit monitoring data, entity B' abnormally operates a water pollution prevention facility and the like to discharge water pollutants in a manner of avoiding supervision, entity A and entity B come from the same legal terms, the penalty results are very similar, and therefore, the two are very similar from the legal point of view. However, from the viewpoint of the number of overlapped words, word vectors, sentence vectors, etc., they are not similar, so the common algorithm in the industry cannot accurately match the correct result. Therefore, after the term information mapping conversion is carried out on the first word vector matrix and the second word vector matrix by adopting the second text similarity calculation model, the text similarity calculation is carried out on the two texts with term information in a distributed mode to obtain a second similarity value, and the matching accuracy is improved.

As an alternative embodiment of the present invention, the first text similarity calculation model includes:

the system comprises an input layer used for inputting a first word vector matrix and a second word vector matrix, a distributed text conversion layer used for respectively carrying out text distributed conversion processing on the first word vector matrix and the second word vector matrix to obtain a distributed input text and a distributed text to be matched, a similarity calculation layer used for carrying out cosine similarity calculation processing on the distributed input text and the distributed text to be matched to obtain a first similarity value, and an output layer used for carrying out output processing on the first similarity value.

Specifically, the first text similarity calculation model is a model for text similarity calculation processing based on word vectors, and the first similarity value can be obtained only by converting the first word vector matrix and the second word vector matrix into the input text distributed type and the text distributed type to be matched respectively and then performing cosine similarity calculation processing on the two distributed types. The speed of obtaining the first similarity value can be improved by processing through the first text similarity calculation model, and the calculation time of the similarity value is saved.

As an optional embodiment of the present invention, inputting the first word vector matrix and the second word vector matrix into the first text similarity calculation model to perform text similarity calculation processing based on word vectors specifically includes:

respectively carrying out averaging processing on the first word vector matrix and the second word vector matrix to respectively obtain an average vector of the first word vector matrix and an average vector of the second word vector matrix;

taking the average vector of the first word vector matrix as an input text distribution type and the average vector of the second word vector matrix as a text distribution type to be matched, and performing similarity calculation on the input text distribution type and the text distribution type to be matched by using a first cosine similarity calculation formula to obtain a first similarity value; wherein,

the first cosine similarity calculation formula is as follows:

wherein i =1,2, …, m

Wherein v is _user Distributed, v for input text _kgi The number of the distributed texts to be matched is ith and m is the number of the distributed texts to be matched.

Specifically, when the similarity calculation of the two matrices is performed by using the first text similarity calculation model, the word vector matrices are directly averaged to obtain a 1*d-dimensional vector. By way of example: assuming that the input text includes 5 words, and the dimension of the word vector is 100 in the pre-trained word vector model, a 5 × 100 matrix is obtained, and the next step is to average the 5 × 100 matrix to obtain a 1 × 100 vector. Distributed representation v of input text _user And distributed representation v of m texts to be matched in the knowledge graph _kg1 ,v _kg2 ,…,v _kgm To find out

Where i =1,2, …, m. The similarity between the user text and the m texts to be matched in the knowledge graph can be obtained, and the range of the similarity is 0-1. Since only the word vectors of the general field are used to calculate the similarity and no information about the professional terms in the knowledge-graph is included, the similarity can only indicate whether the input text is similar to the text to be matched in the knowledge-graph from the viewpoint of the general field (i.e., from the viewpoint of not including a specific professional field, such as legal field knowledge).

As an optional embodiment of the present invention, the second text similarity calculation model is stored in the blockchain, the second text similarity calculation model is a neural network model, and the training of the second text similarity calculation model specifically includes:

selecting a positive sample pair and a negative sample pair from a knowledge graph library, and respectively carrying out true value labeling processing on the positive sample pair and the negative sample pair; wherein, the positive sample pair comprises a text pair which is judged to be similar; the negative sample pairs comprise sample pairs which are judged to be dissimilar;

taking the positive sample pair and the negative sample pair together as a model training sample set;

randomly extracting an A sample and a B sample from a model training sample set, and performing word segmentation vectorization processing on the A sample and the B sample to respectively obtain an A sample word vector matrix and a B sample word vector matrix;

sharing the word vector matrix of the sample A and the word vector matrix of the sample B with the weight of a term information mapping conversion layer of a pre-constructed basic neural network model, and splicing an output vector corresponding to the last word from front to back and an output vector corresponding to the first word from back to front in the term information mapping conversion layer to respectively obtain a sample distributed type A and a sample distributed type B;

performing cosine similarity calculation processing on the sample distribution type A and the sample distribution type B to obtain a cosine similarity value between the sample distribution type A and the sample distribution type B;

taking a cosine similarity value between the sample distribution type A and the sample distribution type B as a predicted value, taking a result of pre-labeling the sample A and the sample B as a true value, and adopting cross entropy as a loss function of the pre-constructed basic neural network model; wherein,

the formula for the calculation of the loss function is:

pred＝cos(Bi-LSTM(A),Bi-LSTM(B))

loss＝label*log(pred)+(1-label)*log(1-pred)

wherein, label is a true value, pred is a predicted value, and loss is a loss function of a pre-constructed basic neural network model;

and according to the loss function, performing optimization training on the pre-constructed basic neural network model to obtain a second text similarity calculation model.

Specifically, the second text similarity calculation model is a knowledge graph-based text similarity calculation model and can be divided into two stages: the method comprises a model training stage and a model application stage, wherein the model training stage is to train a neural network model by using an entity text in a knowledge graph as a sample, and after the model is trained, weight parameters of the model are taken out, and an input text is mapped to a semantic space of the knowledge graph. The idea is as follows: the word vectors are trained by using general texts, the texts in the knowledge graph are in a special professional field, and two texts which are similar from the perspective of the special professional field cannot show the similarity of the two texts by directly using the word vectors.

The LSTM-based neural network model takes as input a word vector that does not contain professional domain information, and the output is a word vector that contains professional domain information. The model can be viewed as a function f (x) (where x is a word vector that does not contain terminologic information, such as legal information), the goal of which is to learn a mapping that maps generic domain word vectors to specialized domains. The function is trained by first determining similar entity texts and then training the model to fit the target: the distance between similar entity texts is smaller and the distance between dissimilar entity texts is larger.

The training phase of the second text similarity calculation model first constructs a model training data set. Taking the legal field as an example, positive and negative sample pairs are selected from the legal knowledge graph, wherein the positive sample pairs comprise sample pairs which should be judged to be similar, and the negative sample pairs comprise sample pairs which should not be judged to be similar. Under the same legal provision, the entity text of the same penalty result is a positive sample, and the true value label is 1. And in different legal terms, the entity texts of different penalty results are negative samples, and the true value label is 0. Similar sample pairs may also be manually labeled.

The method comprises a Bi-LSTM layer (namely a term information mapping conversion layer), and randomly selects an A sample and a B sample from a model training sample set consisting of positive and negative sample pairs; performing word segmentation vectorization processing on the sample A and the sample B to respectively obtain a sample word vector matrix A and a sample word vector matrix B; the a-sample word vector matrix and the B-sample word vector matrix share the weights of the layer. And splicing the output vector of the last word from the front layer to the rear layer in the Bi-LSTM and the output vector corresponding to the first word from the rear layer to the front layer to obtain the distributed representation of the text. Since these two vectors contain information to see the whole sentence from front to back and to see the whole sentence from back to front, respectively. And after the A sample distribution type and the B sample distribution type are obtained, calculating the cosine similarity of the A sample distribution type and the B sample distribution type. And taking the cosine similarity as a predicted value, taking the marked result as a real value, and using the cross entropy as a loss calculation mode. And then training the model by a machine learning common optimization method (such as SGD or Adam) to obtain a second text similarity calculation model. The second text similarity calculation model has a mapping capability from the general domain to the expertise atlas.

As an optional embodiment of the present invention, inputting the first word vector matrix and the second word vector matrix into the second text similarity calculation model to perform text similarity calculation processing after term information mapping conversion specifically includes:

inputting the first word vector matrix and the second word vector matrix into a second text similarity calculation model, and performing term information mapping conversion processing on the first word vector matrix and the second word vector matrix through a term information mapping conversion layer of the second text similarity calculation model to respectively obtain an input text distributed type with professional term information and a text distributed type to be matched with the professional term information;

performing cosine similarity calculation processing on the input text distribution type with the professional term information and the text distribution type to be matched with the professional term information through a second cosine similarity calculation formula to obtain a second similarity value; wherein, the second cosine similarity calculation formula is:

wherein i =1,2, …, m;

wherein,

distributed for input text with professional term information,

For the ith with a special term letterAnd m is the number of distributed texts to be matched.

Specifically, a first word vector matrix and a second word vector matrix are input into a second text similarity calculation model, term information mapping conversion processing is carried out on the first word vector matrix and the second word vector matrix through a term information mapping conversion layer, an input text distribution type with professional term information and a text distribution type to be matched with professional term information are obtained, then a second similarity value between the input text distribution type with professional term information and the text distribution type to be matched with professional term information is calculated through a second cosine similarity calculation formula, and the similarity between the input text and the text to be matched in the knowledge map can be obtained.

And step S130, carrying out weighted summation processing on the first similarity value and the second similarity value to obtain an ultimate similarity value.

Specifically, the two similarity values are weighted and summed to obtain a final similarity score S.

S＝w ^common *S ^common +w ^kg *S ^kg

Wherein w ^common Is based on the weight of the first similarity value, w ^kg The weight based on the second similarity value can be set according to actual needs. The two similar values respectively see whether the input text from the user is similar to the text to be matched from the knowledge map library from the perspective of the general field and the specific professional field, and compared with the method commonly used in the industry, the method increases the information of the specific professional knowledge map and is more intelligent.

And S140, acquiring a text to be matched corresponding to the ultimate similarity value with the highest numerical value from the knowledge map library as a matching result of the input text.

Specifically, the higher the final similarity value is, the more similar the input text from the user is to the text to be matched in the knowledge map library, so that the text to be matched corresponding to the final similarity value with the highest numerical value only needs to be obtained from the knowledge map library and is used as the matching result of the input text.

As an optional embodiment of the present invention, obtaining a text to be matched corresponding to a final similarity value with a highest numerical value from a knowledge map library, as a matching result of an input text, specifically includes:

when the knowledge map library is a legal knowledge map library, the text to be matched is legal fact stored in the legal knowledge map library; each legal fact is matched with a corresponding legal processing result;

and acquiring the legal fact corresponding to the ultimate similarity value with the highest numerical value from the legal knowledge map library as the matching result of the input text, and acquiring the legal processing result corresponding to the legal fact as the matching result as the output result of the matching result of the input text.

Specifically, for the legal field, each legal fact in the legal knowledge atlas database can be used as a text to be matched, each legal fact is matched with a corresponding legal processing result, and the processing results possibly matched by similar legal facts are also similar, so that the most similar legal fact best matching result is obtained from the legal knowledge atlas database through the input text of the user, and meanwhile, the legal processing result corresponding to the legal fact can be used as an output result of the most matching result.

Fig. 2 is a functional block diagram of a text matching apparatus according to an embodiment of the present invention.

The text matching apparatus 200 of the present invention may be installed in an electronic device. According to the implemented functions, the text matching device may include a word segmentation vectorization processing module 210, a similarity value calculation module 220, a weighted sum processing module 230, and a text matching module 240. A module according to the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and can perform a fixed function, and are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

and the word segmentation vectorization processing module 210 is configured to perform word segmentation vectorization processing on the acquired input text and the text to be matched in the knowledge map library to obtain a first word vector matrix of the input text and a second word vector matrix of the text to be matched, respectively.

Specifically, a user inputs a text in a mode of a query box and the like, and after the processor obtains the input text of the user, the processor obtains texts to be matched from a knowledge graph library one by one; the method comprises the steps of carrying out word segmentation vectorization processing on texts from two sources (an input text from a user and a text to be matched in a knowledge map library) to respectively obtain a first word vector matrix of the input text and a second word vector matrix of the text to be matched.

As an optional embodiment of the present invention, performing word segmentation vectorization processing on an acquired input text and a text to be matched in a knowledge graph library to obtain a first word vector matrix of the input text and a second word vector matrix of the text to be matched, respectively, specifically including:

Specifically, the two texts are firstly subjected to word segmentation, and word segmentation tools such as jieba or jiagu can be selected for word segmentation; after word segmentation, respectively obtaining a first word segmentation set of an input text and a second word segmentation set of a text to be matched; in order to eliminate the trouble of the similarity calculation after the useless words are calculated, for example, some words without semantic information, such as: "la", etc.; respectively removing stop words from the first word set and the second word set, and then respectively obtaining a first word list and a second word list; and then respectively converting words in the first word list and the second word list into corresponding word vectors through a pre-training word vector model, thereby respectively obtaining a first word vector matrix and a second word vector matrix.

In the process of converting words in the word list into word vectors, taking the first word list as an example, the pre-training word vector model stores a pre-training vector dictionary, whose key is a word and value is a word vector, which is in the form of { word: word vector }. And (4) correspondingly searching word vectors in the pre-training vector dictionary by using the words in the first word list to obtain a word vector matrix. At the moment, a word vector matrix with n x d dimensions can be obtained, wherein n refers to the number of words, and d refers to the dimension of the word vector; the expression is as follows: m = PM (word _ list), where PM is a pre-training word vector model; word list is the word list and M is the resulting word vector matrix.

The similarity value calculation module 220 is configured to input the first word vector matrix and the second word vector matrix into a first text similarity calculation model to perform text similarity calculation processing based on word vectors, and input the first word vector matrix and the second word vector matrix into a second text similarity calculation model to perform text similarity calculation processing after term information mapping conversion, so as to obtain a first similarity value and a second similarity value respectively;

the second text similarity calculation model comprises a term information mapping conversion layer and a similarity calculation layer, wherein the term information mapping conversion layer is used for respectively mapping and converting the first word vector matrix and the second word vector matrix into an input text distribution type with professional term information and a to-be-matched text distribution type with professional term information, and the similarity calculation layer is used for performing text similarity calculation processing on the input text distribution type with professional term information and the to-be-matched text distribution type with professional term information.

Specifically, the text similarity of the word vector based first word vector matrix and the text similarity of the word vector based second word vector matrix can be calculated by using a first text similarity calculation model; however, the first similarity value obtained by using the model is only similarity measurement performed on two texts (an input text and a text to be matched) from a literal meaning, and for a long text or a text with a substantial meaning of a professional term different from the literal meaning, the matching accuracy is low, for example, legal terms such as entity A 'utilizes seepage wells, seepage pits, cracks and caves, secretes, falsifies and counterfeit monitoring data, entity B' abnormally operates a water pollution prevention facility and the like to discharge water pollutants in a manner of avoiding supervision, entity A and entity B come from the same legal terms, the penalty results are very similar, and therefore, the two are very similar from the legal point of view. However, from the viewpoint of the number of overlapped words, word vectors, sentence vectors, etc., they are not similar, so the common algorithm in the industry cannot accurately match the correct result. Therefore, after the term information mapping conversion is carried out on the first word vector matrix and the second word vector matrix by adopting the second text similarity calculation model, the two texts with term information are distributed and then subjected to text similarity calculation to obtain a second similarity value, and the matching accuracy is improved.

the first cosine similarity calculation formula is as follows:

wherein i =1,2, …, m

Specifically, when the similarity calculation of the two matrices is performed by using the first text similarity calculation model, the word vector matrices are directly averaged to obtain a 1*d-dimensional vector. For example, the following steps are carried out: assuming that the input text includes 5 words, and the dimension of the word vector is 100 in the pre-trained word vector model, a 5 × 100 matrix is obtained, and the next step is to average the 5 × 100 matrix to obtain a 1 × 100 vector. Distributed representation v of input text _user And distributed representation v of m texts to be matched in the knowledge graph _kg1 ,v _kg2 ,…,v _kgm To find out

As an optional embodiment of the present invention, the second text similarity calculation model is stored in the block chain, the second text similarity calculation model is a neural network model, and the training of the second text similarity calculation model specifically includes:

selecting a positive sample pair and a negative sample pair from a knowledge graph library, and respectively carrying out real value labeling processing on the positive sample pair and the negative sample pair; wherein, the positive sample pairs comprise text pairs which are judged to be similar; the negative sample pairs comprise sample pairs which are judged to be dissimilar;

randomly extracting samples A and samples B from a model training sample set, and performing word segmentation vectorization processing on the samples A and the samples B to respectively obtain a word vector matrix of the samples A and a word vector matrix of the samples B;

the formula for the calculation of the loss function is:

pred＝cos(Bi-LSTM(A),Bi-LSTM(B))

loss＝label*log(pred)+(1-label)*log(1-pred)

and according to the loss function, carrying out optimization training on the pre-constructed basic neural network model to obtain a second text similarity calculation model.

The LSTM-based neural network model takes as input a word vector that does not contain professional domain information, and outputs a word vector that contains professional domain information. The model can be viewed as a function f (x) (where x is a word vector that does not contain terminologic information, such as legal information), the goal of which is to learn a mapping that maps generic domain word vectors to specialized domains. The function is trained by first determining similar entity texts and then training the model to fit the target: the distance between similar entity texts is smaller and the distance between dissimilar entity texts is larger.

The training phase of the second text similarity calculation model first constructs a model training data set. Taking the legal field as an example, positive and negative sample pairs are selected from the legal knowledge map, wherein the positive sample pairs comprise sample pairs which should be judged to be similar, and the negative sample pairs comprise sample pairs which should not be judged to be similar. Under the same legal provision, the entity text of the same penalty result is a positive sample, and the true value label is 1. And in different legal terms, the entity texts of different penalty results are negative samples, and the true value label is 0. Similar sample pairs may also be manually labeled.

The method comprises a Bi-LSTM layer (namely a term information mapping conversion layer), and randomly selects an A sample and a B sample from a model training sample set consisting of positive and negative sample pairs; performing word segmentation vectorization processing on the sample A and the sample B to respectively obtain a sample word vector matrix A and a sample word vector matrix B; the a-sample word vector matrix and the B-sample word vector matrix share the weights of the layer. And taking the output vector of the last word from the front layer to the rear layer in the Bi-LSTM and the output vector corresponding to the first word from the rear layer to the front layer, and splicing the output vector and the output vector to obtain the distributed representation of the text. Since these two vectors contain information to see the whole sentence from front to back and to see the whole sentence from back to front, respectively. And after the A sample distribution type and the B sample distribution type are obtained, calculating the cosine similarity of the A sample distribution type and the B sample distribution type. And taking the cosine similarity as a predicted value, taking the marked result as a real value, and using the cross entropy as a loss calculation mode. And then training the model by a machine learning common optimization method (such as SGD or Adam) to obtain a second text similarity calculation model. The second text similarity calculation model has a mapping capability from the general domain to the expertise atlas.

wherein i =1,2, …, m;

wherein,

distributed for input text with professional term information,

Distributing the ith texts to be matched with the professional term information, and taking m as the number of the distributed texts to be matched.

Specifically, a first word vector matrix and a second word vector matrix are input into a second text similarity calculation model, term information mapping conversion processing is carried out on the first word vector matrix and the second word vector matrix through a term information mapping conversion layer, an input text distribution type with professional term information and a text distribution type to be matched with the professional term information are obtained, then a second similarity value between the input text distribution type with the professional term information and the text distribution type to be matched with the professional term information is calculated through a second cosine similarity calculation formula, and the similarity between the input text and the text to be matched in the knowledge map can be obtained.

And the weighted summation processing module 230 is configured to perform weighted summation processing on the first similarity value and the second similarity value to obtain an ultimate similarity value.

S＝w ^common *S ^common +w ^kg *S ^kg

And the text matching module 240 is configured to obtain a text to be matched corresponding to the final similarity value with the highest numerical value from the knowledge map library, as a matching result of the input text.

when the knowledge graph library is a legal knowledge graph library, the text to be matched is legal fact stored in the legal knowledge graph library; each legal fact is matched with a corresponding legal processing result;

and taking the legal fact corresponding to the ultimate similarity value with the highest value obtained from the legal knowledge map library as a matching result of the input text, and obtaining a legal processing result corresponding to the legal fact serving as the matching result as an output result of the matching result of the input text.

Specifically, for the legal field, each legal fact in the legal knowledge atlas database can be used as a text to be matched, each legal fact is matched with a corresponding legal processing result, and the processing results possibly matched with similar legal facts are also similar, so that the most similar legal fact most matching result is obtained from the legal knowledge atlas database through the input text of the user, and meanwhile, the legal processing result corresponding to the legal fact is used as the output result of the most matching result.

Fig. 3 is a schematic structural diagram of an electronic device implementing a text matching method according to an embodiment of the present invention.

The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a text matching program 12, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of a text matching program, etc., but also to temporarily store data that has been output or will be output.

The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., text matching programs, etc.) stored in the memory 11 and calling data stored in the memory 11.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used to establish a communication connection between the electronic device 1 and another electronic device.

Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The memory 11 in the electronic device 1 stores a text matching program 12 that is a combination of instructions that, when executed in the processor 10, enable:

performing word segmentation vectorization processing on the acquired input text and the text to be matched in the knowledge graph library to respectively obtain a first word vector matrix of the input text and a second word vector matrix of the text to be matched;

inputting the first word vector matrix and the second word vector matrix into a first text similarity calculation model to perform text similarity calculation processing based on word vectors, and inputting the first word vector matrix and the second word vector matrix into a second text similarity calculation model to perform text similarity calculation processing after term information mapping conversion, so as to respectively obtain a first similarity value and a second similarity value;

carrying out weighted summation processing on the first similarity value and the second similarity value to obtain a final similarity value;

and acquiring the text to be matched corresponding to the ultimate similarity value with the highest numerical value from the knowledge map library as a matching result of the input text.

Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again. It should be emphasized that, in order to further ensure the privacy and security of the second text similarity calculation model, the second text similarity calculation model may also be stored in a node of a block chain.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM).

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A text matching method is applied to an electronic device, and is characterized by comprising the following steps:

the second text similarity calculation model comprises a term information mapping conversion layer and a similarity calculation layer, wherein the term information mapping conversion layer is used for respectively mapping and converting the first word vector matrix and the second word vector matrix into an input text distribution type with professional term information and a to-be-matched text distribution type with professional term information, and the similarity calculation layer is used for performing text similarity calculation processing on the input text distribution type with professional term information and the to-be-matched text distribution type with professional term information; the training of the second text similarity calculation model specifically includes:

selecting a positive sample pair and a negative sample pair from the knowledge graph library, and respectively carrying out real value labeling processing on the positive sample pair and the negative sample pair; the positive sample pairs comprise text pairs which are judged to be similar; the negative sample pairs comprise sample pairs which are judged to be dissimilar correspondingly;

randomly extracting an A sample and a B sample from the model training sample set, and performing word segmentation vectorization processing on the A sample and the B sample to respectively obtain an A sample word vector matrix and a B sample word vector matrix;

sharing the weight of a term information mapping conversion layer of a pre-constructed basic neural network model by the sample word vector matrix A and the sample word vector matrix B, and splicing an output vector corresponding to the last word from front to back and an output vector corresponding to the first word from back to front in the term information mapping conversion layer to respectively obtain a sample distributed type A and a sample distributed type B;

taking a cosine similarity value between the sample distribution type A and the sample distribution type B as a predicted value, taking a result of pre-labeling the sample distribution type A and the sample distribution type B as a true value, and adopting cross entropy as a loss function of the pre-constructed basic neural network model;

according to the loss function, performing optimization training on the pre-constructed basic neural network model to obtain a second text similarity calculation model;

2. The text matching method according to claim 1, wherein the performing word segmentation vectorization processing on the acquired input text and the text to be matched in the knowledge map library to obtain a first word vector matrix of the input text and a second word vector matrix of the text to be matched respectively comprises:

removing stop words from the first word segmentation set and the second word segmentation set to respectively obtain a first word list and a second word list;

inputting the first word list and the second word list into a pre-training word vector model, and respectively obtaining a first word vector matrix and a second word vector matrix after performing word vectorization processing through the pre-training word vector model;

the pre-training word vector model comprises an input layer used for inputting the first word list and the second word list, a word vector conversion layer used for comparing words in the first word list and the second word list with a pre-training vector dictionary to respectively obtain a first word vector matrix and a second word vector matrix, and an output layer used for outputting the first word vector matrix and the second word vector matrix.

3. The text matching method according to claim 1, wherein the first text similarity calculation model includes:

the system comprises an input layer used for inputting the first word vector matrix and the second word vector matrix, a distributed text conversion layer used for respectively carrying out text distributed conversion processing on the first word vector matrix and the second word vector matrix to obtain a distributed input text and a distributed text to be matched, a similarity calculation layer used for carrying out cosine similarity calculation processing on the distributed input text and the distributed text to be matched to obtain a first similarity value, and an output layer used for carrying out output processing on the first similarity value.

4. The text matching method according to claim 1, wherein the step of inputting the first word vector matrix and the second word vector matrix into a first text similarity calculation model to perform text similarity calculation processing based on word vectors specifically comprises:

the first cosine similarity calculation formula is as follows:

wherein i =1,2, …, m

5. The text matching method according to claim 1, wherein the second text similarity calculation model is stored in a block chain, and the second text similarity calculation model is a neural network model; the calculation formula of the loss function is as follows:

pred＝cos(Bi-LSTM(A),Bi-LSTM(B))

loss＝label*log(pred)+(1-label)*log(1-pred)

wherein the label is a real value, pred is a predicted value, and loss is a loss function of a pre-constructed basic neural network model.

6. The text matching method according to claim 1, wherein the text similarity calculation processing performed after the term information mapping conversion by inputting the first word vector matrix and the second word vector matrix into a second text similarity calculation model specifically includes:

inputting the first word vector matrix and the second word vector matrix into the second text similarity calculation model, and performing term information mapping conversion processing on the first word vector matrix and the second word vector matrix through a term information mapping conversion layer of the second text similarity calculation model to respectively obtain an input text distribution type with professional term information and a to-be-matched text distribution type with professional term information;

wherein i =1,2, …, m;

wherein,

distributed for input text with professional term information,

The number of the distributed texts to be matched is the ith text to be matched with the professional term information, and m is the number of the distributed texts to be matched.

7. The text matching method according to claim 1, wherein the obtaining of the text to be matched corresponding to the final similarity value with the highest numerical value from the knowledge map library as the matching result of the input text specifically comprises:

when the knowledge map library is a legal knowledge map library, the text to be matched is legal facts stored in the legal knowledge map library; each legal fact is matched with a corresponding legal processing result;

and taking the legal fact corresponding to the ultimate similar value with the highest value obtained from the legal knowledge map library as the matching result of the input text, and obtaining the legal processing result corresponding to the legal fact serving as the matching result as the output result of the matching result of the input text.

8. A text matching apparatus, characterized in that the apparatus comprises:

according to the loss function, carrying out optimization training on the pre-constructed basic neural network model to obtain a second text similarity calculation model;

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the text matching method of any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a text matching method according to any one of claims 1 to 7.