CN112115716A - Service discovery method, system and equipment based on multi-dimensional word vector context matching - Google Patents

Service discovery method, system and equipment based on multi-dimensional word vector context matching Download PDF

Info

Publication number
CN112115716A
CN112115716A CN202010982942.XA CN202010982942A CN112115716A CN 112115716 A CN112115716 A CN 112115716A CN 202010982942 A CN202010982942 A CN 202010982942A CN 112115716 A CN112115716 A CN 112115716A
Authority
CN
China
Prior art keywords
matching
layer
sentence
word vector
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010982942.XA
Other languages
Chinese (zh)
Inventor
黄昭
赵薇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Normal University
Original Assignee
Shaanxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Normal University filed Critical Shaanxi Normal University
Priority to CN202010982942.XA priority Critical patent/CN112115716A/en
Publication of CN112115716A publication Critical patent/CN112115716A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method, a system and equipment for discovering services based on multi-dimensional word vector context text matching, firstly, extracting text information and matching grade labels from a standard data set for matching network training and service discovery test; performing data processing on the extracted text to generate three key Word vector representations of TF-IDF, Word2Vec and ELMo; constructing a three-layer similarity matrix of the text on word granularity based on the multi-dimensional word vector, and training by applying a convolutional neural network; for each query request, matching the query request with the test centralized service one by one, and predicting the probability score of the matching level based on the trained matching network; thereby finding a suitable target service; by combining the advantages and the characteristics of the multidimensional word vector and the convolutional neural network, the similarity characteristics of the text on the word granularity are comprehensively obtained from the characteristic information of the multidimensional mined keywords according to the processed data form, the matching of the service and the query request is effectively supported, and the accurate retrieval of the target service is realized.

Description

Service discovery method, system and equipment based on multi-dimensional word vector context matching
Technical Field
The invention belongs to the field of computer science and technology, and particularly relates to a method, a system and equipment for discovering services based on multi-dimensional word vector context matching.
Background
The core task in the service discovery process is to solve the problem of function matching between the user query request and the candidate service. Since the functional description information of the service is generally expressed in a natural language, a text matching method is generally applied to a service discovery research to discover a target service according to a similarity matching score between texts. The text matching process mainly comprises two stages of vectorization processing and similarity calculation. The traditional word vector processing method only considers the feature information under the word frequency or semantic single dimension generally, and with the increase of the number and the types of service resources, the single dimension word vector has certain defects in the aspect of text information mining and cannot fully reflect the weight of keywords in the text. In addition, the text similarity calculated in the vector space based on the single-dimensional word vector is simple and easy to implement, but the deep matching is lacked, and the comprehensive matching information is difficult to obtain, so that the matching accuracy is influenced. Therefore, in order to improve the text matching precision to realize accurate service discovery, corresponding solutions need to be designed for different existing problems.
Disclosure of Invention
In order to overcome the defects in the existing service discovery technology, the invention provides a method, a system and equipment for discovering services based on multidimensional word vector context matching, and the method, the system and the equipment can provide optimal target services for users on the basis of text matching by combining the multidimensional word vector and the functional characteristics of a convolutional neural network.
The technical scheme adopted by the invention for realizing the aim is as follows: a service discovery method based on multi-dimensional word vector context matching comprises the following steps:
step 1, respectively extracting question sentence pairs for semantic similarity detection and corresponding matching grades from a Quora data set so as to train a matching network; acquiring function description information of service and query requests from an OWLS-TC4 data set as a test set;
step 2, processing the sentences extracted in the step 1 to generate three key Word vector representations of TF-IDF, Word2Vec and ELMo; the three keyword vector representations aim at obtaining feature information of the keywords under the word frequency-reverse file frequency, static semantics and dynamic semantics dimensions;
step 3, based on the three keyword vectors obtained in the step 2, calculating cosine similarity between different keyword vectors in each pair of sentences to construct a three-layer similarity matrix, and using the three-layer similarity matrix as input of a matching network;
step 4, initializing convolutional neural network model parameters, performing convolutional, pooling, global tie pooling and softmax classification operations on the similarity matrix obtained in the step 3, calculating error loss between a prediction result and an actual matching grade according to a loss function, and further performing reverse iterative optimization to obtain a convolutional neural network model, namely a matching network;
step 5, for each query request, performing the operations of the step 2 and the step 3 with the service in the test set one by one, and predicting the probability score of the matching level based on the matching network trained in the step 4; and then, sequencing the predicted and matched candidate services according to the probability scores obtained by prediction, wherein the first N services with the highest probability scores are the target results to be retrieved.
In step 1, the extracted matching grade is used as an output label in the network training process and is used for calculating the error loss of the prediction result.
The step 2 is realized by the following specific steps:
step 21, respectively calculating the lengths of sentence A and sentence B in each pair of sentence pairs, and recording as LaAnd Lb(ii) a The length threshold is set as i, j, and only L which satisfies that i is less than or equal to is reservedaJ is not more than j and i is not more than LbThe sentence pairs less than or equal to j realize the data filtering of the sentence pairs;
step 22, preprocessing the sentence pair after data filtering, including removing stop words, extracting word stems and segmenting words, obtaining corresponding keywords, and completing feature extraction of the text;
step 23, counting the number of keywords of each sentence in all sentence pairs, and selecting the minimum number of keywords as m; according to the weight value of the keywords in each sentence, the first m keywords are used as representative keywords to realize fixed-length processing;
step 24, based on the obtained representative keywords, generating TF-IDF word vector representation of each sentence to obtain word frequency-reverse file frequency information;
step 25, generating Word2Vec Word vector representation of each sentence based on the obtained representative keywords to obtain static semantic information;
based on the obtained representative keywords, an ELMo word vector representation of each sentence is generated to obtain dynamic semantic information, step 26.
In the three-layer similarity matrix in step 3, each layer of feature matrix is formed by cosine similarity of sentences A and sentences B based on each word vector in word granularity.
In step 4, three-channel convolution scanning is carried out on the similarity matrix of the input layer by adopting two filters with the size of 3 x 2 in a single step in the convolution layer, each layer of elements in each filter and elements at corresponding positions in the sensing field of each layer of input matrix are multiplied and summed, and the sum of the convolution results of the three layers is used as convolution output elements, so that two corresponding two-dimensional output characteristic matrixes are generated and are used as pooling input matrixes.
Step 4, a Max-Pooling mode is adopted in the Pooling layer, the largest similarity element in the sensing field of each feature matrix is used as a pooled output feature, and Pooling operation is carried out on the two input feature matrices, so that a pooled output matrix is formed, and deep filtering and extraction of the similarity feature are completed; in step 4, a global average pooling layer is arranged behind the pooling layer, and all elements in each layer of feature matrix output in a pooling mode are added to calculate the average value, so that two feature values respectively correspond to the two layers of feature matrices.
And step 4, inputting the characteristic value obtained after global average pooling into softmax at an output layer, predicting the matching grade of the sentence A and the sentence B, calculating error loss between the predicted matching result and the actual matching grade at a back propagation stage, and performing layer-by-layer back optimization and adjustment on the weight parameters in the network by adopting a gradient descent method to minimize a loss function so as to determine the model parameters, and finishing training.
The service discovery system based on the text matching under the multi-dimensional word vector comprises a data preparation and processing module, a similarity matrix construction module, an iterative training module and a service discovery module; the data preparation and processing module is used for respectively extracting a problem sentence pair and a corresponding matching grade for semantic similarity detection from the Quora data set, acquiring functional description information of a service and query request from the OWLS-TC4 data set as a test set, and preprocessing the extracted sentences to generate three key Word vector representations of TF-IDF, Word2Vec and ELMo;
the similarity matrix construction module constructs a three-layer similarity matrix by calculating cosine similarity between different keyword vectors in a sentence pair based on the three keyword vectors and takes the three-layer similarity matrix as the input of a matching network;
the iterative training module performs convolution, pooling, global tie pooling and softmax classification operations on the three-layer similarity matrix, calculates error loss between a prediction result and an actual matching grade according to a loss function, and further performs reverse iterative optimization to obtain a convolutional neural network model;
for each query request, the service discovery module performs three keyword vector extractions and three-layer similarity matrix construction with the services in the test set one by one, and predicts the probability score of the matching level based on the convolutional neural network model; and then, sequencing the predicted and matched candidate services according to the probability scores obtained by prediction, wherein the first N services with the highest probability scores are the target results to be retrieved.
A computer device includes, but is not limited to, one or more processors and a memory, where the memory is used to store a computer-executable program, and the processor reads part or all of the computer-executable program from the memory and executes the computer-executable program, and when the processor executes part or all of the computer-executable program, the processor can implement part or all of the steps of the service discovery method based on multi-dimensional word vector context matching according to the present invention.
A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, is capable of implementing the service discovery method based on multi-dimensional word vector context matching according to the present invention.
Compared with the existing service discovery method, the method has the following advantages: according to the method, forward propagation and reverse iteration optimization training are firstly carried out on a standard text data set, model parameters are continuously adjusted according to error loss of a prediction result and an actual matching grade, and a trained matching network is guaranteed to have good reliability; based on the data form processed by the method, three keyword vector representations of TF-IDF, Word2Vec and ELMo are generated to obtain the feature information of the keywords under the Word frequency-reverse file frequency, static semantics and dynamic semantics dimensions, so that the text similarity calculation is effectively supported; and performing further feature extraction by utilizing the functional characteristics of the convolutional neural network based on a three-layer similarity matrix of the text on the word granularity, and sequencing the predicted and matched candidate services according to the prediction probability scores to find the most accurate target service, thereby realizing the accurate retrieval of the target service.
Drawings
Fig. 1 is a structural diagram of a service discovery method based on multi-dimensional word vector context matching.
FIG. 2 is a flow chart of a method for training a matching network using a convolutional neural network.
Detailed Description
Fig. 1 is an overall structural diagram of the present invention, and the technical solution of the present invention will be further explained with reference to the drawings and examples.
The multi-dimensional word vector reflects the feature weight of the keywords in the text from different angles, and fully excavates the feature information of the keywords under the dimensions of word frequency-reverse file frequency, static semantics and dynamic semantics. Compared with the traditional single-dimensional word vector, the multi-dimensional word vector not only can represent the importance of the keywords in the text and extract the meaning of the shallow words, but also can realize deep contextualized word expression and extract rich and comprehensive word characteristic information from the text to complete a specific task. In addition, the convolutional neural network is used as an efficient recognition algorithm, a series of breakthrough research results are obtained in the fields of computer vision and the like, and the convolutional neural network is also widely applied to the field of natural language processing. The strong feature learning and representing capability of abstracting more essential expression from the input data through local receiving, weight sharing and down sampling provides an effective method and thought for the research of many natural language processing problems.
Based on the advantages and characteristics of the two methods, the method provided by the invention considers the combination of the two methods, and is used for quickly and accurately finding out the target service meeting the user request in the service discovery process.
The invention provides a service discovery method based on multi-dimensional word vector context matching, which comprises two processes of generating training and testing, and comprises the following specific steps:
step 1: data preparation the quadra dataset is a standard dataset in text matching applications, containing both question sentence pairs and matching levels for semantic similarity detection. The method extracts question sentence pairs and corresponding matching grades from the Quora data set respectively for training the matching network. In the testing process, an OWLS-TC4 data set is adopted by the invention, 1083 semantic web services and 42 user query requests related to nine fields of education, medical treatment, food, communication, economy, geography, travel, weapons and simulation are included, and in order to obtain the matching level of the query requests and the services, the invention respectively extracts corresponding function description information from configuration file modules in the services and the OWLS files of the query requests to perform text matching.
For example: the partial data information extracted from the Quora dataset and the OWLS-TC4 dataset is as follows:
TABLE 1 Quora data example
Problem 1 Problem 2 Match rating
How can I be a good geologist? What should I do to be a great geologist? 1
How do I read and find my YouTube comments? How can I see all my Youtube comments? 1
What are the questions should not ask on Quora? Which question should I ask on Quora? 0
How to learn Java programming language? How do I learn a computer language like java? 1
How GST affects the CAs and tax officers? Why can't I do my homework? 0
TABLE 2 OWLS-TC4 data examples
Figure BDA0002686210750000061
Step 2: generating multidimensional word vectors
The method comprises the following steps of firstly carrying out vectorization processing on text data in the text matching process, mining the keyword features of the text from three dimensions of Word frequency-reverse file frequency, static semantics and dynamic semantics by considering the influence of feature information in the text on the matching accuracy, and generating three keyword vector representations of TF-IDF, Word2Vec and ELMo, wherein the specific steps are as follows:
step 21: in order to reduce the influence on the reliability of the matching network, the sentence length in the quadra data set is not uniformly distributed, and data filtering is firstly needed. For each pair of sentences extracted, the length of sentence A and sentence B is calculated and recorded as LaAnd Lb(ii) a The length threshold is set as i, j, and only L which satisfies that i is less than or equal to is reservedaJ is not more than j and i is not more than LbThe sentence pairs less than or equal to j realize data filtering;
step 22: and acquiring key phrases of the sentences to finish the feature extraction of the texts. In order to save storage space and improve matching efficiency, stop words need to be removed and word stems need to be extracted first in the text processing process. In addition, the present invention applies convolutional neural networks based on a similarity matrix of text at word granularity. Therefore, the sentence is required to be subjected to word segmentation processing to obtain a corresponding key phrase, and feature extraction of the text is completed.
As an example, suppose a question sentence pair "How to left Java programming language? How do Home do I left a computer language like java? "satisfy the length requirement of data filtering. After the text features are extracted, the following key phrases are generated:
TABLE 3 example keyword sets
Problem 1 ['How','learn','Java','program','language']
Problem 2 ['How','learn','computer','language','like','java']
Step 23: in order to ensure good reliability of the trained matching network, the input matrix needs to maintain a consistent size, so that all sentences are determined to correspondingly contain the same number of keywords. After the key word groups of all sentences are obtained, the number of the key words contained in each sentence in the sentence pair is counted, the minimum number m of the key words is recorded, and the first m key words are used as representative key words according to the weight of the key words in each sentence, so that the fixed length processing is realized.
For example, for the above example question sentence pair, when the minimum keyword m is set to 4, the fixed-length keyword group shown in table 4 is generated:
TABLE 4 fixed-length key phrase examples
Problem 1 ['How','learn','Java','program']
Problem 2 ['How','learn','computer','language']
Step 24: based on the selected representative keywords, the invention adopts a TF-IDF feature extraction method to generate word vector representation of each keyword in a sentence so as to obtain word frequency-reverse file frequency, wherein TF-TDF is TF-IDF actually and is used for evaluating the importance degree of words in a sentence set or a corpus. Wherein the content of the first and second substances,
(1) TF is the Term Frequency (Term Frequency) and represents the Frequency of occurrence of a keyword in a sentence. The calculation formula is as follows:
Figure BDA0002686210750000081
n in the word frequency calculation methodi,jIndicates that the word is in a sentenceSub djThe denominator represents the sum of the occurrences of all keywords in the sentence.
(2) The IDF is an Inverse Document Frequency (Inverse Document Frequency) representing the Frequency of occurrence of keywords in all sentences. The calculation formula is as follows:
Figure BDA0002686210750000082
in the reverse file frequency calculation method, | D | represents the total number of sentences in the corpus, and the denominator represents the included keyword tiThe corpus of the present invention is composed of sentences in all sentence pairs.
Since the similarity matrix is constructed based on word granularity, for each keyword in a sentence, (tf, idf, tf × idf) is used as a keyword vector for similarity calculation.
For example, for the representative keyword of question 2 in the example question sentence pair, the vector of its TF-IDF words is shown in Table 5:
table 5 represents examples of word vector representations of the keywords TF-IDF
Keyword TF IDF TF-TDF
How 0.1428 1.3664 0.1952
learn 0.1428 3.3524 0.4789
computer 0.1428 4.1997 0.5999
language 0.1428 4.6051 0.6578
Step 25: in order to directly convert the keywords into space vectors and simultaneously mine the static semantic information of the words, the Word2Vec Word vector representation of each keyword in the sentence is generated by adopting genim training based on the selected representative keywords. Similarly, the sentences in all sentence pairs form a training corpus, a training model is called through a gensim packet, corresponding model parameters including model types, an acceleration method, the lowest word frequency, a sliding window and word vector dimensions are set, and the generated model is stored after training is finished; and in the Word vector generation stage, directly importing the text to be processed into a stored generation model so as to obtain Word2Vec vector representation corresponding to the keywords, wherein the dimension of each Word vector is consistent with the dimension attribute set in the training process.
For example, for the representative keyword of Question sentence pair Question 2 in the above example, the Word2Vec Word vector is shown in table 6:
table 6 represents an example of Word2Vec Word vector representation for the keyword
Keyword Word2Vec
How [-0.0022,0.0372,0.0149,-0.0047,-0.045,0.0572]
learn [0.0296,0.0049,-0.0028,0.0261,-0.0425,-0.0287]
computer [0.071,0.0111,0.0147,0.0485,0.0686,-0.0168]
language [-0.0079,0.0482,-0.0175,-0.0686,-0.0036,0.0232]
Step 26: in addition to TF-IDF and Word2Vec Word vectors, the invention generates ELMo Word vector representation based on the selected representative keywords and according to the context in which the representative keywords are located, and can solve the problem of Word ambiguity while acquiring dynamic semantic information. Firstly, directly inputting a sentence to be processed into an ELMo network which is pre-trained by adopting an official ELMo fragment mode, wherein each keyword in the sentence can obtain three corresponding Embedding, the bottom layer is the original Embedding of the keyword, and the other two layers are the Embedding of the corresponding position of the keyword in an ELMo bidirectional LSTM network structure and are respectively used for obtaining syntax and semantic information of the keyword; then by weighted summation of the three embeddings, an ELMo word vector representation of the keyword in the dynamic context is generated, and the dimension size is 1024.
It should be noted that there is no definite sequence between steps 24 and 26, and the steps 24 to 26 may be performed simultaneously, randomly, or sequentially according to steps 24 to 26.
As an example, for the representative keyword of Question 2 in the Question sentence pair in the above example, the ELMo word vector thereof is shown in table 7:
table 7 represents an example of keyword ELMo word vector representation
Figure BDA0002686210750000091
Figure BDA0002686210750000101
Step 3, constructing an input similarity matrix of the matching network
The matching network adopted by the invention is used for mining deep matching information based on the similarity of the text on the word granularity, so that the accurate matching grade is obtained. And 2, processing all sentences into key phrases with the length of m, generating multi-dimensional word vector representation of each key word in the sentences, and acquiring feature information of the words from different dimensions. For each class of keyword vectors (TF-IDF, Word2Vec and ELMo), the invention constructs a similarity matrix A with the size of m multiplied by m by calculating the cosine similarity between the keyword vectors in the sentences to be matchedm×m(for A)m×mAny element a in (1)ijAnd the size of the similarity matrix is the cosine similarity value of the ith keyword in the sentence A and the jth keyword in the sentence B), and after the three layers of similarity matrixes are constructed, the similarity matrixes are input into a matching network for similarity characteristic learning.
The cosine similarity measures the difference between two individuals by using a cosine value of an included angle between two vectors in a vector space. The calculation formula is as follows:
Figure BDA0002686210750000102
in the above cosine similarity calculation method
Figure BDA0002686210750000103
And
Figure BDA0002686210750000104
which represents two vectors of the vector(s),
Figure BDA0002686210750000105
and
Figure BDA0002686210750000106
respectively represent
Figure BDA0002686210750000107
And
Figure BDA0002686210750000108
mold of AiAnd BiRespectively represent vectors
Figure BDA0002686210750000109
And
Figure BDA00026862107500001010
n represents the total number of components contained in the vector.
For example, the question sentence pairs in the above example, generate three layers of input similarity matrices based on TF-IDF, Word2Vec, and ELMo Word vector representations, respectively, as shown in tables 8-10,
TABLE 8 TF-IDF based input similarity matrix example
Keyword How learn Java program
How 0.9988 0.9987 0.9983 0.9983
learn 0.9968 0.9991 0.9992 0.9992
computer 0.9962 0.9991 0.9992 0.9992
language 0.9962 0.9991 0.9992 0.9992
TABLE 9 input similarity matrix example based on Word2Vec
Keyword How learn Java program
How 1.0 0.5272 0.0789 0.5761
learn 0.5272 1.0 0.3947 0.3252
computer 0.2927 0.5623 0.5928 0.5151
language 0.7277 0.3032 0.2521 0.6271
TABLE 10 ELMo-based input similarity matrix example
Keyword How learn Java program
How 0.9705 0.6567 0.5864 0.5559
learn 0.6378 0.9436 0.6116 0.6521
computer 0.5614 0.6211 0.7199 0.7446
language 0.5555 0.6215 0.6711 0.7636
Step 4, training the matching network
FIG. 2 is a flowchart of a method for training a matching network by applying a convolutional neural network, wherein the matching network constructed by the invention comprises an input layer, a convolutional layer, a pooling layer, a global tie pooling layer and an output layer; in the process of training the matching network, firstly, model parameters are required to be initialized, the model parameters comprise iteration times, batch processing size, learning rate, random weight, filter size, number and moving step length, an input layer of the matching network is composed of three layers of similarity matrixes among sentences obtained in the step 3, three-channel convolution scanning is carried out on the similarity matrixes of the input layer by adopting two filters with the size of 3 x 2 in a single step length, elements of each layer in each filter are multiplied and summed with elements of corresponding positions in a sensing field of each layer of input matrix, and the sum of convolution results of the three layers is used as convolution output elements, so that two corresponding two-dimensional output feature matrixes are generated and are used as a pooling input matrix at the same time.
In the Pooling layer, the Max-Pooling mode is adopted, the largest similarity element in the receptive field of each characteristic matrix is used as the pooled output characteristic, and the Pooling operation is carried out on the two input characteristic matrixes, so that the pooled output matrix is formed, and the depth filtering and extraction of the similarity characteristic are completed. In order to reduce the operation complexity and reduce the over-fitting problem, the global average pooling layer is arranged behind the pooling layer, all elements in each layer of feature matrix output in a pooling mode are added to calculate the average value, and two feature values are obtained and respectively correspond to two layers of feature matrices to be used for predicting the matching grade.
And in the output layer, the characteristic value obtained after global average pooling is input into softmax, and the matching level of the sentence A and the sentence B is predicted. The actual matching levels of the training sentence pairs have been extracted in step 1, and are classified into two categories, 1 and 0 ("1" indicates matching and "0" indicates non-matching). And in a back propagation stage, calculating error loss between the predicted matching result and the actual matching grade, and performing layer-by-layer back optimization adjustment on the weight parameters in the network by adopting a gradient descent method minimized loss function so as to determine model parameters and finish training.
Step 5, service discovery test
In the testing process, the invention performs text matching on the function description information of the OWLS-TC4 data set service and the query request. And (4) for each query request, performing the operations of the steps 2 and 3 with the services in the test set one by one, and predicting the probability score of the matching level based on the matching network trained in the step 4. Since the present invention aims to find the service that best satisfies the query request, the candidate services that are predicted to match (with the predicted matching level of "1") need to be further ranked according to the probability score obtained by prediction, and the top N services with the highest probability score are the target results to be retrieved.
For example, when the user query request is described as "It's return list of available subscribed services by the given counter," step 2 and step 3 are performed on the services in the test set one by one, and the matching level probability score of the service is predicted through the matching network, so as to generate the following probability prediction example table:
TABLE 11 probabilistic prediction examples Table
Figure BDA0002686210750000121
And then, sorting the services with the predicted matching level of 1 according to the P (1) value, and selecting the top N services with the highest scores as the target results of retrieval.
The invention provides a service discovery system, which comprises a data preparation and processing module, a similarity matrix construction module, an iterative training module and a service discovery module, wherein the data preparation and processing module is used for carrying out data acquisition and data transmission; the data preparation and processing module is used for respectively extracting a problem sentence pair and a corresponding matching grade for semantic similarity detection from the Quora data set, acquiring functional description information of a service and query request from the OWLS-TC4 data set as a test set, and preprocessing the extracted sentences to generate three key Word vector representations of TF-IDF, Word2Vec and ELMo;
the similarity matrix construction module constructs a three-layer similarity matrix by calculating cosine similarity between different keyword vectors in a sentence pair based on the three keyword vectors and takes the three-layer similarity matrix as the input of a matching network;
the iterative training module performs convolution, pooling, global tie pooling and softmax classification operations on the three-layer similarity matrix, calculates error loss between a prediction result and an actual matching grade according to a loss function, and further performs reverse iterative optimization to obtain a convolutional neural network model;
for each query request, the service discovery module performs three keyword vector extractions and three-layer similarity matrix construction with the services in the test set one by one, and predicts the probability score of the matching level based on the convolutional neural network model; and then, sequencing the predicted and matched candidate services according to the probability scores obtained by prediction, wherein the first N services with the highest probability scores are the target results to be retrieved.
The present invention also provides a computer device, including but not limited to one or more processors and a memory, where the memory is used to store a computer executable program, the processor reads part or all of the computer executable program from the memory and executes the computer executable program, and the processor executes part or all of the computer executable program to implement part or all of the steps of the service discovery method based on multidimensional word vector context matching according to the present invention, and the memory is further used to store a quera data set, an OWLS-TC4 data set, and intermediate results and final results of each step of the method according to the present invention.
A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, is capable of implementing the service discovery method based on multi-dimensional word vector context matching according to the present invention.
The computer device may be a notebook computer, a tablet computer, a desktop computer, a mobile phone or a workstation.
The invention also provides an output device, the output device is connected with the output end of the processor, and the output device is a display or a printer.
The processor may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or an off-the-shelf programmable gate array (FPGA).
The memory of the invention can be an internal storage unit of a notebook computer, a tablet computer, a desktop computer, a mobile phone or a workstation, such as a memory and a hard disk; external memory units such as removable hard disks, flash memory cards may also be used.
Computer-readable storage media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. The computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM).

Claims (10)

1. A service discovery method based on multi-dimensional word vector context matching is characterized by comprising the following steps:
step 1, respectively extracting question sentence pairs for semantic similarity detection and corresponding matching grades from a Quora data set so as to train a matching network; acquiring function description information of service and query requests from an OWLS-TC4 data set as a test set;
step 2, processing the sentences extracted in the step 1 to generate three key Word vector representations of TF-IDF, Word2Vec and ELMo; the three keyword vector representations aim at obtaining feature information of the keywords under the word frequency-reverse file frequency, static semantics and dynamic semantics dimensions;
step 3, based on the three keyword vectors obtained in the step 2, calculating cosine similarity between different keyword vectors in each pair of sentences to construct a three-layer similarity matrix, and using the three-layer similarity matrix as input of a matching network;
step 4, initializing convolutional neural network model parameters, performing convolutional, pooling, global tie pooling and softmax classification operations on the similarity matrix obtained in the step 3, calculating error loss between a prediction result and an actual matching grade according to a loss function, and further performing reverse iterative optimization to obtain a convolutional neural network model, namely a matching network;
step 5, for each query request, performing the operations of the step 2 and the step 3 with the service in the test set one by one, and predicting the probability score of the matching level based on the matching network trained in the step 4; and then, sequencing the predicted and matched candidate services according to the probability scores obtained by prediction, wherein the first N services with the highest probability scores are the target results to be retrieved.
2. The method for service discovery based on multi-dimensional word vector context matching as claimed in claim 1, wherein in step 1, the extracted matching level is used as an output label in a network training process for calculating error loss of the prediction result.
3. The method for discovering service based on context matching of multi-dimensional word vectors as claimed in claim 1, wherein step 2 is implemented by the following steps:
step 21, respectively calculating the lengths of sentence A and sentence B in each pair of sentence pairs, and recording as LaAnd Lb(ii) a The length threshold is set as i, j, and only L which satisfies that i is less than or equal to is reservedaJ is not more than j and i is not more than LbThe sentence pairs less than or equal to j realize the data filtering of the sentence pairs;
step 22, preprocessing the sentence pair after data filtering, including removing stop words, extracting word stems and segmenting words, obtaining corresponding keywords, and completing feature extraction of the text;
step 23, counting the number of keywords of each sentence in all sentence pairs, and selecting the minimum number of keywords as m; according to the weight value of the keywords in each sentence, the first m keywords are used as representative keywords to realize fixed-length processing;
step 24, based on the obtained representative keywords, generating TF-IDF word vector representation of each sentence to obtain word frequency-reverse file frequency information;
step 25, generating Word2Vec Word vector representation of each sentence based on the obtained representative keywords to obtain static semantic information;
based on the obtained representative keywords, an ELMo word vector representation of each sentence is generated to obtain dynamic semantic information, step 26.
4. The method of claim 1, wherein in the three-layer similarity matrix in step 3, each layer of the feature matrix is composed of sentences A and B based on cosine similarity of each word vector in word granularity.
5. The method of claim 1, wherein in step 4, a three-channel convolution scan is performed on the similarity matrix of the input layer using two filters with a single step size of 3 x 2 on the convolution layer, each layer of elements in each filter is multiplied and summed with elements at corresponding positions in the input matrix receptive field of each layer, and the sum of the convolution results of the three layers is used as convolution output elements, thereby generating two corresponding two-dimensional output feature matrices and simultaneously used as the pooled input matrix.
6. The method for discovering service based on multi-dimensional word vector context matching according to claim 1, wherein in step 4, Max-Pooling mode is adopted in the Pooling layer, the largest similarity element in each feature matrix receptive field is used as pooled output features, the Pooling operation is performed on two input feature matrices, thereby forming a pooled output matrix, and the depth filtering and extraction of the similarity features are completed; in step 4, a global average pooling layer is arranged behind the pooling layer, and all elements in each layer of feature matrix output in a pooling mode are added to calculate the average value, so that two feature values respectively correspond to the two layers of feature matrices.
7. The method of claim 1, wherein in step 4, the eigenvalues obtained after the global average pooling are input into softmax in the output layer, the matching grades of the sentence a and the sentence B are predicted, in the back propagation stage, the error loss between the predicted matching result and the actual matching grade is calculated, and a gradient descent method is used to minimize a loss function to perform layer-by-layer back optimization adjustment on the weight parameters in the network, so as to determine the model parameters, and the training is completed.
8. The service discovery system based on the text matching under the multi-dimensional word vector is characterized by comprising a data preparation and processing module, a similarity matrix construction module, an iterative training module and a service discovery module; the data preparation and processing module is used for respectively extracting a problem sentence pair and a corresponding matching grade for semantic similarity detection from the Quora data set, acquiring functional description information of a service and query request from the OWLS-TC4 data set as a test set, and preprocessing the extracted sentences to generate three key Word vector representations of TF-IDF, Word2Vec and ELMo;
the similarity matrix construction module constructs a three-layer similarity matrix by calculating cosine similarity between different keyword vectors in a sentence pair based on the three keyword vectors and takes the three-layer similarity matrix as the input of a matching network;
the iterative training module performs convolution, pooling, global tie pooling and softmax classification operations on the three-layer similarity matrix, calculates error loss between a prediction result and an actual matching grade according to a loss function, and further performs reverse iterative optimization to obtain a convolutional neural network model;
for each query request, the service discovery module performs three keyword vector extractions and three-layer similarity matrix construction with the services in the test set one by one, and predicts the probability score of the matching level based on the convolutional neural network model; and then, sequencing the predicted and matched candidate services according to the probability scores obtained by prediction, wherein the first N services with the highest probability scores are the target results to be retrieved.
9. A computer device, comprising but not limited to one or more processors and a memory, wherein the memory is used for storing a computer executable program, the processor reads part or all of the computer executable program from the memory and executes the computer executable program, and the processor can implement part or all of the steps of the service discovery method based on multi-dimensional word vector context matching according to claims 1 to 7 when executing the computer executable program.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, is adapted to carry out the method for service discovery based on multi-dimensional word vector context matching according to claims 1-7.
CN202010982942.XA 2020-09-17 2020-09-17 Service discovery method, system and equipment based on multi-dimensional word vector context matching Pending CN112115716A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010982942.XA CN112115716A (en) 2020-09-17 2020-09-17 Service discovery method, system and equipment based on multi-dimensional word vector context matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010982942.XA CN112115716A (en) 2020-09-17 2020-09-17 Service discovery method, system and equipment based on multi-dimensional word vector context matching

Publications (1)

Publication Number Publication Date
CN112115716A true CN112115716A (en) 2020-12-22

Family

ID=73800165

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010982942.XA Pending CN112115716A (en) 2020-09-17 2020-09-17 Service discovery method, system and equipment based on multi-dimensional word vector context matching

Country Status (1)

Country Link
CN (1) CN112115716A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112768080A (en) * 2021-01-25 2021-05-07 武汉大学 Medical keyword bank establishing method and system based on medical big data
CN112927782A (en) * 2021-03-29 2021-06-08 山东思正信息科技有限公司 Mental and physical health state early warning system based on text emotion analysis
CN113094703A (en) * 2021-03-11 2021-07-09 北京六方云信息技术有限公司 Output content filtering method and system for web intrusion detection
CN113283351A (en) * 2021-05-31 2021-08-20 深圳神目信息技术有限公司 Video plagiarism detection method using CNN to optimize similarity matrix
CN113380359A (en) * 2021-06-01 2021-09-10 上海德衡数据科技有限公司 Medical information query method and device based on Internet of things and electronic equipment
CN114780690A (en) * 2022-06-20 2022-07-22 成都信息工程大学 Patent text retrieval method and device based on multi-mode matrix vector representation
CN116128438A (en) * 2022-12-27 2023-05-16 江苏巨楷科技发展有限公司 Intelligent community management system based on big data record information

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299753A (en) * 2018-12-11 2019-02-01 济南浪潮高新科技投资发展有限公司 A kind of integrated learning approach and system for Law Text information excavating
CN110175221A (en) * 2019-05-17 2019-08-27 国家计算机网络与信息安全管理中心 Utilize the refuse messages recognition methods of term vector combination machine learning
US20200026759A1 (en) * 2018-07-18 2020-01-23 The Dun & Bradstreet Corporation Artificial intelligence engine for generating semantic directions for websites for automated entity targeting to mapped identities
CN110941698A (en) * 2019-11-18 2020-03-31 陕西师范大学 Service discovery method based on convolutional neural network under BERT
CN111538836A (en) * 2020-04-22 2020-08-14 哈尔滨工业大学(威海) Method for identifying financial advertisements in text advertisements

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200026759A1 (en) * 2018-07-18 2020-01-23 The Dun & Bradstreet Corporation Artificial intelligence engine for generating semantic directions for websites for automated entity targeting to mapped identities
CN109299753A (en) * 2018-12-11 2019-02-01 济南浪潮高新科技投资发展有限公司 A kind of integrated learning approach and system for Law Text information excavating
CN110175221A (en) * 2019-05-17 2019-08-27 国家计算机网络与信息安全管理中心 Utilize the refuse messages recognition methods of term vector combination machine learning
CN110941698A (en) * 2019-11-18 2020-03-31 陕西师范大学 Service discovery method based on convolutional neural network under BERT
CN111538836A (en) * 2020-04-22 2020-08-14 哈尔滨工业大学(威海) Method for identifying financial advertisements in text advertisements

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙阳: "基于卷积神经网络的中文句子相似度计算", 中国优秀硕士学位论文全文数据库信息科技辑, no. 8, pages 138 - 1277 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112768080A (en) * 2021-01-25 2021-05-07 武汉大学 Medical keyword bank establishing method and system based on medical big data
CN113094703A (en) * 2021-03-11 2021-07-09 北京六方云信息技术有限公司 Output content filtering method and system for web intrusion detection
CN112927782A (en) * 2021-03-29 2021-06-08 山东思正信息科技有限公司 Mental and physical health state early warning system based on text emotion analysis
CN112927782B (en) * 2021-03-29 2023-08-08 山东齐鲁心理健康研究院有限公司 Heart health state early warning system based on text emotion analysis
CN113283351A (en) * 2021-05-31 2021-08-20 深圳神目信息技术有限公司 Video plagiarism detection method using CNN to optimize similarity matrix
CN113283351B (en) * 2021-05-31 2024-02-06 深圳神目信息技术有限公司 Video plagiarism detection method using CNN optimization similarity matrix
CN113380359A (en) * 2021-06-01 2021-09-10 上海德衡数据科技有限公司 Medical information query method and device based on Internet of things and electronic equipment
CN114780690A (en) * 2022-06-20 2022-07-22 成都信息工程大学 Patent text retrieval method and device based on multi-mode matrix vector representation
CN116128438A (en) * 2022-12-27 2023-05-16 江苏巨楷科技发展有限公司 Intelligent community management system based on big data record information

Similar Documents

Publication Publication Date Title
CN112115716A (en) Service discovery method, system and equipment based on multi-dimensional word vector context matching
CN109190117B (en) Short text semantic similarity calculation method based on word vector
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
Celikyilmaz et al. LDA based similarity modeling for question answering
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN108280114B (en) Deep learning-based user literature reading interest analysis method
CN109635083B (en) Document retrieval method for searching topic type query in TED (tele) lecture
CN111221944B (en) Text intention recognition method, device, equipment and storage medium
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN111291188B (en) Intelligent information extraction method and system
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN111325029B (en) Text similarity calculation method based on deep learning integrated model
CN107832326B (en) Natural language question-answering method based on deep convolutional neural network
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN111581949B (en) Method and device for disambiguating name of learner, storage medium and terminal
CN110727765B (en) Problem classification method and system based on multi-attention machine mechanism and storage medium
CN111191002A (en) Neural code searching method and device based on hierarchical embedding
CN112148831B (en) Image-text mixed retrieval method and device, storage medium and computer equipment
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
CN112395875A (en) Keyword extraction method, device, terminal and storage medium
CN113220832A (en) Text processing method and device
CN115688024A (en) Network abnormal user prediction method based on user content characteristics and behavior characteristics
Yildiz et al. Improving word embedding quality with innovative automated approaches to hyperparameters
CN111581364A (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination