CN110674256B

CN110674256B - Method and system for detecting correlation degree of comment and reply of OTA hotel

Info

Publication number: CN110674256B
Application number: CN201910909573.9A
Authority: CN
Inventors: 江小林; 罗超; 胡泓
Original assignee: Ctrip Computer Technology Shanghai Co Ltd
Current assignee: Ctrip Computer Technology Shanghai Co Ltd
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2023-05-12
Anticipated expiration: 2039-09-25
Also published as: CN110674256A

Abstract

The invention discloses a detection method and a detection system for correlation degree of comment and reply of an OTA hotel, wherein the detection method comprises the following steps: obtaining criticizing and replying; converting the commentary and the replies into a commentary vector sequence and a reply vector sequence respectively; coding the comment vector sequence to obtain a coded comment vector at each moment; encoding the recovery vector sequence to obtain an encoded recovery vector at each moment; matching the coded comment vector at each moment with the coded reply vector at each moment to obtain a plurality of matching vectors; capturing the relation between the matching vectors in the vector sequence and aggregating the relation into a spliced vector; inputting the spliced vector into a full connection layer to obtain a target vector; and calculating the relevance probability of the comment and the reply according to the target vector. The invention can effectively, quickly and accurately calculate whether the reply for the comment is matched with the comment content, thereby not only helping hotels to improve the existing products according to the effective comment, but also reducing the labor cost.

Description

Method and system for detecting correlation degree of comment and reply of OTA hotel

Technical Field

The invention relates to the service field of OTA (online travel agency) hotels, in particular to a method and a system for detecting correlation degree of comment and reply of an OTA hotel.

Background

For service enterprises, consultation or feedback of users is critical to the enterprises, and many products have comment functions, and the user can fully reflect the problems of the products when the user comment, especially bad comments, on the products, so that merchants are required to respond to the comments appropriately. When the bad review client (except for malicious bad review) gets a proper response, the client feels that the merchant attaches his opinion, and many people change his negative attitude. It is therefore necessary to detect which replies are questions about the question and which replies are targeted answers for the existing product reviews, and to improve them.

Most of the current methods for replying to commentary and relativity of commentary are carried out manually by setting keyword rules, and some of the methods are used for filtering irrelevant questions and answers by setting thresholds.

Disclosure of Invention

The invention aims to overcome the defect of inaccurate matching between the comment of the user and the reply of the merchant in the prior art, and provides a detection method and a detection system capable of efficiently and accurately detecting the correlation degree between the comment and the reply of an OTA hotel.

The invention solves the technical problems by the following technical scheme:

The invention provides a detection method of relevance between criticizing and reply of an OTA hotel, which comprises the following steps:

obtaining comments and replies of the OTA hotel;

converting the commentary and the replies into a commentary vector sequence and a reply vector sequence respectively;

coding semantic relations among vectors in the comment vector sequence to obtain a coded comment vector at each moment;

coding semantic relations among vectors in the reply vector sequence to obtain a coded reply vector at each moment;

matching the coded comment vector at each moment with the coded reply vector at each moment to obtain a plurality of matching vectors, wherein the plurality of matching vectors form a matching vector sequence;

capturing the relation between the matching vectors in the vector sequence and aggregating the matching vector sequence into a spliced vector according to the relation;

inputting the spliced vectors to a full-connection layer to obtain target vectors, wherein the dimension of the target vectors is the same as the number of preset categories;

and calculating the relevance probability of the comment and the reply according to the target vector.

And the neural network model is used for respectively encoding semantic relations among all words in the comment vector sequence and the reply vector sequence.

Wherein the relevance probability of the comment to the reply is calculated by softmax (flexible maximum transfer value function).

According to the invention, through vectorizing the comment and the reply of the OTA hotel, analyzing the semantic relation between the vectorized comment and the reply, and analyzing and comparing each word in the comment and the reply with the whole sentence through machine learning, whether the reply of the comment is matched with the comment content or not can be effectively, quickly and accurately calculated, the hotel can be helped to improve the existing product according to the effective comment, the labor cost is reduced, and the service quality of the merchant is improved under the condition of improving the recognition precision and the recall rate so as to help the merchant bring benefits.

Preferably, before the step of inputting the spliced vector to the full-connection layer to obtain the preset dimension vector, the method further includes:

calculating the text similarity of each reply and other replies to obtain a similarity sequence;

obtaining a similarity average value according to the similarity sequence;

splicing the similarity average value serving as one dimension in the splicing vector with the splicing vector;

The step of inputting the splice vector to the full connection layer to obtain a target vector includes:

inputting the spliced vector spliced with the similarity average value to a full-connection layer to obtain a target vector;

and/or the number of the groups of groups,

the step of matching the coded comment vector at each time with the coded reply vector at each time to obtain a plurality of matching vectors includes:

and obtaining a plurality of matching vectors according to cosine similarity of the weighted coding comment vector of each dimension at each moment and the weighted coding reply vector of the corresponding dimension at each moment.

The text similarity can be calculated in the modes of editing distance and the like;

wherein, the calculation formula of the cosine similarity is that

Wherein v is _1, 、v ₂ For the vector to be compared, k represents a dimension of the vector, w _k Is a trainable parameter that can be back-propagated through the neural network.

According to the invention, the average value of the similarity sequence can be obtained by comparing the text similarity of the current reply of a specific hotel with the text similarity of other replies, and the average value is used as one dimension in the spliced vector, so that the correlation probability calculated by the spliced vector is more accurate and meets the actual requirement.

Preferably, the method comprises the steps of,

sequentially matching the coded comment vector at the current moment with the coded reply vector at the last moment from the first comment moment to obtain a first matching vector at each moment;

sequentially matching the coded reply vector at the current moment with the coded comment vector at the last moment from the first reply moment to obtain a second matching vector at each moment;

the plurality of match vectors includes the first match vector and the second match vector;

and/or the number of the groups of groups,

starting from the first comment moment, calculating the coding comment vector at the current moment and the coding reply vector at each moment in sequence to obtain the cosine similarity of each reply moment;

calculating a weighted coding reply vector according to the cosine similarity of each reply time of the current comment time;

starting from the first comment moment, matching the coded comment vector of each comment moment with the corresponding weighted coded reply vector to obtain a third matching vector of each moment;

Starting from the first reply time, sequentially calculating the coding reply vector at the current time and the coding comment vector at each time to obtain the cosine similarity of each comment time;

calculating a weighted coding comment vector according to the cosine similarity of each comment moment of the current reply moment;

starting from the first reply moment, matching the coded comment vector of each reply moment with the corresponding weighted coded comment vector to obtain a fourth matching vector of each moment;

the plurality of match vectors includes the third match vector and the fourth match vector.

The cosine similarity of each reply time at each comment time is used for calculating a weight, namely the relevance of a certain word in the comment and reply content, and the relation between the comment and the reply can be obtained by weighting the vector at the reply time through the relevance, namely the cosine similarity, and the relation between the reply and the comment can be obtained by weighting and averaging the cosine similarity of each comment time at each reply time and the vector at all the comments.

In the invention, from the first moment, the true relationship between the commentary and the reply can be obtained by carrying out full matching on the vector of the current time of the commentary and the vector of the last time of the reply and comparing the vector of the current time of the reply and the vector of the last time of the commentary, and the commentary or the vector in the commentary is weighted by the cosine similarity of the commentary or the reply, thereby overcoming the defect of neglecting the detail relevance in the prior art, and further obtaining the feedback of the more true commentary and the reply relevance.

Preferably, in the step of encoding semantic relationships among vectors in the score vector sequence to obtain encoded score vectors at each moment,

the coding comment vector comprises a forward coding comment vector and a reverse coding comment vector;

in the step of encoding the semantic relationships between the vectors in the sequence of reply vectors to obtain encoded reply vectors for each instant,

the coded reply vector comprises a forward coded reply vector and a reverse coded reply vector;

the step of capturing the relationship between the matching vectors in the vector sequence and aggregating the vector sequence into a splice vector according to the relationship comprises the steps of:

Inputting the sequence of matching vectors into a bi-directional LSTM (a machine learning model) model;

obtaining the relation among the matching vectors at each moment according to the bidirectional LSTM model, and intercepting the comment forward relation vector, the comment reverse relation vector, the reply forward relation vector and the reply reverse relation vector at the last moment in the LSTM model;

and aggregating the forward relation vector, the comment reverse relation vector, the reply forward relation vector and the reply reverse relation vector into the splicing vector.

In the invention, the inaccuracy of obtaining only one-way vectors is avoided by obtaining the forward coding comment vector and the reverse coding comment vector, and the complete semantics of the whole speech can be obtained by inputting the matching vector sequence into the bidirectional LSTM model and intercepting specific four vectors, thereby improving the aggregation efficiency and enabling the calculation of the subsequent correlation degree to be more accurate through the bidirectional model.

Preferably, the step of converting the comment and the reply into a comment vector sequence and a reply vector sequence respectively includes:

preprocessing the commentary and the replies;

inputting the commentary and the replies to a word segmentation tool respectively to obtain a first word segmentation commentary sequence and a first word segmentation reply sequence;

Respectively adding preset professional vocabularies in the current scene to the first word segmentation comment sequence and the first word segmentation reply sequence to form a second word segmentation comment sequence and a second word segmentation reply sequence;

respectively inputting the second word comment sequence and the second word reply sequence into a word vector model to obtain a comment vector sequence and a reply vector sequence;

the step of preprocessing comprises the following steps: filtering at least one of special characters, filtering pure numbers, filtering sentences not containing Chinese characters, filtering invalid sentences, and standardized sentences;

and/or the number of the groups of groups,

the step of calculating the relevance probability of the comment and the reply according to the preset dimension vector further comprises the following steps:

and judging whether the relevance probability is larger than the preset probability, if so, the comment is not matched with the reply.

The word segmentation tool is an open source word segmentation tool and comprises hanlp (a word segmentation tool).

In the word segmentation process, some preset professional vocabularies in the current scene can be added, for example: under the hotel scene of the OTA industry, the special words of preauthorization, credit, deduction deposit, cash register, large bedroom, account arrival, two-in-one, three-in-one, four-in-one, five-in-one, six-in-one, seven-in-one, eight-in-one, nine-in-one, ten-in-one, full two-in-one, full three-in-one, full four-in-one, full five-in-one, full six-in-one, full seven-in-one, full eight-in-one, full nine-in-one, full ten-in-one, no-in-store price increase, sitting price increase, apartment house, receiver and the like corresponding to the scene are added during word segmentation processing.

The word vector model comprises word2vec and glove.

The preprocessing step comprises the steps of filtering special characters such as expressions, filtering sentences which do not contain Chinese characters, summarizing partial boring and invalid sentences, and calculating similarity through editing distance to perform standardized sentence processes such as filtering, full-angle conversion, complex body conversion, case conversion and the like.

In the invention, the accuracy of vector sequence conversion can be improved through commenting and replying, the accuracy of word segmentation processing steps can be improved through adding preset professional vocabularies, and the accuracy of subsequent relevance judgment is prevented from being influenced due to objective reasons through preprocessing steps.

According to the invention, by comparing the predicted relevance probability with the preset probability, which replies are the replies of the questions and answers, the invention can help the merchant to improve and further avoid the loss of potential clients.

The invention also provides a detection system for correlation degree of comment and reply of the OTA hotel, which comprises: the system comprises an information acquisition module, a conversion module, a comment coding module, a reply coding module, a matching module, a first splicing module, a target vector acquisition module and a probability calculation module;

The information acquisition module is used for acquiring comments and replies of the OTA hotel;

the conversion module is used for converting the commentary and the replies into a commentary vector sequence and a reply vector sequence respectively;

the comment coding module is used for coding semantic relations among vectors in the comment vector sequence to obtain coded comment vectors at each moment;

the reply coding module is used for coding semantic relations among vectors in the reply vector sequence to obtain coded reply vectors at each moment;

the matching module is used for matching the coding comment vector at each moment with the coding reply vector at each moment to obtain a plurality of matching vectors, and the plurality of matching vectors form a matching vector sequence;

the first splicing module is used for capturing the relation between the matching vectors in the vector sequence and aggregating the matching vector sequence into a splicing vector according to the relation;

the target vector acquisition module is used for inputting the spliced vector to the full-connection layer to obtain target vectors, and the dimension of the target vectors is the same as the number of preset categories;

the probability calculation module is used for calculating the relevance probability of the comment and the reply according to the target vector.

The comment coding module and the reply coding module respectively code semantic relations among all words in the comment vector sequence and the reply vector sequence by using a neural network model.

Wherein, the probability calculation module calculates the relevance probability of the comment and the reply through softmax.

In the invention, the conversion module is used for vectorizing the comment and the reply of the OTA hotel, the comment coding module and the reply coding module are used for analyzing the semantic relation between the vectorized comment and the reply, and the matching module and the first splicing module are used for analyzing and comparing each word between the comment and the reply with the whole sentence through machine learning, so that whether the reply aiming at the comment is matched with the comment content or not can be effectively, quickly and accurately calculated, the hotel can be helped to improve the existing product according to the effective comment, the cost is reduced, and the service quality of the merchant is improved under the condition of improving the identification precision and the recall rate, thereby helping the merchant bring benefits.

Preferably, the detection system further comprises: the device comprises a text similarity calculation module, an average value obtaining module and a second splicing module;

The text similarity calculation module is used for calculating the text similarity of each reply and other replies to obtain a similarity sequence;

the average value obtaining module is used for obtaining a similarity average value according to the similarity sequence;

the second splicing module is used for splicing the similarity average value serving as one dimension of the splicing vector with the splicing vector;

the target vector acquisition module is also used for inputting the spliced vector spliced with the similarity average value to a full-connection layer to obtain a target vector;

and/or the number of the groups of groups,

the matching module is further configured to obtain a plurality of matching vectors according to cosine similarity between the weighted encoded comment vector of each dimension at each moment and the weighted encoded reply vector of the corresponding dimension at each moment.

The text similarity calculation module can calculate the text similarity in a mode of editing distance and the like.

Wherein, the calculation formula of the cosine similarity is that

According to the invention, the text similarity calculation module compares the text similarity of the current reply of a specific hotel electricity with the text similarity of other replies, so that the average value of the similarity sequence can be obtained through the average value obtaining module, and the average value is used as one dimension in the splicing vector through the second splicing module, so that the correlation probability calculated through the splicing vector is more accurate and meets the actual requirements.

Preferably, the matching module comprises a first comment matching unit and a first reply matching unit;

the first comment matching unit is used for sequentially matching the coded comment vector at the current moment with the coded reply vector at the last moment from the first comment moment to obtain a first matching vector at each moment;

the first reply matching unit is used for sequentially matching the coded reply vector at the current moment with the coded comment vector at the last moment from the first reply moment to obtain a second matching vector at each moment;

and/or the number of the groups of groups,

the matching module comprises a first matching unit and a second matching unit;

the first matching unit is used for sequentially matching the coded comment vector at the current moment with the coded reply vector at the last moment from the first comment moment to obtain a first matching vector at each moment;

the second matching unit is used for sequentially matching the coded reply vector at the current moment with the coded comment vector at the last moment from the first reply moment to obtain a second matching vector at each moment;

and/or the number of the groups of groups,

the matching module comprises a reply cosine computing unit, a weighted reply computing unit, a third matching unit, a comment cosine computing unit, a weighted comment computing unit and a fourth matching unit;

the reply cosine calculation unit is used for sequentially calculating the coding comment vector at the current moment and the coding reply vector at each moment from the first comment moment to obtain the cosine similarity of each reply moment;

the weighted reply calculation unit is used for calculating a weighted coding reply vector according to the cosine similarity of each reply time of the current comment time;

the third matching unit is used for matching the coded comment vector of each comment moment with the corresponding weighted coded reply vector from the first comment moment to obtain a third matching vector of each moment;

the comment cosine calculation unit is used for sequentially calculating the coding reply vector at the current moment and the coding comment vector at each moment from the first reply moment to obtain the cosine similarity of each comment moment;

the weighted comment calculation unit is used for calculating a weighted coding comment vector according to the cosine similarity of each comment moment of the current reply moment;

The fourth matching unit is used for matching the coding comment vector of each reply moment with the corresponding weighted coding comment vector from the first reply moment so as to obtain a fourth matching vector of each moment;

In the invention, from the first moment, the vector of the current moment of the commentary and the vector of the last moment of the replying are completely matched through the first matching unit and the second matching unit, and the vector of the current moment of the replying and the vector of the last moment of the commentary are compared, and the vector in the replying or the commentary is weighted through the cosine similarity of the commentary or the replying through the third matching unit or the fourth matching unit, so that the relation between the real commentary and the replying can be obtained, the defect of neglecting the detail relevance in the prior art is overcome, and the feedback of the more real commentary and the replying relevance can be further obtained.

Preferably, the coding comment vector comprises a forward coding comment vector and a reverse coding comment vector;

The first splicing module comprises: the device comprises an input unit, an intercepting unit and an aggregation unit;

the input unit is used for inputting the matching vector sequence into a bidirectional LSTM model;

the intercepting unit is used for obtaining the relation among the matching vectors at each moment according to the bidirectional LSTM model, and intercepting the comment forward relation vector, the comment backward relation vector, the reply forward relation vector and the reply backward relation vector at the last moment in the LSTM model;

the aggregation unit is used for aggregating the forward relation vector, the comment reverse relation vector, the reply forward relation vector and the reply reverse relation vector into the splicing vector.

In the invention, the inaccuracy of obtaining only the unidirectional vector is avoided by obtaining the forward coding comment vector and the backward coding comment vector, the matching vector sequence is input into the bidirectional LSTM model through the input unit, and the specific four vectors are intercepted through the intercepting unit, so that the complete semantics of the whole speech can be obtained, the aggregation efficiency of the aggregation unit is improved, and the calculation of the subsequent relevance is more accurate through the bidirectional model.

Preferably, the conversion module comprises a preprocessing unit, a word segmentation processing unit, a vocabulary adding unit and a vector sequence obtaining unit;

The preprocessing unit is used for preprocessing the commentary and the replies;

the word segmentation processing unit is used for inputting the commentary and the replies to a word segmentation tool respectively to obtain a first word segmentation commentary sequence and a first word segmentation reply sequence;

the vocabulary adding unit is used for respectively adding preset professional vocabularies in the current scene to the first word segmentation comment sequence and the first word segmentation reply sequence to form a second word segmentation comment sequence and a second word segmentation reply sequence;

the vector sequence acquisition unit is used for respectively inputting the second word comment sequence and the second word reply sequence into a word vector model to obtain a comment vector sequence and a reply vector sequence;

the preprocessing comprises at least one of filtering special characters, filtering pure numbers, filtering sentences which do not contain Chinese characters, filtering invalid sentences and standardized sentences;

and/or the number of the groups of groups,

the detection system further comprises a judging module for judging whether the relevance probability is larger than the preset probability, and if yes, the comment is not matched with the reply.

In the word segmentation process, the vocabulary adding unit may add some preset professional vocabularies in the current scene, for example: under the hotel scene of the OTA industry, the professional vocabulary of preauthorization, credit check, deduction deposit, cash register, large bedroom, account check, two-in-one, three-in-one, four-in-one, five-in-one, six-in-one, seven-in-one, eight-in-one, nine-in-one, ten-in-one, full two-in-one, full three-in-one, full four-in-one, full five-in-one, full six-in-one, full seven-in-one, full eight-in-one, full nine-in-one, full ten-in-one, no-store price increase, sitting price, apartment house, receiver and the like corresponding to the scene is added during word segmentation processing.

The word vector model comprises word2vec and glove.

The preprocessing unit is used for preprocessing by filtering special characters such as expressions, filtering sentences which do not contain Chinese characters, summarizing partial boring and invalid sentences, and calculating similarity by editing distance to perform standardized sentences such as filtering, full-angle conversion, traditional Chinese conversion, case-to-case conversion and the like.

According to the invention, the accuracy of vector sequence conversion after commenting and replying can be improved through the preprocessing unit, the accuracy of word segmentation processing steps can be improved through adding the preset professional vocabulary through the vocabulary adding unit, and the influence on the accuracy of subsequent relevance judgment due to objective reasons is avoided through the preprocessing process in the preprocessing unit.

In the invention, the comparison of the predicted relevance probability and the preset probability by the judging module can judge which replies are the replies of the point comment answer questions, so that improvement of merchants can be helped, and loss of potential clients is further avoided.

The invention has the positive progress effects that: according to the invention, through vectorization of the comment and the reply of the OTA hotel, through analysis of semantic relation between the vectorized comment and the reply, and through machine learning, analysis and comparison of each word in the comment and the reply and the whole sentence are carried out, so that whether the reply aiming at the comment is matched with the comment content or not can be effectively, quickly and accurately calculated, the hotel can be helped to improve the existing product according to the effective comment, the labor cost is reduced, and the service quality of the merchant is improved under the condition of improving the recognition precision and the recall rate, thereby helping the merchant bring benefits.

Drawings

Fig. 1 is a flowchart of a method for detecting relevance between criticizing and reply of an OTA hotel in embodiment 1 of the invention.

Fig. 2 is a specific flowchart of step 102 in embodiment 2.

Fig. 3 is a specific flowchart of step 104 in embodiment 2.

Fig. 4 is a specific flowchart of step 105 in example 2.

Fig. 5 is a schematic diagram of the detection method in example 2.

Fig. 6 is a schematic diagram of a module of a system for detecting relevance between criticizing and reply of an OTA hotel in embodiment 3 of the invention.

Fig. 7 is a schematic block diagram of a conversion module in embodiment 4.

Fig. 8 is a schematic block diagram of a matching module in embodiment 4.

Fig. 9 is a schematic block diagram of a first splicing module in embodiment 4.

Detailed Description

The invention is further illustrated by means of the following examples, which are not intended to limit the scope of the invention.

Example 1

The embodiment provides a method for detecting relevance between criticizing and reply of an OTA hotel, as shown in fig. 1, the method comprises the following steps:

step 101, obtaining comments and replies of the OTA hotel;

step 102, converting the comment and the reply into a comment vector sequence and a reply vector sequence respectively;

step 103, coding semantic relations among vectors in the comment vector sequence to obtain a coded comment vector at each moment; coding semantic relations among vectors in the reply vector sequence to obtain a coded reply vector at each moment;

104, matching the coded comment vector at each moment with the coded reply vector at each moment to obtain a plurality of matching vectors, wherein the plurality of matching vectors form a matching vector sequence;

step 105, capturing the relation between the matching vectors in the vector sequence and aggregating the matching vector sequence into a spliced vector according to the relation;

step 106, inputting the spliced vector to a full-connection layer to obtain a target vector, wherein the dimension of the target vector is the same as the number of preset categories;

and 107, calculating the relevance probability of the comment and the reply according to the target vector.

In step 106, the dimension of the target vector is the same as the number of preset categories.

In the embodiment, through the vectorization of the comment and the reply of the OTA hotel, the semantic relation between the vectorized comment and the reply is analyzed, and the analysis and the comparison are carried out between the comment and the reply through machine learning, each word in the comment and the reply is compared with the whole sentence, so that whether the reply aiming at the comment is matched with the comment content or not can be effectively, quickly and accurately calculated, the hotel can be helped to improve the existing product according to the effective comment, the labor cost is reduced, and the service quality of the merchant is improved under the condition of improving the identification precision and the recall rate, so that the merchant is helped to bring benefits.

Example 2

This embodiment is a further improvement on the basis of embodiment 1, specifically, as shown in fig. 2, in this embodiment, step 102 includes:

step 201, preprocessing the comment and the reply;

step 202, inputting the commentary and the replies to a word segmentation tool respectively to obtain a first word segmentation commentary sequence and a first word segmentation reply sequence;

step 203, adding a preset professional vocabulary in the current scene to the first word segmentation comment sequence and the first word segmentation reply sequence to form a second word segmentation comment sequence and a second word segmentation reply sequence;

and 204, respectively inputting the second word comment sequence and the second word reply sequence into a word vector model to obtain a comment vector sequence and a reply vector sequence.

In step 201, preprocessing is performed by filtering special characters such as expressions, filtering sentences not containing chinese characters, summarizing partially boring and invalid sentences, and calculating similarity by editing distance to perform standardized sentences such as filtering, full-angle conversion, complex-body conversion, case-to-case conversion, and the like.

In step 202, a first word segmentation comment sequence and a first word segmentation reply sequence are obtained by a word segmentation tool including hanlp (a word segmentation processing tool).

In step 203, some preset specialized terms that need to be in the current scene may be added, for example: under the hotel scene of the OTA industry, the professional vocabulary of preauthorization, credit check, deduction deposit, cash register, large bedroom, account check, two-in-one, three-in-one, four-in-one, five-in-one, six-in-one, seven-in-one, eight-in-one, nine-in-one, ten-in-one, full two-in-one, full three-in-one, full four-in-one, full five-in-one, full six-in-one, full seven-in-one, full eight-in-one, full nine-in-one, full ten-in-one, no-store price increase, sitting price, apartment house, receiver and the like corresponding to the scene is added during word segmentation processing.

Wherein, in step 204, the word vector model includes word2vec (a word vector model), glove (a word vector model).

In this embodiment, the accuracy of the vector sequence conversion performed later can be improved by performing the comment and reply processing, and the accuracy of the word segmentation processing step can be improved by adding a preset professional vocabulary, so that the accuracy of the subsequent relevance judgment is prevented from being affected due to objective reasons by the preprocessing step.

In this embodiment, by comparing the predicted relevance probability with the preset probability, it can be determined which replies are those of the question-answering question, so that improvement can be facilitated for the merchant, and loss of potential clients can be further avoided.

In this embodiment, through step 204, a comment vector sequence composed of word vectors in each sentence of comments and a reply vector sequence composed of word vectors in each sentence of replies can be obtained respectively.

In this embodiment, in order to obtain a more accurate semantic relationship between each vector and the whole sentence between the comment vector sequence and the reply vector sequence, in step 103, the encoded comment vector includes a forward encoded comment vector and a reverse encoded comment vector, and the encoded reply vector includes a forward encoded reply vector and a reverse encoded reply vector.

In this embodiment, in order to more properly match each vector in the encoded comment vectors with the semantics of the encoded reply vectors and to more properly match each vector in the encoded reply vectors with the semantics of the encoded comment vectors, step 104 obtains a plurality of matching vectors according to the cosine similarity between the weighted encoded comment vector of each dimension at each moment and the weighted encoded reply vector of the corresponding dimension at each moment.

Wherein, the calculation formula of the cosine similarity is that

As shown in fig. 3, in this embodiment, the step 104 may specifically include the following steps:

step 1041, starting from the first comment time, sequentially matching the coded comment vector at the current time with the coded reply vector at the last time to obtain a first matching vector at each time;

step 1042, starting from the first recovery moment, sequentially matching the coded recovery vector at the current moment with the coded comment vector at the last moment to obtain a second matching vector at each moment;

step 1043, starting from the first comment time, calculating the coded comment vector at the current time and the coded reply vector at each time in turn to obtain the cosine similarity of each reply time;

step 1044, calculating a weighted encoding reply vector according to the cosine similarity of each reply time of the current comment time;

step 1045, starting from the first comment time, matching the coded comment vector at each comment time with the corresponding weighted coded reply vector to obtain a third matching vector at each time;

step 1046, starting from the first reply time, calculating the coded reply vector at the current time and the coded comment vector at each time in turn to obtain cosine similarity at each comment time;

Step 1047, calculating a weighted coding comment vector according to the cosine similarity of each comment time of the current reply time;

step 1048, starting from the first reply time, matching the coded comment vector at each reply time with the corresponding weighted coded comment vector to obtain a fourth matching vector at each time;

the plurality of match vectors includes the first match vector, a second match vector, the third match vector, and the fourth match vector.

Wherein, the steps 1041-1042 and the steps 1043-1048 can be performed simultaneously.

The matching method between vectors is to match by the above-mentioned cosine similarity formula, that is, after the whole flow of steps 1041-1048, a matching vector sequence composed of the cosine similarities of multiple dimensions at each moment is obtained.

The

steps

1044 and 1045 weight and average all the time vectors of the reply through the cosine similarity of each reply time, wherein the cosine similarity is used for calculating the weight, that is, the relevance of a certain word in the reply and the reply content, and the relationship between the reply and the reply can be obtained through the relevance, that is, the cosine similarity weights the vector of the reply time, and the

same steps

1046 and 1047 weight and average all the time vectors of the reply and the reply through the cosine similarity of each reply time.

In this embodiment, from the first moment, the true relationship between the comment and the reply is obtained by comparing the vector of the current moment of the comment with the vector of the last moment of the reply and comparing the vector of the current moment of the reply with the vector of the last moment of the comment, and weighting the vector of the reply or the comment by the cosine similarity of the comment or the reply, thereby overcoming the defect of neglecting the detail correlation in the prior art, and further obtaining the feedback of the more true comment and the reply correlation.

In this embodiment, after obtaining a matched vector sequence, step 105 is performed, as shown in fig. 4, where step 105 specifically includes:

step 1051, inputting the matching vector sequence into a bidirectional LSTM model;

step 1052, obtaining the relationship among the matching vectors at each moment according to the bidirectional LSTM model, and intercepting the comment forward relationship vector, the comment backward relationship vector, the reply forward relationship vector and the reply backward relationship vector at the last moment in the LSTM model;

step 1053, aggregating the forward relation vector, comment reverse relation vector, reply forward relation vector and reply reverse relation vector into the stitching vector.

In this embodiment, the inaccuracy of obtaining only the unidirectional vector is avoided by obtaining the forward encoding comment vector and the backward encoding comment vector, and the matching vector sequence is input into the bidirectional LSTM model and the specific four vectors are intercepted, so that not only the complete semantics of the whole speech can be obtained, but also the aggregation efficiency is improved, and the calculation of the subsequent correlation degree is more accurate by the bidirectional model.

In addition, the embodiment further comprises the following steps:

judging whether the relevance probability is larger than the preset probability, if so, the comment is not matched with the reply, and if not, the comment is matched with the reply.

For a better understanding of the present embodiment, the principle of the present embodiment will be briefly described below.

In this embodiment, as shown in fig. 5, firstly, the commentary is converted into words, then the words are converted into word vectors 301, and the same is done, then the reply is converted into word vectors 311, and then the word vector sequence of the commentary composed of word vectors 301 and the reply word vector sequence are respectively input into the 302LSTM model for encoding, so that the overall relation between each word vector and the whole sentence can be obtained, wherein the relation comprises the forward relation and the backward relation of each word vector, then the encoded commentary word vector is matched with the encoded reply word vector again by a matching layer, so as to obtain the relevance between the word vector and the reply at each moment in the commentary and the relevance between the word vector and the commentary at each moment in the reply, then the matched vectors containing the relevance information are spliced and input into the bidirectional LSTM model and are aggregated into a vector with fixed length, and the vector 304 at the last moment before the commentary in the model, the vector 305 at the last moment after the commentary, the vector at the last moment after the commentary, the forward moment 314 and the reply sentence vector at the last moment after the commentary sentence are intercepted, and the overall relation is calculated, so that the overall efficiency is improved. And then, after splicing the similarity value after the comment reply similarity calculation with the splicing vector, sending the spliced similarity value into a full-connection layer and a softMax layer to obtain the final similarity probability, and further judging the relation between the comment and the reply.

The present embodiment will be further described with reference to the following specific example.

If the user's comment is "towel dirty" and the comment is returned to "need sanitary cleaning", after the comment and the comment are obtained in step 101, the comment and the reply are first preprocessed, such as removing special characters such as expressions in the text, and then the content is divided into one word by a word segmentation tool in step 202, such as the comment content "towel dirty" can be divided into three words of "towel, dirty" and form a sequence of the three word groups, and the reply content "need sanitary cleaning" can be divided into three words of "need, sanitary, clean" and form a sequence of the three word group. Then, a preset professional vocabulary in the current scene can be added to related comments or replies, such as an apartment house in the current scene, and the word of "pub house" can be added before the three words of "towel, very, dirty". Then respectively inputting the comment word sequence and the replied word sequence into a word vector model to obtain a comment vector sequence and a reply word sequence Complex vector sequences, e.g., in this embodiment, the three words "towel, very, dirty" form a respectively _k 、b _k 、c _k The three vectors form a criticizing vector sequence of a _k b _k c _k Similarly, the sequence of reply vectors is A _k B _k C _k Where k represents different dimensions.

Next, the LSTM model is used for respectively evaluating the vector sequence a _k b _k c _k Recovery vector sequence A _k B _k C _k The method comprises the steps of respectively encoding to obtain the relation between each word and the whole sentence in each sentence, wherein the principle of the step is that the comment sentences are regarded as a sequence formed by words in sequence, each word is expressed by word embedding, a middle expression is arranged at a corresponding position, then the middle expression of each word is obtained, the middle expression represents the meaning from the head of the sentence to the position, the middle expression of the word consists of word embedding of the current word and the middle expression of the previous word, finally, the middle expression of the word at the tail of the sentence is used as the vector expression of the whole dialogue, forward and backward operations are respectively carried out, and the forward and backward vectors of the same word are fused to obtain the vector expression of a sentence at a plurality of moments. Similarly, the vectorized representation of multiple times is also obtained for replies through bi-directional LSTM, e.g., at b _k At the moment, the forward vector obtained is a _k b _k The resulting backward vector is b _k c _k According to the method, the coding comment vector at each moment and the coding reply vector at each moment can be obtained respectively.

Next, the coded comment vector obtained in the previous step is matched with the coded reply vector, in this embodiment, there are two matching methods, one is the full matching method described in steps 1041-1042, and the other is the focused matching method described in steps 1043-1046, for example, in the comment, in b _k At the moment, b is respectively carried out by a full matching method _k Time coded comment vector, i.e. b _k c _k Vector of last moment of reply, i.e. A _k B _k C _k Since, since the light source,the present embodiment requires forward and backward operations, respectively, and therefore, criticizing b _k The moment also has a forward encoding vector a _k b _k Correspondingly, the last moment of the reply also has a forward encoding vector, so there are essentially 4 comparison values; the attention matching method is focused through the full matching method, the coding comment vector at the current moment and the coding reply vector at each moment are calculated in sequence to obtain cosine similarity of each reply moment, and if comment is from a _k Time-of-day, backward encoding vector a _k b _k c _k Backward coding with reply, namely: a, a _k b _k c _k Respectively with A _k B _k C _k 、B _k C _k C (C) _k (there are substantially 4 comparison values, and the cosine line similarities obtained from one direction in this embodiment are 0.1, 0.2 and 0.3, respectively), and the cosine similarities are used to calculate weights from which a weighted average of the recovery vector M can be obtained _k Then, through the comment vector of the moment, a _k b _k c _k And the weighted average of the reply vector M _k Matching is performed. In this embodiment, the matching method in the vector is as follows

Matching is performed, wherein v ₁ 、v ₂ For the vector to be compared, k represents a dimension of the vector, w _k Is a trainable parameter that can be back-propagated through the neural network. For example, when comparing a _k b _k c _k And M is as follows _k When the cosine similarity of the two vectors is weighted and compared for each dimension in the vectors, for example, the cosine similarity of the first dimension is 0.1, the cosine similarity of the second dimension is 0.2, and the cosine similarity of the third dimension is 0.3, then the two vectors are compared in a _k At any moment, a three-dimensional cosine similarity vector is formed, and at other moments, multi-dimensional vectors are also formed, all the compared vectors are spliced to obtain a matching vector sequence based on the relation between the representative comment and the reply of the two-vector cosine similarity, and then the response comment and the reply are processed The matching vector sequence of the complex relations is put into a bidirectional LSTM model, a forward final vector of the overall semantic relation of the reaction aiming at the comment in the model is intercepted, a reverse final vector, a forward final vector of the overall semantic relation of the reaction and the reply are four vectors of the reverse final vector, and then the four vectors are spliced together to form a spliced vector.

In addition, in this embodiment, the similarity between the current reply and the text of other replies may be obtained by editing the distance similarity, and then the similarity is used as one dimension of the vector to be spliced with the vector obtained in the previous step, for example, a vector of 400 dimensions is obtained in the previous step, and a vector of 401 dimensions may be obtained through the similarity calculation in this step.

Then, the vector of 401 dimension in the previous step is put into the full connection layer to calculate, so that the number of categories (two categories, one category is a question of no question, and the second category is not a question of no question) in the embodiment is the same as the number of categories (the probability of the question of no question, for example, in the embodiment, the probability of the question of no question is 0.6, and the probability of the question of no question is 0.4, for example, the answer in the embodiment is the answer of the question of no question.

Example 3

The embodiment provides a detection system for correlation degree of criticizing and reply of an OTA hotel, as shown in fig. 6, the detection system comprises an information acquisition module 401, a conversion module 402, a criticizing coding module 403, a reply coding module 404, a matching module 405, a first splicing module 406, a target vector acquisition module 407 and a probability calculation module 408;

the information acquisition module 401 is used for acquiring comments and replies of the OTA hotel;

the conversion module 402 is configured to convert the comment and the reply into a comment vector sequence and a reply vector sequence, respectively;

the comment coding module 403 is configured to code semantic relationships among vectors in the comment vector sequence to obtain a coded comment vector at each moment;

the reply coding module 404 is configured to code semantic relationships among vectors in the reply vector sequence to obtain a coded reply vector at each moment;

the matching module 405 is configured to match the coded comment vector at each time with the coded reply vector at each time to obtain a plurality of matching vectors, where the plurality of matching vectors form a matching vector sequence;

the first stitching module 406 is configured to capture a relationship between matching vectors in the vector sequence and aggregate the matching vector sequence into a stitched vector according to the relationship;

The target vector obtaining module 407 is configured to input the spliced vector to a full connection layer to obtain a target vector, where the dimension of the target vector is the same as the number of preset categories;

the probability calculation module 407 is configured to calculate a relevance probability of the comment and the reply according to the target vector.

The comment encoding module 403 and the reply encoding module 404 encode semantic relationships between all words in the comment vector sequence and the reply vector sequence respectively using a neural network model.

Wherein the probability calculation module 408 calculates a relevance probability of the comment to the reply by softmax.

In this embodiment, the conversion module analyzes the semantic relationship between the comment and the reply of the vectorization through the comment encoding module and the reply encoding module, and the matching module and the first splicing module analyze and compare each word between the comment and the reply with the whole sentence through machine learning, so that whether the reply of the comment is matched with the comment content can be effectively, quickly and accurately calculated, the hotel can be helped to improve the existing product according to the effective comment, the labor cost is reduced, and the service quality of the merchant is improved under the condition of improving the recognition precision and the recall rate, thereby helping the merchant bring benefits.

Example 4

This embodiment is a further improvement on the basis of embodiment 3, specifically, as shown in fig. 7, in this embodiment, the conversion module 402 includes: a preprocessing unit 4021, a word segmentation processing unit 4022, a vocabulary adding unit 4023, and a vector sequence acquiring unit 4024;

the preprocessing unit 4021 is configured to preprocess the comment and the reply;

the word segmentation processing unit 4022 is configured to input the comment and the reply to a word segmentation tool to obtain a first word segmentation comment sequence and a first word segmentation reply sequence;

the vocabulary adding unit 4023 is configured to add a preset professional vocabulary in the current scene to the first word segmentation comment sequence and the first word segmentation reply sequence to form a second word segmentation comment sequence and a second word segmentation reply sequence;

the vector sequence obtaining unit 4024 is configured to input the second score vector sequence and the second score reply sequence to a word vector model to obtain a score vector sequence and a reply vector sequence;

the preprocessing unit 4021 is configured to perform preprocessing by at least one means of filtering special characters, filtering pure digits, filtering sentences that do not contain chinese characters, filtering invalid sentences, and normalizing sentences;

The word segmentation tool is an open source word segmentation tool and comprises hanlp.

In the word segmentation process, the vocabulary adding unit 4023 may add some preset specialized vocabularies in the current scene, for example: under the hotel scene of the OTA industry, the professional vocabulary of preauthorization, credit check, deduction deposit, cash register, large bedroom, account check, two-in-one, three-in-one, four-in-one, five-in-one, six-in-one, seven-in-one, eight-in-one, nine-in-one, ten-in-one, full two-in-one, full three-in-one, full four-in-one, full five-in-one, full six-in-one, full seven-in-one, full eight-in-one, full nine-in-one, full ten-in-one, no-store price increase, sitting price, apartment house, receiver and the like corresponding to the scene is added during word segmentation processing.

The word vector model comprises word2vec and glove.

The preprocessing unit 4021 is configured to perform preprocessing by filtering special characters such as expressions, filtering sentences that do not include chinese characters, summarizing sentences that are partially boring and ineffective, and calculating a similarity by editing a distance to perform filtering, full-angle conversion, complex-body conversion, case-to-case conversion, and other standardized sentences.

In this embodiment, the accuracy of vector sequence conversion after the comment and reply are processed by the preprocessing unit can be improved, the accuracy of word segmentation processing steps can be improved by adding the preset professional vocabulary by the vocabulary adding unit, and the accuracy of subsequent relevance judgment due to objective reasons is avoided by the preprocessing process in the preprocessing unit.

In this embodiment, the vector sequence acquisition unit 4024 may obtain the comment vector sequence composed of the word vector in each sentence of the comment and the reply vector sequence composed of the word vector in each sentence of the reply, respectively.

In this embodiment, in order to obtain a more accurate semantic relationship between each vector and the whole sentence between the comment vector sequence and the reply vector sequence, the encoded comment vector includes a forward encoded comment vector and a reverse encoded comment vector, and the encoded reply vector includes a forward encoded reply vector and a reverse encoded reply vector.

In this embodiment, in order to more properly match each vector of the encoded comment vectors with the semantics of the encoded reply vector and to more properly match each vector of the encoded reply vectors with the semantics of the encoded comment vector, the matching module 405 obtains a plurality of matching vectors according to the cosine similarity between the weighted encoded comment vector of each dimension at each time and the weighted encoded reply vector of the corresponding dimension at each time.

Wherein, the calculation formula of the cosine similarity is that

Wherein v is _1, 、v ₂ For the vector to be compared, k represents a dimension of the vector, w _k Is trainableParameters, which can be back-propagated through the neural network.

As shown in fig. 8, in this embodiment, the matching module 405 includes: a first matching unit 4051, a second matching unit 4052, a reply cosine calculation unit 4053, a weighted reply calculation unit 4054, a third matching unit 4055, a criticizing cosine calculation unit 4056, a weighted criticizing calculation unit 4057, and a fourth matching unit 4058;

the first matching unit 4051 is configured to sequentially match, from the first comment time, the coded comment vector at the current time with the coded reply vector at the last time to obtain a first matching vector at each time;

the second matching unit 4052 is configured to sequentially match, from the first reply time, the coded reply vector at the current time with the coded comment vector at the last time to obtain a second matching vector at each time;

the reply cosine calculation unit 4053 is configured to sequentially calculate, from the first comment time, the encoded comment vector at the current time and the encoded reply vector at each time to obtain a cosine similarity at each reply time;

The weighted reply calculation unit 4054 is configured to calculate a weighted encoded reply vector according to the cosine similarity of each reply time at the current comment time;

the third matching unit 4055 is configured to match the encoded comment vector at each comment time with the corresponding weighted encoded reply vector from the first comment time to obtain a third matching vector at each time;

the comment cosine calculating unit 4056 is configured to sequentially calculate, from the first reply time, the coded reply vector at the current time and the coded comment vector at each time to obtain a cosine similarity at each comment time;

the weighted comment calculation unit 4057 is configured to calculate a weighted encoding comment vector according to the cosine similarity of each comment time at the current reply time;

the fourth matching unit 4058 is configured to match the coded comment vector at each reply time with the corresponding weighted coded comment vector from the first reply time to obtain a fourth matching vector at each time;

the plurality of match vectors includes the first match vector, the second match vector, the third match vector, and the fourth match vector.

The matching method between vectors is to match through the cosine similarity formula, that is, after passing through all modules 4051-4058, a matching vector sequence composed of the cosine similarities of multiple dimensions at each moment is obtained.

The reply cosine calculation unit 4053 and the weighted reply calculation unit 4054 weight-average all the time vectors of the reply by the cosine similarity of each reply time at each reply time, where the cosine similarity is used to calculate a weight, that is, the correlation between a word in the reply and the reply content, and the relationship between the reply and the reply can be obtained by weighting the vector at the reply time by the cosine similarity, and the same comment cosine calculation unit 4056 and the weighted comment calculation unit 4057 weight-average all the time vectors of the reply by the cosine similarity of each comment time at each reply time.

In this embodiment, from the first moment, the first matching unit and the second matching unit are used for comparing the vector of the current moment of the comment with the vector of the last moment of the reply and comparing the vector of the current moment of the reply with the vector of the last moment of the comment, and the third matching unit or the fourth matching unit is used for weighting the vector of the reply or the direction vector in the comment on the cosine similarity of the comment or the reply, so that the relation between the real comment and the reply can be obtained, the defect of neglecting the detail correlation in the prior art is overcome, and more real feedback of the comment and the reply correlation can be further obtained.

In this embodiment, after obtaining a matched vector sequence, the first splicing module 406 is called, as shown in fig. 9, where the first splicing module 406 includes an input unit 4061, an interception unit 4062, and an aggregation unit 4063.

The input unit 4061 is configured to input the matching vector sequence into a bidirectional LSTM model;

the intercepting unit 4062 is configured to obtain a relationship between the plurality of matching vectors at each time according to the bidirectional LSTM model, and intercept a comment forward relationship vector, a comment backward relationship vector, a reply forward relationship vector, and a reply backward relationship vector at a last time in the LSTM model;

the aggregation unit 4063 is configured to aggregate the forward relation vector, the comment backward relation vector, the reply forward relation vector, and the reply backward relation vector into the concatenation vector.

In this embodiment, the inaccuracy of obtaining only the unidirectional vector is avoided by obtaining the forward encoding comment vector and the backward encoding comment vector, the matching vector sequence is input into the bidirectional LSTM model through the input unit, and the specific four vectors are intercepted through the intercepting unit, so that not only the complete semantics of the whole speech can be obtained, but also the aggregation efficiency of the aggregation unit is improved, and the calculation of the subsequent correlation degree is more accurate through the bidirectional model.

In addition, the embodiment further includes a judging unit, configured to judge whether the relevance probability is greater than the preset probability, if yes, the comment is not matched with the reply, and if no, the comment is matched with the reply.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.

Claims

1. The method for detecting the correlation degree between the comment and the reply of the OTA hotel is characterized by comprising the following steps:

obtaining comments and replies of the OTA hotel;

calculating the relevance probability of the comment and the reply according to the target vector;

the step of matching the coded comment vector at each time with the coded reply vector at each time to obtain a plurality of matching vectors further includes:

2. The detection method according to claim 1, wherein,

before the step of inputting the spliced vector to the full connection layer to obtain the target vector, the method further comprises the following steps:

obtaining a similarity average value according to the similarity sequence;

and/or the number of the groups of groups,

3. The detection method according to claim 1, wherein,

In the step of encoding semantic relationships among vectors in the score vector sequence to obtain encoded score vectors at each time,

inputting the matching vector sequence into a bidirectional LSTM model;

4. The detection method according to claim 1, wherein,

The step of converting the comment and the reply into a comment vector sequence and a reply vector sequence respectively comprises the following steps:

preprocessing the commentary and the replies;

and/or the number of the groups of groups,

the step of calculating the relevance probability of the comment and the reply according to the target vector further comprises the following steps:

judging whether the relevance probability is larger than a preset probability, if so, the comment is not matched with the reply.

5. A system for detecting relevance of criticisms and replies of an OTA hotel, the system comprising: the system comprises an information acquisition module, a conversion module, a comment coding module, a reply coding module, a matching module, a first splicing module, a target vector acquisition module and a probability calculation module;

the probability calculation module is used for calculating the relevance probability of the comment and the reply according to the target vector;

The matching module comprises a first matching unit and a second matching unit;

the matching module further comprises a reply cosine computing unit, a weighted reply computing unit, a third matching unit, a comment cosine computing unit, a weighted comment computing unit and a fourth matching unit;

6. The detection system of claim 5, wherein the detection system further comprises: the device comprises a text similarity calculation module, an average value obtaining module and a second splicing module;

and/or the number of the groups of groups,

7. The detection system of claim 5, wherein,

8. The detection system according to claim 5, wherein the conversion module includes a preprocessing unit, a word segmentation processing unit, a vocabulary addition unit, and a vector sequence acquisition unit;

the preprocessing unit is used for preprocessing at least one means of filtering special characters, filtering pure numbers, filtering sentences which do not contain Chinese characters, filtering invalid sentences and standardized sentences;

And/or the number of the groups of groups,

the detection system further comprises a judging module for judging whether the relevance probability is larger than a preset probability, and if yes, the comment is not matched with the reply.