CN110909545A

CN110909545A - Black guide detection method based on gradient lifting algorithm

Info

Publication number: CN110909545A
Application number: CN201911173486.8A
Authority: CN
Inventors: 詹瑾瑜; 余佳雨; 江维; 李响; 杨瑞; 刘昌澍; 李博智; 蔡玉舒; 周巧瑜
Original assignee: Division Big Data Research Institute Co Ltd; University of Electronic Science and Technology of China
Current assignee: Division Big Data Research Institute Co Ltd; University of Electronic Science and Technology of China
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2020-03-24

Abstract

The invention discloses a black tour guide detection method based on a gradient lifting algorithm, which is applied to the field of data detection and aims at the problem of supervision lag of the existing tour industry; training by adopting a gradient lifting algorithm based on the obtained word vector model to obtain a black tour guide category prediction model; and finally, inputting a complaint text into the obtained black tour guide category prediction model to obtain a prediction type, and compared with the existing manual data detection, the detection efficiency is obviously improved.

Description

Black guide detection method based on gradient lifting algorithm

Technical Field

The invention belongs to the field of big data processing, and particularly relates to a data detection technology based on a gradient lifting algorithm.

Background

Recently, news reports about the appearance of butchered guests, black shops and black tours in the domestic tourist market are frequent, the problems of malicious fraud and the like in the domestic tourist market are exposed, and the essence of lagging supervision of the existing tourist market is reflected. In the age of the machine learning becoming mature, how to solve the supervision lag problem of the tourism market by using the machine learning, through the collection, cleaning and analysis of mass data, the machine learning is applied to solve the related problems of the tourism market supervision, and the research on the hot problems in the intelligent supervision of the tourism market becomes a necessary trend.

In order to solve the above problems, a black tour guide detection technology, which specifically refers to a technology based on the content of a complaint text, has been developed, but the prior art lacks effective detection.

Disclosure of Invention

In order to solve the technical problems, the invention provides a black guide detection method based on a gradient boost algorithm, which judges and classifies a large amount of texts into a plurality of predefined categories by using the gradient boost algorithm, thereby effectively promoting the travel market.

The technical scheme adopted by the invention is as follows: a black tour guide detection method based on a gradient lifting algorithm comprises the following steps:

A. acquiring website news URL data, and obtaining a word vector model based on word embedding training;

B. based on the word vector model in the step A, training by adopting a gradient lifting algorithm to obtain a black tour guide category prediction model;

C. and D, inputting a complaint text into the black tour guide category prediction model obtained in the step B to obtain a prediction type.

Further, the step A comprises the following substeps:

a1, a request is initiated by a travel news network to obtain news URL data;

a2, crawling news content from news URL data;

a3, performing word segmentation on the news content obtained in the step A2 to obtain word segmentation corpora;

and A4, training according to the participle corpus to obtain a word vector model.

Further, step a1 specifically includes: and simulating an HTTP request by Postman, setting request parameters to obtain all results, setting the document type to be application/x-www-form-URL, analyzing the returned results, and storing the daily news URL data according to the rows.

Further, step a2 specifically includes: reading news URL to initiate HTTP request, analyzing the returned HTML content, respectively obtaining the content in the title label and the content in the text label, directly storing the title content as a line, cutting the text content into segments according to periods, and then writing the files according to the lines.

Further, the step B comprises the following substeps:

b1, obtaining a complaint text, and dividing the complaint text into two parts, wherein one part is used as a training set, and the other part is used as a test set;

b2, reading the local word vector model file, analyzing according to rows, using a word as a key, using the corresponding word vector as a value, and storing the value in a dictionary variable to obtain a word embedded dictionary;

b3, converting each sentence in the training set and the test set into a training sentence vector and a test sentence vector respectively by using a word embedding dictionary;

b4, training by using a gradient lifting algorithm according to the training sentence vectors to generate a black guide category prediction model;

and B5, verifying the effect of the training model by using test data, wherein the evaluation indexes comprise accuracy, recall rate and F1 value.

Further, step C includes the following substeps:

c1, reading the local word vector model file, analyzing according to the line, using a word as a key, using the corresponding word vector as a value, and storing the value in a dictionary variable, thereby obtaining a word embedded dictionary;

c2, converting the input text into sentence vectors by using a word embedding dictionary;

and C3, inputting the sentence vector obtained in the step C2 into the black tour guide type prediction model obtained by training in the step B, and outputting to obtain a prediction type result.

The invention has the beneficial effects that: the method disclosed by the invention is based on the black tour guide prediction model obtained by training of the gradient lifting algorithm, can effectively identify the black tour guide category, and remarkably improves the classification detection efficiency of the tour data compared with the existing method for manually processing the complaint text.

Drawings

Fig. 1 is a schematic flow chart of a black tour guide detection method based on a gradient boosting algorithm according to the present invention.

FIG. 2 is a flow diagram of the crawler module of the present invention.

FIG. 3 is a flow diagram of the word vector model module of the present invention.

FIG. 4 is a flow diagram of the word segmentation module of the present invention.

Fig. 5 is a flow diagram of the sentence vector conversion module of the present invention.

FIG. 6 is a flow diagram of a predictive model training module of the present invention.

FIG. 7 is a flow chart of the gradient boost algorithm module of the present invention.

FIG. 8 is a flow diagram of the prediction module of the present invention.

FIG. 9 is a flow diagram of the space vector model construction module of the present invention.

Detailed Description

In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the present invention will be further explained with reference to the accompanying drawings.

The technical scheme of the invention is as follows: a black tour guide detection method based on a gradient lifting algorithm mainly comprises three large flow modules, as shown in FIG. 1: the device comprises a space vector model construction module, a prediction model training module, a black tour guide category prediction module and two tool modules: word segmentation module, sentence vector conversion module. The flow of the space vector model building module consists of small modules, and comprises the following steps: the device comprises a crawler module, a word segmentation module and a word vector model module. The prediction model training module comprises a core algorithm module, namely a gradient lifting algorithm module.

The invention classifies the tour guides with the following five characteristics or behaviors as black tour guides: 1) forced shopping/consumption; 2) modifying/terminating the trip; 3) catering/accommodation violations; 4) the tour guide has no qualification/tour guide certificate; 5) assault, assault. When the input complaint text is subjected to black tour guide prediction, category matching is carried out according to the classification, and the probability ranking of the predicted category is output.

As shown in fig. 1, a black tour detection method based on a gradient lifting algorithm mainly includes three modules, a spatial vector model construction module, a prediction model training module, and a black tour category prediction module, and in addition, the invention also includes two reusable tool modules, a word segmentation module (as shown in fig. 4), and a sentence vector conversion module (as shown in fig. 5).

As shown in fig. 9, step a of the present invention: training a space vector model by word embedding;

word embedding may also be called neural network-based distributed representation, where a neural network word vector representation technique models the context and the relationship between the context and the target word through a neural network technique. Since neural networks are flexible, complex contexts can be represented. If n-grams containing word order information are used as context, the total number of n-grams grows exponentially as n increases, and the dimensionality disaster is encountered. And when the neural network represents the n-gram, the n words can be combined in some combination modes, and the number of the parameters only increases at a linear speed. With this advantage, the neural network model can model more complex contexts, including richer semantic information in the word vector.

Step A, crawling a news text, and then constructing a space vector model through a word vector model, wherein the method comprises the following steps:

step A1: as shown in FIG. 2, the travel News Web initiates a request to obtain a series of news URL data: simulating an HTTP request by Postman, setting request parameters to obtain all results, analyzing the returned results, and storing the URL of the news every day according to the rows, wherein the document type is application/x-www-form-URL;

step A2: crawling news contents through a news URL, and storing the news contents in a one-sentence one-line format: reading a news URL to initiate an HTTP request, analyzing the returned content in the HTML format, respectively obtaining the content in a title label and the content in a text label, directly storing the title content as a line, cutting the text content into parts according to periods, and then writing the parts into files according to the lines;

step A3: the use environment of the invention is a Chinese environment, Chinese is different from English in that words are separated by spaces, and English is a natural word-dividing word, so that the Chinese news needs to be divided into words. Travel news is Chinese and requires word segmentation of news content. As shown in fig. 4, the specific steps are to implement efficient word graph scanning based on a prefix dictionary, generate a Directed Acyclic Graph (DAG) composed of all possible word forming conditions of the chinese characters in the sentence, generate a prefix tree using twenty thousand words trained in the jieba open source library, and then generate several possible distinctions of the sentence to be segmented against the existing prefix tree. And then, searching a maximum probability path by adopting dynamic programming, and finding out a maximum segmentation combination based on the word frequency. For words which cannot be found in a prefix tree, an HMM model based on Chinese character word forming capability is adopted, Chinese words are arranged into a sequence according to B (begin-start position) E (end-end position) M (middle-middle position) S (single-single word forming position, no front or back), a BEMS sequence with the maximum probability can be obtained by matching with a Viterbi algorithm, sentences to be segmented are recombined according to the mode of B heading and E ending, and finally a segmentation result is obtained;

step A4: and training a word vector model by using the word segmentation linguistic data obtained in the previous step. As shown in fig. 3, specifically, a fully-connected neural network with only one hidden layer is constructed: firstly, inputting a sentence into an input layer, and converting a word into a One-Hot vector at the moment; then, inputting a linear model for simple mapping in a first hidden layer, wherein the linear model is not a nonlinear activation function; and finally, the third layer is a classifier which uses Softmax regression and outputs the probability corresponding to each word. And finally, saving the trained word vector model as a local file according to a format that a line of words and a line of word vectors are added.

As shown in fig. 6, step B of the present invention: training a black tour guide category prediction model by using a gradient lifting algorithm; specifically, the method comprises the following steps: the method for obtaining the training model by using the gradient promotion algorithm for the complaint text of the black tour guide (the complaint text is obtained by using a crawler module and crawling from the comments of the tourism website), comprises the following steps:

step B1: dividing the complaint text by the pseudo-ginseng, wherein 70% of the complaint text is used as a training set, and 30% of the complaint text is used as a test set, so that the test set can correctly evaluate the performance of the model by the proportion division;

step B2: loading a word vector model: reading a local word vector model file, analyzing according to rows, taking a word as a key, taking a word vector as a value, and storing the word vector in a dictionary variable;

step B3: each sentence of the training set and the test set is converted into a training sentence vector and a test sentence vector, respectively, using a word embedding dictionary. As shown in fig. 5, a specific method for converting a sentence into a sentence vector is to perform word segmentation on the sentence (the detailed steps are the same as those in step a 3), obtain a series of words after word segmentation, and use the words as keys to search the vector values in a dictionary. Adding the obtained word vector values, and finally dividing by the number of the words to eliminate deviation;

step B4: and training by using a gradient lifting algorithm to generate a black tour guide category prediction model.

As shown in fig. 7, the specific gradient boost algorithm is as follows. Firstly, a regression tree class is created, the information of the tree comprises a root node, the height of the tree and a rule, and the nodes comprise a storage predicted value, a left node, a right node, a feature and a partition point. The method for calculating the division point and the optimal division point comprises the following steps: calculating the MSE after segmentation according to the independent variable col, the dependent variable label and the segmentation point split; and traversing all non-repeating points in a certain column of the features, finding out the point with the minimum MSE as the optimal segmentation point, and returning to None if no non-repeating elements exist in the features. Selecting the best characteristics: and traversing all the features, calculating the MSE corresponding to the optimal segmentation point, and finding out the features with the minimum MSE, the corresponding segmentation points and the average values corresponding to the left and right subnodes. And if all the characteristics have no non-repeated elements, returning None. Rule: all rules of the regression tree are expressed in words by using queue + breadth first search, so that the full appearance of the tree is solved.

The regression tree is then initialized and the regression tree, learning rate, initial predicted values and transformation functions are stored. And then calculating which leaf node of the regression tree the training sample belongs to, then acquiring all leaf nodes of one regression tree, storing all the training samples corresponding to the leaf nodes of the regression tree and the training samples into the dictionary, calculating the predicted value again, updating the predicted value of each leaf node of the regression tree, and calculating the effect of residual prediction in the current round. And obtaining a function by using the predicted value and the residual error of the (m-1) th round, and further optimizing the function.

Step B5: the test data was used to validate the effectiveness of the training model, and the evaluation metrics included accuracy, recall, F1 values (F1-score). The formula for precision (precision) is P ═ TP/(TP + FP), which calculates the proportion of all correctly retrieved results (TP) to all actually retrieved (TP + FP). The recall ratio (recall) is given by the formula R TP/(TP + FN) which calculates the proportion of all correctly retrieved results (TP) to all results (TP + FN) that should be retrieved. The F1 value is a harmonic mean of the exact value and the recall ratio, and is expressed as F1 ═ 2 × P × R/(P + R), and F1 combines the results of P and R, indicating that the test method is more effective when F1 is higher. The test of this step is suitable for verifying the validity of the model, and if the effect is poor, it indicates that the data volume for training is insufficient, and training data should be supplemented.

The evaluation effect is shown in table 1, and it can be seen that the prediction effect of the method is better in the categories of "forced shopping/consumption" and "change/stop trip", the recognition errors are both less than 0.1, and the two categories are also the most common complaint categories, so that the scheme of the invention can be used for rapidly and accurately recognizing the black tour guide.

TABLE 1 model evaluation data sheet

	precision	recall	F1-score
				Forced shopping/self-fee	0.91	0.93	0.92
Change/terminate stroke	0.90	0.90	0.90

As shown in fig. 8, step C of the present invention: and inputting a complaint text to predict the black tour guide category. Specifically, the method for predicting the complaint category through the training model obtained in the step B comprises the following steps:

step C1: loading a word vector model: reading a local word vector model file, analyzing according to rows, taking a word as a key, taking a word vector as a value, and storing the word vector in a dictionary variable;

step C2: the input text is converted into a sentence vector using a word embedding dictionary. As shown in fig. 5, a specific method for converting a sentence into a sentence vector is to perform word segmentation on the sentence (the detailed steps are the same as those in step a 3), obtain a series of words after word segmentation, and use the words as keys to search vector values in dictionary variables. Adding the obtained word vector values, and finally dividing by the number of the words to eliminate deviation;

step C3: and (4) carrying out violation type prediction by using a black tour guide type prediction model, and outputting the predicted violation type. And predicting by the mean value of the optimal division area, adding the initial value and the predicted value of m-1 regression trees, and solving the Sigmoid value to predict y. The violation categories in this step are: 1) forced shopping/consumption; 2) modifying/terminating the trip; 3) catering/accommodation violations; 4) the tour guide has no qualification/tour guide certificate; 5) assault, assault.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A black tour guide detection method based on a gradient lifting algorithm is characterized by comprising the following steps:

2. The black guide detection method based on the gradient boost algorithm according to claim 1, wherein the step a comprises the following sub-steps:

a1, a request is initiated by a travel news network to obtain news URL data;

a2, crawling news content for news URL data;

3. The black guide detection method based on the gradient boost algorithm according to claim 2, wherein the step a1 specifically comprises: and simulating the HTTP request by Postman, setting request parameters to obtain all results, setting the document type to be application/x-www-form-URL, analyzing the returned results, and storing the daily news URL data according to the rows.

4. The black guide detection method based on the gradient boost algorithm according to claim 2, wherein the step a2 specifically comprises: reading news URL to initiate HTTP request, analyzing the returned HTML content, respectively obtaining the content in the title label and the content in the text label, directly storing the title content as a line, cutting the text content into segments according to periods, and then writing the files according to the lines.

5. The black guide detection method based on the gradient boost algorithm according to claim 1, wherein the step B comprises the following sub-steps:

b1, obtaining a complaint text, and taking a part of the complaint text as a training set;

b3, converting each sentence of the training set into a training sentence vector by using a word embedding dictionary;

and B4, training by using a gradient lifting algorithm according to the training sentence vectors to generate a black guide type prediction model.

6. The method according to claim 5, wherein the data bits in the training set obtain 70% of the complaint text.

7. The black guide detection method based on the gradient boost algorithm according to claim 1, wherein the step C comprises the following sub-steps: