CN113378558A

CN113378558A - RESTful API document theme distribution extraction method based on representative word pairs

Info

Publication number: CN113378558A
Application number: CN202110570270.6A
Authority: CN
Inventors: 陆佳炜; 郑嘉弘; 赵伟; 王小定; 朱昊天; 徐俊; 程振波
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-09-10
Anticipated expiration: 2041-05-25
Also published as: CN113378558B

Abstract

A RESTful API document topic distribution extraction method based on representative word pairs, the method comprising the steps of: the first step is as follows: performing word segmentation processing on the document, and performing stop word removal and temporal normalization; the second step is that: converting the word segmentation result into a word pair set; the third step: and calculating a representative word pair in the iterative process of the topic model, realizing a probability sampling algorithm by using the representative word pair, completing the training of the topic model, and outputting the document topic distribution of RESTful API. The invention provides a RESTful API document theme distribution extraction method based on representative word pairs, which designs a word pair model based on a BTM theme model, searches for the representative word pair with high relevance degree to the current sampling theme in the training process through a probability sampling strategy based on theme distribution information, and reduces the interference caused by the noise problem by adjusting the weight information of the word pair in the sampling process.

Description

RESTful API document theme distribution extraction method based on representative word pairs

Technical Field

The invention relates to a RESTful API document theme distribution extraction method based on a representative word pair.

Background

REST, all known as representational State Transfer (HTTP), is a software architecture style whose idea can be generalized to represent resources using URIs and to represent operations on these resources using HTTP methods. The RESTful API is an REST style API, and as long as the front end sends a request containing a corresponding resource URI and the HTTP method (POST, GET, PUT, DELETE) is used for realizing the jump of different operations of the resource, the server only needs to define a uniform response interface and does not need to carry out various analyses on the request. The RESTful API tends to return data in JSON or XML with descriptive documents composed of natural language. Because of its light weight, simple structure and direct resource-oriented characteristics, it gradually becomes the mainstream API service form on the internet at present. Researchers often base their description documents on the computation of corresponding API features.

The topic model can automatically acquire implicit topic distribution of the corpus through iterative sampling, the implicit semantic information of the document is fully utilized, and the document topic distribution obtained by training the topic model is used as REST API (representational State information) characteristic information, which is a common means. However, API profiles are provided with short text features. The short text is a short text containing a few words, can only acquire a small amount of word co-occurrence information and has semantic sparsity. In the processing of short texts, the conventional topic model cannot exert good effect due to the sparsity problem. On the other hand, descriptive documents face the problem of noise interference, i.e. the text contains words that are not associated with a functional topic, which may have a negative effect on the determination of the topic, called noise words. Only by solving the above two problems can an effective and reasonable document theme distribution be extracted from the descriptive document.

A word pair Topic model (BTM) (Biterm Topic model) is proposed in 2013, the model converts word sets after linguistic data are divided into word pair sets by combining every two words, the word pair sets are sampled, and corresponding Topic distribution is obtained through training. The model converts the original linguistic data into a word pair model, so that semantic co-occurrence information is increased, and the problem of sparsity of short texts is solved.

Disclosure of Invention

In order to solve the difficulty and the deficiency of document theme distribution extraction brought by the problems of sparsity and noise of the conventional RESTful API document, the invention provides a RESTful API document theme distribution extraction method based on a representative word pair.

The invention adopts the following technical scheme:

a RESTful API document topic distribution extraction method based on representative word pairs, the method comprising the steps of:

the first step is as follows: performing word segmentation processing on the document, and performing stop word removal and temporal normalization;

the second step is that: converting the word segmentation result into a word pair set;

the third step: and calculating a representative word pair in the iterative process of the topic model, realizing a probability sampling algorithm by using the representative word pair, completing the training of the topic model, and outputting the document topic distribution of RESTful API.

Further, the first step process is as follows:

1.1 reading RESTful API document information, converting the RESTful API document information into a value key pair D by taking the API name as a key and the document content as a value;

1.2 traversing the document content in D, setting the current document content as D, and setting an empty set word _ list. Sentence division processing is carried out on the d, punctuation marks are removed, and then word division is carried out on each sentence;

1.3, in the traversal process, judging each word after word segmentation, if the word is not composed of special symbols, is not a pure number and does not exist in a stop word list, carrying out normalization processing on the word, storing the word into the word _ list set in the step 1.2, and after the judgment of each word is completed, using the word _ list to replace D as a value key to store the value in D.

Further, the process of the second step is as follows:

2.1 traversing the word segmentation result obtained in the step 1 to generate a nonrepeating vocabulary Voc;

2.2 defining a word pair bitterm structure, wherein the sequence numbers of two different words in Voc are contained, the smaller sequence number is set as word1, and the larger sequence number is set as word 2;

2.3 setting an empty set whole _ words as a storage set of all word segmentation results, traversing a value key pair D, and sequentially storing a word _ list set corresponding to each key into wole _ words;

2.4 traversing all word information in the world _ words, and converting the word information into corresponding word serial numbers in the vocabulary Voc;

2.5 generating a set of word pairs B.

Preferably, the step 2.5 is as follows:

2.5.1 traversing the whole word set, and setting the vocabulary sequence result set of the document participle corresponding to the current document as single _ list;

2.5.2 setting a word pair set B for storing word pair information;

2.5.3 traversing single _ list, wherein the current object is single _ list (i), the single _ list (i) represents the vocabulary sequence number of the ith word in the single _ list, wherein i is more than or equal to 0 and less than single _ list.length, and for each single _ list (i), the vocabulary sequence number is combined with the vocabulary sequence number of the jth word corresponding to the single _ list (j) to generate a word pair b, wherein i < j < single _ list.length;

and 2.5.4, storing the generated word pairs into a word pair set B, and sequentially setting a word pair sequence number for each word pair B and marking as b.index.

Still further, the process of the third step is as follows:

3.1 setting a zero matrix nz with the size k x 1 for storing the word pairs corresponding to each topic, wherein k is the number of topics, setting a zero matrix nwz with the size k x | Voc | for storing the times of each vocabulary divided into each topic, wherein | Voc | represents the number of vocabularies in the vocabulary, and the zero matrix refers to a matrix with matrix elements all being 0;

3.2 randomly endowing the word pair with a theme, and initializing nz and nwz;

3.3 setting iteration times iterating and setting the current iteration times iter;

3.4, starting the first iteration, traversing the word pair set B, and sampling each word pair B;

3.5 calculating a representative word pair matrix S;

3.6, continuing iteration, adding 1 to the current iteration number iter, traversing the word pair set B, and sampling each word pair B;

3.7 repeating the operation of step 3.5;

3.8 judging the size of iter, and stopping iteration when the size of iter is equal to iterating;

3.9 calculating the document topic distribution theta according to the formula:

p (z | d) represents the probability, nd, of document d for topic z_zIndicating the number of words in the document that are assigned topic z.

The step 3.2 is as follows:

3.2.1 traversing the word pair set B, randomly obtaining an integer value t for each word pair B, wherein t is more than or equal to 0 and less than k, and taking t as the subject of the word pair B and marking as b.topic;

3.2.2 traversing the word pair set B with the randomly assigned theme, setting the current word pair as B, adding 1 to the value of nz [ b.topic ] position in the matrix nz, and respectively adding 1 to the values of nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position in the matrix nwz, wherein b.word1 represents the value of word1 in the word pair, and b.word2 represents the value of word2 in the word pair, thereby completing matrix initialization.

The step 3.4 is as follows:

3.4.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;

3.4.2 calls the following formula to sample each topic z:

wherein

Representing the probability that a word pair b belongs to a topic z, n, after the influence of the current word pair b is removed_zRepresenting the number of words belonging to subject z, i.e. nz [ z ] in matrix nz]The value of (a) is a direct ratio, α and β are hyper-parameters, and n_wi|zWord w, with sequence number b_iThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word1]A value of (a), n_wj|zWord w representing a sequence number b_jThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word2]The value of (1) and M is the number of words in the vocabulary table, and the probabilities obtained by all the topics are sequentially stored into a list distribution;

3.4.3 using roulette operation to obtain new theme corresponding to word pair b, setting it as b.topic, and using roulette algorithm, also called proportion selection algorithm, to obtain cumulative probability corresponding to each individual by means of sectional accumulation of probability distribution, and generating a random number in interval [0,1], and selecting individual whose cumulative probability is greater than or equal to the random number and whose difference with the random number is minimum as output result of roulette;

3.4.4 add 1 to the value of the nz [ b.topic ] position in the matrix nz, while adding 1 to the value of the nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position, respectively, in the matrix nwz, and the matrix accepts the sampling results.

The step of 3.5 is as follows:

3.5.1 setting a matrix lambda with the size of | B |. k to represent a word pair distinguishing matrix, wherein | B | represents the number of word pairs in the word pair set, and setting a matrix S with the size of | B |. k to represent a word pair matrix;

3.5.2 traversing the word pair set B, setting the current word pair as B, traversing all the topics, and calculating the word pair probability of the word pair B for the topic z according to a formula, wherein the formula is as follows:

the sign meaning is the same as that in step 3.4.2, the maximum value of the probability p (z | b) of the word pair b for each topic is found and set as max (p (z | b)), the ratio p (z | b)/max (p (z | b)) is calculated for each topic z, and the ratio is stored in the position of lambda [ b.index ] [ z ] in the matrix lambda;

3.5.3, traversing all the values in matrix lambda, judging the corresponding value of lambda [ b.index ] [ z ] according to the Bernoulli distribution with the set probability of 0.5, storing 0 or 1 of the result in the representative word pair matrix S, wherein the Bernoulli distribution is a discrete probability distribution, when the input probability is greater than the set probability, returning to 1, and when the input probability is less than or equal to the set probability, returning to 0.

The step 3.6 is as follows:

3.6.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;

3.6.2, traversing each topic, setting the current topic as z, judging, if the corresponding value of S [ b.index ] [ t ] is 0, repeating the operations of steps 3.4.2, 3.4.3 and 3.4.4, and if the corresponding value of S [ b.index ] [ t ] is 1, replacing the formula in step 3.4.2 with the following formula:

wherein mu is a weight parameter of the representative word pair, is set before training, is adjusted to change the training effect of the model, and then repeats the operations of 3.4.3 and 3.4.4.

The invention has the beneficial effects that: (1) RESTful API documents are taken as research objects, the functional semantic requirements of the RESTful API are met, and the method is more suitable for feature extraction work. (2) The topic model based on the word pairs can greatly increase the co-occurrence information and overcome the defect of high text sparsity of the API document. (3) By the probability sampling algorithm realized by the calculation of the representative word pair, the influence of short text noise words is reduced while the time complexity of the algorithm is not influenced, and the reliability of the API document theme distribution acquisition is improved.

Detailed Description

The present invention is further explained below.

A RESTful API document theme distribution extraction method based on representative word pairs comprises the following steps:

the process of the first step is as follows:

1.2 traversing D Chinese document contents, setting the current document contents as D, setting an empty set word _ list, performing sentence division processing on D by utilizing a natural language processing NLTK library, removing punctuation marks, and then performing word division on each sentence;

1.3 in the traversal process, judging each word after word segmentation by means of a regular expression, if the word is not composed of special symbols, is not a pure number and does not exist in a stop word list, carrying out normalization processing on the word, storing the word into a word _ list set in the step 1.2, and after the judgment of each word is completed, storing a value in D by using the word _ list to replace D as a value key;

the process of the second step is as follows:

2.5 generating a word pair set B, comprising the following steps:

2.5.2 setting a word pair set B for storing word pair information;

The third step: calculating a representative word pair in the iterative process of the topic model, realizing a probability sampling algorithm by using the representative word pair, completing the training of the topic model, and outputting the document topic distribution of RESTful API;

the third step process is as follows:

3.2 randomly assigning a topic to the word pair, initializing nz and nwz, and the steps are as follows:

3.2.1 traversing the word pair set B, randomly obtaining an integer value t for each word pair B, wherein t is more than or equal to 0 and less than k, and taking t as the subject of the word pair B and marking as b.topic; (ii) a

3.2.2 traversing the word pair set B with the randomly assigned theme, setting the current word pair as B, adding 1 to the value of the nz [ b.topic ] position in the matrix nz, and respectively adding 1 to the values of nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position in the matrix nwz, wherein b.word1 represents the value of word1 in the word pair, and b.word2 represents the value of word2 in the word pair, thereby completing matrix initialization;

3.4, starting the first iteration, traversing the word pair set B, and performing sampling operation on each word pair B, wherein the steps are as follows:

3.4.2 calls the following formula to sample each topic z:

wherein

3.4.4 adding 1 to the value of the nz [ b.topic ] position in the matrix nz, and adding 1 to the values of nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position in the matrix nwz respectively to make the matrix accept the sampling result;

3.5 calculating a representative word pair matrix S, comprising the following steps:

3.5.3 traversing all the values in matrix lambda, judging the corresponding value of lambda [ b.index ] [ z ] according to Bernoulli distribution with set probability of 0.5, storing 0 or 1 of the result in the representative word pair matrix S, wherein the Bernoulli distribution is a discrete probability distribution, when the input probability is greater than the set probability, returning to 1, and when the input probability is less than or equal to the set probability, returning to 0;

3.6, continuing iteration, adding 1 to the current iteration number iter, traversing the word pair set B, and sampling each word pair B, wherein the steps are as follows:

wherein mu is a weight parameter of the representative word pair, is set before training, is adjusted to change the training effect of the model, and then repeats the operations of the steps 3.4.3 and 3.4.4;

3.7 repeating the operation of step 3.5;

3.9 calculating the document topic distribution theta according to the formula:

The embodiments described in this specification are merely illustrative of implementations of the inventive concepts, which are intended for purposes of illustration only. The scope of the present invention should not be construed as being limited to the particular forms set forth in the examples, but rather as being defined by the claims and the equivalents thereof which can occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A RESTful API document topic distribution extraction method based on representative word pairs is characterized by comprising the following steps:

2. The method of claim 1, wherein the first step is performed by:

1.2 traversing D document contents, setting the current document contents as D, setting an empty set word _ list, performing sentence division processing on D, removing punctuation marks, and then performing word division on each sentence;

3. The method of claim 2, wherein the second step is performed by the following process:

2.5 generating a set of word pairs B.

4. The method of claim 3, wherein the step of 2.5 is as follows:

2.5.2 setting a word pair set B for storing word pair information;

5. The RESTful API document topic distribution extraction method based on representative word pairs according to one of claims 1 to 4, characterized in that the procedure of the third step is as follows:

3.2 randomly endowing the word pair with a theme, and initializing nz and nwz;

3.5 calculating a representative word pair matrix S;

3.7 repeating the operation of step 3.5;

3.9 calculating the document topic distribution theta according to the formula:

6. The method of claim 5, wherein the step of 3.2 is as follows:

7. The method of claim 5, wherein the step of 3.4 is as follows:

3.4.2 calls the following formula to sample each topic z:

wherein

8. The method of claim 7, wherein the step of 3.5 is as follows:

finding out the maximum value of the probability p (z | b) of the word pair b for each topic, setting the maximum value as max (p (z | b)), calculating the ratio p (z | b)/max (p (z | b)) for each topic z, and storing the ratio into the position of lambda [ b.index ] [ z ] in the matrix lambda;

9. The method of claim 7, wherein the step of 3.6 is as follows: