CN113378558A - RESTful API document theme distribution extraction method based on representative word pairs - Google Patents
RESTful API document theme distribution extraction method based on representative word pairs Download PDFInfo
- Publication number
- CN113378558A CN113378558A CN202110570270.6A CN202110570270A CN113378558A CN 113378558 A CN113378558 A CN 113378558A CN 202110570270 A CN202110570270 A CN 202110570270A CN 113378558 A CN113378558 A CN 113378558A
- Authority
- CN
- China
- Prior art keywords
- word
- topic
- word pair
- matrix
- pair
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 12
- 238000000034 method Methods 0.000 claims abstract description 25
- 238000005070 sampling Methods 0.000 claims abstract description 18
- 230000011218 segmentation Effects 0.000 claims abstract description 17
- 238000012549 training Methods 0.000 claims abstract description 13
- 238000012545 processing Methods 0.000 claims abstract description 11
- 238000010606 normalization Methods 0.000 claims abstract description 7
- 238000012804 iterative process Methods 0.000 claims abstract description 4
- 230000002123 temporal effect Effects 0.000 claims abstract description 4
- 239000011159 matrix material Substances 0.000 claims description 66
- 230000001186 cumulative effect Effects 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 5
- 238000009825 accumulation Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/44—Statistical methods, e.g. probability models
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
A RESTful API document topic distribution extraction method based on representative word pairs, the method comprising the steps of: the first step is as follows: performing word segmentation processing on the document, and performing stop word removal and temporal normalization; the second step is that: converting the word segmentation result into a word pair set; the third step: and calculating a representative word pair in the iterative process of the topic model, realizing a probability sampling algorithm by using the representative word pair, completing the training of the topic model, and outputting the document topic distribution of RESTful API. The invention provides a RESTful API document theme distribution extraction method based on representative word pairs, which designs a word pair model based on a BTM theme model, searches for the representative word pair with high relevance degree to the current sampling theme in the training process through a probability sampling strategy based on theme distribution information, and reduces the interference caused by the noise problem by adjusting the weight information of the word pair in the sampling process.
Description
Technical Field
The invention relates to a RESTful API document theme distribution extraction method based on a representative word pair.
Background
REST, all known as representational State Transfer (HTTP), is a software architecture style whose idea can be generalized to represent resources using URIs and to represent operations on these resources using HTTP methods. The RESTful API is an REST style API, and as long as the front end sends a request containing a corresponding resource URI and the HTTP method (POST, GET, PUT, DELETE) is used for realizing the jump of different operations of the resource, the server only needs to define a uniform response interface and does not need to carry out various analyses on the request. The RESTful API tends to return data in JSON or XML with descriptive documents composed of natural language. Because of its light weight, simple structure and direct resource-oriented characteristics, it gradually becomes the mainstream API service form on the internet at present. Researchers often base their description documents on the computation of corresponding API features.
The topic model can automatically acquire implicit topic distribution of the corpus through iterative sampling, the implicit semantic information of the document is fully utilized, and the document topic distribution obtained by training the topic model is used as REST API (representational State information) characteristic information, which is a common means. However, API profiles are provided with short text features. The short text is a short text containing a few words, can only acquire a small amount of word co-occurrence information and has semantic sparsity. In the processing of short texts, the conventional topic model cannot exert good effect due to the sparsity problem. On the other hand, descriptive documents face the problem of noise interference, i.e. the text contains words that are not associated with a functional topic, which may have a negative effect on the determination of the topic, called noise words. Only by solving the above two problems can an effective and reasonable document theme distribution be extracted from the descriptive document.
A word pair Topic model (BTM) (Biterm Topic model) is proposed in 2013, the model converts word sets after linguistic data are divided into word pair sets by combining every two words, the word pair sets are sampled, and corresponding Topic distribution is obtained through training. The model converts the original linguistic data into a word pair model, so that semantic co-occurrence information is increased, and the problem of sparsity of short texts is solved.
Disclosure of Invention
In order to solve the difficulty and the deficiency of document theme distribution extraction brought by the problems of sparsity and noise of the conventional RESTful API document, the invention provides a RESTful API document theme distribution extraction method based on a representative word pair.
The invention adopts the following technical scheme:
a RESTful API document topic distribution extraction method based on representative word pairs, the method comprising the steps of:
the first step is as follows: performing word segmentation processing on the document, and performing stop word removal and temporal normalization;
the second step is that: converting the word segmentation result into a word pair set;
the third step: and calculating a representative word pair in the iterative process of the topic model, realizing a probability sampling algorithm by using the representative word pair, completing the training of the topic model, and outputting the document topic distribution of RESTful API.
Further, the first step process is as follows:
1.1 reading RESTful API document information, converting the RESTful API document information into a value key pair D by taking the API name as a key and the document content as a value;
1.2 traversing the document content in D, setting the current document content as D, and setting an empty set word _ list. Sentence division processing is carried out on the d, punctuation marks are removed, and then word division is carried out on each sentence;
1.3, in the traversal process, judging each word after word segmentation, if the word is not composed of special symbols, is not a pure number and does not exist in a stop word list, carrying out normalization processing on the word, storing the word into the word _ list set in the step 1.2, and after the judgment of each word is completed, using the word _ list to replace D as a value key to store the value in D.
Further, the process of the second step is as follows:
2.1 traversing the word segmentation result obtained in the step 1 to generate a nonrepeating vocabulary Voc;
2.2 defining a word pair bitterm structure, wherein the sequence numbers of two different words in Voc are contained, the smaller sequence number is set as word1, and the larger sequence number is set as word 2;
2.3 setting an empty set whole _ words as a storage set of all word segmentation results, traversing a value key pair D, and sequentially storing a word _ list set corresponding to each key into wole _ words;
2.4 traversing all word information in the world _ words, and converting the word information into corresponding word serial numbers in the vocabulary Voc;
2.5 generating a set of word pairs B.
Preferably, the step 2.5 is as follows:
2.5.1 traversing the whole word set, and setting the vocabulary sequence result set of the document participle corresponding to the current document as single _ list;
2.5.2 setting a word pair set B for storing word pair information;
2.5.3 traversing single _ list, wherein the current object is single _ list (i), the single _ list (i) represents the vocabulary sequence number of the ith word in the single _ list, wherein i is more than or equal to 0 and less than single _ list.length, and for each single _ list (i), the vocabulary sequence number is combined with the vocabulary sequence number of the jth word corresponding to the single _ list (j) to generate a word pair b, wherein i < j < single _ list.length;
and 2.5.4, storing the generated word pairs into a word pair set B, and sequentially setting a word pair sequence number for each word pair B and marking as b.index.
Still further, the process of the third step is as follows:
3.1 setting a zero matrix nz with the size k x 1 for storing the word pairs corresponding to each topic, wherein k is the number of topics, setting a zero matrix nwz with the size k x | Voc | for storing the times of each vocabulary divided into each topic, wherein | Voc | represents the number of vocabularies in the vocabulary, and the zero matrix refers to a matrix with matrix elements all being 0;
3.2 randomly endowing the word pair with a theme, and initializing nz and nwz;
3.3 setting iteration times iterating and setting the current iteration times iter;
3.4, starting the first iteration, traversing the word pair set B, and sampling each word pair B;
3.5 calculating a representative word pair matrix S;
3.6, continuing iteration, adding 1 to the current iteration number iter, traversing the word pair set B, and sampling each word pair B;
3.7 repeating the operation of step 3.5;
3.8 judging the size of iter, and stopping iteration when the size of iter is equal to iterating;
3.9 calculating the document topic distribution theta according to the formula:
p (z | d) represents the probability, nd, of document d for topic zzIndicating the number of words in the document that are assigned topic z.
The step 3.2 is as follows:
3.2.1 traversing the word pair set B, randomly obtaining an integer value t for each word pair B, wherein t is more than or equal to 0 and less than k, and taking t as the subject of the word pair B and marking as b.topic;
3.2.2 traversing the word pair set B with the randomly assigned theme, setting the current word pair as B, adding 1 to the value of nz [ b.topic ] position in the matrix nz, and respectively adding 1 to the values of nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position in the matrix nwz, wherein b.word1 represents the value of word1 in the word pair, and b.word2 represents the value of word2 in the word pair, thereby completing matrix initialization.
The step 3.4 is as follows:
3.4.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;
3.4.2 calls the following formula to sample each topic z:
whereinRepresenting the probability that a word pair b belongs to a topic z, n, after the influence of the current word pair b is removedzRepresenting the number of words belonging to subject z, i.e. nz [ z ] in matrix nz]The value of (a) is a direct ratio, α and β are hyper-parameters, and nwi|zWord w, with sequence number biThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word1]A value of (a), nwj|zWord w representing a sequence number bjThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word2]The value of (1) and M is the number of words in the vocabulary table, and the probabilities obtained by all the topics are sequentially stored into a list distribution;
3.4.3 using roulette operation to obtain new theme corresponding to word pair b, setting it as b.topic, and using roulette algorithm, also called proportion selection algorithm, to obtain cumulative probability corresponding to each individual by means of sectional accumulation of probability distribution, and generating a random number in interval [0,1], and selecting individual whose cumulative probability is greater than or equal to the random number and whose difference with the random number is minimum as output result of roulette;
3.4.4 add 1 to the value of the nz [ b.topic ] position in the matrix nz, while adding 1 to the value of the nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position, respectively, in the matrix nwz, and the matrix accepts the sampling results.
The step of 3.5 is as follows:
3.5.1 setting a matrix lambda with the size of | B |. k to represent a word pair distinguishing matrix, wherein | B | represents the number of word pairs in the word pair set, and setting a matrix S with the size of | B |. k to represent a word pair matrix;
3.5.2 traversing the word pair set B, setting the current word pair as B, traversing all the topics, and calculating the word pair probability of the word pair B for the topic z according to a formula, wherein the formula is as follows:
the sign meaning is the same as that in step 3.4.2, the maximum value of the probability p (z | b) of the word pair b for each topic is found and set as max (p (z | b)), the ratio p (z | b)/max (p (z | b)) is calculated for each topic z, and the ratio is stored in the position of lambda [ b.index ] [ z ] in the matrix lambda;
3.5.3, traversing all the values in matrix lambda, judging the corresponding value of lambda [ b.index ] [ z ] according to the Bernoulli distribution with the set probability of 0.5, storing 0 or 1 of the result in the representative word pair matrix S, wherein the Bernoulli distribution is a discrete probability distribution, when the input probability is greater than the set probability, returning to 1, and when the input probability is less than or equal to the set probability, returning to 0.
The step 3.6 is as follows:
3.6.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;
3.6.2, traversing each topic, setting the current topic as z, judging, if the corresponding value of S [ b.index ] [ t ] is 0, repeating the operations of steps 3.4.2, 3.4.3 and 3.4.4, and if the corresponding value of S [ b.index ] [ t ] is 1, replacing the formula in step 3.4.2 with the following formula:
wherein mu is a weight parameter of the representative word pair, is set before training, is adjusted to change the training effect of the model, and then repeats the operations of 3.4.3 and 3.4.4.
The invention has the beneficial effects that: (1) RESTful API documents are taken as research objects, the functional semantic requirements of the RESTful API are met, and the method is more suitable for feature extraction work. (2) The topic model based on the word pairs can greatly increase the co-occurrence information and overcome the defect of high text sparsity of the API document. (3) By the probability sampling algorithm realized by the calculation of the representative word pair, the influence of short text noise words is reduced while the time complexity of the algorithm is not influenced, and the reliability of the API document theme distribution acquisition is improved.
Detailed Description
The present invention is further explained below.
A RESTful API document theme distribution extraction method based on representative word pairs comprises the following steps:
the first step is as follows: performing word segmentation processing on the document, and performing stop word removal and temporal normalization;
the process of the first step is as follows:
1.1 reading RESTful API document information, converting the RESTful API document information into a value key pair D by taking the API name as a key and the document content as a value;
1.2 traversing D Chinese document contents, setting the current document contents as D, setting an empty set word _ list, performing sentence division processing on D by utilizing a natural language processing NLTK library, removing punctuation marks, and then performing word division on each sentence;
1.3 in the traversal process, judging each word after word segmentation by means of a regular expression, if the word is not composed of special symbols, is not a pure number and does not exist in a stop word list, carrying out normalization processing on the word, storing the word into a word _ list set in the step 1.2, and after the judgment of each word is completed, storing a value in D by using the word _ list to replace D as a value key;
the second step is that: converting the word segmentation result into a word pair set;
the process of the second step is as follows:
2.1 traversing the word segmentation result obtained in the step 1 to generate a nonrepeating vocabulary Voc;
2.2 defining a word pair bitterm structure, wherein the sequence numbers of two different words in Voc are contained, the smaller sequence number is set as word1, and the larger sequence number is set as word 2;
2.3 setting an empty set whole _ words as a storage set of all word segmentation results, traversing a value key pair D, and sequentially storing a word _ list set corresponding to each key into wole _ words;
2.4 traversing all word information in the world _ words, and converting the word information into corresponding word serial numbers in the vocabulary Voc;
2.5 generating a word pair set B, comprising the following steps:
2.5.1 traversing the whole word set, and setting the vocabulary sequence result set of the document participle corresponding to the current document as single _ list;
2.5.2 setting a word pair set B for storing word pair information;
2.5.3 traversing single _ list, wherein the current object is single _ list (i), the single _ list (i) represents the vocabulary sequence number of the ith word in the single _ list, wherein i is more than or equal to 0 and less than single _ list.length, and for each single _ list (i), the vocabulary sequence number is combined with the vocabulary sequence number of the jth word corresponding to the single _ list (j) to generate a word pair b, wherein i < j < single _ list.length;
and 2.5.4, storing the generated word pairs into a word pair set B, and sequentially setting a word pair sequence number for each word pair B and marking as b.index.
The third step: calculating a representative word pair in the iterative process of the topic model, realizing a probability sampling algorithm by using the representative word pair, completing the training of the topic model, and outputting the document topic distribution of RESTful API;
the third step process is as follows:
3.1 setting a zero matrix nz with the size k x 1 for storing the word pairs corresponding to each topic, wherein k is the number of topics, setting a zero matrix nwz with the size k x | Voc | for storing the times of each vocabulary divided into each topic, wherein | Voc | represents the number of vocabularies in the vocabulary, and the zero matrix refers to a matrix with matrix elements all being 0;
3.2 randomly assigning a topic to the word pair, initializing nz and nwz, and the steps are as follows:
3.2.1 traversing the word pair set B, randomly obtaining an integer value t for each word pair B, wherein t is more than or equal to 0 and less than k, and taking t as the subject of the word pair B and marking as b.topic; (ii) a
3.2.2 traversing the word pair set B with the randomly assigned theme, setting the current word pair as B, adding 1 to the value of the nz [ b.topic ] position in the matrix nz, and respectively adding 1 to the values of nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position in the matrix nwz, wherein b.word1 represents the value of word1 in the word pair, and b.word2 represents the value of word2 in the word pair, thereby completing matrix initialization;
3.3 setting iteration times iterating and setting the current iteration times iter;
3.4, starting the first iteration, traversing the word pair set B, and performing sampling operation on each word pair B, wherein the steps are as follows:
3.4.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;
3.4.2 calls the following formula to sample each topic z:
whereinRepresenting the probability that a word pair b belongs to a topic z, n, after the influence of the current word pair b is removedzRepresenting the number of words belonging to subject z, i.e. nz [ z ] in matrix nz]The value of (a) is a direct ratio, α and β are hyper-parameters, and nwi|zWord w, with sequence number biThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word1]A value of (a), nwj|zWord w representing a sequence number bjThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word2]The value of (1) and M is the number of words in the vocabulary table, and the probabilities obtained by all the topics are sequentially stored into a list distribution;
3.4.3 using roulette operation to obtain new theme corresponding to word pair b, setting it as b.topic, and using roulette algorithm, also called proportion selection algorithm, to obtain cumulative probability corresponding to each individual by means of sectional accumulation of probability distribution, and generating a random number in interval [0,1], and selecting individual whose cumulative probability is greater than or equal to the random number and whose difference with the random number is minimum as output result of roulette;
3.4.4 adding 1 to the value of the nz [ b.topic ] position in the matrix nz, and adding 1 to the values of nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position in the matrix nwz respectively to make the matrix accept the sampling result;
3.5 calculating a representative word pair matrix S, comprising the following steps:
3.5.1 setting a matrix lambda with the size of | B |. k to represent a word pair distinguishing matrix, wherein | B | represents the number of word pairs in the word pair set, and setting a matrix S with the size of | B |. k to represent a word pair matrix;
3.5.2 traversing the word pair set B, setting the current word pair as B, traversing all the topics, and calculating the word pair probability of the word pair B for the topic z according to a formula, wherein the formula is as follows:
the sign meaning is the same as that in step 3.4.2, the maximum value of the probability p (z | b) of the word pair b for each topic is found and set as max (p (z | b)), the ratio p (z | b)/max (p (z | b)) is calculated for each topic z, and the ratio is stored in the position of lambda [ b.index ] [ z ] in the matrix lambda;
3.5.3 traversing all the values in matrix lambda, judging the corresponding value of lambda [ b.index ] [ z ] according to Bernoulli distribution with set probability of 0.5, storing 0 or 1 of the result in the representative word pair matrix S, wherein the Bernoulli distribution is a discrete probability distribution, when the input probability is greater than the set probability, returning to 1, and when the input probability is less than or equal to the set probability, returning to 0;
3.6, continuing iteration, adding 1 to the current iteration number iter, traversing the word pair set B, and sampling each word pair B, wherein the steps are as follows:
3.6.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;
3.6.2, traversing each topic, setting the current topic as z, judging, if the corresponding value of S [ b.index ] [ t ] is 0, repeating the operations of steps 3.4.2, 3.4.3 and 3.4.4, and if the corresponding value of S [ b.index ] [ t ] is 1, replacing the formula in step 3.4.2 with the following formula:
wherein mu is a weight parameter of the representative word pair, is set before training, is adjusted to change the training effect of the model, and then repeats the operations of the steps 3.4.3 and 3.4.4;
3.7 repeating the operation of step 3.5;
3.8 judging the size of iter, and stopping iteration when the size of iter is equal to iterating;
3.9 calculating the document topic distribution theta according to the formula:
p (z | d) represents the probability, nd, of document d for topic zzIndicating the number of words in the document that are assigned topic z.
The embodiments described in this specification are merely illustrative of implementations of the inventive concepts, which are intended for purposes of illustration only. The scope of the present invention should not be construed as being limited to the particular forms set forth in the examples, but rather as being defined by the claims and the equivalents thereof which can occur to those skilled in the art upon consideration of the present inventive concept.
Claims (9)
1. A RESTful API document topic distribution extraction method based on representative word pairs is characterized by comprising the following steps:
the first step is as follows: performing word segmentation processing on the document, and performing stop word removal and temporal normalization;
the second step is that: converting the word segmentation result into a word pair set;
the third step: and calculating a representative word pair in the iterative process of the topic model, realizing a probability sampling algorithm by using the representative word pair, completing the training of the topic model, and outputting the document topic distribution of RESTful API.
2. The method of claim 1, wherein the first step is performed by:
1.1 reading RESTful API document information, converting the RESTful API document information into a value key pair D by taking the API name as a key and the document content as a value;
1.2 traversing D document contents, setting the current document contents as D, setting an empty set word _ list, performing sentence division processing on D, removing punctuation marks, and then performing word division on each sentence;
1.3, in the traversal process, judging each word after word segmentation, if the word is not composed of special symbols, is not a pure number and does not exist in a stop word list, carrying out normalization processing on the word, storing the word into the word _ list set in the step 1.2, and after the judgment of each word is completed, using the word _ list to replace D as a value key to store the value in D.
3. The method of claim 2, wherein the second step is performed by the following process:
2.1 traversing the word segmentation result obtained in the step 1 to generate a nonrepeating vocabulary Voc;
2.2 defining a word pair bitterm structure, wherein the sequence numbers of two different words in Voc are contained, the smaller sequence number is set as word1, and the larger sequence number is set as word 2;
2.3 setting an empty set whole _ words as a storage set of all word segmentation results, traversing a value key pair D, and sequentially storing a word _ list set corresponding to each key into wole _ words;
2.4 traversing all word information in the world _ words, and converting the word information into corresponding word serial numbers in the vocabulary Voc;
2.5 generating a set of word pairs B.
4. The method of claim 3, wherein the step of 2.5 is as follows:
2.5.1 traversing the whole word set, and setting the vocabulary sequence result set of the document participle corresponding to the current document as single _ list;
2.5.2 setting a word pair set B for storing word pair information;
2.5.3 traversing single _ list, wherein the current object is single _ list (i), the single _ list (i) represents the vocabulary sequence number of the ith word in the single _ list, wherein i is more than or equal to 0 and less than single _ list.length, and for each single _ list (i), the vocabulary sequence number is combined with the vocabulary sequence number of the jth word corresponding to the single _ list (j) to generate a word pair b, wherein i < j < single _ list.length;
and 2.5.4, storing the generated word pairs into a word pair set B, and sequentially setting a word pair sequence number for each word pair B and marking as b.index.
5. The RESTful API document topic distribution extraction method based on representative word pairs according to one of claims 1 to 4, characterized in that the procedure of the third step is as follows:
3.1 setting a zero matrix nz with the size k x 1 for storing the word pairs corresponding to each topic, wherein k is the number of topics, setting a zero matrix nwz with the size k x | Voc | for storing the times of each vocabulary divided into each topic, wherein | Voc | represents the number of vocabularies in the vocabulary, and the zero matrix refers to a matrix with matrix elements all being 0;
3.2 randomly endowing the word pair with a theme, and initializing nz and nwz;
3.3 setting iteration times iterating and setting the current iteration times iter;
3.4, starting the first iteration, traversing the word pair set B, and sampling each word pair B;
3.5 calculating a representative word pair matrix S;
3.6, continuing iteration, adding 1 to the current iteration number iter, traversing the word pair set B, and sampling each word pair B;
3.7 repeating the operation of step 3.5;
3.8 judging the size of iter, and stopping iteration when the size of iter is equal to iterating;
3.9 calculating the document topic distribution theta according to the formula:
p (z | d) represents the probability, nd, of document d for topic zzIndicating the number of words in the document that are assigned topic z.
6. The method of claim 5, wherein the step of 3.2 is as follows:
3.2.1 traversing the word pair set B, randomly obtaining an integer value t for each word pair B, wherein t is more than or equal to 0 and less than k, and taking t as the subject of the word pair B and marking as b.topic;
3.2.2 traversing the word pair set B with the randomly assigned theme, setting the current word pair as B, adding 1 to the value of nz [ b.topic ] position in the matrix nz, and respectively adding 1 to the values of nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position in the matrix nwz, wherein b.word1 represents the value of word1 in the word pair, and b.word2 represents the value of word2 in the word pair, thereby completing matrix initialization.
7. The method of claim 5, wherein the step of 3.4 is as follows:
3.4.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;
3.4.2 calls the following formula to sample each topic z:
whereinRepresenting the probability that a word pair b belongs to a topic z, n, after the influence of the current word pair b is removedzRepresenting the number of words belonging to subject z, i.e. nz [ z ] in matrix nz]The value of (a) is a direct ratio, α and β are hyper-parameters, and nwi|zWord w, with sequence number biThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word1]A value of (a), nwj|zWord w representing a sequence number bjThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word2]The value of (1) and M is the number of words in the vocabulary table, and the probabilities obtained by all the topics are sequentially stored into a list distribution;
3.4.3 using roulette operation to obtain new theme corresponding to word pair b, setting it as b.topic, and using roulette algorithm, also called proportion selection algorithm, to obtain cumulative probability corresponding to each individual by means of sectional accumulation of probability distribution, and generating a random number in interval [0,1], and selecting individual whose cumulative probability is greater than or equal to the random number and whose difference with the random number is minimum as output result of roulette;
3.4.4 add 1 to the value of the nz [ b.topic ] position in the matrix nz, while adding 1 to the value of the nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position, respectively, in the matrix nwz, and the matrix accepts the sampling results.
8. The method of claim 7, wherein the step of 3.5 is as follows:
3.5.1 setting a matrix lambda with the size of | B |. k to represent a word pair distinguishing matrix, wherein | B | represents the number of word pairs in the word pair set, and setting a matrix S with the size of | B |. k to represent a word pair matrix;
3.5.2 traversing the word pair set B, setting the current word pair as B, traversing all the topics, and calculating the word pair probability of the word pair B for the topic z according to a formula, wherein the formula is as follows:
finding out the maximum value of the probability p (z | b) of the word pair b for each topic, setting the maximum value as max (p (z | b)), calculating the ratio p (z | b)/max (p (z | b)) for each topic z, and storing the ratio into the position of lambda [ b.index ] [ z ] in the matrix lambda;
3.5.3, traversing all the values in matrix lambda, judging the corresponding value of lambda [ b.index ] [ z ] according to the Bernoulli distribution with the set probability of 0.5, storing 0 or 1 of the result in the representative word pair matrix S, wherein the Bernoulli distribution is a discrete probability distribution, when the input probability is greater than the set probability, returning to 1, and when the input probability is less than or equal to the set probability, returning to 0.
9. The method of claim 7, wherein the step of 3.6 is as follows:
3.6.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;
3.6.2, traversing each topic, setting the current topic as z, judging, if the corresponding value of S [ b.index ] [ t ] is 0, repeating the operations of steps 3.4.2, 3.4.3 and 3.4.4, and if the corresponding value of S [ b.index ] [ t ] is 1, replacing the formula in step 3.4.2 with the following formula:
wherein mu is a weight parameter of the representative word pair, is set before training, is adjusted to change the training effect of the model, and then repeats the operations of 3.4.3 and 3.4.4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110570270.6A CN113378558B (en) | 2021-05-25 | 2021-05-25 | RESTful API document theme distribution extraction method based on representative word pairs |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110570270.6A CN113378558B (en) | 2021-05-25 | 2021-05-25 | RESTful API document theme distribution extraction method based on representative word pairs |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113378558A true CN113378558A (en) | 2021-09-10 |
CN113378558B CN113378558B (en) | 2024-04-16 |
Family
ID=77571838
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110570270.6A Active CN113378558B (en) | 2021-05-25 | 2021-05-25 | RESTful API document theme distribution extraction method based on representative word pairs |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113378558B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108197144A (en) * | 2017-11-28 | 2018-06-22 | 河海大学 | A kind of much-talked-about topic based on BTM and Single-pass finds method |
CN110647626A (en) * | 2019-07-30 | 2020-01-03 | 浙江工业大学 | REST data service clustering method based on Internet service domain |
CN111191036A (en) * | 2019-12-30 | 2020-05-22 | 杭州远传新业科技有限公司 | Short text topic clustering method, device, equipment and medium |
CN111475609A (en) * | 2020-02-28 | 2020-07-31 | 浙江工业大学 | Improved K-means service clustering method around topic modeling |
CN112632215A (en) * | 2020-12-01 | 2021-04-09 | 重庆邮电大学 | Community discovery method and system based on word-pair semantic topic model |
-
2021
- 2021-05-25 CN CN202110570270.6A patent/CN113378558B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108197144A (en) * | 2017-11-28 | 2018-06-22 | 河海大学 | A kind of much-talked-about topic based on BTM and Single-pass finds method |
CN110647626A (en) * | 2019-07-30 | 2020-01-03 | 浙江工业大学 | REST data service clustering method based on Internet service domain |
CN111191036A (en) * | 2019-12-30 | 2020-05-22 | 杭州远传新业科技有限公司 | Short text topic clustering method, device, equipment and medium |
CN111475609A (en) * | 2020-02-28 | 2020-07-31 | 浙江工业大学 | Improved K-means service clustering method around topic modeling |
CN112632215A (en) * | 2020-12-01 | 2021-04-09 | 重庆邮电大学 | Community discovery method and system based on word-pair semantic topic model |
Non-Patent Citations (2)
Title |
---|
YAN X等: "biterm topic model for short texts", PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, RIO: IW3C2 * |
陈婷等: "基于BTM主题模型的Web服务聚类方法研究", 计算机工程与科学, vol. 40, no. 10, pages 1 * |
Also Published As
Publication number | Publication date |
---|---|
CN113378558B (en) | 2024-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111144131B (en) | Network rumor detection method based on pre-training language model | |
CN111709243B (en) | Knowledge extraction method and device based on deep learning | |
US8200671B2 (en) | Generating a dictionary and determining a co-occurrence context for an automated ontology | |
CN101079031A (en) | Web page subject extraction system and method | |
CN107357777B (en) | Method and device for extracting label information | |
CN110263154A (en) | A kind of network public-opinion emotion situation quantization method, system and storage medium | |
WO2017161749A1 (en) | Method and device for information matching | |
JP4534666B2 (en) | Text sentence search device and text sentence search program | |
CN115186654B (en) | Method for generating document abstract | |
CN108491381B (en) | Syntax analysis method of Chinese binary structure | |
CN115759119B (en) | Financial text emotion analysis method, system, medium and equipment | |
CN106610952A (en) | Mixed text feature word extraction method | |
CN109885641A (en) | A kind of method and system of database Chinese Full Text Retrieval | |
CN111859950A (en) | Method for automatically generating lecture notes | |
CN107092595A (en) | New keyword extraction techniques | |
CN112434533A (en) | Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium | |
CN107102986A (en) | Multi-threaded keyword extraction techniques in document | |
CN112632272A (en) | Microblog emotion classification method and system based on syntactic analysis | |
CN113378558B (en) | RESTful API document theme distribution extraction method based on representative word pairs | |
CN112784536B (en) | Processing method, system and storage medium of mathematical application problem solving model | |
CN114328865A (en) | Improved TextRank multi-feature fusion education resource keyword extraction method | |
CN114595684A (en) | Abstract generation method and device, electronic equipment and storage medium | |
CN107092669A (en) | A kind of method for setting up intelligent robot interaction | |
Maheswari et al. | Rule based morphological variation removable stemming algorithm | |
CN113361270B (en) | Short text optimization topic model method for service data clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |