CN113378558A - RESTful API document theme distribution extraction method based on representative word pairs - Google Patents

RESTful API document theme distribution extraction method based on representative word pairs Download PDF

Info

Publication number
CN113378558A
CN113378558A CN202110570270.6A CN202110570270A CN113378558A CN 113378558 A CN113378558 A CN 113378558A CN 202110570270 A CN202110570270 A CN 202110570270A CN 113378558 A CN113378558 A CN 113378558A
Authority
CN
China
Prior art keywords
word
topic
word pair
matrix
pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110570270.6A
Other languages
Chinese (zh)
Other versions
CN113378558B (en
Inventor
陆佳炜
郑嘉弘
赵伟
王小定
朱昊天
徐俊
程振波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202110570270.6A priority Critical patent/CN113378558B/en
Publication of CN113378558A publication Critical patent/CN113378558A/en
Application granted granted Critical
Publication of CN113378558B publication Critical patent/CN113378558B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A RESTful API document topic distribution extraction method based on representative word pairs, the method comprising the steps of: the first step is as follows: performing word segmentation processing on the document, and performing stop word removal and temporal normalization; the second step is that: converting the word segmentation result into a word pair set; the third step: and calculating a representative word pair in the iterative process of the topic model, realizing a probability sampling algorithm by using the representative word pair, completing the training of the topic model, and outputting the document topic distribution of RESTful API. The invention provides a RESTful API document theme distribution extraction method based on representative word pairs, which designs a word pair model based on a BTM theme model, searches for the representative word pair with high relevance degree to the current sampling theme in the training process through a probability sampling strategy based on theme distribution information, and reduces the interference caused by the noise problem by adjusting the weight information of the word pair in the sampling process.

Description

RESTful API document theme distribution extraction method based on representative word pairs
Technical Field
The invention relates to a RESTful API document theme distribution extraction method based on a representative word pair.
Background
REST, all known as representational State Transfer (HTTP), is a software architecture style whose idea can be generalized to represent resources using URIs and to represent operations on these resources using HTTP methods. The RESTful API is an REST style API, and as long as the front end sends a request containing a corresponding resource URI and the HTTP method (POST, GET, PUT, DELETE) is used for realizing the jump of different operations of the resource, the server only needs to define a uniform response interface and does not need to carry out various analyses on the request. The RESTful API tends to return data in JSON or XML with descriptive documents composed of natural language. Because of its light weight, simple structure and direct resource-oriented characteristics, it gradually becomes the mainstream API service form on the internet at present. Researchers often base their description documents on the computation of corresponding API features.
The topic model can automatically acquire implicit topic distribution of the corpus through iterative sampling, the implicit semantic information of the document is fully utilized, and the document topic distribution obtained by training the topic model is used as REST API (representational State information) characteristic information, which is a common means. However, API profiles are provided with short text features. The short text is a short text containing a few words, can only acquire a small amount of word co-occurrence information and has semantic sparsity. In the processing of short texts, the conventional topic model cannot exert good effect due to the sparsity problem. On the other hand, descriptive documents face the problem of noise interference, i.e. the text contains words that are not associated with a functional topic, which may have a negative effect on the determination of the topic, called noise words. Only by solving the above two problems can an effective and reasonable document theme distribution be extracted from the descriptive document.
A word pair Topic model (BTM) (Biterm Topic model) is proposed in 2013, the model converts word sets after linguistic data are divided into word pair sets by combining every two words, the word pair sets are sampled, and corresponding Topic distribution is obtained through training. The model converts the original linguistic data into a word pair model, so that semantic co-occurrence information is increased, and the problem of sparsity of short texts is solved.
Disclosure of Invention
In order to solve the difficulty and the deficiency of document theme distribution extraction brought by the problems of sparsity and noise of the conventional RESTful API document, the invention provides a RESTful API document theme distribution extraction method based on a representative word pair.
The invention adopts the following technical scheme:
a RESTful API document topic distribution extraction method based on representative word pairs, the method comprising the steps of:
the first step is as follows: performing word segmentation processing on the document, and performing stop word removal and temporal normalization;
the second step is that: converting the word segmentation result into a word pair set;
the third step: and calculating a representative word pair in the iterative process of the topic model, realizing a probability sampling algorithm by using the representative word pair, completing the training of the topic model, and outputting the document topic distribution of RESTful API.
Further, the first step process is as follows:
1.1 reading RESTful API document information, converting the RESTful API document information into a value key pair D by taking the API name as a key and the document content as a value;
1.2 traversing the document content in D, setting the current document content as D, and setting an empty set word _ list. Sentence division processing is carried out on the d, punctuation marks are removed, and then word division is carried out on each sentence;
1.3, in the traversal process, judging each word after word segmentation, if the word is not composed of special symbols, is not a pure number and does not exist in a stop word list, carrying out normalization processing on the word, storing the word into the word _ list set in the step 1.2, and after the judgment of each word is completed, using the word _ list to replace D as a value key to store the value in D.
Further, the process of the second step is as follows:
2.1 traversing the word segmentation result obtained in the step 1 to generate a nonrepeating vocabulary Voc;
2.2 defining a word pair bitterm structure, wherein the sequence numbers of two different words in Voc are contained, the smaller sequence number is set as word1, and the larger sequence number is set as word 2;
2.3 setting an empty set whole _ words as a storage set of all word segmentation results, traversing a value key pair D, and sequentially storing a word _ list set corresponding to each key into wole _ words;
2.4 traversing all word information in the world _ words, and converting the word information into corresponding word serial numbers in the vocabulary Voc;
2.5 generating a set of word pairs B.
Preferably, the step 2.5 is as follows:
2.5.1 traversing the whole word set, and setting the vocabulary sequence result set of the document participle corresponding to the current document as single _ list;
2.5.2 setting a word pair set B for storing word pair information;
2.5.3 traversing single _ list, wherein the current object is single _ list (i), the single _ list (i) represents the vocabulary sequence number of the ith word in the single _ list, wherein i is more than or equal to 0 and less than single _ list.length, and for each single _ list (i), the vocabulary sequence number is combined with the vocabulary sequence number of the jth word corresponding to the single _ list (j) to generate a word pair b, wherein i < j < single _ list.length;
and 2.5.4, storing the generated word pairs into a word pair set B, and sequentially setting a word pair sequence number for each word pair B and marking as b.index.
Still further, the process of the third step is as follows:
3.1 setting a zero matrix nz with the size k x 1 for storing the word pairs corresponding to each topic, wherein k is the number of topics, setting a zero matrix nwz with the size k x | Voc | for storing the times of each vocabulary divided into each topic, wherein | Voc | represents the number of vocabularies in the vocabulary, and the zero matrix refers to a matrix with matrix elements all being 0;
3.2 randomly endowing the word pair with a theme, and initializing nz and nwz;
3.3 setting iteration times iterating and setting the current iteration times iter;
3.4, starting the first iteration, traversing the word pair set B, and sampling each word pair B;
3.5 calculating a representative word pair matrix S;
3.6, continuing iteration, adding 1 to the current iteration number iter, traversing the word pair set B, and sampling each word pair B;
3.7 repeating the operation of step 3.5;
3.8 judging the size of iter, and stopping iteration when the size of iter is equal to iterating;
3.9 calculating the document topic distribution theta according to the formula:
Figure BDA0003082347040000041
p (z | d) represents the probability, nd, of document d for topic zzIndicating the number of words in the document that are assigned topic z.
The step 3.2 is as follows:
3.2.1 traversing the word pair set B, randomly obtaining an integer value t for each word pair B, wherein t is more than or equal to 0 and less than k, and taking t as the subject of the word pair B and marking as b.topic;
3.2.2 traversing the word pair set B with the randomly assigned theme, setting the current word pair as B, adding 1 to the value of nz [ b.topic ] position in the matrix nz, and respectively adding 1 to the values of nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position in the matrix nwz, wherein b.word1 represents the value of word1 in the word pair, and b.word2 represents the value of word2 in the word pair, thereby completing matrix initialization.
The step 3.4 is as follows:
3.4.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;
3.4.2 calls the following formula to sample each topic z:
Figure BDA0003082347040000042
wherein
Figure BDA0003082347040000043
Representing the probability that a word pair b belongs to a topic z, n, after the influence of the current word pair b is removedzRepresenting the number of words belonging to subject z, i.e. nz [ z ] in matrix nz]The value of (a) is a direct ratio, α and β are hyper-parameters, and nwi|zWord w, with sequence number biThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word1]A value of (a), nwj|zWord w representing a sequence number bjThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word2]The value of (1) and M is the number of words in the vocabulary table, and the probabilities obtained by all the topics are sequentially stored into a list distribution;
3.4.3 using roulette operation to obtain new theme corresponding to word pair b, setting it as b.topic, and using roulette algorithm, also called proportion selection algorithm, to obtain cumulative probability corresponding to each individual by means of sectional accumulation of probability distribution, and generating a random number in interval [0,1], and selecting individual whose cumulative probability is greater than or equal to the random number and whose difference with the random number is minimum as output result of roulette;
3.4.4 add 1 to the value of the nz [ b.topic ] position in the matrix nz, while adding 1 to the value of the nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position, respectively, in the matrix nwz, and the matrix accepts the sampling results.
The step of 3.5 is as follows:
3.5.1 setting a matrix lambda with the size of | B |. k to represent a word pair distinguishing matrix, wherein | B | represents the number of word pairs in the word pair set, and setting a matrix S with the size of | B |. k to represent a word pair matrix;
3.5.2 traversing the word pair set B, setting the current word pair as B, traversing all the topics, and calculating the word pair probability of the word pair B for the topic z according to a formula, wherein the formula is as follows:
Figure BDA0003082347040000051
the sign meaning is the same as that in step 3.4.2, the maximum value of the probability p (z | b) of the word pair b for each topic is found and set as max (p (z | b)), the ratio p (z | b)/max (p (z | b)) is calculated for each topic z, and the ratio is stored in the position of lambda [ b.index ] [ z ] in the matrix lambda;
3.5.3, traversing all the values in matrix lambda, judging the corresponding value of lambda [ b.index ] [ z ] according to the Bernoulli distribution with the set probability of 0.5, storing 0 or 1 of the result in the representative word pair matrix S, wherein the Bernoulli distribution is a discrete probability distribution, when the input probability is greater than the set probability, returning to 1, and when the input probability is less than or equal to the set probability, returning to 0.
The step 3.6 is as follows:
3.6.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;
3.6.2, traversing each topic, setting the current topic as z, judging, if the corresponding value of S [ b.index ] [ t ] is 0, repeating the operations of steps 3.4.2, 3.4.3 and 3.4.4, and if the corresponding value of S [ b.index ] [ t ] is 1, replacing the formula in step 3.4.2 with the following formula:
Figure BDA0003082347040000052
wherein mu is a weight parameter of the representative word pair, is set before training, is adjusted to change the training effect of the model, and then repeats the operations of 3.4.3 and 3.4.4.
The invention has the beneficial effects that: (1) RESTful API documents are taken as research objects, the functional semantic requirements of the RESTful API are met, and the method is more suitable for feature extraction work. (2) The topic model based on the word pairs can greatly increase the co-occurrence information and overcome the defect of high text sparsity of the API document. (3) By the probability sampling algorithm realized by the calculation of the representative word pair, the influence of short text noise words is reduced while the time complexity of the algorithm is not influenced, and the reliability of the API document theme distribution acquisition is improved.
Detailed Description
The present invention is further explained below.
A RESTful API document theme distribution extraction method based on representative word pairs comprises the following steps:
the first step is as follows: performing word segmentation processing on the document, and performing stop word removal and temporal normalization;
the process of the first step is as follows:
1.1 reading RESTful API document information, converting the RESTful API document information into a value key pair D by taking the API name as a key and the document content as a value;
1.2 traversing D Chinese document contents, setting the current document contents as D, setting an empty set word _ list, performing sentence division processing on D by utilizing a natural language processing NLTK library, removing punctuation marks, and then performing word division on each sentence;
1.3 in the traversal process, judging each word after word segmentation by means of a regular expression, if the word is not composed of special symbols, is not a pure number and does not exist in a stop word list, carrying out normalization processing on the word, storing the word into a word _ list set in the step 1.2, and after the judgment of each word is completed, storing a value in D by using the word _ list to replace D as a value key;
the second step is that: converting the word segmentation result into a word pair set;
the process of the second step is as follows:
2.1 traversing the word segmentation result obtained in the step 1 to generate a nonrepeating vocabulary Voc;
2.2 defining a word pair bitterm structure, wherein the sequence numbers of two different words in Voc are contained, the smaller sequence number is set as word1, and the larger sequence number is set as word 2;
2.3 setting an empty set whole _ words as a storage set of all word segmentation results, traversing a value key pair D, and sequentially storing a word _ list set corresponding to each key into wole _ words;
2.4 traversing all word information in the world _ words, and converting the word information into corresponding word serial numbers in the vocabulary Voc;
2.5 generating a word pair set B, comprising the following steps:
2.5.1 traversing the whole word set, and setting the vocabulary sequence result set of the document participle corresponding to the current document as single _ list;
2.5.2 setting a word pair set B for storing word pair information;
2.5.3 traversing single _ list, wherein the current object is single _ list (i), the single _ list (i) represents the vocabulary sequence number of the ith word in the single _ list, wherein i is more than or equal to 0 and less than single _ list.length, and for each single _ list (i), the vocabulary sequence number is combined with the vocabulary sequence number of the jth word corresponding to the single _ list (j) to generate a word pair b, wherein i < j < single _ list.length;
and 2.5.4, storing the generated word pairs into a word pair set B, and sequentially setting a word pair sequence number for each word pair B and marking as b.index.
The third step: calculating a representative word pair in the iterative process of the topic model, realizing a probability sampling algorithm by using the representative word pair, completing the training of the topic model, and outputting the document topic distribution of RESTful API;
the third step process is as follows:
3.1 setting a zero matrix nz with the size k x 1 for storing the word pairs corresponding to each topic, wherein k is the number of topics, setting a zero matrix nwz with the size k x | Voc | for storing the times of each vocabulary divided into each topic, wherein | Voc | represents the number of vocabularies in the vocabulary, and the zero matrix refers to a matrix with matrix elements all being 0;
3.2 randomly assigning a topic to the word pair, initializing nz and nwz, and the steps are as follows:
3.2.1 traversing the word pair set B, randomly obtaining an integer value t for each word pair B, wherein t is more than or equal to 0 and less than k, and taking t as the subject of the word pair B and marking as b.topic; (ii) a
3.2.2 traversing the word pair set B with the randomly assigned theme, setting the current word pair as B, adding 1 to the value of the nz [ b.topic ] position in the matrix nz, and respectively adding 1 to the values of nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position in the matrix nwz, wherein b.word1 represents the value of word1 in the word pair, and b.word2 represents the value of word2 in the word pair, thereby completing matrix initialization;
3.3 setting iteration times iterating and setting the current iteration times iter;
3.4, starting the first iteration, traversing the word pair set B, and performing sampling operation on each word pair B, wherein the steps are as follows:
3.4.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;
3.4.2 calls the following formula to sample each topic z:
Figure BDA0003082347040000081
wherein
Figure BDA0003082347040000083
Representing the probability that a word pair b belongs to a topic z, n, after the influence of the current word pair b is removedzRepresenting the number of words belonging to subject z, i.e. nz [ z ] in matrix nz]The value of (a) is a direct ratio, α and β are hyper-parameters, and nwi|zWord w, with sequence number biThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word1]A value of (a), nwj|zWord w representing a sequence number bjThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word2]The value of (1) and M is the number of words in the vocabulary table, and the probabilities obtained by all the topics are sequentially stored into a list distribution;
3.4.3 using roulette operation to obtain new theme corresponding to word pair b, setting it as b.topic, and using roulette algorithm, also called proportion selection algorithm, to obtain cumulative probability corresponding to each individual by means of sectional accumulation of probability distribution, and generating a random number in interval [0,1], and selecting individual whose cumulative probability is greater than or equal to the random number and whose difference with the random number is minimum as output result of roulette;
3.4.4 adding 1 to the value of the nz [ b.topic ] position in the matrix nz, and adding 1 to the values of nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position in the matrix nwz respectively to make the matrix accept the sampling result;
3.5 calculating a representative word pair matrix S, comprising the following steps:
3.5.1 setting a matrix lambda with the size of | B |. k to represent a word pair distinguishing matrix, wherein | B | represents the number of word pairs in the word pair set, and setting a matrix S with the size of | B |. k to represent a word pair matrix;
3.5.2 traversing the word pair set B, setting the current word pair as B, traversing all the topics, and calculating the word pair probability of the word pair B for the topic z according to a formula, wherein the formula is as follows:
Figure BDA0003082347040000082
the sign meaning is the same as that in step 3.4.2, the maximum value of the probability p (z | b) of the word pair b for each topic is found and set as max (p (z | b)), the ratio p (z | b)/max (p (z | b)) is calculated for each topic z, and the ratio is stored in the position of lambda [ b.index ] [ z ] in the matrix lambda;
3.5.3 traversing all the values in matrix lambda, judging the corresponding value of lambda [ b.index ] [ z ] according to Bernoulli distribution with set probability of 0.5, storing 0 or 1 of the result in the representative word pair matrix S, wherein the Bernoulli distribution is a discrete probability distribution, when the input probability is greater than the set probability, returning to 1, and when the input probability is less than or equal to the set probability, returning to 0;
3.6, continuing iteration, adding 1 to the current iteration number iter, traversing the word pair set B, and sampling each word pair B, wherein the steps are as follows:
3.6.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;
3.6.2, traversing each topic, setting the current topic as z, judging, if the corresponding value of S [ b.index ] [ t ] is 0, repeating the operations of steps 3.4.2, 3.4.3 and 3.4.4, and if the corresponding value of S [ b.index ] [ t ] is 1, replacing the formula in step 3.4.2 with the following formula:
Figure BDA0003082347040000091
wherein mu is a weight parameter of the representative word pair, is set before training, is adjusted to change the training effect of the model, and then repeats the operations of the steps 3.4.3 and 3.4.4;
3.7 repeating the operation of step 3.5;
3.8 judging the size of iter, and stopping iteration when the size of iter is equal to iterating;
3.9 calculating the document topic distribution theta according to the formula:
Figure BDA0003082347040000092
p (z | d) represents the probability, nd, of document d for topic zzIndicating the number of words in the document that are assigned topic z.
The embodiments described in this specification are merely illustrative of implementations of the inventive concepts, which are intended for purposes of illustration only. The scope of the present invention should not be construed as being limited to the particular forms set forth in the examples, but rather as being defined by the claims and the equivalents thereof which can occur to those skilled in the art upon consideration of the present inventive concept.

Claims (9)

1. A RESTful API document topic distribution extraction method based on representative word pairs is characterized by comprising the following steps:
the first step is as follows: performing word segmentation processing on the document, and performing stop word removal and temporal normalization;
the second step is that: converting the word segmentation result into a word pair set;
the third step: and calculating a representative word pair in the iterative process of the topic model, realizing a probability sampling algorithm by using the representative word pair, completing the training of the topic model, and outputting the document topic distribution of RESTful API.
2. The method of claim 1, wherein the first step is performed by:
1.1 reading RESTful API document information, converting the RESTful API document information into a value key pair D by taking the API name as a key and the document content as a value;
1.2 traversing D document contents, setting the current document contents as D, setting an empty set word _ list, performing sentence division processing on D, removing punctuation marks, and then performing word division on each sentence;
1.3, in the traversal process, judging each word after word segmentation, if the word is not composed of special symbols, is not a pure number and does not exist in a stop word list, carrying out normalization processing on the word, storing the word into the word _ list set in the step 1.2, and after the judgment of each word is completed, using the word _ list to replace D as a value key to store the value in D.
3. The method of claim 2, wherein the second step is performed by the following process:
2.1 traversing the word segmentation result obtained in the step 1 to generate a nonrepeating vocabulary Voc;
2.2 defining a word pair bitterm structure, wherein the sequence numbers of two different words in Voc are contained, the smaller sequence number is set as word1, and the larger sequence number is set as word 2;
2.3 setting an empty set whole _ words as a storage set of all word segmentation results, traversing a value key pair D, and sequentially storing a word _ list set corresponding to each key into wole _ words;
2.4 traversing all word information in the world _ words, and converting the word information into corresponding word serial numbers in the vocabulary Voc;
2.5 generating a set of word pairs B.
4. The method of claim 3, wherein the step of 2.5 is as follows:
2.5.1 traversing the whole word set, and setting the vocabulary sequence result set of the document participle corresponding to the current document as single _ list;
2.5.2 setting a word pair set B for storing word pair information;
2.5.3 traversing single _ list, wherein the current object is single _ list (i), the single _ list (i) represents the vocabulary sequence number of the ith word in the single _ list, wherein i is more than or equal to 0 and less than single _ list.length, and for each single _ list (i), the vocabulary sequence number is combined with the vocabulary sequence number of the jth word corresponding to the single _ list (j) to generate a word pair b, wherein i < j < single _ list.length;
and 2.5.4, storing the generated word pairs into a word pair set B, and sequentially setting a word pair sequence number for each word pair B and marking as b.index.
5. The RESTful API document topic distribution extraction method based on representative word pairs according to one of claims 1 to 4, characterized in that the procedure of the third step is as follows:
3.1 setting a zero matrix nz with the size k x 1 for storing the word pairs corresponding to each topic, wherein k is the number of topics, setting a zero matrix nwz with the size k x | Voc | for storing the times of each vocabulary divided into each topic, wherein | Voc | represents the number of vocabularies in the vocabulary, and the zero matrix refers to a matrix with matrix elements all being 0;
3.2 randomly endowing the word pair with a theme, and initializing nz and nwz;
3.3 setting iteration times iterating and setting the current iteration times iter;
3.4, starting the first iteration, traversing the word pair set B, and sampling each word pair B;
3.5 calculating a representative word pair matrix S;
3.6, continuing iteration, adding 1 to the current iteration number iter, traversing the word pair set B, and sampling each word pair B;
3.7 repeating the operation of step 3.5;
3.8 judging the size of iter, and stopping iteration when the size of iter is equal to iterating;
3.9 calculating the document topic distribution theta according to the formula:
Figure FDA0003082347030000021
p (z | d) represents the probability, nd, of document d for topic zzIndicating the number of words in the document that are assigned topic z.
6. The method of claim 5, wherein the step of 3.2 is as follows:
3.2.1 traversing the word pair set B, randomly obtaining an integer value t for each word pair B, wherein t is more than or equal to 0 and less than k, and taking t as the subject of the word pair B and marking as b.topic;
3.2.2 traversing the word pair set B with the randomly assigned theme, setting the current word pair as B, adding 1 to the value of nz [ b.topic ] position in the matrix nz, and respectively adding 1 to the values of nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position in the matrix nwz, wherein b.word1 represents the value of word1 in the word pair, and b.word2 represents the value of word2 in the word pair, thereby completing matrix initialization.
7. The method of claim 5, wherein the step of 3.4 is as follows:
3.4.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;
3.4.2 calls the following formula to sample each topic z:
Figure FDA0003082347030000031
wherein
Figure FDA0003082347030000032
Representing the probability that a word pair b belongs to a topic z, n, after the influence of the current word pair b is removedzRepresenting the number of words belonging to subject z, i.e. nz [ z ] in matrix nz]The value of (a) is a direct ratio, α and β are hyper-parameters, and nwi|zWord w, with sequence number biThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word1]A value of (a), nwj|zWord w representing a sequence number bjThe number of times attributed to subject z, i.e., nwz [ z ] in matrix nwz][b.word2]The value of (1) and M is the number of words in the vocabulary table, and the probabilities obtained by all the topics are sequentially stored into a list distribution;
3.4.3 using roulette operation to obtain new theme corresponding to word pair b, setting it as b.topic, and using roulette algorithm, also called proportion selection algorithm, to obtain cumulative probability corresponding to each individual by means of sectional accumulation of probability distribution, and generating a random number in interval [0,1], and selecting individual whose cumulative probability is greater than or equal to the random number and whose difference with the random number is minimum as output result of roulette;
3.4.4 add 1 to the value of the nz [ b.topic ] position in the matrix nz, while adding 1 to the value of the nwz [ b.topic ] [ b.word1] position and nwz [ b.topic ] [ b.word2] position, respectively, in the matrix nwz, and the matrix accepts the sampling results.
8. The method of claim 7, wherein the step of 3.5 is as follows:
3.5.1 setting a matrix lambda with the size of | B |. k to represent a word pair distinguishing matrix, wherein | B | represents the number of word pairs in the word pair set, and setting a matrix S with the size of | B |. k to represent a word pair matrix;
3.5.2 traversing the word pair set B, setting the current word pair as B, traversing all the topics, and calculating the word pair probability of the word pair B for the topic z according to a formula, wherein the formula is as follows:
Figure FDA0003082347030000041
finding out the maximum value of the probability p (z | b) of the word pair b for each topic, setting the maximum value as max (p (z | b)), calculating the ratio p (z | b)/max (p (z | b)) for each topic z, and storing the ratio into the position of lambda [ b.index ] [ z ] in the matrix lambda;
3.5.3, traversing all the values in matrix lambda, judging the corresponding value of lambda [ b.index ] [ z ] according to the Bernoulli distribution with the set probability of 0.5, storing 0 or 1 of the result in the representative word pair matrix S, wherein the Bernoulli distribution is a discrete probability distribution, when the input probability is greater than the set probability, returning to 1, and when the input probability is less than or equal to the set probability, returning to 0.
9. The method of claim 7, wherein the step of 3.6 is as follows:
3.6.1 subtracting 1 from each of nz [ b.topic ], nwz [ b.topic ] [ b.word1] and nwz [ b.topic ] [ b.word2] to exclude the influence of the current word on b;
3.6.2, traversing each topic, setting the current topic as z, judging, if the corresponding value of S [ b.index ] [ t ] is 0, repeating the operations of steps 3.4.2, 3.4.3 and 3.4.4, and if the corresponding value of S [ b.index ] [ t ] is 1, replacing the formula in step 3.4.2 with the following formula:
Figure FDA0003082347030000042
wherein mu is a weight parameter of the representative word pair, is set before training, is adjusted to change the training effect of the model, and then repeats the operations of 3.4.3 and 3.4.4.
CN202110570270.6A 2021-05-25 2021-05-25 RESTful API document theme distribution extraction method based on representative word pairs Active CN113378558B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110570270.6A CN113378558B (en) 2021-05-25 2021-05-25 RESTful API document theme distribution extraction method based on representative word pairs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110570270.6A CN113378558B (en) 2021-05-25 2021-05-25 RESTful API document theme distribution extraction method based on representative word pairs

Publications (2)

Publication Number Publication Date
CN113378558A true CN113378558A (en) 2021-09-10
CN113378558B CN113378558B (en) 2024-04-16

Family

ID=77571838

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110570270.6A Active CN113378558B (en) 2021-05-25 2021-05-25 RESTful API document theme distribution extraction method based on representative word pairs

Country Status (1)

Country Link
CN (1) CN113378558B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197144A (en) * 2017-11-28 2018-06-22 河海大学 A kind of much-talked-about topic based on BTM and Single-pass finds method
CN110647626A (en) * 2019-07-30 2020-01-03 浙江工业大学 REST data service clustering method based on Internet service domain
CN111191036A (en) * 2019-12-30 2020-05-22 杭州远传新业科技有限公司 Short text topic clustering method, device, equipment and medium
CN111475609A (en) * 2020-02-28 2020-07-31 浙江工业大学 Improved K-means service clustering method around topic modeling
CN112632215A (en) * 2020-12-01 2021-04-09 重庆邮电大学 Community discovery method and system based on word-pair semantic topic model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197144A (en) * 2017-11-28 2018-06-22 河海大学 A kind of much-talked-about topic based on BTM and Single-pass finds method
CN110647626A (en) * 2019-07-30 2020-01-03 浙江工业大学 REST data service clustering method based on Internet service domain
CN111191036A (en) * 2019-12-30 2020-05-22 杭州远传新业科技有限公司 Short text topic clustering method, device, equipment and medium
CN111475609A (en) * 2020-02-28 2020-07-31 浙江工业大学 Improved K-means service clustering method around topic modeling
CN112632215A (en) * 2020-12-01 2021-04-09 重庆邮电大学 Community discovery method and system based on word-pair semantic topic model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YAN X等: "biterm topic model for short texts", PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, RIO: IW3C2 *
陈婷等: "基于BTM主题模型的Web服务聚类方法研究", 计算机工程与科学, vol. 40, no. 10, pages 1 *

Also Published As

Publication number Publication date
CN113378558B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN111144131B (en) Network rumor detection method based on pre-training language model
CN111709243B (en) Knowledge extraction method and device based on deep learning
US8200671B2 (en) Generating a dictionary and determining a co-occurrence context for an automated ontology
CN101079031A (en) Web page subject extraction system and method
CN107357777B (en) Method and device for extracting label information
CN110263154A (en) A kind of network public-opinion emotion situation quantization method, system and storage medium
WO2017161749A1 (en) Method and device for information matching
JP4534666B2 (en) Text sentence search device and text sentence search program
CN115186654B (en) Method for generating document abstract
CN108491381B (en) Syntax analysis method of Chinese binary structure
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN106610952A (en) Mixed text feature word extraction method
CN109885641A (en) A kind of method and system of database Chinese Full Text Retrieval
CN111859950A (en) Method for automatically generating lecture notes
CN107092595A (en) New keyword extraction techniques
CN112434533A (en) Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium
CN107102986A (en) Multi-threaded keyword extraction techniques in document
CN112632272A (en) Microblog emotion classification method and system based on syntactic analysis
CN113378558B (en) RESTful API document theme distribution extraction method based on representative word pairs
CN112784536B (en) Processing method, system and storage medium of mathematical application problem solving model
CN114328865A (en) Improved TextRank multi-feature fusion education resource keyword extraction method
CN114595684A (en) Abstract generation method and device, electronic equipment and storage medium
CN107092669A (en) A kind of method for setting up intelligent robot interaction
Maheswari et al. Rule based morphological variation removable stemming algorithm
CN113361270B (en) Short text optimization topic model method for service data clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant