CN113139056A

CN113139056A - Network data clustering method, clustering device, electronic device and medium

Info

Publication number: CN113139056A
Application number: CN202110450664.8A
Authority: CN
Inventors: 朱书苗; 颜开华; 邓洁; 经纬
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2021-07-20

Abstract

The disclosure provides a clustering method of network data, and belongs to the technical field of cloud computing. The method comprises the following steps: firstly, obtaining description texts and labels of Q network data objects, wherein Q is an integer greater than 1; then, aiming at each network data object, obtaining a semantic association coefficient corresponding to a word based on semantic similarity between the word in the description text of the network data object and a label of the network data object, wherein the semantic association coefficient is used for measuring the correlation degree of the corresponding word and the service function of the network data object, and processing the description text of the network data object based on the semantic association coefficient corresponding to each word to obtain a text feature vector of the network data object; and then clustering the Q network data objects based on the text feature vectors of the Q network data objects. The disclosure also provides a clustering device of network data, an electronic device, and a computer-readable storage medium.

Description

Network data clustering method, clustering device, electronic device and medium

Technical Field

The present disclosure belongs to the technical field of cloud computing, and more particularly, to a method and an apparatus for clustering network data, an electronic device, and a medium.

Background

With the development of Web 2.0 technology, many innovative applications and software systems, such as blogs, wikis, network maps, online shopping and search systems, etc., have emerged on the internet. With the rapid increase of network data resources on the internet, how to find out network data needed by a user quickly and accurately becomes an urgent problem to be solved.

According to a large amount of research work, similar network data can be gathered in the same cluster after the network data objects are clustered, so that when the network data are searched or called, the network data can be searched in one or a plurality of related clusters according to the keyword information, the search space of the network data can be effectively reduced, and the discovery of the network data is promoted. In this process, the accuracy of network data clustering is critical to accurately and quickly discovering network data.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a network data clustering method, a clustering device, an electronic device, and a medium, which can improve the clustering accuracy of network data to a certain extent, so that the network data clustering result is more consistent with the real service function distribution of the network data.

In one aspect of the disclosed embodiments, a method for clustering network data is provided. The method comprises the following steps: obtaining description texts and labels of Q network data objects, wherein Q is an integer greater than 1; for each network data object, obtaining a semantic association coefficient corresponding to a word based on semantic similarity between the word in the description text of the network data object and a label of the network data object, wherein the semantic association coefficient is used for measuring the degree of correlation between the corresponding word and a service function of the network data object, and processing the description text of the network data object based on the semantic association coefficient corresponding to each word to obtain a text feature vector of the network data object; and clustering the Q network data objects based on the text feature vectors of the Q network data objects.

According to an embodiment of the present disclosure, the obtaining a semantic association coefficient corresponding to a word based on semantic similarity between the word in the description text of the network data object and a tag of the network data object includes obtaining a text semantic weight corresponding to the word based on accumulation of similarity between a feature word vector corresponding to each word and each tag word vector in a tag word vector set, where the semantic association coefficient includes the text semantic weight. The label word vector set is a set of word vectors obtained after processing labels of the network data objects, and the number of the label word vectors contained in the label word vector set is equal to or greater than the number of label words in the labels of the network data objects.

According to an embodiment of the present disclosure, the obtaining a semantic association coefficient corresponding to a word based on semantic similarity between the word in the description text of the network data object and a tag of the network data object further includes obtaining a semantic word frequency fusion weight of the word based on the text semantic weight corresponding to each word and a word frequency-inverse text frequency TF-IDF value statistically obtained by the word in Q network data objects; and the semantic association coefficient comprises the semantic word frequency fusion weight.

According to an embodiment of the present disclosure, obtaining a semantic association coefficient corresponding to a word based on semantic similarity between the word in the description text of the network data object and the tag of the network data object further includes: processing the description text of the network data object by using a Word2vec model to obtain a feature Word vector set corresponding to the network data object, wherein the feature Word vector set is a set formed by feature Word vectors corresponding to words in the description text of the network data object; and processing the label of the network data object by using the Word2vec model to obtain the label Word vector set.

According to an embodiment of the present disclosure, the processing the label of the network data object by using the Word2vec model to obtain the label Word vector set includes: searching R words most similar to each label Word in the labels of the network data object by using the Word2vec model, wherein R is an integer greater than or equal to 1; combining all tag words in the tags of the network data object and R words most similar to each tag word together to obtain an expansion tag set; and processing the extended tag set by using the Word2vec model to obtain the tagged Word vector set.

According to an embodiment of the present disclosure, the processing the description text of the network data object based on the semantic relation coefficient corresponding to each word to obtain the text feature vector of the network data object includes: processing the description text of the network data object to obtain a feature word vector set corresponding to the network data object, wherein the feature word vector set is a set formed by feature word vectors corresponding to words in the description text of the network data object; and taking the semantic association coefficient corresponding to each word as the weight of the feature word vector corresponding to the word, and performing weighting processing on the feature word vectors in the feature word vector set to obtain the text feature vector.

According to an embodiment of the present disclosure, the clustering Q network data objects based on their text feature vectors includes: constructing a similarity matrix of Q x Q based on pairwise similarities between the text feature vectors of Q network data objects, wherein the ith x j element in the similarity matrix represents the similarity between the text feature vector of the ith network data object and the text feature vector of the jth network data object; and taking the similarity matrix as the input of a k-means algorithm to cluster the Q network data objects.

According to an embodiment of the present disclosure, the network data object comprises a mushup service.

In another aspect of the disclosed embodiments, a network data clustering device is provided. The device comprises an acquisition module, a feature extraction module and a clustering module. The feature extraction module comprises a semantic association extraction submodule and a text feature extraction submodule. The acquisition module is used for acquiring description texts and labels of Q network data objects, wherein Q is an integer larger than 1.

The semantic association extraction submodule is used for obtaining a semantic association coefficient corresponding to a word based on semantic similarity between the word in the description text of the network data object and the label of the network data object aiming at each network data object, and the semantic association coefficient is used for measuring the degree of correlation between the word and the service function of the network data object. The text feature extraction submodule is used for processing the description text of the network data object based on the semantic association coefficient corresponding to each word aiming at each network data object to obtain a text feature vector of the network data object. And the clustering module is used for clustering the Q network data objects based on the text characteristic vectors of the Q network data objects.

According to an embodiment of the present disclosure, the feature extraction module further includes a word vector extraction sub-module. The word vector extraction sub-module is to: processing the description text of the network data object by using a Word2vec model to obtain a feature Word vector set corresponding to the network data object, wherein the feature Word vector set is a set formed by feature Word vectors corresponding to words in the description text of the network data object; and processing the label of the network data object by using the Word2vec model to obtain the label Word vector set.

According to an embodiment of the present disclosure, the feature extraction module further includes a tag expansion submodule. The tag expansion submodule is used for searching R words most similar to each tag Word in the tags of the network data object by using the Word2vec model, wherein R is an integer greater than or equal to 1; and combining all the label words in the labels of the network data object and the R words most similar to each label word together to obtain an expansion label set. And the Word vector extraction submodule is also used for processing the extended tag set by using the Word2vec model to obtain the tag Word vector set.

In another aspect of the disclosed embodiments, an electronic device is provided. The electronic device includes one or more memories, and one or more processors. The memory stores executable instructions. The processor executes the executable instructions to implement the method as described above.

In another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, storing computer-executable instructions, which when executed, implement the method as described above.

In another aspect of the disclosed embodiments, there is provided a computer program comprising computer executable instructions for implementing the method as described above when executed.

One or more of the above-described embodiments may provide the following advantages or benefits: in the process of clustering the network data objects, the semantic coefficients describing the correlation degree of words in the text and the service functions of the network data objects are combined to extract the characteristics of the network data objects, so that the distribution of text characteristic vectors of the network data objects in a vector space is more consistent with the distribution of real service functions of the network data objects, and the accuracy of clustering the network data objects is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

fig. 1 schematically illustrates an application scenario of a clustering method and apparatus for network data according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a method of clustering network data according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a conceptual illustration of extracting textual feature vectors of a network data object according to an embodiment of the disclosure;

FIG. 4 schematically shows a flow chart for obtaining semantic relevance coefficients according to an embodiment of the disclosure;

FIG. 5 schematically illustrates a flow chart of a method of clustering network data according to another embodiment of the present disclosure;

fig. 6 schematically illustrates a flow chart of a method of Mashup service clustering according to an embodiment of the present disclosure;

fig. 7 schematically illustrates a word vector transition diagram in Mashup service clustering according to an embodiment of the present disclosure;

fig. 8 schematically shows a block diagram of a clustering arrangement of network data according to an embodiment of the present disclosure; and

FIG. 9 schematically illustrates a block diagram of an electronic device suitable for implementing network data clustering in accordance with an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

In the related technology of network data clustering, when calculating the similarity between network data objects, for the weight calculation of keywords and non-keywords in a description text, usually only the frequency of words appearing in the description text is considered, and semantic information of the words is not considered, which may cause the difference between the distribution of text feature vectors of the network data objects in a vector space and the distribution of real service functions of the network data objects.

In view of this, embodiments of the present disclosure provide a clustering method for network data, a clustering device, an electronic device, and a computer-readable storage medium, which can perform clustering in conjunction with text semantic information of a network data object. The method comprises the following steps: firstly, obtaining description texts and labels of Q network data objects, wherein Q is an integer greater than 1; then, aiming at each network data object, obtaining a semantic association coefficient corresponding to a word based on semantic similarity between the word in the description text of the network data object and a label of the network data object, wherein the semantic association coefficient is used for measuring the correlation degree of the corresponding word and the service function of the network data object, and processing the description text of the network data object based on the semantic association coefficient corresponding to each word to obtain a text feature vector of the network data object; and then clustering the Q network data objects based on the text feature vectors of the Q network data objects.

According to the embodiment of the disclosure, the semantic coefficient of the correlation degree between the words in the description text and the service function of the network data object is combined to extract the features of the description text of the network data object, so that the distribution of the text feature vector of the network data object in the vector space is more consistent with the distribution of the real service function of the network data object, and the accuracy of clustering the network data object is improved.

It should be noted that the clustering method and the clustering device for network data determined in the embodiments of the present disclosure may be applied to cloud computing in the financial field, and may also be applied to any field other than the financial field.

Fig. 1 schematically illustrates an application scenario 100 of the clustering method and apparatus of network data according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, the application scenario 100 may include a cloud platform 101. Wherein a large number of network data objects (e.g.,

network data objects

1, 2, 3, 4, 5.) may be searched or connected to by the cloud platform 101. The cloud platform 101 may provide a search service, a call service, or a network data object assembly service for network data objects, and the like.

According to the embodiment of the present disclosure, the cloud platform 101 may execute the clustering method of the network data of the embodiment of the present disclosure, and perform clustering on the network data objects 1, 2, 3, 4, 5, wherein in the clustering process, text feature vectors of the network data objects 1, 2, 3, 4, 5 are extracted in combination with text semantic information, so that a clustering result is matched with distribution of service functions of the network data objects 1, 2, 3, 4, 5. In this way, when a certain network data object is searched or called through the cloud platform 101, the search can be performed from the associated cluster according to the searched keyword, and the search space of the network data object is improved.

In accordance with embodiments of the present disclosure, a network data object (i.e., any of network data objects 1, 2, 3, 4, 5.) may be any network data that includes descriptive text and a tag. The network data object may be a software application (e.g., mushup service, or various microservices), a public number, or a document (e.g., news information, blogs, microblogs), etc. The presentation form or naming of the description text and the tag in the actual application may be different for different network data objects. For example, the description text of the mushup service is text that introduces the function of the mushup service, and the tag is a tag provided when the mushup service is published online. The description text of the public number may be, for example, a public number profile of the public number, and the tag may be tag information provided by the operation platform of the public number for the public number, or tag information for marking the public number by the cloud platform 101. The description text of the news information may be, for example, an abstract or a body, and the tag may be, for example, a keyword provided by the news information.

Mashup is a novel software service formed by mixing service or resource files on the internet in recent years, can conveniently establish software application meeting personalized requirements of users by using the Mashup, does not need SOAP protocol, uses SDK and the like, and has the advantages of expandability, easy access, user-centered effect and the like. When the method of the embodiment of the disclosure is applied to Mashup service clustering, clustering results can be matched with the real service function distribution of a large number of Mashup services as much as possible, the Mashup service clustering precision is effectively improved, and the efficiency of accurately finding the required Mashup service by the cloud platform 101 can be improved.

Fig. 2 schematically illustrates a flow chart of a method 200 of clustering network data according to an embodiment of the present disclosure.

As shown in fig. 2, the clustering method 200 according to the embodiment of the present disclosure may include operations S210 to S240.

In operation S210, description texts and tags of Q network data objects are obtained, where Q is an integer greater than 1. The basic data set is built with descriptive texts and tags of Q network data objects, and the content style of the data set can be as shown in table 1.

TABLE 1

Network data object	Description text	Label (R)
			Network data object 1	D＝(w₁，w₂，...w_m)	T＝(t₁，t2，...，t_r)
....	....	...

In table 1, the content of the description text and the label of one network data object is exemplified by the network data object 1. Where the descriptive text D may include a plurality of words, exemplified as w₁，w₂，...w_m. Wherein, w₁Etc. may represent a word. For example, when the description text is Chinese, w₁The words obtained after the description text is divided into words. When the description text is English, w₁Is a word. When describing text in other languages, w₁Is a word with the least independent meaning in the language obtained by a corresponding method. The tag T may also include one or more words (referred to herein as tag words), illustrated as T₁，t₂，...，t_r。

Next, in operation S220, for each network data object, a semantic association coefficient corresponding to a word in the description text of the network data object is obtained based on semantic similarity between the word and the tag of the network data object. The semantic relevance coefficient is used to measure the degree of relevance of the corresponding word to the service function of the network data object.

In some embodiments, the words in the description text D may be preprocessed before operation S220, for example, specific word extraction, root reduction, stop word removal, and the like. The preprocessed description text is denoted in this context as the set of characteristic items DP ═(s)₁，s₂，...S_l) L is less than m. Then, in operation S220, only the words S in the DP may be processed₁，s₂，...s_lThe corresponding semantic relation coefficient is obtained, thereby reducing the data processing amount of operation S220, reducing the processing noise, and improving the text processing efficiency.

Then, in operation S230, for each network data object, the description text of the network data object is processed based on the semantic association coefficient corresponding to each word, so as to obtain a text feature vector of the network data object. If the description text is preprocessed before operation S220, a text feature vector may be obtained based on the preprocessed description text (i.e., the feature item set DP) in operation S230.

Then, in operation S240, the Q network data objects are clustered based on their text feature vectors.

Fig. 3 schematically shows a conceptual illustration of extracting text feature vectors of a network data object according to an embodiment of the disclosure. Operations S220 and S230 are described below with reference to fig. 3.

Referring to fig. 2 and 3 together, for example, the description text D is first set to (w) before operation S220₁，w₂，...w_m) Preprocessing is carried out to obtain a characteristic item set DP ═ s₁，s₂，...s_l)，l＜m。

Then, in operation S220, a feature item set DP may be defined based on (S)₁，s₂，...s_l) Each word in (in s)₁For example) and tag T ═ T (T)₁，t₂，...，t_r) Obtaining s from the semantic similarity of₁The corresponding semantic relevance coefficient f 1. In this way, the semantic association coefficient f corresponding to all the words in the feature item set DP one by one can be obtained₁，f₂，...，f_l。

According to the embodiment of the disclosure, the implementation method for calculating the semantic similarity between a word and a tag can be arbitrarily selected according to the needs of those skilled in the art.

In some embodiments, a term (e.g., s) is calculated₁) In the process of similarity with the tag T, the tag T may be given (T ═ T)₁，t₂，...，t_r) Generating a group of similar words, alternative words or antisense words and the like for each label word in the Chinese language, correspondingly combining the similar words, alternative words or antisense words and the like to form corresponding categories, and then combining the words s₁The classification in the category of the tag T, its synonym, alternative, or antisense word can be based on s₁Determining a semantic relation coefficient f from the classification result₁。

In other embodiments, in the process of calculating the similarity between a word and a label, the word and all label words in the label may be compared one by one to obtain corresponding values, and then all the values may be accumulated.

For example, a word (e.g., s) may be determined according to a preset rule₁) And tag T ═ T (T)₁，t₂，...，t_r) The language of each tag word inThe degree of closeness is defined to obtain the corresponding value.

According to an embodiment of the present disclosure, the preset rule may be, for example, that the feature item set DP is(s)₁，s₂，...s_l) Each word in (a), and the tag T ═ T (T)₁，t₂，...，t_r) Converts each tagged word in (b) to a corresponding word vector and then computes each word (e.g., s) in the feature item set DP₁) Corresponding feature word vector dw₁Similarity of each label word vector in the label word vector set corresponding to the label T is obtained, and then the obtained similarities are accumulated to obtain the word s₁Corresponding text semantic weight TSWeight (dw)₁). The label word vector set corresponding to the label T is a word vector set obtained after the label of the network data object is processed. The number of the label word vectors contained in the label word vector set is equal to or greater than the number of the label words in the labels of the network data objects. For example, when the number of tag words in the tag T is small (a specific standard is determined according to actual needs), the tag T may be expanded by a synonym or a near-synonym of each tag word in the tag T, and word vector extraction is performed on all words in the expanded tag, so that the number of word vectors in the obtained tag word vector set is greater than the number of words in the tag T.

In one embodiment, after the text semantic weight corresponding to a word is obtained, the text semantic weight may be directly used as the semantic association coefficient corresponding to the word. For example, with TSWeight (dw)₁) As a word s₁Corresponding semantic relation coefficient f₁。

In another embodiment, after the semantic weight of the text corresponding to a word is obtained, the word frequency-inverse text frequency TF-IDF value statistically obtained by the word in the data set (i.e., in the description texts and/or tags of Q network data objects) may also be combined to obtain the semantic word frequency fusion weight of the word. Among them, TF-IDF (abbreviation for term frequency-inverse document ffeguency). For example, with text semantic weight TSWeight (dw)₁) Multiplying with TF-IDF value to obtain semantic word frequency fusion weight STFWeight (dw)₁) Then fusing the weight STFWeight (dw) with semantic word frequency₁) As a word s₁Corresponding semantic relation coefficient f₁。

The text semantic weight corresponding to the word is further processed through the TF-IDF value of the word, and the word frequency characteristic is combined, so that the proportion of keywords in the text characteristic vector is enhanced to a great extent, the influence of non-keywords on the text characteristic vector is weakened, the difference between the distribution of the generated text characteristic vector in a vector space and the distribution of real service functions is smaller, and the clustering accuracy is improved.

With continued reference to FIG. 3, a semantic correlation coefficient f is obtained₁，f₂，...，f_lThen, when text features are extracted from the description text of the network object in operation S230, the preprocessed feature item set DP may be defined as (S)₁，s₂，...s_l) Semantic association coefficient f corresponding to each word₁，f₂，...，f_lThe two are used together as the input of a feature extraction model (such as Word2vec model, or recurrent neural network, or long-short term memory neural network LSTM, etc.) to semantically associate the coefficient f₁，f₂，...，f_lAnd adjusting the importance degree of each word during feature extraction to obtain the text feature vector of the network data object.

In this way, according to the embodiment of the disclosure, when network data is clustered, the relationship between the semantics of words in the network data objects and the functions provided by the network data objects is combined, and the characteristics of the network data objects are extracted, so that the clustering result is more consistent with the distribution of the real service functions of the network data objects, and the precision of clustering the network data objects is improved.

Fig. 4 schematically shows a flowchart for obtaining the semantic relation coefficient in operation S220 according to an embodiment of the present disclosure.

As shown in fig. 4, operation S220 may include operations S401 to S404 according to an embodiment of the present disclosure.

Firstly, in operation S401, a Word2vec model is used to process a description text of a network data object, and a feature Word vector set corresponding to the network data object is obtained, where the feature Word vector set is a set formed by feature Word vectors corresponding to words in the description text of the network data object.

Then, in operation S402, the Word2vec model is used to process the label of the network data object, so as to obtain a label Word vector set.

According to an embodiment of the present disclosure, the tag of the network data object may be augmented with Word2vec, considering that in some scenarios the number of tag words contained in the tag of the network data object may be small (e.g., typically only 2 to 4, or only 1), and the amount of data that can be provided in extracting the semantic relevance coefficient is small. Specifically, a Word2vec model can be used to search for R words most similar to each tag Word in the tags of the network data object, where R is an integer greater than or equal to 1; and then all tag words in the tags of the network data object and the R words most similar to each tag word are combined together to obtain an expanded tag set. Next, in operation S402, the extended tag set is processed by using the Word2vec model to obtain a tagged Word vector set.

Next, in operation S403, a text semantic weight corresponding to each word is obtained based on the accumulation of the similarity between the feature word vector corresponding to each word and each tag word vector in the tag word vector set.

Then, in operation S404, semantic word frequency fusion weight of each word is obtained based on the text semantic weight corresponding to each word and the word frequency-inverse text frequency TF-IDF value statistically obtained by the word in the description texts of Q network data objects. In this embodiment, the semantic word frequency fusion weight may be used as the semantic association coefficient.

According to an embodiment of the present disclosure, a Word2vec model may be trained before the description text and tag processing of a network data object may be processed using the Word2vec model. For the case where the web data object is written in english, wiki english encyclopedia may be used as a Word vector training corpus, training the Word2vec model, and the Word vector dimension may be set to 300, for example, according to conventional experience. For the case where the web data object is written in another language (e.g., Chinese), the Word2vec model may be trained using the collected Chinese corpus as a Word vector training corpus. The present disclosure is not limited thereto.

The method and the device can construct a Word vector library through a Word2vec model, calculate the semantic similarity between words in a description text of a network data object and a label of the network data object by using the Word vector library, calculate the semantic association coefficient of each Word in the description text of the network data object according to the semantic similarity, construct a text feature vector by taking the semantic association coefficient as the weight of the Word vector corresponding to the description text of the network data object, and finally complete clustering based on the obtained text feature vectors of Q network data objects, so that the clustering precision can be effectively improved.

Fig. 5 schematically illustrates a flow chart of a method 500 of clustering network data according to another embodiment of the present disclosure.

As shown in fig. 5, the clustering method 500 according to the embodiment may include operations S210 to S220, operations S231 to S232, and operations S241 to S242.

First, in operation S210, description texts and tags of Q network data objects are obtained, where Q is an integer greater than 1.

Then, in operation S220, for each network data object, based on semantic similarity between words in the description text of the network data object and tags of the network data object, a semantic association coefficient corresponding to the word is obtained, and the semantic association coefficient is used to measure a degree of correlation between the corresponding word and a service function of the network data object.

The operations S210 and S220 may refer to the above description, and are not described herein again.

Next, in operation S231, the description text of the network data object is processed to obtain a feature word vector set corresponding to the network data object, where the feature word vector set is a set composed of feature word vectors corresponding to words in the description text of the network data object. In one embodiment, a Word2vec model may be used to derive a set of feature Word vectors, as described above.

Then, in operation S232, the semantic association coefficient corresponding to each word is used as the weight of the feature word vector corresponding to the word, and the feature word vectors in the feature word vector set are weighted to obtain the text feature vector.

Next, in operation S241, based on pairwise similarity between the text feature vectors of the Q network data objects, a similarity matrix M of Q × Q is constructed, where the i × j-th element in the similarity matrix represents the similarity between the text feature vector of the i-th network data object and the text feature vector of the j-th network data object.

Thereafter, in operation S242, Q network data objects are clustered with the similarity matrix M as input to the k-means algorithm. The output of the k-means algorithm is the cluster distribution of Q network data objects after clustering. The K-means clustering algorithm is an iterative solution clustering method proposed by MacQueen and the like, and the main process of the K-means algorithm is that K points are randomly selected in a data set as a clustering starting central point, then the distances from all data objects in the data set to the K central points are calculated through a distance formula, and then the data objects are divided into clusters where the central points closest to the data objects are located, so that one clustering iteration is completed; and after iteration is finished, calculating the average value of the data objects in each cluster to update the clustering center and clustering again, wherein if the clustering centers of two adjacent clusters are not changed, the adjustment of the data objects is finished, the clustering criterion function is converged, and the algorithm is finished.

Fig. 6 schematically shows a flowchart of a Mashup service clustering method according to an embodiment of the present disclosure. Fig. 7 schematically illustrates a word vector transformation diagram in Mashup service clustering according to an embodiment of the present disclosure. Fig. 6 and fig. 7 illustrate a specific application example of the network data clustering method according to the embodiment of the present disclosure, by taking a network data object as Mashup service as an example. It is understood that the flow is exemplary and does not constitute any limitation on the aspects of the present disclosure.

As shown in fig. 6 and 7, the Mashup service clustering method according to the embodiment of the present disclosure first extracts a service tag and a service description text from a database, obtains a corresponding Word vector through a Word2vec model, calculates a text semantic weight by using a similarity between the tag Word vector and the text Word vector, generates a semantic Word frequency fusion weight by combining a TF-IDF algorithm and constructs a text feature vector, and finally constructs a similarity matrix through the text feature vector of the service, and completes clustering by using the similarity matrix as an input of a k-means algorithm, which may specifically include steps S1 to S11.

Step S1, collecting a large amount of Mashup service information, including description texts and labels, and constructing a data set based on the description texts and the labels.

And step S2, training a Word2vec model by using wiki English encyclopedia as a Word vector training corpus, and setting the dimension of a Word vector to be 300.

Step S3, the Mashup service description text D ═ (w)₁，w₂，...w_m) Preprocessing operations such as extracting words, restoring root words and removing stop words are carried out, and the preprocessed description text is represented as a characteristic item set DP ═ s₁，s₂，...s_l)，l＜m。

Step S4, converting words in the description text into Word vectors according to the Word2vec model, and obtaining a feature Word vector set DPW (dw) corresponding to the feature item set DP₁，dw₂，...dw_l). The label set served by Mashup is then denoted as T ═ T (T)₁，t₂，...，t_r) The corresponding tagged Word vector set obtained by the Word2vec model is represented as TW ═ TW (TW)₁，tw₂，...，tw_r)。

Step S5, according to the crawled Mashup service tag data, by observation and calculation, it is found that the number of most Mashup service tags is mainly 2 to 4, and even some services may have only 1 tag. In order to solve the problem that the semantic information of the tags is insufficient due to the fact that the number of the service tags is small, the tags need to be expanded. And (3) finding the first N similar words in the Word2vec model aiming at a certain label of Mashup service by using the Word2vec Word vector model, and finally combining the similar words into a label set for expansion. The steps of extending the tag are as follows:

5.1 inputting the tag set T into a Word2vec model, and constructing each tag T in the tag set T_iMost similar words set Vt of_i＝(v₁，v₁，...，v_M)，M≥N。

5.2 expansion completed Tab set denoted T_enrich-N＝{(t₁，Vt₁)，(t₂，Vt₂)，...，(t_r，Vt_r) And on the side, taking 10 from R to obtain an expanded tag set which is expressed as an expanded tag set T_enrich-10；

5.3 obtaining extended tag set T through Word2vec model_enrich-10Corresponding expansion tag word vector set TW_enrich-10＝(tw₁，tw₂，...，tw_u). Wherein u is the sum of the number of tags and the extended words.

Step S6, obtaining the vector set T of the expansion tag words_Wenrich-10After DPW is integrated with the feature word vector set, the cosine similarity is used to calculate the text semantic weight TSWeight (dw)_j) The calculation formula is as follows:

wherein dw_jRepresenting the jth feature word vector in the feature word vector set DPW, representing the augmented tagged word vector set TW_enrich-10The ith tagged word vector of (a).

Step S7, the text semantic weight TSWeight (dw) obtained in step S6_j) On the basis, the TF-IDF algorithm is combined to calculate the semantic word frequency fusion weight STFWeight, and the calculation formula is as follows:

STFWeight(dw_j)＝TSWeight(dw_j)*[TF-IDF(dw_j)]

wherein TF-IDF (dw)_j) Represents dw_jThe TF-IDF value of (1).

Step S8, constructing a text feature vector vec (dpw) of the Mashup service description text according to the word vector of the feature term and the semantic word-frequency fusion weight STFWeight calculated in step S7, wherein the calculation formula is as follows:

in step S9, the similarity between Mashup services is calculated by the text feature vector vec (dpw) constructed in step S8. The formula is as follows:

Sim_D(M_i，M_j)＝con sin Sim(vec(DPW_i)，vec(DPW_j))

of these, consinSim (vec (DPM)_i)，vec(DPM_j) ) represents the cosine similarity between models i and j.

Step S10, according to the similarity between Mashup services calculated in the step S9, a similarity matrix M is constructed, and the matrix is expressed as:

wherein s is_ijRepresenting the similarity between Mashup service i and service j, and n is the sum of all Mashup services in the data set.

And S11, taking the similarity matrix M constructed in the step S10 as the input of a k-means algorithm for clustering.

Fig. 8 schematically shows a block diagram of a clustering apparatus 800 of network data according to an embodiment of the present disclosure.

As shown in fig. 8, a clustering apparatus 800 of network data according to an embodiment of the present disclosure may include an obtaining module 810, a feature extracting module 820, and a clustering module 830. The feature extraction module 820 includes a semantic association extraction sub-module 821 and a text feature extraction sub-module 822. The apparatus 800 may be used to implement the methods described with reference to fig. 2-7.

The obtaining module 810 is configured to obtain description texts and tags of Q network data objects, where Q is an integer greater than 1.

The semantic association extracting sub-module 821 is configured to, for each network data object, obtain a semantic association coefficient corresponding to a word based on semantic similarity between the word in the description text of the network data object and a tag of the network data object, where the semantic association coefficient is used to measure a degree of correlation between the word and a service function of the network data object.

The text feature extraction submodule 822 is configured to, for each network data object, process the description text of the network data object based on the semantic association coefficient corresponding to each word, and obtain a text feature vector of the network data object.

The clustering module 830 is configured to cluster the Q network data objects based on the text feature vectors of the Q network data objects.

According to other embodiments of the present disclosure, the feature extraction module 820 may further include a word vector extraction sub-module. The word vector extraction submodule is used for: processing the description text of the network data object by using a Word2vec model to obtain a feature Word vector set corresponding to the network data object, wherein the feature Word vector set is a set formed by feature Word vectors corresponding to words in the description text of the network data object; and processing the label of the network data object by using a Word2vec model to obtain a label Word vector set.

According to still further embodiments of the present disclosure, the feature extraction module 820 may further include a tag expansion submodule. The tag expansion submodule is used for searching R words most similar to each tag Word in the tag of the network data object by using a Word2vec model, wherein R is an integer greater than or equal to 1; and combining all the label words in the labels of the network data object and the R words most similar to each label word together to obtain an expanded label set. Correspondingly, the Word vector extraction submodule is also used for processing the expanded tag set by using a Word2vec model to obtain a tag Word vector set.

Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.

For example, any of the obtaining module 810, the feature extraction module 820, the clustering module 830, the feature extraction module 820, the semantic association extraction submodule 821, the text feature extraction submodule 822, the word vector extraction submodule, and the tag expansion submodule may be combined and implemented in one module, or any one of the modules may be split into multiple modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to the embodiment of the present disclosure, at least one of the obtaining module 810, the feature extracting module 820, the clustering module 830, the feature extracting module 820, the semantic association extracting sub-module 821, the text feature extracting sub-module 822, the word vector extracting sub-module, and the tag expansion sub-module may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementation manners of software, hardware, and firmware, or implemented by a suitable combination of any of them. Or the obtaining module 810, the feature extracting module 820 and the clustering module 830. At least one of the feature extraction module 820, the semantic association extraction submodule 821, the text feature extraction submodule 822, the word vector extraction submodule, and the tag expansion submodule may be implemented at least in part as a computer program module that, when executed, may perform corresponding functions.

FIG. 9 schematically illustrates a block diagram of an electronic device 900 suitable for implementing network data clustering in accordance with an embodiment of the present disclosure. The computer system of the electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 9, an electronic apparatus 900 according to an embodiment of the present disclosure includes a processor 901 which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. Processor 901 may comprise, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 901 may also include on-board memory for caching purposes. The processor 901 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.

In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. The processor 901 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the programs may also be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

Electronic device 900 may also include input/output (I/O) interface 905, input/output (I/O) interface 905 also connected to bus 904, according to an embodiment of the present disclosure. The electronic device 900 may also include one or more of the following components connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The computer program, when executed by the processor 901, performs the above-described functions defined in the system of the embodiment of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 902 and/or the RAM 903 described above and/or one or more memories other than the ROM 902 and the RAM 903.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method provided by the embodiments of the present disclosure, when the computer program product is run on an electronic device, the program code being adapted to cause the electronic device to carry out the image recognition method provided by the embodiments of the present disclosure.

The computer program, when executed by the processor 901, performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, and downloaded and installed through the communication section 909 and/or installed from the removable medium 911. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A method of clustering network data, comprising:

obtaining description texts and labels of Q network data objects, wherein Q is an integer greater than 1;

for each of the network data objects,

obtaining a semantic association coefficient corresponding to a word based on semantic similarity between the word in the description text of the network data object and a label of the network data object, wherein the semantic association coefficient is used for measuring the degree of correlation between the corresponding word and a service function of the network data object; and

processing the description text of the network data object based on the semantic association coefficient corresponding to each word to obtain a text feature vector of the network data object; and

and clustering the Q network data objects based on the text characteristic vectors of the Q network data objects.

2. The method of claim 1, wherein the deriving a semantic relevance coefficient corresponding to a word in the description text of the network data object based on semantic similarity of the word to a tag of the network data object comprises:

based on the accumulation of the similarity of the feature word vector corresponding to each word and each label word vector in the label word vector set, obtaining the text semantic weight corresponding to the word; the semantic association coefficient comprises the text semantic weight;

wherein,

the label word vector set is a set of word vectors obtained after processing the labels of the network data objects, and the number of the label word vectors contained in the label word vector set is equal to or greater than the number of the label words in the labels of the network data objects.

3. The method of claim 2, wherein the deriving a semantic relevance coefficient corresponding to a word in the description text of the network data object based on semantic similarity of the word to a tag of the network data object further comprises:

obtaining semantic word frequency fusion weight of each word based on the text semantic weight corresponding to each word and a word frequency-reverse text frequency TF-IDF value obtained by counting the word in Q network data objects; and the semantic association coefficient comprises the semantic word frequency fusion weight.

4. The method of claim 2, wherein the deriving a semantic relevance coefficient corresponding to a word in the description text of the network data object based on semantic similarity of the word to a tag of the network data object further comprises:

processing the description text of the network data object by using a Word2vec model to obtain a feature Word vector set corresponding to the network data object, wherein the feature Word vector set is a set formed by feature Word vectors corresponding to words in the description text of the network data object; and

and processing the label of the network data object by using the Word2vec model to obtain the label Word vector set.

5. The method of claim 4, wherein the processing the label of the network data object using the Word2vec model to obtain the set of label Word vectors comprises:

searching R words most similar to each label Word in the labels of the network data object by using the Word2vec model, wherein R is an integer greater than or equal to 1;

combining all tag words in the tags of the network data object and R words most similar to each tag word together to obtain an expansion tag set; and

and processing the extended label set by using the Word2vec model to obtain the label Word vector set.

6. The method according to any one of claims 1 to 5, wherein the processing the description text of the network data object based on the semantic association coefficient corresponding to each word to obtain a text feature vector of the network data object comprises:

processing the description text of the network data object to obtain a feature word vector set corresponding to the network data object, wherein the feature word vector set is a set formed by feature word vectors corresponding to words in the description text of the network data object; and

and taking the semantic association coefficient corresponding to each word as the weight of the feature word vector corresponding to the word, and performing weighting processing on the feature word vectors in the feature word vector set to obtain the text feature vector.

7. The method of claim 1, wherein said clustering Q of said network data objects based on their textual feature vectors comprises:

constructing a similarity matrix of Q x Q based on pairwise similarities between the text feature vectors of Q network data objects, wherein the ith x j element in the similarity matrix represents the similarity between the text feature vector of the ith network data object and the text feature vector of the jth network data object; and

and clustering Q network data objects by taking the similarity matrix as the input of a k-means algorithm.

8. The method of claim 1, wherein the network data object comprises a mushup service.

9. An apparatus for clustering network data, comprising:

the acquisition module is used for acquiring description texts and labels of Q network data objects, wherein Q is an integer greater than 1;

the feature extraction module comprises a semantic association extraction submodule and a text feature extraction submodule; wherein,

the semantic association extraction submodule is used for obtaining a semantic association coefficient corresponding to a word based on semantic similarity between the word in the description text of the network data object and the label of the network data object aiming at each network data object, and the semantic association coefficient is used for measuring the degree of correlation between the word and the service function of the network data object; and

the text feature extraction submodule is used for processing the description text of the network data object based on the semantic association coefficient corresponding to each word aiming at each network data object to obtain a text feature vector of the network data object;

and

and the clustering module is used for clustering the Q network data objects based on the text characteristic vectors of the Q network data objects.

10. An electronic device, comprising:

one or more memories storing executable instructions; and

one or more processors executing the executable instructions to implement the method of any one of claims 1-8.

11. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 8.