CN109670187B

CN109670187B - Software commonality feature extraction method based on Internet text description data

Info

Publication number: CN109670187B
Application number: CN201811625340.8A
Authority: CN
Inventors: 刘春�; 喻杰
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2021-06-22
Anticipated expiration: 2038-12-28
Also published as: CN109670187A

Abstract

The invention relates to the technical field of internet information, in particular to a software commonality feature extraction method based on internet text description data, which comprises the following steps: acquiring a software description text, user grading data and download amount data of an Internet software public text; obtaining the software characteristics of the networking open text according to the software description text; selecting important software features according to user scoring data; mining a user-oriented feature association relation according to the download amount data; and searching other software features related to the important software features in the mined user-oriented feature association relation. According to the method, the common characteristics of the software in the specific field are mined and the more important characteristics are identified by utilizing a natural language processing method according to the open text of the Internet software, so that designers can quickly analyze the important common characteristics of the software in the specific field.

Description

Software commonality feature extraction method based on Internet text description data

Technical Field

The invention relates to the technical field of internet information, in particular to a software commonality feature extraction method based on internet text description data.

Background

Based on Kano model, the features of a piece of software can be divided into three categories: basic features, desired features, and excitement-type features. Basic features are features that a user thinks a domain of software should have for all reason. Without this feature, the user would consider the software unacceptable. The expected type of features are features that a user expects a software designer and producer to be able to try to implement and perfect, the better the implementation and perfection, the more satisfied the user is with the corresponding software. Whereas the excitement-type feature is a feature that the user does not think of, but that the software does provide. When the type of characteristics are not available, the satisfaction degree of the user is not influenced. It is clear that the base and expected features are shared by all software in a domain, and are common features that all software in a domain should have. Therefore, it is of no doubt important for a software designer about to enter a domain to analyze all common features of software in the current domain.

For a daily consumption type cheap software such as toothpaste, mobile phones and the like, a designer of the new software can purchase the same type of commercially available software and then can analyze common characteristics of the software in a manual mode. However, for some software that is expensive, complex in function, numerous in software, or dependent on a specific operating environment, such a manual analysis method is not only not easy to implement, but also expensive. Under the condition, the method has important value of automatically extracting the common characteristics of the field software from the text by crawling the software description information disclosed on the Internet and utilizing the processing means of natural language.

However, current feature extraction methods often focus on extracting only a specified number of software features, and lack further analysis of the importance of the extracted features. In practice, when we extract a large number of software features, it is obvious that it has important practical significance to distinguish which features are important and which are unimportant.

Disclosure of Invention

The invention aims to provide a software commonality feature extraction method based on internet text description data, which is characterized in that according to the open text of internet software, a natural language processing method is utilized to mine the software commonality features in a specific field and identify more important features in the software commonality features so that designers can quickly analyze the important commonality features of the software in the specific field.

In order to achieve the above purpose, the invention provides the following technical scheme:

the invention provides a software commonality feature extraction method based on internet text description data, which comprises the following steps:

acquiring a software description text, user grading data and download amount data of an Internet software public text;

obtaining the software characteristics of the networking open text according to the software description text;

selecting important software features according to user scoring data;

mining a user-oriented feature association relation according to the download amount data;

preferably, after mining the user-oriented feature association relationship according to the download amount, the method further includes:

and searching other software features related to the important software features in the mined user-oriented feature association relation.

In the foregoing method for extracting software commonality features based on internet text description data, as a preferred solution, the obtaining of software features of a networking publication according to a software description text includes:

preprocessing a software description text;

constructing a sentence similarity network;

discovering a sentence community in a sentence similarity network;

feature descriptors for the sentence community are determined.

In the foregoing method for extracting software commonality features based on internet text description data, as a preferred solution, the preprocessing the software description text includes:

carrying out redundancy removal processing on the software description text;

and performing sentence segmentation, word deactivation, word drying and dimension reduction on the software description text.

In the above method for extracting software commonality features based on internet text description data, as a preferred solution, the constructing a sentence similarity network includes:

the similarity between sentences in the text is described by the following formula measurement software:

where idf is the inverse document frequency, s_iAnd s_jIs two sentences, w_kIs the k-th word in the sentence.

In the above method for extracting software commonality features based on internet text description data, as a preferred solution, the finding sentence communities in the sentence similarity network includes:

the sentence similarity network is a weighted network, and edges in the sentence similarity network represent similarity between sentences; and selecting one node which is not distributed with communities and is attached to the edge with the largest weight in the sentence similarity network as a seed node found by the sentence community.

In the above method for extracting software commonality features based on internet text description data, as a preferred solution, after discovering a sentence community in a sentence similarity network, the method further includes:

the suitability of a node for a community is measured by the following formula:

wherein E is_inIs a collection of edges between nodes in a community, E_outThe method is a set of edges connecting nodes in the community with nodes outside the community.

In the foregoing software commonality feature extraction method based on internet text description data, as a preferred solution, the determining a feature descriptor of a sentence community includes:

the entropy of each sentence community is measured by the following formula:

wherein the content of the first and second substances,

representing contained sentences s_iThe entropy of the sentence community represents the overlapping size between one sentence community and other sentence communities;

and selecting one community with the minimum entropy from sentence communities of which the feature descriptors are not selected currently to select the feature descriptors.

In the above method for extracting software commonality features based on internet text description data, as a preferred solution, the selecting a community with the smallest entropy to select the feature descriptor includes

Respectively regarding each sentence community in which the feature descriptors are not selected currently as a document, wherein the document contains all sentences in the sentence community;

calculating the TF-IDF value of each word in the sentence community with the minimum current entropy by using a TF-IDF mode;

each sentence in the community is converted into a TF-IDF vector, and the sentence closest to the center of the sentence community is selected as a feature descriptor of the sentence community.

In the above method for extracting software commonality features based on internet text description data, as a preferred scheme, the selecting important software features according to user score data includes:

calculating the average user scores of all the software, and discretizing the user score of each software into 1 and 0 according to the magnitude relation between the software user scores and the average user scores, namely, the software user score which is greater than the average user score is 1, and the software user score which is less than the average user score is 0;

according to the download amount data and the user rating data of the soft open text, a matrix of software features and user ratings is constructed;

based on the constructed matrix of software features and user scores,

scoring the extracted software features by using random logistic regression, and reserving the software features with the score not being 0;

reducing the constructed software feature and user score matrix, and utilizing logistic regression to learn the relationship between the software feature and the software user score to configure a coefficient related to the software user score for the software feature;

according to the size of the coefficient of each software feature, important software features are selected, namely for one software feature, the larger the coefficient of the software feature is, the higher the priority is to be selected as the important software features.

In the foregoing method for extracting software commonality features based on internet text description data, as a preferred solution, the mining a user-oriented feature association relationship according to download amount data includes:

according to the download quantity data and the software characteristics of the soft open text, a matrix of the software characteristics and the download quantity is constructed;

and mining the user-oriented feature association relationship by using an association rule mining method according to the software feature and the download quantity matrix.

Compared with the closest prior art, the technical scheme provided by the invention has the following beneficial effects:

the invention provides a software commonality characteristic extraction method based on internet text description data, which utilizes a natural language processing method to mine software commonality characteristics in a specific field according to the open text of internet software; meanwhile, the grading data of the user to the software is referred to, and important software features are identified; furthermore, the incidence relation between the characteristics is mined from the perspective of a user by utilizing the download amount data of the software.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. Wherein:

FIG. 1 is a schematic flow chart of data processing of a software commonality feature extraction method based on Internet text description data according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart of steps of a method for extracting common features of software based on Internet text description data according to an embodiment of the present invention;

FIG. 3 is an exemplary diagram of storing mined frequent feature sets by a directed graph in an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

In the description of the present invention, the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, which are for convenience of description of the present invention only and do not require that the present invention must be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. The terms "connected" and "connected" used herein should be interpreted broadly, and may include, for example, a fixed connection or a detachable connection; they may be directly connected or indirectly connected through intermediate members, and specific meanings of the above terms will be understood by those skilled in the art as appropriate.

In the context of the present invention, it is worth mentioning:

TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and information exploration. TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query. In addition to TF-IDF, search engines on the internet use a ranking method based on link analysis to determine the order in which documents appear in the search results. The main idea of TFIDF is that if a word or phrase occurs in an article with a high frequency TF and rarely occurs in other articles, the word or phrase is considered to have a good class discrimination ability and is suitable for classification. TFIDF is actually TF-IDF, TF Term Frequency (Term Frequency), IDF Inverse Document Frequency (Inverse Document Frequency). TF represents the frequency with which terms appear in document d. The main idea of the IDF is that if the documents containing the entry t are fewer, that is, the smaller n is, the larger IDF is, the entry t has good category distinguishing capability. If the number of documents containing the entry t in a certain class of document C is m, and the total number of documents containing the entry t in other classes is k, it is obvious that the number of documents containing t is m + k, when m is large, n is also large, and the value of the IDF obtained according to the IDF formula is small, which means that the category distinguishing capability of the entry t is not strong. In practice, however, if a term frequently appears in a document of a class, it indicates that the term can well represent the characteristics of the text of the class, and such terms should be given higher weight and selected as characteristic words of the text of the class to distinguish the document from other classes. This is a deficiency of IDF.

The nodes in the present invention refer to sentences constituting a sentence similarity network.

Association rule mining is a rule-based machine learning algorithm that finds interesting relationships in large databases with the goal of using some metric to resolve strong rules present in the database.

and S101, acquiring a software description text, user grading data and download amount data of the Internet software public text. And crawling the description text information of the relevant software in a specific field from the relevant websites on the Internet by using crawler software, wherein the description text information comprises the feature description of the software, the user rating data of the software and the download amount data of the software.

And step S102, preprocessing the software description text. And preprocessing the crawled text data by utilizing a natural language processing technology, wherein the preprocessing comprises removing redundant software description texts, and performing sentence segmentation, word stop removal, word drying and dimension reduction processing on the description texts.

And step S103, constructing a sentence similarity network. And constructing a sentence similarity network, and considering not only the importance of sharing words among sentences but also the importance of the words in the sentences respectively when measuring the similarity among the sentences. When the sentence similarity network is constructed, similarity between sentences in the text is described through the following formula measurement software:

where idf is the inverse document frequency, s_iAnd s_jIs divided into two sentencesSub, w_kIs the k-th word in the sentence.

And step S104, sentence communities in the sentence similarity network are found. And taking each sentence in the software description as a potential software feature descriptor, and discovering a sentence community existing in the sentence set of the software text description based on an overlapping community discovery algorithm. Each sentence community represents a software feature. The sentence similarity network is a weighted network, and edges in the sentence similarity network represent similarity between sentences; one node which is not distributed with communities and is attached to the edge with the maximum weight in the sentence similarity network is selected as a seed node found by the sentence community, and two nodes connected through the edge with the maximum weight have higher possibility of belonging to the same sentence community. After sentence communities in the sentence similarity network are found, the suitability of a node for one community is measured by the following formula:

And step S105, determining a characteristic descriptor of the sentence community. Selecting a representative sentence from each sentence community as a descriptor of the characteristic represented by the sentence community; calculating the entropy of each sentence community to measure the overlapping size between one community and other communities, and starting from the community which has no feature descriptor currently selected and has the minimum entropy; in order to select the sentence closest to the current community center, each community is treated as a document, then the weight of each word in the current community is calculated based on the TFIDF method, and each sentence in the current community is converted into a TF-IDF vector. In determining the feature descriptors of the sentence communities, the entropy of each sentence community is measured by the following formula:

wherein the content of the first and second substances,

representing contained sentences s_iThe entropy of the sentence community represents the overlapping size between one sentence community and other sentence communities; selecting a community with the minimum entropy from sentence communities in which feature descriptors are not selected currently, and regarding each sentence community in which feature descriptors are not selected currently as a document, wherein the document comprises all sentences in the sentence community; calculating the TF-IDF value of each word in the sentence community with the minimum current entropy by using a TF-IDF mode; each sentence in the community is converted into a TF-IDF vector, and the sentence closest to the center of the sentence community is selected as a feature descriptor of the sentence community.

And S106, selecting important software features according to the user scoring data. The extracted features are first reduced based on a random logistic regression, and then important software features are identified based on logistic regression based on the reduced feature set and user scoring data. The method specifically comprises the following steps: calculating the average user scores of all the software, and discretizing the user score of each software into 1 and 0 according to the magnitude relation between the software user scores and the average user scores, namely, the software user score which is greater than the average user score is 1, and the software user score which is less than the average user score is 0; according to the download amount data and the user rating data of the soft open text, a matrix of software features and user ratings is constructed; based on the constructed software features and the matrix of user scores, scoring the extracted software features by using random logistic regression, and reserving the software features with the score not being 0; reducing the constructed software feature and user score matrix, and utilizing logistic regression to learn the relationship between the software feature and the software user score to configure a coefficient related to the software user score for the software feature; according to the size of the coefficient of each software feature, important software features are selected, namely for one software feature, the larger the coefficient of the software feature is, the higher the priority is to be selected as the important software features.

And S107, mining the user-oriented feature association relation according to the download amount data. The download amount reflects the popularity of the software in the user, important software features are searched according to the download amount, and the associated features expected or received by the user can be found more easily from the perspective of the user; and the inverted and compressed frequent feature set graph is used for storing the mined frequent feature set and the feature association rule, so that the storage space is saved. When a user-oriented feature incidence relation is mined, a matrix of software features and a download quantity is constructed according to the download quantity data and the software features of the soft public text; and mining the user-oriented feature association relationship by using an association rule mining method according to the software feature and the download quantity matrix.

And S108, searching other software features related to the important software features in the mined user-oriented feature association relation. The software designer may enter the identified important software features to discover other features that have a strong association with the identified important software features.

The invention provides a method for mining common characteristics of software in a specific field and identifying more important characteristics of the common characteristics by using a natural language processing method based on software description text data disclosed on the Internet.

First, the method exploits the overlapping community discovery algorithm LMF to mine potentially all software features from the software product description text.

1.1 regarding each sentence in the software description text as a descriptor of a software feature, and identifying the software feature contained in the software description text by constructing a sentence similarity network and finding an overlapped sentence community in the sentence similarity network. Each sentence community represents a software feature and the most representative sentence in the community is selected as the descriptor of the feature.

1.2 when constructing a sentence similarity network, the similarity between sentences is measured by the following formula. Where idf is the inverse document frequency, s_iAnd s_jIs two sentences, w_kIs the k-th word in the sentence. The inverse document frequency (IDF inverse document frequency) is also called inverse document frequency, and is an inverse number of the document frequency. This formulaIn measuring the similarity between sentences, not only the importance of the shared words between sentences but also the influence of these shared words in each sentence is considered.

1.3 conventional overlapping community discovery algorithm LMF measures the suitability of a node for a community using the following formula. Since LMF is for a non-weighted network, k here_inIs the degree of connectivity between nodes in the community, and k_outThe connectivity between the nodes in the community and the nodes outside the community.

Because the sentence similarity network constructed by the method is a weighted network, edges in the network represent the similarity between sentences. To this end, the process according to the invention modifies this formula as follows, where E_inIs a collection of edges between nodes in a community, E_outThe method is a set of edges connecting nodes in the community with nodes outside the community.

1.4 the traditional overlapping community discovery algorithm LMF randomly selects a node in the network to which no community is allocated as a seed node for community discovery, and the method of the present invention selects a node to which no community is allocated and which is attached to an edge with the largest weight as a seed node. The method of the invention considers that two nodes connected by an edge with the largest weight have larger possibility of belonging to the same community. Therefore, the node attached to the edge with the largest weight is selected as the seed node, and the discovery process of the overlapping communities is accelerated.

1.5 in selecting a sentence for each community as a feature descriptor, measure each using the following formulaEntropy of communities, which represents the amount of overlap between one community and other communities. Wherein

Representing contained sentences s_iThe number of communities. The method of the invention iteratively selects a community with the smallest entropy from communities in which feature descriptors are not currently selected for feature descriptor selection.

1.6 when selecting the feature descriptor for the community with the minimum current entropy, regarding each community for which the feature descriptor is not selected currently as a document, wherein the document comprises all sentences in the community, then calculating the TF-IDF value of each word in the community with the minimum current entropy by using a TF-IDF mode, then converting each sentence in the community into a TF-IDF vector, and selecting the sentence closest to the center of the community as the feature descriptor of the community.

Secondly, the method learns a logistic relationship model between the reduced software features and the software popularity by using a stochastic logistic regression method first, based on the disclosed user score data and regarding the user score data as a reflection of the software popularity, and identifies important software features by referring to the coefficient of each feature in the model.

Thirdly, the method mines the association relation among software features based on the download amount data of the software.

3.1 this method distinguishes between user-oriented feature association rules and designer-oriented feature association rules. The feature association rule reflects the co-occurrence relationship between features, i.e., whether a set of features co-occur frequently. The designer-oriented feature association rule represents whether a group of features are frequently included by software in the field from the perspective of software designers in the field. The user-oriented feature association rule shows whether a group of features are frequently liked by users in the field from the perspective of the users. The method of the invention considers the user-oriented feature association rules to be more meaningful for the recommendation of software features, because the purpose of the designer designing the software is to hope to contain the features that the user likes as much as possible.

3.2 the method regards each download of the software as the support of a user to a group of software features, and further extracts the user-oriented association rules between the software by using an association rule mining method.

3.3 the method uses a directed graph as shown in fig. 3 to store the mined frequent feature set, and the association rules between features are also included in the relationships between nodes in the graph. Each node in fig. 3 stores one frequent feature set and its support. The graph resembles a reverse order, compressed tree structure. The compression embodied in the tree is different in the degree of support of each node from the parent node containing it, i.e. if the parent node of a node appears in the tree and the parent node has the same degree of support as it, then the node will not be stored in the tree. The nodes in the tree with the reverse order are layered, except for the root node, the nodes at the high level often store feature sets with more elements, and the reverse order aims to effectively filter the frequent feature sets with the same support degree as the parent nodes when the frequent feature sets are inserted so as to save storage space.

The following description will be given by way of specific examples of the operation of the method of the present invention in practical use:

and step S1, crawling the description text information of the relevant software in the specific field from the relevant websites on the Internet by using crawler software, wherein the description text information comprises the feature description of the software, the user rating data of the software and the download data of the software.

Step S2: and preprocessing the crawled text data by utilizing a natural language processing technology, wherein the preprocessing comprises removing redundant software description texts, and performing sentence segmentation, word stop removal, word drying and dimension reduction processing on the description texts.

Step S3: and constructing a sentence similarity network, finding sentence communities in the sentence similarity network, and combining communities with higher similarity.

Step S4: a feature descriptor is selected for each sentence community.

Step S5: identifying important software features based on logistic regression

1) Acquiring user scoring data of software;

2) for the missing software user scoring data, calculating the similarity between the software and other software according to the characteristics of the software, and taking the user score of the most similar software as the user score;

3) calculating the average user scores of all the software, and discretizing the user score of each software into 1 and 0 according to the size relationship between the software user scores and the average user scores, wherein the user scores are greater than the average user score of 1 and less than the average user score of 0;

4) constructing a matrix of software features and user scores;

5) based on the constructed software features and the matrix of user scores, scoring the extracted software features by using random logistic regression, and reserving the features with the score not being 0;

6) reducing the constructed software features and the user scoring matrix based on the reduced feature set, and learning a relation model between the software features and the software user scoring by using logistic regression based on the reduced software features and the user scoring matrix;

7) based on the learned model, the more important software features are selected according to the coefficient size of each feature in the model.

Step S6: and mining a user-oriented feature association rule based on the software download amount data.

Step S7: based on the mined user-oriented feature association rules, a software designer may enter the identified important software features to discover other features that have strong associations with the identified important software features.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A software commonality feature extraction method based on internet text description data is characterized by comprising the following steps:

obtaining the software characteristics of the Internet open text according to the software description text;

selecting important software features according to user scoring data;

wherein the content of the first and second substances,

the software features of the interconnection network public text obtained according to the software description text comprise:

preprocessing a software description text;

constructing a sentence similarity network, and describing the similarity between sentences in the text by the following formula measurement software:

where idf is the inverse document frequency, s_iAnd s_jIs two sentences, w_kIs the k word in the sentence;

sentence communities in the sentence similarity network are found: the sentence similarity network is a weighted network, and edges in the sentence similarity network represent similarity between sentences; selecting a node which is not distributed with communities and is attached to the edge with the maximum weight in the sentence similarity network as a seed node discovered by the sentence community;

the suitability of a node for a community is measured by the following formula:

wherein E is_inIs a collection of edges between nodes in a community, E_outThe method comprises the steps of collecting edges connecting nodes in a community with nodes outside the community;

determining feature descriptors of sentence communities: the entropy of each sentence community is measured by the following formula:

wherein the content of the first and second substances,

2. The method for extracting the software commonality feature based on the internet text description data as claimed in claim 1, wherein the preprocessing the software description text comprises:

carrying out redundancy removal processing on the software description text;

3. The method of claim 1, wherein selecting the community with the least entropy for feature descriptor comprises

4. The method for extracting software commonality features based on internet text description data as claimed in claim 1, wherein the selecting important software features according to user score data comprises:

based on the constructed matrix of software features and user scores,

5. The method for extracting software commonality features based on internet text description data as claimed in claim 4, wherein the mining of user-oriented feature association relationship according to download amount data comprises:

constructing a matrix of the software features and the download quantity according to the download quantity data and the software features of the public text;