CN109670187B - Software commonality feature extraction method based on Internet text description data - Google Patents

Software commonality feature extraction method based on Internet text description data Download PDF

Info

Publication number
CN109670187B
CN109670187B CN201811625340.8A CN201811625340A CN109670187B CN 109670187 B CN109670187 B CN 109670187B CN 201811625340 A CN201811625340 A CN 201811625340A CN 109670187 B CN109670187 B CN 109670187B
Authority
CN
China
Prior art keywords
software
sentence
community
user
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811625340.8A
Other languages
Chinese (zh)
Other versions
CN109670187A (en
Inventor
刘春�
喻杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Priority to CN201811625340.8A priority Critical patent/CN109670187B/en
Publication of CN109670187A publication Critical patent/CN109670187A/en
Application granted granted Critical
Publication of CN109670187B publication Critical patent/CN109670187B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Abstract

The invention relates to the technical field of internet information, in particular to a software commonality feature extraction method based on internet text description data, which comprises the following steps: acquiring a software description text, user grading data and download amount data of an Internet software public text; obtaining the software characteristics of the networking open text according to the software description text; selecting important software features according to user scoring data; mining a user-oriented feature association relation according to the download amount data; and searching other software features related to the important software features in the mined user-oriented feature association relation. According to the method, the common characteristics of the software in the specific field are mined and the more important characteristics are identified by utilizing a natural language processing method according to the open text of the Internet software, so that designers can quickly analyze the important common characteristics of the software in the specific field.

Description

Software commonality feature extraction method based on Internet text description data
Technical Field
The invention relates to the technical field of internet information, in particular to a software commonality feature extraction method based on internet text description data.
Background
Based on Kano model, the features of a piece of software can be divided into three categories: basic features, desired features, and excitement-type features. Basic features are features that a user thinks a domain of software should have for all reason. Without this feature, the user would consider the software unacceptable. The expected type of features are features that a user expects a software designer and producer to be able to try to implement and perfect, the better the implementation and perfection, the more satisfied the user is with the corresponding software. Whereas the excitement-type feature is a feature that the user does not think of, but that the software does provide. When the type of characteristics are not available, the satisfaction degree of the user is not influenced. It is clear that the base and expected features are shared by all software in a domain, and are common features that all software in a domain should have. Therefore, it is of no doubt important for a software designer about to enter a domain to analyze all common features of software in the current domain.
For a daily consumption type cheap software such as toothpaste, mobile phones and the like, a designer of the new software can purchase the same type of commercially available software and then can analyze common characteristics of the software in a manual mode. However, for some software that is expensive, complex in function, numerous in software, or dependent on a specific operating environment, such a manual analysis method is not only not easy to implement, but also expensive. Under the condition, the method has important value of automatically extracting the common characteristics of the field software from the text by crawling the software description information disclosed on the Internet and utilizing the processing means of natural language.
However, current feature extraction methods often focus on extracting only a specified number of software features, and lack further analysis of the importance of the extracted features. In practice, when we extract a large number of software features, it is obvious that it has important practical significance to distinguish which features are important and which are unimportant.
Disclosure of Invention
The invention aims to provide a software commonality feature extraction method based on internet text description data, which is characterized in that according to the open text of internet software, a natural language processing method is utilized to mine the software commonality features in a specific field and identify more important features in the software commonality features so that designers can quickly analyze the important commonality features of the software in the specific field.
In order to achieve the above purpose, the invention provides the following technical scheme:
the invention provides a software commonality feature extraction method based on internet text description data, which comprises the following steps:
acquiring a software description text, user grading data and download amount data of an Internet software public text;
obtaining the software characteristics of the networking open text according to the software description text;
selecting important software features according to user scoring data;
mining a user-oriented feature association relation according to the download amount data;
preferably, after mining the user-oriented feature association relationship according to the download amount, the method further includes:
and searching other software features related to the important software features in the mined user-oriented feature association relation.
In the foregoing method for extracting software commonality features based on internet text description data, as a preferred solution, the obtaining of software features of a networking publication according to a software description text includes:
preprocessing a software description text;
constructing a sentence similarity network;
discovering a sentence community in a sentence similarity network;
feature descriptors for the sentence community are determined.
In the foregoing method for extracting software commonality features based on internet text description data, as a preferred solution, the preprocessing the software description text includes:
carrying out redundancy removal processing on the software description text;
and performing sentence segmentation, word deactivation, word drying and dimension reduction on the software description text.
In the above method for extracting software commonality features based on internet text description data, as a preferred solution, the constructing a sentence similarity network includes:
the similarity between sentences in the text is described by the following formula measurement software:
Figure GDA0003038476330000031
where idf is the inverse document frequency, siAnd sjIs two sentences, wkIs the k-th word in the sentence.
In the above method for extracting software commonality features based on internet text description data, as a preferred solution, the finding sentence communities in the sentence similarity network includes:
the sentence similarity network is a weighted network, and edges in the sentence similarity network represent similarity between sentences; and selecting one node which is not distributed with communities and is attached to the edge with the largest weight in the sentence similarity network as a seed node found by the sentence community.
In the above method for extracting software commonality features based on internet text description data, as a preferred solution, after discovering a sentence community in a sentence similarity network, the method further includes:
the suitability of a node for a community is measured by the following formula:
Figure GDA0003038476330000032
wherein E isinIs a collection of edges between nodes in a community, EoutThe method is a set of edges connecting nodes in the community with nodes outside the community.
In the foregoing software commonality feature extraction method based on internet text description data, as a preferred solution, the determining a feature descriptor of a sentence community includes:
the entropy of each sentence community is measured by the following formula:
Figure GDA0003038476330000033
wherein the content of the first and second substances,
Figure GDA0003038476330000041
representing contained sentences siThe entropy of the sentence community represents the overlapping size between one sentence community and other sentence communities;
and selecting one community with the minimum entropy from sentence communities of which the feature descriptors are not selected currently to select the feature descriptors.
In the above method for extracting software commonality features based on internet text description data, as a preferred solution, the selecting a community with the smallest entropy to select the feature descriptor includes
Respectively regarding each sentence community in which the feature descriptors are not selected currently as a document, wherein the document contains all sentences in the sentence community;
calculating the TF-IDF value of each word in the sentence community with the minimum current entropy by using a TF-IDF mode;
each sentence in the community is converted into a TF-IDF vector, and the sentence closest to the center of the sentence community is selected as a feature descriptor of the sentence community.
In the above method for extracting software commonality features based on internet text description data, as a preferred scheme, the selecting important software features according to user score data includes:
calculating the average user scores of all the software, and discretizing the user score of each software into 1 and 0 according to the magnitude relation between the software user scores and the average user scores, namely, the software user score which is greater than the average user score is 1, and the software user score which is less than the average user score is 0;
according to the download amount data and the user rating data of the soft open text, a matrix of software features and user ratings is constructed;
based on the constructed matrix of software features and user scores,
scoring the extracted software features by using random logistic regression, and reserving the software features with the score not being 0;
reducing the constructed software feature and user score matrix, and utilizing logistic regression to learn the relationship between the software feature and the software user score to configure a coefficient related to the software user score for the software feature;
according to the size of the coefficient of each software feature, important software features are selected, namely for one software feature, the larger the coefficient of the software feature is, the higher the priority is to be selected as the important software features.
In the foregoing method for extracting software commonality features based on internet text description data, as a preferred solution, the mining a user-oriented feature association relationship according to download amount data includes:
according to the download quantity data and the software characteristics of the soft open text, a matrix of the software characteristics and the download quantity is constructed;
and mining the user-oriented feature association relationship by using an association rule mining method according to the software feature and the download quantity matrix.
Compared with the closest prior art, the technical scheme provided by the invention has the following beneficial effects:
the invention provides a software commonality characteristic extraction method based on internet text description data, which utilizes a natural language processing method to mine software commonality characteristics in a specific field according to the open text of internet software; meanwhile, the grading data of the user to the software is referred to, and important software features are identified; furthermore, the incidence relation between the characteristics is mined from the perspective of a user by utilizing the download amount data of the software.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. Wherein:
FIG. 1 is a schematic flow chart of data processing of a software commonality feature extraction method based on Internet text description data according to an embodiment of the present invention;
FIG. 2 is a schematic flowchart of steps of a method for extracting common features of software based on Internet text description data according to an embodiment of the present invention;
FIG. 3 is an exemplary diagram of storing mined frequent feature sets by a directed graph in an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
In the description of the present invention, the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, which are for convenience of description of the present invention only and do not require that the present invention must be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. The terms "connected" and "connected" used herein should be interpreted broadly, and may include, for example, a fixed connection or a detachable connection; they may be directly connected or indirectly connected through intermediate members, and specific meanings of the above terms will be understood by those skilled in the art as appropriate.
In the context of the present invention, it is worth mentioning:
TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and information exploration. TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query. In addition to TF-IDF, search engines on the internet use a ranking method based on link analysis to determine the order in which documents appear in the search results. The main idea of TFIDF is that if a word or phrase occurs in an article with a high frequency TF and rarely occurs in other articles, the word or phrase is considered to have a good class discrimination ability and is suitable for classification. TFIDF is actually TF-IDF, TF Term Frequency (Term Frequency), IDF Inverse Document Frequency (Inverse Document Frequency). TF represents the frequency with which terms appear in document d. The main idea of the IDF is that if the documents containing the entry t are fewer, that is, the smaller n is, the larger IDF is, the entry t has good category distinguishing capability. If the number of documents containing the entry t in a certain class of document C is m, and the total number of documents containing the entry t in other classes is k, it is obvious that the number of documents containing t is m + k, when m is large, n is also large, and the value of the IDF obtained according to the IDF formula is small, which means that the category distinguishing capability of the entry t is not strong. In practice, however, if a term frequently appears in a document of a class, it indicates that the term can well represent the characteristics of the text of the class, and such terms should be given higher weight and selected as characteristic words of the text of the class to distinguish the document from other classes. This is a deficiency of IDF.
The nodes in the present invention refer to sentences constituting a sentence similarity network.
Association rule mining is a rule-based machine learning algorithm that finds interesting relationships in large databases with the goal of using some metric to resolve strong rules present in the database.
The invention provides a software commonality feature extraction method based on internet text description data, which comprises the following steps:
and S101, acquiring a software description text, user grading data and download amount data of the Internet software public text. And crawling the description text information of the relevant software in a specific field from the relevant websites on the Internet by using crawler software, wherein the description text information comprises the feature description of the software, the user rating data of the software and the download amount data of the software.
And step S102, preprocessing the software description text. And preprocessing the crawled text data by utilizing a natural language processing technology, wherein the preprocessing comprises removing redundant software description texts, and performing sentence segmentation, word stop removal, word drying and dimension reduction processing on the description texts.
And step S103, constructing a sentence similarity network. And constructing a sentence similarity network, and considering not only the importance of sharing words among sentences but also the importance of the words in the sentences respectively when measuring the similarity among the sentences. When the sentence similarity network is constructed, similarity between sentences in the text is described through the following formula measurement software:
Figure GDA0003038476330000071
where idf is the inverse document frequency, siAnd sjIs divided into two sentencesSub, wkIs the k-th word in the sentence.
And step S104, sentence communities in the sentence similarity network are found. And taking each sentence in the software description as a potential software feature descriptor, and discovering a sentence community existing in the sentence set of the software text description based on an overlapping community discovery algorithm. Each sentence community represents a software feature. The sentence similarity network is a weighted network, and edges in the sentence similarity network represent similarity between sentences; one node which is not distributed with communities and is attached to the edge with the maximum weight in the sentence similarity network is selected as a seed node found by the sentence community, and two nodes connected through the edge with the maximum weight have higher possibility of belonging to the same sentence community. After sentence communities in the sentence similarity network are found, the suitability of a node for one community is measured by the following formula:
Figure GDA0003038476330000072
wherein E isinIs a collection of edges between nodes in a community, EoutThe method is a set of edges connecting nodes in the community with nodes outside the community.
And step S105, determining a characteristic descriptor of the sentence community. Selecting a representative sentence from each sentence community as a descriptor of the characteristic represented by the sentence community; calculating the entropy of each sentence community to measure the overlapping size between one community and other communities, and starting from the community which has no feature descriptor currently selected and has the minimum entropy; in order to select the sentence closest to the current community center, each community is treated as a document, then the weight of each word in the current community is calculated based on the TFIDF method, and each sentence in the current community is converted into a TF-IDF vector. In determining the feature descriptors of the sentence communities, the entropy of each sentence community is measured by the following formula:
Figure GDA0003038476330000081
wherein the content of the first and second substances,
Figure GDA0003038476330000082
representing contained sentences siThe entropy of the sentence community represents the overlapping size between one sentence community and other sentence communities; selecting a community with the minimum entropy from sentence communities in which feature descriptors are not selected currently, and regarding each sentence community in which feature descriptors are not selected currently as a document, wherein the document comprises all sentences in the sentence community; calculating the TF-IDF value of each word in the sentence community with the minimum current entropy by using a TF-IDF mode; each sentence in the community is converted into a TF-IDF vector, and the sentence closest to the center of the sentence community is selected as a feature descriptor of the sentence community.
And S106, selecting important software features according to the user scoring data. The extracted features are first reduced based on a random logistic regression, and then important software features are identified based on logistic regression based on the reduced feature set and user scoring data. The method specifically comprises the following steps: calculating the average user scores of all the software, and discretizing the user score of each software into 1 and 0 according to the magnitude relation between the software user scores and the average user scores, namely, the software user score which is greater than the average user score is 1, and the software user score which is less than the average user score is 0; according to the download amount data and the user rating data of the soft open text, a matrix of software features and user ratings is constructed; based on the constructed software features and the matrix of user scores, scoring the extracted software features by using random logistic regression, and reserving the software features with the score not being 0; reducing the constructed software feature and user score matrix, and utilizing logistic regression to learn the relationship between the software feature and the software user score to configure a coefficient related to the software user score for the software feature; according to the size of the coefficient of each software feature, important software features are selected, namely for one software feature, the larger the coefficient of the software feature is, the higher the priority is to be selected as the important software features.
And S107, mining the user-oriented feature association relation according to the download amount data. The download amount reflects the popularity of the software in the user, important software features are searched according to the download amount, and the associated features expected or received by the user can be found more easily from the perspective of the user; and the inverted and compressed frequent feature set graph is used for storing the mined frequent feature set and the feature association rule, so that the storage space is saved. When a user-oriented feature incidence relation is mined, a matrix of software features and a download quantity is constructed according to the download quantity data and the software features of the soft public text; and mining the user-oriented feature association relationship by using an association rule mining method according to the software feature and the download quantity matrix.
And S108, searching other software features related to the important software features in the mined user-oriented feature association relation. The software designer may enter the identified important software features to discover other features that have a strong association with the identified important software features.
The invention provides a method for mining common characteristics of software in a specific field and identifying more important characteristics of the common characteristics by using a natural language processing method based on software description text data disclosed on the Internet.
First, the method exploits the overlapping community discovery algorithm LMF to mine potentially all software features from the software product description text.
1.1 regarding each sentence in the software description text as a descriptor of a software feature, and identifying the software feature contained in the software description text by constructing a sentence similarity network and finding an overlapped sentence community in the sentence similarity network. Each sentence community represents a software feature and the most representative sentence in the community is selected as the descriptor of the feature.
1.2 when constructing a sentence similarity network, the similarity between sentences is measured by the following formula. Where idf is the inverse document frequency, siAnd sjIs two sentences, wkIs the k-th word in the sentence. The inverse document frequency (IDF inverse document frequency) is also called inverse document frequency, and is an inverse number of the document frequency. This formulaIn measuring the similarity between sentences, not only the importance of the shared words between sentences but also the influence of these shared words in each sentence is considered.
Figure GDA0003038476330000091
1.3 conventional overlapping community discovery algorithm LMF measures the suitability of a node for a community using the following formula. Since LMF is for a non-weighted network, k hereinIs the degree of connectivity between nodes in the community, and koutThe connectivity between the nodes in the community and the nodes outside the community.
Figure GDA0003038476330000101
Because the sentence similarity network constructed by the method is a weighted network, edges in the network represent the similarity between sentences. To this end, the process according to the invention modifies this formula as follows, where EinIs a collection of edges between nodes in a community, EoutThe method is a set of edges connecting nodes in the community with nodes outside the community.
Figure GDA0003038476330000102
1.4 the traditional overlapping community discovery algorithm LMF randomly selects a node in the network to which no community is allocated as a seed node for community discovery, and the method of the present invention selects a node to which no community is allocated and which is attached to an edge with the largest weight as a seed node. The method of the invention considers that two nodes connected by an edge with the largest weight have larger possibility of belonging to the same community. Therefore, the node attached to the edge with the largest weight is selected as the seed node, and the discovery process of the overlapping communities is accelerated.
1.5 in selecting a sentence for each community as a feature descriptor, measure each using the following formulaEntropy of communities, which represents the amount of overlap between one community and other communities. Wherein
Figure GDA0003038476330000103
Representing contained sentences siThe number of communities. The method of the invention iteratively selects a community with the smallest entropy from communities in which feature descriptors are not currently selected for feature descriptor selection.
Figure GDA0003038476330000104
1.6 when selecting the feature descriptor for the community with the minimum current entropy, regarding each community for which the feature descriptor is not selected currently as a document, wherein the document comprises all sentences in the community, then calculating the TF-IDF value of each word in the community with the minimum current entropy by using a TF-IDF mode, then converting each sentence in the community into a TF-IDF vector, and selecting the sentence closest to the center of the community as the feature descriptor of the community.
Secondly, the method learns a logistic relationship model between the reduced software features and the software popularity by using a stochastic logistic regression method first, based on the disclosed user score data and regarding the user score data as a reflection of the software popularity, and identifies important software features by referring to the coefficient of each feature in the model.
Thirdly, the method mines the association relation among software features based on the download amount data of the software.
3.1 this method distinguishes between user-oriented feature association rules and designer-oriented feature association rules. The feature association rule reflects the co-occurrence relationship between features, i.e., whether a set of features co-occur frequently. The designer-oriented feature association rule represents whether a group of features are frequently included by software in the field from the perspective of software designers in the field. The user-oriented feature association rule shows whether a group of features are frequently liked by users in the field from the perspective of the users. The method of the invention considers the user-oriented feature association rules to be more meaningful for the recommendation of software features, because the purpose of the designer designing the software is to hope to contain the features that the user likes as much as possible.
3.2 the method regards each download of the software as the support of a user to a group of software features, and further extracts the user-oriented association rules between the software by using an association rule mining method.
3.3 the method uses a directed graph as shown in fig. 3 to store the mined frequent feature set, and the association rules between features are also included in the relationships between nodes in the graph. Each node in fig. 3 stores one frequent feature set and its support. The graph resembles a reverse order, compressed tree structure. The compression embodied in the tree is different in the degree of support of each node from the parent node containing it, i.e. if the parent node of a node appears in the tree and the parent node has the same degree of support as it, then the node will not be stored in the tree. The nodes in the tree with the reverse order are layered, except for the root node, the nodes at the high level often store feature sets with more elements, and the reverse order aims to effectively filter the frequent feature sets with the same support degree as the parent nodes when the frequent feature sets are inserted so as to save storage space.
The following description will be given by way of specific examples of the operation of the method of the present invention in practical use:
and step S1, crawling the description text information of the relevant software in the specific field from the relevant websites on the Internet by using crawler software, wherein the description text information comprises the feature description of the software, the user rating data of the software and the download data of the software.
Step S2: and preprocessing the crawled text data by utilizing a natural language processing technology, wherein the preprocessing comprises removing redundant software description texts, and performing sentence segmentation, word stop removal, word drying and dimension reduction processing on the description texts.
Step S3: and constructing a sentence similarity network, finding sentence communities in the sentence similarity network, and combining communities with higher similarity.
Step S4: a feature descriptor is selected for each sentence community.
Step S5: identifying important software features based on logistic regression
1) Acquiring user scoring data of software;
2) for the missing software user scoring data, calculating the similarity between the software and other software according to the characteristics of the software, and taking the user score of the most similar software as the user score;
3) calculating the average user scores of all the software, and discretizing the user score of each software into 1 and 0 according to the size relationship between the software user scores and the average user scores, wherein the user scores are greater than the average user score of 1 and less than the average user score of 0;
4) constructing a matrix of software features and user scores;
5) based on the constructed software features and the matrix of user scores, scoring the extracted software features by using random logistic regression, and reserving the features with the score not being 0;
6) reducing the constructed software features and the user scoring matrix based on the reduced feature set, and learning a relation model between the software features and the software user scoring by using logistic regression based on the reduced software features and the user scoring matrix;
7) based on the learned model, the more important software features are selected according to the coefficient size of each feature in the model.
Step S6: and mining a user-oriented feature association rule based on the software download amount data.
Step S7: based on the mined user-oriented feature association rules, a software designer may enter the identified important software features to discover other features that have strong associations with the identified important software features.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. A software commonality feature extraction method based on internet text description data is characterized by comprising the following steps:
acquiring a software description text, user grading data and download amount data of an Internet software public text;
obtaining the software characteristics of the Internet open text according to the software description text;
selecting important software features according to user scoring data;
mining a user-oriented feature association relation according to the download amount data;
wherein the content of the first and second substances,
the software features of the interconnection network public text obtained according to the software description text comprise:
preprocessing a software description text;
constructing a sentence similarity network, and describing the similarity between sentences in the text by the following formula measurement software:
Figure FDA0003038476320000011
where idf is the inverse document frequency, siAnd sjIs two sentences, wkIs the k word in the sentence;
sentence communities in the sentence similarity network are found: the sentence similarity network is a weighted network, and edges in the sentence similarity network represent similarity between sentences; selecting a node which is not distributed with communities and is attached to the edge with the maximum weight in the sentence similarity network as a seed node discovered by the sentence community;
the suitability of a node for a community is measured by the following formula:
Figure FDA0003038476320000012
wherein E isinIs a collection of edges between nodes in a community, EoutThe method comprises the steps of collecting edges connecting nodes in a community with nodes outside the community;
determining feature descriptors of sentence communities: the entropy of each sentence community is measured by the following formula:
Figure FDA0003038476320000021
wherein the content of the first and second substances,
Figure FDA0003038476320000022
representing contained sentences siThe entropy of the sentence community represents the overlapping size between one sentence community and other sentence communities;
and selecting one community with the minimum entropy from sentence communities of which the feature descriptors are not selected currently to select the feature descriptors.
2. The method for extracting the software commonality feature based on the internet text description data as claimed in claim 1, wherein the preprocessing the software description text comprises:
carrying out redundancy removal processing on the software description text;
and performing sentence segmentation, word deactivation, word drying and dimension reduction on the software description text.
3. The method of claim 1, wherein selecting the community with the least entropy for feature descriptor comprises
Respectively regarding each sentence community in which the feature descriptors are not selected currently as a document, wherein the document contains all sentences in the sentence community;
calculating the TF-IDF value of each word in the sentence community with the minimum current entropy by using a TF-IDF mode;
each sentence in the community is converted into a TF-IDF vector, and the sentence closest to the center of the sentence community is selected as a feature descriptor of the sentence community.
4. The method for extracting software commonality features based on internet text description data as claimed in claim 1, wherein the selecting important software features according to user score data comprises:
calculating the average user scores of all the software, and discretizing the user score of each software into 1 and 0 according to the magnitude relation between the software user scores and the average user scores, namely, the software user score which is greater than the average user score is 1, and the software user score which is less than the average user score is 0;
according to the download amount data and the user rating data of the soft open text, a matrix of software features and user ratings is constructed;
based on the constructed matrix of software features and user scores,
scoring the extracted software features by using random logistic regression, and reserving the software features with the score not being 0;
reducing the constructed software feature and user score matrix, and utilizing logistic regression to learn the relationship between the software feature and the software user score to configure a coefficient related to the software user score for the software feature;
according to the size of the coefficient of each software feature, important software features are selected, namely for one software feature, the larger the coefficient of the software feature is, the higher the priority is to be selected as the important software features.
5. The method for extracting software commonality features based on internet text description data as claimed in claim 4, wherein the mining of user-oriented feature association relationship according to download amount data comprises:
constructing a matrix of the software features and the download quantity according to the download quantity data and the software features of the public text;
and mining the user-oriented feature association relationship by using an association rule mining method according to the software feature and the download quantity matrix.
CN201811625340.8A 2018-12-28 2018-12-28 Software commonality feature extraction method based on Internet text description data Active CN109670187B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811625340.8A CN109670187B (en) 2018-12-28 2018-12-28 Software commonality feature extraction method based on Internet text description data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811625340.8A CN109670187B (en) 2018-12-28 2018-12-28 Software commonality feature extraction method based on Internet text description data

Publications (2)

Publication Number Publication Date
CN109670187A CN109670187A (en) 2019-04-23
CN109670187B true CN109670187B (en) 2021-06-22

Family

ID=66146559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811625340.8A Active CN109670187B (en) 2018-12-28 2018-12-28 Software commonality feature extraction method based on Internet text description data

Country Status (1)

Country Link
CN (1) CN109670187B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115711A (en) * 2020-07-30 2020-12-22 中国民用航空上海航空器适航审定中心 Extraction of airworthiness instruction problem features based on natural language

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001312419A (en) * 2000-02-22 2001-11-09 Fujitsu Ltd Software overlap degree evaluating device and recording medium with recorded software overlap degree evaluating program
CN102968408A (en) * 2012-11-23 2013-03-13 西安电子科技大学 Method for identifying substance features of customer reviews
US10437894B2 (en) * 2015-05-07 2019-10-08 TCL Research America Inc. Method and system for app search engine leveraging user reviews

Also Published As

Publication number Publication date
CN109670187A (en) 2019-04-23

Similar Documents

Publication Publication Date Title
CN106156372B (en) A kind of classification method and device of internet site
US20060212441A1 (en) Full text query and search systems and methods of use
US20140304267A1 (en) Suffix tree similarity measure for document clustering
KR101100830B1 (en) Entity searching and opinion mining system of hybrid-based using internet and method thereof
US20130110839A1 (en) Constructing an analysis of a document
US20050102251A1 (en) Method of document searching
CN105095187A (en) Search intention identification method and device
CN106970991B (en) Similar application identification method and device, application search recommendation method and server
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
EP2823410A1 (en) Entity augmentation service from latent relational data
US20110184946A1 (en) Applying synonyms to unify text search with faceted browsing classification
WO2014210387A2 (en) Concept extraction
US20140229486A1 (en) Method and apparatus for unsupervised learning of multi-resolution user profile from text analysis
CN107506472B (en) Method for classifying browsed webpages of students
CN110232126B (en) Hot spot mining method, server and computer readable storage medium
KR20180097120A (en) Method for searching electronic document and apparatus thereof
KR20190128246A (en) Searching methods and apparatus and non-transitory computer-readable storage media
CN101088082A (en) Full text query and search systems and methods of use
CN112836029A (en) Graph-based document retrieval method, system and related components thereof
CN112749272A (en) Intelligent new energy planning text recommendation method for unstructured data
KR101543680B1 (en) Entity searching and opinion mining system of hybrid-based using internet and method thereof
US9552415B2 (en) Category classification processing device and method
CN109670187B (en) Software commonality feature extraction method based on Internet text description data
CN105159898A (en) Searching method and searching device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant