WO2017028566A1 - Method and apparatus for collecting cloud environment resource focus point, and server - Google Patents

Method and apparatus for collecting cloud environment resource focus point, and server Download PDF

Info

Publication number
WO2017028566A1
WO2017028566A1 PCT/CN2016/082253 CN2016082253W WO2017028566A1 WO 2017028566 A1 WO2017028566 A1 WO 2017028566A1 CN 2016082253 W CN2016082253 W CN 2016082253W WO 2017028566 A1 WO2017028566 A1 WO 2017028566A1
Authority
WO
WIPO (PCT)
Prior art keywords
vocabulary
preset condition
satisfies
resource
log file
Prior art date
Application number
PCT/CN2016/082253
Other languages
French (fr)
Chinese (zh)
Inventor
周莉
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2017028566A1 publication Critical patent/WO2017028566A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/06Generation of reports
    • H04L43/062Generation of reports related to network traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/50Testing arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Definitions

  • the present invention relates to, but is not limited to, the field of cloud computing resource technology.
  • the "cloud” in the related art is composed of virtual computing resources such as a computing server, a storage server, a bandwidth resource, a software, and an application that can self-maintain and manage, and the "cloud” is a resource pool.
  • “Cloud computing” is a highly virtualized resource pool that is dynamically created by centralizing all computing resources. Therefore, how to obtain the operator’s attention to cloud environment resources (including physical resources and virtual resources), so that operators can use the cloud efficiently. Providing resources is a concern.
  • a related algorithm for obtaining a focus of a cloud environment resource generally operates in a stand-alone mode.
  • This paper provides a method, device and server for collecting cloud environment resource concerns, which can reduce the time-consuming of acquiring cloud environment resource concerns.
  • An embodiment of the present invention provides a method for collecting a focus of a cloud environment resource, including:
  • Each of the importance degree feature vectors is used to represent a weight of each of the words corresponding to the first preset condition in a uniform resource locator URL webpage.
  • the step of obtaining a cloud environment resource focus point according to the importance degree feature vector includes:
  • the corresponding cloud environment resource attention point is obtained.
  • the step of obtaining, by the summary, the vocabulary that satisfies the first preset condition comprises:
  • the obtained vocabulary that satisfies the third preset condition is saved as the vocabulary that satisfies the first preset condition.
  • the collecting method before the collecting the vocabulary that meets the first preset condition, the collecting method further includes:
  • the text to be classified is segmented to obtain the vocabulary that satisfies the third preset condition.
  • the step of performing the word segmentation of the to-be-classified text to obtain the vocabulary that meets the third preset condition comprises:
  • the step of obtaining the vocabulary that meets the third preset condition according to the order parameter includes:
  • the vocabulary satisfying the third preset condition is obtained by using the order parameter and the ranking algorithm.
  • the collecting method before the extracting the resource-related Uniform Resource Locator URL from the sample log file, the collecting method further includes:
  • the sample log file is obtained according to log data of the initial log file.
  • the step of obtaining the sample log file according to the log data of the initial log file includes:
  • the information required to open the corresponding web page is saved as the sample log file.
  • An embodiment of the present invention further provides a device for collecting a focus of a cloud environment resource, including:
  • a first processing module configured to collectively obtain a vocabulary that satisfies a first preset condition
  • a calculation module configured to calculate an importance degree feature vector of the vocabulary that satisfies the first preset condition
  • a second processing module configured to obtain a cloud environment resource focus point according to the importance degree feature vector
  • Each of the importance degree feature vectors is used to represent a weight of each of the words corresponding to the first preset condition in a uniform resource locator URL webpage.
  • the second processing module includes:
  • a transform submodule configured to perform data transformation on the importance degree feature vector to obtain a corresponding frequency
  • a first sorting submodule configured to sequentially arrange the frequencies
  • the first obtaining submodule is configured to sequentially acquire the frequency that satisfies the second preset condition after the arrangement
  • the first processing submodule is configured to obtain a corresponding cloud environment resource concern point according to the obtained frequency.
  • the first processing module includes:
  • a summary sub-module configured to summarize the vocabulary satisfying the third preset condition and the corresponding frequency of occurrence in the resource-related URL webpage
  • a second sorting submodule configured to sort the words satisfying the third preset condition according to the frequency
  • a second acquiring sub-module configured to sequentially obtain the vocabulary that satisfies the third preset condition after the sorting, until the acquired frequency corresponding to the vocabulary that satisfies the third preset condition and all the third predetermined conditions are met The frequency corresponding to the vocabulary reaches a predetermined threshold;
  • the frequency corresponding to the obtained vocabulary that satisfies the third preset condition and the frequency corresponding to all the vocabularies satisfying the third preset condition reach a preset threshold, including: The ratio of the frequency corresponding to the vocabulary of the preset condition to the sum of the frequencies corresponding to the vocabulary satisfying the third preset condition reaches a predetermined threshold;
  • the first saving submodule is configured to save the acquired vocabulary that satisfies the third preset condition as the vocabulary that satisfies the first preset condition.
  • the collecting device further includes:
  • An extraction module configured to extract a resource-related URL from a sample log file before the first processing module performs a related operation
  • a crawling module configured to crawl the webpage content of the resource-related URL, and use the crawled webpage content as the text to be classified;
  • a third processing module configured to perform word segmentation on the to-be-classified text, to obtain the vocabulary that meets the third preset condition.
  • the third processing module includes:
  • a second processing sub-module configured to perform word segmentation on the to-be-classified text to obtain a resource-related vocabulary
  • a transformation sub-module configured to convert the resource-related vocabulary into a digital vector
  • a third processing submodule configured to process the digital vector to obtain a parameter feature vector
  • a fourth processing submodule configured to obtain a sequence parameter according to the parameter feature vector
  • a fifth processing submodule configured to obtain the third preset condition according to the order parameter vocabulary.
  • the fifth processing submodule is configured to:
  • the vocabulary satisfying the third preset condition is obtained by using the order parameter and the ranking algorithm.
  • the collecting device further includes:
  • the collecting module is configured to periodically collect the initial log file before the extracting module performs related operations
  • the fourth processing module is configured to obtain the sample log file according to the log data of the initial log file.
  • the fourth processing module includes:
  • a third obtaining submodule configured to: when receiving the information request sent by the network client according to the webpage open instruction, obtain, according to the information request, information required to open the corresponding webpage from the initial log file;
  • the second saving submodule is configured to save the information required to open the corresponding webpage as the sample log file.
  • the embodiment of the invention further provides a server, comprising: the foregoing collecting device of a cloud environment resource focus point.
  • the method for collecting the focus of the cloud environment resource obtains the vocabulary that satisfies the first preset condition, calculates the importance degree feature vector of the corresponding vocabulary, and obtains the focus of the cloud environment resource; and can calculate and analyze reliably and efficiently.
  • the problem of cloud environment resource attention is mined and extracted, and the problem that the traditional algorithm acquires the focus of the cloud environment resource in the related technology is solved.
  • FIG. 1 is a schematic flowchart of a method for collecting a focus point of a cloud environment resource according to Embodiment 1 of the present invention
  • FIG. 2 is a schematic structural diagram of a device for collecting a cloud environment resource focus point according to Embodiment 2 of the present invention.
  • the traditional algorithms for obtaining the focus of the cloud environment resources in the related art are usually operated in a stand-alone mode, which is easily hindered by the performance of many computer hardware such as processor speed and storage capacity, and has a long time-consuming and poor scalability, and as the user logs increase,
  • the complexity of the algorithm is polynomial, and the performance of the algorithm is getting worse.
  • a method for collecting a cloud environment resource focus point in Embodiment 1 of the present invention includes:
  • Step 11 Collecting a vocabulary that satisfies the first preset condition
  • Step 12 Calculate the importance degree feature vector of the vocabulary that satisfies the first preset condition
  • Step 13 Obtain a cloud environment resource attention point according to the importance degree feature vector
  • Each feature value of the importance degree feature vector is used to represent a weight of a corresponding vocabulary that satisfies the first preset condition in a uniform resource locator URL webpage, and may be a TF-IDF feature vector.
  • the first preset condition is essentially a condition for obtaining the number of words.
  • the method for collecting the focus points of the cloud environment resources obtained by the first embodiment of the present invention obtains the vocabulary of the importance degree of the corresponding vocabulary by collecting the vocabulary that meets the first preset condition, and then obtains the attention point of the cloud environment resource; thereby achieving reliable and efficient
  • the purpose of calculating, analyzing, mining and extracting the focus of cloud environment resources is to solve the problem that the traditional technology acquires the focus of cloud environment resources in the related technology.
  • the step of obtaining a cloud environment resource focus point according to the importance degree feature vector includes: performing data transformation on the importance degree feature vector to obtain a corresponding frequency; and sequentially arranging the frequency (may be The sequence is selected in descending order; the frequency of satisfying the second preset condition after the arrangement is obtained; and the corresponding cloud environment resource attention point is obtained according to the obtained frequency.
  • the frequency of satisfying the second preset condition after the obtaining the arrangement includes:
  • the frequency of satisfying the second preset condition after sequentially obtaining the arrangement includes:
  • the frequency of satisfying the second preset condition after the arrangement is sequentially obtained in descending order.
  • the second preset condition is essentially the limit condition of the number of cloud environment resource concerns that the administrator needs to obtain.
  • the step of acquiring the vocabulary that satisfies the first preset condition includes: summarizing the vocabulary that satisfies the third preset condition and the frequency corresponding to each vocabulary in the resource-related URL webpage; and satisfying the satisfaction according to the frequency Sorting the vocabulary of the three preset conditions; sequentially obtaining the sorted vocabulary satisfying the third preset condition, until the frequency corresponding to the obtained vocabulary reaches a frequency corresponding to all the vocabularies satisfying the third preset condition Presetting a threshold (optionally 9/10); saving the obtained vocabulary satisfying the third preset condition as the vocabulary satisfying the first preset condition.
  • the frequency corresponding to the vocabulary that meets the third preset condition reaches a preset threshold, and the frequency corresponding to the vocabulary that meets the third preset condition is obtained. And a ratio of a sum of frequencies corresponding to all of the words satisfying the third preset condition reaches a predetermined threshold.
  • the obtained frequency corresponding to the vocabulary that meets the third preset condition refers to a frequency that is currently acquired for a vocabulary that satisfies a third preset condition, where all of the three meet the third
  • the frequency corresponding to the vocabulary of the preset condition refers to the highest frequency among the frequencies corresponding to all the vocabularies satisfying the third preset condition.
  • the sequentially acquiring the sorted words satisfying the third preset condition includes:
  • the sorted words satisfying the third preset condition are sequentially acquired in descending order.
  • the collecting method further includes: extracting a resource-related URL from the sample log file; and crawling the webpage content of the resource-related URL (through a web crawler), The webpage content that is crawled is taken as the text to be classified; the text to be classified is segmented, and the vocabulary that satisfies the third preset condition is obtained.
  • the third preset condition is essentially a lower limit condition for the number of times the vocabulary appears in the content of the web page.
  • the text to be classified is segmented to obtain the third predetermined condition
  • the step of vocabulary includes: segmenting the text to be classified, obtaining a resource-related vocabulary; converting the resource-related vocabulary into a digital vector; processing the digital vector to obtain a parameter eigenvector; and obtaining the parameter eigenvector according to the parameter a sequence parameter; the vocabulary satisfying the third preset condition is obtained according to the order parameter.
  • the processing the digital vector to obtain the parameter feature vector comprises: performing alignment processing on the digital vector, performing zero mean processing and normalization processing to obtain a parameter feature vector.
  • the step of obtaining the vocabulary that satisfies the third preset condition according to the order parameter comprises: using the order parameter and the ranking algorithm to obtain the vocabulary that satisfies the third preset condition.
  • the collecting method further includes: periodically collecting an initial log file (a cloud environment log file to be analyzed); and according to the initial log file The log data is obtained from the sample log file (the file composed of the information required to open the web page).
  • an initial log file a cloud environment log file to be analyzed
  • the log data is obtained from the sample log file (the file composed of the information required to open the web page).
  • the periodically collecting the initial log file includes: setting a system timer on the node that needs to collect the log, starting a system timer, and setting a system timer task to periodically collect the initial log file.
  • the step of obtaining the sample log file according to the log data of the initial log file includes: when receiving the information request sent by the network client according to the webpage open instruction, requesting from the initial according to the information request The information required to open the corresponding webpage is obtained in the log file; and the information required to open the corresponding webpage is saved as the sample log file.
  • the vocabulary satisfying the first preset condition corresponds to the best keyword of the high frequency
  • the frequency of satisfying the second preset condition is the frequency satisfying the threshold set by the user
  • the vocabulary satisfying the third preset condition corresponds to the best key
  • the resource-related vocabulary corresponds to the vocabulary of the webpage content of the resource-related URL.
  • the apparatus for collecting cloud environment resource focus points in Embodiment 2 of the present invention includes:
  • the first processing module 21 is configured to collectively obtain a vocabulary that satisfies the first preset condition
  • the calculating module 22 is configured to calculate an importance degree feature vector of the vocabulary that satisfies the first preset condition
  • the second processing module 23 is configured to obtain a cloud environment resource according to the importance degree feature vector Note
  • Each feature value of the importance degree feature vector is used to represent a weight of each vocabulary corresponding to the first preset condition in a Uniform Resource Locator URL webpage, optionally TF-IDF Feature vector.
  • the first preset condition is essentially a condition for obtaining the number of words.
  • the collection device of the cloud environment resource focus point obtained by the second embodiment of the present invention obtains the vocabulary of the importance degree of the corresponding vocabulary by collecting the vocabulary that satisfies the first preset condition, and obtains the cloud environment resource attention point; and can reliably and efficiently Calculating, analyzing, mining and extracting the attention points of cloud environment resources, reducing the time-consuming of the related technologies to obtain the focus points of cloud environment resources by traditional algorithms.
  • the second processing module includes: a transform submodule, configured to perform data transformation on the importance degree feature vector to obtain a corresponding frequency; and the first sorting submodule is configured to sequence the frequency Arrangement (optionally in descending order); the first acquisition sub-module is configured to sequentially obtain the frequency of the second preset condition after the arrangement; the first processing sub-module is set to obtain the corresponding cloud environment according to the acquired frequency Resource concerns.
  • a transform submodule configured to perform data transformation on the importance degree feature vector to obtain a corresponding frequency
  • the first sorting submodule is configured to sequence the frequency Arrangement (optionally in descending order)
  • the first acquisition sub-module is configured to sequentially obtain the frequency of the second preset condition after the arrangement
  • the first processing sub-module is set to obtain the corresponding cloud environment according to the acquired frequency Resource concerns.
  • the second preset condition is essentially the limit condition of the number of cloud environment resource concerns that the administrator needs to obtain.
  • the first processing module includes: a summary sub-module, configured to summarize the vocabulary satisfying the third preset condition and the frequency corresponding to the vocabulary in the resource-related URL webpage; and the second sorting sub-module is set according to the frequency Sorting the vocabulary that satisfies the third preset condition; the second acquiring sub-module is configured to sequentially obtain the vocabulary that satisfies the third preset condition after the sorting, until the obtained third preset condition is met The frequency corresponding to the vocabulary and the frequency corresponding to all the words satisfying the third preset condition reach a preset threshold (optionally 9/10); the first saving submodule is set to satisfy the third The vocabulary of the preset condition is saved as the vocabulary satisfying the first preset condition.
  • a summary sub-module configured to summarize the vocabulary satisfying the third preset condition and the frequency corresponding to the vocabulary in the resource-related URL webpage
  • the second sorting sub-module is set according to the frequency Sorting the vocabulary that satisfies the third preset condition
  • the second acquiring sub-module
  • the collecting device further includes: an extracting module, configured to extract a resource-related URL from the sample log file before the first processing module performs the related operation; and the crawling module is set to crawl through the web crawler a webpage content of the resource-related URL, the webpage content that is crawled as the text to be classified; a third processing module, configured to segment the text to be classified, and obtain The vocabulary that satisfies the third preset condition.
  • an extracting module configured to extract a resource-related URL from the sample log file before the first processing module performs the related operation
  • the crawling module is set to crawl through the web crawler a webpage content of the resource-related URL, the webpage content that is crawled as the text to be classified
  • a third processing module configured to segment the text to be classified, and obtain The vocabulary that satisfies the third preset condition.
  • the third preset condition is essentially a lower limit condition for the number of times the vocabulary appears in the content of the web page.
  • the third processing module includes: a second processing sub-module, configured to perform segmentation of the text to be classified to obtain a resource-related vocabulary; and a transformation sub-module configured to convert the resource-related vocabulary into a digital vector;
  • the third processing sub-module is configured to process the digital vector to obtain a parameter feature vector (after the digital vector is aligned, and then perform zero-mean processing and normalization to obtain a parameter feature vector);
  • the fourth processing sub-module And being configured to obtain a sequence parameter according to the parameter feature vector;
  • the fifth processing sub-module is configured to obtain the vocabulary that satisfies the third preset condition according to the order parameter.
  • the fifth processing submodule is configured to: obtain the vocabulary that meets the third preset condition by using the order parameter and the ranking algorithm.
  • the collecting device further includes: an collecting module, configured to: before the extracting module performs related operations, (installing a system timer on the node that needs to collect logs, and starting a system timer, setting a system timer task)
  • the initial log file (the cloud environment log file to be analyzed) is periodically collected;
  • the fourth processing module is configured to obtain the sample log file (the file composed of the information required to open the webpage) according to the log data of the initial log file.
  • the fourth processing module includes: a third obtaining submodule, configured to: when receiving the information request sent by the network client according to the webpage open instruction, obtain the open from the initial log file according to the information request Corresponding to the information required by the webpage; the second saving submodule is configured to save the information required to open the corresponding webpage as the sample log file.
  • the vocabulary satisfying the first preset condition corresponds to the best keyword of the high frequency
  • the frequency of satisfying the second preset condition is the frequency satisfying the threshold set by the user
  • the vocabulary satisfying the third preset condition corresponds to the best key
  • the resource-related vocabulary corresponds to the vocabulary of the webpage content of the resource-related URL.
  • the device for collecting the focus of the cloud environment resource provided by the second embodiment of the present invention can improve the server in the related art to implement the function of the device for collecting the focus of the cloud environment resource in the second embodiment.
  • the front-end server installs the system timer CRON on each node in the cloud environment that needs to collect logs; adds CRON to the startup script and starts the CRON service; edits the /etc/crontab file to set the tasks that the system periodically performs.
  • the log file collected periodically by each node is set. It should be noted that the file must have root authority for editing the file.
  • the periodically collected log files are saved in a unified format, and the log data is pre-processed, wherein the pre-processing includes: when receiving the web page (when the user opens the web page of the cloud platform), the network client sends the spelling When a certain format (such as a string) is requested, the information required to open the corresponding webpage is obtained from the log file (initial log file), and the required information includes any one or more of the following contents:
  • Start time of operation end time of operation, client IP, user information, and access address.
  • the front-end server stores the above information (the obtained required information) as a unified format log file (sample log file), and performs an inter-network transmission to the HDFS (Distributed File System) of the cloud platform, and stores it in the LZO format.
  • HDFS Distributed File System
  • LZO is a data compression algorithm dedicated to decompression speed
  • LZO is the abbreviation of Lempel-Ziv-Oberhumer.
  • the back-end server stores the log file (sample log file) in the HDFS, extracts the resource-related URL from the accessed URL, crawls the webpage content corresponding to the resource URL through the web crawler, and retains the webpage content as the text to be classified;
  • the word segmentation technology classifies the content of the resource URL webpage, obtains keywords (resource-related vocabulary); queries the international code library to convert the keywords into digital vectors; after aligning the digital vectors, then performing zero-mean processing and normalization
  • the parameter feature vector is obtained by the processing; the parameter feature vector is identified by the cooperative neural network pattern to obtain the order parameter, and the order parameter is used to obtain the best keyword in the database.
  • the back-end server summarizes the best keywords, and inputs the summarized best keywords into MapReduce (distributed computing box) to obtain the frequency corresponding to each best keyword.
  • MapReduce distributed computing box
  • the frequency of occurrence according to each best keyword is from large to small.
  • the ratio of the word frequency to the total word frequency of the selected word reaches 9 to 10 refers to the sum of the frequencies corresponding to all the vocabularies that have acquired the third preset condition, and all the words satisfying the third preset condition.
  • the ratio of the sum of the corresponding frequencies reaches 9 to 10.
  • TF-IDF word frequency-reverse file frequency
  • Equation (1) the TF-IDF eigenvalue calculation formula for the high-frequency best keyword is:
  • equation (2) the formula for calculating the tf value is:
  • equation (2) the formula for calculating the idf value is:
  • D is a collection of all URL web pages
  • d is a specific URL web page
  • t n is the nth high-frequency vocabulary, that is, a feature
  • N is a selection The total number of best keywords
  • FeatureVector is the feature vector
  • Number of Times is the number of times.
  • the TF-IDF feature vector is transformed by MapReduce (distributed parallel computing model in Hadoop framework) to obtain the frequency of each set of feature vectors (multiple TF-IDF eigenvalues of a high-frequency best keyword). And the sum is arranged in descending order, according to the manager's setting, the corresponding amount of cloud environment resource concerns are sequentially obtained in the order of arrangement.
  • MapReduce distributed parallel computing model in Hadoop framework
  • MapReduce is a key technology of cloud computing. It is a software architecture and programming model proposed by Google for parallel computing of large-scale data. MapReduce disassembles all operations of the system into mapping function Map and protocol function Reduce. The Map function splits large-scale data into multiple small data sets and distributes them to multiple machines for parallel operation. The Reduce function aggregates the results of the Map function operations on each machine, and the cooperation between Map and Reduce achieves the effect of distributed parallel computing;
  • TF represents the number of times this keyword appears in a URL page.
  • IDF is a measure of the universal importance of the keyword. The number of total sample files can be divided by the number of sample files containing the keyword, and the obtained business logarithm. get. Multiply the two parts of TF and IDF to get the importance of a word for a URL page.
  • the front-end server and the back-end server mentioned in the third embodiment of the present invention may be integrated into one server, or may exist in two servers, which is not limited herein.
  • the method for collecting cloud environment resource focus points provided by the third embodiment of the present invention filters a large number of URLs, extracts URLs related to cloud resources, and performs TF-IDF on URL webpage content by using MapReduce.
  • the feature extraction not only solves the bottleneck problem of time, storage and calculation of cloud cloud resource attention points in massive log analysis, but also can accurately find the attention of cloud environment resources and improve the utilization of cloud environment resources.
  • the solution provided by the embodiment of the present invention can accurately and efficiently calculate, analyze, and mine a large number of user logs, thereby efficiently extracting the cloud environment resource concerns that the user is most concerned about in the log in real time;
  • the algorithm is short in time and extensible. It solves the traditional algorithm running in stand-alone mode, which is easily hindered by many computer hardware performances such as processor speed and storage capacity. As the user log increases, the complexity of the algorithm grows polynomial. The performance of the algorithm is getting worse and worse.
  • modules/sub-modules Many of the functional components described in this specification are referred to as modules/sub-modules to more particularly emphasize the independence of their implementation.
  • the modules/sub-modules may be implemented in software for execution by various types of processors.
  • an identified executable code module can comprise one or more physical or logical blocks of computer instructions, which can be constructed, for example, as an object, procedure, or function. Nonetheless, the executable code of the identified modules need not be physically located together, but may include different instructions stored in different bits that, when logically combined, constitute a module and implement the provisions of the module. purpose.
  • the executable code module can be a single instruction or a plurality of instructions, and can even be distributed across multiple different code segments, distributed among different programs, and distributed across multiple memory devices.
  • operational data may be identified within the modules and may be implemented in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed at different locations (including on different storage devices), and may at least partially exist as an electronic signal on a system or network.
  • the module can be implemented by software, considering the level of the hardware process in the related art, the module that can be implemented by software can be taken by those skilled in the art without considering the cost.
  • Corresponding functions are implemented to implement corresponding functions, including conventional Very Large Scale Integration (VLSI) circuits or gate arrays and related semiconductors such as logic chips, transistors, or other discrete components.
  • VLSI Very Large Scale Integration
  • the modules can also be implemented with programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices, and the like.
  • all or part of the steps of the above embodiments may also be implemented by using an integrated circuit. These steps may be separately fabricated into individual integrated circuit modules, or multiple modules or steps may be fabricated into a single integrated circuit module. achieve.
  • the devices/function modules/functional units in the above embodiments may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices.
  • the device/function module/functional unit in the above embodiment When the device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium.
  • the above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
  • the vocabulary that satisfies the first preset condition is summarized, the importance degree feature vector of the corresponding vocabulary is calculated, and the cloud environment resource attention point is obtained, and the cloud environment resource can be calculated, analyzed, mined, and extracted reliably and efficiently.
  • Concerns reduce the problem that it takes a long time to obtain the attention of the cloud environment resources through the algorithm.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and an apparatus for collecting a cloud environment resource focus point, and a server. The method for collecting a cloud environment resource focus point comprises: summarizing words satisfying a first preset condition; calculating importance eigenvectors of the words satisfying the first preset condition; and obtaining a cloud environment resource focus point according to the importance eigenvectors, the importance eigenvector being used to indicate a weight of a corresponding word satisfying the first condition in a uniform resource locator (URL) webpage.

Description

一种云环境资源关注点的采集方法、装置及服务器Method, device and server for collecting cloud environment resource focus 技术领域Technical field
本发明涉及但不限于云计算资源技术领域。The present invention relates to, but is not limited to, the field of cloud computing resource technology.
背景技术Background technique
众所周知,相关技术中的“云”由能进行自我维护和管理的计算服务器、存储服务器、带宽资源、软件和应用等虚拟计算资源构成,“云”就是一种资源池。“云计算”是把所有计算资源集中起来,动态创建的高度虚拟化的资源池,所以如何获取运营商对云环境资源(包括物理资源和虚拟资源)的关注点,从而为运营商高效使用云资源提供帮助是备受关注的一个问题。As is well known, the "cloud" in the related art is composed of virtual computing resources such as a computing server, a storage server, a bandwidth resource, a software, and an application that can self-maintain and manage, and the "cloud" is a resource pool. “Cloud computing” is a highly virtualized resource pool that is dynamically created by centralizing all computing resources. Therefore, how to obtain the operator’s attention to cloud environment resources (including physical resources and virtual resources), so that operators can use the cloud efficiently. Providing resources is a concern.
相关技术中,获取云环境资源关注点的相关算法通常在单机模式下运行。In the related art, a related algorithm for obtaining a focus of a cloud environment resource generally operates in a stand-alone mode.
发明内容Summary of the invention
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.
本文提供一种云环境资源关注点的采集方法、装置及服务器,能够减少获取云环境资源关注点的耗时。This paper provides a method, device and server for collecting cloud environment resource concerns, which can reduce the time-consuming of acquiring cloud environment resource concerns.
本发明实施例提供一种云环境资源关注点的采集方法,包括:An embodiment of the present invention provides a method for collecting a focus of a cloud environment resource, including:
汇总获取满足第一预设条件的词汇;Collecting a vocabulary that satisfies the first preset condition;
计算所述满足第一预设条件的词汇的重要程度特征向量;Calculating an importance degree feature vector of the vocabulary that satisfies the first preset condition;
根据所述重要程度特征向量得到云环境资源关注点;Obtaining a focus point of the cloud environment resource according to the importance degree feature vector;
其中,所述重要程度特征向量中的每一特征值用以表征每一对应所述满足第一预设条件的词汇在一个统一资源定位器URL网页中所占的权重。Each of the importance degree feature vectors is used to represent a weight of each of the words corresponding to the first preset condition in a uniform resource locator URL webpage.
可选地,所述根据所述重要程度特征向量得到云环境资源关注点的步骤包括:Optionally, the step of obtaining a cloud environment resource focus point according to the importance degree feature vector includes:
将所述重要程度特征向量进行数据变换,得到对应的频度; Performing data transformation on the importance degree feature vector to obtain a corresponding frequency;
将所述频度进行顺序排列;Arranging the frequencies in order;
依次获取排列后满足第二预设条件的频度;The frequency of satisfying the second preset condition after the arrangement is sequentially obtained;
根据获取到的频度得到对应的云环境资源关注点。According to the obtained frequency, the corresponding cloud environment resource attention point is obtained.
可选地,所述汇总获取满足第一预设条件的词汇的步骤包括:Optionally, the step of obtaining, by the summary, the vocabulary that satisfies the first preset condition comprises:
汇总满足第三预设条件的词汇以及其在资源相关URL网页中对应出现的频率;Summarizing the vocabulary that satisfies the third preset condition and its corresponding occurrence frequency in the resource-related URL webpage;
根据所述频率对所述满足第三预设条件的词汇进行排序;Sorting the words satisfying the third preset condition according to the frequency;
依次获取排序后所述满足第三预设条件的词汇,直到获取到的所述满足第三预设条件的词汇对应的频率与所有所述满足第三预设条件的词汇对应的频率达到一预设阈值;And sequentially obtaining the vocabulary that satisfies the third preset condition after the sorting, until the obtained frequency corresponding to the vocabulary satisfying the third preset condition reaches a frequency corresponding to all the vocabularies satisfying the third preset condition Set a threshold;
将获取到的所述满足第三预设条件的词汇保存为所述满足第一预设条件的词汇。The obtained vocabulary that satisfies the third preset condition is saved as the vocabulary that satisfies the first preset condition.
可选地,在所述汇总获取满足第一预设条件的词汇之前,所述采集方法还包括:Optionally, before the collecting the vocabulary that meets the first preset condition, the collecting method further includes:
从样本日志文件中提取资源相关URL;Extract the resource-related URL from the sample log file;
爬取资源相关URL的网页内容,将爬取到的所述网页内容作为待分类文本;Crawling the webpage content of the resource-related URL, and using the crawled webpage content as the text to be classified;
将所述待分类文本进行分词,获得所述满足第三预设条件的词汇。The text to be classified is segmented to obtain the vocabulary that satisfies the third preset condition.
可选地,所述将所述待分类文本进行分词,获得所述满足第三预设条件的词汇的步骤包括:Optionally, the step of performing the word segmentation of the to-be-classified text to obtain the vocabulary that meets the third preset condition comprises:
将所述待分类文本进行分词,获得资源相关词汇;Sorting the text to be classified into a resource-related vocabulary;
将所述资源相关词汇转化为数字向量;Converting the resource-related vocabulary into a digital vector;
将所述数字向量进行处理,得到参数特征向量;Processing the digital vector to obtain a parameter feature vector;
根据所述参数特征向量得到序参量;Obtaining a sequence parameter according to the parameter feature vector;
根据所述序参量得到所述满足第三预设条件的词汇。And obtaining the vocabulary that satisfies the third preset condition according to the order parameter.
可选地,所述根据序参量得到所述满足第三预设条件的词汇的步骤包括: Optionally, the step of obtaining the vocabulary that meets the third preset condition according to the order parameter includes:
利用所述序参量和排名算法得到所述满足第三预设条件的词汇。The vocabulary satisfying the third preset condition is obtained by using the order parameter and the ranking algorithm.
可选地,在所述从样本日志文件中提取资源相关统一资源定位器URL之前,所述采集方法还包括:Optionally, before the extracting the resource-related Uniform Resource Locator URL from the sample log file, the collecting method further includes:
定期采集初始日志文件;Collect initial log files periodically;
根据所述初始日志文件的日志数据得到所述样本日志文件。The sample log file is obtained according to log data of the initial log file.
可选地,所述根据所述初始日志文件的日志数据得到所述样本日志文件的步骤包括:Optionally, the step of obtaining the sample log file according to the log data of the initial log file includes:
在接收到网络客户端根据网页打开指令发送的信息请求时,根据所述信息请求从所述初始日志文件中获取打开对应网页所需的信息;After receiving the information request sent by the network client according to the webpage open instruction, obtaining, according to the information request, information required to open the corresponding webpage from the initial log file;
将所述打开对应网页所需的信息保存为所述样本日志文件。The information required to open the corresponding web page is saved as the sample log file.
本发明实施例还提供了一种云环境资源关注点的采集装置,包括:An embodiment of the present invention further provides a device for collecting a focus of a cloud environment resource, including:
第一处理模块,设置为汇总获取满足第一预设条件的词汇;a first processing module, configured to collectively obtain a vocabulary that satisfies a first preset condition;
计算模块,设置为计算所述满足第一预设条件的词汇的重要程度特征向量;a calculation module, configured to calculate an importance degree feature vector of the vocabulary that satisfies the first preset condition;
第二处理模块,设置为根据所述重要程度特征向量得到云环境资源关注点;a second processing module, configured to obtain a cloud environment resource focus point according to the importance degree feature vector;
其中,所述重要程度特征向量中的每一特征值用以表征每一对应所述满足第一预设条件的词汇在一个统一资源定位器URL网页中所占的权重。Each of the importance degree feature vectors is used to represent a weight of each of the words corresponding to the first preset condition in a uniform resource locator URL webpage.
可选地,所述第二处理模块包括:Optionally, the second processing module includes:
变换子模块,设置为将所述重要程度特征向量进行数据变换,得到对应的频度;a transform submodule, configured to perform data transformation on the importance degree feature vector to obtain a corresponding frequency;
第一排序子模块,设置为将所述频度进行顺序排列;a first sorting submodule, configured to sequentially arrange the frequencies;
第一获取子模块,设置为依次获取排列后满足第二预设条件的频度;The first obtaining submodule is configured to sequentially acquire the frequency that satisfies the second preset condition after the arrangement;
第一处理子模块,设置为根据获取到的频度得到对应的云环境资源关注点。The first processing submodule is configured to obtain a corresponding cloud environment resource concern point according to the obtained frequency.
可选地,所述第一处理模块包括: Optionally, the first processing module includes:
汇总子模块,设置为汇总满足第三预设条件的词汇以及其在资源相关URL网页中对应出现的频率;a summary sub-module, configured to summarize the vocabulary satisfying the third preset condition and the corresponding frequency of occurrence in the resource-related URL webpage;
第二排序子模块,设置为根据所述频率对所述满足第三预设条件的词汇进行排序;a second sorting submodule, configured to sort the words satisfying the third preset condition according to the frequency;
第二获取子模块,设置为依次获取排序后所述满足第三预设条件的词汇,直到获取到的所述满足第三预设条件的词汇对应的频率与所有所述满足第三预设条件的词汇对应的频率达到一预设阈值;a second acquiring sub-module, configured to sequentially obtain the vocabulary that satisfies the third preset condition after the sorting, until the acquired frequency corresponding to the vocabulary that satisfies the third preset condition and all the third predetermined conditions are met The frequency corresponding to the vocabulary reaches a predetermined threshold;
其中,直到获取到的所述满足第三预设条件的词汇对应的频率与所有所述满足第三预设条件的词汇对应的频率达到一预设阈值包括:直到获取到的所述满足第三预设条件的词汇对应的频率之和,与所有所述满足第三预设条件的词汇对应的频率之和的比例达到一预设阈值;The frequency corresponding to the obtained vocabulary that satisfies the third preset condition and the frequency corresponding to all the vocabularies satisfying the third preset condition reach a preset threshold, including: The ratio of the frequency corresponding to the vocabulary of the preset condition to the sum of the frequencies corresponding to the vocabulary satisfying the third preset condition reaches a predetermined threshold;
第一保存子模块,设置为将获取到的所述满足第三预设条件的词汇保存为所述满足第一预设条件的词汇。The first saving submodule is configured to save the acquired vocabulary that satisfies the third preset condition as the vocabulary that satisfies the first preset condition.
可选地,所述采集装置还包括:Optionally, the collecting device further includes:
提取模块,设置为所述第一处理模块执行相关操作之前,从样本日志文件中提取资源相关URL;An extraction module, configured to extract a resource-related URL from a sample log file before the first processing module performs a related operation;
爬取模块,设置为爬取资源相关URL的网页内容,将爬取到的所述网页内容作为待分类文本;a crawling module, configured to crawl the webpage content of the resource-related URL, and use the crawled webpage content as the text to be classified;
第三处理模块,设置为将所述待分类文本进行分词,获得所述满足第三预设条件的词汇。And a third processing module, configured to perform word segmentation on the to-be-classified text, to obtain the vocabulary that meets the third preset condition.
可选地,所述第三处理模块包括:Optionally, the third processing module includes:
第二处理子模块,设置为将所述待分类文本进行分词,获得资源相关词汇;a second processing sub-module, configured to perform word segmentation on the to-be-classified text to obtain a resource-related vocabulary;
转化子模块,设置为将所述资源相关词汇转化为数字向量;a transformation sub-module configured to convert the resource-related vocabulary into a digital vector;
第三处理子模块,设置为将所述数字向量进行处理,得到参数特征向量;a third processing submodule configured to process the digital vector to obtain a parameter feature vector;
第四处理子模块,设置为根据所述参数特征向量得到序参量;a fourth processing submodule, configured to obtain a sequence parameter according to the parameter feature vector;
第五处理子模块,设置为根据所述序参量得到所述满足第三预设条件的 词汇。a fifth processing submodule, configured to obtain the third preset condition according to the order parameter vocabulary.
可选地,所述第五处理子模块设置为:Optionally, the fifth processing submodule is configured to:
利用所述序参量和排名算法得到所述满足第三预设条件的词汇。The vocabulary satisfying the third preset condition is obtained by using the order parameter and the ranking algorithm.
可选地,所述采集装置还包括:Optionally, the collecting device further includes:
采集模块,设置为所述提取模块执行相关操作之前,定期采集初始日志文件;The collecting module is configured to periodically collect the initial log file before the extracting module performs related operations;
第四处理模块,设置为根据所述初始日志文件的日志数据得到所述样本日志文件。The fourth processing module is configured to obtain the sample log file according to the log data of the initial log file.
可选地,所述第四处理模块包括:Optionally, the fourth processing module includes:
第三获取子模块,设置为在接收到网络客户端根据网页打开指令发送的信息请求时,根据所述信息请求从所述初始日志文件中获取打开对应网页所需的信息;a third obtaining submodule, configured to: when receiving the information request sent by the network client according to the webpage open instruction, obtain, according to the information request, information required to open the corresponding webpage from the initial log file;
第二保存子模块,设置为将所述打开对应网页所需的信息保存为所述样本日志文件。The second saving submodule is configured to save the information required to open the corresponding webpage as the sample log file.
本发明实施例还提供了一种服务器,包括:上述的云环境资源关注点的采集装置。The embodiment of the invention further provides a server, comprising: the foregoing collecting device of a cloud environment resource focus point.
本发明实施例的上述技术方案的有益效果如下:The beneficial effects of the above technical solutions of the embodiments of the present invention are as follows:
上述方案中,所述云环境资源关注点的采集方法通过汇总获取满足第一预设条件的词汇,计算对应词汇的重要程度特征向量,进而得到云环境资源关注点;能够可靠高效地计算、分析、挖掘和提取云环境资源关注点,解决了相关技术中由传统算法获取云环境资源关注点耗时长的问题。In the above solution, the method for collecting the focus of the cloud environment resource obtains the vocabulary that satisfies the first preset condition, calculates the importance degree feature vector of the corresponding vocabulary, and obtains the focus of the cloud environment resource; and can calculate and analyze reliably and efficiently. The problem of cloud environment resource attention is mined and extracted, and the problem that the traditional algorithm acquires the focus of the cloud environment resource in the related technology is solved.
在阅读并理解了附图和详细描述后,可以明白其他方面。Other aspects will be apparent upon reading and understanding the drawings and detailed description.
附图概述BRIEF abstract
图1为本发明实施例一中云环境资源关注点的采集方法流程示意图;1 is a schematic flowchart of a method for collecting a focus point of a cloud environment resource according to Embodiment 1 of the present invention;
图2为本发明实施例二中云环境资源关注点的采集装置结构示意图。 FIG. 2 is a schematic structural diagram of a device for collecting a cloud environment resource focus point according to Embodiment 2 of the present invention.
本发明的较佳实施方式Preferred embodiment of the invention
下为使本发明实施例要解决的技术问题、技术方案和优点更加清楚,下面将结合附图及具体实施例进行详细描述。The technical problems, technical solutions, and advantages of the embodiments of the present invention will become more apparent from the following detailed description.
相关技术中获取云环境资源关注点的传统算法通常在单机模式下运行,容易受到处理器速度、存储容量等诸多计算机硬件性能的阻碍,并且存在耗时长、可扩展性差,随着用户日志增多,算法的复杂度呈多项式增长,算法性能越来越差等问题。The traditional algorithms for obtaining the focus of the cloud environment resources in the related art are usually operated in a stand-alone mode, which is easily hindered by the performance of many computer hardware such as processor speed and storage capacity, and has a long time-consuming and poor scalability, and as the user logs increase, The complexity of the algorithm is polynomial, and the performance of the algorithm is getting worse.
本文针对相关技术中由传统算法获取云环境资源关注点耗时长的问题,提供了多种解决措施,包括:This paper provides a variety of solutions for the long-term problem of obtaining the attention of cloud environment resources by traditional algorithms in related technologies, including:
实施例一Embodiment 1
参见图1,本发明实施例一中云环境资源关注点的采集方法包括:Referring to FIG. 1, a method for collecting a cloud environment resource focus point in Embodiment 1 of the present invention includes:
步骤11:汇总获取满足第一预设条件的词汇;Step 11: Collecting a vocabulary that satisfies the first preset condition;
步骤12:计算所述满足第一预设条件的词汇的重要程度特征向量;Step 12: Calculate the importance degree feature vector of the vocabulary that satisfies the first preset condition;
步骤13:根据所述重要程度特征向量得到云环境资源关注点;Step 13: Obtain a cloud environment resource attention point according to the importance degree feature vector;
其中,所述重要程度特征向量中的每一特征值用以表征对应的满足第一预设条件的词汇在一个统一资源定位器URL网页中所占的权重,可选为TF-IDF特征向量。Each feature value of the importance degree feature vector is used to represent a weight of a corresponding vocabulary that satisfies the first preset condition in a uniform resource locator URL webpage, and may be a TF-IDF feature vector.
第一预设条件实质为获取词汇的个数限制条件。The first preset condition is essentially a condition for obtaining the number of words.
本发明实施例一提供的所述云环境资源关注点的采集方法通过汇总获取满足第一预设条件的词汇,计算对应词汇的重要程度特征向量,进而得到云环境资源关注点;实现了可靠高效地计算、分析、挖掘和提取云环境资源关注点的目的,解决了相关技术中由传统算法获取云环境资源关注点耗时长的问题。The method for collecting the focus points of the cloud environment resources provided by the first embodiment of the present invention obtains the vocabulary of the importance degree of the corresponding vocabulary by collecting the vocabulary that meets the first preset condition, and then obtains the attention point of the cloud environment resource; thereby achieving reliable and efficient The purpose of calculating, analyzing, mining and extracting the focus of cloud environment resources is to solve the problem that the traditional technology acquires the focus of cloud environment resources in the related technology.
可选地,所述根据所述重要程度特征向量得到云环境资源关注点的步骤包括:将所述重要程度特征向量进行数据变换,得到对应的频度;将所述频度进行顺序排列(可选为降序排列);获取排列后满足第二预设条件的频度;根据获取到的频度得到对应的云环境资源关注点。 Optionally, the step of obtaining a cloud environment resource focus point according to the importance degree feature vector includes: performing data transformation on the importance degree feature vector to obtain a corresponding frequency; and sequentially arranging the frequency (may be The sequence is selected in descending order; the frequency of satisfying the second preset condition after the arrangement is obtained; and the corresponding cloud environment resource attention point is obtained according to the obtained frequency.
可选地,所述获取排列后满足第二预设条件的频度包括:Optionally, the frequency of satisfying the second preset condition after the obtaining the arrangement includes:
依次获取排列后满足第二预设条件的频度。The frequency of satisfying the second preset condition after the arrangement is sequentially obtained.
可选地,所述依次获取排列后满足第二预设条件的频度包括:Optionally, the frequency of satisfying the second preset condition after sequentially obtaining the arrangement includes:
按照从大到小的顺序依次获取排列后满足第二预设条件的频度。The frequency of satisfying the second preset condition after the arrangement is sequentially obtained in descending order.
第二预设条件实质为管理者设置的最终需得到的云环境资源关注点的个数限制条件。The second preset condition is essentially the limit condition of the number of cloud environment resource concerns that the administrator needs to obtain.
其中,所述汇总获取满足第一预设条件的词汇的步骤包括:汇总满足第三预设条件的词汇以及各个词汇在资源相关URL网页中对应出现的频率;根据所述频率对所述满足第三预设条件的词汇进行排序;依次获取排序后的所述满足第三预设条件的词汇,直到获取到的词汇对应的频率与所有所述满足第三预设条件的词汇对应的频率达到一预设阈值(可选为9/10);将获取到的所述满足第三预设条件的词汇保存为所述满足第一预设条件的词汇。The step of acquiring the vocabulary that satisfies the first preset condition includes: summarizing the vocabulary that satisfies the third preset condition and the frequency corresponding to each vocabulary in the resource-related URL webpage; and satisfying the satisfaction according to the frequency Sorting the vocabulary of the three preset conditions; sequentially obtaining the sorted vocabulary satisfying the third preset condition, until the frequency corresponding to the obtained vocabulary reaches a frequency corresponding to all the vocabularies satisfying the third preset condition Presetting a threshold (optionally 9/10); saving the obtained vocabulary satisfying the third preset condition as the vocabulary satisfying the first preset condition.
其中,直到获取到的词汇对应的频率与所有所述满足第三预设条件的词汇对应的频率达到一预设阈值包括:直到获取到的所述满足第三预设条件的词汇对应的频率之和,与所有所述满足第三预设条件的词汇对应的频率之和的比例达到一预设阈值。The frequency corresponding to the vocabulary that meets the third preset condition reaches a preset threshold, and the frequency corresponding to the vocabulary that meets the third preset condition is obtained. And a ratio of a sum of frequencies corresponding to all of the words satisfying the third preset condition reaches a predetermined threshold.
可选地,在一个示例中,获取到的所述满足第三预设条件的词汇对应的频率指当前获取的一个满足第三预设条件的词汇对应的频率,所述所有所述满足第三预设条件的词汇对应的频率是指所有满足第三预设条件的词汇对应的频率中的最高频率。Optionally, in an example, the obtained frequency corresponding to the vocabulary that meets the third preset condition refers to a frequency that is currently acquired for a vocabulary that satisfies a third preset condition, where all of the three meet the third The frequency corresponding to the vocabulary of the preset condition refers to the highest frequency among the frequencies corresponding to all the vocabularies satisfying the third preset condition.
可选地,所述依次获取排序后的所述满足第三预设条件的词汇包括:Optionally, the sequentially acquiring the sorted words satisfying the third preset condition includes:
按照从大到小的顺序依次获取排序后的所述满足第三预设条件的词汇。The sorted words satisfying the third preset condition are sequentially acquired in descending order.
可选的,在所述汇总获取满足第一预设条件的词汇之前,所述采集方法还包括:从样本日志文件中提取资源相关URL;(通过网络爬虫)爬取资源相关URL的网页内容,将爬取到的所述网页内容作为待分类文本;将所述待分类文本进行分词,获得所述满足第三预设条件的词汇。Optionally, before the collecting the vocabulary that satisfies the first preset condition, the collecting method further includes: extracting a resource-related URL from the sample log file; and crawling the webpage content of the resource-related URL (through a web crawler), The webpage content that is crawled is taken as the text to be classified; the text to be classified is segmented, and the vocabulary that satisfies the third preset condition is obtained.
第三预设条件实质为词汇在网页内容中出现次数的下限值限制条件。The third preset condition is essentially a lower limit condition for the number of times the vocabulary appears in the content of the web page.
其中,所述将所述待分类文本进行分词,获得所述满足第三预设条件的 词汇的步骤包括:将所述待分类文本进行分词,获得资源相关词汇;将所述资源相关词汇转化为数字向量;将所述数字向量进行处理,得到参数特征向量;根据所述参数特征向量得到序参量;根据所述序参量得到所述满足第三预设条件的词汇。Wherein, the text to be classified is segmented to obtain the third predetermined condition The step of vocabulary includes: segmenting the text to be classified, obtaining a resource-related vocabulary; converting the resource-related vocabulary into a digital vector; processing the digital vector to obtain a parameter eigenvector; and obtaining the parameter eigenvector according to the parameter a sequence parameter; the vocabulary satisfying the third preset condition is obtained according to the order parameter.
可选地,所述将所述数字向量进行处理,得到参数特征向量包括:将数字向量进行对齐处理后,进行零均值处理和归一化处理得到参数特征向量。Optionally, the processing the digital vector to obtain the parameter feature vector comprises: performing alignment processing on the digital vector, performing zero mean processing and normalization processing to obtain a parameter feature vector.
可选的,所述根据序参量得到所述满足第三预设条件的词汇的步骤包括:利用所述序参量和排名算法得到所述满足第三预设条件的词汇。Optionally, the step of obtaining the vocabulary that satisfies the third preset condition according to the order parameter comprises: using the order parameter and the ranking algorithm to obtain the vocabulary that satisfies the third preset condition.
可选地,在所述从样本日志文件中提取资源相关统一资源定位器URL之前,所述采集方法还包括:定期采集初始日志文件(待分析的云环境日志文件);根据所述初始日志文件的日志数据得到所述样本日志文件(打开网页所需信息组成的文件)。Optionally, before the extracting the resource-related Uniform Resource Locator URL from the sample log file, the collecting method further includes: periodically collecting an initial log file (a cloud environment log file to be analyzed); and according to the initial log file The log data is obtained from the sample log file (the file composed of the information required to open the web page).
可选地,所述定期采集初始日志文件包括:在需采集日志的节点上设置系统定时器,并启动系统定时器,设置系统定时器的任务,以定期采集初始日志文件。Optionally, the periodically collecting the initial log file includes: setting a system timer on the node that needs to collect the log, starting a system timer, and setting a system timer task to periodically collect the initial log file.
可选地,所述根据所述初始日志文件的日志数据得到所述样本日志文件的步骤包括:在接收到网络客户端根据网页打开指令发送的信息请求时,根据所述信息请求从所述初始日志文件中获取打开对应网页所需的信息;将所述打开对应网页所需的信息保存为所述样本日志文件。Optionally, the step of obtaining the sample log file according to the log data of the initial log file includes: when receiving the information request sent by the network client according to the webpage open instruction, requesting from the initial according to the information request The information required to open the corresponding webpage is obtained in the log file; and the information required to open the corresponding webpage is saved as the sample log file.
上述满足第一预设条件的词汇对应于高频的最佳关键词,满足第二预设条件的频度为满足用户设置门限的频度,满足第三预设条件的词汇对应于最佳关键词,资源相关词汇对应于资源相关URL的网页内容的词汇。The vocabulary satisfying the first preset condition corresponds to the best keyword of the high frequency, the frequency of satisfying the second preset condition is the frequency satisfying the threshold set by the user, and the vocabulary satisfying the third preset condition corresponds to the best key The word, the resource-related vocabulary corresponds to the vocabulary of the webpage content of the resource-related URL.
实施例二Embodiment 2
参见图2,本发明实施例二中云环境资源关注点的采集装置包括:Referring to FIG. 2, the apparatus for collecting cloud environment resource focus points in Embodiment 2 of the present invention includes:
第一处理模块21,设置为汇总获取满足第一预设条件的词汇;The first processing module 21 is configured to collectively obtain a vocabulary that satisfies the first preset condition;
计算模块22,设置为计算所述满足第一预设条件的词汇的重要程度特征向量;The calculating module 22 is configured to calculate an importance degree feature vector of the vocabulary that satisfies the first preset condition;
第二处理模块23,设置为根据所述重要程度特征向量得到云环境资源关 注点;The second processing module 23 is configured to obtain a cloud environment resource according to the importance degree feature vector Note
其中,所述重要程度特征向量中的每一特征值用以表征每一对应所述满足第一预设条件的词汇在一个统一资源定位器URL网页中所占的权重,可选为TF-IDF特征向量。Each feature value of the importance degree feature vector is used to represent a weight of each vocabulary corresponding to the first preset condition in a Uniform Resource Locator URL webpage, optionally TF-IDF Feature vector.
第一预设条件实质为获取词汇的个数限制条件。The first preset condition is essentially a condition for obtaining the number of words.
本发明实施例二提供的所述云环境资源关注点的采集装置通过汇总获取满足第一预设条件的词汇,计算对应词汇的重要程度特征向量,进而得到云环境资源关注点;能够可靠高效地计算、分析、挖掘和提取云环境资源关注点,减小了相关技术中由传统算法获取云环境资源关注点的耗时。The collection device of the cloud environment resource focus point provided by the second embodiment of the present invention obtains the vocabulary of the importance degree of the corresponding vocabulary by collecting the vocabulary that satisfies the first preset condition, and obtains the cloud environment resource attention point; and can reliably and efficiently Calculating, analyzing, mining and extracting the attention points of cloud environment resources, reducing the time-consuming of the related technologies to obtain the focus points of cloud environment resources by traditional algorithms.
可选地,所述第二处理模块包括:变换子模块,设置为将所述重要程度特征向量进行数据变换,得到对应的频度;第一排序子模块,设置为将所述频度进行顺序排列(可选为降序排列);第一获取子模块,设置为依次获取排列后满足第二预设条件的频度;第一处理子模块,设置为根据获取到的频度得到对应的云环境资源关注点。Optionally, the second processing module includes: a transform submodule, configured to perform data transformation on the importance degree feature vector to obtain a corresponding frequency; and the first sorting submodule is configured to sequence the frequency Arrangement (optionally in descending order); the first acquisition sub-module is configured to sequentially obtain the frequency of the second preset condition after the arrangement; the first processing sub-module is set to obtain the corresponding cloud environment according to the acquired frequency Resource concerns.
第二预设条件实质为管理者设置的最终需得到的云环境资源关注点的个数限制条件。The second preset condition is essentially the limit condition of the number of cloud environment resource concerns that the administrator needs to obtain.
其中,所述第一处理模块包括:汇总子模块,设置为汇总满足第三预设条件的词汇以及其在资源相关URL网页中对应出现的频率;第二排序子模块,设置为根据所述频率对所述满足第三预设条件的词汇进行排序;第二获取子模块,设置为依次获取排序后所述满足第三预设条件的词汇,直到获取到的所述满足第三预设条件的词汇对应的频率与所有所述满足第三预设条件的词汇对应的频率达到一预设阈值(可选为9/10);第一保存子模块,设置为将获取到的所述满足第三预设条件的词汇保存为所述满足第一预设条件的词汇。The first processing module includes: a summary sub-module, configured to summarize the vocabulary satisfying the third preset condition and the frequency corresponding to the vocabulary in the resource-related URL webpage; and the second sorting sub-module is set according to the frequency Sorting the vocabulary that satisfies the third preset condition; the second acquiring sub-module is configured to sequentially obtain the vocabulary that satisfies the third preset condition after the sorting, until the obtained third preset condition is met The frequency corresponding to the vocabulary and the frequency corresponding to all the words satisfying the third preset condition reach a preset threshold (optionally 9/10); the first saving submodule is set to satisfy the third The vocabulary of the preset condition is saved as the vocabulary satisfying the first preset condition.
可选的,所述采集装置还包括:提取模块,设置为所述第一处理模块执行相关操作之前,从样本日志文件中提取资源相关URL;爬取模块,设置为(通过网络爬虫)爬取资源相关URL的网页内容,将爬取到的所述网页内容作为待分类文本;第三处理模块,设置为将所述待分类文本进行分词,获得 所述满足第三预设条件的词汇。Optionally, the collecting device further includes: an extracting module, configured to extract a resource-related URL from the sample log file before the first processing module performs the related operation; and the crawling module is set to crawl through the web crawler a webpage content of the resource-related URL, the webpage content that is crawled as the text to be classified; a third processing module, configured to segment the text to be classified, and obtain The vocabulary that satisfies the third preset condition.
第三预设条件实质为词汇在网页内容中出现次数的下限值限制条件。The third preset condition is essentially a lower limit condition for the number of times the vocabulary appears in the content of the web page.
其中,所述第三处理模块包括:第二处理子模块,设置为将所述待分类文本进行分词,获得资源相关词汇;转化子模块,设置为将所述资源相关词汇转化为数字向量;第三处理子模块,设置为将所述数字向量进行处理,得到参数特征向量(将数字向量进行对齐处理后,再进行零均值处理和归一化处理得到参数特征向量);第四处理子模块,设置为根据所述参数特征向量得到序参量;第五处理子模块,设置为根据所述序参量得到所述满足第三预设条件的词汇。The third processing module includes: a second processing sub-module, configured to perform segmentation of the text to be classified to obtain a resource-related vocabulary; and a transformation sub-module configured to convert the resource-related vocabulary into a digital vector; The third processing sub-module is configured to process the digital vector to obtain a parameter feature vector (after the digital vector is aligned, and then perform zero-mean processing and normalization to obtain a parameter feature vector); the fourth processing sub-module, And being configured to obtain a sequence parameter according to the parameter feature vector; and the fifth processing sub-module is configured to obtain the vocabulary that satisfies the third preset condition according to the order parameter.
可选的,所述第五处理子模块设置为:利用所述序参量和排名算法得到所述满足第三预设条件的词汇。Optionally, the fifth processing submodule is configured to: obtain the vocabulary that meets the third preset condition by using the order parameter and the ranking algorithm.
可选的,所述采集装置还包括:采集模块,设置为所述提取模块执行相关操作之前,(在需采集日志的节点上安装系统定时器,并启动系统定时器,设置系统定时器的任务)定期采集初始日志文件(待分析的云环境日志文件);第四处理模块,设置为根据所述初始日志文件的日志数据得到所述样本日志文件(打开网页所需信息组成的文件)。Optionally, the collecting device further includes: an collecting module, configured to: before the extracting module performs related operations, (installing a system timer on the node that needs to collect logs, and starting a system timer, setting a system timer task) The initial log file (the cloud environment log file to be analyzed) is periodically collected; the fourth processing module is configured to obtain the sample log file (the file composed of the information required to open the webpage) according to the log data of the initial log file.
可选地,所述第四处理模块包括:第三获取子模块,设置为在接收到网络客户端根据网页打开指令发送的信息请求时,根据所述信息请求从所述初始日志文件中获取打开对应网页所需的信息;第二保存子模块,设置为将所述打开对应网页所需的信息保存为所述样本日志文件。Optionally, the fourth processing module includes: a third obtaining submodule, configured to: when receiving the information request sent by the network client according to the webpage open instruction, obtain the open from the initial log file according to the information request Corresponding to the information required by the webpage; the second saving submodule is configured to save the information required to open the corresponding webpage as the sample log file.
上述满足第一预设条件的词汇对应于高频的最佳关键词,满足第二预设条件的频度为满足用户设置门限的频度,满足第三预设条件的词汇对应于最佳关键词,资源相关词汇对应于资源相关URL的网页内容的词汇。The vocabulary satisfying the first preset condition corresponds to the best keyword of the high frequency, the frequency of satisfying the second preset condition is the frequency satisfying the threshold set by the user, and the vocabulary satisfying the third preset condition corresponds to the best key The word, the resource-related vocabulary corresponds to the vocabulary of the webpage content of the resource-related URL.
本发明实施例二提供的云环境资源关注点的采集装置可以对相关技术中的服务器进行改进以实现云环境资源关注点的采集装置在实施例二中的功能。The device for collecting the focus of the cloud environment resource provided by the second embodiment of the present invention can improve the server in the related art to implement the function of the device for collecting the focus of the cloud environment resource in the second embodiment.
实施例三Embodiment 3
本发明实施例三提供的云环境资源关注点的采集方法流程包括: The method for collecting the focus of the cloud environment resource provided by the third embodiment of the present invention includes:
首先,前端服务器在需要采集日志的云环境中的各个节点上安装系统定时器CRON;并将CRON加入到启动脚本中,并启动CRON服务;编辑/etc/crontab文件,设定系统定期执行的任务,在本发明实施例三中,设定每个节点定期采集的日志文件,需要说明的是,编辑此文件,必须有root权限。First, the front-end server installs the system timer CRON on each node in the cloud environment that needs to collect logs; adds CRON to the startup script and starts the CRON service; edits the /etc/crontab file to set the tasks that the system periodically performs. In the third embodiment of the present invention, the log file collected periodically by each node is set. It should be noted that the file must have root authority for editing the file.
然后,将定期采集的日志文件(初始日志文件)保存为统一格式,对日志数据进行预处理,其中预处理包括:在接收到(用户打开云平台的web页面时)网络客户端发送的拼成一定格式(比如字符串)的信息请求时,从日志文件(初始日志文件)中获取打开对应网页所需的信息,所需的信息包括以下内容的任一项或多项:Then, the periodically collected log files (initial log files) are saved in a unified format, and the log data is pre-processed, wherein the pre-processing includes: when receiving the web page (when the user opens the web page of the cloud platform), the network client sends the spelling When a certain format (such as a string) is requested, the information required to open the corresponding webpage is obtained from the log file (initial log file), and the required information includes any one or more of the following contents:
操作的开始时间、操作的结束时间、客户端IP、用户信息和访问地址。Start time of operation, end time of operation, client IP, user information, and access address.
前端服务器将以上信息(获取到的所需的信息)存储为统一格式的日志文件(样本日志文件),进行跨网传输导入云平台的HDFS(分布式文件系统),并以LZO格式存储于所述HDFS文件系统中。其中,LZO是致力于解压速度的一种数据压缩算法,LZO是Lempel-Ziv-Oberhumer的缩写。The front-end server stores the above information (the obtained required information) as a unified format log file (sample log file), and performs an inter-network transmission to the HDFS (Distributed File System) of the cloud platform, and stores it in the LZO format. In the HDFS file system. Among them, LZO is a data compression algorithm dedicated to decompression speed, LZO is the abbreviation of Lempel-Ziv-Oberhumer.
后端服务器将存储到HDFS中的日志文件(样本日志文件)中,访问的URL中提取出资源相关的URL;通过网络爬虫爬取对应资源URL的网页内容,保留网页内容作为待分类文本;通过分词技术对资源URL网页内容分词,获取关键词(资源相关词汇);查询国际码库将所述关键词转化为数字向量;将所述数字向量进行对齐处理后,再进行零均值处理和归一化处理得到参数特征向量;将参数特征向量采用协同神经网络模式识别得到序参量,由序参量在数据库中获取最佳关键词。The back-end server stores the log file (sample log file) in the HDFS, extracts the resource-related URL from the accessed URL, crawls the webpage content corresponding to the resource URL through the web crawler, and retains the webpage content as the text to be classified; The word segmentation technology classifies the content of the resource URL webpage, obtains keywords (resource-related vocabulary); queries the international code library to convert the keywords into digital vectors; after aligning the digital vectors, then performing zero-mean processing and normalization The parameter feature vector is obtained by the processing; the parameter feature vector is identified by the cooperative neural network pattern to obtain the order parameter, and the order parameter is used to obtain the best keyword in the database.
后端服务器汇总最佳关键词,并将汇总后的最佳关键词输入MapReduce(分布式计算框),获取各个最佳关键词对应的频率,依照各个最佳关键词出现的频率从大到小排列,然后按照频率从大到小逐个选择对应的词汇(高频最佳关键词),直到已经选择词的词频与总词频的比例达到9比10。The back-end server summarizes the best keywords, and inputs the summarized best keywords into MapReduce (distributed computing box) to obtain the frequency corresponding to each best keyword. The frequency of occurrence according to each best keyword is from large to small. Arrange, and then select the corresponding vocabulary (high-frequency best keyword) one by one according to the frequency, until the ratio of the word frequency of the selected word to the total word frequency reaches 9 to 10.
其中,直到已经选择词的词频与总词频的比例达到9比10是指所有已经获取到的满足第三预设条件的词汇对应的频率之和,与所有所述满足第三预设条件的词汇对应的频率之和的比例达到9比10。 Wherein, the ratio of the word frequency to the total word frequency of the selected word reaches 9 to 10 refers to the sum of the frequencies corresponding to all the vocabularies that have acquired the third preset condition, and all the words satisfying the third preset condition. The ratio of the sum of the corresponding frequencies reaches 9 to 10.
接着计算选择的每一词汇对应的TF-IDF(词频-逆向文件频率)特征向量(对于每一个URL网页的所有高频最佳关键词进行针对所有样本日志文件的重要程度的计算,生成每一个高频最佳关键词的TF-IDF特征向量):Then calculate the TF-IDF (word frequency-reverse file frequency) feature vector corresponding to each selected word (calculate the importance degree of all the high-frequency best keywords for each sample web page for each sample log file, generate each one TF-IDF feature vector for high frequency best keywords):
FeatureVector={f1,f2,f3,f4,......,fn};          (1)FeatureVector={f 1 ,f 2 ,f 3 ,f 4 ,...,f n }; (1)
式(1)中,高频最佳关键词的TF-IDF特征值计算公式为:In equation (1), the TF-IDF eigenvalue calculation formula for the high-frequency best keyword is:
fn=tf-idf(tn,d,D)=tf(tn,d)×idf(tn,D);        (2)f n =tf-idf(t n ,d,D)=tf(t n ,d)×idf(t n ,D); (2)
式(2)中,tf值计算公式为:In equation (2), the formula for calculating the tf value is:
tf(tn,d=NumberofTimes(tn));           (3)Tf(t n ,d=NumberofTimes(t n )); (3)
式(2)中,idf值计算公式为:In equation (2), the formula for calculating the idf value is:
Figure PCTCN2016082253-appb-000001
Figure PCTCN2016082253-appb-000001
其中,式(2)、(3)、(4)中,D为所有URL网页集合,d为具体的某一个URL网页,tn为第n个高频词汇,即一个特征,N为选择的最佳关键词的总个数;FeatureVector为特征向量,Number of Times为次数。Where, in the formulas (2), (3), (4), D is a collection of all URL web pages, d is a specific URL web page, t n is the nth high-frequency vocabulary, that is, a feature, and N is a selection The total number of best keywords; FeatureVector is the feature vector, Number of Times is the number of times.
最后,将TF-IDF的特征向量使用MapReduce(Hadoop框架中的分布式并行计算模型)进行数据变换,得到每组特征向量的频度(一个高频最佳关键词的多个TF-IDF特征值之和)并按降序排列,根据管理者的设定按照排列顺序依次获取对应量的云环境资源关注点。Finally, the TF-IDF feature vector is transformed by MapReduce (distributed parallel computing model in Hadoop framework) to obtain the frequency of each set of feature vectors (multiple TF-IDF eigenvalues of a high-frequency best keyword). And the sum is arranged in descending order, according to the manager's setting, the corresponding amount of cloud environment resource concerns are sequentially obtained in the order of arrangement.
需要说明的是,MapReduce是云计算的关键技术,是由Google公司提出的一个软件架构和编程模型,用于大规模数据的并行运算。MapReduce将系统对数据的所有操作都拆解为映射函数Map和规约函数Reduce两个步骤来执行,Map函数将大规模数据进行拆分为多个小的数据集分发到多台机器上并行运行,Reduce函数则将各台机器上Map函数运算的结果进行聚合,Map和Reduce的配合达到了分布式并行运算的效果;It should be noted that MapReduce is a key technology of cloud computing. It is a software architecture and programming model proposed by Google for parallel computing of large-scale data. MapReduce disassembles all operations of the system into mapping function Map and protocol function Reduce. The Map function splits large-scale data into multiple small data sets and distributes them to multiple machines for parallel operation. The Reduce function aggregates the results of the Map function operations on each machine, and the cooperation between Map and Reduce achieves the effect of distributed parallel computing;
TF代表这个关键词在一个URL网页中出现的次数,IDF是这个关键词普遍重要性的度量,可以由总样本文件数目除以包含该关键词的样本文件数目,再将得到的商取对数得到。由TF和IDF两部分相乘,可得到一个词对于一个URL网页的重要程度。 TF represents the number of times this keyword appears in a URL page. IDF is a measure of the universal importance of the keyword. The number of total sample files can be divided by the number of sample files containing the keyword, and the obtained business logarithm. get. Multiply the two parts of TF and IDF to get the importance of a word for a URL page.
本发明实施例三中提及的前端服务器和后端服务器可以集成于一个服务器中,也可以存在于两个服务器中,在此不作限定。The front-end server and the back-end server mentioned in the third embodiment of the present invention may be integrated into one server, or may exist in two servers, which is not limited herein.
由上可得,本发明实施例三提供的云环境资源关注点的采集方法对大量的URL进行了筛选处理,提取了云资源相关的URL,同时,用MapReduce对URL网页内容进行了TF-IDF的特征提取,不仅解决了海量日志分析中提取云环境资源关注点的时效、存储和计算的瓶颈问题,同时,还可以精确查找到云环境资源关注点,提高云环境资源的利用率。The method for collecting cloud environment resource focus points provided by the third embodiment of the present invention filters a large number of URLs, extracts URLs related to cloud resources, and performs TF-IDF on URL webpage content by using MapReduce. The feature extraction not only solves the bottleneck problem of time, storage and calculation of cloud cloud resource attention points in massive log analysis, but also can accurately find the attention of cloud environment resources and improve the utilization of cloud environment resources.
综上所述,本发明实施例提供的方案通过对海量的用户日志进行可靠高效地计算、分析和挖掘,从而能够高效实时地提取出日志中用户最关心的云环境资源关注点;相比其他算法,耗时短、可扩展性强,解决了传统算法在单机模式运行,容易受到处理器速度、存储容量等诸多计算机硬件性能阻碍,以及随着用户日志增多,算法的复杂度呈多项式增长,算法性能越来越差等问题。In summary, the solution provided by the embodiment of the present invention can accurately and efficiently calculate, analyze, and mine a large number of user logs, thereby efficiently extracting the cloud environment resource concerns that the user is most concerned about in the log in real time; The algorithm is short in time and extensible. It solves the traditional algorithm running in stand-alone mode, which is easily hindered by many computer hardware performances such as processor speed and storage capacity. As the user log increases, the complexity of the algorithm grows polynomial. The performance of the algorithm is getting worse and worse.
此说明书中所描述的许多功能部件都被称为模块/子模块,以便更加特别地强调其实现方式的独立性。Many of the functional components described in this specification are referred to as modules/sub-modules to more particularly emphasize the independence of their implementation.
本发明实施例中,模块/子模块可以用软件实现,以便由各种类型的处理器执行。举例来说,一个标识的可执行代码模块可以包括计算机指令的一个或多个物理或者逻辑块,举例来说,其可以被构建为对象、过程或函数。尽管如此,所标识模块的可执行代码无需物理地位于一起,而是可以包括存储在不同位里上的不同的指令,当这些指令逻辑上结合在一起时,其构成模块并且实现该模块的规定目的。In an embodiment of the invention, the modules/sub-modules may be implemented in software for execution by various types of processors. For example, an identified executable code module can comprise one or more physical or logical blocks of computer instructions, which can be constructed, for example, as an object, procedure, or function. Nonetheless, the executable code of the identified modules need not be physically located together, but may include different instructions stored in different bits that, when logically combined, constitute a module and implement the provisions of the module. purpose.
实际上,可执行代码模块可以是单条指令或者是许多条指令,并且甚至可以分布在多个不同的代码段上,分布在不同程序当中,以及跨越多个存储器设备分布。同样地,操作数据可以在模块内被识别,并且可以依照任何适当的形式实现并且被组织在任何适当类型的数据结构内。所述操作数据可以作为单个数据集被收集,或者可以分布在不同位置上(包括在不同存储设备上),并且至少部分地可以仅作为电子信号存在于系统或网络上。In practice, the executable code module can be a single instruction or a plurality of instructions, and can even be distributed across multiple different code segments, distributed among different programs, and distributed across multiple memory devices. As such, operational data may be identified within the modules and may be implemented in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed at different locations (including on different storage devices), and may at least partially exist as an electronic signal on a system or network.
在模块可以利用软件实现时,考虑到相关技术中硬件工艺的水平,所以可以以软件实现的模块,在不考虑成本的情况下,本领域技术人员都可以搭 建对应的硬件电路来实现对应的功能,所述硬件电路包括常规的超大规模集成(VLSI)电路或者门阵列以及诸如逻辑芯片、晶体管之类的相关技术中半导体或者是其它分立的元件。模块还可以用可编程硬件设备,诸如现场可编程门阵列、可编程阵列逻辑、可编程逻辑设备等实现。When the module can be implemented by software, considering the level of the hardware process in the related art, the module that can be implemented by software can be taken by those skilled in the art without considering the cost. Corresponding functions are implemented to implement corresponding functions, including conventional Very Large Scale Integration (VLSI) circuits or gate arrays and related semiconductors such as logic chips, transistors, or other discrete components. The modules can also be implemented with programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices, and the like.
以上所述的是本发明实施例的可选实施方式,应当指出对于本技术领域的普通人员来说,在不脱离本发明实施例所述原理前提下,还可以作出若干改进和润饰,这些改进和润饰也应视为本发明实施例的保护范围。The above is an alternative embodiment of the embodiments of the present invention, and it should be noted that those skilled in the art can make some improvements and refinements without departing from the principles of the embodiments of the present invention. And retouching should also be regarded as the scope of protection of the embodiments of the present invention.
本领域普通技术人员可以理解上述实施例的全部或部分步骤可以使用计算机程序流程来实现,所述计算机程序可以存储于一计算机可读存储介质中,所述计算机程序在相应的硬件平台上(如系统、设备、装置、器件等)执行,在执行时,包括方法实施例的步骤之一或其组合。One of ordinary skill in the art will appreciate that all or a portion of the steps of the above-described embodiments can be implemented using a computer program flow, which can be stored in a computer readable storage medium, such as on a corresponding hardware platform (eg, The system, device, device, device, etc. are executed, and when executed, include one or a combination of the steps of the method embodiments.
可选地,上述实施例的全部或部分步骤也可以使用集成电路来实现,这些步骤可以被分别制作成一个个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。Alternatively, all or part of the steps of the above embodiments may also be implemented by using an integrated circuit. These steps may be separately fabricated into individual integrated circuit modules, or multiple modules or steps may be fabricated into a single integrated circuit module. achieve.
上述实施例中的装置/功能模块/功能单元可以采用通用的计算装置来实现,它们可以集中在单个的计算装置上,也可以分布在多个计算装置所组成的网络上。The devices/function modules/functional units in the above embodiments may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices.
上述实施例中的装置/功能模块/功能单元以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。上述提到的计算机可读取存储介质可以是只读存储器,磁盘或光盘等。When the device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. The above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
工业实用性Industrial applicability
通过本发明实施例的方案,汇总获取满足第一预设条件的词汇,计算对应词汇的重要程度特征向量,进而得到云环境资源关注点;能够可靠高效地计算、分析、挖掘和提取云环境资源关注点,减小了通过算法获取云环境资源关注点耗时长的问题。 Through the solution of the embodiment of the present invention, the vocabulary that satisfies the first preset condition is summarized, the importance degree feature vector of the corresponding vocabulary is calculated, and the cloud environment resource attention point is obtained, and the cloud environment resource can be calculated, analyzed, mined, and extracted reliably and efficiently. Concerns reduce the problem that it takes a long time to obtain the attention of the cloud environment resources through the algorithm.

Claims (17)

  1. 一种云环境资源关注点的采集方法,包括:A method for collecting focus points of cloud environment resources, including:
    汇总获取满足第一预设条件的词汇;Collecting a vocabulary that satisfies the first preset condition;
    计算所述满足第一预设条件的词汇的重要程度特征向量;Calculating an importance degree feature vector of the vocabulary that satisfies the first preset condition;
    根据所述重要程度特征向量得到云环境资源关注点;Obtaining a focus point of the cloud environment resource according to the importance degree feature vector;
    其中,所述重要程度特征向量中的每个特征值分别用以表征一个对应的所述满足第一预设条件的词汇在一个统一资源定位器URL网页中所占的权重。Each feature value of the importance degree feature vector is used to represent a weight of a corresponding vocabulary satisfying the first preset condition in a uniform resource locator URL webpage.
  2. 如权利要求1所述的采集方法,其中,所述根据所述重要程度特征向量得到云环境资源关注点的步骤包括:The method of claim 1 , wherein the step of obtaining a cloud environment resource focus according to the importance degree feature vector comprises:
    将所述重要程度特征向量进行数据变换,得到对应的频度;Performing data transformation on the importance degree feature vector to obtain a corresponding frequency;
    将所述频度进行顺序排列;Arranging the frequencies in order;
    获取排列后满足第二预设条件的频度;Obtaining a frequency that satisfies the second preset condition after the arrangement;
    根据获取到的频度得到对应的云环境资源关注点。According to the obtained frequency, the corresponding cloud environment resource attention point is obtained.
  3. 如权利要求1所述的采集方法,其中,所述汇总获取满足第一预设条件的词汇的步骤包括:The collecting method according to claim 1, wherein the step of collectively obtaining a vocabulary that satisfies the first preset condition comprises:
    汇总满足第三预设条件的词汇以及其在资源相关URL网页中对应出现的频率;Summarizing the vocabulary that satisfies the third preset condition and its corresponding occurrence frequency in the resource-related URL webpage;
    根据所述频率对所述满足第三预设条件的词汇从大到小进行排序;Sorting the vocabulary satisfying the third preset condition from large to small according to the frequency;
    依次获取排序后所述满足第三预设条件的词汇,直到获取到的所述满足第三预设条件的词汇对应的频率之和,与所有所述满足第三预设条件的词汇对应的频率之和的比例达到一预设阈值;And sequentially obtaining the vocabulary satisfying the third preset condition after the sorting, until the sum of the frequencies corresponding to the vocabulary satisfying the third preset condition, and the frequency corresponding to all the vocabularies satisfying the third preset condition The ratio of the sum reaches a predetermined threshold;
    将获取到的所述满足第三预设条件的词汇保存为所述满足第一预设条件的词汇。The obtained vocabulary that satisfies the third preset condition is saved as the vocabulary that satisfies the first preset condition.
  4. 如权利要求3所述的采集方法,其中,所述汇总获取满足第一预设条件的词汇前还包括:The collection method according to claim 3, wherein the summarizing the vocabulary that satisfies the first preset condition further comprises:
    从样本日志文件中提取资源相关URL;Extract the resource-related URL from the sample log file;
    爬取资源相关URL的网页内容,将爬取到的所述网页内容作为待分类文 本;Crawling the webpage content of the resource-related URL, and using the crawled webpage content as the to-be-categorized text this;
    将所述待分类文本进行分词,获得所述满足第三预设条件的词汇。The text to be classified is segmented to obtain the vocabulary that satisfies the third preset condition.
  5. 如权利要求4所述的采集方法,其中,所述将所述待分类文本进行分词,获得所述满足第三预设条件的词汇包括:The collecting method according to claim 4, wherein the segmenting the text to be classified and obtaining the vocabulary satisfying the third preset condition comprises:
    将所述待分类文本进行分词,获得资源相关词汇;Sorting the text to be classified into a resource-related vocabulary;
    将所述资源相关词汇转化为数字向量;Converting the resource-related vocabulary into a digital vector;
    将所述数字向量进行处理,得到参数特征向量;Processing the digital vector to obtain a parameter feature vector;
    根据所述参数特征向量得到序参量;Obtaining a sequence parameter according to the parameter feature vector;
    根据所述序参量得到所述满足第三预设条件的词汇。And obtaining the vocabulary that satisfies the third preset condition according to the order parameter.
  6. 如权利要求5所述的采集方法,其中,所述根据序参量得到所述满足第三预设条件的词汇包括:The acquisition method according to claim 5, wherein the obtaining the vocabulary that satisfies the third preset condition according to the order parameter comprises:
    利用所述序参量和排名算法得到所述满足第三预设条件的词汇。The vocabulary satisfying the third preset condition is obtained by using the order parameter and the ranking algorithm.
  7. 如权利要求4所述的采集方法,其中,所述从样本日志文件中提取资源相关统一资源定位器URL前还包括:The collection method of claim 4, wherein the extracting the resource-related Uniform Resource Locator URL from the sample log file further comprises:
    定期采集初始日志文件;Collect initial log files periodically;
    根据所述初始日志文件的日志数据得到所述样本日志文件。The sample log file is obtained according to log data of the initial log file.
  8. 如权利要求7所述的采集方法,其中,所述根据所述初始日志文件的日志数据得到所述样本日志文件的步骤包括:The collecting method according to claim 7, wherein the step of obtaining the sample log file according to the log data of the initial log file comprises:
    在接收到网络客户端根据网页打开指令发送的信息请求时,根据所述信息请求从所述初始日志文件中获取打开对应网页所需的信息;After receiving the information request sent by the network client according to the webpage open instruction, obtaining, according to the information request, information required to open the corresponding webpage from the initial log file;
    将所述打开对应网页所需的信息保存为所述样本日志文件。The information required to open the corresponding web page is saved as the sample log file.
  9. 一种云环境资源关注点的采集装置,包括:A collection device for a cloud environment resource focus includes:
    第一处理模块,设置为汇总获取满足第一预设条件的词汇;a first processing module, configured to collectively obtain a vocabulary that satisfies a first preset condition;
    计算模块,设置为计算所述满足第一预设条件的词汇的重要程度特征向量;a calculation module, configured to calculate an importance degree feature vector of the vocabulary that satisfies the first preset condition;
    第二处理模块,设置为根据所述重要程度特征向量得到云环境资源关注点;a second processing module, configured to obtain a cloud environment resource focus point according to the importance degree feature vector;
    其中,所述重要程度特征向量中的每个特征值分别用以表征一个对应的所述满足第一预设条件的词汇在一个统一资源定位器URL网页中所占的权 重。Each of the importance degree feature vectors is used to represent a corresponding vocabulary that satisfies the first preset condition in a uniform resource locator URL webpage. weight.
  10. 如权利要求9所述的采集装置,其中,所述第二处理模块包括:The collection device of claim 9, wherein the second processing module comprises:
    变换子模块,设置为将所述重要程度特征向量进行数据变换,得到对应的频度;a transform submodule, configured to perform data transformation on the importance degree feature vector to obtain a corresponding frequency;
    第一排序子模块,设置为将所述频度进行顺序排列;a first sorting submodule, configured to sequentially arrange the frequencies;
    第一获取子模块,设置为获取排列后满足第二预设条件的频度;The first obtaining submodule is configured to obtain a frequency that satisfies the second preset condition after the arrangement;
    第一处理子模块,设置为根据获取到的频度得到对应的云环境资源关注点。The first processing submodule is configured to obtain a corresponding cloud environment resource concern point according to the obtained frequency.
  11. 如权利要求9所述的采集装置,其中,所述第一处理模块包括:The collection device of claim 9, wherein the first processing module comprises:
    汇总子模块,设置为汇总满足第三预设条件的词汇以及其在资源相关URL网页中对应出现的频率;a summary sub-module, configured to summarize the vocabulary satisfying the third preset condition and the corresponding frequency of occurrence in the resource-related URL webpage;
    第二排序子模块,设置为根据所述频率对所述满足第三预设条件的词汇从大到小进行排序;a second sorting submodule, configured to sort the vocabulary satisfying the third preset condition from large to small according to the frequency;
    第二获取子模块,设置为依次获取排序后所述满足第三预设条件的词汇,直到获取到的所述满足第三预设条件的词汇对应的频率之和,与所有所述满足第三预设条件的词汇对应的频率之和的比例达到一预设阈值;a second obtaining sub-module, configured to sequentially obtain the vocabulary that satisfies the third preset condition after the sorting, until the sum of the frequencies corresponding to the vocabulary that meets the third preset condition is obtained, and all the said meet the third The ratio of the sum of the frequencies corresponding to the vocabulary of the preset condition reaches a preset threshold;
    第一保存子模块,设置为将获取到的所述满足第三预设条件的词汇保存为所述满足第一预设条件的词汇。The first saving submodule is configured to save the acquired vocabulary that satisfies the third preset condition as the vocabulary that satisfies the first preset condition.
  12. 如权利要求11所述的采集装置,还包括:The collection device of claim 11 further comprising:
    提取模块,设置为所述第一处理模块执行相关操作之前,从样本日志文件中提取资源相关URL;An extraction module, configured to extract a resource-related URL from a sample log file before the first processing module performs a related operation;
    爬取模块,设置为爬取资源相关URL的网页内容,将爬取到的所述网页内容作为待分类文本;a crawling module, configured to crawl the webpage content of the resource-related URL, and use the crawled webpage content as the text to be classified;
    第三处理模块,设置为将所述待分类文本进行分词,获得所述满足第三预设条件的词汇。And a third processing module, configured to perform word segmentation on the to-be-classified text, to obtain the vocabulary that meets the third preset condition.
  13. 如权利要求12所述的采集装置,其中,所述第三处理模块包括:The collection device of claim 12, wherein the third processing module comprises:
    第二处理子模块,设置为将所述待分类文本进行分词,获得资源相关词汇;a second processing sub-module, configured to perform word segmentation on the to-be-classified text to obtain a resource-related vocabulary;
    转化子模块,设置为将所述资源相关词汇转化为数字向量; a transformation sub-module configured to convert the resource-related vocabulary into a digital vector;
    第三处理子模块,设置为将所述数字向量进行处理,得到参数特征向量;a third processing submodule configured to process the digital vector to obtain a parameter feature vector;
    第四处理子模块,设置为根据所述参数特征向量得到序参量;a fourth processing submodule, configured to obtain a sequence parameter according to the parameter feature vector;
    第五处理子模块,设置为根据所述序参量得到所述满足第三预设条件的词汇。The fifth processing submodule is configured to obtain the vocabulary that satisfies the third preset condition according to the order parameter.
  14. 如权利要求13所述的采集装置,其中,所述第五处理子模块根据所述序参量得到所述满足第三预设条件的词汇包括:The collection device according to claim 13, wherein the fifth processing sub-module obtains the vocabulary that satisfies the third preset condition according to the order parameter includes:
    利用所述序参量和排名算法得到所述满足第三预设条件的词汇。The vocabulary satisfying the third preset condition is obtained by using the order parameter and the ranking algorithm.
  15. 如权利要求12所述的采集装置,还包括:The collection device of claim 12, further comprising:
    采集模块,设置为所述提取模块执行相关操作之前,定期采集初始日志文件;The collecting module is configured to periodically collect the initial log file before the extracting module performs related operations;
    第四处理模块,设置为根据所述初始日志文件的日志数据得到所述样本日志文件。The fourth processing module is configured to obtain the sample log file according to the log data of the initial log file.
  16. 如权利要求15所述的采集装置,其中,所述第四处理模块包括:The collection device of claim 15 wherein said fourth processing module comprises:
    第三获取子模块,设置为在接收到网络客户端根据网页打开指令发送的信息请求时,根据所述信息请求从所述初始日志文件中获取打开对应网页所需的信息;a third obtaining submodule, configured to: when receiving the information request sent by the network client according to the webpage open instruction, obtain, according to the information request, information required to open the corresponding webpage from the initial log file;
    第二保存子模块,设置为将所述打开对应网页所需的信息保存为所述样本日志文件。The second saving submodule is configured to save the information required to open the corresponding webpage as the sample log file.
  17. 一种服务器,包括:如权利要求9至16任一项所述的云环境资源关注点的采集装置。 A server, comprising: a cloud environment resource focus point collection device according to any one of claims 9 to 16.
PCT/CN2016/082253 2015-08-19 2016-05-16 Method and apparatus for collecting cloud environment resource focus point, and server WO2017028566A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510511018.2A CN106470130A (en) 2015-08-19 2015-08-19 A kind of acquisition method of cloud environment resource focus, device and server
CN201510511018.2 2015-08-19

Publications (1)

Publication Number Publication Date
WO2017028566A1 true WO2017028566A1 (en) 2017-02-23

Family

ID=58050721

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/082253 WO2017028566A1 (en) 2015-08-19 2016-05-16 Method and apparatus for collecting cloud environment resource focus point, and server

Country Status (2)

Country Link
CN (1) CN106470130A (en)
WO (1) WO2017028566A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236867A (en) * 2011-08-15 2011-11-09 悠易互通(北京)广告有限公司 Cloud computing-based audience behavioral analysis advertisement targeting system
CN102402606A (en) * 2011-11-28 2012-04-04 中国科学院计算机网络信息中心 High-efficiency text data mining method
CN103186612A (en) * 2011-12-30 2013-07-03 中国移动通信集团公司 Lexical classification method and system and realization method
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN104504024A (en) * 2014-12-11 2015-04-08 中国科学院计算技术研究所 Method and system for mining keywords based on microblog content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236867A (en) * 2011-08-15 2011-11-09 悠易互通(北京)广告有限公司 Cloud computing-based audience behavioral analysis advertisement targeting system
CN102402606A (en) * 2011-11-28 2012-04-04 中国科学院计算机网络信息中心 High-efficiency text data mining method
CN103186612A (en) * 2011-12-30 2013-07-03 中国移动通信集团公司 Lexical classification method and system and realization method
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN104504024A (en) * 2014-12-11 2015-04-08 中国科学院计算技术研究所 Method and system for mining keywords based on microblog content

Also Published As

Publication number Publication date
CN106470130A (en) 2017-03-01

Similar Documents

Publication Publication Date Title
US10831562B2 (en) Method and system for operating a data center by reducing an amount of data to be processed
US10235376B2 (en) Merging metadata for database storage regions based on overlapping range values
US9197665B1 (en) Similarity search and malware prioritization
US10205627B2 (en) Method and system for clustering event messages
US9081861B2 (en) Uniform resource locator canonicalization
US9619564B2 (en) Method and system for providing recommended terms
US20120158724A1 (en) Automated web page classification
US11036764B1 (en) Document classification filter for search queries
US20230086966A1 (en) Search systems and methods utilizing search based user clustering
Tan et al. Hadoop framework: impact of data organization on performance
WO2022165168A1 (en) Configuring an instance of a software program using machine learning
US11182386B2 (en) Offloading statistics collection
US8954438B1 (en) Structured metadata extraction
US20210004389A1 (en) Cloud service categorization
CN106777140B (en) Method and device for searching unstructured document
US11500945B2 (en) System and method of crawling wide area computer network for retrieving contextual information
CN112597369A (en) Webpage spider theme type search system based on improved cloud platform
WO2023030184A1 (en) Data retrieval method and related device
US10380195B1 (en) Grouping documents by content similarity
WO2017028566A1 (en) Method and apparatus for collecting cloud environment resource focus point, and server
CN106776654B (en) Data searching method and device
US10839028B2 (en) System for querying web pages using a real time entity authentication engine
Xu et al. The application of web crawler in city image research
CN111858918A (en) News classification method and device, network element and storage medium
JP7106924B2 (en) CLUSTER ANALYSIS SYSTEM, CLUSTER ANALYSIS METHOD AND CLUSTER ANALYSIS PROGRAM

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16836431

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16836431

Country of ref document: EP

Kind code of ref document: A1