TWI534640B - Chinese network information monitoring and analysis system and its method - Google Patents

Chinese network information monitoring and analysis system and its method Download PDF

Info

Publication number
TWI534640B
TWI534640B TW102115477A TW102115477A TWI534640B TW I534640 B TWI534640 B TW I534640B TW 102115477 A TW102115477 A TW 102115477A TW 102115477 A TW102115477 A TW 102115477A TW I534640 B TWI534640 B TW I534640B
Authority
TW
Taiwan
Prior art keywords
webpage
information
word
chinese
data
Prior art date
Application number
TW102115477A
Other languages
Chinese (zh)
Other versions
TW201333735A (en
Inventor
zhong-bin Li
Original Assignee
zhong-bin Li
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by zhong-bin Li filed Critical zhong-bin Li
Priority to TW102115477A priority Critical patent/TWI534640B/en
Publication of TW201333735A publication Critical patent/TW201333735A/en
Application granted granted Critical
Publication of TWI534640B publication Critical patent/TWI534640B/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Description

中文網路資訊監測分析系統及其方法 Chinese network information monitoring and analysis system and method thereof

本發明係關於一種中文網路資訊監測分析系統及其方法,更特別的是關於一種包含網路社群、各種媒體及論壇等於網路上所公布之資訊的監測分析系統及其方法。 The present invention relates to a Chinese network information monitoring and analysis system and method thereof, and more particularly to a monitoring and analysis system and method thereof, including an online community, various media, and a forum equal to information published on the Internet.

隨著網路技術的進步,網路的使用人口也不斷地成長,生活網路化的程度也在逐日增加,如今,使用網路進行各種線上型態之瀏覽、評論、聊天、心情抒發等各式各樣的網路活動儼然已成為網路使用者每日必做的事。 With the advancement of network technology, the population of the Internet has continued to grow, and the degree of life networking has also increased day by day. Today, the Internet is used for various online types of browsing, commenting, chatting, and feelings. A variety of online activities have become a daily must for Internet users.

在現今網路高度普及的情況下,微網誌、微媒體等此種一對多之短篇幅形式的社交通訊不斷地蓬勃發展,其係有別於傳統的即時通訊、聊天室、和佈告欄等方式,微網誌、微媒體係使人們得以用“生活點滴之串流”的方式進行通訊。此種通訊有關於人們的現實生活中藉由連上網際網路之電子設備進行分享之經驗所構成的思想、意見和評論。 In today's highly popular Internet, micro-blogs, micro-media and other such short-form social communications continue to flourish, which is different from traditional instant messaging, chat rooms, and bulletin boards. In other ways, micro-blogs and micro-media systems enable people to communicate in a "life-by-stream" manner. This type of communication has thoughts, opinions, and comments about people's real-life experiences of sharing through electronic devices connected to the Internet.

這些龐大的資訊量往往可透露出許多的社會潮流趨向,例如:某商品在市場上被討論的程度及其評價的好壞,甚至是過去幾天內被討論的次數等,這些資訊若可被有效的取得與分析,將可對所欲了解之目標資訊帶來相當有用的分析資料。 These huge amounts of information often reveal many trends in the social trends, such as the extent to which a product is discussed in the market and its evaluation, even the number of times it has been discussed in the past few days. Effective acquisition and analysis will bring useful analytical data to the target information you want to know.

本發明之一目的在於達到網路上之各種社群媒體、論壇網站等 網路資訊的取得與分析。 One of the objectives of the present invention is to achieve various social media, forum websites, etc. on the Internet. The acquisition and analysis of network information.

本發明之另一目的在於藉由特定演算法及特定分析法的搭配來達到精確的監測分析結果。 Another object of the present invention is to achieve accurate monitoring and analysis results by the combination of a specific algorithm and a specific analysis method.

為達上述目的及其他目的,本發明提出一種中文網路資訊監測分析系統,係用於根據所輸入之至少一中文目標資訊進行網際網路上之監測分析,包含:一詞庫儲存裝置,係內儲存有複數中文分詞表資訊、複數中文同義詞資訊、複數中文蘊含詞資訊;一電腦運算裝置,係連結該詞庫儲存裝置,包含:目標資訊處理模組、網路資訊處理分析模組及網路資訊整理模組;及一資料庫儲存裝置,係連結該電腦運算裝置,係依據所擷取網頁對應之串碼儲存所擷取之該網頁擷取資料及其分詞資訊,以分類所擷取之網頁擷取資料。該目標資訊處理模組係接收該至少一中文目標資訊;該網路資訊處理分析模組係於該網際網路進行搜尋及產生分析結果;該網路資訊整理模組,係依據該中文目標資訊於一資料庫儲存裝置內選取對應的類別並比對所擷取之網頁的該分詞資訊,於有匹配之網頁時擷取該網頁之頁面以產生一網頁擷取資料,以提供所擷取之網頁擷取資料。 To achieve the above and other objects, the present invention provides a Chinese network information monitoring and analysis system for monitoring and analyzing on the Internet according to at least one Chinese target information input, including: a word storage device, and a system A plurality of Chinese word segment information, plural Chinese synonym information, and plural Chinese word information are stored; a computer computing device is connected to the thesaurus storage device, and includes: a target information processing module, a network information processing analysis module, and a network The information collation module and the data storage device are connected to the computer computing device, and the webpage data and the word segmentation information retrieved according to the serial code corresponding to the retrieved webpage are used to classify the captured data. Web page capture data. The target information processing module receives the at least one Chinese target information; the network information processing analysis module searches for and generates an analysis result on the Internet; the network information collation module is based on the Chinese target information Selecting a corresponding category in a database storage device and comparing the word segmentation information of the captured webpage, and capturing a page of the webpage when a matching webpage is used to generate a webpage retrieval data to provide the selected webpage Web page capture data.

其中,該網路資訊處理分析模組包含:一網路資訊擷取單元,係於該網際網路上進行網頁資料的擷取;一斷詞斷句處理單元,係用於對所擷取之該網頁資料依據標點符號及該等中文分詞表資訊進行斷詞斷句之第一階處理,再依據最大匹配法進行第二階處理,以產生對應網頁之斷詞斷句結果;一詞頻處理單元,係根據該等中文同義詞資訊及該等中文蘊含詞資訊,於該斷詞斷句結果中計數對應之詞句的出現頻率,以產生對應網頁之一詞頻計數結果;及一網頁指紋處理單元,係用於對所擷取之網頁資料進行網頁屬性的分類,其 係以所擷取網頁之網頁原始碼的標籤作為段落切割的節點,並依據該詞頻計數結果搭配使用TF/IDF權重分析元件,以產生對應網頁之每個分詞的權值,並自權值大至小的排列中選取前預定數量的分詞並依據其字元重新排列以產生對應網頁之一分詞資訊,最後再依雜湊演算法將所選取的該等字元轉換為預訂位元數的一串碼,進而產生對應網頁之串碼。 The network information processing and analysis module comprises: a network information acquisition unit for extracting webpage data on the internet; and a word segmentation processing unit for using the webpage captured The data is processed according to the punctuation marks and the Chinese word segmentation information, and then the second-order processing is performed according to the maximum matching method to generate the result of the word segmentation of the corresponding webpage; the word frequency processing unit is based on the And the Chinese synonym information and the Chinese implied word information, the frequency of occurrence of the corresponding sentence is counted in the result of the broken sentence to generate a word frequency counting result of the corresponding webpage; and a webpage fingerprint processing unit is used for the webpage Take webpage data to classify webpage attributes, The label of the webpage source code of the webpage captured is used as a node for segment cutting, and the TF/IDF weight analysis component is used according to the word frequency counting result to generate the weight of each word segment of the corresponding webpage, and the self-weight value is large. Selecting a predetermined number of word segments from the smallest arrangement and rearranging them according to their characters to generate segmentation information of one of the corresponding web pages, and finally converting the selected characters into a string of the number of reserved bits according to the hash algorithm. The code, which in turn generates the serial code of the corresponding web page.

於本發明之一實施例中,該網路資訊擷取單元更用於依據預設之登入資訊登入需登錄資訊的網路平台中進行網頁資料的擷取。 In an embodiment of the present invention, the network information capturing unit is further configured to perform web page data retrieval by logging in to the network platform that needs to log in according to the preset login information.

本發明復提出一種中文網路資訊監測分析方法,係用於根據所輸入之中文目標資訊進行網際網路上之監測分析,包含下列步驟:於網際網路上進行網頁資料的擷取;進行該網頁資料之斷詞斷句處理,以標點符號及預設之詞庫儲存裝置內的中文分詞表資訊進行斷詞斷句之第一階處理,並以最大匹配法進行第二階處理以產生一斷詞斷句結果;進行該網頁資料之詞頻處理,以預設之該詞庫儲存裝置內的中文同義詞資訊及等中文蘊含詞資訊計數該斷詞斷句結果中出現該詞庫儲存裝置內對應詞句之詞句及其頻率,以產生一詞頻計數結果;進行網頁指紋處理,先以所擷取之網頁資料之網頁原始碼的標籤作為段落切割的節點,並依據該詞頻計數結果搭配使用TF/IDF權重分析元件,以產生每個分詞的權值,並自權值大至小的排列中選取前預定數量的分詞並依據其字元重新排列以產生一分詞資訊,最後再依雜湊演算法將所選取的該等字元轉換為預訂位元數的一串碼,進而產生所擷取網頁資料對應之串碼;儲存該網頁資料對應之分詞資訊及串碼;及於所儲存之分詞資訊及串碼中,根據該中文目標資訊選取對應的類別並比對所擷取之網頁的該分詞資訊,於有匹配之網頁時擷取該網頁之頁面以產生一網頁擷取資料,以提供所擷取之網頁擷取資料。 The invention further provides a Chinese network information monitoring and analysis method, which is used for monitoring and analyzing on the Internet according to the input Chinese target information, and comprises the following steps: capturing webpage data on the internet; and performing webpage data. The word segmentation is processed, and the first-order processing of the word segmentation sentence is performed by the punctuation marks and the Chinese word segment table information in the preset word storage device, and the second-order processing is performed by the maximum matching method to generate a word segmentation result. The word frequency processing of the webpage data is performed, and the Chinese synonym information and the Chinese implied word information in the vocabulary storage device are preset to count the words and the frequency of the corresponding words in the lexicon storage device in the result of the word segmentation sentence and the frequency thereof To generate a word frequency counting result; performing webpage fingerprint processing, first using the label of the webpage source code of the captured webpage data as a node of the paragraph cutting, and using the TF/IDF weight analysis component according to the word frequency counting result to generate The weight of each participle, and the pre-determined number of participles are selected from the arrangement of the weights from large to small and according to their characters The new arrangement is to generate a participle information, and finally the selected characters are converted into a series of codes of the reserved bit number according to the hash algorithm, thereby generating the serial code corresponding to the retrieved webpage data; storing the webpage data corresponding to the webpage data And the word segmentation information and the serial code; and in the stored word segmentation information and the serial code, the corresponding category is selected according to the Chinese target information and the word segmentation information of the captured webpage is compared, and the matching page is captured. The page of the web page generates a web page to retrieve the data to provide the captured web page.

藉此,本發明應用在擷取網路上之社群媒體的各類型資料,包含非結構化的資料(如文字)、半結構化的資料(如HTML檔案)、結構化的資料(如表格),並對資料加以進行分析、篩選、轉換、擷取、模式分析及語意分析,進而可對該中文目標資訊進行各種監測與調查,例如:瞭解客戶行為、企業的品牌及產品口碑評估,特定微型媒體的有效性,進而可幫助市場活動之成功次數的量化,另外,亦可防止企業不慎於網路上公開客戶的個資等。 Thereby, the present invention is applied to various types of social media data collected on the network, including unstructured data (such as text), semi-structured data (such as HTML files), and structured data (such as tables). And analyze, filter, convert, extract, model and semantic analysis of the data, and then conduct various monitoring and investigation of the Chinese target information, for example: understanding customer behavior, corporate brand and product reputation evaluation, specific micro The effectiveness of the media, in turn, can help quantify the number of successful marketing campaigns. In addition, it can prevent companies from inadvertently revealing the client’s personal resources on the Internet.

1‧‧‧中文網路資訊監測分析系統 1‧‧‧Chinese Network Information Monitoring and Analysis System

2‧‧‧網際網路 2‧‧‧Internet

3‧‧‧使用者之電子通訊裝置 3‧‧‧User's electronic communication device

100‧‧‧詞庫儲存裝置 100‧‧‧ Thesaurus storage device

200‧‧‧電腦運算裝置 200‧‧‧Computer computing device

210‧‧‧目標資訊處理模組 210‧‧‧Target Information Processing Module

230‧‧‧網路資訊處理分析模組 230‧‧‧Network Information Processing Analysis Module

231‧‧‧網路資訊擷取單元 231‧‧‧Network Information Capture Unit

233‧‧‧斷詞斷句處理單元 233‧‧‧Sentence word segment processing unit

235‧‧‧詞頻處理單元 235‧‧ ‧ word frequency processing unit

237‧‧‧網頁指紋處理單元 237‧‧‧Webpage fingerprint processing unit

250‧‧‧網路資訊整理模組 250‧‧‧Network Information Organizer

300‧‧‧資料庫儲存裝置 300‧‧‧Database storage device

S101~S111‧‧‧步驟 S101~S111‧‧‧Steps

第1圖係本發明一實施例之中文網路資訊監測分析系統的系統方塊圖。 1 is a system block diagram of a Chinese network information monitoring and analysis system according to an embodiment of the present invention.

第2圖係本發明一實施例之運行中文網路資訊監測分析系統的方法流程圖。 2 is a flow chart of a method for operating a Chinese network information monitoring and analysis system according to an embodiment of the present invention.

第3圖係本發明另一實施例之中文網路資訊監測分析系統的系統方塊圖。 FIG. 3 is a system block diagram of a Chinese network information monitoring and analysis system according to another embodiment of the present invention.

為充分瞭解本發明之目的、特徵及功效,茲藉由下述具體之實施例,並配合所附之圖式,對本發明做一詳細說明,說明如後:本發明之技術係以向量空間模型出發,其係經過網頁資料的擷取、分類後,將產生之代表網頁資訊的特徵詞句及其權值儲存於資料庫儲存裝置,並藉由儲存於資料庫儲存裝置內之每一網頁資料的該等特徵詞句及其權值來作為被比對的對象,以使用這些特徵項來評價網頁資料中之未知文本與主題的相關程度。其中,特徵詞及其權值的選取稱為主題樣本的特徵選擇,詞句在不同內容的文檔中所呈現出的頻率分佈是不同的,因此可以根據詞句的頻率特性進行特徵選擇和權重評價,使本發明之技術得以對中文的目標資訊進行精確的監測與調查,進而儲存於資料庫儲存裝置中供使用者於資料庫中搜尋與取得 網頁擷取資料。 In order to fully understand the object, features and advantages of the present invention, the present invention will be described in detail by the following specific embodiments and the accompanying drawings. Departure, after the retrieval and classification of the webpage data, the characteristic words and their weights representing the webpage information are stored in the database storage device, and are stored by each webpage stored in the database storage device. The feature words and their weights are used as objects to be compared to use the feature items to evaluate the degree of relevance of the unknown text in the web page material to the topic. Among them, the selection of feature words and their weights is called feature selection of topic samples. The frequency distribution of words and phrases in different content documents is different. Therefore, feature selection and weight evaluation can be performed according to the frequency characteristics of words and phrases. The technology of the present invention enables accurate monitoring and investigation of Chinese target information, and is stored in a database storage device for users to search and obtain in the database. Web page capture data.

首先請參閱第1圖,係本發明一實施例中之中文網路資訊監測分析系統的系統方塊圖。本發明係藉由連接網際網路2之中文網路資訊監測分析系統1來提供使用者之電子通訊裝置3的連結,圖式中該使用者之電子通訊裝置3係以一使用者為示例,並非以單一數量之使用者為限,可同時連線之數量係取決於電腦運算裝置1之設備等級。該使用者之電子通訊裝置3可為桌上型電腦、智慧型手機、個人數位助理裝置、平板電腦等可直接或間接連上網際網路之電子通訊裝置。 First, please refer to FIG. 1 , which is a system block diagram of a Chinese network information monitoring and analysis system according to an embodiment of the present invention. The present invention provides a connection between the user's electronic communication device 3 by connecting the Internet information monitoring and analysis system 1 of the Internet 2, wherein the user's electronic communication device 3 is exemplified by a user. It is not limited to a single number of users, and the number of simultaneous connections depends on the device level of the computer computing device 1. The user's electronic communication device 3 can be an electronic communication device such as a desktop computer, a smart phone, a personal digital assistant device, a tablet computer, or the like, which can be directly or indirectly connected to the Internet.

本發明之中文網路資訊監測分析系統1包含:詞庫儲存裝置100、電腦運算裝置200及資料庫儲存裝置300。電腦運算裝置200係連結該詞庫儲存裝置100及該資料庫儲存裝置300,以自該等資料庫中搜尋與取得網頁擷取資料。 The Chinese network information monitoring and analysis system 1 of the present invention comprises: a thesaurus storage device 100, a computer computing device 200, and a database storage device 300. The computer computing device 200 connects the thesaurus storage device 100 and the database storage device 300 to search and retrieve webpages from the databases.

詞庫儲存裝置100係內儲存有複數中文分詞表資訊、複數中文同義詞資訊、複數中文蘊含詞資訊。詞庫儲存裝置100內儲存之中文分詞表資訊係包含了大量不會成為特徵項的常用詞彙,為了提高中文網路資訊監測分析系統1的運行效率,系統係透過該詞庫儲存裝置100來建置大量的中文分詞表,如此可以在保證特徵選擇準確性的前提下,顯著提高系統的運行效率。此外,考慮到自然語言的多樣性,係透過該詞庫儲存裝置100來建置中文同義詞庫、中文蘊含(Conditional Connective)詞庫等輔助詞庫,以在進行詞頻統計時提高資訊匹配的準確度。 The thesaurus storage device 100 stores a plurality of Chinese word segment information, plural Chinese synonym information, and plural Chinese word information. The Chinese word segmentation information stored in the thesaurus storage device 100 contains a large number of common words that do not become feature items. In order to improve the operational efficiency of the Chinese network information monitoring and analysis system 1, the system is built by the thesaurus storage device 100. A large number of Chinese word segmentation tables are provided, which can significantly improve the operating efficiency of the system under the premise of ensuring the accuracy of feature selection. In addition, taking into account the diversity of natural language, the thesaurus storage library 100 is used to build auxiliary lexicons such as the Chinese synonym database and the Chinese connotation (Conditional Connective) vocabulary to improve the accuracy of information matching when performing word frequency statistics. .

資料庫儲存裝置300係連結該電腦運算裝置200,以依據該電腦運算裝置200所擷取之網頁對應的串碼,儲存所擷取的網頁擷取資料及其分詞 資訊,進而分類所擷取之網頁擷取資料。本發明藉由該資料庫儲存裝置300內之資料的不斷更新來建置完整的網頁擷取資料及其分類(藉由詞句的頻率特性和權重評價),並可依據使用者提交之中文目標資訊來進行進一步之網際網路上的搜尋與擷取。 The database storage device 300 is coupled to the computer computing device 200 for storing the captured webpage data and its word segmentation according to the serial code corresponding to the webpage captured by the computer computing device 200. Information, and then classify the captured web pages to retrieve data. Through continuous updating of the data in the database storage device 300, the present invention constructs a complete webpage retrieval data and its classification (by frequency characteristics and weight evaluation of words and phrases), and can be based on Chinese target information submitted by the user. For further search and retrieval on the Internet.

電腦運算裝置200係連結該詞庫儲存裝置100及該資料庫儲存裝置300,包含:目標資訊處理模組210、網路資訊處理分析模組230及網路資訊整理模組250。目標資訊處理模組210係用於接收使用者所提交之中文目標資訊。網路資訊處理分析模組230係用於在網際網路2進行搜尋及產生分析結果。網路資訊整理模組250係依據中文目標資訊於該資料庫儲存裝置300內選取對應的類別並比對所擷取之網頁的分詞資訊,於有匹配之網頁時擷取該網頁之頁面以產生一網頁擷取資料,以將所擷取之網頁擷取資料提供予使用者之電子通訊裝置3。 The computer computing device 200 is connected to the thesaurus storage device 100 and the database storage device 300, and includes a target information processing module 210, a network information processing and analysis module 230, and a network information organization module 250. The target information processing module 210 is configured to receive Chinese target information submitted by the user. The network information processing analysis module 230 is used to search and generate analysis results on the Internet 2. The network information collating module 250 selects the corresponding category in the database storage device 300 according to the Chinese target information and compares the word segmentation information of the captured webpage, and captures the page of the webpage when the matching webpage is generated to generate A web page captures information to provide the captured web page data to the user's electronic communication device 3.

該網路資訊處理分析模組230復包含:網路資訊擷取單元231、斷詞斷句處理單元233、詞頻處理單元235及網頁指紋處理單元237。 The network information processing and analysis module 230 further includes: a network information capturing unit 231, a word segmentation processing unit 233, a word frequency processing unit 235, and a webpage fingerprint processing unit 237.

網路資訊擷取單元231係用於在網際網路2上進行網頁資料的擷取,其係透過網際網路2對網站頁面、搜尋引擎、微型媒體等公開資訊進行網頁資料的擷取,以供後續的分析與分類歸納。 The network information capturing unit 231 is configured to perform webpage data retrieval on the Internet 2, and the webpage data is captured through the Internet 2 for public information such as a website page, a search engine, and a micro media. For subsequent analysis and classification.

斷詞斷句處理單元233係用於對所擷取之該網頁資料依據標點符號及該等中文分詞表資訊進行斷詞斷句之第一階處理,再依據最大匹配法進行第二階處理,以產生對應網頁之斷詞斷句結果。所謂之第一階處理即係藉由標點符號先行斷句,再依據中文分詞表資訊的內容來於對應網頁之內文做關鍵詞之斷句,以初步取得特徵詞句;接著再透過最大匹配法來進行細部斷詞斷句 以達到更高的準確度。 The word segmentation processing unit 233 is configured to perform the first-order processing of the word segmentation sentence according to the punctuation marks and the Chinese word segment table information, and then perform the second-order processing according to the maximum matching method to generate Corresponding to the result of the word break of the web page. The so-called first-order processing means that the punctuation marks are used to break the sentence first, and then the content of the Chinese word segment table information is used to make a keyword break in the corresponding text of the web page to obtain the feature words initially; and then the maximum matching method is used. Detailed segmentation To achieve higher accuracy.

該最大匹配法係為一種習用之機械分詞方法,它是按照一定的策略將待分析的中文字串與一個“充分大的”詞庫(例如本發明之詞庫儲存裝置100)中的詞條進行匹配,若在詞庫中找到某個字符串,則匹配成功(識別出一個詞)。按照掃描方向的不同,串匹配分詞方法可以分為正向匹配和逆向匹配;按照不同長度優先匹配的情況,可以分為最大(最長)匹配和最小(最短)匹配。本發明之特徵之一在於特定匹配法之選用,以與後續分析達到最佳的組合效果,因此,本發明係使用「最大」匹配法,進一步地,於一實施態樣下係分別使用正向及逆向之最大匹配法來競合以產生最佳的斷詞斷句結果。 The maximum matching method is a conventional mechanical word segmentation method, which is a term in the Chinese character string to be analyzed according to a certain strategy and a "sufficiently large" thesaurus (for example, the thesaurus storage device 100 of the present invention). Matching, if a string is found in the thesaurus, the match is successful (a word is recognized). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the case of preferential matching with different lengths, it can be divided into the largest (longest) matching and the smallest (shortest) matching. One of the features of the present invention is the selection of a specific matching method to achieve the best combination effect with subsequent analysis. Therefore, the present invention uses the "maximum" matching method, and further, in one embodiment, the positive direction is used separately. And the inverse maximum matching method to competing to produce the best word segmentation result.

詞頻處理單元235係根據該等中文同義詞資訊及該等中文蘊含詞資訊,於該斷詞斷句結果中計數對應之詞句的出現頻率,以產生對應網頁之詞頻計數結果。 The word frequency processing unit 235 counts the frequency of occurrence of the corresponding word in the word segmentation result according to the Chinese synonym information and the Chinese word information, to generate a word frequency counting result of the corresponding web page.

網頁指紋處理單元237係用於對所擷取之網頁資料進行網頁屬性的分類,其係以所擷取網頁之網頁原始碼(HTML文檔)的標籤作為段落切割的節點,並依據該詞頻計數結果搭配使用TF/IDF權重分析元件,以產生對應網頁之每個分詞的權值,並自權值大至小的排列中選取前預定數量的分詞並依據其字元重新排列以產生對應網頁之一分詞資訊,最後再依雜湊演算法將所選取的該等字元轉換為預訂位元數的一串碼,進而產生對應網頁之串碼。其中,該TF/IDF權重分析元件係運用習知的TF/IDF公式,以計算所擷取之網頁資料內經斷詞斷句後之各字詞的權重值,TF/IDF公式如下式(1)所示: The webpage fingerprint processing unit 237 is configured to perform webpage attribute classification on the retrieved webpage data, and the label of the webpage source code (HTML document) of the retrieved webpage is used as a node for segment cutting, and the result is counted according to the word frequency. The TF/IDF weight analysis component is used together to generate the weight of each word segment of the corresponding webpage, and the predetermined predetermined number of word segments are selected from the arrangement of the weighted value to be small and rearranged according to the character to generate one of the corresponding webpages. The word segmentation information is finally converted into a series of codes of the reserved bit number according to the hash algorithm, thereby generating the serial code of the corresponding webpage. Wherein, the TF/IDF weight analysis component uses the conventional TF/IDF formula to calculate the weight value of each word after the broken word segmentation in the captured webpage data, and the TF/IDF formula is as follows (1) Show:

式(1)中,i=代表某一特定的字詞,j=代表該字詞所在的文件,tfi,j表示i的字詞在文件j中出現的頻率,N表示集合中所有文件的數目,dfi表示包含字詞i的文件數,log項即為IDF值,wi,j則為所計算之i在文件j中的權值。經此TF/IDF權重分析元件的計算即可算出文件中出現所有字詞的權值。 In the formula (1), i= represents a specific word, j= represents the file in which the word is located, tf i,j represents the frequency at which the word of i appears in the file j, and N represents the file in the set. The number, df i represents the number of files containing the word i, the log item is the IDF value, and w i,j is the weight of the calculated i in the file j. The weight of all words appearing in the file can be calculated by the calculation of the TF/IDF weight analysis component.

為了提高運行效率,系統對特徵向量進行降維處理,僅保留權值較高的詞句作為文檔的特徵項,從而形成維數較低的目標特徵向量,進而該網頁指紋處理單元237於一實施例中係選取之前10個特徵作為所擷取之網頁資料的特徵串,進一步係可於進行網頁指紋處理的步驟中,將該預訂位元數訂為128位元。 In order to improve the operation efficiency, the system performs dimension reduction processing on the feature vector, and only retains the words with higher weights as the feature items of the document, thereby forming the target feature vector with lower dimension, and the webpage fingerprint processing unit 237 is in an embodiment. The middle system selects the first 10 features as the feature strings of the retrieved webpage data, and further, in the step of performing webpage fingerprint processing, the number of reserved bits is set to 128 bits.

接著請參閱第2圖,係本發明一實施例之運行中文網路資訊監測分析系統的方法流程圖。 Next, please refer to FIG. 2, which is a flowchart of a method for running a Chinese network information monitoring and analysis system according to an embodiment of the present invention.

首先,步驟S101,係於網際網路上進行網頁資料的擷取;接著,步驟S103,進行該網頁資料之斷詞斷句處理,以標點符號及預設之詞庫儲存裝置內的中文分詞表資訊進行斷詞斷句之第一階處理,並以最大匹配法進行第二階處理以產生一斷詞斷句結果;接著,步驟S105,進行該網頁資料之詞頻處理,以預設之該詞庫儲存裝置內的中文同義詞資訊及等中文蘊含詞資訊計數該斷詞斷句結果中出現該詞庫儲存裝置內對應詞句之詞句及其頻率,以產生一詞頻計數結果;接著,步驟S107,進行網頁指紋處理,先以所擷取之網頁資料之網頁原始碼的標籤作為段落切割的節點,並依據該詞頻計數結果搭配使用TF/IDF權重分析元件,以產生每個分詞的權值,並自權值大至小的排列中選取前預定數量的分詞並依據其字元重新排列以產生一分詞資訊,最後再依雜湊演 算法將所選取的該等字元轉換為預訂位元數的一串碼,進而產生所擷取網頁資料對應之串碼;接著,步驟S109,儲存該網頁資料對應之分詞資訊及串碼;最後,步驟S111,於所儲存之分詞資訊及串碼中,根據該中文目標資訊選取對應的類別並比對所擷取之網頁的該分詞資訊進而產生匹配結果及其對應資訊。於有匹配之網頁時擷取該網頁之頁面以產生一網頁擷取資料,以提供所擷取之網頁擷取資料。 First, in step S101, the webpage data is retrieved on the Internet; then, in step S103, the word segmentation processing of the webpage data is performed, and the punctuation marks and the Chinese word segmentation information in the preset word storage device are used. The first-order processing of the broken words is performed, and the second-order processing is performed by the maximum matching method to generate a word-breaking sentence result; then, in step S105, the word frequency processing of the web page data is performed, and the word library storage device is preset Chinese synonym information and other Chinese implied word information counts the word segmentation sentence and the frequency of the corresponding word in the word bank storage device to generate a word frequency counting result; then, step S107, web page fingerprint processing, first The label of the webpage source code of the captured webpage data is used as a node for segment cutting, and the TF/IDF weight analysis component is used according to the word frequency counting result to generate the weight of each word segment, and the self-weight value is large to small. Select a predetermined number of participles in the arrangement and rearrange them according to their characters to generate a participle information. The algorithm converts the selected characters into a string of the number of reserved bits, and then generates a serial code corresponding to the retrieved webpage data; then, in step S109, the word segmentation information and the serial code corresponding to the webpage data are stored; In step S111, in the stored word segmentation information and the serial code, the corresponding category is selected according to the Chinese target information and the word segmentation information of the captured webpage is compared to generate a matching result and corresponding information. The page of the webpage is retrieved when there is a matching webpage to generate a webpage retrieval data to provide the captured webpage retrieval data.

接著請參閱第3圖,係本發明另一實施例之中文網路資訊監測分析系統的系統方塊圖。該網路資訊擷取單元231更可用於依據預設之登入資訊登入需登錄資訊的網路平台中進行網頁資料的擷取。如第3圖所示,係可登入社群媒體中進行網頁資料的擷取與分析歸納,進而儲存於資料庫儲存裝置300中。 Next, please refer to FIG. 3, which is a system block diagram of a Chinese network information monitoring and analysis system according to another embodiment of the present invention. The network information capturing unit 231 is further configured to log in to the web platform that needs to log in according to the preset login information to perform webpage data retrieval. As shown in FIG. 3, the webpage data can be retrieved and analyzed and summarized in the social media, and then stored in the database storage device 300.

綜上所述,本發明可根據使用者所提交之中文目標資訊選擇目標的特徵資訊,根據特徵資自動在網際網路上搜集資料,並對所搜集到的網頁資料進行分類整理並導入資料庫,藉由系統的自動運行與更新,提供個性化之中文網路資訊的搜尋服務。 In summary, the present invention can select the feature information of the target according to the Chinese target information submitted by the user, automatically collect the data on the Internet according to the feature capital, and classify and import the collected webpage data into the database. Provide personalized search service for Chinese online information through automatic operation and update of the system.

本發明在上文中已以較佳實施例揭露,然熟習本項技術者應理解的是,該實施例僅用於描繪本發明,而不應解讀為限制本發明之範圍。應注意的是,舉凡與該實施例等效之變化與置換,均應設為涵蓋於本發明之範疇內。因此,本發明之保護範圍當以申請專利範圍所界定者為準。 The invention has been described above in terms of the preferred embodiments, and it should be understood by those skilled in the art that the present invention is not intended to limit the scope of the invention. It should be noted that variations and permutations equivalent to those of the embodiments are intended to be included within the scope of the present invention. Therefore, the scope of protection of the present invention is defined by the scope of the patent application.

1‧‧‧中文網路資訊監測分析系統 1‧‧‧Chinese Network Information Monitoring and Analysis System

2‧‧‧網際網路 2‧‧‧Internet

3‧‧‧使用者之電子通訊裝置 3‧‧‧User's electronic communication device

100‧‧‧詞庫儲存裝置 100‧‧‧ Thesaurus storage device

200‧‧‧電腦運算裝置 200‧‧‧Computer computing device

210‧‧‧目標資訊處理模組 210‧‧‧Target Information Processing Module

230‧‧‧網路資訊處理分析模組 230‧‧‧Network Information Processing Analysis Module

231‧‧‧網路資訊擷取單元 231‧‧‧Network Information Capture Unit

233‧‧‧斷詞斷句處理單元 233‧‧‧Sentence word segment processing unit

235‧‧‧詞頻處理單元 235‧‧ ‧ word frequency processing unit

237‧‧‧網頁指紋處理單元 237‧‧‧Webpage fingerprint processing unit

250‧‧‧網路資訊整理模組 250‧‧‧Network Information Organizer

300‧‧‧資料庫儲存裝置 300‧‧‧Database storage device

Claims (7)

一種中文網路資訊監測分析系統,係用於根據使用者之電子通訊裝置所提交之至少一中文目標資訊,進行網際網路上之監測分析,包含:一詞庫儲存裝置,係內儲存有複數中文分詞表資訊、複數中文同義詞資訊、複數中文蘊含詞資訊;一電腦運算裝置,係連結該詞庫儲存裝置,包含:一目標資訊處理模組,係接收該至少一中文目標資訊;一網路資訊處理分析模組,係於該網際網路進行搜尋及產生分析結果,包含:一網路資訊擷取單元,係於該網際網路上進行網頁資料的擷取;一斷詞斷句處理單元,係用於對所擷取之該網頁資料依據標點符號及該等中文分詞表資訊進行斷詞斷句之第一階處理,再依據最大匹配法進行第二階處理,以產生對應網頁之斷詞斷句結果;一詞頻處理單元,係根據該等中文同義詞資訊及該等中文蘊含詞資訊,於該斷詞斷句結果中計數對應之詞句的出現頻率,以產生對應網頁之一詞頻計數結果;及一網頁指紋處理單元,係用於對所擷取之網頁資料進行網頁屬性的分類,其係以所擷取網頁之網頁原始碼的標籤作為段落切割的節點,並依據該詞頻計數結果搭配使用TF/IDF權重分析元件,以產生對應網頁之每個分詞的權值,並自權值大至小的排列中選取前預定數量的分詞並依據其字元重新排列以產生對應網頁之一分詞資訊,最後再依雜湊演算法將所選取的該等字元轉換為預訂位元數的 一串碼,進而產生對應網頁之串碼;及一網路資訊整理模組,係依據該中文目標資訊於一資料庫儲存裝置內選取對應的類別並比對所擷取之網頁的該分詞資訊,於有匹配之網頁時擷取該網頁之頁面以產生一網頁擷取資料,以提供所擷取之網頁擷取資料;其中,該資料庫儲存裝置係連結該電腦運算裝置,係依據所擷取網頁對應之串碼儲存所擷取之該網頁擷取資料及其分詞資訊,以分類所擷取之網頁擷取資料。 A Chinese network information monitoring and analysis system is used for monitoring and analyzing on the Internet according to at least one Chinese target information submitted by the user's electronic communication device, including: a word storage device, and a plurality of Chinese stored in the system The word segment information, the plural Chinese synonym information, and the plural Chinese word information; a computer computing device is connected to the thesaurus storage device, comprising: a target information processing module, receiving the at least one Chinese target information; Processing the analysis module, searching and generating the analysis result on the internet, comprising: a network information acquisition unit, which is used for extracting webpage data on the internet; a word segmentation processing unit is used Performing the first-order processing of the word segmentation sentence according to the punctuation marks and the Chinese word segment table information, and then performing the second-order processing according to the maximum matching method to generate the word segmentation result of the corresponding webpage; a word frequency processing unit is based on the Chinese synonym information and the Chinese implied word information Counting the frequency of occurrence of the corresponding word to generate a word frequency counting result of the corresponding webpage; and a webpage fingerprint processing unit for classifying the webpage attribute of the captured webpage data, which is the webpage of the captured webpage The label of the source code is used as a node for segment cutting, and the TF/IDF weight analysis component is used according to the word frequency counting result to generate the weight of each word segment of the corresponding webpage, and the pre-scheduled selection is selected from the arrangement of the weights to the smallest. The number of word segments are rearranged according to their characters to generate segmentation information of one of the corresponding web pages, and finally the selected characters are converted into the number of reserved bits according to the hash algorithm. a string of codes, which in turn generates a serial number of the corresponding webpage; and a network information collation module, which selects a corresponding category in a database storage device according to the Chinese target information and compares the word segmentation information of the captured webpage The page of the web page is retrieved to generate a web page capture data for providing the retrieved web page capture data; wherein the database storage device is coupled to the computer computing device, The webpage corresponding to the webpage is used to store the retrieved data and the word segmentation information of the webpage to classify the captured webpage data. 如申請專利範圍第1項所述之中文網路資訊監測分析系統,其中該網路資訊擷取單元更用於依據預設之登入資訊登入需登錄資訊的網路平台中進行網頁資料的擷取。 For example, the Chinese network information monitoring and analysis system described in the first application of the patent scope, wherein the network information retrieval unit is further configured to log in to the web platform that needs to log in information according to the preset login information. . 如申請專利範圍第1項所述之中文網路資訊監測分析系統,其中該斷詞斷句處理單元所採用之該最大匹配法係包含正向及逆向之最大匹配法。 For example, the Chinese network information monitoring and analysis system described in claim 1 wherein the maximum matching method used by the word segmentation processing unit includes a forward matching method and a maximum matching method. 一種中文網路資訊監測分析方法,係用於根據所輸入之中文目標資訊進行網際網路上之監測分析,包含下列步驟:於網際網路上進行網頁資料的擷取;進行該網頁資料之斷詞斷句處理,以標點符號及預設之詞庫儲存裝置內的中文分詞表資訊進行斷詞斷句之第一階處理,並以最大匹配法進行第二階處理以產生一斷詞斷句結果;進行該網頁資料之詞頻處理,以預設之該詞庫儲存裝置內的中文同義詞資訊及等中文蘊含詞資訊計數該斷詞斷句結果中出現該詞庫儲存裝置內對應詞句之詞句及其頻率,以產生一詞頻計數結果; 進行網頁指紋處理,先以所擷取之網頁資料之網頁原始碼的標籤作為段落切割的節點,並依據該詞頻計數結果搭配使用TF/IDF權重分析元件,以產生每個分詞的權值,並自權值大至小的排列中選取前預定數量的分詞並依據其字元重新排列以產生一分詞資訊,最後再依雜湊演算法將所選取的該等字元轉換為預訂位元數的一串碼,進而產生所擷取網頁資料對應之串碼;儲存該網頁資料對應之分詞資訊及串碼;及於所儲存之分詞資訊及串碼中,根據該中文目標資訊選取對應的類別並比對所擷取之網頁的該分詞資訊,於有匹配之網頁時擷取該網頁之頁面以產生一網頁擷取資料,以提供所擷取之網頁擷取資料。 A Chinese network information monitoring and analysis method is used for monitoring and analyzing on the Internet according to the input Chinese target information, and includes the following steps: capturing webpage data on the Internet; performing word segmentation of the webpage data Processing, using the Chinese word segmentation information in the punctuation and the preset vocabulary storage device to perform the first-order processing of the word segmentation sentence, and performing the second-order processing by the maximum matching method to generate a word segmentation result; performing the webpage The word frequency processing of the data, the Chinese synonym information and the Chinese implied word information in the vocabulary storage device are preset to count the words and the frequency of the corresponding words in the lexicon storage device in the result of the word segmentation to generate a Word frequency count result; Performing webpage fingerprint processing, firstly using the label of the webpage source code of the retrieved webpage data as a node of the paragraph cutting, and using the TF/IDF weight analysis component according to the word frequency counting result, to generate the weight of each word segment, and The pre-determined number of word segments are selected from the arrangement of the largest and smallest weights, and are rearranged according to the characters to generate a piece of word information, and finally the selected characters are converted into the number of reserved bits according to the hash algorithm. Serial code, which further generates a serial code corresponding to the captured webpage data; stores the word segmentation information and the serial code corresponding to the webpage data; and in the stored segmentation information and the serial code, selects a corresponding category according to the Chinese target information and compares The segmentation information of the captured webpage is retrieved from the webpage of the webpage to obtain a webpage retrieval data to provide the captured webpage retrieval data. 如申請專利範圍第4項所述之中文網路資訊監測分析方法,其中於網際網路上進行網頁資料擷取的步驟中更包含於需登錄資訊的網路平台中進行網頁資料擷取的步驟。 For example, the Chinese network information monitoring and analysis method described in claim 4, wherein the step of extracting webpage data on the Internet further includes the step of extracting webpage data in a web platform that needs to log in information. 如申請專利範圍第4項所述之中文網路資訊監測分析方法,其中於進行網頁指紋處理的步驟中,所選取之前預定數量的分詞係為選取前10個。 For example, in the Chinese network information monitoring and analysis method described in claim 4, in the step of performing webpage fingerprint processing, the predetermined predetermined number of word segmentation is selected as the first ten. 如申請專利範圍第6項所述之中文網路資訊監測分析方法,其中於進行網頁指紋處理的步驟中,該預訂位元數係為128位元。 For example, in the Chinese network information monitoring and analysis method described in claim 6, wherein in the step of performing webpage fingerprint processing, the number of reserved bits is 128 bits.
TW102115477A 2013-04-30 2013-04-30 Chinese network information monitoring and analysis system and its method TWI534640B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW102115477A TWI534640B (en) 2013-04-30 2013-04-30 Chinese network information monitoring and analysis system and its method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW102115477A TWI534640B (en) 2013-04-30 2013-04-30 Chinese network information monitoring and analysis system and its method

Publications (2)

Publication Number Publication Date
TW201333735A TW201333735A (en) 2013-08-16
TWI534640B true TWI534640B (en) 2016-05-21

Family

ID=49479526

Family Applications (1)

Application Number Title Priority Date Filing Date
TW102115477A TWI534640B (en) 2013-04-30 2013-04-30 Chinese network information monitoring and analysis system and its method

Country Status (1)

Country Link
TW (1) TWI534640B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI462534B (en) * 2009-02-13 2014-11-21 Alibaba Group Holding Ltd Information transmission method and device in instant messaging
CN110020422B (en) * 2018-11-26 2020-08-04 阿里巴巴集团控股有限公司 Feature word determining method and device and server

Also Published As

Publication number Publication date
TW201333735A (en) 2013-08-16

Similar Documents

Publication Publication Date Title
Sharma et al. Sentimental analysis of twitter data with respect to general elections in India
Gupta et al. Study of Twitter sentiment analysis using machine learning algorithms on Python
Gokulakrishnan et al. Opinion mining and sentiment analysis on a twitter data stream
US7461056B2 (en) Text mining apparatus and associated methods
Iqbal et al. Mining writeprints from anonymous e-mails for forensic investigation
Stamatatos et al. Clustering by authorship within and across documents
CN105975459B (en) A kind of the weight mask method and device of lexical item
Bhattacharjee et al. Sentiment analysis using cosine similarity measure
Singh et al. Sentiment analysis using lexicon based approach
Niam et al. Hate speech detection using latent semantic analysis (lsa) method based on image
CN111160007B (en) Search method and device based on BERT language model, computer equipment and storage medium
Eke et al. The significance of global vectors representation in sarcasm analysis
Saito et al. Classifying user reviews at sentence and review levels utilizing Naïve Bayes
CN107665442B (en) Method and device for acquiring target user
Digamberrao et al. Author identification on literature in different languages: a systematic survey
TWI534640B (en) Chinese network information monitoring and analysis system and its method
Patel et al. Influence of Gujarati STEmmeR in supervised learning of web page categorization
CN111753540B (en) Method and system for collecting text data to perform Natural Language Processing (NLP)
JP6173958B2 (en) Program, apparatus and method for searching using a plurality of hash tables
Alorini et al. Machine learning enabled sentiment index estimation using social media big data
Narang et al. Twitter Sentiment Analysis on Citizenship Amendment Act in India
CN109597879B (en) Service behavior relation extraction method and device based on 'citation relation' data
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
KR101712507B1 (en) Smart delivery system and method using wearable device
Kumar et al. Enterprise analysis through opinion mining

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees