Summary of the invention
For this reason, technical matters to be solved by this invention is to need in prior art to install on equipment for surfing the net client just can get the Internet data bag that equipment for surfing the net possessor produces when surfing the Net, thus provides a kind of and can obtain without the need to installing client on equipment for surfing the net the method and system that Internet data Bao Bingneng therefrom identifies the languages of the recognition network Word message of the languages of Word message.
For solving the problems of the technologies described above, technical scheme of the present invention is as follows:
The invention provides a kind of method of languages of recognition network Word message, comprise the steps:
At the Internet data bag that network insertion station acquisition equipment for surfing the net produces when surfing the Net;
Obtain the Word message comprised in described Internet data bag;
The languages of the Word message that equipment for surfing the net produces according to described Word message identification.
The method of the languages of recognition network Word message of the present invention, the step of the Word message comprised in the described Internet data bag of described acquisition, comprising:
According to transport layer protocol, described Internet data bag is reassembled into transport layer session data stream;
The data comprised in described transport layer session data stream are gone out according to HTML (Hypertext Markup Language) HTML protocol analysis;
Its Word message comprised is gone out from described extracting data.
The method of the languages of recognition network Word message of the present invention, the step of the languages of the described Word message that equipment for surfing the net produces according to described Word message identification, comprising:
Parse in described Word message the character code of each character correspondence in Unicode comprised;
The coding range of described Word message in Unicode is obtained according to described character code;
The languages of described Word message are identified according to described coding range.
The method of the languages of recognition network Word message of the present invention, also comprises after the languages of the described Word message that equipment for surfing the net produces according to described Word message identification:
Carry out classification according to the languages of described Word message to described Word message to store.
Present invention also offers a kind of system of languages of recognition network Word message, comprising:
Harvester, for the Internet data bag produced when surfing the Net at network insertion station acquisition equipment for surfing the net;
Acquisition device, for obtaining the Word message comprised in described Internet data bag;
Recognition device, for the languages of the Word message that equipment for surfing the net according to described Word message identification produces.
The system of the languages of recognition network Word message of the present invention, described acquisition device comprises:
Recomposition unit, for reassembling into transport layer session data stream according to transport layer protocol by described Internet data bag;
First resolution unit, for going out the data comprised in described transport layer session data stream according to HTML (Hypertext Markup Language) HTML protocol analysis;
Extraction unit, for going out its Word message comprised from described extracting data.
The system of the languages of recognition network Word message of the present invention, described recognition device comprises:
Second resolution unit, for parsing in described Word message the character code of each character correspondence in Unicode comprised;
Scope acquiring unit, for obtaining the coding range of described Word message in Unicode according to described character code;
Languages recognition unit, for identifying the languages of described Word message according to described coding range.
The system of the languages of recognition network Word message of the present invention, also comprises:
Sorting storage device, stores for carrying out classification according to the languages of described Word message to described Word message.
Technique scheme of the present invention has the following advantages compared to existing technology:
The invention provides a kind of method and system of languages of recognition network Word message, by the Internet data bag produced when surfing the Net at network insertion station acquisition equipment for surfing the net, obtain the Word message that comprises in described Internet data bag afterwards and according to described Word message identification the languages of the Word message that equipment for surfing the net produces.Therefore the method and system of the languages of recognition network Word message of the present invention, the languages that described Internet data Bao Bingneng therefrom identifies Word message can be obtained without the need to installing client on equipment for surfing the net, and the nationality belonging to possessor of described equipment for surfing the net can be judged according to described languages, security department can be monitored some specific crowd (crowd in such as a certain national scope) targetedly, improve supervision efficiency, be conducive to security department and get the information relevant to terrorist activity in time, safeguard the stable of society.
Embodiment
Embodiment 1
Present embodiments provide a kind of method of languages of recognition network Word message, as shown in Figure 1, comprise the steps:
S1. at the Internet data bag that network insertion station acquisition equipment for surfing the net produces when surfing the Net.
S2. the Word message comprised in described Internet data bag is obtained.
S3. the languages of Word message that equipment for surfing the net produces according to described Word message identification.
Preferably, also comprise the steps: after described step S3
S4. carry out classification according to the languages of described Word message to described Word message to store.
Particularly, the Internet data bag that can be produced when surfing the Net at network insertion station acquisition equipment for surfing the net by the data acquisition node being arranged at network insertion position.Can by the Internet data bag of each equipment for surfing the net of type collection of poll.
Particularly, also can first store Internet data bag, then the languages that aforesaid operations identifies the Word message that equipment for surfing the net produces are performed to the Internet data bag stored, after languages identification, according to languages, according to languages, class indication be carried out to the data stored again; Also first can perform after aforesaid operations identifies the languages of the Word message that equipment for surfing the net produces, according to languages, classification storage be carried out to Word message.In a word, can carry out before recognition the storage of data, also can carry out after recognition, can determine according to system architecture concrete condition when the system of building.
Preferably, as shown in Figure 2, the step of the Word message comprised in the described Internet data bag of described acquisition, can comprise:
S21. according to transport layer protocol, described Internet data bag is reassembled into transport layer session data stream.
S22. the data comprised in described transport layer session data stream are gone out according to HTML (Hypertext Markup Language) HTML protocol analysis.
S23. its Word message comprised is gone out from described extracting data.
Particularly, equipment for surfing the net possessor utilizes equipment for surfing the net to carry out transmission mail, chat, during the operation such as online forum message, generally all text event detection can be carried out, therefore above-mentioned Word message will be comprised in the Internet data bag that equipment for surfing the net produces when surfing the Net, after collecting above-mentioned Internet data bag, by transport layer protocol, described Internet data bag is reassembled into transport layer session data stream, the data comprised in described transport layer session data stream can be parsed according to HTML (Hypertext Markup Language) HTML agreement, the MAC Address of equipment for surfing the net is just included in described data, network access style (sends mail, browse webpage, forum posts, chat etc.) and internet content (Mail Contents, post content in URL address, website, chatting object, chat content) etc. data, therefore its Word message comprised can be extracted from above-mentioned data, such as Mail Contents, chat content, to post content etc.
Preferably, as shown in Figure 3, the step of the languages of the described Word message that equipment for surfing the net produces according to described Word message identification, can comprise:
S31. the character code of each character correspondence in Unicode comprised is parsed in described Word message.
S32. the coding range of described Word message in Unicode is obtained according to described character code.
S33. the languages of described Word message are identified according to described coding range.
Particularly, parse each character of comprising in Word message at Unicode (Unicode, ten thousand country codes, single code) middle corresponding character code, just can get the coding range of Word message in Unicode according to character code, when such as coding range is in (4E00-9FBF), the languages that just can be identified the Word message of its correspondence by the mode of inquiry comparison according to this coding range are Chinese, when coding range is in (0600-06FF, 0750-077F, FB50-FDFF, FE70-FEFF) time in, the languages that just can identify Word message corresponding to this coding range are Arabic, when coding range is in (1800-18AF), the languages that just can identify Word message corresponding to this coding range are Mongolian etc.And by the languages of described Word message, the nationality of equipment for surfing the net possessor just can be judged, be Chinese, Arabic, Mongolian or other countries, national people.After languages confirm, carry out classification according to the languages of Word message to Word message again to store, such as carry out classification according to Chinese information, english information, Tibetan information, Balakrishnan information, Sino-British mixed information, middle dimension mixed information etc. to described Word message store and show, be conducive to inquiry and the monitoring in later stage.
The method of the languages of recognition network Word message described in the present embodiment, the languages that described Internet data Bao Bingneng therefrom identifies Word message can be obtained without the need to installing client on equipment for surfing the net, and the nationality belonging to possessor of described equipment for surfing the net can be judged according to described languages, security department can be monitored some specific crowd (crowd in such as a certain national scope) targetedly, improve supervision efficiency, be conducive to security department and get the information relevant to terrorist activity in time, safeguard the stable of society.
Embodiment 2
Present embodiments provide a kind of system of languages of recognition network Word message, as shown in Figure 4, comprising:
Harvester 1, for the Internet data bag produced when surfing the Net at network insertion station acquisition equipment for surfing the net.
Acquisition device 2, for obtaining the Word message comprised in described Internet data bag.
Recognition device 3, for the languages of the Word message that equipment for surfing the net according to described Word message identification produces.
Preferably, sorting storage device 4 can also being comprised, storing for carrying out classification according to the languages of described Word message to described Word message.
Preferably, described acquisition device 2 can comprise:
Recomposition unit 21, for reassembling into transport layer session data stream according to transport layer protocol by described Internet data bag.
First resolution unit 22, for going out the data comprised in described transport layer session data stream according to HTML (Hypertext Markup Language) HTML protocol analysis.
Extraction unit 23, for going out its Word message comprised from described extracting data.
Preferably, described recognition device 3 can comprise:
Second resolution unit 31, for parsing in described Word message the character code of each character correspondence in Unicode comprised.
Scope acquiring unit 32, for obtaining the coding range of described Word message in Unicode according to described character code.
Languages recognition unit 33, for identifying the languages of described Word message according to described coding range.
The system of the languages of recognition network Word message described in the present embodiment, without the need to installing client on equipment for surfing the net, described Internet data bag can be obtained by harvester 1 and therefrom be identified the languages of Word message by acquisition device 2 and recognition device 3, and the nationality belonging to possessor of described equipment for surfing the net can be judged according to described languages, security department can be monitored some specific crowd (crowd in such as a certain national scope) targetedly, improve supervision efficiency, be conducive to security department and get the information relevant to terrorist activity in time, safeguard the stable of society.
Those skilled in the art should understand, embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the present invention can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code.
The present invention describes with reference to according to the process flow diagram of the method for the embodiment of the present invention, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
Although describe the preferred embodiments of the present invention, those skilled in the art once obtain the basic creative concept of cicada, then can make other change and amendment to these embodiments.So claims are intended to be interpreted as comprising preferred embodiment and falling into all changes and the amendment of the scope of the invention.