CN103853701A

CN103853701A - Neural-network-based self-learning semantic detection method and system

Info

Publication number: CN103853701A
Application number: CN201210505765.1A
Authority: CN
Inventors: 苏青; 苗光胜; 牛温佳; 唐晖; 慈松; 谭红艳
Original assignee: Wasu Media & Network Co ltd; Institute of Acoustics CAS
Current assignee: Wasu Media & Network Co ltd; Institute of Acoustics CAS
Priority date: 2012-11-30
Filing date: 2012-11-30
Publication date: 2014-06-11

Abstract

The invention discloses a neural-network-based self-learning semantic detection method and system. The method includes the steps: 101, a dictionary base is imported to segment filenames to be recognized so as to obtain keywords in the filenames, and a probability item of each keyword is calculated on the basis of a Bayesian algorithm; the probability items are analyzed and acquired on the basis of judgment results of good or bad filenames; 102, a product of multiplying a product of probabilities of all keywords occurring in good semantic string names and prior probabilities of the good semantic string names is obtained; a product of multiplying a product of probabilities of all keywords occurring in bad semantic string names and prior probabilities of bad semantic string names is obtained; 103, the two products are compared; if the product item of the good semantic string is larger than that of the bad semantic strings, the strings are determined as to have good semantics; if not, the strings are determined to have bad semantics; judgment results are stored in a storage medium.

Description

The semantic detection method of a kind of self study based on neural network and system

Technical field

The invention belongs to network information processing and analysis field, refer more particularly to Word message content character and tendentious automatic judgement field, be specifically related to the semantic detection method of a kind of self study based on neural network and system.

Background technology

The automatic processing of the network information and analytical technology are the important component parts that realizes the analysis of Web content, detects and manage, and process with the structure of security system significant for Web content.

The bandwidth providing due to development and the operator of network technology constantly improves, user can access the various information in download network very easily, wherein, when the information that the is promoted to transmission of bandwidth provides broader stage, also propagate new facility is provided to flame.In recent years, the harmful informations such as salaciousness, pornographic and reaction on network are propagated prevailing, legacy network information processing scheme often needs very large man power and material for the identification of these harmful informations, be subject to the restriction of objective condition, far can not meet real needs for discovery and the processing of network flame.

Internet is just as the huge water system that is crossed and formed by many rivers, the inside various content informations that flowing at high speed, and the network user accesses internet by the mode of fetching water in river.The flow of this rivers water system of internet is huge, flow velocity is exceedingly fast, and the number of users on being connected to is hundreds of millions of.Traditional network information processing and analytical plan cannot be realized the automatic and intelligent analysis of network information character, must drop into a large amount of personnel and carry out Manual analysis and differentiation.Prior art be simple certain participle of definition be good or bad, judge that this file is by name bad if filename comprises bad participle, instead of carry out Bayesian failure probability; In addition, the workload of this definition participle is very large, is relatively difficult to upgrade, and native system can carry out self study renewal at any time, causes and fails to judge or judge by accident in order to avoid there is emerging word; Further, native system has also increased feedback element, prevents that participle is imperfect or incorrect, improves success ratio.Analyze existing judgement system from system composition and substantially only have a word-dividing mode and discrimination module, carry out simple participle, then see whether comprise bad keyword, the attribute that judges filename with this, often success ratio is not high.

Under existing situation, face huge volumes of content in internet, use manual method will accomplish that real-time analysis cannot tackle, in the urgent need to thering is network information processing and the identifying schemes of intellectual analysis ability, realize automatic detection and judgement to particular network information attribute.

Summary of the invention

The object of the invention is to for overcoming the problems referred to above, the invention provides the semantic detection method of a kind of self study based on neural network and system.

For achieving the above object, the invention provides the semantic detection method of a kind of self study based on neural network, described method comprises:

Step 101) import dictionary library to filename participle to be identified, obtain the keyword in filename, calculate the probability item of each keyword based on bayesian algorithm; And described probability item is based on the analysis of the good or bad judged result of filename is obtained;

Step 102) obtain the prior probability of the long-pending and good semantic character string name of the probability occurring that all keywords are corresponding in good semantic character string name, and above-mentioned two values of consult volume are multiplied each other and obtain the first product; And

Obtain the prior probability of the long-pending and bad semantic character string name of the probability occurring that all keywords are corresponding in bad semantic character string name, and by two parameters multiply each other obtain second and product;

Step 103) size of the first product and the second product relatively, if the first product term is greater than the second product term, this character string is good semanteme, otherwise is bad semanteme, and court verdict is deposited in storage medium.

Above-mentioned probability item is: the probability that word is Wk that good and bad two kinds difference percentage P (Vj) and a filename from classification Vj are randomly drawed

Wherein, the computing formula of P (Vj) is All Files name number in filename subset/V that in V, desired value is Vj, and V is filename set;

computing formula be:

\frac{(nk + 1)}{(n + | V |)}

Wherein, n is the sum of different keywords in Textj, Textj is the single document that members all in docsj is coupled together, docsj is the filename subset that in V, desired value is Vj, wherein Vj is good or bad, nk is that word Wk appears at the number of times in Textj, | V| represents the number of filename in V.

Above-mentioned steps 102) described the amassing of the probability occurring in good semantic character string name corresponding to all keywords

the Wk of this product formula is each keyword in filename; The prior probability P=P (Vj) of described good semantic character string name; The probability occurring in bad semantic character string name that described all keywords are corresponding is long-pending

the prior probability P=P (Vj) of described bad semantic character string name.

Optimize described step 101) and step 102) between also comprise: adopt the complete of all keyword participles in feedback policy guarantee filename.Can adopt manual examination and verification to obtain the judged result based on to good or bad.

Also provide a kind of self study based on neural network semantic detection system based on said method the present invention, described system comprises:

Probability item acquisition module, for importing dictionary library to filename participle to be identified, obtains the keyword in filename, calculates the probability item of each keyword based on bayesian algorithm; And the analysis of the judged result of described probability item based on to good or bad is obtained;

Processing module, for obtaining the prior probability of long-pending and good semantic character string name of the probability occurring in good semantic character string name that all keywords are corresponding, and multiplies each other the long-pending and prior probability of good semantic character string name of the probability occurring in good semantic character string name; And obtain the prior probability of the long-pending and bad semantic character string name of the probability occurring that all keywords are corresponding in bad semantic character string name, and the long-pending and prior probability of bad semantic character string name of the probability occurring in bad semantic character string name is multiplied each other;

Relatively judging module, for the Output rusults according to processing module, carry out as acted:

If the long-pending result multiplying each other with the prior probability of good semantic character string name of the probability occurring in good semantic character string name is greater than the long-pending result multiplying each other with the prior probability of bad semantic character string name of the probability occurring in good semantic character string name, this character string is good semanteme, otherwise be bad semanteme, court verdict is deposited in storage medium.

Above-mentioned probability item comprises the probability that word is Wk that classification percentage P (Vj) and a filename from classification Vj extract immediately

wherein, the computing formula of P (Vj) is: All Files name number in filename subset/V that in V, desired value is Vj, and V is filename set;

computing formula be:

\frac{(nk + 1)}{(n + | V |)}

Wherein, n is the sum of different keywords in Textj, and Textj is the single document that members all in docsj is coupled together, and docsj is the filename subset that in V, desired value is Vj, wherein Vj is good or bad, and nk is that word Wk appears at the number of times in Textj.

Above-mentioned processing module further comprises:

First processes submodule, for foundation

and P=P (Vj) obtains the product of the prior probability of the long-pending and good semantic character string name of the probability occurring in good semantic character string name that all keywords are corresponding

(P (Vj) * ΠP (\frac{Wk}{Vj}));

With

Second processes submodule, for foundation

and the product of the prior probability of the long-pending and bad semantic character string name of the probability occurring in bad semantic character string name corresponding to all keywords of P=P (Vj)

(P (Vj) * ΠP (\frac{Wk}{Vj})) .

Optimize, said system also comprises the feedback module between probability item acquisition module and processing module, and this feedback module is used for ensureing keyword, and whether participle is complete, and complete participle not restarted to keyword participle.

In a word, innovation of the present invention is exactly to build neural network and neuronic model, utilizes neural network can self-organization self study to produce the feature of action policy, by the calculating of network of samples information being obtained to probability item product accurately

), then identify the part of semantic of the network information in larger more multiregion according to these probability item products.The advantage of doing is like this to upgrade at any time sample data, differentiates can the emerging network information being made in time correct semanteme, can make improvements the participle mode of sample data simultaneously; And can slightly make improvements sample environment, just can differentiate a greater variety of information.In a word, the scope of application of the present invention is wider, the method that adopts self study to detect, can be by continuous artificial cognition, constantly upgrade, even if can accomplish that the attribute of the network information constantly changes, also can successfully detect its truly semanteme, can solve in semantic changing network structure and can not correctly detect semantic problem, for Strengthens network information processing and identification, provide technology platform to automatic detection and the judgement of particular network information attribute.

The composition relative complex of native system, comprise training self-learning module, to sample file name, artificial cognition is carried out in set, trains according to the result of artificial cognition, draw the probability item of each keyword, with this, larger filename set is being carried out to automatic discrimination, Output rusults, generally, success ratio is higher, also have in addition feedback module, imperfect and incorrect participle is revised, improve accuracy with this.The process detecting due to self study can complete automatically, so in network structure with dispose change in the situation that, as long as Reconfigurations file just can obtain the detection result after variation timely.

Brief description of the drawings

Fig. 1 is the schematic block diagram of self study determination method of the present invention;

Fig. 2 is the schematic block diagram of discrimination module provided by the invention.

Embodiment

Just to differentiate filename character string for example, illustrate embodiments of the present invention below.

Technology contents of the present invention: the semantic detection method of a kind of self study based on neural network, concrete steps are as follows:

1, be stored in storage medium with the character string of the network information, in human-computer interaction module access storage media, carry out manual examination and verification, the result of manual examination and verification continues to deposit in storage medium;

2, utilize the result of manual examination and verification to train sample character string, to sample file name participle, the probability of each participle and participle is deposited in storage medium, contain neuronic neural network theory model to set up, the character string that contains the network information is carried out to the identification of part of semantic;

3, import dictionary and user dictionary storehouse, to the participle of sample file name to be discriminated, according to dictionary library and word segmentation result judgement filename, and result is deposited in database;

4, manually judge one by one that by feedback module whether complete word segmentation result is correct, if not, adds keyword to user dictionary storehouse, and sample file name is re-started to training;

5, the word segmentation result in sample database and probability are imported in the storage medium that needs to differentiate, utilize the word segmentation result judgement of sample with the part of semantic of the character string of the network information.

The information that the present invention collects, can database mode tissue storage.

The network information can comprise: character string name, IP address, cryptographic hash, station address.

Self study information can comprise: the probability item of manual examination and verification result, manual examination and verification mark, keywords database, keyword, prior probability item.

Discriminative information can comprise: machine auditing result, machine examination & verification mark, character string part of semantic analysis result.

The concrete steps of carrying out human-computer interaction module are: first the storage medium of sample character string is deposited in access, request storage medium returns to the character string list of not carrying out manual examination and verification, human-computer interaction interface returns to storage medium to the character string list coming and shows, and list manual examination and verification option, thereby can carry out successively manual examination and verification, the result of judgement is deposited in storage medium, to carry out next step self study training process.These auditing result are exactly the evaluation that the external environment condition in Neural Network Self-learning provides system, evaluate judgement performance that just can consolidation system by these.

The concrete steps of carrying out self study training module are: the auditing result of first utilizing human-computer interaction module to produce is analyzed, and import dictionary library simultaneously, each participle is analyzed, based on for learning and the NB Algorithm of classifying text, and calculate the probability item of each participle, analysis result is deposited in storage medium.This is the core embodiment of neural network self-organization self study, when neural network learning, supervised learning can be calculated and be obtained fast weights accurately by concrete data sample, in the present invention, data sample is exactly said sample character string above, these character strings, with the network information, by self-organized learning, can be identified the part of semantic of the larger more network information.

The concrete steps of carrying out machine decision module are: first import dictionary library participle, accessing database simultaneously, word segmentation result is deposited in database, and the probability item of participle is returned in request, the product of the prior probability of the long-pending and bad semantic character string name of the probability occurring in bad semantic character string name that all keywords of sum of products of the prior probability of the long-pending and good semantic character string name of the probability occurring in good semantic character string name that more all keywords are corresponding are corresponding, if the product term of good semantic character string is greater than the product term of bad semantic character string, this character string is good semanteme, otherwise be bad semanteme, court verdict is deposited in storage medium.

The concrete steps of carrying out feedback module are: first the storage medium of word segmentation result is deposited in access, require return string and character string word segmentation result, be presented at human-computer interaction interface, manually audit with this whether whether complete participle is correct, if imperfect, new correct participle is added in dictionary library, until examination & verification All Files name participle, to improve the success ratio of court verdict.

The present invention collects the data such as character string, character string cryptographic hash with the network information by foundation, is stored in medium, makes the structuring of network Global Information; Then, using character string cryptographic hash as identity ID, character string as input, using the human-computer interaction interface connecting each other between artificial and machine as communication media, NB Algorithm for study and classifying text is basic foundation, calculate the probability item of each participle, make system really become can self study method; Finally, utilizing the probability item of participle to carry out good and bad judgement to filename, obtain the conclusions such as the attribute of filename, is that system really has emotion.

The invention provides the semantic detection method of a kind of self-organized learning based on neural network, Neural Network Self-learning is that the evaluation that system provided according to external environment condition is strengthened action which is rewarded to improve self performance, produce a series of actions strategy by study, its mode is set up neuron and neural network theory model exactly, on the basis of theoretical model investigation, construct concrete neural network model, to realize computer simulation, comprise the research of self-organization of network study.Its direct object is exactly to train more to newly arrive by self-organized learning to detect the character string of some network informations, judges the some parts semanteme of this information, reaches the effect of automatic identification information character with this.

Embodiment

Model, by collecting the data such as filename, filename cryptographic hash, is stored in database table, makes the structuring of network Global Information; Then, using filename cryptographic hash as identity ID, filename as input, using connecting each other as communication media between artificial and machine, NB Algorithm for study and classifying text is basic foundation, calculate the probability item of each participle, make system really become can self study method; Finally, utilize the probability item of participle to carry out good and bad judgement to filename, obtain the conclusions such as the attribute of filename, make system really there is emotion recognition ability.With reference to figure 1, provide detailed process below.

The first step: collect the network information and this information of storage organization to database, make network information structuring, information is divided into two parts: system input message and system output information.

1, the network information of system input, the system that is input to exactly, as data such as the pending filenames of system, detects to be used for the carrying out self study in stage below.

The filename that filename(is used for adjudicating)

The cryptographic hash of infohash(filename, is equivalent to filename ID, is used for unique identification document name)

These information, all can utilize known network packet capturing to obtain, and make full use of the science and the integrality that are fruitful and cross after testing.

2, the network information of system output, is exactly that system utilizes input message through a series of flow processs such as self study, examination & verification judgements, input message judged, and output.

Whether man_checked(manual examination and verification)

The result of man_result(manual examination and verification)

Whether machine_checked(machine audited)

The result of machine_result(machine examination & verification)

It is complete whether keywords_checked(keyword has been audited)

For the storage of above-mentioned information, the method that the present invention adopts is to set up a database for each information, because the storage of mysql database is powerful and flexible, is easy to upgrade and expansion, and is easy to be converted into various interface form.

Second step: utilize connecting each other between manual examination and verification module and the data of storage information, sample file name database is carried out to manual examination and verification.Utilize user dictionary storehouse to filename participle, according to bayesian algorithm, participle is carried out to the analysis of probability item, and be deposited in database; Utilize feedback module to check that whether keyword is complete, if keyword does not have participle complete, add suitable keyword and add user dictionary storehouse.

1, manual examination and verification:

Major customer's end of manual examination and verification is webpages of making based on PHP language, has formed very friendly interface, makes manually more accurately more complete differentiation sample file name.While opening webpage, first can access file name database, read the field that in filename, man_checked is 0, create the link of judgement filename, it is good and bad that link name is respectively, and in the time of the artificial clickthrough of user, changes the field of man_chencked and man_result in filename according to the value of artificial click, refresh afterwards the page, make to audit webpage and can show in real time the information in database.

Manual examination and verification module is the foundation of the semantic detection method of self study, only have by manual examination and verification module and could accurately determine that whether filename is good and bad, only just can accurately carry out the analysis of probability item to participle taking the result of manual examination and verification as foundation, the success ratio of the part of the machine in stage examination & verification so below just can improve.

2, self study training

Self study training module is the result drawing according to manual examination and verification module of the semantic detection method of self study and utilizes user dictionary storehouse to carry out participle to filename, based on bayesian algorithm, the keyword of filename is carried out to the analysis of probability item, and the result tissue of analysis is stored in database, this result is the foundation of the machine decision module in stage below.

Be written into user dictionary storehouse and carry out participle, and participle is deposited in database, carry out the probability item of the each keyword of analytical calculation based on bayesian algorithm, if V is the set of filename, first analytical calculation probability item

and change in database table the value of corresponding field.It has described the probability that word is Wk that a filename from classification Vj extracts immediately (Vj is legal or illegal file).While is the prior probability of analytical calculation two kinds also, is the prior probability of legitimate files and illegal file.

Calculate required probability item P (Vj) and

1) the filename subset that in docsj:V, desired value is Vj

2) P (Vj): All Files name number in docsj number/V

3) Textj: the single document that members all in docsj is coupled together

4) n: the sum of different keywords in Textj

5) nk: word Wk appears at the number of times in Textj

6）

P (\frac{Wk}{Vj}) : \frac{(nk + 1)}{(n + | V |)}

Self study training module is the key of the semantic detection method of self study, and the manual examination and verification result of sample file name is analyzed, and just can obtain the probability item of each keyword, and take this as a foundation and could adjudicate more filename.

3, feedback module

Feedback module is in order to prevent that keyword from not having correct complete participle, improves the success ratio of system determination.First feedback module accessing database, read the field that the keywords_checked in table filename is zero, then user checks participle, in text box, add user's word, and the word of user add according to space and carriage return participle, whether traversal user dictionary Fileview has added participle, and dictionary library does not have, and adds word to user dictionary storehouse.

Feedback module just carries out in the training of sample file, is in order to carry out more complete participle, in order to adjudicate better, more accurately filename, improves the success ratio of system.

The 3rd step: the key words probabilities item that utilizes self study training module to obtain, more filename is adjudicated.

Judging module is that the information result drawing according to above-mentioned several modules is carried out the judgement of filename, to the filename judgement of drawing a conclusion, and Output rusults is deposited in database.

The process of judging module is as follows:

1) Vp (legal): the probability occurring in legitimate files name that all keywords are corresponding is long-pending

2) Vp (illegal): the probability occurring in illegal filename that all keywords are corresponding is long-pending

3) resultP (legal): the legal joint probability calculating is:

Vp (legal) * P (prior probability of legitimate files name)

4) resultP (illegal): illegal joint probability is:

Vp (illegal) * P (prior probability of illegal filename)

5) relatively resultP (legal) and resultP (illegal) relatively obtain a result

If resultP (legal) is greater than resultP (illegal), system thinks that this file is legitimate files; Otherwise system thinks that this file is illegal file, and relevant field in corresponding Update Table storehouse.

Judging module is the core of self study detection method, is to utilize the probability item of participle and participle to adjudicate each filename, and what make that system can be complete accomplishes the end in view.

For professional person, can also and utilize method according to the algorithm of the Flow Chart Design of this system oneself, in specific environment, reach best effect, thereby carry out the semantic detection of self study comprehensively.

As shown in Figure 2, the schematic block diagram that this figure is discrimination module of the present invention.

First import dictionary library participle, accessing database simultaneously, word segmentation result is deposited in database, and the probability item of participle is returned in request, the product of the prior probability of the long-pending and bad semantic character string name of the probability occurring in bad semantic character string name that all keywords of sum of products of the prior probability of the long-pending and good semantic character string name of the probability occurring in good semantic character string name that more all keywords are corresponding are corresponding, if the product term of good semantic character string is greater than the product term of bad semantic character string, this character string is good semanteme, otherwise be bad semanteme, court verdict is deposited in storage medium.

In a word, the semantic detection method of a kind of self study based on neural network provided by the invention, belongs to network information processing and analysis field, refers more particularly to Word message content character and tendentious automatic judgement field.The semantic detection method of this self-organized learning based on neural network, Neural Network Self-learning is that the evaluation that system provided according to external environment condition is strengthened action which is rewarded to improve self performance, produce a series of actions strategy by study, its mode is set up neuron and neural network theory model exactly, on the basis of theoretical model investigation, construct concrete neural network model, to realize computer simulation, comprise the research of self-organization of network study.Its direct object is exactly to train more to newly arrive by self-organized learning to detect the character string of some network informations, judges the some parts semanteme of this information, reaches the effect of automatic identification information character with this.

Although disclose for the purpose of illustration specific embodiments of the invention and accompanying drawing, its object is help to understand content of the present invention and implement according to this, but it will be appreciated by those skilled in the art that: without departing from the spirit and scope of the invention and the appended claims, various replacements, variation and amendment are all possible.Therefore, the present invention should not be limited to most preferred embodiment and the disclosed content of accompanying drawing, and the scope that the scope of protection of present invention defines with claims is as the criterion.

Claims

1. the semantic detection method of the self study based on neural network, described method comprises:

2. the semantic detection method of the self study based on neural network according to claim 1, it is characterized in that, described probability item is: the probability that word is Wk that good and bad two kinds difference percentage P (Vj) and a filename from classification Vj are randomly drawed

computing formula be:

\frac{(nk + 1)}{(n + | V |)}

3. the semantic detection method of the self study based on neural network according to claim 2, is characterized in that,

Step 102) described the amassing of the probability occurring in good semantic character string name corresponding to all keywords the Wk of this product formula is each keyword in filename;

The prior probability P=P (Vj) of described good semantic character string name;

The probability occurring in bad semantic character string name that described all keywords are corresponding is long-pending

The prior probability P=P (Vj) of described bad semantic character string name.

4. the semantic detection method of self study based on neural network according to claim 1, is characterized in that described step 101) and step 102) between also comprise:

In employing feedback policy guarantee filename, all keyword participles is complete.

5. the semantic detection system of the self study based on neural network, described system comprises:

6. the semantic detection system of self study based on neural network according to claim 5, is characterized in that, described probability item comprises the probability that word is Wk that classification percentage P (Vj) and a filename from classification Vj extract immediately

computing formula be:

\frac{(nk + 1)}{(n + | V |)}

7. the semantic detection system of the self study based on neural network according to claim 6, is characterized in that, described processing module further comprises:

First processes submodule, for foundation

(P (Vj) * ΠP (\frac{Wk}{Vj}));

With

Second processes submodule, for foundation

(P (Vj) * ΠP (\frac{Wk}{Vj})) .

8. the semantic detection system of the self study based on neural network according to claim 5, it is characterized in that, described system also comprises the feedback module between probability item acquisition module and processing module, this feedback module is used for ensureing keyword, and whether participle is complete, and complete participle not restarted to keyword participle.