CN103853701A - Neural-network-based self-learning semantic detection method and system - Google Patents

Neural-network-based self-learning semantic detection method and system Download PDF

Info

Publication number
CN103853701A
CN103853701A CN201210505765.1A CN201210505765A CN103853701A CN 103853701 A CN103853701 A CN 103853701A CN 201210505765 A CN201210505765 A CN 201210505765A CN 103853701 A CN103853701 A CN 103853701A
Authority
CN
China
Prior art keywords
character string
probability
bad
semantic
good
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210505765.1A
Other languages
Chinese (zh)
Inventor
苏青
苗光胜
牛温佳
唐晖
慈松
谭红艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wasu Media & Network Co ltd
Institute of Acoustics CAS
Original Assignee
Wasu Media & Network Co ltd
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wasu Media & Network Co ltd, Institute of Acoustics CAS filed Critical Wasu Media & Network Co ltd
Priority to CN201210505765.1A priority Critical patent/CN103853701A/en
Publication of CN103853701A publication Critical patent/CN103853701A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a neural-network-based self-learning semantic detection method and system. The method includes the steps: 101, a dictionary base is imported to segment filenames to be recognized so as to obtain keywords in the filenames, and a probability item of each keyword is calculated on the basis of a Bayesian algorithm; the probability items are analyzed and acquired on the basis of judgment results of good or bad filenames; 102, a product of multiplying a product of probabilities of all keywords occurring in good semantic string names and prior probabilities of the good semantic string names is obtained; a product of multiplying a product of probabilities of all keywords occurring in bad semantic string names and prior probabilities of bad semantic string names is obtained; 103, the two products are compared; if the product item of the good semantic string is larger than that of the bad semantic strings, the strings are determined as to have good semantics; if not, the strings are determined to have bad semantics; judgment results are stored in a storage medium.

Description

The semantic detection method of a kind of self study based on neural network and system
Technical field
The invention belongs to network information processing and analysis field, refer more particularly to Word message content character and tendentious automatic judgement field, be specifically related to the semantic detection method of a kind of self study based on neural network and system.
Background technology
The automatic processing of the network information and analytical technology are the important component parts that realizes the analysis of Web content, detects and manage, and process with the structure of security system significant for Web content.
The bandwidth providing due to development and the operator of network technology constantly improves, user can access the various information in download network very easily, wherein, when the information that the is promoted to transmission of bandwidth provides broader stage, also propagate new facility is provided to flame.In recent years, the harmful informations such as salaciousness, pornographic and reaction on network are propagated prevailing, legacy network information processing scheme often needs very large man power and material for the identification of these harmful informations, be subject to the restriction of objective condition, far can not meet real needs for discovery and the processing of network flame.
Internet is just as the huge water system that is crossed and formed by many rivers, the inside various content informations that flowing at high speed, and the network user accesses internet by the mode of fetching water in river.The flow of this rivers water system of internet is huge, flow velocity is exceedingly fast, and the number of users on being connected to is hundreds of millions of.Traditional network information processing and analytical plan cannot be realized the automatic and intelligent analysis of network information character, must drop into a large amount of personnel and carry out Manual analysis and differentiation.Prior art be simple certain participle of definition be good or bad, judge that this file is by name bad if filename comprises bad participle, instead of carry out Bayesian failure probability; In addition, the workload of this definition participle is very large, is relatively difficult to upgrade, and native system can carry out self study renewal at any time, causes and fails to judge or judge by accident in order to avoid there is emerging word; Further, native system has also increased feedback element, prevents that participle is imperfect or incorrect, improves success ratio.Analyze existing judgement system from system composition and substantially only have a word-dividing mode and discrimination module, carry out simple participle, then see whether comprise bad keyword, the attribute that judges filename with this, often success ratio is not high.
Under existing situation, face huge volumes of content in internet, use manual method will accomplish that real-time analysis cannot tackle, in the urgent need to thering is network information processing and the identifying schemes of intellectual analysis ability, realize automatic detection and judgement to particular network information attribute.
Summary of the invention
The object of the invention is to for overcoming the problems referred to above, the invention provides the semantic detection method of a kind of self study based on neural network and system.
For achieving the above object, the invention provides the semantic detection method of a kind of self study based on neural network, described method comprises:
Step 101) import dictionary library to filename participle to be identified, obtain the keyword in filename, calculate the probability item of each keyword based on bayesian algorithm; And described probability item is based on the analysis of the good or bad judged result of filename is obtained;
Step 102) obtain the prior probability of the long-pending and good semantic character string name of the probability occurring that all keywords are corresponding in good semantic character string name, and above-mentioned two values of consult volume are multiplied each other and obtain the first product; And
Obtain the prior probability of the long-pending and bad semantic character string name of the probability occurring that all keywords are corresponding in bad semantic character string name, and by two parameters multiply each other obtain second and product;
Step 103) size of the first product and the second product relatively, if the first product term is greater than the second product term, this character string is good semanteme, otherwise is bad semanteme, and court verdict is deposited in storage medium.
Above-mentioned probability item is: the probability that word is Wk that good and bad two kinds difference percentage P (Vj) and a filename from classification Vj are randomly drawed
Figure BDA00002501687300021
Wherein, the computing formula of P (Vj) is All Files name number in filename subset/V that in V, desired value is Vj, and V is filename set;
computing formula be:
( nk + 1 ) ( n + | V | )
Wherein, n is the sum of different keywords in Textj, Textj is the single document that members all in docsj is coupled together, docsj is the filename subset that in V, desired value is Vj, wherein Vj is good or bad, nk is that word Wk appears at the number of times in Textj, | V| represents the number of filename in V.
Above-mentioned steps 102) described the amassing of the probability occurring in good semantic character string name corresponding to all keywords
Figure BDA00002501687300024
the Wk of this product formula is each keyword in filename; The prior probability P=P (Vj) of described good semantic character string name; The probability occurring in bad semantic character string name that described all keywords are corresponding is long-pending
Figure BDA00002501687300031
the prior probability P=P (Vj) of described bad semantic character string name.
Optimize described step 101) and step 102) between also comprise: adopt the complete of all keyword participles in feedback policy guarantee filename.Can adopt manual examination and verification to obtain the judged result based on to good or bad.
Also provide a kind of self study based on neural network semantic detection system based on said method the present invention, described system comprises:
Probability item acquisition module, for importing dictionary library to filename participle to be identified, obtains the keyword in filename, calculates the probability item of each keyword based on bayesian algorithm; And the analysis of the judged result of described probability item based on to good or bad is obtained;
Processing module, for obtaining the prior probability of long-pending and good semantic character string name of the probability occurring in good semantic character string name that all keywords are corresponding, and multiplies each other the long-pending and prior probability of good semantic character string name of the probability occurring in good semantic character string name; And obtain the prior probability of the long-pending and bad semantic character string name of the probability occurring that all keywords are corresponding in bad semantic character string name, and the long-pending and prior probability of bad semantic character string name of the probability occurring in bad semantic character string name is multiplied each other;
Relatively judging module, for the Output rusults according to processing module, carry out as acted:
If the long-pending result multiplying each other with the prior probability of good semantic character string name of the probability occurring in good semantic character string name is greater than the long-pending result multiplying each other with the prior probability of bad semantic character string name of the probability occurring in good semantic character string name, this character string is good semanteme, otherwise be bad semanteme, court verdict is deposited in storage medium.
Above-mentioned probability item comprises the probability that word is Wk that classification percentage P (Vj) and a filename from classification Vj extract immediately
Figure BDA00002501687300032
wherein, the computing formula of P (Vj) is: All Files name number in filename subset/V that in V, desired value is Vj, and V is filename set;
Figure BDA00002501687300033
computing formula be:
( nk + 1 ) ( n + | V | )
Wherein, n is the sum of different keywords in Textj, and Textj is the single document that members all in docsj is coupled together, and docsj is the filename subset that in V, desired value is Vj, wherein Vj is good or bad, and nk is that word Wk appears at the number of times in Textj.
Above-mentioned processing module further comprises:
First processes submodule, for foundation
Figure BDA00002501687300041
and P=P (Vj) obtains the product of the prior probability of the long-pending and good semantic character string name of the probability occurring in good semantic character string name that all keywords are corresponding ( P ( Vj ) * ΠP ( Wk Vj ) ) ; With
Second processes submodule, for foundation
Figure BDA00002501687300043
and the product of the prior probability of the long-pending and bad semantic character string name of the probability occurring in bad semantic character string name corresponding to all keywords of P=P (Vj) ( P ( Vj ) * ΠP ( Wk Vj ) ) .
Optimize, said system also comprises the feedback module between probability item acquisition module and processing module, and this feedback module is used for ensureing keyword, and whether participle is complete, and complete participle not restarted to keyword participle.
In a word, innovation of the present invention is exactly to build neural network and neuronic model, utilizes neural network can self-organization self study to produce the feature of action policy, by the calculating of network of samples information being obtained to probability item product accurately
Figure BDA00002501687300045
), then identify the part of semantic of the network information in larger more multiregion according to these probability item products.The advantage of doing is like this to upgrade at any time sample data, differentiates can the emerging network information being made in time correct semanteme, can make improvements the participle mode of sample data simultaneously; And can slightly make improvements sample environment, just can differentiate a greater variety of information.In a word, the scope of application of the present invention is wider, the method that adopts self study to detect, can be by continuous artificial cognition, constantly upgrade, even if can accomplish that the attribute of the network information constantly changes, also can successfully detect its truly semanteme, can solve in semantic changing network structure and can not correctly detect semantic problem, for Strengthens network information processing and identification, provide technology platform to automatic detection and the judgement of particular network information attribute.
The composition relative complex of native system, comprise training self-learning module, to sample file name, artificial cognition is carried out in set, trains according to the result of artificial cognition, draw the probability item of each keyword, with this, larger filename set is being carried out to automatic discrimination, Output rusults, generally, success ratio is higher, also have in addition feedback module, imperfect and incorrect participle is revised, improve accuracy with this.The process detecting due to self study can complete automatically, so in network structure with dispose change in the situation that, as long as Reconfigurations file just can obtain the detection result after variation timely.
Brief description of the drawings
Fig. 1 is the schematic block diagram of self study determination method of the present invention;
Fig. 2 is the schematic block diagram of discrimination module provided by the invention.
Embodiment
Just to differentiate filename character string for example, illustrate embodiments of the present invention below.
Technology contents of the present invention: the semantic detection method of a kind of self study based on neural network, concrete steps are as follows:
1, be stored in storage medium with the character string of the network information, in human-computer interaction module access storage media, carry out manual examination and verification, the result of manual examination and verification continues to deposit in storage medium;
2, utilize the result of manual examination and verification to train sample character string, to sample file name participle, the probability of each participle and participle is deposited in storage medium, contain neuronic neural network theory model to set up, the character string that contains the network information is carried out to the identification of part of semantic;
3, import dictionary and user dictionary storehouse, to the participle of sample file name to be discriminated, according to dictionary library and word segmentation result judgement filename, and result is deposited in database;
4, manually judge one by one that by feedback module whether complete word segmentation result is correct, if not, adds keyword to user dictionary storehouse, and sample file name is re-started to training;
5, the word segmentation result in sample database and probability are imported in the storage medium that needs to differentiate, utilize the word segmentation result judgement of sample with the part of semantic of the character string of the network information.
The information that the present invention collects, can database mode tissue storage.
The network information can comprise: character string name, IP address, cryptographic hash, station address.
Self study information can comprise: the probability item of manual examination and verification result, manual examination and verification mark, keywords database, keyword, prior probability item.
Discriminative information can comprise: machine auditing result, machine examination & verification mark, character string part of semantic analysis result.
The concrete steps of carrying out human-computer interaction module are: first the storage medium of sample character string is deposited in access, request storage medium returns to the character string list of not carrying out manual examination and verification, human-computer interaction interface returns to storage medium to the character string list coming and shows, and list manual examination and verification option, thereby can carry out successively manual examination and verification, the result of judgement is deposited in storage medium, to carry out next step self study training process.These auditing result are exactly the evaluation that the external environment condition in Neural Network Self-learning provides system, evaluate judgement performance that just can consolidation system by these.
The concrete steps of carrying out self study training module are: the auditing result of first utilizing human-computer interaction module to produce is analyzed, and import dictionary library simultaneously, each participle is analyzed, based on for learning and the NB Algorithm of classifying text, and calculate the probability item of each participle, analysis result is deposited in storage medium.This is the core embodiment of neural network self-organization self study, when neural network learning, supervised learning can be calculated and be obtained fast weights accurately by concrete data sample, in the present invention, data sample is exactly said sample character string above, these character strings, with the network information, by self-organized learning, can be identified the part of semantic of the larger more network information.
The concrete steps of carrying out machine decision module are: first import dictionary library participle, accessing database simultaneously, word segmentation result is deposited in database, and the probability item of participle is returned in request, the product of the prior probability of the long-pending and bad semantic character string name of the probability occurring in bad semantic character string name that all keywords of sum of products of the prior probability of the long-pending and good semantic character string name of the probability occurring in good semantic character string name that more all keywords are corresponding are corresponding, if the product term of good semantic character string is greater than the product term of bad semantic character string, this character string is good semanteme, otherwise be bad semanteme, court verdict is deposited in storage medium.
The concrete steps of carrying out feedback module are: first the storage medium of word segmentation result is deposited in access, require return string and character string word segmentation result, be presented at human-computer interaction interface, manually audit with this whether whether complete participle is correct, if imperfect, new correct participle is added in dictionary library, until examination & verification All Files name participle, to improve the success ratio of court verdict.
The present invention collects the data such as character string, character string cryptographic hash with the network information by foundation, is stored in medium, makes the structuring of network Global Information; Then, using character string cryptographic hash as identity ID, character string as input, using the human-computer interaction interface connecting each other between artificial and machine as communication media, NB Algorithm for study and classifying text is basic foundation, calculate the probability item of each participle, make system really become can self study method; Finally, utilizing the probability item of participle to carry out good and bad judgement to filename, obtain the conclusions such as the attribute of filename, is that system really has emotion.
The invention provides the semantic detection method of a kind of self-organized learning based on neural network, Neural Network Self-learning is that the evaluation that system provided according to external environment condition is strengthened action which is rewarded to improve self performance, produce a series of actions strategy by study, its mode is set up neuron and neural network theory model exactly, on the basis of theoretical model investigation, construct concrete neural network model, to realize computer simulation, comprise the research of self-organization of network study.Its direct object is exactly to train more to newly arrive by self-organized learning to detect the character string of some network informations, judges the some parts semanteme of this information, reaches the effect of automatic identification information character with this.
Embodiment
Model, by collecting the data such as filename, filename cryptographic hash, is stored in database table, makes the structuring of network Global Information; Then, using filename cryptographic hash as identity ID, filename as input, using connecting each other as communication media between artificial and machine, NB Algorithm for study and classifying text is basic foundation, calculate the probability item of each participle, make system really become can self study method; Finally, utilize the probability item of participle to carry out good and bad judgement to filename, obtain the conclusions such as the attribute of filename, make system really there is emotion recognition ability.With reference to figure 1, provide detailed process below.
The first step: collect the network information and this information of storage organization to database, make network information structuring, information is divided into two parts: system input message and system output information.
1, the network information of system input, the system that is input to exactly, as data such as the pending filenames of system, detects to be used for the carrying out self study in stage below.
The filename that filename(is used for adjudicating)
The cryptographic hash of infohash(filename, is equivalent to filename ID, is used for unique identification document name)
These information, all can utilize known network packet capturing to obtain, and make full use of the science and the integrality that are fruitful and cross after testing.
2, the network information of system output, is exactly that system utilizes input message through a series of flow processs such as self study, examination & verification judgements, input message judged, and output.
Whether man_checked(manual examination and verification)
The result of man_result(manual examination and verification)
Whether machine_checked(machine audited)
The result of machine_result(machine examination & verification)
It is complete whether keywords_checked(keyword has been audited)
For the storage of above-mentioned information, the method that the present invention adopts is to set up a database for each information, because the storage of mysql database is powerful and flexible, is easy to upgrade and expansion, and is easy to be converted into various interface form.
Second step: utilize connecting each other between manual examination and verification module and the data of storage information, sample file name database is carried out to manual examination and verification.Utilize user dictionary storehouse to filename participle, according to bayesian algorithm, participle is carried out to the analysis of probability item, and be deposited in database; Utilize feedback module to check that whether keyword is complete, if keyword does not have participle complete, add suitable keyword and add user dictionary storehouse.
1, manual examination and verification:
Major customer's end of manual examination and verification is webpages of making based on PHP language, has formed very friendly interface, makes manually more accurately more complete differentiation sample file name.While opening webpage, first can access file name database, read the field that in filename, man_checked is 0, create the link of judgement filename, it is good and bad that link name is respectively, and in the time of the artificial clickthrough of user, changes the field of man_chencked and man_result in filename according to the value of artificial click, refresh afterwards the page, make to audit webpage and can show in real time the information in database.
Manual examination and verification module is the foundation of the semantic detection method of self study, only have by manual examination and verification module and could accurately determine that whether filename is good and bad, only just can accurately carry out the analysis of probability item to participle taking the result of manual examination and verification as foundation, the success ratio of the part of the machine in stage examination & verification so below just can improve.
2, self study training
Self study training module is the result drawing according to manual examination and verification module of the semantic detection method of self study and utilizes user dictionary storehouse to carry out participle to filename, based on bayesian algorithm, the keyword of filename is carried out to the analysis of probability item, and the result tissue of analysis is stored in database, this result is the foundation of the machine decision module in stage below.
Be written into user dictionary storehouse and carry out participle, and participle is deposited in database, carry out the probability item of the each keyword of analytical calculation based on bayesian algorithm, if V is the set of filename, first analytical calculation probability item
Figure BDA00002501687300081
and change in database table the value of corresponding field.It has described the probability that word is Wk that a filename from classification Vj extracts immediately (Vj is legal or illegal file).While is the prior probability of analytical calculation two kinds also, is the prior probability of legitimate files and illegal file.
Calculate required probability item P (Vj) and
Figure BDA00002501687300082
1) the filename subset that in docsj:V, desired value is Vj
2) P (Vj): All Files name number in docsj number/V
3) Textj: the single document that members all in docsj is coupled together
4) n: the sum of different keywords in Textj
5) nk: word Wk appears at the number of times in Textj
6) P ( Wk Vj ) : ( nk + 1 ) ( n + | V | )
Self study training module is the key of the semantic detection method of self study, and the manual examination and verification result of sample file name is analyzed, and just can obtain the probability item of each keyword, and take this as a foundation and could adjudicate more filename.
3, feedback module
Feedback module is in order to prevent that keyword from not having correct complete participle, improves the success ratio of system determination.First feedback module accessing database, read the field that the keywords_checked in table filename is zero, then user checks participle, in text box, add user's word, and the word of user add according to space and carriage return participle, whether traversal user dictionary Fileview has added participle, and dictionary library does not have, and adds word to user dictionary storehouse.
Feedback module just carries out in the training of sample file, is in order to carry out more complete participle, in order to adjudicate better, more accurately filename, improves the success ratio of system.
The 3rd step: the key words probabilities item that utilizes self study training module to obtain, more filename is adjudicated.
Judging module is that the information result drawing according to above-mentioned several modules is carried out the judgement of filename, to the filename judgement of drawing a conclusion, and Output rusults is deposited in database.
The process of judging module is as follows:
1) Vp (legal): the probability occurring in legitimate files name that all keywords are corresponding is long-pending
2) Vp (illegal): the probability occurring in illegal filename that all keywords are corresponding is long-pending
3) resultP (legal): the legal joint probability calculating is:
Vp (legal) * P (prior probability of legitimate files name)
4) resultP (illegal): illegal joint probability is:
Vp (illegal) * P (prior probability of illegal filename)
5) relatively resultP (legal) and resultP (illegal) relatively obtain a result
If resultP (legal) is greater than resultP (illegal), system thinks that this file is legitimate files; Otherwise system thinks that this file is illegal file, and relevant field in corresponding Update Table storehouse.
Judging module is the core of self study detection method, is to utilize the probability item of participle and participle to adjudicate each filename, and what make that system can be complete accomplishes the end in view.
For professional person, can also and utilize method according to the algorithm of the Flow Chart Design of this system oneself, in specific environment, reach best effect, thereby carry out the semantic detection of self study comprehensively.
As shown in Figure 2, the schematic block diagram that this figure is discrimination module of the present invention.
First import dictionary library participle, accessing database simultaneously, word segmentation result is deposited in database, and the probability item of participle is returned in request, the product of the prior probability of the long-pending and bad semantic character string name of the probability occurring in bad semantic character string name that all keywords of sum of products of the prior probability of the long-pending and good semantic character string name of the probability occurring in good semantic character string name that more all keywords are corresponding are corresponding, if the product term of good semantic character string is greater than the product term of bad semantic character string, this character string is good semanteme, otherwise be bad semanteme, court verdict is deposited in storage medium.
In a word, the semantic detection method of a kind of self study based on neural network provided by the invention, belongs to network information processing and analysis field, refers more particularly to Word message content character and tendentious automatic judgement field.The semantic detection method of this self-organized learning based on neural network, Neural Network Self-learning is that the evaluation that system provided according to external environment condition is strengthened action which is rewarded to improve self performance, produce a series of actions strategy by study, its mode is set up neuron and neural network theory model exactly, on the basis of theoretical model investigation, construct concrete neural network model, to realize computer simulation, comprise the research of self-organization of network study.Its direct object is exactly to train more to newly arrive by self-organized learning to detect the character string of some network informations, judges the some parts semanteme of this information, reaches the effect of automatic identification information character with this.
Although disclose for the purpose of illustration specific embodiments of the invention and accompanying drawing, its object is help to understand content of the present invention and implement according to this, but it will be appreciated by those skilled in the art that: without departing from the spirit and scope of the invention and the appended claims, various replacements, variation and amendment are all possible.Therefore, the present invention should not be limited to most preferred embodiment and the disclosed content of accompanying drawing, and the scope that the scope of protection of present invention defines with claims is as the criterion.

Claims (8)

1. the semantic detection method of the self study based on neural network, described method comprises:
Step 101) import dictionary library to filename participle to be identified, obtain the keyword in filename, calculate the probability item of each keyword based on bayesian algorithm; And described probability item is based on the analysis of the good or bad judged result of filename is obtained;
Step 102) obtain the prior probability of the long-pending and good semantic character string name of the probability occurring that all keywords are corresponding in good semantic character string name, and above-mentioned two values of consult volume are multiplied each other and obtain the first product; And
Obtain the prior probability of the long-pending and bad semantic character string name of the probability occurring that all keywords are corresponding in bad semantic character string name, and by two parameters multiply each other obtain second and product;
Step 103) size of the first product and the second product relatively, if the first product term is greater than the second product term, this character string is good semanteme, otherwise is bad semanteme, and court verdict is deposited in storage medium.
2. the semantic detection method of the self study based on neural network according to claim 1, it is characterized in that, described probability item is: the probability that word is Wk that good and bad two kinds difference percentage P (Vj) and a filename from classification Vj are randomly drawed
Figure FDA00002501687200011
Wherein, the computing formula of P (Vj) is All Files name number in filename subset/V that in V, desired value is Vj, and V is filename set;
Figure FDA00002501687200012
computing formula be:
( nk + 1 ) ( n + | V | )
Wherein, n is the sum of different keywords in Textj, Textj is the single document that members all in docsj is coupled together, docsj is the filename subset that in V, desired value is Vj, wherein Vj is good or bad, nk is that word Wk appears at the number of times in Textj, | V| represents the number of filename in V.
3. the semantic detection method of the self study based on neural network according to claim 2, is characterized in that,
Step 102) described the amassing of the probability occurring in good semantic character string name corresponding to all keywords the Wk of this product formula is each keyword in filename;
The prior probability P=P (Vj) of described good semantic character string name;
The probability occurring in bad semantic character string name that described all keywords are corresponding is long-pending
Figure FDA00002501687200021
The prior probability P=P (Vj) of described bad semantic character string name.
4. the semantic detection method of self study based on neural network according to claim 1, is characterized in that described step 101) and step 102) between also comprise:
In employing feedback policy guarantee filename, all keyword participles is complete.
5. the semantic detection system of the self study based on neural network, described system comprises:
Probability item acquisition module, for importing dictionary library to filename participle to be identified, obtains the keyword in filename, calculates the probability item of each keyword based on bayesian algorithm; And the analysis of the judged result of described probability item based on to good or bad is obtained;
Processing module, for obtaining the prior probability of long-pending and good semantic character string name of the probability occurring in good semantic character string name that all keywords are corresponding, and multiplies each other the long-pending and prior probability of good semantic character string name of the probability occurring in good semantic character string name; And obtain the prior probability of the long-pending and bad semantic character string name of the probability occurring that all keywords are corresponding in bad semantic character string name, and the long-pending and prior probability of bad semantic character string name of the probability occurring in bad semantic character string name is multiplied each other;
Relatively judging module, for the Output rusults according to processing module, carry out as acted:
If the long-pending result multiplying each other with the prior probability of good semantic character string name of the probability occurring in good semantic character string name is greater than the long-pending result multiplying each other with the prior probability of bad semantic character string name of the probability occurring in good semantic character string name, this character string is good semanteme, otherwise be bad semanteme, court verdict is deposited in storage medium.
6. the semantic detection system of self study based on neural network according to claim 5, is characterized in that, described probability item comprises the probability that word is Wk that classification percentage P (Vj) and a filename from classification Vj extract immediately
Figure FDA00002501687200022
Wherein, the computing formula of P (Vj) is: All Files name number in filename subset/V that in V, desired value is Vj, and V is filename set;
Figure FDA00002501687200023
computing formula be:
( nk + 1 ) ( n + | V | )
Wherein, n is the sum of different keywords in Textj, and Textj is the single document that members all in docsj is coupled together, and docsj is the filename subset that in V, desired value is Vj, wherein Vj is good or bad, and nk is that word Wk appears at the number of times in Textj.
7. the semantic detection system of the self study based on neural network according to claim 6, is characterized in that, described processing module further comprises:
First processes submodule, for foundation
Figure FDA00002501687200032
and P=P (Vj) obtains the product of the prior probability of the long-pending and good semantic character string name of the probability occurring in good semantic character string name that all keywords are corresponding ( P ( Vj ) * ΠP ( Wk Vj ) ) ; With
Second processes submodule, for foundation
Figure FDA00002501687200034
and the product of the prior probability of the long-pending and bad semantic character string name of the probability occurring in bad semantic character string name corresponding to all keywords of P=P (Vj) ( P ( Vj ) * ΠP ( Wk Vj ) ) .
8. the semantic detection system of the self study based on neural network according to claim 5, it is characterized in that, described system also comprises the feedback module between probability item acquisition module and processing module, this feedback module is used for ensureing keyword, and whether participle is complete, and complete participle not restarted to keyword participle.
CN201210505765.1A 2012-11-30 2012-11-30 Neural-network-based self-learning semantic detection method and system Pending CN103853701A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210505765.1A CN103853701A (en) 2012-11-30 2012-11-30 Neural-network-based self-learning semantic detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210505765.1A CN103853701A (en) 2012-11-30 2012-11-30 Neural-network-based self-learning semantic detection method and system

Publications (1)

Publication Number Publication Date
CN103853701A true CN103853701A (en) 2014-06-11

Family

ID=50861369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210505765.1A Pending CN103853701A (en) 2012-11-30 2012-11-30 Neural-network-based self-learning semantic detection method and system

Country Status (1)

Country Link
CN (1) CN103853701A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239294A (en) * 2014-09-10 2014-12-24 华建宇通科技(北京)有限责任公司 Multi-strategy Tibetan long sentence segmentation method for Tibetan to Chinese translation system
CN108205763A (en) * 2016-12-19 2018-06-26 北京京东尚科信息技术有限公司 A kind of user account detection method
CN110233830A (en) * 2019-05-20 2019-09-13 中国银行股份有限公司 Domain name identification and domain name identification model generation method, device and storage medium
CN112613325A (en) * 2021-01-04 2021-04-06 上海交通大学 Password semantic structuralization realization method based on deep learning
CN114120052A (en) * 2021-12-02 2022-03-01 成都智元汇信息技术股份有限公司 Self-learning multi-scheduling cloud annotation platform, working method, electronic device and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06208590A (en) * 1993-01-12 1994-07-26 Nippon Denshika Jisho Kenkyusho:Kk Method for calculating degree of similarity between words
CN102609407A (en) * 2012-02-16 2012-07-25 复旦大学 Fine-grained semantic detection method of harmful text contents in network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06208590A (en) * 1993-01-12 1994-07-26 Nippon Denshika Jisho Kenkyusho:Kk Method for calculating degree of similarity between words
CN102609407A (en) * 2012-02-16 2012-07-25 复旦大学 Fine-grained semantic detection method of harmful text contents in network

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
吴志峰等: "人名_机构名在基于概念的文本分类中的应用研究", 《河北大学学报(自然科学版)》 *
潘志方: "基于朴素贝叶斯学习的电子商务网站客户兴趣分类的应用研究", 《计算机科学》 *
王春霞等: "网络教学平台上邮件智能分类系统的设计与实现", 《内蒙古师范大学学报自然科学(汉文)版》 *
许明英等: "一种结合反馈信息的贝叶斯分类增量学习方法", 《计算机应用》 *
陈冲: "互联网中文文本分类的研究与应用", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239294A (en) * 2014-09-10 2014-12-24 华建宇通科技(北京)有限责任公司 Multi-strategy Tibetan long sentence segmentation method for Tibetan to Chinese translation system
CN104239294B (en) * 2014-09-10 2017-06-06 华建宇通科技(北京)有限责任公司 Hide the how tactful Tibetan language long sentence cutting method of Chinese translation system
CN108205763A (en) * 2016-12-19 2018-06-26 北京京东尚科信息技术有限公司 A kind of user account detection method
CN110233830A (en) * 2019-05-20 2019-09-13 中国银行股份有限公司 Domain name identification and domain name identification model generation method, device and storage medium
CN112613325A (en) * 2021-01-04 2021-04-06 上海交通大学 Password semantic structuralization realization method based on deep learning
CN114120052A (en) * 2021-12-02 2022-03-01 成都智元汇信息技术股份有限公司 Self-learning multi-scheduling cloud annotation platform, working method, electronic device and medium
CN114120052B (en) * 2021-12-02 2023-06-27 成都智元汇信息技术股份有限公司 Self-learning multi-scheduling cloud labeling platform, working method, electronic equipment and medium

Similar Documents

Publication Publication Date Title
Popat Assessing the credibility of claims on the web
CN103235772B (en) A kind of text set character relation extraction method
CN110347894A (en) Knowledge mapping processing method, device, computer equipment and storage medium based on crawler
Li et al. Entity-oriented multi-modal alignment and fusion network for fake news detection
CN109145301B (en) Information classification method and device and computer readable storage medium
CN113312480B (en) Scientific and technological thesis level multi-label classification method and device based on graph volume network
CN103853701A (en) Neural-network-based self-learning semantic detection method and system
Binkley et al. The need for software specific natural language techniques
Gong et al. A survey on dataset quality in machine learning
Meusel et al. Towards automatic topical classification of LOD datasets
Usino et al. Document similarity detection using k-means and cosine distance
Ross et al. A case-based reasoning system for conflict resolution: design and implementation
Nurhachita et al. A comparison between deep learning, naïve bayes and random forest for the application of data mining on the admission of new students
Pane et al. Reevaluating synthesizing sentiment analysis on COVID-19 fake news detection using spark dataframe
Soto et al. Application-specific word embeddings for hate and offensive language detection
Mustafa et al. Optimizing document classification: Unleashing the power of genetic algorithms
CN116976321A (en) Text processing method, apparatus, computer device, storage medium, and program product
CN103279549A (en) Method and device for acquiring target data of target objects
Shrestha Detecting Fake News with Sentiment Analysis and Network Metadata
Park et al. A new forecasting system using the latent dirichlet allocation (LDA) topic modeling technique
KR102096328B1 (en) Platform for providing high value-added intelligent research information based on prescriptive analysis and a method thereof
Li et al. Legal case inspection: An analogy-based approach to judgment evaluation
Lei et al. Detecting Quality Problems in Semantic Metadata without the Presence of a Gold Standard.
Nie et al. An ontology for defining and characterizing demonstration environments
Zhang et al. A semantic search framework for similar audit issue recommendation in financial industry

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140611