CN105653649A - Identification method and device of low-proportion information in mass texts - Google Patents

Identification method and device of low-proportion information in mass texts Download PDF

Info

Publication number
CN105653649A
CN105653649A CN201511002761.1A CN201511002761A CN105653649A CN 105653649 A CN105653649 A CN 105653649A CN 201511002761 A CN201511002761 A CN 201511002761A CN 105653649 A CN105653649 A CN 105653649A
Authority
CN
China
Prior art keywords
information
model
analytical model
new
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201511002761.1A
Other languages
Chinese (zh)
Other versions
CN105653649B (en
Inventor
倪时龙
苏江文
宋立华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Yirong Information Technology Co Ltd
Original Assignee
Fujian Yirong Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Yirong Information Technology Co Ltd filed Critical Fujian Yirong Information Technology Co Ltd
Priority to CN201511002761.1A priority Critical patent/CN105653649B/en
Publication of CN105653649A publication Critical patent/CN105653649A/en
Application granted granted Critical
Publication of CN105653649B publication Critical patent/CN105653649B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The invention discloses an identification method and device of low-proportion information in mass texts. The method comprises the following steps: dividing material information into training information and text information, converting the material information into an analyzable mathematical matrix through characteristic selection and vectorization processing, and substituting the analyzable mathematical matrix into an integration learning model to carry out model training. The step of model training comprises the following specific steps: according to training information, constructing a first analysis model; and substituting the test information into an analysis model to evaluate the operation effect of the first analysis model so as to obtain an evaluation value, and regulating the training information distribution weight of each algorithm in the first analysis model according to the evaluation value to obtain a new analysis model. The method solves the identification problem of a small quantity of low-proportion information to be identified in mass data.

Description

Low accounting information identifying method and device in magnanimity text
Technical field
The present invention relates to big data processing method, particularly relate to and a kind of magnanimity information very low in information accounting to be identified carries out information knowledge method for distinguishing and device.
Background technology
Along with the development of internet, internet public opinion (blog, forum, microblogging, micro-letter public number etc.) has replaced print media, becomes the important source of public opinion. Enterprise is played an important role by the public opinion analysis for internet, such as, in the marketing of product innovation, by the emotion information collected on internet and analyze, enterprise can carry out more fully customer experience management and company's feedback management, understands the demand of the masses, for company improves the product of oneself better, formulate the production strategy more meeting user to offer help, for user provides better service; And for the large-scale mechanism such as government, central enterprise, the brand image of self is more and more had great effect by internet public opinion, need to monitor pointedly, guide, the unreal public opinion being unfavorable for self is avoided to obtain wide-scale distribution, this just brings the widespread demand to the negative speech in internet monitoring, and particularly identification for negative public sentiment is monitoring.
A negative public sentiment Monitoring systems in complete internet, it relates to processes such as internet information acquisition, dependency judgement, negative tendency analysis, visual presentations:
1. internet information acquisition. By network reptile, capture up-to-date internet public feelings information from websites such as the news portal specified, forum, blog, microbloggings.
2. dependency judges. The public sentiment collected is carried out dependency judgement (whether relevant with destination organization, such as: whether have " XX enterprise " relevant), uncorrelated information is carried out discard processing.
3. negative tendency is analyzed. The internet public feelings that destination organization is relevant, carries out proneness judgement. Proneness comprises front, neutrality and negative, and wherein, valuable is negative.
4. visual presentation. In modes such as form, picture and text, forms, the negative public sentiment monitored is carried out statistics show, for public sentiment monitoring librarian use.
But, in practice process, the effect of the identification that the ripe algorithm of current machine learning field text analyzing is directly used in the negative public sentiment in internet by our discovery is not good, mainly because the accounting of the negative public sentiment in internet in all public sentiments is very little, cause conventional machines learning algorithm to be difficult to accurate identification, that is: analyze " poor fitting " phenomenon of process.
Such as shown in Fig. 1, according to certain large-scale central enterprise public sentiment Monitoring systems statistics display of our operation, annual collection about 1,000 ten thousand relevant public feelings informations, wherein negative public sentiment is no more than 50,000 every year, and accounting is less than 0.5%. And as described above, what traditional machine learning algorithm adopted is pattern dependency determination methods, the two carries out dependency judgement to be about to " public sentiment to be analyzed " and " front or neutral public sentiment pattern " and " negative public sentiment pattern ", whether it is judged as negative public sentiment, depends on that whether dependency compares is high with " pattern of negative public sentiment ". Under " front or neutral public sentiment " article accounts for most ratio situation, a small amount of negative public sentiment is often difficult to identified, usually claims this kind of phenomenon to be " poor fitting ".
In sum, in the negative tendency judgement process of internet public feelings information, existing scheme exist " need to safeguard dictionary; and upgrade due to dictionary cannot meet timeliness and cause erroneous judgement and fail to judge " and defect such as " negative public sentiment accounting is little; conventional machines learning algorithm is directly applied and easily produced over-fitting ", the proneness that can not solve negative public sentiment very well judges problem. This patent proposes the complex optimum thinking of a kind of utilization based on conventional machines learning algorithm, it is possible to solve the problem such that it is able to effectively in public sentiment emotional orientation analysis.
Summary of the invention
For this reason, it is desirable to provide a kind of big data identify the method for a small amount of information to be identified.
For achieving the above object, inventor provide low accounting information identifying method in a kind of magnanimity text, comprise the steps, material information is divided into training information and detecting information, described material information is changed into analyzable math matrix by feature selection, vectorization process, substitutes into integrated study model and carry out model training;
Described model training comprises step, builds the first analytical model according to training information;
Detecting information being substituted into analytical model, the operating performance of described first analytical model is assessed, obtain assessed value, the training information right of distribution recuperation adjusting each algorithm in the first analytical model according to assessed value is to new analytical model;
Detecting information is substituted into analytical model, the operating performance of new analytical model is assessed, obtain new assessed value, if described new assessed value does not restrain, the training information right of distribution recuperation then again adjusting each algorithm in new analytical model according to new assessed value, to new analytical model, again carries out assessment and judges; If new assessed value convergence, stopping to judge, new analytical model is as Optimization Analysis model;
By optimization model application deployment, target information is carried out discriminance analysis.
Preferably, also comprise step after material information is divided into training information and detecting information, material to be identified is additionally added in detecting information.
Specifically, described target information or material information noise material and material to be identified, in target information, the ratio of noise material and material to be identified is greater than 50.
Specifically, described first analytical model comprises at least two kinds in SVM, kNN, multinomial_nb, Bernoulli_nb, NearestCentroid, Ridge algorithm.
Low accounting information recognition device in a kind of magnanimity text, comprise material processing module, model construction module, assessment judge module, models applying module,
Described material information, for material information is divided into training information and detecting information, is changed into analyzable math matrix by feature selection, vectorization process, substitutes into integrated study model and carry out model training by described material processing module;
Described model construction module is for building the first analytical model according to training information;
Described assessment judges that module is for substituting into analytical model by detecting information, the operating performance of described first analytical model is assessed, obtain assessed value, described model construction module also for adjust each algorithm in the first analytical model according to assessed value training information right of distribution recuperation to new analytical model;
Described assessment judges module also for detecting information is substituted into analytical model, the operating performance of new analytical model is assessed, and obtains new assessed value;
Described model construction module is not also for when new assessed value restrains, then the training information right of distribution recuperation again adjusting each algorithm in new analytical model according to new assessed value is to new analytical model, and enable assessment judges that module again carries out assessment and judges; Also for when new assessed value restrains, using new analytical model as Optimization Analysis model;
Described models applying module is used for optimization model application deployment, and target information is carried out discriminance analysis.
Preferably, described material processing module is also for additionally adding to material to be identified in detecting information.
Specifically, described target information or material information noise material and material to be identified, in target information, the ratio of noise material and material to be identified is greater than 50.
Specifically, described first analytical model comprises at least two kinds in SVM, kNN, multinomial_nb, Bernoulli_nb, NearestCentroid, Ridge algorithm.
Wherein, described text comprises the multiple form examples such as word, picture, webpage, is all suitable for the scheme that present method is set forth. It is different from prior art, technique scheme is by two aspect Optimal improvements, it is adopt integrated analysis method to substitute single analytical algorithm respectively, and for " crossing sampling " technology that model training adopts, False Rate and the misdetection rate of analysis process can be reduced well so that the internet automatic analytical effect of negative public sentiment is greatly improved.
Accompanying drawing explanation
Fig. 1 is the internet public feelings schematic diagram described in background of invention;
Fig. 2 is the negative the analysis of public opinion schema described in the specific embodiment of the invention;
Fig. 3 is the negative the analysis of public opinion schema of the improvement described in the specific embodiment of the invention;
Fig. 4 is the low accounting information identifying method schema described in the specific embodiment of the invention;
Fig. 5 is the low accounting information recognition device module map described in the specific embodiment of the invention;
Fig. 6 is the system of the negative the analysis of public opinion in the internet described in the specific embodiment of the invention.
Description of reference numerals:
500, material processing module;
502, model construction module;
504, assessment judges module;
506, models applying module.
Embodiment
By technology contents, the structural attitude of technical scheme being described in detail, is realized object and effect, below in conjunction with specific embodiment and coordinate accompanying drawing to be explained in detail.
Here please first seeing Fig. 2, in some embodiment shown in Fig. 2, be the basic procedure that the study of a kind of applied for machines carries out the negative analysis of public sentiment, wherein, described public sentiment is the signal of public opinion information, comprises the multiple form examples such as word, picture, webpage.
Flow process is described as follows:
1) data encasement: from history internet public feelings data, by artificial mark, forms " front is with neutral " public sentiment data collection, and " negative public sentiment " data set.
2) text feature: here can composition graphs 4, for the low accounting information identifying method schematic flow sheet of one, this step is equivalent to step S400, material information is processed: all public sentiment texts are carried out Chinese word segmentation, feature selection and vectorization process (all adopting mature technology), all public sentiment texts are transformed into analyzable math matrix; Then, Data Division becoming " training set " (accounting for 80%) and " test set " (accounting for 20%) two parts of data sets, the former is for training model, and the latter is used for the effect of test model.
3) model training and model evaluation: the machine learning algorithm (such as SVM, Native-Bayes, Ridge etc.) selecting a kind of maturation each time, based on training set data, the model (such as SVM model) that " training " is corresponding; Then, utilize test set to assess on each model trained, thus obtain the effectiveness indicator (usually assessing with accuracy, recall rate two indices) of each model.
4) application deployment. The model that will behave oneself best in model evaluation step, is deployed in production environment, for new public sentiment data is carried out sentiment analysis.
In the further embodiment shown in Fig. 3, do two at " model training " and " model evaluation " stage and obviously improved, made it to be more suitable for being applied to negative the analysis of public opinion scene.
1) " integrated study " method is introduced in first improvement. In the model training stage, it not use " a kind of algorithm ", but the machine learning algorithm of the multiple maturation of conbined usage, combination builds " integrated study " algorithm, solves aforementioned single algorithm misdetection rate, problem that False Rate is high under " negative public sentiment accounting is very low " scene.
So-called " integrated study ", refers to based on different analytical algorithms, by training multiple analytical model, then these classification model group altogether, to reach better estimated performance. This patent demonstrates multiple has supervision analytical algorithm, comprising: SVM, kNN, multinomial_nb, Bernoulli_nb, NearestCentroid, Ridge.
The method flow that have employed integrated study is described here:
A. training data is prepared. Training dataset being pressed weight proportion split, (initial weight is distribute to the identical training data of each algorithm to distribute to algorithms of different. Such as 5 algorithms participate in assessment, then respectively distribute 20%).
B. creation analysis model. Adopt ready algorithm, based on training dataset, build corresponding analytical model.
C. analysis and assessment model. Based on test data set, the operating performance of each analytical model is assessed, obtains assessed value.
D. adjusting training data weighting, reruns. Assessed value according to each analytical model, the distribution weight (algorithm that operating performance is more good, weight correspondence is more high, and the training data of distribution is more many) of adjusting training data set, returns to step 2 and performs. Until the result convergence of step c (namely repeatedly the assessed value of analytical results tends towards stability, and no longer changes). Thus, obtain the weight of each analytical model.
E. it is combined to form " integrated analysis algorithm ". According to each analytical model weight during above-mentioned convergence, when one part of data is analyzed by structure integrated analysis model, each analytical model in integrated model participates in analyzing, obtaining result. Final analytical results, determines according to the weight of each algorithm of integrated study model.
In specific embodiment as shown in Figure 4, combine the advantage of some embodiment above-mentioned, introduce a kind of low accounting information identifying method, comprise the steps, step S400, material information is divided into training information and detecting information, described material information is changed into analyzable math matrix by feature selection, vectorization process, substitutes into integrated study model and carry out model training;
Described model training comprises step S402, builds the first analytical model according to training information;
Also carrying out step S404 and detecting information is substituted into analytical model, the operating performance of described first analytical model assessed, obtain assessed value, the training information right of distribution recuperation adjusting each algorithm in the first analytical model according to assessed value is to new analytical model;
Continue step S406 again, detecting information is substituted into analytical model, the operating performance of new analytical model is assessed, obtain new assessed value, if described new assessed value does not restrain, the training information right of distribution recuperation then again adjusting each algorithm in new analytical model according to new assessed value, to new analytical model, again carries out assessment and judges; If new assessed value convergence, stopping to judge, carry out step S408, the new analytical model assessed value according to convergence built is as Optimization Analysis model;
Finally by optimization model application deployment, target information is carried out discriminance analysis. Described target information or material information noise material and material to be identified, in target information, the ratio of noise material and material to be identified is greater than 50, and the dimension of described material can be number of words, information number, page number etc. Here target information is with the public feelings information on internet as an example, and generally the number of negative public sentiment only accounts for 200 mono-. The negative public sentiment of major part, namely material to be identified has been buried in the information mighty torrent of internet, and these fronts are too huge with neutral public sentiment quantity, are regarded as noise material in the present embodiment. By aforesaid method, integrated multiple algorithm sets up analytical model, it is to increase the recognition rate of the material to be identified of low accounting in big data large information capacity. Effectively reach the effect identifying a small amount of information to be identified in big data, specifically can refer to experimental example hereafter.
In other preferred embodiments, also comprise step after material information is divided into training information and detecting information, material to be identified is additionally added in detecting information. Such improvement is the means introducing " crossing sampling ". Specifically, the negative public sentiment for internet, the independent preparation negative public sentiment data of some history, in mode evaluation and application deployment stage, above-mentioned negative public sentiment is additionally joined in detecting information, the artificial accounting improving negative public sentiment in described detecting information, then be that the detecting information changed is updated in the assessment of analytical model, heavy building process by people, this process is referred to as " crossing sampling " technology. This technology, through experiment, also reaches the effect improving low accounting information recognition efficiency really effectively.
Experimental example 1:
Checking data.
Based on the history including data acquisition data of certain large-scale central enterprise, filter out following two parts of data, use for subsequent authentication.
Data set A: the negative public sentiment of history, quantity 4406
The relevant public sentiment of data set B:2015 this central enterprise in July. Quantity is 24182, and wherein 259 sections is negative public sentiment.
Verification method
According to 8:2, the non-negative public sentiment in the data in " data set B " and negative public sentiment are carried out random cutting, and 80% as training dataset C, and 20% as test data set D. In experimentation, carry out arranging as follows:
Cross and adopt application. The negative public sentiment of certain number in data set A is added in C.
Experimentation have employed multiple model, such as svm, knn etc., and evaluated and tested and the integrated analysis model table of these models has used " integrated study " as the analytical model title of its correspondence.
All experiments all repeat 20 times, so test results all below is average result
Experimental result
As shown in form, analytical results adopts the accuracy rate and recall rate measure of criterions that industry is general, and two indices value is and is the bigger the better. Above-mentioned experimental result illustrates: the first, after adopting over-fitting to process, and the better effects if of same analytical model, and the negative public sentiment amount additionally added is more many, and effect is more good; 2nd, the analytical effect of integrated study is better than single algorithm, such as svm and Ridge. This is consistent with experiment expection, also demonstrates the validity of the inventive method.
In embodiment shown in Fig. 5, illustrate low accounting information recognition device in a kind of magnanimity text, comprise material processing module 500, model construction module 502, assessment judge module 504, models applying module 506,
Described material information, for material information is divided into training information and detecting information, is changed into analyzable math matrix by feature selection, vectorization process, substitutes into integrated study model and carry out model training by described material processing module 500;
Described model construction module 502 is for building the first analytical model according to training information;
Described assessment judges that module 504 is for substituting into analytical model by detecting information, the operating performance of described first analytical model is assessed, obtain assessed value, described model construction module 502 also for adjust each algorithm in the first analytical model according to assessed value training information right of distribution recuperation to new analytical model;
Described assessment judges module 504 also for detecting information is substituted into analytical model, the operating performance of new analytical model is assessed, and obtains new assessed value;
Described model construction module 502 is not also for when new assessed value restrains, then the training information right of distribution recuperation again adjusting each algorithm in new analytical model according to new assessed value is to new analytical model, and enable assessment judges that module again carries out assessment and judges; Also for when new assessed value restrains, using new analytical model as Optimization Analysis model;
Described models applying module 506 is for by optimization model application deployment, carrying out discriminance analysis to target information. Said apparatus effectively reaches the effect of low accounting information identification
Preferably in embodiment, described material processing module 500 is also for additionally adding to material to be identified in detecting information. By above-mentioned module installation, solve the identification problem of the low accounting information of big data better.
Specifically, described target information or material information noise material and material to be identified, in target information, the ratio of noise material and material to be identified is greater than 50.
Specifically, described first analytical model comprises at least two kinds in SVM, kNN, multinomial_nb, Bernoulli_nb, NearestCentroid, Ridge algorithm. Apparatus of the present invention can set up analytical model under big data background, analyzes the material to be identified identifying that in target information, accounting is very low.
Fig. 6 proposes a kind of to carry out synthesis improvement based on machine learning algorithm, the system of the negative the analysis of public opinion that is applicable to internet. System mainly comprises with lower module:
1. internet public feelings acquisition module. By network crawler technology, the internet sites specified is carried out data gathering.
2. internet public feelings analyzes module. Based on text analysis technique such as machine learning, the internet public feelings text collected is analyzed, identifies negative public sentiment.
3. Infrastructure. For supporting interim storage and the distributed computing of mass data. Wherein distributed computing adopts open source software ApacheSpark.
4. data memory module. The result (the negative public feelings information identified) analyzed is carried out persistent storage. That database adopts is open source software MongoDB.
5. visual presentation. Sing on web interface, shows negative public feelings information, statistics etc.
Wherein, above-mentioned internet public feelings collection, Infrastructure, data store, visual presentation all adopts and compares proven technique and build. This system is optimized improvement for the negative the analysis of public opinion method in internet based on machine learning algorithm, the associated problem that existing scheme exists are evaded, reduce False Rate and misdetection rate that negative public sentiment is analyzed automatically, thus solve the problem that magnanimity internet public feelings negative tendency is analyzed preferably.
It should be noted that, herein, the such as relational terms of first and second grades and so on is only used for separating an entity or operation with another entity or operational zone, and not necessarily requires or imply to there is any this kind of actual relation or sequentially between these entities or operation. And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, so that comprise the process of a series of key element, method, article or terminating unit not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise the key element intrinsic for this kind of process, method, article or terminating unit. When not more restrictions, the key element limited by statement " comprising ... " or " comprising ... ", and be not precluded within process, method, article or the terminating unit comprising described key element and also there is other key element. In addition, herein, " being greater than ", " being less than ", " exceeding " etc. are interpreted as and do not comprise this number; " more than ", " below ", " within " etc. be interpreted as and comprise this number.
Those skilled in the art are it should be appreciated that the various embodiments described above can be provided as method, device or computer program. These embodiments can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect. All or part of step in the method that the various embodiments described above relate to can be completed by the hardware that program carrys out instruction relevant, described program can be stored in the storage media that computer equipment can read, for performing all or part of step described in the various embodiments described above method. Described computer equipment, includes but not limited to: Personal Computer, server, multi-purpose computer, special purpose computer, the network equipment, embedded equipment, programmable device, intelligent mobile terminal, intelligent home device, wearable intelligent equipment, vehicle intelligent equipment etc.; Described storage media, includes but not limited to: the storage of RAM, ROM, magnetic disc, tape, CD, flash memory, USB flash disk, portable hard drive, storage card, memory stick, the webserver, network cloud storage etc.
The various embodiments described above are that schema and/or skeleton diagram with reference to the method according to embodiment, equipment (system) and computer program describe. Should understand can by the combination of the flow process in each flow process in computer program instructions flowchart and/or skeleton diagram and/or square frame and schema and/or skeleton diagram and/or square frame. These computer program instructions can be provided to the treater of computer equipment to produce a machine so that the instruction that performed by the treater of computer equipment is produced for realizing the device of function specified in schema flow process or multiple flow process and/or skeleton diagram square frame or multiple square frame.
These computer program instructions also can be stored in the computer equipment readable memory that computer equipment can be guided to work in a specific way, making the instruction that is stored in this computer equipment readable memory produce the manufacture comprising instruction device, this instruction device realizes the function specified in schema flow process or multiple flow process and/or skeleton diagram square frame or multiple square frame.
These computer program instructions also can be loaded on computer equipment, make to perform a series of operation steps on a computing device to produce computer implemented process, thus the instruction performed on a computing device is provided for realizing the step of the function specified in schema flow process or multiple flow process and/or skeleton diagram square frame or multiple square frame.
Although the various embodiments described above being described; but those skilled in the art once the substantially creative concept of cicada; then these embodiments can be made other change and amendment; so the foregoing is only embodiments of the invention; not thereby the scope of patent protection of the present invention is limited; every utilize specification sheets of the present invention and accompanying drawing content to do equivalent structure or equivalence flow process conversion; or directly or indirectly it is used in other relevant technical fields, all it is included in reason within the scope of patent protection of the present invention.

Claims (8)

1. low accounting information identifying method in a magnanimity text, it is characterized in that, comprise the steps, material information is divided into training information and detecting information, described material information is changed into analyzable math matrix by feature selection, vectorization process, substitutes into integrated study model and carry out model training;
Described model training comprises step, builds the first analytical model according to training information;
Detecting information substituting into the first analytical model, the operating performance of described first analytical model is assessed, obtain assessed value, the training information right of distribution recuperation adjusting each algorithm in the first analytical model according to assessed value is to new analytical model;
Detecting information is substituted into new analytical model, the operating performance of new analytical model is assessed, obtain new assessed value, if described new assessed value does not restrain, the training information right of distribution recuperation then again adjusting each algorithm in new analytical model according to new assessed value, to new analytical model, again carries out assessment and judges; If new assessed value convergence, stopping to judge, new analytical model is as Optimization Analysis model;
By optimization model application deployment, target information is carried out discriminance analysis.
2. low accounting information identifying method in magnanimity text according to claim 1, it is characterised in that, also comprise step after material information is divided into training information and detecting information, material to be identified is additionally added in detecting information.
3. low accounting information identifying method in magnanimity text according to claim 1, it is characterised in that, described target information or material information noise material and material to be identified, in target information, the ratio of noise material and material to be identified is greater than 50.
4. low accounting information identifying method in magnanimity text according to claim 1, it is characterized in that, described first analytical model comprises at least two kinds in SVM, kNN, multinomial_nb, Bernoulli_nb, NearestCentroid, Ridge algorithm.
5. low accounting information recognition device in a magnanimity text, it is characterised in that, comprise material processing module, model construction module, assessment judge module, models applying module,
Described material information, for material information is divided into training information and detecting information, is changed into analyzable math matrix by feature selection, vectorization process, substitutes into integrated study model and carry out model training by described material processing module;
Described model construction module is for building the first analytical model according to training information;
Described assessment judges that module is used for substituting into detecting information the first analytical model, the operating performance of described first analytical model is assessed, obtain assessed value, described model construction module also for adjust each algorithm in the first analytical model according to assessed value training information right of distribution recuperation to new analytical model;
Described assessment judges module also for detecting information substitutes into new analytical model, the operating performance of new analytical model is assessed, obtains new assessed value;
Described model construction module is not also for when new assessed value restrains, then the training information right of distribution recuperation again adjusting each algorithm in new analytical model according to new assessed value is to new analytical model, and enable assessment judges that module again carries out assessment and judges; Also for when new assessed value restrains, using new analytical model as Optimization Analysis model;
Described models applying module is used for optimization model application deployment, and target information is carried out discriminance analysis.
6. low accounting information recognition device in magnanimity text according to claim 5, it is characterised in that, described material processing module is also for additionally adding to material to be identified in detecting information.
7. low accounting information identifying method in magnanimity text according to claim 5, it is characterised in that, described target information or material information noise material and material to be identified, in target information, the ratio of noise material and material to be identified is greater than 50.
8. low accounting information identifying method in magnanimity text according to claim 5, it is characterized in that, described first analytical model comprises at least two kinds in SVM, kNN, multinomial_nb, Bernoulli_nb, NearestCentroid, Ridge algorithm.
CN201511002761.1A 2015-12-28 2015-12-28 Low accounting information identifying method and device in mass text Active CN105653649B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511002761.1A CN105653649B (en) 2015-12-28 2015-12-28 Low accounting information identifying method and device in mass text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511002761.1A CN105653649B (en) 2015-12-28 2015-12-28 Low accounting information identifying method and device in mass text

Publications (2)

Publication Number Publication Date
CN105653649A true CN105653649A (en) 2016-06-08
CN105653649B CN105653649B (en) 2019-05-21

Family

ID=56477070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511002761.1A Active CN105653649B (en) 2015-12-28 2015-12-28 Low accounting information identifying method and device in mass text

Country Status (1)

Country Link
CN (1) CN105653649B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609019A (en) * 2017-08-07 2018-01-19 国网辽宁省电力有限公司 A kind of method that corporate information based on internet public information obtains
CN108090040A (en) * 2016-11-23 2018-05-29 北京国双科技有限公司 A kind of text message sorting technique and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004764A (en) * 2010-11-04 2011-04-06 中国科学院计算机网络信息中心 Internet bad information detection method and system
CN102567304A (en) * 2010-12-24 2012-07-11 北大方正集团有限公司 Filtering method and device for network malicious information
CN102929897A (en) * 2011-08-12 2013-02-13 北京千橡网景科技发展有限公司 Method and equipment for detecting bad information from text
CN103793503A (en) * 2014-01-24 2014-05-14 北京理工大学 Opinion mining and classification method based on web texts

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004764A (en) * 2010-11-04 2011-04-06 中国科学院计算机网络信息中心 Internet bad information detection method and system
CN102567304A (en) * 2010-12-24 2012-07-11 北大方正集团有限公司 Filtering method and device for network malicious information
CN102929897A (en) * 2011-08-12 2013-02-13 北京千橡网景科技发展有限公司 Method and equipment for detecting bad information from text
CN103793503A (en) * 2014-01-24 2014-05-14 北京理工大学 Opinion mining and classification method based on web texts

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090040A (en) * 2016-11-23 2018-05-29 北京国双科技有限公司 A kind of text message sorting technique and system
CN107609019A (en) * 2017-08-07 2018-01-19 国网辽宁省电力有限公司 A kind of method that corporate information based on internet public information obtains

Also Published As

Publication number Publication date
CN105653649B (en) 2019-05-21

Similar Documents

Publication Publication Date Title
CN106980573B (en) Method, device and system for constructing test case request object
CN105069470A (en) Classification model training method and device
CN103106262B (en) The method and apparatus that document classification, supporting vector machine model generate
CN105740404A (en) Label association method and device
CN105068993B (en) A method of assessment text difficulty
CN109522562B (en) Webpage knowledge extraction method based on text image fusion recognition
CN111400499A (en) Training method of document classification model, document classification method, device and equipment
CN106610970A (en) Collaborative filtering-based content recommendation system and method
CN101819585A (en) Device and method for constructing forum event dissemination pattern
US20160170993A1 (en) System and method for ranking news feeds
CN104346425A (en) Method and system of hierarchical internet public sentiment indication system
CN106156372A (en) The sorting technique of a kind of internet site and device
CN111309910A (en) Text information mining method and device
CN103957116A (en) Decision-making method and system of cloud failure data
WO2024067387A1 (en) User portrait generation method based on characteristic variable scoring, device, vehicle, and storage medium
CN109168051A (en) A kind of network direct broadcasting platform supervision evidence-obtaining system based on blue-ray storage
WO2021103401A1 (en) Data object classification method and apparatus, computer device and storage medium
CN113723737A (en) Enterprise portrait-based policy matching method, device, equipment and medium
CN110689211A (en) Method and device for evaluating website service capability
CN105653649A (en) Identification method and device of low-proportion information in mass texts
CN103207804A (en) MapReduce load simulation method based on cluster job logging
CN103093236B (en) A kind of pornographic filter method of mobile terminal analyzed based on image, semantic
CN103279549A (en) Method and device for acquiring target data of target objects
CN111026940A (en) Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment
CN102567425B (en) Method and device for processing data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Ni Shilong

Inventor after: Su Jiangwen

Inventor after: Wu Fei

Inventor after: Wang Qiulin

Inventor after: Song Lihua

Inventor before: Ni Shilong

Inventor before: Su Jiangwen

Inventor before: Song Lihua

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant