CN109460477A - Information collects categorizing system and method and its retrieval and integrated approach - Google Patents

Information collects categorizing system and method and its retrieval and integrated approach Download PDF

Info

Publication number
CN109460477A
CN109460477A CN201811258103.2A CN201811258103A CN109460477A CN 109460477 A CN109460477 A CN 109460477A CN 201811258103 A CN201811258103 A CN 201811258103A CN 109460477 A CN109460477 A CN 109460477A
Authority
CN
China
Prior art keywords
potential
information
neologisms
module
retrieval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811258103.2A
Other languages
Chinese (zh)
Other versions
CN109460477B (en
Inventor
刘默驰
武亚宁
郭剑南
顾益智
武永清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HAINAN METAIMAG TECHNOLOGY Ltd
Original Assignee
HAINAN METAIMAG TECHNOLOGY Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HAINAN METAIMAG TECHNOLOGY Ltd filed Critical HAINAN METAIMAG TECHNOLOGY Ltd
Priority to CN201811258103.2A priority Critical patent/CN109460477B/en
Publication of CN109460477A publication Critical patent/CN109460477A/en
Application granted granted Critical
Publication of CN109460477B publication Critical patent/CN109460477B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of information to collect categorizing system and method and its retrieval and integrated approach, the retrieval and integrated approach include the following steps: step S1: obtaining a potential neologisms, the retrieval of knowledge mapping is carried out to it, there are the potential neologisms in knowledge mapping if it exists, then directly carry out step S2, if it does not exist, then by the potential neologisms and all triple (e1 related with its, r, e2) it is integrated into knowledge mapping, wherein e1 indicates the potential neologisms, e2 indicates that the word for having entity relationship with the potential neologisms, r indicate the relationship type of e1 and e2;Step S2: it is integrated that term vector is carried out to the potential neologisms obtained;Step S3: repeating step S1- step S2, until all potential new word and search sum aggregate is at finishing.The present invention can effectively, incrementally integrated information, only trigger retraining if necessary, in the case where guaranteeing the quality of Knowledge Aggregation, reduce system cost, optimize system flow.

Description

Information collects categorizing system and method and its retrieval and integrated approach
Technical field
Collected the present invention relates to information and processing technology field, and in particular to a kind of information collect categorizing system and method and It is retrieved and integrated approach.
Background technique
In little Wei undertaking area and industrial park where company's hot news newest for country, policy favour and our company's business The specific concern in subdivision field embodies the specificity of the information corpus of their concerns.It is built in their subdivision fields of interest Stand effective message collection, classification, retrieval, supplying system can make they find and find at the first time it is valuable to oneself Information, avoid being submerged in the boundless ocean of magnanimity message.
Traditional corpus disaggregated model and term vector model are often formed by a large amount of general corpus training, such as by Wikipedia Trained term vector, search dog laboratory newsletter archive taxonomy library.The subdivision field of national policy is supported insufficient;And it is right In the neologisms occurred in professional domain, subdivision field can not fast reaction, be effectively treated.
Accordingly, it is desirable to provide a kind of new information processing method.
Summary of the invention
To solve the shortcomings of the prior art, the present invention provides a kind of information retrieval and integrated approach, including it is as follows Step:
Step S1: a potential neologisms are obtained, the retrieval of knowledge mapping are carried out to it, it is potential that there are this in knowledge mapping if it exists Neologisms then directly carry out step S2, if it does not exist, then by the potential neologisms and all triples (e1, r, e2) related with its It is integrated into knowledge mapping, wherein e1 indicates that the potential neologisms, e2 indicate that the word for having entity relationship with the potential neologisms, r indicate e1 With the relationship type of e2;
Step S2: it is integrated that term vector is carried out to potential neologisms obtained;
Step S3: repeating step S1- step S2, until all potential new word and search sum aggregate is at finishing.
Wherein, the step S2 includes the following steps:
Step S21: the potential neologisms are retrieved in term vector library, and if it exists, then return step S1 is obtained next potential new Word;If it does not exist, then step S22 is carried out;
Step S22: whether the judgement potential neologisms species number n that accumulation obtains at present is more than or equal to threshold value threshold_ALL, if It is then to reset potential neologisms species number n, and retraining is carried out to entire term vector, returns again to step S1, obtains next latent In neologisms;If it is not, then carrying out step S23;
Step S23: n value n corresponding with the potential neologisms is updated iValue, wherein n iValue indicates that the acquired potential neologisms are tired Meter enters the number of system;
Step S24: judge the corresponding n of the potential neologisms iWhether value is more than or equal to threshold value threshold_ONE, if it is not, then Return step S1 obtains next potential neologisms;If so, carrying out step S25;
Step S25: the term vector of the potential neologisms is integrated into term vector library.
Wherein, the step S25 includes: the retrieval entity word related with the potential neologisms in knowledge mapping;
If retrieving, it is put in storage the weighted average of the term vector in relation to entity word as the term vector of the potential neologisms, and Return step S1;
If not retrieving, in the retrieval of at least one of synonym dictionary, near synonym dictionary and antonym dictionary, this is potential new Synonym, near synonym or the antonym of word will be in the synonyms, near synonym and antonym of the potential neologisms if retrieving The weighted average of the term vector of at least one is put in storage as the term vector of the potential neologisms, and return step S1;If not retrieving It arrives, then some default term vector of the potential neologisms is inserted into dictionary.
It wherein,, will be potential new when neologisms species number n is more than or equal to threshold value threshold_ALL in the step S22 Word species number n is reset, the retrieval of next potential neologisms and it is integrated during, only to potential neologisms emerging after clearing Type accumulation calculates n value;
In the step S23, the principle of n value is updated are as follows: if occurring in systems before the acquired potential neologisms, n Be worth it is constant, if the acquired potential neologisms before do not occurred in systems, n value adds 1;Update n iPrinciple be n iValue Add 1.
Invention additionally provides a kind of information sorting techniques, include the following steps:
Information crawler: step S1 carries out information to the related text on relevant news, website and database by web crawlers It crawls, to obtain information;
Step S2: Text Pretreatment;
Step S3: potential neologisms and potential new relation are found from pretreated information;
Step S4, information retrieval and integrated: potential neologisms and potential new relation to discovery carry out information retrieval and integrated;
Step S5: the information after having integrated is classified;
Wherein, the information retrieval in the step S4 and the described in any item information retrievals as above of integrated basis and integrated approach are complete At.
Wherein, it in the step S1, in information crawler, is carried out by the web crawlers of the scrapy or urllib of python Information crawler, also, during information crawler, latest data is crawled by start by set date mechanism, by crawling history management Mechanism guarantees only to crawl incremental data, will crawl data by push or memory mechanism and push to subsequent module, or will Data are crawled to store;
In the step S2, the pretreatment of text includes removal html label, participle or quotes deactivated vocabulary to remove stop-word.
Wherein, the step S3 includes:
Step S31 has found potential neologisms: by feature ordering based on word frequency obtain the frequency of occurrences in text it is highest several Keyword;Proprietary vocabulary is obtained by characteristic character, all vocabulary related with proprietary vocabulary are obtained by syntactic analysis, are passed through Entity recognition method deletes the special meaning entity including title;
Step S32 has found potential new relation: obtaining all sentences including potential neologisms, obtains it using relationship extracting method In relative, classified using classifier to relative, obtain the triple (e1, r, e2) of relationship of having classified.
Wherein, the step S5 includes:
Step S51: training pattern feature is obtained;
Step S52: by Concat layers by training pattern Fusion Features be a big feature vector;
Step S53: training pattern feature is exported to single class vector by Connected layers of Fully;
Step S54: by the class vector of Softmax layers of normalized output, and final process be (0,0 ..., 1 ..., 0) result, wherein i-th of element is 1, represents text and belongs to i-th of classification.
The present invention also provides a kind of information to collect categorizing system, comprising:
Information crawler module, for carrying out information crawler to the related text on relevant news, website and database, to obtain Information;
Text Pretreatment module is connect with information crawler module, for carrying out Text Pretreatment to the information of acquisition;
Discovery module is connect with Text Pretreatment module, for finding potential neologisms and potential new from pretreated information Relationship;
Information retrieval and integration module, connect with discovery module, for discovery potential neologisms and potential new relation carry out letter Breath is retrieved and is integrated;
Categorization module, for the information after having integrated to be classified;
Wherein, information retrieval and integration module are according to described in any item information retrievals as above and integrated approach completion information retrieval With it is integrated.
Wherein, the information crawler module includes that policy information crawls module and business information crawls module, is respectively used to Information crawler is carried out to the related text on relevant news, website and database by different web crawlers;
The discovery module includes potential new word discovery module and potential new relation discovery module, is respectively used to from pretreated Potential neologisms and potential new relation are found in information;
The information retrieval and integration module include knowledge mapping retrieval and integration module and term vector is retrieved and integration module, Wherein, knowledge mapping retrieval and integration module are used to complete the step in described in any item information retrievals and integrated approach as above S1, term vector retrieval and integration module are used to complete the step S2 in described in any item information retrievals and integrated approach as above.
Wherein, the mechanism of action of the potential new word discovery module includes: to obtain text by the feature ordering based on word frequency Several highest keywords of the frequency of occurrences in this;Proprietary vocabulary is obtained by characteristic character, is obtained by syntactic analysis and special There are the related all vocabulary of vocabulary, is deleted the special meaning entity including title by entity recognition method;
The mechanism of action of the potential new relation discovery module includes: to obtain all sentences including potential neologisms, uses relationship Extracting method obtains relative therein, is classified using classifier to relative, obtains the triple for relationship of having classified (e1, r, e2).
Wherein, the categorization module by the following method classifies the information after having integrated:
Step Sa: training pattern feature is obtained;
Step Sb: by Concat layers by training pattern Fusion Features be a big feature vector;
Step Sc: training pattern feature is exported to single class vector by Connected layers of Fully;
Step Sd: by the class vector of Softmax layers of normalized output, and final process is (0,0 ..., 1 ..., 0) Result, wherein i-th element is 1, represents text and belongs to i-th of classification.
Wherein, in the step Sa, acquired training pattern feature includes:
The term vector that several highest keywords of the frequency of occurrences are formed in the text obtained by the method by word frequency statistics is equal The mixing words and phrases level characteristics of value;
Article level characteristics are formed by by the agent model insertion feature of the text obtained by training text;And
Logical implication in article is formed by by the knowledge mapping insertion feature obtained by TransE or TransR algorithm.
Wherein, the categorization module is also wrapped between the step Sb and Sc when the information after having integrated is classified It includes: fused feature vector being normalized by Normalize layers of Batch, and, pass through at least one Dropout Layer random invalid part of nodes during training pattern.
Wherein, it further includes memory module that the information, which collects categorizing system, is connect with categorization module, entire for storing It is embedding that information collects the text header that collection is obtained in assorting process, text, keyword, agent model insertion vector, knowledge mapping Incoming vector, class vector and classification results.
Wherein, it further includes user interactive module that the information, which collects categorizing system, is connect with memory module, and basis is used for The information that memory module is stored provides intelligent search service and customization Push Service for user.
Information provided by the invention collects categorizing system and method and its retrieval and integrated approach, can effectively, increment Formula ground integrated information, only triggers retraining if necessary, in the case where guaranteeing the quality of Knowledge Aggregation, reduce system at This, optimizes system flow.
Detailed description of the invention
Fig. 1: information provided by the invention collects the system architecture diagram of categorizing system.
Fig. 2: the logical flow chart of the working method of discovery module of the invention.
Fig. 3: the work flow diagram of retrieval and integration module of the invention.
Fig. 4: the work flow diagram of categorization module of the invention.
Description of symbols
10- information crawler module, 11- policy information crawl module, 12- business information crawls module;
20- Text Pretreatment module;
The potential new word discovery module of 30- discovery module, 31-, the potential new relation discovery module of 32-;
40- retrieval and integration module, the retrieval of 41- knowledge mapping and integration module, the retrieval of 42- term vector and integration module;
50- categorization module, 60- memory module;
70- user interactive module, 71- customizable push module, 72- Natural Language Search module;
80- user.
Specific embodiment
In order to have further understanding to technical solution of the present invention and beneficial effect, it is described in detail with reference to the accompanying drawing Technical solution of the present invention and its beneficial effect of generation.
Fig. 1 is the system architecture diagram that information provided by the invention collects categorizing system, as shown in Figure 1, provided by the invention It mainly includes sequentially connected that information, which collects categorizing system: information crawler module 10, Text Pretreatment module 20, discovery module 30, information retrieval and integration module 40, categorization module 50, memory module 60 and user interactive module 70, are sequentially completed information Acquisition, pretreatment, the discovery of potential neologisms and potential new relation, the retrieval of information and integrated, text classification and above-mentioned The storage of generated information in each module routine, to be provided to terminal user 80 first hand, personalized, accurate Messaging service, now to the mode of cooperating between the specific working method of each module and each module, details are as follows.
One, information crawler module
Information crawler module 10 crawls module 11 including the policy information that can be run parallel and business information crawls module 12, adopts With the scrapy of existing python, the technology of any language that can provide web crawlers such as urllib packet is to a large amount of:
1, hot news and policy website, database (crawling module 11 by policy information to crawl);
2, the depth data (crawling module 12 by business information to crawl) carried out on the professional website of field is segmented;
It is crawled, the website crawled includes the link referred in webpage and the webpage.The present invention passes through preset maximum depth, To limit the quantity for crawling webpage.For database data, source data is not only crawled, also crawls the main foreign key relationship between data.
Its data source can include:
1, country, municipal government, province official website, the nets such as news, policy, finance and economics, finance, foundation, industry, science and technology, channel or open data Library;
2, service enterprise, garden client official website, main business, the relevant technologies consulting network, channel or open database;
3, other relevant networks or database, can information through the invention collect categorizing system and preset.
In order to guarantee that efficient, accurate and depth, the information crawler module 10 that crawl information should have following mechanism:
1, start by set date mechanism, specifically, guaranteeing to climb by information crawler module 10 described in the regular pull-up of external timer degree Take latest data;
2, history management mechanism is crawled, has been crawled with which web page contents for identification, has had no update, thus only to increment net Page data is crawled;
3, push or memory mechanism, will crawl data-pushing to subsequent module, or will crawl data and store, and be convenient for subsequent mould Block consumption.
Two, Text Pretreatment module
The present invention mainly uses some general Text Pretreatment technologies, such as removes html label (using beautifulsoup), Participle (is segmented) using jieba, is introduced and is deactivated vocabulary to remove stop-word etc..
Three, discovery module
Fig. 2 is the logical flow chart of the working method of discovery module of the invention, incorporated by reference to shown in Fig. 1 and Fig. 2, hair of the invention Existing module 30 includes potential new word discovery module 31 and potential new relation discovery module 32.
1, potential new word discovery module
The text handled is sent respectively to different in potential new word discovery module 31 answer first by Text Pretreatment module 20 With process node, each node handles the text respectively distributed respectively.
Specifically, obtaining the top N keyword in text by the feature ordering based on word frequency such as tf-idf and bow;It is logical Quotation marks, punctuation marks used to enclose the title are crossed, the special characters such as bracket obtain the proprietary vocabulary of potential policy and subdivision field;It is obtained by syntactic analysis There are all vocabulary of predicate relationship with the above vocabulary;Judge whether above-mentioned vocabulary is name, place name, public affairs by entity recognition techniques The special meaning entities such as name are taken charge of, if then rejecting.Using the union of the above vocabulary as potential neologisms, in conjunction with satellite information (meta Data, such as whether in quotation marks, if be top N) it is pushed to potential new relation discovery module 32 together.
2, potential new relation discovery module
After the potential neologisms for obtaining each node, all sentences comprising the potential neologisms are obtained.It is obtained using relationship extractive technique Relative therein.Reusing pre-training, good classifier classifies to relative.By the triple comprising the relationship of having classified (e1, r, e2) is sent to retrieval and integration module 40.
In the present invention, triple (e1, r, e2) is the form of 30 final output of discovery module.Mould is found for each For 30 process of block, a document is handled, several such entity relationship triples will be exported.The reality that all triples are related to The union of body, i.e., whole potential new set of words, the i.e. potential new relation set of the union for all new relations being related to.They will input Subsequent processing is carried out into retrieval and integration module 40.
In the present invention, discovery module 30 passes to the information of retrieval and integration module 40, it is not limited to which what is found is latent It can also include the acquiring way for obtaining potential neologisms, that is, potential neologisms acquiring way in neologisms and potential new relation itself Itself can be used as satellite information and passes to subsequent module, as retrieval and integrated reference frame.
Four, retrieval and integration module
1, it retrieves and integrated basic
The present invention uses existing Chinese model as basic term vector, such as use search dog the whole network news corpus (http: // Www.sogou.com/labs/resource/ca.php) the good Chinese Word2Vec model of pre-training is as basic term vector. Existing knowledge mapping library is used simultaneously, such as Fudan University Chinese CN-DBpedia (http://kw.fudan.edu.cn/ Cndbpedia/intro/) as Chinese knowledge mapping.
Fig. 3 is the work flow diagram of retrieval of the invention and integration module, incorporated by reference to shown in Fig. 1 and Fig. 3, inspection of the invention Rope and integration module 40 include knowledge mapping retrieval and integration module 41 and term vector retrieval and integration module 42, are respectively completed The retrieval of knowledge mapping and term vector and integrated.
2, knowledge mapping retrieval and integration module
It in the present invention, retrieval to potential neologisms and integrated carries out one by one, that is, obtain a potential neologisms, it is carried out respectively The retrieval of knowledge mapping and term vector and integrated, obtains next potential neologisms again later, in the present invention, is carrying out potential neologisms Retrieval and it is integrated when, each potential neologisms in different retrievals and may repeat in the integrated period, repeat Number is more, and it is higher to represent the frequency that the potential neologisms occur in the text.The present invention is with niRepresent the corresponding potential neologisms The number of appearance, i represent sequence subscript of each potential neologisms in the different potential neologisms occurred, such as first retrieval In the integrated period, neologisms " A " is got, the corresponding n of the potential neologisms1=1, second is retrieved and in the integrated period, is got Neologisms " B ", the corresponding n of the potential neologisms2=1, third is retrieved and in the integrated period, gets neologisms " A " again, this is potential The corresponding n of neologisms1Value accumulation is primary, is 2.In the present invention, the species number for the potential neologisms that system accumulation obtains is represented with n, still For above, in three retrievals and integrated period, n value takes 1,2,2 respectively.
In the present invention, after obtaining a potential neologisms, first carries out the retrieval of knowledge mapping and integrate.First determine whether the word Whether in knowledge mapping.If it does not exist, then first the word and associated all triples (e1, r, e2) are integrated into and are known Know map.The retrieval of term vector is carried out after completing this step and is integrated.
3, term vector retrieval and integration module
After the retrieval for completing knowledge mapping and integrating, carries out the retrieval of term vector and integrate.
(1) firstly, retrieving the potential neologisms in term vector library whether there is, and if it exists, then terminate this wheel retrieval sum aggregate At the period, next potential neologisms are reacquired;If it does not exist, then it carries out step (2).
(2) judge whether the species number for the potential neologisms that accumulation occurs is more than or equal to preset threshold value threshold_ ALL, if it is greater than or equal to illustrating to occur in this text a large amount of new corpus and neologisms, then in order to obtain more accurate subdivision Domain term vector model, needs to trigger the retraining process of term vector, at this point, n value is reset, and after retraining, ties The retrieval of beam epicycle and integrated period, reacquire next potential neologisms.
Once should be noted that n value is reset, then in retrieval and integrated period later, as long as occurring never going out The neologisms now crossed can just add up to n, for example, preset threshold threshold_ALL=3, when system is in three retrieval sum aggregates At in the period, if getting word " A ", " B " and " C " respectively, three words retrieval before and in the integrated period from Do not occurred, n value is added to 3, triggers retraining process, in the 4th retrieval and integrated period, occurs above-mentioned word again When, it is not counted in potential neologisms species number, n value is still 0;That is, n value is once reset, the word occurred before is equal when calculating n value It does not consider further that.
The setting of threshold_ALL=3 and n value determines when start retraining full model.
(3) if n value does not reach threshold value threshold_ALL, according to principle identified above to n and niIt carries out more Newly, niIt is worth cumulative 1, n value and adds 1 in the case where the potential neologisms never occur, is protected in the case where the potential neologisms occurred It holds constant;The n of the acquired potential neologisms is further judged after having updatediWhether value reaches threshold value threshold_ONE, if Not up to, then it is assumed that the potential neologisms are not a valuable potential neologisms, terminate currently retrieval and integrated period at this time;If Reach, then the potential neologisms are integrated into incoming vector library, specific method participates in following step.
(4) the retrieval entity word related with the potential neologisms in knowledge mapping, judges the entity word quantity m acquired (i) whether 0 is greater than, if so, its weighted average is calculated based on the term vector of these entity words, as the potential neologisms Term vector is simultaneously inserted into term vector library;If not, it tries the synonymous of the potential neologisms is retrieved in synonym or antonym dictionary Word, near synonym or antonym (being hereafter referred to as equivalent), calculate its weighted average, as this based on the term vector of equivalent The term vector of potential neologisms is simultaneously inserted into term vector library.
(5) if can not still obtain effective term vector through the above way, term vector insertion term vector is preset with some Library, such as assignment (0,0,0 ..., 0) are simultaneously inserted into dictionary, or use the weighted average of full term vector in library as the potential neologisms Term vector storage.
To sum up, the present invention passes through threshold value by introducing threshold value threshold_ALL and threshold_ONE Threshold_ALL can control the frequency of system retraining, can control system to neologisms by threshold value threshold_ONE Susceptibility the approximate calculation method of term vector is proposed by the setting of two threshold values so that system need not obtain every time Retraining is all carried out when potential neologisms, and its term vector can be calculated by the vector of associated some related terms, Retraining is carried out again after n value reaches given threshold, and the present invention is allowed to retrieve and integrate valuable neologisms and new relation, Save computing resource, and accelerate the integrated of neologisms and new relation, allow they immediately by subsequent classification model, search, Pushing module uses.Improve the precision of prediction and search etc..
Five, categorization module
Fig. 4 is the work flow diagram of categorization module of the invention, incorporated by reference to shown in Fig. 1 and Fig. 4, categorization module 50 of the invention with Retrieval and integration module 40 connect, and the basis of train classification models is inputted training pattern feature, comprising:
1, statement mix level characteristics-TopN term vector feature: this feature be above by word frequency statistics method (tf-idf or BOW etc.) TopN word obtained, by retrieving the weighted average with integration module term vector obtained.
2, agent model is embedded in feature: being obtained by LDA training text, inputs categorization module as article level characteristics 50。
3, knowledge picture is embedded in feature: by using TransE, TransR scheduling algorithm is obtained, special as logic in article Sign input categorization module 50.
Necessary tool used in the training process of the aspect of model is Concat layers and Fully shown in figure Connected layers:
1, Concat layers: for being a big feature vector by above-mentioned three kinds of Fusion Features, as the defeated of subsequent neural network Enter.
2, Fully Connected layers (full articulamentum): after the features described above for receiving input, export single classification to Amount.
In the present invention, one Dropout layers can be set between Concat layers and Fully Connected layers and (abandon just Then change layer) and Normalization layers of at least one Batch (batch normalization layer), Dropout layers and Batch Normalization layers of setting sequence is required without front and back, i.e., the two is between Concat layers and Fully Connected layers Set-up mode may include following three kinds:
1, Dropout layers, Normalize layers of Batch;
2, Batch Normalize layers, Dropout layers;
3, Dropout layers, Normalize layers, Dropout layers of Batch.
Normalization layers of Batch are used to that fused feature vector to be normalized, to stablize data distribution, Improve convergence rate.
Dropout layers are used for random invalid part of nodes during model training, to avoid over-fitting.
Finally, by the class vector of Softmax layers of normalized output, and final process be (0,0 ..., 1 ..., 0).Here i-th of element is 1, represents text and belongs to i-th of classification.
Six, memory module
After the completion of text classification training, each new input sample can be by above-mentioned Text Pretreatment module 20, discovery Module 30, retrieval and integration module 40 and categorization module 50 are finally stored in memory module 60 after handling, the information being stored in here Including article title, text, Top N keyword (contain term vector), topic model insertion vector, knowledge mapping insertion vector, The classification of the class vector and text of Connected layers of Fully output.
Seven, user interactive module
After information above is stored in file system or database by memory module 40, searching for intelligence can be provided based on these information Rope module;Can also subscription and interest according to different user to different information, carry out purposefully and targetedly push.
As shown in Figure 1, user interactive module 70 includes customizable push module 71 and Natural Language Search module 72, it is preceding Person mainly provides the push of daily new information, such as by wechat, short message, mail means, information after the classification that user is subscribed to Be pushed to user 80 in real time, the latter provides the real-time search of news, policy information after index, can be used keyword, classification and from Right language issues scan for.
In the present invention, so-called " python ", " scrapy " and " urllib " is common web crawlers tool.
In the present invention, so-called " tf-idf " refers to, " the inverse text frequency of word frequency-", so-called " bow ", refers to " bag of words mould Type ", both for the existing generic text processing technique for being based primarily upon word frequency, these technologies can word, phrase or On the basis of N-gram, the text feature of word frequency rank is calculated, the feature calculated can be ranked up, thus can be obtained Top N number of such word, phrase, a kind of source as potential neologisms.
In the present invention, the method for so-called " syntactic analysis " refers to by disassembling sentence for the syntax tree of several nestings, To therefrom extract: 1. principal series table relationships, 2. Subject, Predicate and Object relationships, 3. modified relationships and 4. other relationships etc., due to relationship morphology Numerous relatives by classifying to it, are divided into several major class by formula multiplicity, are conducive to subsequent retrieval and are integrated.
It is so-called " syntactic analysis technology " in the present invention, refer to name entity recognition techniques, can identify company The special entity such as name, name, date, place name, these entities are little for helps such as model predictions, not as potential neologisms, therefore It needs to reject from potential neologisms.
It is used " classifier " in the present invention, it is selected from any existing classifier, such as search dog news corpus text classification Device etc..
In the present invention, the technologies such as so-called " TransE " and " TransR " are indicated a kind of vector of knowledge mapping, main Relationship and vector logic the chemical conversion feature being used for inside text.
In the present invention, so-called " LDA " training text can be distributed text word by agent model and model, by a text Chapter is expressed as a vector, represents its distribution situation on several themes.
In the present invention, so-called " Concat layers " refer to merging features layer, refer to two or more features corresponding Spliced in dimension.
Beneficial effects of the present invention are as follows:
1, by potential new word discovery module and potential new relation discovery module, a kind of quick, targeted neologisms are proposed Method is found with new relation: can more preferably, faster meet the needs of enterprise is for popular hot information.Due to the tool of a large amount of neologisms Feature fast, that short-term frequency is high, life cycle is shorter is occurred, if after waiting has accumulated certain corpus and data, triggering weight Them are found again when training, and is added into system, by there is a strong possibility misses the best discovery of these neologisms and new relation, classifies With index opportunity.Frequent retraining will also increase the computation burden of enterprise.Therefore the present invention passes through potential new word discovery module With the proposition of potential new relation discovery module, can also be avoided frequently while accelerating to find valuable neologisms, new relation Retraining model.It is more efficient and quick while guaranteeing to find quality, and reduce calculating cost.
2, it by knowledge mapping retrieval and integration module and term vector retrieval and integration module, provides a kind of iterative The integrated approach of efficient neologisms and new relation: by the way that the threshold value threshold for specifying neologisms quantity and accumulative neologisms quantity is added Threshold value threshold when trigger term vector retraining to define.Before triggering retraining, directly using related term word to The weighted average of amount is directly integrated as the term vector of the neologisms.The satellite information that potential new relation is considered in integrating, as Integrated reference frame.So as to effectively, incrementally integrated information, only trigger retraining if necessary, guaranteeing to know In the case where knowing integrated quality, system cost is reduced, system flow is optimized.
3, by categorization module, the information disaggregated model in conjunction with a variety of non-completeness text features is provided.Use 3 kinds of spies Sign, from 3 dimensions to non-structured text information carry out feature extraction, i.e., respectively from keyword dimension (keyword term vector), Chapter word is distributed the logical dimension (knowledge mapping insertion vector) inside dimension (agent model vector) and chapter.Due in system In these information be individually to be extracted using unsupervised learning and training, there is non-completeness.By combining above-mentioned 3 dimensions, Improve precision of prediction.
Although the present invention is illustrated using above-mentioned preferred embodiment, the protection model that however, it is not to limit the invention It encloses, anyone skilled in the art are not departing within the spirit and scope of the present invention, and opposite above-described embodiment carries out various changes It is dynamic still to belong to the range that the present invention is protected with modification, therefore protection scope of the present invention subjects to the definition of the claims.

Claims (16)

1. a kind of information retrieval and integrated approach, it is characterised in that include the following steps:
Step S1: obtaining a potential neologisms, the retrieval of knowledge mapping carried out to it, potential new if there are this in knowledge mapping Word then directly carries out step S2, if it does not exist, then the potential neologisms and its related all triples (e1, r, e2) are integrated Into knowledge mapping, wherein e1 indicates that the potential neologisms, e2 indicate that the word for having entity relationship with the potential neologisms, r indicate e1 and e2 Relationship type;
Step S2: it is integrated that term vector is carried out to the potential neologisms obtained;
Step S3: repeating step S1- step S2, until all potential new word and search sum aggregate is at finishing.
2. information retrieval as described in claim 1 and integrated approach, which is characterized in that the step S2 includes the following steps:
Step S21: the potential neologisms are retrieved in term vector library, and if it exists, then return step S1 is obtained next potential new Word;If it does not exist, then step S22 is carried out;
Step S22: whether the judgement potential neologisms species number n that accumulation obtains at present is more than or equal to threshold value threshold_ALL, if It is then to reset potential neologisms species number n, and retraining is carried out to entire term vector, returns again to step S1, obtains next latent In neologisms;If it is not, then carrying out step S23;
Step S23: n value n corresponding with the potential neologisms is updated iValue, wherein n iValue indicates that the acquired potential neologisms are accumulative Into the number of system;
Step S24: judge the corresponding n of the potential neologisms iWhether value is more than or equal to threshold value threshold_ONE, if it is not, then returning Step S1 is returned, next potential neologisms are obtained;If so, carrying out step S25;
Step S25: the term vector of the potential neologisms is integrated into term vector library.
3. information retrieval as claimed in claim 2 and integrated approach, which is characterized in that the step S25 includes: in knowledge graph Retrieval entity word related with the potential neologisms in spectrum;
If retrieving, it is put in storage the weighted average of the term vector in relation to entity word as the term vector of the potential neologisms, and Return step S1;
If not retrieving, in the retrieval of at least one of synonym dictionary, near synonym dictionary and antonym dictionary, this is potential new Synonym, near synonym or the antonym of word will be in the synonyms, near synonym and antonym of the potential neologisms if retrieving The weighted average of the term vector of at least one is put in storage as the term vector of the potential neologisms, and return step S1;If not retrieving It arrives, then some default term vector of the potential neologisms is inserted into dictionary.
4. information retrieval as claimed in claim 2 and integrated approach, it is characterised in that: in the step S22, in neologisms type Number n be more than or equal to threshold value threshold_ALL when, by potential neologisms species number n reset, next potential neologisms retrieval and In integrating process, n value only is calculated to potential neologisms type accumulation emerging after clearing;
In the step S23, the principle of n value is updated are as follows: if occurring in systems before the acquired potential neologisms, n Be worth it is constant, if the acquired potential neologisms before do not occurred in systems, n value adds 1;Update n iPrinciple be n iValue Add 1.
5. a kind of information sorting technique, it is characterised in that include the following steps:
Information crawler: step S1 carries out information to the related text on relevant news, website and database by web crawlers It crawls, to obtain information;
Step S2: Text Pretreatment;
Step S3: potential neologisms and potential new relation are found from pretreated information;
Step S4, information retrieval and integrated: potential neologisms and potential new relation to discovery carry out information retrieval and integrated;
Step S5: the information after having integrated is classified;
Wherein, the information retrieval in the step S4 and integrated information retrieval described in any one of -4 according to claim 1 and Integrated approach is completed.
6. information sorting technique as claimed in claim 5, it is characterised in that:
In the step S1, in information crawler, information crawler is carried out by the web crawlers of the scrapy or urllib of python, Also, during information crawler, latest data is crawled by start by set date mechanism, is guaranteed only by crawling history management mechanism Incremental data is crawled, data will be crawled by push or memory mechanism and push to subsequent module, or data will be crawled and deposited Storage is got off;
In the step S2, the pretreatment of text includes removal html label, participle or quotes deactivated vocabulary to remove stop-word.
7. information sorting technique as claimed in claim 5, which is characterized in that the step S3 includes:
Step S31 has found potential neologisms: by feature ordering based on word frequency obtain the frequency of occurrences in text it is highest several Keyword;Proprietary vocabulary is obtained by characteristic character, all vocabulary related with proprietary vocabulary are obtained by syntactic analysis, are passed through Entity recognition method deletes the special meaning entity including title;
Step S32 has found potential new relation: obtaining all sentences including potential neologisms, obtains it using relationship extracting method In relative, classified using classifier to relative, obtain the triple (e1, r, e2) of relationship of having classified.
8. information sorting technique as claimed in claim 5, which is characterized in that the step S5 includes:
Step S51: training pattern feature is obtained;
Step S52: by Concat layers by training pattern Fusion Features be a big feature vector;
Step S53: training pattern feature is exported to single class vector by Connected layers of Fully;
Step S54: by the class vector of Softmax layers of normalized output, and final process be (0,0 ..., 1 ..., 0) result, wherein i-th of element is 1, represents text and belongs to i-th of classification.
9. a kind of information collects categorizing system, characterized by comprising:
Information crawler module, for carrying out information crawler to the related text on relevant news, website and database, to obtain Information;
Text Pretreatment module is connect with information crawler module, for carrying out Text Pretreatment to the information of acquisition;
Discovery module is connect with Text Pretreatment module, for finding potential neologisms and potential new from pretreated information Relationship;
Information retrieval and integration module, connect with discovery module, for discovery potential neologisms and potential new relation carry out letter Breath is retrieved and is integrated;
Categorization module, for the information after having integrated to be classified;
Wherein, information retrieval described in any one of -4 and integrated approach are complete according to claim 1 for information retrieval and integration module At information retrieval and integrate.
10. information as claimed in claim 9 collects categorizing system, it is characterised in that:
The information crawler module includes that policy information crawls module and business information crawls module, is respectively used to by different Web crawlers carries out information crawler to the related text on relevant news, website and database;
The discovery module includes potential new word discovery module and potential new relation discovery module, is respectively used to from pretreated Potential neologisms and potential new relation are found in information;
The information retrieval and integration module include knowledge mapping retrieval and integration module and term vector is retrieved and integration module, Wherein, knowledge mapping retrieval and integration module are for completing information retrieval of any of claims 1-4 and integrated side Step S1 in method, term vector retrieval and integration module for complete information retrieval of any of claims 1-4 and Step S2 in integrated approach.
11. information as claimed in claim 10 collects categorizing system, it is characterised in that: the work of the potential new word discovery module It include: that several highest keywords of the frequency of occurrences in text are obtained by the feature ordering based on word frequency with mechanism;Pass through spy It levies character and obtains proprietary vocabulary, all vocabulary related with proprietary vocabulary are obtained by syntactic analysis, pass through entity recognition method Special meaning entity including title is deleted;
The mechanism of action of the potential new relation discovery module includes: to obtain all sentences including potential neologisms, uses relationship Extracting method obtains relative therein, is classified using classifier to relative, obtains the triple for relationship of having classified (e1, r, e2).
12. information as claimed in claim 9 collects categorizing system, which is characterized in that the categorization module is by the following method Information after having integrated is classified:
Step Sa: training pattern feature is obtained;
Step Sb: by Concat layers by training pattern Fusion Features be a big feature vector;
Step Sc: training pattern feature is exported to single class vector by Connected layers of Fully;
Step Sd: by the class vector of Softmax layers of normalized output, and final process is (0,0 ..., 1 ..., 0) Result, wherein i-th element is 1, represents text and belongs to i-th of classification.
13. information as claimed in claim 12 collects categorizing system, it is characterised in that: in the step Sa, acquired instruction Practicing the aspect of model includes:
The term vector that several highest keywords of the frequency of occurrences are formed in the text obtained by the method by word frequency statistics is equal The mixing words and phrases level characteristics of value;
Article level characteristics are formed by by the agent model insertion feature of the text obtained by training text;And
Logical implication in article is formed by by the knowledge mapping insertion feature obtained by TransE or TransR algorithm.
14. information as claimed in claim 12 collects categorizing system, which is characterized in that the categorization module will be after it will integrate Information when being classified, between the step Sb and Sc further include: by Normalize layers of Batch to fused feature Vector is normalized, and, pass through at least one Dropout layers of random invalid part section during training pattern Point.
15. the information as described in any one of claim 9-14 collects categorizing system, which is characterized in that the information, which is collected, divides Class system further includes memory module, is connect with categorization module, collects acquisition collection in assorting process for storing entire information Text header, text, keyword, agent model insertion vector, knowledge mapping insertion vector, class vector and classification knot Fruit.
16. information as claimed in claim 15 collects categorizing system, it is characterised in that: the information is collected categorizing system and also wrapped User interactive module is included, is connect with memory module, the information for being stored according to memory module provides intelligence for user and searches Rope service and customization Push Service.
CN201811258103.2A 2018-10-26 2018-10-26 Information collection and classification system and method and retrieval and integration method thereof Expired - Fee Related CN109460477B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811258103.2A CN109460477B (en) 2018-10-26 2018-10-26 Information collection and classification system and method and retrieval and integration method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811258103.2A CN109460477B (en) 2018-10-26 2018-10-26 Information collection and classification system and method and retrieval and integration method thereof

Publications (2)

Publication Number Publication Date
CN109460477A true CN109460477A (en) 2019-03-12
CN109460477B CN109460477B (en) 2022-03-29

Family

ID=65608499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811258103.2A Expired - Fee Related CN109460477B (en) 2018-10-26 2018-10-26 Information collection and classification system and method and retrieval and integration method thereof

Country Status (1)

Country Link
CN (1) CN109460477B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110609903A (en) * 2019-08-01 2019-12-24 华为技术有限公司 Information presentation method and device
CN110765235A (en) * 2019-09-09 2020-02-07 深圳市人马互动科技有限公司 Training data generation method and device, terminal and readable medium
CN112035653A (en) * 2020-11-05 2020-12-04 北京智源人工智能研究院 Policy key information extraction method and device, storage medium and electronic equipment
CN112347343A (en) * 2020-09-25 2021-02-09 北京淇瑀信息科技有限公司 Customized information pushing method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160092448A1 (en) * 2014-09-26 2016-03-31 International Business Machines Corporation Method For Deducing Entity Relationships Across Corpora Using Cluster Based Dictionary Vocabulary Lexicon
CN107818164A (en) * 2017-11-02 2018-03-20 东北师范大学 A kind of intelligent answer method and its system
CN108509654A (en) * 2018-04-18 2018-09-07 上海交通大学 The construction method of dynamic knowledge collection of illustrative plates

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160092448A1 (en) * 2014-09-26 2016-03-31 International Business Machines Corporation Method For Deducing Entity Relationships Across Corpora Using Cluster Based Dictionary Vocabulary Lexicon
US20170255694A1 (en) * 2014-09-26 2017-09-07 International Business Machines Corporation Method For Deducing Entity Relationships Across Corpora Using Cluster Based Dictionary Vocabulary Lexicon
CN107818164A (en) * 2017-11-02 2018-03-20 东北师范大学 A kind of intelligent answer method and its system
CN108509654A (en) * 2018-04-18 2018-09-07 上海交通大学 The construction method of dynamic knowledge collection of illustrative plates

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄恒琪等: "知识图谱研究综述", 《计算机系统应用》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110609903A (en) * 2019-08-01 2019-12-24 华为技术有限公司 Information presentation method and device
WO2021018154A1 (en) * 2019-08-01 2021-02-04 华为技术有限公司 Information representation method and apparatus
CN110765235A (en) * 2019-09-09 2020-02-07 深圳市人马互动科技有限公司 Training data generation method and device, terminal and readable medium
CN110765235B (en) * 2019-09-09 2023-09-05 深圳市人马互动科技有限公司 Training data generation method, device, terminal and readable medium
CN112347343A (en) * 2020-09-25 2021-02-09 北京淇瑀信息科技有限公司 Customized information pushing method and device and electronic equipment
CN112347343B (en) * 2020-09-25 2024-05-28 北京淇瑀信息科技有限公司 Custom information pushing method and device and electronic equipment
CN112035653A (en) * 2020-11-05 2020-12-04 北京智源人工智能研究院 Policy key information extraction method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN109460477B (en) 2022-03-29

Similar Documents

Publication Publication Date Title
CN103605665B (en) Keyword based evaluation expert intelligent search and recommendation method
CN109460477A (en) Information collects categorizing system and method and its retrieval and integrated approach
CN103838833A (en) Full-text retrieval system based on semantic analysis of relevant words
KR20020049164A (en) The System and Method for Auto - Document - classification by Learning Category using Genetic algorithm and Term cluster
CN109325231A (en) A kind of method that multi task model generates term vector
CN109241199B (en) Financial knowledge graph discovery method
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN109597995A (en) A kind of document representation method based on BM25 weighted combination term vector
CN112905800A (en) Public character public opinion knowledge graph and XGboost multi-feature fusion emotion early warning method
CN112036178A (en) Distribution network entity related semantic search method
CN111325018A (en) Domain dictionary construction method based on web retrieval and new word discovery
Dhanith et al. A word embedding based approach for focused web crawling using the recurrent neural network
CN112926325A (en) Chinese character relation extraction construction method based on BERT neural network
Janusz et al. Interactive document indexing method based on explicit semantic analysis
Rajiv et al. A supervised learning‐based approach for focused web crawling for IoMT using global co‐occurrence matrix
Shukla et al. Artificial intelligence in information retrieval
Asa et al. A comprehensive survey on extractive text summarization techniques
Bhavani et al. An efficient clustering approach for fair semantic web content retrieval via tri-level ontology construction model with hybrid dragonfly algorithm
Chahal et al. An ontology based approach for finding semantic similarity between web documents
Long et al. Joint learning for legal text retrieval and textual entailment: leveraging the relationship between relevancy and affirmation
CN112507097A (en) Method for improving generalization capability of question-answering system
CN112115269A (en) Webpage automatic classification method based on crawler
Perez-Guadarramas et al. Analysis of OWA operators for automatic keyphrase extraction in a semantic context
US20230162031A1 (en) Method and system for training neural network for generating search string
Li et al. An Innovative Similar Complaint Recommendation Model Integrating Semantic and Graph Embeddings

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220329