CN109033076A - information mining method and device - Google Patents
information mining method and device Download PDFInfo
- Publication number
- CN109033076A CN109033076A CN201810716210.9A CN201810716210A CN109033076A CN 109033076 A CN109033076 A CN 109033076A CN 201810716210 A CN201810716210 A CN 201810716210A CN 109033076 A CN109033076 A CN 109033076A
- Authority
- CN
- China
- Prior art keywords
- query statement
- template
- query
- high frequency
- particular category
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000005065 mining Methods 0.000 title claims abstract description 14
- 230000000875 corresponding effect Effects 0.000 claims description 54
- 238000004422 calculation algorithm Methods 0.000 claims description 17
- 238000009412 basement excavation Methods 0.000 claims description 7
- 230000002596 correlated effect Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 230000009467 reduction Effects 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000004891 communication Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 6
- 239000000284 extract Substances 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 238000013473 artificial intelligence Methods 0.000 description 4
- 238000007418 data mining Methods 0.000 description 3
- 230000005611 electricity Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000005295 random walk Methods 0.000 description 3
- 235000013399 edible fruits Nutrition 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000005336 cracking Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present invention proposes a kind of information mining method and device.Wherein this method comprises: excavating each query statement of each particular category from search log;Give the kind fructification of the particular category;According to the kind fructification of the particular category and each query statement, the corresponding expression template of each query statement of the particular category is generated;According to query statement of all categories and its corresponding expression template, is excavated from described search log and obtain high frequency query statement and high frequency expression template.Using the search log of user as data source, it may include the content that the template being manually enriched with such as colloquial style expression can not cover that obtained high frequency sentence high frequency expression, which is not only enriched but also can satisfy the communicative habits that can cover various users,.
Description
Technical field
The present invention relates to technical field of information retrieval more particularly to a kind of information mining methods and device.
Background technique
In man-machine interactive system, user is varied for the requirement express of robot interactive.It is existing to be based on template
The user that parsing module needs full dose puts question to query statement (query), and recall rate and the parsing that could improve user's understanding are quasi-
True rate.These users expression has following feature, causes to be existed and much asked using traditional artificial enrichment rule and vocabulary
Topic.
(1) expression way is varied, various with the expression-form of problem user, the communicative habits of different user
Also varied, in this case, artificial enrichment building can not cover all expression.
(2) inclined colloquial style is expressed, user's expression-form colloquial style is serious, and the template being manually enriched with can not cover.
(3) the vocabulary substantial amounts of every dimension cannot manually construct the vocabulary of such vast number grade.
Due to the above feature of user's expression, if there are times and human cost using artificial enrichment rule and vocabulary
The problems such as height, low efficiency, poor parsing effect, it will lead to that user's Understanding Module effect is poor, and man-machine interaction experience is poor.In addition, enrichment
Vocabulary can not be enriched with extensive full dose vocabulary, cause parsing recall rate low.Enrichment expression way can not be enriched with extensive full dose table
It is expressed up to template, colloquial style, causes to parse recall rate and accuracy rate is low, cannot understood that user expresses, accurate answer can not be provided,
Cause user satisfaction low.
Summary of the invention
The embodiment of the present invention provides a kind of information mining method and device, to solve one or more skills in the prior art
Art problem.
In a first aspect, the embodiment of the invention provides a kind of information mining methods, comprising:
Each query statement of each particular category is excavated from search log;
Give the kind fructification of the particular category;
According to the kind fructification of the particular category and each query statement, each query statement pair of the particular category is generated
The expression template answered;
According to query statement of all categories and its corresponding expression template, is excavated from described search log and obtain high frequency and look into
Ask sentence and high frequency expression template.
With reference to first aspect, the embodiment of the present invention is in the first implementation of first aspect, according to the certain kinds
Other kind of fructification and each query statement generate the corresponding expression template of each query statement of the particular category, comprising:
If in the query statement of the particular category including kind of a fructification, described kind of fructification is used into wildcard figure generation
It replaces, obtains corresponding expression template.
With reference to first aspect, the embodiment of the present invention is in second of implementation of first aspect, further includes:
Using each expression template, various entities are excavated from described search log, to obtain high frequency words and/or colloquial style
Word;And/or
The full dose word for belonging to the particular category is extracted from the full dose data of selected website.
Second of implementation with reference to first aspect, the third implementation of the embodiment of the present invention in first aspect
In, further includes:
Scalable vector graphics SVG dimension-reduction treatment is carried out to the expression template for the various entities excavated, is obtained corresponding
Feature vector;
The corresponding feature vector of multiple expression templates is clustered, the expression template that the particular category includes is obtained.
With reference to first aspect or its any one implementation, the embodiment of the present invention is in the 4th kind of realization side of first aspect
In formula, further includes:
Described search log is screened, relevant query statement and expression template are obtained.
The 4th kind of implementation with reference to first aspect, five kind implementation of the embodiment of the present invention in first aspect
In, which is characterized in that according to query statement of all categories and its corresponding expression template, excavates and obtain from described search log
High frequency query statement and high frequency expression template, comprising:
From relevant query statement, the query statement marked is obtained, includes class in the query statement marked
Distinguishing label;
The semantic similarity of the two is calculated according to the term vector of two query statements in the query statement marked;
If the semantic similarity of described two query statements is greater than threshold value, link between the two is established;
According to the connection between the link and the corresponding expression template of each query statement between each query statement
Relationship establishes sentence template relational graph;
According to the sum of the class label of query statement each in the query statement marked and the query statement marked,
Calculate the parameter of random algorithm;
The random algorithm is used in the sentence template relational graph, obtains each query statement and its corresponding expression mould
The sequence of plate;
High frequency query statement and high frequency query template are filtered out according to ranking results.
Second aspect, the embodiment of the invention provides a kind of information excavating devices, comprising:
Sentence excavates module, for excavating each query statement of each particular category from search log;
Entity gives module, for giving the kind fructification of the particular category;
Template generation module, for according to the particular category kind fructification and each query statement, generate it is described specific
The corresponding expression template of each query statement of classification;
High frequency excavates module, for according to query statement of all categories and its corresponding expression template, from described search day
It is excavated in will and obtains high frequency query statement and high frequency expression template.
In conjunction with second aspect, the embodiment of the present invention is in the first implementation of second aspect, the template generation mould
If block is also used to include kind of a fructification in the query statement of the particular category, described kind of fructification is used into wildcard figure generation
It replaces, obtains corresponding expression template.
In conjunction with second aspect, the embodiment of the present invention is in second of implementation of second aspect, further includes:
Query word excavates module, for utilizing each expression template, excavates various entities, from described search log to obtain
To high frequency words and/or colloquial style word;And/or
Full dose word abstraction module, for extracting the full dose for belonging to the particular category from the full dose data of selected website
Word.
In conjunction with second of implementation of second aspect, the third implementation of the embodiment of the present invention in second aspect
In, the query word excavates module and includes:
Dimension-reduction treatment submodule carries out scalable vector graphics SVG for the expression template to the various entities excavated
Dimension-reduction treatment obtains corresponding feature vector;
It clusters submodule and obtains the particular category for clustering the corresponding feature vector of multiple expression templates
Including expression template.
In conjunction with second aspect or its any one implementation, the embodiment of the present invention is in the 4th kind of realization side of second aspect
In formula, further includes:
Correlated expression excavates module and obtains relevant query statement and expression for screening to described search log
Template.
In conjunction with the 4th kind of implementation of second aspect, five kind implementation of the embodiment of the present invention in second aspect
In, the high frequency excavates module and includes:
Mark sentence acquisition submodule, for from relevant query statement, obtain the query statement marked, it is described
It include class label in the query statement of mark;
Similarity calculation submodule, for calculating two according to the term vector of two query statements in the query statement marked
The semantic similarity of person;
Setting up submodule is linked, if the semantic similarity for described two query statements is greater than threshold value, establishes two
Link between person;
Relational graph setting up submodule, for according between each query statement link and each query statement it is right with it
The connection relationship between expression template answered establishes sentence template relational graph;
Parameter computation module, for according to the class label of each query statement in the query statement that has marked and having marked
The sum of the query statement of note calculates the parameter of random algorithm;
Sorting sub-module obtains each query statement for using the random algorithm in the sentence template relational graph
And its sequence of corresponding expression template;
High frequency screens submodule, for filtering out high frequency query statement and high frequency query template according to ranking results.
The third aspect, the embodiment of the invention provides a kind of information excavating device, the function of described device can be by hard
Part is realized, corresponding software realization can also be executed by hardware.The hardware or software include one or more and above-mentioned function
It can corresponding module.
It include processor and memory, the memory in the structure of information excavating device in a possible design
For storing the program for supporting information excavating device to execute above- mentioned information method for digging, the processor is configured to for executing
The program stored in the memory.The information excavating device can also include communication interface, for other equipment or logical
Communication network communication.
Fourth aspect, the embodiment of the invention provides a kind of computer readable storage mediums, for storing information excavating dress
Set computer software instructions used comprising for executing program involved in above- mentioned information method for digging.
A technical solution in above-mentioned technical proposal is had the following advantages that or the utility model has the advantages that is made with the search log of user
For data source, obtained high frequency sentence high frequency expression is not only enriched but also can satisfy the communicative habits that can cover various users, can
To include content that the template that is manually enriched with such as colloquial style expression can not cover.
Another technical solution in above-mentioned technical proposal has the following advantages that or the utility model has the advantages that by merging a variety of data
Excavate and artificial intelligence technology, can excavate extensive vocabulary, excavate correlated expression template, cluster extract user's expression template,
User's high frequency expression template is extracted using random algorithm, to achieve the effect that high efficiency, height are recalled and height parses accuracy rate.On
State the purpose summarized and be merely to illustrate that book, it is not intended to be limited in any way.It is schematical except foregoing description
Except aspect, embodiment and feature, by reference to attached drawing and the following detailed description, further aspect of the present invention, implementation
Mode and feature, which will be, to be readily apparent that.
Detailed description of the invention
In the accompanying drawings, unless specified otherwise herein, otherwise indicate the same or similar through the identical appended drawing reference of multiple attached drawings
Component or element.What these attached drawings were not necessarily to scale.It should be understood that these attached drawings depict only according to the present invention
Disclosed some embodiments, and should not serve to limit the scope of the present invention.
Fig. 1 shows the flow chart of information mining method according to an embodiment of the present invention.
Fig. 2 shows the flow charts of information mining method according to an embodiment of the present invention.
Fig. 3 shows the flow chart of information mining method according to an embodiment of the present invention.
Fig. 4 shows the structural block diagram of information excavating device according to an embodiment of the present invention.
Fig. 5 shows the structural block diagram of information excavating device according to an embodiment of the present invention.
Fig. 6 shows the structural block diagram of information excavating device according to an embodiment of the present invention.
Fig. 7 shows the schematic diagram of the general type of sentence template relational graph.
Fig. 8 shows a kind of exemplary schematic diagram of sentence template relational graph.
Fig. 9 shows the structural block diagram of information excavating device according to an embodiment of the present invention.
Specific embodiment
Hereinafter, certain exemplary embodiments are simply just described.As one skilled in the art will recognize that
Like that, without departing from the spirit or scope of the present invention, described embodiment can be modified by various different modes.
Therefore, attached drawing and description are considered essentially illustrative rather than restrictive.
Fig. 1 shows the flow chart of information mining method according to an embodiment of the present invention.As shown in Figure 1, the information excavating side
Method may comprise steps of:
Step 101, each query statement that each particular category is excavated from search log;
Step 102, the kind fructification for giving the particular category;
Step 103, the kind fructification according to the particular category and each query statement, generate respectively looking into for the particular category
Ask the corresponding expression template of sentence;
Step 104, according to query statement of all categories and its corresponding expression template, excavated from described search log
To high frequency query statement and high frequency expression template.
In embodiments of the present invention, search log may include the relevant information of search behavior of user, for example, when search
The query statement of input, it is searching for as a result, and the search result of user's actual click etc..It can be dug in search log
Excavate the query statement for belonging to a certain particular category.For example, the name including film can be found out if particular category is film
Each query statement of the relevant information such as title, star, role, the query statement as this classification of film.
Given kind fructification may include the entity for belonging to the particular category.For example, belonging to the reality of this classification of film
Body includes role A, star B, movie name C etc..
In one possible implementation, if it includes kind that step 103, which includes: in the query statement of the particular category,
Described kind of fructification is then replaced using wildcard figure, obtains corresponding expression template by fructification.
Specifically, it can be matched, be will match in each query statement of the category according to these kind of fructification
Kind fructification in query statement is replaced using asterisk wildcard, to generate corresponding expression template.
For example, query statement Q1 includes " movie name C first broadcast ", it can be by " the movie name C " in Q1 with asterisk wildcard such as " * "
It replaces, generates expression template<* first broadcast>.
For another example, query statement Q2 includes " star B participate in film festival ", can by " the star B " in Q1 with asterisk wildcard for example
" * " is replaced, and is generated expression template<* and is participated in film festival>.
In one possible implementation, as shown in Fig. 2, after obtaining expression template, this method further include: step
201, using each expression template, various entities are excavated from described search log, to obtain high frequency words and/or colloquial style word.
After obtaining a large amount of expression template, these expression templates can be used and excavated in search log, obtain
Belong to all entities of these expression templates.
For example, using the first broadcast of template<*>, it is excavated in log and arrives query statement Q11 " movie name C1 first broadcast ", Q12 " film
Name C2 first broadcast ", Q13 " movie name C3 first broadcast " etc..To obtain " movie name C1 ", " movie name C2 ", " movie name C3 " these realities
Body.
For another example, film festival is participated in using template<*>, it is excavated in log to query statement Q21 " star B1 participation film
Section ", Q22 " star B2 participates in film festival ", Q23 " star B2 participates in film festival " etc..To obtain " star B1 ", " star B2 ",
" star B3 " these entities.
It in one possible implementation, as shown in Fig. 2, can also be from some encyclopaedia websites such as wikipedia, hundred
The full dose word of the particular category is excavated in the offline full dose data of websites such as degree encyclopaedia.This method may include: step 202,
The full dose word for belonging to the particular category is extracted from the full dose data of selected website.For example, electricity can be extracted from encyclopaedia website
All entries of this classification of shadow are then based on abstract, catalogue of encyclopaedia etc. and classify again to all entries of this classification.
In one possible implementation, it is excavated from described search log according to each expression template and obtains corresponding look into
Ask word, comprising:
Using each expression template, various entities are excavated from described search log;
Scalable vector graphics (Scalable Vector is carried out to the expression template for the various entities excavated
Graphics, SVG) dimension-reduction treatment, obtain corresponding feature vector;
The corresponding feature vector of each expression template is clustered, the expression template that the particular category includes is obtained.
It wherein, is sparse feature using expression template as the feature of each entity.It will carry out expression template SVG dimensionality reduction
Processing, after obtaining corresponding feature vector, Clustering Effect is more preferable, high-efficient.
In one possible implementation, as shown in figure 3, this method further include:
Step 301 screens described search log, obtains relevant query statement and expression template.
In one possible implementation, step 104 may include according to query statement of all categories and its corresponding
Expression template, from relevant query statement and expression template, excavation obtains high frequency query statement and high frequency expression template, specifically
May include:
From relevant query statement, the query statement marked is obtained, includes class in the query statement marked
Distinguishing label;
The semantic similarity of the two is calculated according to the term vector of two query statements in the query statement marked, such as will
Semantic similarity of the COS distance of the term vector of two query words as the two;
If the semantic similarity of described two query statements is greater than threshold value, link between the two is established;
According to the connection between the link and the corresponding expression template of each query statement between each query statement
Relationship establishes sentence template relational graph;
According to the sum of the class label of query statement each in the query statement marked and the query statement marked,
Calculate the parameter of random algorithm, such as R value=value of class label/sum of the query statement marked;
Using the random algorithm (utilizing above-mentioned R value) in the sentence template relational graph, each query statement is obtained
And its sequence of corresponding expression template;
High frequency query statement and high frequency query template are filtered out according to ranking results.
The embodiment of the present invention using the search log of user as data source, both enriched by obtained high frequency sentence high frequency expression
It can satisfy the communicative habits for covering various users again, may include that the template being manually enriched with such as colloquial style expression can not be covered
The content of lid.
In addition, can excavate extensive vocabulary by merging a variety of data minings and artificial intelligence technology, excavate correlation table
Up to template, cluster extract user's expression template, using random algorithm extract user's high frequency expression template, thus reach high efficiency,
Height is recalled and the effect of high parsing accuracy rate, is capable of providing accurate answer, improves user satisfaction, man-machine interaction experience is good.
Fig. 4 shows the structural block diagram of information excavating device according to an embodiment of the present invention.As shown in figure 4, the information excavating
Device may include:
Sentence excavates module 41, for excavating each query statement of each particular category from search log;
Entity gives module 42, for giving the kind fructification of the particular category;
Template generation module 43, for according to the particular category kind fructification and each query statement, generate the spy
Determine the corresponding expression template of each query statement of classification;
High frequency excavates module 44, for according to query statement of all categories and its corresponding expression template, from described search
It is excavated in log and obtains high frequency query statement and high frequency expression template.
In one possible implementation, if the template generation module 43 is also used to the inquiry of the particular category
Include kind of a fructification in sentence, then described kind of fructification is replaced using wildcard figure, obtain corresponding expression template.
In one possible implementation, as shown in figure 5, the device further include:
Query word excavates module 51, for excavating various entities from described search log using each expression template, with
Obtain high frequency words and/or colloquial style word;And/or
Full dose word abstraction module 52, for extracting the full dose for belonging to the particular category from the full dose data of selected website
Word.
In one possible implementation, the query word excavation module 51 includes:
Dimension-reduction treatment submodule carries out scalable vector graphics SVG for the expression template to the various entities excavated
Dimension-reduction treatment obtains corresponding feature vector;
It clusters submodule and obtains the particular category for clustering the corresponding feature vector of multiple expression templates
Including expression template.
In one possible implementation, as shown in fig. 6, the device further include: correlated expression excavates module 61, is used for
Described search log is screened, relevant query statement and expression template are obtained.
In one possible implementation, the high frequency excavation module 44 includes:
Mark sentence acquisition submodule, for from relevant query statement, obtain the query statement marked, it is described
It include class label in the query statement of mark;
Similarity calculation submodule, for calculating two according to the term vector of two query statements in the query statement marked
The semantic similarity of person;
Setting up submodule is linked, if the semantic similarity for described two query statements is greater than threshold value, establishes two
Link between person;
Relational graph setting up submodule, for according between each query statement link and each query statement it is right with it
The connection relationship between expression template answered establishes sentence template relational graph;
Parameter computation module, for according to the class label of each query statement in the query statement that has marked and having marked
The sum of the query statement of note calculates the parameter of random algorithm;
Sorting sub-module obtains each query statement for using the random algorithm in the sentence template relational graph
And its sequence of corresponding expression template;
High frequency screens submodule, for filtering out high frequency query statement and high frequency query template according to ranking results.
The function of each module in each device of the embodiment of the present invention may refer to the corresponding description in the above method, herein not
It repeats again.
The embodiment of the present invention can excavate extensive vocabulary, digging by merging a variety of data minings and artificial intelligence technology
Dig correlated expression template, cluster extracts user's expression template, the expression of user's high frequency is extracted using random algorithm (randomwalk)
Template, to achieve the effect that high efficiency, height are recalled and height parses accuracy rate.
In a kind of application example, the information mining method using the embodiment of the present invention may include following part:
One: extensive core vocabulary excavates, and the vocabulary of excavation may include high frequency words, colloquial style word and full dose word.
1. excavating high frequency words, colloquial style word:
1.1 excavate all query statements (query) of particular category from search log;
1.2 give a small amount of kind of fructification (can be understood as the specific query object in certain field), if some occurs in query
Entity is then replaced with asterisk wildcard, generates a corresponding expression template.For example given kind of fructification --- " transformer " has
One query statement is " the online high definition viewing of transformer ", then generates the online high definition viewing of expression template (pattern)<*>.
After 1.3 previous steps obtain great expression template, with these expression templates, in search, log (log) is inner digs out all realities
Body.
1.4 using expression template as the feature of each entity (entity), this is a very sparse feature, does cluster effect
Fruit is bad, low efficiency.Therefore, can first to expression template do scalable vector graphics (Scalable Vector Graphics,
SVD) dimensionality reduction, then clustered with the feature vector after dimensionality reduction.
2. excavating full dose word: from wikipedia (Wikipedia) or the offline full dose data pick-up certain kinds of Baidupedia
Other all entries carry out the classification made a summary based on Wikipedia;
Two: the whole network is expressed query and is excavated
Search log (the big search click logs of such as Baidu) is excavated, therefrom screening retains the relevant date for clicking main stream website
Will, so that filtering out relevant user expresses query statement (query) and expression template (being referred to as expression way).
Three: extracting high frequency expression query and expression template
The class label of mark a batch query, query can indicate whether the query belongs to particular category.Such as electricity
Shadow, otherwise it is 0 that label, which is for 1, label,.
Using label/sum, (formula indicates R of the label value 0 or 1 divided by sum (text sum) as each query
It is worth (parameter of random algorithm).
It is semantic that query is indicated using the term vector (such as lstm_encoding) of query, and uses cosine similarity
Calculate semantic similarity between every two query.Two query for being greater than threshold value such as 0.9 for semantic similarity construct chain
It connects.
The connection relationship between link and query and its expression template (pattern) between comprehensive query, building
Sentence template relational graph (can abbreviation QQT-Graph).
Random algorithm (such as randomwalk) is carried out using above-mentioned R value in QQT-Graph, obtains final query
It sorts with pattern.
Finally, the expression query and expression template of screening high frequency carry out parsing and recall coverage, to promote resolution factor.
As shown in fig. 7, being the general type of sentence template relational graph, wherein q indicates that query statement, s indicate query statement
Between link, t indicate expression template.Cqs indicates s weight corresponding with the side of q connection, and Cs indicates that the score of s, Cq indicate q
Score.Iqt indicates that q weight corresponding with the side of t connection, Iq indicate that the score of q, It indicate the score of t.In addition, building
When vertical sentence template relational graph, the connection relationship between s two query statements of expression can also not had to, and in two query statements
Direct-connected side is established between q and q.
As shown in figure 8, being a kind of illustrative sentence template relational graph.Assuming that the example of each query statement therein is such as
Under:
q1: jobs in chicago
q2: jobs in boston
q3: jobs in microsoft
q4: jobs in motorola
q5: marketing jobs in motorola
q6: 401k plans
q7: illinois employment statistics
The semantic similarity for calculating each query statement can establish the link between two sentences.The example of link is such as
Under:
S1: monster.com
s2: motorola.com
s3: us401k.com
Wherein, the linking of q1 and q7, the linking of q1 and q2, the linking of q1 and q3, q1 and q6 are linked as s1;Q4's and q5
It is linked as s2;Q6's and q7 is linked as s3.
Based on the relationship of the expression template and each query statement excavated before, the connection of template Yu each query statement is established.
The example of expression template is as follows:
t1: jobs in#location
t2: jobs in#company
t3: #category jobs in#company
t4: #location employment statistics
Wherein, q1, q2 and t1 have connection relationship;Q2, q3, q4 and t2 have connection relationship;There is q5 and t3 connection to close
System;Q7 and t4 has connection relationship.
In sentence template relational graph shown in Fig. 8, numerical value 1,2,5,4,10,12 etc. indicates the corresponding weight in each side.
Using the information mining method and device of the embodiment of the present invention, there is following clear advantage:
Time and human cost are saved, it can the cracking cold start-up for completing new classification using machine excavation technology
Knowledge excavation.
A variety of data minings and artificial intelligence technology are merged, the correlation table that extensive vocabulary can be excavated, excavate the whole network
Up to template, cluster extract user's expression template, randomwalk extracts user's high frequency expression template, to reach high efficiency, height
The effect with high parsing accuracy rate is recalled, user experience is promoted.
Fig. 9 shows the structural block diagram of information excavating device according to an embodiment of the present invention.As shown in figure 9, the device includes:
Memory 910 and processor 920 are stored with the computer program that can be run on processor 920 in memory 910.The place
Reason device 920 realizes the information mining method in above-described embodiment when executing the computer program.The memory 910 and processing
The quantity of device 920 can be one or more.
The device further include:
Communication interface 930 carries out data interaction for being communicated with external device.
Memory 910 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-
Volatile memory), a for example, at least magnetic disk storage.
If memory 910, processor 920 and the independent realization of communication interface 930, memory 910,920 and of processor
Communication interface 930 can be connected with each other by bus and complete mutual communication.The bus can be Industry Standard Architecture
Structure (ISA, Industry Standard Architecture) bus, external equipment interconnection (PCI, Peripheral
Component) bus or extended industry-standard architecture (EISA, Extended Industry Standard
Component) bus etc..The bus can be divided into address bus, data/address bus, control bus etc..For convenient for expression, Fig. 9
In only indicated with a thick line, it is not intended that an only bus or a type of bus.
Optionally, in specific implementation, if memory 910, processor 920 and communication interface 930 are integrated in one piece of core
On piece, then memory 910, processor 920 and communication interface 930 can complete mutual communication by internal interface.
The embodiment of the invention provides a kind of computer readable storage mediums, are stored with computer program, the program quilt
Processor realizes any method in above-described embodiment when executing.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is included at least one embodiment or example of the invention.Moreover, particular features, structures, materials, or characteristics described
It may be combined in any suitable manner in any one or more of the embodiments or examples.In addition, without conflicting with each other, this
The technical staff in field can be by the spy of different embodiments or examples described in this specification and different embodiments or examples
Sign is combined.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance
Or implicitly indicate the quantity of indicated technical characteristic." first " is defined as a result, the feature of " second " can be expressed or hidden
It include at least one this feature containing ground.In the description of the present invention, the meaning of " plurality " is two or more, unless otherwise
Clear specific restriction.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes
It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion
Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discussed suitable
Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, to execute function, this should be of the invention
Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use
In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for
Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction
The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set
It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass
Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment
It sets.The more specific example (non-exhaustive list) of computer-readable medium include the following: there is the electricity of one or more wirings
Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory
(ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable read-only memory
(CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other suitable Jie
Matter, because can then be edited, be interpreted or when necessary with other for example by carrying out optical scanner to paper or other media
Suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the invention can be realized with hardware, software, firmware or their combination.Above-mentioned
In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage
Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware
Any one of column technology or their combination are realized: having a logic gates for realizing logic function to data-signal
Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene
Programmable gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries
It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium
In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module
It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould
Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as
Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer
In readable storage medium storing program for executing.The storage medium can be read-only memory, disk or CD etc..
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in its various change or replacement,
These should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the guarantor of the claim
It protects subject to range.
Claims (14)
1. a kind of information mining method characterized by comprising
Each query statement of each particular category is excavated from search log;
Give the kind fructification of the particular category;
According to the kind fructification of the particular category and each query statement, each query statement for generating the particular category is corresponding
Expression template;
According to query statement of all categories and its corresponding expression template, is excavated from described search log and obtain high frequency inquiry language
Sentence and high frequency expression template.
2. the method according to claim 1, wherein according to the kind fructification of the particular category and each inquiry language
Sentence, generates the corresponding expression template of each query statement of the particular category, comprising:
If in the query statement of the particular category including kind of a fructification, described kind of fructification is replaced using wildcard figure,
Obtain corresponding expression template.
3. the method according to claim 1, wherein further include:
Using each expression template, various entities are excavated from described search log, to obtain high frequency words and/or colloquial style word;
And/or
The full dose word for belonging to the particular category is extracted from the full dose data of selected website.
4. according to the method described in claim 3, it is characterized by further comprising:
Scalable vector graphics SVG dimension-reduction treatment is carried out to the expression template for the various entities excavated, obtains corresponding feature
Vector;
The corresponding feature vector of multiple expression templates is clustered, the expression template that the particular category includes is obtained.
5. method according to claim 1 to 4, which is characterized in that further include:
Described search log is screened, relevant query statement and expression template are obtained.
6. according to the method described in claim 5, it is characterized in that, according to query statement and its corresponding expression mould of all categories
Plate excavates from described search log and obtains high frequency query statement and high frequency expression template, comprising:
From relevant query statement, the query statement marked is obtained, includes classification mark in the query statement marked
Label;
The semantic similarity of the two is calculated according to the term vector of two query statements in the query statement marked;
If the semantic similarity of described two query statements is greater than threshold value, link between the two is established;
It is closed according to the connection between the link and the corresponding expression template of each query statement between each query statement
System, establishes sentence template relational graph;
According to the sum of the class label of query statement each in the query statement marked and the query statement marked, calculate
The parameter of random algorithm;
The random algorithm is used in the sentence template relational graph, obtains each query statement and its corresponding expression template
Sequence;
High frequency query statement and high frequency query template are filtered out according to ranking results.
7. a kind of information excavating device characterized by comprising
Sentence excavates module, for excavating each query statement of each particular category from search log;
Entity gives module, for giving the kind fructification of the particular category;
Template generation module, for according to the particular category kind fructification and each query statement, generate the particular category
The corresponding expression template of each query statement;
High frequency excavates module, for according to query statement of all categories and its corresponding expression template, from described search log
Excavation obtains high frequency query statement and high frequency expression template.
8. device according to claim 7, which is characterized in that if the template generation module is also used to the certain kinds
Include kind of a fructification in other query statement, then described kind of fructification is replaced using wildcard figure, obtain corresponding expression template.
9. device according to claim 7, which is characterized in that further include:
Query word excavates module, for utilizing each expression template, excavates various entities, from described search log to obtain height
Frequency word and/or colloquial style word;And/or
Full dose word abstraction module, for extracting the full dose word for belonging to the particular category from the full dose data of selected website.
10. device according to claim 9, which is characterized in that the query word excavates module and includes:
Dimension-reduction treatment submodule carries out scalable vector graphics SVG dimensionality reduction for the expression template to the various entities excavated
Processing, obtains corresponding feature vector;
Submodule is clustered, for clustering the corresponding feature vector of multiple expression templates, obtaining the particular category includes
Expression template.
11. device according to any one of claims 7 to 10, which is characterized in that further include:
Correlated expression excavates module and obtains relevant query statement and expression template for screening to described search log.
12. device according to claim 11, which is characterized in that the high frequency excavates module and includes:
Sentence acquisition submodule is marked, it is described to have marked for from relevant query statement, obtaining the query statement marked
Query statement in include class label;
Similarity calculation submodule, for calculating the two according to the term vector of two query statements in the query statement marked
Semantic similarity;
Setting up submodule is linked, if the semantic similarity for described two query statements is greater than threshold value, establishes the two
Between link;
Relational graph setting up submodule, for according between each query statement link and each query statement it is corresponding
Connection relationship between expression template establishes sentence template relational graph;
Parameter computation module, for according to the class label of each query statement in the query statement that has marked and having marked
The sum of query statement calculates the parameter of random algorithm;
Sorting sub-module, in the sentence template relational graph use the random algorithm, obtain each query statement and its
The sequence of corresponding expression template;
High frequency screens submodule, for filtering out high frequency query statement and high frequency query template according to ranking results.
13. a kind of information excavating device characterized by comprising
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors
Realize such as method described in any one of claims 1 to 6.
14. a kind of computer readable storage medium, is stored with computer program, which is characterized in that the program is held by processor
Such as method described in any one of claims 1 to 6 is realized when row.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810716210.9A CN109033076A (en) | 2018-06-29 | 2018-06-29 | information mining method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810716210.9A CN109033076A (en) | 2018-06-29 | 2018-06-29 | information mining method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109033076A true CN109033076A (en) | 2018-12-18 |
Family
ID=65521476
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810716210.9A Pending CN109033076A (en) | 2018-06-29 | 2018-06-29 | information mining method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033076A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110990451A (en) * | 2019-11-15 | 2020-04-10 | 浙江大华技术股份有限公司 | Data mining method, device and equipment based on sentence embedding and storage device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102419778A (en) * | 2012-01-09 | 2012-04-18 | 中国科学院软件研究所 | Information searching method for mining and clustering sub-topics of query sentences |
CN103425714A (en) * | 2012-05-25 | 2013-12-04 | 北京搜狗信息服务有限公司 | Query method and system |
-
2018
- 2018-06-29 CN CN201810716210.9A patent/CN109033076A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102419778A (en) * | 2012-01-09 | 2012-04-18 | 中国科学院软件研究所 | Information searching method for mining and clustering sub-topics of query sentences |
CN103425714A (en) * | 2012-05-25 | 2013-12-04 | 北京搜狗信息服务有限公司 | Query method and system |
Non-Patent Citations (1)
Title |
---|
伍大勇: "搜索引擎中命名实体查询处理相关技术研究", 《中国博士学位论文全文数据库》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110990451A (en) * | 2019-11-15 | 2020-04-10 | 浙江大华技术股份有限公司 | Data mining method, device and equipment based on sentence embedding and storage device |
CN110990451B (en) * | 2019-11-15 | 2023-05-12 | 浙江大华技术股份有限公司 | Sentence embedding-based data mining method, device, equipment and storage device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rule et al. | Lexical shifts, substantive changes, and continuity in State of the Union discourse, 1790–2014 | |
Bell | Learning with information systems: Learning cycles in information systems development | |
Havasi et al. | Digital intuition: Applying common sense using dimensionality reduction | |
CA2807494C (en) | Method and system for integrating web-based systems with local document processing applications | |
US6298350B1 (en) | Method for automatic processing of information materials for customised use | |
CN109947934A (en) | For the data digging method and system of short text | |
Hollink et al. | Adding Spatial Semantics to Image Annotations. | |
Beytía Reyes et al. | Visibility layers: a framework for systematising the gender gap in Wikipedia content | |
Trye et al. | Harnessing Indigenous Tweets: The Reo Māori Twitter corpus | |
Burns et al. | A suite of generative tasks for multi-level multimodal webpage understanding | |
Dodd | Working with German Corpora: With a Foreword by John Sinclair | |
CN109033076A (en) | information mining method and device | |
Martins et al. | StanceXplore: Visualization for the interactive exploration of stance in social media | |
Abraham et al. | Extraction of spatio‐temporal data about historical events from text documents | |
CN110688453B (en) | Scene application method, system, medium and equipment based on information classification | |
Tonkin | A day at work (with text): A brief introduction | |
Nobre | Anaphora resolution | |
Zarifi et al. | Gender identification of short text author using conceptual vectorization | |
Evert et al. | A distributional approach to open questions in market research | |
CN110866084A (en) | Data processing method and device for family tree character and electronic equipment | |
Qiu | Empirical study of big data mining technology in English teaching integration and optimization analysis | |
Krzywicki et al. | A knowledge acquisition method for event extraction and coding based on deep patterns | |
Shahbazi | StoryMiner: An Automated and Scalable Framework for Story Analysis and Detection from Social Media | |
Castano et al. | SABINE: a multi-purpose dataset of semantically-annotated social content | |
Dao | Coreference Resolution for Software Architecture Documentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181218 |
|
RJ01 | Rejection of invention patent application after publication |