CN103870446B - A kind of descriptor screening technique and device - Google Patents
A kind of descriptor screening technique and device Download PDFInfo
- Publication number
- CN103870446B CN103870446B CN201210551720.8A CN201210551720A CN103870446B CN 103870446 B CN103870446 B CN 103870446B CN 201210551720 A CN201210551720 A CN 201210551720A CN 103870446 B CN103870446 B CN 103870446B
- Authority
- CN
- China
- Prior art keywords
- descriptor
- statistical value
- dictionary
- business object
- description information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application discloses a kind of descriptor screening technique and device, including: for each business object in multiple business objects, the each descriptor included based on descriptor dictionary, from the title content of this business object, extract descriptor present in the descriptor dictionary, and determine whether the descriptor of this existence exists in the appointment description information of this business object, if there is, according to the first statistical value that the descriptor setting this existence of incremental update is corresponding, if it does not, according to the second statistical value corresponding to descriptor setting this existence of incremental update;After all carrying out above-mentioned statistics for each business object in the plurality of business object, the first corresponding statistical value and the second statistical value is distinguished according to each descriptor that descriptor dictionary includes, the each descriptor including descriptor dictionary screens, the descriptor dictionary after being updated.The scheme using the embodiment of the present application to provide, improves the accuracy that the descriptor to business object determines.
Description
Technical field
The application relates to Internet technical field and field of computer technology, particularly relate to a kind of descriptor screening technique and
Device.
Background technology
In existing Internet technology, website typically can be issued some business objects, for logging in the user of this website
Browse, and further for the post-treatment operations of specific transactions object.Such as, as a example by e-commerce website, business
Object can be specifically the product that seller user issues, and the information of business object can be specifically retouching of the various features to product
State information etc., such as type information, pricing information, performance information and the brand message etc. of product, log in the use of e-commerce website
Family can by browsing the various information of release product, understand the details of this product, it is possible to further perform receipts
Hide, buy or recommend the process operations such as other users;As a example by community website, business object can be specifically that community users is sent out
The model of cloth, the information of business object can be specifically the description information of model, the content information etc. of model, website, login community
Browse user and can understand the details of this model by browsing the various information of the model of issue, it is possible to further
Execution collection, money order receipt to be signed and returned to the sender or recommend other users etc. and process operation.
In actual applications, the description information of business object can be to be issued this business pair by the supplier of business object
As time input, and due to various actual cause, such as operational error, to reasons such as the inadequate understandings of business object, in fact it could happen that
The inaccurate situation of description information that the supplier of business object is inputted for its business object provided.Such as, for product
The input of board information, is likely to be due to supplier and is unfamiliar with the exact brand name of business object, or recognizes the reasons such as mistake, causes
The brand word of input is not the brand of a necessary being.And if now will extract based on the brand message data of mistake
The brand word list come, in the brand recognition of business object processes, it will further cause recognition result inaccurate, from
And also need to the most inaccurate recognition result and correct, thus waste process resource, and reduce brand knowledge
Other treatment effeciency.
Summary of the invention
In view of this, the embodiment of the present application provides a kind of descriptor screening technique and device, is used for solving in prior art
The descriptor to business object existed determines inaccurate problem.
The embodiment of the present application is achieved through the following technical solutions:
The embodiment of the present application provides a kind of descriptor screening technique, including:
For each business object in multiple business objects, perform following steps A and step B:
Step A: each descriptor included based on descriptor dictionary, from the title content of this business object, extraction is in institute
State descriptor present in descriptor dictionary;
Step B: determine whether the descriptor of described existence exists, if deposited in the appointment description information of this business object
, first statistical value corresponding according to setting the descriptor existed described in incremental update, if it does not, increase according to described setting
Measure the second statistical value that the descriptor of the described existence of renewal is corresponding;
For each business object in the plurality of business object, after performing step A and step B, according to described
First statistical value of each descriptor that descriptor dictionary includes correspondence respectively and the second statistical value, include described descriptor dictionary
Each descriptor screen, the descriptor dictionary after being updated.
The embodiment of the present application additionally provides a kind of descriptor screening plant, including:
First extracting unit, for for each business object in multiple business objects, includes based on descriptor dictionary
Each descriptor, from the title content of this business object, extract descriptor present in the described descriptor dictionary;
Statistic unit, for determining whether the descriptor of described existence is deposited in the appointment description information of this business object
, if it does, first statistical value corresponding according to setting the descriptor existed described in incremental update, if it does not, according to
The second statistical value that the descriptor of existence described in described setting incremental update is corresponding;
Screening unit, for for each business object in the plurality of business object, extracts by described first
After unit and described statistic unit process, according to the first of each descriptor correspondence respectively that described descriptor dictionary includes
Statistical value and the second statistical value, each descriptor including described descriptor dictionary screens, the descriptor after being updated
Dictionary.
In at least one technical scheme above-mentioned that the embodiment of the present application provides, in each description included based on descriptor dictionary
When word screens, first against each business object in multiple business objects, each description included based on descriptor dictionary
Word, from the title content of this business object, extracts descriptor present in the descriptor dictionary, it is then determined that the retouching of this existence
Whether predicate exists in the appointment description information of this business object, if it does, according to setting retouching of this existence of incremental update
The first statistical value that predicate is corresponding, if it does not, according to set this existence of incremental update descriptor corresponding second statistics
Value;Wherein, in the presence of a descriptor is in the title content of business object and appointment description information all, this descriptor is represented
It is accurately to a certain extent, otherwise, when a descriptor only exists in the title content of business object, and in this business pair
In the appointment description information of elephant not in the presence of, represent that this descriptor is inaccurate to a certain extent, so, to multiple business
After object all completes above-mentioned statistics, each descriptor that descriptor dictionary includes is to having the first statistical value and the second statistics
Value, and, corresponding the first statistical value this descriptor of the biggest expression is the most accurate, corresponding the second statistical value this description of the biggest expression
Word is the most inaccurate, thus the first corresponding statistical value and the second statistical value distinguished in each descriptor included according to descriptor dictionary,
The each descriptor including descriptor dictionary screens, and removes inaccurate descriptor, it is possible to obtain wherein descriptor more accurate
Descriptor dictionary after true renewal, i.e. improve determined by the accuracy of descriptor.
Other features and advantage will illustrate in the following description, and, partly become from description
Obtain it is clear that or understand by implementing the application.The purpose of the application and other advantages can be by the explanations write
Structure specifically noted in book, claims and accompanying drawing realizes and obtains.
Accompanying drawing explanation
Accompanying drawing is for providing further understanding of the present application, and constitutes a part for description, implements with the application
Example is used for explaining the application together, is not intended that the restriction to the application.In the accompanying drawings:
The flow chart of the descriptor screening technique that Fig. 1 provides for the embodiment of the present application;
The flow chart of the Fig. 2 descriptor screening technique for providing in the embodiment of the present application 1;
The flow chart of the Fig. 3 descriptor identifying processing for providing in the embodiment of the present application 1;
The structural representation of the Fig. 4 descriptor screening plant for providing in the embodiment of the present application 2.
Detailed description of the invention
In order to provide the implementation of the accuracy improving the descriptor determining business object, the embodiment of the present application provides
A kind of descriptor screening technique and device, this technical scheme can apply to determine the process of the descriptor dictionary of business object,
Both can be implemented as a kind of method, it is also possible to be embodied as a kind of device.Below in conjunction with the Figure of description preferred reality to the application
Execute example to illustrate, it will be appreciated that preferred embodiment described herein is merely to illustrate and explains the application, be not used to limit
Determine the application.And in the case of not conflicting, the embodiment in the application and the feature in embodiment can be mutually combined.
The embodiment of the present application provides a kind of descriptor screening technique, as it is shown in figure 1, include:
For each business object in multiple business objects, perform following steps 101 and step 102:
Step 101: each descriptor included based on descriptor dictionary, from the title content of this business object, extracts
Descriptor present in descriptor dictionary.
Step 102: determine whether the descriptor of this existence exists, if deposited in the appointment description information of this business object
, according to the first statistical value that the descriptor setting this existence of incremental update is corresponding, if it does not, according to setting incremental update
The second statistical value that the descriptor of this existence is corresponding.
Step 103, for each business object in the plurality of business object, perform step 101 and step 102 it
After, distinguish the first corresponding statistical value and the second statistical value, to descriptor dictionary according to each descriptor that descriptor dictionary includes
Including each descriptor screen, the descriptor dictionary after being updated.
Wherein, each descriptor that descriptor dictionary includes, can be to be described information by the appointment in the plurality of business object
The middle descriptor composition occurred.
Further, in the said method that the embodiment of the present application provides, after the descriptor dictionary after being updated, can
For the descriptor dictionary after updating, to use the descriptor screening mode shown in above-mentioned Fig. 1, to the descriptor dictionary after updating
Including each descriptor again screen, in order to further improve the accuracy of descriptor included by descriptor dictionary.
Further, in the said method that the embodiment of the present application provides, after the descriptor dictionary after being updated, i.e.
The each descriptor that can include based on the descriptor dictionary after this renewal, is described word identifying processing to a business object,
Describe information with the appointment of this business object supplementary, or correct inaccurate description in the appointment description information of this business object
Word, for a pending business object, specifically may include that
The each descriptor included based on the descriptor dictionary after updating, from the title content of pending business object, takes out
Take descriptor present in descriptor dictionary in the updated;
In the presence of this descriptor extracted is in the appointment description information of pending business object not, this extracted is retouched
Predicate joins in the appointment description information of pending business object, or, pending business is replaced in this descriptor of extraction
Descriptor in the appointment description information of object.
Below in conjunction with the accompanying drawings, the method and device provided the application with specific embodiment is described in detail.
Embodiment 1:
The flow chart of the Fig. 2 descriptor screening technique for providing in the embodiment of the present application 1, specifically includes and processes step as follows
Rapid:
Step 201, obtain in multiple business object in the title content of each business object, and the plurality of business object
The appointment of each business object describes information.
Wherein, the plurality of business object can be to belong to the multiple business object of same class purpose, such as, with e-business network
As a example by standing, multiple business objects may belong to identical product classification, such as men's clothing classification, women's dress classification, or mobile phone classification etc..
This appointment describes the attribute information that information can be business object, and when such as business object is commodity, this appointment is retouched
The information of stating can be the brand message of commodity.This appointment describes information can be relevant with the type of follow-up screened descriptor,
Such as, when the descriptor of follow-up screening is brand word, accordingly, this appointment describes information can be brand message.
Step 202, determine the plurality of business object specify the descriptor in description information.
Step 203, the descriptor composition descriptor word that will occur in the appointment description information of the plurality of business object
Allusion quotation.
In this step, it is also possible to each descriptor that statistics descriptor dictionary includes respectively is in the appointment of the plurality of business object
The number of times occurred in description information, this number of times can be used for the follow-up screening to the descriptor in descriptor dictionary.
After obtaining initial descriptor dictionary, i.e. can by each business object in the plurality of business object respectively
As pending business object, perform following steps 204 to step 207.
Step 204, each descriptor included based on this descriptor dictionary, from the title content of currently pending business object
In, extract descriptor present in this descriptor dictionary.
This step can carry out word segmentation processing by title content based on currently pending business object, and determines obtain every
Whether individual participle exists in this descriptor dictionary.
Based on each descriptor in descriptor dictionary, this step can also determine that whether this descriptor is currently pending
The title content of business object exists, now need not the title content of currently pending business object is carried out at participle
Reason.
Preferably, this step specifically can use Aho-Corasick algorithm to realize, and Aho-Corasick algorithm is a kind of
String assemble matching algorithm based on dictionary, is one all to cross finity state machine structure and build the character of similar Trie tree construction
String matching algorithm.
Step 205, exist in descriptor dictionary, and in currently pending industry for what above-mentioned steps 204 was determined
The descriptor that there is also in the title content of business object, determines that this descriptor is retouched in the appointment of currently pending business object
State in information and whether exist, if it does, enter step 206, if it does not, enter step 207.
Step 206, first statistical value corresponding according to setting this descriptor of incremental update.
That is, by that set increment former first statistical value corresponding with this descriptor and be worth, corresponding more as this descriptor
The first statistical value after Xin.
Follow-up add up according to step 204 to step 207 for other pending business object time, by this step
This descriptor obtained update after the first statistical value, iteration enters in the statistical computation next time carried out for this descriptor.
I.e. once the first statistical value of this descriptor is updated to the after updating this descriptor obtained in this step upper
Renewal on the basis of one statistical value.
It is to say, the value before the first statistical value of this descriptor is updated be last this descriptor is updated after the
One statistical value.Wherein, when adding up for this descriptor first, the first statistical value of its correspondence is initial value, and this is initial
Value could be arranged to 0.This setting increment could be arranged to 1.
Step 207, second statistical value corresponding according to setting this descriptor of incremental update.
That is, by that set increment former second statistical value corresponding with this descriptor and be worth, corresponding more as this descriptor
The second statistical value after Xin.
Follow-up add up according to step 204 to step 207 for other pending business object time, by this step
This descriptor obtained update after the second statistical value, iteration enters in the statistical computation next time carried out for this descriptor,
I.e. once the second statistical value of this descriptor is updated to the after updating this descriptor obtained in this step upper
Renewal on the basis of two statistical values.
It is to say, the value before the second statistical value of this descriptor is updated be last this descriptor is updated after the
Two statistical values.Wherein, when descriptor for this existence first is added up, the second statistical value of its correspondence is initial value,
This initial value could be arranged to 0.This setting increment could be arranged to 1.
Step 208, for each business object in the plurality of business object, perform step 204 respectively to step 207
Afterwards, each descriptor that this descriptor dictionary includes is respectively to having the first statistical value and the second statistical value, in this step,
First statistical value the most corresponding with each descriptor included according to this descriptor dictionary and the second statistical value, to this descriptor word
Each descriptor that allusion quotation includes screens, the descriptor dictionary after being updated, specifically can be in the following way:
First kind of way: first, the first statistical value that each descriptor included according to this descriptor dictionary is the most corresponding and
Second statistical value, determines the comprehensive scores of each descriptor that this descriptor dictionary includes;
Wherein, the first statistical value that descriptor is corresponding is the highest, represents that this descriptor is the most accurate, so, comprehensive scores
Can increase along with the increase of the first statistical value, otherwise, the second statistical value that descriptor is corresponding is the highest, represents this description
Word is the most inaccurate, so, comprehensive scores can increase along with the increase of the second statistical value;
The concrete calculation of comprehensive scores can be adopted in various manners as required, for example, it is possible to be the first statistical value
Deduct the difference of the second statistical value, it is also possible to be that the first statistical value accounts for the first statistical value and the second statistical value and value ratio
Deng;
Then, according to the height of comprehensive scores, each descriptor including this descriptor dictionary screens, such as, and will
Comprehensive scores is got rid of from this descriptor dictionary less than the descriptor presetting point threshold, retains comprehensive scores not less than default point
The descriptor of value threshold value, thus the descriptor dictionary after being updated.
The second way: from each descriptor that this descriptor dictionary includes, selects the first corresponding statistical value to meet the
One presets statistical value condition, and the second statistical value of correspondence meets the second descriptor presetting statistical value condition, after composition updates
Descriptor dictionary;
Wherein, first presets statistical value condition and second presets statistical value condition, can carry out according to actual needs flexibly
Arranging, such as, first statistical value corresponding due to descriptor is the highest, represents that this descriptor is the most accurate, otherwise, a description
The second statistical value that word is corresponding is the highest, represents that this descriptor is the most inaccurate, so, first to preset statistical value condition can be corresponding
The first statistical value preset statistical value threshold value not less than first, second presets the second statistical value that statistical value condition can be correspondence
Statistical value threshold value is preset less than second.
When above-mentioned steps 203 has also been added up the appointment in the plurality of business object of each descriptor that descriptor dictionary includes
During the number of times occurred in description information, in this step, it is also possible to each descriptor correspondence respectively included according to this descriptor dictionary
The first statistical value and the second statistical value, and specify the number of times occurred in description information, to this in the plurality of business object
Each descriptor that descriptor dictionary includes screens, the descriptor dictionary after being updated, specifically can be in the following way:
The third mode: first, the first statistical value that each descriptor included according to this descriptor dictionary is the most corresponding and
Second statistical value, and the number of times that each descriptor occurs respectively in the appointment description information of the plurality of business object, determine this
The comprehensive scores of each descriptor that descriptor dictionary includes;
Wherein, the first statistical value that descriptor is corresponding is the highest, represents that this descriptor is the most accurate, and comprehensive scores can be with
The increase of the first statistical value and increase, otherwise, the second statistical value that descriptor is corresponding is the highest, represents this descriptor the most not
Accurately, comprehensive scores can increase along with the increase of the second statistical value;And descriptor is at the finger of the plurality of business object
The number of times determining to occur in description information is the biggest, represents that this descriptor is the most accurate, and comprehensive scores can be along with the increase of this number of times
Increase;
The concrete calculation of comprehensive scores can be adopted in various manners as required, for example, it is possible to add for this number of times
First statistical value deducts the numerical value that the second statistical value obtains again, it is also possible to deduct the difference of the second statistical value for the first statistical value
Value, and this number of times, the numerical value that being weighted sues for peace obtains;
Preferably, in the embodiment of the present application propose can use equation below calculate descriptor comprehensive scores:
Score=log (C+n1)+((P+n2)/(N+n2)+1)/Th;
Wherein, Score is the comprehensive scores of a descriptor, and C is that this descriptor is retouched in the appointment of the plurality of business object
Stating the number of times occurred in information, P is the first statistical value that this descriptor is corresponding, and N is the second statistical value that this descriptor is corresponding, Th
For adjusting threshold value.n1And n2For smooth regulation coefficient, its purpose is to obtain smoothed data, such as, n1Can be set as 2, n2
Can be set as 1.Adjust threshold value Th can arrange flexibly with actual count situation according to actual needs, be used for filtering and make an uproar
Sound;
Then, according to the height of comprehensive scores, each descriptor including this descriptor dictionary screens, such as, and will
Comprehensive scores is got rid of from this descriptor dictionary less than the descriptor presetting point threshold, retains comprehensive scores not less than default point
The descriptor of value threshold value, thus the descriptor dictionary after being updated.
4th kind of mode: from each descriptor that this descriptor dictionary includes, selects the first corresponding statistical value to meet the
One presets statistical value condition, and the second statistical value of correspondence meets second and presets statistical value condition, and in the plurality of business object
The number of times occurred in description information of specifying meet the descriptor of preset times condition, composition update after descriptor dictionary;
Wherein, first preset statistical value condition, second preset statistical value condition and preset times condition, can be according to reality
Needs are arranged flexibly, and such as, first statistical value corresponding due to descriptor is the highest, represents that this descriptor is the most accurate,
Otherwise, the second statistical value that descriptor is corresponding is the highest, represents that this descriptor is the most inaccurate, and a descriptor is in the plurality of industry
The number of times occurred in the appointment description information of business object is the biggest, represents that this descriptor is the most accurate, so, first presets statistical value bar
Part can be that the first corresponding statistical value presets statistical value threshold value not less than first, and second to preset statistical value condition can be correspondence
The second statistical value preset statistical value threshold value less than second;Preset times threshold value can be to retouch in the appointment of the plurality of business object
State the number of times occurred in information and meet preset times not less than preset times threshold value.
Use the foregoing description word screening technique that the embodiment of the present application 1 provides, due to when a descriptor is in business object
Title content and specify in description information in the presence of all, represent that this descriptor is accurately to a certain extent, otherwise, when one
Descriptor only exists in the title content of business object, and in the appointment description information of this business object not in the presence of, table
Show that this descriptor is inaccurate to a certain extent, so, after multiple business objects are all completed statistics, descriptor dictionary
Including each descriptor to having the first statistical value and the second statistical value, and, the first corresponding the biggest expression of statistical value
This descriptor is the most accurate, and corresponding the second statistical value this descriptor of the biggest expression is the most inaccurate, thus according to descriptor dictionary bag
First statistical value of each descriptor included correspondence respectively and the second statistical value, each descriptor including descriptor dictionary sieves
Choosing, removes inaccurate descriptor, it is possible to obtain the descriptor dictionary after wherein descriptor updates more accurately, i.e. improve institute
The accuracy of the descriptor determined.
In the embodiment of the present application, by foregoing description word screening technique, after the descriptor dictionary after being updated, go back
Can to update after descriptor dictionary in descriptor be ranked up display, sequence time can according to use above-mentioned the third
The comprehensive scores that mode determines order from high to low is ranked up.
In the embodiment of the present application, by foregoing description word screening technique, after the descriptor dictionary after being updated, go back
The each descriptor that can include based on the descriptor dictionary after this renewal, is described word identifying processing to a business object,
Describe information with the appointment of this business object supplementary, or correct inaccurate description in the appointment description information of this business object
Word, for a pending business object, as it is shown on figure 3, specifically can include processing as follows step:
Step 301, each descriptor included based on the descriptor dictionary after updating, in the title of pending business object
Rong Zhong, extracts descriptor present in descriptor dictionary in the updated.
This step can carry out word segmentation processing by title content based on this business object, and determines that each participle obtained exists
Whether the descriptor dictionary after this renewal exists.
Based on each descriptor in the descriptor dictionary after updating, this step can also determine that whether this descriptor is at this
The title content of business object exists, now need not the title content to this business object and carry out word segmentation processing.
Preferably, this step specifically can use Aho-Corasick algorithm to realize.
Step 302, determine whether this descriptor of extraction exists in the appointment description information of pending business object, as
Fruit does not exists, and enters step 303, if it does, enter step 304.
Step 303, this descriptor according to extraction update the appointment of this pending business object and describe information.
Concrete, this descriptor of extraction can be joined in the appointment description information of pending business object, or,
What this descriptor of extraction was replaced pending business object specifies the descriptor in description information.
Concrete, if it is sky that the appointment of pending business object describes information, then this descriptor of extraction can be added
Enter in the appointment description information of this pending business object;If the appointment of pending business object describes information not for empty,
And this descriptor of extraction similar to the descriptor in this appointment description information time, such as, there is identical word, then can be by
The descriptor in the appointment description information of pending business object replaced in this descriptor of extraction.
Step 304, the appointment of this pending business object is kept to describe Information invariability.
The descriptor included due to the descriptor dictionary after updating is more accurate, so, based on the descriptor word after this renewal
When allusion quotation is described word identifying processing to business object, improves the accuracy of identifying processing, and avoid follow-up to identifying knot
The corrigendum of fruit processes, thus decreases the waste processing resource, and improves the treatment effeciency of descriptor identification.
Embodiment 2:
Based on same inventive concept, according to the descriptor screening technique of the above embodiments of the present application offer, correspondingly, this Shen
Embodiment 2 please additionally provide a kind of descriptor screening plant, its structural representation as shown in Figure 4, specifically includes:
First extracting unit 401, for for each business object in multiple business objects, based on descriptor dictionary bag
The each descriptor included, from the title content of this business object, extracts descriptor present in the described descriptor dictionary;
Statistic unit 402, for determine the descriptor of described existence in the appointment description information of this business object whether
Exist, if it does, first statistical value corresponding according to setting the descriptor existed described in incremental update, if it does not, press
According to the second statistical value that the descriptor existed described in described setting incremental update is corresponding;
Screening unit 403, for for each business object in the plurality of business object, takes out by described first
Take unit and after described statistic unit processes, each descriptor of including according to described descriptor dictionary the most corresponding the
One statistical value and the second statistical value, each descriptor including described descriptor dictionary screens, the description after being updated
Word dictionary.
Further, said apparatus, also include:
Dictionary determines unit 404, the descriptor in the appointment description information determining the plurality of business object;And will
The descriptor occurred in description information of specifying in the plurality of business object forms described descriptor dictionary.
Further, said apparatus, also include:
Number of times determines unit 405, for adding up each descriptor that described descriptor dictionary includes respectively in the plurality of industry
The number of times occurred in the appointment description information of business object;
Screening unit 403, the first system of each descriptor correspondence respectively specifically for including according to described descriptor dictionary
Evaluation and the second statistical value, and the number of times occurred in the appointment description information of the plurality of business object, to described description
Each descriptor that word dictionary includes screens.
Further, screening unit 403, specifically for each descriptor correspondence respectively included according to described descriptor dictionary
The first statistical value and the second statistical value, determine the comprehensive scores of each descriptor that described descriptor dictionary includes;And according to combining
Closing the height of score value, each descriptor including described descriptor dictionary screens.
Further, screening unit 403, specifically for from each descriptor that described descriptor dictionary includes, it is right to select
The first statistical value answered meets first and presets statistical value condition, and the second statistical value of correspondence meets second and presets statistical value condition
Descriptor, composition update after descriptor dictionary.
Further, said apparatus, also include:
Second extracting unit 406, for each descriptor included based on the descriptor dictionary after described renewal, from pending
In the title content of business object, descriptor present in extraction descriptor dictionary after described renewal;
Descriptor supplementary units 407, for describing in the appointment of described pending business object when this descriptor of extraction
In information not in the presence of, this descriptor of extraction is joined in the appointment description information of described pending business object, or,
What this descriptor of extraction was replaced described pending business object specifies the descriptor in description information.
The function of above-mentioned each unit may correspond to the respective handling step in flow process shown in Fig. 1 to Fig. 3, the most superfluous at this
State.
In sum, the scheme that the embodiment of the present application provides, including: for each business pair in multiple business objects
As, each descriptor included based on descriptor dictionary, from the title content of this business object, extraction is deposited in descriptor dictionary
Descriptor, and determine whether the descriptor of this existence exists in the appointment description information of this business object, if it does,
According to the first statistical value that the descriptor setting this existence of incremental update is corresponding, if it does not, should according to setting incremental update
The second statistical value corresponding to descriptor existed;Above-mentioned system is all being carried out for each business object in the plurality of business object
After meter, distinguish the first corresponding statistical value and the second statistical value, to descriptor according to each descriptor that descriptor dictionary includes
Each descriptor that dictionary includes screens, the descriptor dictionary after being updated.Use the scheme that the embodiment of the present application provides,
Improve the accuracy that the descriptor to business object determines.
The screening plant that embodiments herein is provided can be realized by computer program.Those skilled in the art should
It is understood that above-mentioned Module Division mode is only the one in numerous Module Division mode, if be divided into other modules or
Do not divide module, as long as screening plant has above-mentioned functions, all should be within the protection domain of the application.
The application is with reference to method, equipment (system) and the flow process of computer program according to the embodiment of the present application
Figure and/or block diagram describe.It should be understood that can the most first-class by computer program instructions flowchart and/or block diagram
Flow process in journey and/or square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided
Instruction arrives the processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing device to produce
A raw machine so that the instruction performed by the processor of computer or other programmable data processing device is produced for real
The device of the function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame now.
These computer program instructions may be alternatively stored in and computer or other programmable data processing device can be guided with spy
Determine in the computer-readable memory that mode works so that the instruction being stored in this computer-readable memory produces and includes referring to
Make the manufacture of device, this command device realize at one flow process of flow chart or multiple flow process and/or one square frame of block diagram or
The function specified in multiple square frames.
These computer program instructions also can be loaded in computer or other programmable data processing device so that at meter
Perform sequence of operations step on calculation machine or other programmable devices to produce computer implemented process, thus at computer or
The instruction performed on other programmable devices provides for realizing at one flow process of flow chart or multiple flow process and/or block diagram one
The step of the function specified in individual square frame or multiple square frame.
Obviously, those skilled in the art can carry out various change and the modification essence without deviating from the application to the application
God and scope.So, if these amendments of the application and modification belong to the scope of the application claim and equivalent technologies thereof
Within, then the application is also intended to comprise these change and modification.
Claims (15)
1. a descriptor screening technique, it is characterised in that including:
For each business object in multiple business objects, perform following steps A and step B:
Step A: each descriptor included based on descriptor dictionary, from the title content of this business object, extraction is retouched described
Descriptor present in predicate dictionary;
Step B: determine whether the descriptor of described existence exists in the appointment description information of this business object, if it does,
First statistical value corresponding according to setting the descriptor existed described in incremental update, if it does not, according to described setting increment
Update the second statistical value that the descriptor of described existence is corresponding;
For each business object in the plurality of business object, after performing step A and step B, according to described description
First statistical value of each descriptor that word dictionary includes correspondence respectively and the second statistical value, include described descriptor dictionary is each
Descriptor screens, the descriptor dictionary after being updated, and wherein, corresponding the first statistical value this descriptor of the biggest expression is more
Accurately, corresponding the second statistical value this descriptor of the biggest expression is the most inaccurate.
2. the method for claim 1, it is characterised in that the determination method of descriptor dictionary includes:
Determine the plurality of business object specifies the descriptor in description information;
The descriptor occurred in description information of specifying in the plurality of business object is formed descriptor dictionary.
3. method as claimed in claim 2, it is characterised in that also include:
Add up each descriptor that described descriptor dictionary includes respectively to go out in the appointment description information of the plurality of business object
Existing number of times;
Distinguish the first corresponding statistical value and the second statistical value according to each descriptor that described descriptor dictionary includes, retouch described
Each descriptor that predicate dictionary includes screens, particularly as follows:
The first corresponding statistical value and the second statistical value is distinguished according to each descriptor that described descriptor dictionary includes, and in institute
That states multiple business object specifies the number of times occurred in description information, and each descriptor including described descriptor dictionary sieves
Choosing.
4. the method for claim 1, it is characterised in that the most right according to each descriptor that described descriptor dictionary includes
The first statistical value answered and the second statistical value, each descriptor including described descriptor dictionary screens, and specifically includes:
Distinguish the first corresponding statistical value and the second statistical value according to each descriptor that described descriptor dictionary includes, determine described
The comprehensive scores of each descriptor that descriptor dictionary includes;And according to the height of comprehensive scores, described descriptor dictionary is included
Each descriptor screen.
5. method as claimed in claim 4, it is characterised in that described comprehensive scores uses equation below to calculate: Score=
log(C+n1)+((P+n2)/(N+n2)+1)/Th;
Wherein, Score is the comprehensive scores of a descriptor, and C is that this descriptor describes letter in the appointment of the plurality of business object
The number of times occurred in breath, P is the first statistical value that this descriptor is corresponding, and N is the second statistical value that this descriptor is corresponding, and Th is for adjusting
Whole threshold value, n1And n2For smooth regulation coefficient.
6. the method for claim 1, it is characterised in that the most right according to each descriptor that described descriptor dictionary includes
The first statistical value answered and the second statistical value, each descriptor including described descriptor dictionary screens, and specifically includes:
From each descriptor that described descriptor dictionary includes, select the first corresponding statistical value to meet first and preset statistical value bar
Part, and the second statistical value of correspondence meets the second descriptor presetting statistical value condition, composition update after descriptor dictionary.
7. the method for claim 1, it is characterised in that also include:
The each descriptor included based on the descriptor dictionary after described renewal, from the title content of pending business object, takes out
It is taken at descriptor present in the descriptor dictionary after described renewal;
In the presence of this descriptor extracted is in the appointment description information of described pending business object not, this extracted is retouched
Predicate joins in the appointment description information of described pending business object, or, treat described in this descriptor replacement of extraction
Process business object specifies the descriptor in description information.
8. the method as described in claim 1-7 is arbitrary, it is characterised in that the plurality of business object is for belonging to same class purpose
Multiple business objects.
9. the method as described in claim 1-7 is arbitrary, it is characterised in that each descriptor that described descriptor dictionary includes is product
Board word, it is brand message that described appointment describes information.
10. a descriptor screening plant, it is characterised in that including:
First extracting unit, for for each business object in multiple business objects, based on descriptor dictionary include each
Descriptor, from the title content of this business object, extracts descriptor present in the described descriptor dictionary;
Statistic unit, for determining whether the descriptor of described existence exists in the appointment description information of this business object, as
Fruit exists, and first statistical value corresponding according to setting the descriptor existed described in incremental update, if it does not, set according to described
Determine the second statistical value that the descriptor of existence described in incremental update is corresponding;
Screening unit, for for each business object in the plurality of business object, by described first extracting unit
After processing with described statistic unit, according to the first statistics of each descriptor correspondence respectively that described descriptor dictionary includes
Value and the second statistical value, each descriptor including described descriptor dictionary screens, the descriptor dictionary after being updated,
Wherein, corresponding the first statistical value this descriptor of the biggest expression is the most accurate, corresponding the second statistical value this descriptor of the biggest expression
The most inaccurate.
11. devices as claimed in claim 10, it is characterised in that also include:
Dictionary determines unit, the descriptor in the appointment description information determining the plurality of business object;And will be described
The descriptor occurred in description information of specifying of multiple business objects forms described descriptor dictionary.
12. devices as claimed in claim 11, it is characterised in that also include:
Number of times determines unit, for adding up each descriptor that described descriptor dictionary includes respectively in the plurality of business object
Specify the number of times occurred in description information;
Described screening unit, the first statistical value of each descriptor correspondence respectively specifically for including according to described descriptor dictionary
With the second statistical value, and specify the number of times occurred in description information, to described descriptor word in the plurality of business object
Each descriptor that allusion quotation includes screens.
13. devices as claimed in claim 10, it is characterised in that described screening unit, specifically for according to described descriptor
First statistical value of each descriptor that dictionary includes correspondence respectively and the second statistical value, determine that described descriptor dictionary includes each
The comprehensive scores of descriptor;And according to the height of comprehensive scores, each descriptor including described descriptor dictionary screens.
14. devices as claimed in claim 10, it is characterised in that described screening unit, specifically for from described descriptor word
In each descriptor that allusion quotation includes, select the first corresponding statistical value to meet first and preset statistical value condition, and the second system of correspondence
Evaluation meets the second descriptor presetting statistical value condition, the descriptor dictionary after composition renewal.
15. devices as claimed in claim 10, it is characterised in that also include:
Second extracting unit, for each descriptor included based on the descriptor dictionary after described renewal, from pending business pair
In the title content of elephant, descriptor present in extraction descriptor dictionary after described renewal;
Descriptor supplementary units, for when extraction this descriptor in the appointment description information of described pending business object not
In the presence of, this descriptor of extraction is joined in the appointment description information of described pending business object, or, by extraction
The descriptor in the appointment description information of described pending business object replaced in this descriptor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210551720.8A CN103870446B (en) | 2012-12-18 | 2012-12-18 | A kind of descriptor screening technique and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210551720.8A CN103870446B (en) | 2012-12-18 | 2012-12-18 | A kind of descriptor screening technique and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103870446A CN103870446A (en) | 2014-06-18 |
CN103870446B true CN103870446B (en) | 2016-12-28 |
Family
ID=50908990
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210551720.8A Active CN103870446B (en) | 2012-12-18 | 2012-12-18 | A kind of descriptor screening technique and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103870446B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106469184B (en) * | 2015-08-20 | 2019-12-27 | 阿里巴巴集团控股有限公司 | Data object label processing and displaying method, server and client |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101599A (en) * | 2007-06-20 | 2008-01-09 | 精实万维软件(北京)有限公司 | Method for extracting advertisement main information from web page |
CN102473190A (en) * | 2009-07-30 | 2012-05-23 | 阿尔卡特朗讯 | Keyword assignment to a web page |
CN102682001A (en) * | 2011-03-09 | 2012-09-19 | 阿里巴巴集团控股有限公司 | Method and device for determining suggest word |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7152059B2 (en) * | 2002-08-30 | 2006-12-19 | Emergency24, Inc. | System and method for predicting additional search results of a computerized database search user based on an initial search query |
JP2007104312A (en) * | 2005-10-04 | 2007-04-19 | Toshiba Corp | Information processing method using electronic guide information and apparatus thereof |
-
2012
- 2012-12-18 CN CN201210551720.8A patent/CN103870446B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101599A (en) * | 2007-06-20 | 2008-01-09 | 精实万维软件(北京)有限公司 | Method for extracting advertisement main information from web page |
CN102473190A (en) * | 2009-07-30 | 2012-05-23 | 阿尔卡特朗讯 | Keyword assignment to a web page |
CN102682001A (en) * | 2011-03-09 | 2012-09-19 | 阿里巴巴集团控股有限公司 | Method and device for determining suggest word |
Non-Patent Citations (3)
Title |
---|
一种基于词典的搜索引擎系统动态更新模型;雷鸣 等;《计算机研究与发展》;20001031;第37卷(第10期);第1265-1270页 * |
基于机器学习的文本聚类描述算法研究;章成志;《第三届全国信息检索与内容安全学术会议论文集》;20071101;第216-225页 * |
词表的自动丰富-从元数据中提取关键词及其定位;王军;《中文信息学报》;20051125;第19卷(第6期);第36-43页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103870446A (en) | 2014-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106651542B (en) | Article recommendation method and device | |
CN103778205B (en) | A kind of commodity classification method and system based on mutual information | |
CN106056407A (en) | Online banking user portrait drawing method and equipment based on user behavior analysis | |
CN110110075A (en) | Web page classification method, device and computer readable storage medium | |
CN106909901B (en) | Method and device for detecting object from image | |
CN104035968A (en) | Method and device for constructing training corpus set based on social network | |
CN104331417A (en) | Matching method for personnel garments of user | |
CN107273391A (en) | Document recommends method and apparatus | |
CN108319888B (en) | Video type identification method and device and computer terminal | |
CN110019163A (en) | Method, system, equipment and the storage medium of prediction, the recommendation of characteristics of objects | |
CN110489449A (en) | A kind of chart recommended method, device and electronic equipment | |
CN108509458B (en) | Business object identification method and device | |
CN103123624A (en) | Method of confirming head word, device of confirming head word, searching method and device | |
CN109325639A (en) | A kind of credit scoring card automation branch mailbox method for credit forecast assessment | |
CN104572775B (en) | Advertisement classification method, device and server | |
CN109902157A (en) | A kind of training sample validation checking method and device | |
CN108876452A (en) | Electricity customers demand information acquisition methods, device and electronic equipment | |
CN107657030A (en) | Collect method, apparatus, terminal device and storage medium that user reads data | |
CN107766316A (en) | The analysis method of evaluating data, apparatus and system | |
CN103678548B (en) | Failure service based on integrated mode substitutes recommendation method | |
CN107633421A (en) | A kind of processing method and processing device of market prediction data | |
CN105335446A (en) | Short text classification model generation method and classification method based on word vector | |
CN107357782A (en) | One kind identification user's property method for distinguishing and terminal | |
CN110458600A (en) | Portrait model training method, device, computer equipment and storage medium | |
CN107885754A (en) | The method and apparatus for extracting credit variable from transaction data based on LDA models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |