CN107609094A - Data disambiguation method, device and computer equipment - Google Patents

Data disambiguation method, device and computer equipment Download PDF

Info

Publication number
CN107609094A
CN107609094A CN201710807103.2A CN201710807103A CN107609094A CN 107609094 A CN107609094 A CN 107609094A CN 201710807103 A CN201710807103 A CN 201710807103A CN 107609094 A CN107609094 A CN 107609094A
Authority
CN
China
Prior art keywords
data
feature
url
classification
sorted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710807103.2A
Other languages
Chinese (zh)
Other versions
CN107609094B (en
Inventor
刘琼琼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201710807103.2A priority Critical patent/CN107609094B/en
Publication of CN107609094A publication Critical patent/CN107609094A/en
Application granted granted Critical
Publication of CN107609094B publication Critical patent/CN107609094B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention proposes a kind of data disambiguation method, device and computer equipment, this method is included to every data in training data, it is labeled based on classification to be sorted, obtains being labeled as a plurality of first data for belonging to classification to be sorted and be labeled as being not belonging to a plurality of second data of classification to be sorted;The user behaviors log determination feature related to every first data is clicked on based on user and is used as fisrt feature, and the feature related to every second data and second feature is used as, it is trained according to fisrt feature and second feature to being marked corresponding to every first data/every second data.It can realize that the data that user behaviors log is clicked on to user carry out depth excavation by the present invention, the data therein that refer to are extracted to be analyzed, scene can be incorporated multi-facetedly, significantly lift data disambiguation precision, simultaneously, time and the cost of data disambiguation are reduced, cost of implementation reduces and the automation disambiguation effect of lifting data disambiguation.

Description

Data disambiguation method, device and computer equipment
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of data disambiguation method, device and computer equipment.
Background technology
It is general that disambiguation is carried out to the classification of data using the method and dictionary of machine learning in correlation technique, or, use Name entity recognition techniques identification name, place name, the classification such as mechanism name, under this mode, the classification identification of data is not comprehensive enough, And actual scene is not associated with, a large amount of human costs are expended, data disambiguation is ineffective.
The content of the invention
It is contemplated that at least solves one of technical problem in correlation technique to a certain extent.
Therefore, it is an object of the present invention to propose a kind of data disambiguation method, it can realize and behavior is clicked on to user The data of daily record carry out depth excavation, extract the data therein that refer to and are analyzed, and can combine by scene multi-facetedly Come, significantly lift data disambiguation precision, meanwhile, time and the cost of data disambiguation are reduced, cost of implementation reduces and lifting The automation disambiguation effect of data disambiguation.
It is another object of the present invention to propose a kind of data disambiguator.
It is another object of the present invention to propose a kind of computer equipment.
It is another object of the present invention to propose a kind of non-transitorycomputer readable storage medium.
It is another object of the present invention to propose a kind of computer program product.
To reach above-mentioned purpose, data disambiguation method that first aspect present invention embodiment proposes, including:Construction training number According to;To, per data, being labeled based on classification to be sorted in the training data, obtaining being labeled as belonging to the class to be sorted Other a plurality of first data and a plurality of second data for being labeled as being not belonging to the classification to be sorted;Behavior day is clicked on based on user Will determines the feature related to every first data and is used as fisrt feature, and the feature related to every second data and work For second feature, the fisrt feature/second feature includes:Literal feature and user behavior feature;According to described first Feature and the second feature are trained to being marked corresponding to every first data/every second data.
The data disambiguation method that first aspect present invention embodiment proposes, by constructing training data, in training data Per data, it is labeled based on classification to be sorted, obtains being labeled as a plurality of first data for belonging to classification to be sorted and mark To be not belonging to a plurality of second data of classification to be sorted, it is related to every first data that user behaviors log determination is clicked on based on user Feature is simultaneously used as fisrt feature, and the feature related to every second data and is used as second feature, and fisrt feature/second is special Sign includes:Literal feature and user behavior feature, and according to fisrt feature and second feature to every first data/every Mark is trained corresponding to two data, can be realized that the data that user behaviors log is clicked on to user carry out depth excavation, be extracted it In the data that refer to analyzed, scene can be incorporated multi-facetedly, significantly lifted data disambiguation precision, together When, time and the cost of data disambiguation are reduced, cost of implementation reduces and the automation disambiguation effect of lifting data disambiguation.
To reach above-mentioned purpose, data disambiguator that second aspect of the present invention embodiment proposes, including:Constructing module, For constructing training data;Labeling module, for, per data, entering rower based on classification to be sorted in the training data Note, obtain being labeled as a plurality of first data for belonging to the classification to be sorted and be labeled as being not belonging to the more of the classification to be sorted The data of bar second;Characteristic determination module, the feature related to every first data is determined for clicking on user behaviors log based on user And fisrt feature is used as, and the feature related to every second data and second feature is used as, the fisrt feature/described Two features include:Literal feature and user behavior feature;Training module, for according to the fisrt feature and the second feature It is trained to being marked corresponding to every first data/every second data.
The data disambiguator that second aspect of the present invention embodiment proposes, by constructing training data, in training data Per data, it is labeled based on classification to be sorted, obtains being labeled as a plurality of first data for belonging to classification to be sorted and mark To be not belonging to a plurality of second data of classification to be sorted, it is related to every first data that user behaviors log determination is clicked on based on user Feature is simultaneously used as fisrt feature, and the feature related to every second data and is used as second feature, and fisrt feature/second is special Sign includes:Literal feature and user behavior feature, and according to fisrt feature and second feature to every first data/every Mark is trained corresponding to two data, can be realized that the data that user behaviors log is clicked on to user carry out depth excavation, be extracted it In the data that refer to analyzed, scene can be incorporated multi-facetedly, significantly lifted data disambiguation precision, together When, time and the cost of data disambiguation are reduced, cost of implementation reduces and the automation disambiguation effect of lifting data disambiguation.
To reach above-mentioned purpose, computer equipment that third aspect present invention embodiment proposes, including:Processor, storage Device, power circuit, multimedia groupware, audio-frequency assembly, the interface of input/output (I/O), sensor cluster, and communication component; Wherein, circuit board is placed in the interior volume that housing surrounds, and the processor and the memory are arranged on the circuit board; The power circuit, for each circuit or the device power supply for the computer equipment;The memory is used to store and can hold Line program code;The processor is held by reading the executable program code stored in the memory to run with described Program corresponding to line program code, for performing:Construct training data;To in the training data per data, based on treating Class categories are labeled, obtain being labeled as a plurality of first data for belonging to the classification to be sorted and being labeled as be not belonging to it is described A plurality of second data of classification to be sorted;User behaviors log is clicked on based on user and determines the feature related to every first data and work For fisrt feature, and the feature related to every second data and second feature is used as, the fisrt feature/described second is special Sign includes:Literal feature and user behavior feature;Described every first is counted according to the fisrt feature and the second feature It is trained according to being marked corresponding to/every second data.
The computer equipment that third aspect present invention embodiment proposes, by constructing training data, to every in training data Data, it is labeled based on classification to be sorted, obtains being labeled as a plurality of first data for belonging to classification to be sorted and be labeled as A plurality of second data of classification to be sorted are not belonging to, clicking on user behaviors log based on user determines the spy related to every first data Levy and be used as fisrt feature, and the feature related to every second data and be used as second feature, fisrt feature/second feature Including:Literal feature and user behavior feature, and according to fisrt feature and second feature to every first data/every second Mark is trained corresponding to data, can realize that the data that user behaviors log is clicked on to user carry out depth excavation, extraction is wherein The data that refer to analyzed, scene can be incorporated multi-facetedly, significantly lifted data disambiguation precision, together When, time and the cost of data disambiguation are reduced, cost of implementation reduces and the automation disambiguation effect of lifting data disambiguation.
To reach above-mentioned purpose, non-transitorycomputer readable storage medium that fourth aspect present invention embodiment proposes, When the instruction in the storage medium is performed by the processor of mobile terminal so that mobile terminal is able to carry out a kind of data Disambiguation method, methods described include:Construct training data;To, per data, being entered in the training data based on classification to be sorted Rower is noted, and is obtained being labeled as a plurality of first data for belonging to the classification to be sorted and is labeled as being not belonging to the classification to be sorted A plurality of second data;User behaviors log is clicked on based on user and determines the feature related to every first data and as the first spy Sign, and the feature related to every second data and second feature is used as, the fisrt feature/second feature includes: Literal feature and user behavior feature;According to the fisrt feature and the second feature to every first data/every Mark is trained corresponding to second data.
The non-transitorycomputer readable storage medium that fourth aspect present invention embodiment proposes, number is trained by constructing According to, per data, being labeled based on classification to be sorted in training data, obtaining being labeled as belonging to a plurality of of classification to be sorted First data and a plurality of second data for being labeled as being not belonging to classification to be sorted, user behaviors log is clicked on based on user and determined and every The related feature of first data is simultaneously used as fisrt feature, and the feature related to every second data and is used as second feature, Fisrt feature/second feature includes:Literal feature and user behavior feature, and according to fisrt feature and second feature to every Mark corresponding to first data/every second data is trained, and can realize that the data that user behaviors log is clicked on to user are carried out Depth is excavated, and is extracted the data therein that refer to and is analyzed, multi-facetedly can incorporated scene, significantly lift number According to disambiguation precision, meanwhile, time and the cost of data disambiguation are reduced, cost of implementation reduces and the automation of lifting data disambiguation Disambiguation effect.
To reach above-mentioned purpose, the computer program product that fifth aspect present invention embodiment proposes, when the computer When instruction in program product is by computing device, a kind of data disambiguation method is performed, methods described includes:Construction training number According to;To, per data, being labeled based on classification to be sorted in the training data, obtaining being labeled as belonging to the class to be sorted Other a plurality of first data and a plurality of second data for being labeled as being not belonging to the classification to be sorted;Behavior day is clicked on based on user Will determines the feature related to every first data and is used as fisrt feature, and the feature related to every second data and work For second feature, the fisrt feature/second feature includes:Literal feature and user behavior feature;According to described first Feature and the second feature are trained to being marked corresponding to every first data/every second data.
The computer program product that fifth aspect present invention embodiment proposes, by constructing training data, to training data In per data, be labeled based on classification to be sorted, obtain being labeled as a plurality of first data for belonging to classification to be sorted and mark Note to be not belonging to a plurality of second data of classification to be sorted, it is related to every first data to click on user behaviors log determination based on user Feature and be used as fisrt feature, and the feature related to every second data and be used as second feature, fisrt feature/the second Feature includes:Literal feature and user behavior feature, and according to fisrt feature and second feature to every first data/every Mark is trained corresponding to second data, can realize that the data that user behaviors log is clicked on to user carry out depth excavation, extraction The data therein that refer to are analyzed, and multi-facetedly can be incorporated scene, significantly lift data disambiguation precision, Meanwhile time and the cost of data disambiguation are reduced, cost of implementation reduces and the automation disambiguation effect of lifting data disambiguation.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description Obtain substantially, or recognized by the practice of the present invention.
Brief description of the drawings
Of the invention above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments Substantially and it is readily appreciated that, wherein:
Fig. 1 is the schematic flow sheet for the data disambiguation method that one embodiment of the invention proposes;
Fig. 2 is the schematic flow sheet for the data disambiguation method that another embodiment of the present invention proposes;
Fig. 3 is the schematic flow sheet for the data disambiguation method that another embodiment of the present invention proposes;
Fig. 4 is the structural representation for the data disambiguator that one embodiment of the invention proposes;
Fig. 5 is the structural representation for the data disambiguator that another embodiment of the present invention proposes;
Fig. 6 is the structural representation of a computer equipment of the embodiment of the present invention.
Embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.On the contrary, this All changes that the embodiment of invention includes falling into the range of the spirit and intension of attached claims, modification and equivalent Thing.
Fig. 1 is the schematic flow sheet for the data disambiguation method that one embodiment of the invention proposes.
Referring to Fig. 1, this method includes:
S11:Construct training data.
In the embodiment of the present invention, example is carried out by proper name of training data, this is not restricted.
Proper name can be, for example, rank of nobility mark, dotey etc..
Wherein it is possible to be collected to conventional proper name (for example, rank of nobility mark, dotey), training data is obtained.Training data is Refer to the proper name before not training, an a typically word either phrase, for example, rank of nobility mark, dotey, Peking University, or summer Warm wind etc..
S12:To, per data, being labeled based on classification to be sorted in training data, obtaining being labeled as belonging to be sorted A plurality of first data of classification and a plurality of second data for being labeled as being not belonging to classification to be sorted.
Classification to be sorted therein is the classification that may belong to of a training data, for example, for training data " rank of nobility mark ", Its classification to be sorted can be, for example, " video ", and this is not restricted.
It is general that disambiguation is carried out to the classification of data using the method and dictionary of machine learning in correlation technique, or, use Name entity recognition techniques identification name, place name, the classification such as mechanism name, under this mode, the classification identification of data is not comprehensive enough, And actual scene is not associated with, a large amount of human costs are expended, data disambiguation is ineffective.
And in an embodiment of the present invention, can construct a construction training data, and in training data per data, It is labeled based on classification to be sorted, specifically, the classification of every data can be carried out based on actual mark experience initial Mark, or, the classification of every data can also initially be marked based on name entity recognition techniques, then, then be triggered The mark of every data is trained, to carry out disambiguation to training data.
For example, determining that classification to be sorted is video, then the classification of " rank of nobility mark " in training data is initially marked, obtained To be labeled as " rank of nobility mark " belong to classification to be sorted " video ";The classification of " dotey " in training data is initially marked, Obtained be labeled as " dotey " is not belonging to classification to be sorted " video ";The classification of " Peking University " in training data is carried out just Begin mark, and obtained be labeled as " Peking University " be not belonging to classification to be sorted " video ", in training data " summer it is warm The classification of wind " is initially marked, and obtained be labeled as " warm wind in summer " belongs to classification to be sorted " video ", by that analogy.
Further, in training data per data, after being labeled based on classification to be sorted, it will belong to and treat point The a plurality of training data of class classification will not belong to a plurality of training data of classification to be sorted as the second number as the first data According to.
For example, using training data " rank of nobility mark " and " warm wind in summer " as the first data, and by training data " dotey " and " Peking University " is used as the second data.
S13:The user behaviors log determination feature related to every first data is clicked on based on user and is used as fisrt feature, with And the feature related to every second data and second feature being used as, fisrt feature/second feature includes:Literal feature and user Behavioural characteristic.
Wherein, user's click user behaviors log is the background server system of electronic equipment according to some actual application scenarios Automatically generate.
User, which is clicked in user behaviors log, can record some the click behaviors of user based on internet, for example, to website URL Click behavior, the click behavior, the click behavior to hyperlink etc. to picture.
In an embodiment of the present invention, row is clicked on because user clicks on some of user behaviors log record user based on internet For, and user, which clicks on user behaviors log, to be automatically generated according to some actual application scenarios, therefore, behavior is clicked on reference to user Daily record is trained to mark corresponding to training data, the class categories belonging to training data can be known with reference to scene Not, the generalization of the class categories identification belonging to training data is realized.
Wherein, literal feature can be, for example, the word segmentation result and participle number of the first data " rank of nobility mark ", word segmentation result That is " rank of nobility mark ", participle number is " 1 ", and the literal feature of different training datas can be different or identical.
User behavior feature can be, for example, click feature, search characteristics, and show feature etc., and this is not restricted.
In the embodiment of the present invention, the segmenter based on Dictionary match algorithm can be used, or, point based on learning algorithm Word device etc. determines literal feature corresponding to every training data.
In the embodiment of the present invention, in order to realize that click on user behaviors log with reference to user instructs to mark corresponding to training data Practice, user behaviors log can be clicked on based on user and determines the feature related to every first data and is used as fisrt feature, Yi Jiyu The related feature of every second data is simultaneously used as second feature, i.e. can be obtained first from search engine and every training number According to multiple general website URL of matching, specifically, searching for search engine can be inputted using this training data as search term In rope frame and search is triggered, multiple general website URL are obtained from obtained search result, and then, click on behavior from user User is counted in daily record and clicks on each general website URL number as click feature corresponding with this training data, That is, user behavior feature.
Or each general website URL of user's search number conduct can also be counted from User action log Search characteristics corresponding with this training data.
Or the application that can be also counted from User action log in internet is carried out to each general website URL The number showed, show feature as corresponding with this training data, this is not restricted.
Alternatively, in some embodiments, referring to Fig. 2, S13 can include:
S201:The length characteristic of the second data of every first data/every is determined respectively.
Length characteristic therein can be, for example, the number of characters shared by every first data/every second data, to this not It is restricted.
S202:Every first data/every second data are segmented respectively, obtain word segmentation result, and length is special Word segmentation result of seeking peace is as literal feature.
S203:From pre-set categories keywords database, it is determined that belonging to the classification keyword of classification to be sorted, and treated according to belonging to The classification keyword of class categories generates the first keyword set.
Wherein, pre-set categories keywords database can be established by the way of big data statistics, for example, can specifically adopt Pre-set categories keywords database is established with the mode of statistics, for example, by search behavior of the backstage personnel to user on a search engine Counted, the more possibility of searching times is belonged to the keyword of classification to be sorted, and the class to be sorted may be not belonging to Other keyword is stored in pre-set categories keywords database as search result.Or can also be by the way of machine learning Pre-set categories keywords database is established, for example, obtaining user's searching times from webpage with webpage correlation technique such as crawler technology It is more, the keyword of classification to be sorted may be belonged to, and may be not belonging to the classification to be sorted keyword be stored in it is pre- If in classification keywords database, this is not restricted.
S204:From pre-set categories keywords database, it is determined that be not belonging to the classification keyword of classification to be sorted, and according to not belonging to The second keyword set is generated in the keyword of classification to be sorted.
S205:Click in user behaviors log from user, it is determined that belong to the classification url of classification to be sorted, and treated point according to belonging to The classification url of class classification generates the first url set.
For example, can using the classification to be sorted as search term input search engine search box in and trigger search, from Multiple general website URL are obtained in obtained search result, and then, user behaviors log is clicked on according to user, according to the plurality of logical The classification url in the URL of website being actually the category link to be sorted generates the first url set, i.e. assuming that class to be sorted Not Wei " video ", then from multiple general website URLs corresponding with " video ", determine be actually " video " link it is more Individual classification url, and the first url set is generated according to the plurality of classification url.
S206:Example set is born according to general URL, it is determined that be not belonging to the classification url of classification to be sorted, and according to being not belonging to The classification url of classification to be sorted generates the 2nd url set.
Alternatively, general URL bears example set and can previously generated, and can specifically be given birth to according to general website URL Example set is born into general URL, wherein, general URL bears example and is:User is clicked in general website URL, except with class to be sorted Not and with the URL outside multiple general website URL of the second data match of every first data/every.
For example, can be born from general URL in example set, determine that the plurality of general URL is born in example and do not belong to actually The 2nd url set is generated in the classification url of the category link to be sorted, i.e. assuming that classification to be sorted is " video ", then from general URL bear in example set, determine actually to be not belonging to multiple classification url of " video " link, and according to the plurality of classification url Generate the 2nd url set.
S207:, will using the first keyword set and the first url set as the first associated recommendation corresponding with the first data Second keyword set and the 2nd url set are as the second associated recommendation corresponding with the second data.
In the embodiment of the present invention, while using the first keyword set and the first url set as corresponding with the first data First associated recommendation, the second keyword set and the 2nd url will be gathered as the second associated recommendation corresponding with the second data, The class categories belonging to training data can be identified with reference to the literal feature of scene and training data, realize training number The generalization identified according to affiliated class categories, and lift the precision of identification.
S208:User behaviors log is clicked on according to user, determines that user clicks on first number of the first associated recommendation, user's search First number of the first associated recommendation, corresponding to the website URL in the first url set in title, include the first keyword set In classification keyword first number, and using first number as click feature corresponding with the first data.
S209:User behaviors log is clicked on according to user, determines that user clicks on second number of the second associated recommendation, user's search Second number of the second associated recommendation, corresponding to the website URL in the 2nd url set in title, include the second keyword set In classification keyword second number, and using second number as click feature corresponding with the second data.
S210:Using the literal feature and click feature of every first data as corresponding fisrt feature, by every The literal feature and click feature of second data are as corresponding second feature.
Provided in the embodiment of the present invention a kind of related to every training data based on user's click user behaviors log determination Literal feature and user behavior feature, the reference category that this method is considered is more complete, not only combines user and clicks on row The click feature determined for daily record, and the keyword in pre-set categories keywords database is combined, and behavior day is clicked on according to user Will, determine that user clicks on the number of associated recommendation, user searches for the number of associated recommendation, and the first url gathers the/the two url set In website URL corresponding in title, comprising the first keyword set/time of the second classification keyword in keyword set Number, and using number as click feature corresponding with training data, depth is carried out by the data that user behaviors log is clicked on to user Excavate, extract the data therein that refer to and analyzed, multi-facetedly can be incorporated scene, significantly lift proper name and know Other precision.
S14:Instructed according to fisrt feature and second feature to being marked corresponding to every first data/every second data Practice.
Alternatively, in some embodiments, referring to Fig. 3, S14 can include:
S301:First candidate's set of URL is generated according to the click feature of the first data to close, and it is special according to the click of the second data Sign second candidate's set of URL of generation closes.
Wherein, first candidate's set of URL can include in closing:In multiple general website URL corresponding with the first data, use The actual general website URL clicked in family.Second candidate's set of URL can include in closing:It is corresponding with the second data multiple In general website URL, the actual general website URL clicked on of user.
S302:First candidate's set of URL is closed according to the negative example set of general URL respectively and second candidate's set of URL is closed and carried out Filtering, obtain the first current set of URL and close and the second current set of URL conjunction.
In an embodiment of the present invention, during first candidate's set of URL can be closed, while belong to general URL and bear example set In website URL deleted, to second candidate's set of URL close processing mode it is similar, can effectively be lifted by user click on go It is precision of the daily record as reference matching.
Wherein, the general URL bears example set and can obtained by way of statistics from user's click user behaviors log, right This is not restricted.
S303:Closed respectively from the first current set of URL in being closed with the second current set of URL, filter out number of clicks and be more than or wait Closed in the URL of the first preset value as first object set of URL and the second target set of URL closes.
Wherein, the first preset value is set in advance.
First preset value can be set according to demand by user, or, can also be by the execution of data disambiguation method Device program of dispatching from the factory is preset, and this is not restricted.
In the embodiment of the present invention, the URL in being closed to the first current set of URL conjunction and the second current set of URL is screened, can Further user is clicked on user behaviors log and clicks on referring to for user behaviors log as the precision with reference to matching, lifting user by lifting Property.
S304:Judge that first object set of URL closes and the second target set of URL closes what the candidate's set of URL excavated with history closed Whether similitude meets preparatory condition.
Wherein, preparatory condition is set in advance.
Preparatory condition can be set according to demand by user, or, dress can also be performed by data disambiguation method Put program of dispatching from the factory to preset, this is not restricted.
Preparatory condition is:First object set of URL closes the/the second target set of URL and closes the candidate's set of URL conjunction excavated with history In, the URL for part of the occuring simultaneously first object set of URL that occupies closes the ratio value of the/the second target set of URL conjunction more than or equal to default Threshold value.
S305:The URL for meeting preparatory condition is closed as the first final set of URL and the second final set of URL closes.
In the embodiment of the present invention, closed the URL for meeting preparatory condition as the first final set of URL and the second final set of URL Close, can further be lifted and user is clicked on into user behaviors log as the precision with reference to matching, lifted user and click on user behaviors log Referring to property.
S306:First final set of URL is closed and the second final set of URL closes, and fisrt feature and second feature conduct The input of GBDT decision Tree algorithms, according to the output of algorithm as disaggregated model corresponding with classification to be sorted.
Wherein it is possible to each URL in the first final set of URL conjunction and the second final set of URL conjunction will be stated, as GBDT decision-makings The input of tree algorithm, the output of algorithm is obtained, meanwhile, using fisrt feature and second feature as the defeated of GBDT decision Tree algorithms Enter, obtain the output of algorithm, certain incidence relation be present (i.e., because the first final set of URL closes to close with the second final set of URL Closed for the first final set of URL that identical classification to be sorted " video " marks off and the second final set of URL close), therefore, it is possible to Closed by the way that the first final set of URL is closed with the second final set of URL, and fisrt feature and second feature are calculated as GBDT decision trees The mode of the input of method is trained to the concrete class belonging to every training data, i.e. by the output of GBDT decision Tree algorithms As disaggregated model corresponding with classification to be sorted, to be corrected to mark corresponding to every training data.
S307:Based on disaggregated model corresponding with classification to be sorted, to corresponding to each first data/each second data Mark is trained.
Alternatively, respectively by the fisrt feature for a plurality of first data for being labeled as belonging to classification to be sorted, and it is labeled as not Belong to the second feature of a plurality of second data of classification to be sorted, as the input of disaggregated model, obtain disaggregated model output Tag along sort corresponding with fisrt feature and/or second feature;According to each fisrt feature and/or each second feature to every Mark corresponding to the data of bar first/every second data is trained.
In the present embodiment, by constructing training data, to, per data, entering rower based on classification to be sorted in training data Note, obtain being labeled as a plurality of first data for belonging to classification to be sorted and be labeled as being not belonging to a plurality of second number of classification to be sorted According to, user behaviors log is clicked on based on user and determines the feature related to every first data and is used as fisrt feature, and with every The related feature of second data is simultaneously used as second feature, and fisrt feature/second feature includes:Literal feature and user behavior are special Sign, and be trained according to fisrt feature and second feature to being marked corresponding to every first data/every second data, energy It is enough to realize that the data that user behaviors log is clicked on to user carry out depth excavation, extract the data therein that refer to and analyzed, can Scene is incorporated multi-facetedly, significantly lifts data disambiguation precision, meanwhile, the time of reduction data disambiguation and flower Take, cost of implementation reduces and the automation disambiguation effect of lifting data disambiguation.
Fig. 4 is the structural representation for the data disambiguator that one embodiment of the invention proposes.
Referring to Fig. 4, the device 400 includes:Constructing module 401, labeling module 402, characteristic determination module 403, Yi Jixun Practice module 404, wherein,
Constructing module 401, for constructing training data.
Labeling module 402, for, per data, being labeled, being marked based on classification to be sorted in training data For a plurality of second data for belonging to a plurality of first data of classification to be sorted He being labeled as being not belonging to classification to be sorted.
Characteristic determination module 403, the feature related to every first data is determined for clicking on user behaviors log based on user And fisrt feature is used as, and the feature related to every second data and second feature is used as, fisrt feature/second feature bag Include:Literal feature and user behavior feature.
Training module 404, for corresponding to every first data/every second data according to fisrt feature and second feature Mark be trained.
Alternatively, in some embodiments, user behavior is characterized as click feature, characteristic determination module 403, is specifically used for:
The length characteristic of the second data of every first data/every is determined respectively;
Every first data/every second data are segmented respectively, obtain word segmentation result, and by length characteristic and divide Word result is as literal feature;
From pre-set categories keywords database, it is determined that belong to the classification keyword of classification to be sorted, and it is to be sorted according to belonging to The classification keyword of classification generates the first keyword set;
From pre-set categories keywords database, it is determined that be not belonging to the classification keyword of classification to be sorted, and according to being not belonging to treat The keyword of class categories generates the second keyword set;
Clicked on from user in user behaviors log, it is determined that belong to the classification url of classification to be sorted, and according to belonging to classification to be sorted Classification url generate the first url set;
Example set is born according to general URL, it is determined that be not belonging to the classification url of classification to be sorted, and according to being not belonging to treat point The classification url of class classification generates the 2nd url set;
Using the first keyword set and the first url set as the first associated recommendation corresponding with the first data, by second Keyword set and the 2nd url set are as the second associated recommendation corresponding with the second data;
User behaviors log is clicked on according to user, determines that user clicks on first number of the first associated recommendation, user's search first First number of associated recommendation, corresponding to the website URL in the first url set in title, comprising in the first keyword set First number of classification keyword, and using first number as click feature corresponding with the first data;
User behaviors log is clicked on according to user, determines that user clicks on second number of the second associated recommendation, user's search second Second number of associated recommendation, corresponding to the website URL in the 2nd url set in title, comprising in the second keyword set Second number of classification keyword, and using second number as click feature corresponding with the second data;
Using the literal feature and click feature of every first data as corresponding fisrt feature, by every second number According to literal feature and click feature as corresponding second feature.
Training module 404, is specifically used for:
First candidate's set of URL is generated according to the click feature of the first data to close, and is given birth to according to the click feature of the second data Closed into second candidate's set of URL;
First candidate's set of URL is closed according to the negative example set of general URL respectively and second candidate's set of URL is closed and filtered, The first current set of URL is obtained to close and the second current set of URL conjunction;
Closed respectively from the first current set of URL in being closed with the second current set of URL, filter out number of clicks more than or equal to first The URL of preset value closes as first object set of URL and the second target set of URL closes;
Judge that first object set of URL closes and the second target set of URL closes the similitude that the candidate's set of URL excavated with history closes Whether preparatory condition is met;
The URL for meeting preparatory condition is closed as the first final set of URL and the second final set of URL closes;
First final set of URL is closed and the second final set of URL closes, and fisrt feature and second feature are as GBDT decision-makings The input of tree algorithm, according to the output of algorithm as disaggregated model corresponding with classification to be sorted;
Based on disaggregated model corresponding with classification to be sorted, to mark corresponding to each first data/each second data It is trained.
Training module 404, also particularly useful for:
Respectively by the fisrt feature for a plurality of first data for being labeled as belonging to classification to be sorted, and it is labeled as being not belonging to treating point The second feature of a plurality of second data of class classification, as the input of disaggregated model, obtain disaggregated model output with it is first special Tag along sort corresponding to sign and/or second feature;
According to corresponding to every first data/every second data with each fisrt feature and/or each second feature Mark is trained.
Alternatively, in some embodiments, referring to Fig. 5, the device 400 also includes:
Generation module 405, example set is born for generating general URL according to general website URL, wherein, general URL Negative example is:User is clicked in general website URL, except with classification to be sorted and with every first data/every second data phase URL outside multiple general website URL of matching.
It should be noted that the explanation in earlier figures 1- Fig. 3 embodiments to data disambiguation method embodiment is also suitable In the data disambiguator 400 of the embodiment, its realization principle is similar, and here is omitted.
In the present embodiment, by constructing training data, to, per data, entering rower based on classification to be sorted in training data Note, obtain being labeled as a plurality of first data for belonging to classification to be sorted and be labeled as being not belonging to a plurality of second number of classification to be sorted According to, user behaviors log is clicked on based on user and determines the feature related to every first data and is used as fisrt feature, and with every The related feature of second data is simultaneously used as second feature, and fisrt feature/second feature includes:Literal feature and user behavior are special Sign, and be trained according to fisrt feature and second feature to being marked corresponding to every first data/every second data, energy It is enough to realize that the data that user behaviors log is clicked on to user carry out depth excavation, extract the data therein that refer to and analyzed, can Scene is incorporated multi-facetedly, significantly lifts data disambiguation precision, meanwhile, the time of reduction data disambiguation and flower Take, cost of implementation reduces and the automation disambiguation effect of lifting data disambiguation.
The embodiment of the present invention additionally provides a kind of computer equipment, and referring to Fig. 6, computer equipment 700 can include following One or more assemblies:Processor 701, memory 702, power circuit 703, multimedia groupware 704, audio-frequency assembly 705 are defeated Enter/export the interface 706 of (I/O), sensor cluster 707, and communication component 708.
Power circuit 703, for each circuit or the device power supply for computer equipment;Memory 702 can for storage Configuration processor code;Processor 701 is run and executable journey by reading the executable program code stored in memory 702 Program corresponding to sequence code, for performing following steps:
Construct training data;
To, per data, being labeled based on classification to be sorted in training data, obtaining being labeled as belonging to classification to be sorted A plurality of first data and be labeled as being not belonging to a plurality of second data of classification to be sorted;
The user behaviors log determination feature related to every first data is clicked on based on user and is used as fisrt feature, Yi Jiyu The related feature of every second data is simultaneously used as second feature, and fisrt feature/second feature includes:Literal feature and user behavior Feature;
It is trained according to fisrt feature and second feature to being marked corresponding to every first data/every second data.
It should be noted that the explanation in earlier figures 1- Fig. 3 embodiments to data disambiguation method embodiment is also suitable In the computer equipment 700 of the embodiment, its realization principle is similar, and here is omitted.
In the present embodiment, by constructing training data, to, per data, entering rower based on classification to be sorted in training data Note, obtain being labeled as a plurality of first data for belonging to classification to be sorted and be labeled as being not belonging to a plurality of second number of classification to be sorted According to, user behaviors log is clicked on based on user and determines the feature related to every first data and is used as fisrt feature, and with every The related feature of second data is simultaneously used as second feature, and fisrt feature/second feature includes:Literal feature and user behavior are special Sign, and be trained according to fisrt feature and second feature to being marked corresponding to every first data/every second data, energy It is enough to realize that the data that user behaviors log is clicked on to user carry out depth excavation, extract the data therein that refer to and analyzed, can Scene is incorporated multi-facetedly, significantly lifts data disambiguation precision, meanwhile, the time of reduction data disambiguation and flower Take, cost of implementation reduces and the automation disambiguation effect of lifting data disambiguation.
In order to realize above-described embodiment, the present invention also proposes a kind of non-transitorycomputer readable storage medium, works as storage Instruction in medium by terminal computing device when so that terminal is able to carry out a kind of data disambiguation method, and method includes:
Construct training data;
To, per data, being labeled based on classification to be sorted in training data, obtaining being labeled as belonging to classification to be sorted A plurality of first data and be labeled as being not belonging to a plurality of second data of classification to be sorted;
The user behaviors log determination feature related to every first data is clicked on based on user and is used as fisrt feature, Yi Jiyu The related feature of every second data is simultaneously used as second feature, and fisrt feature/second feature includes:Literal feature and user behavior Feature;
It is trained according to fisrt feature and second feature to being marked corresponding to every first data/every second data.
Non-transitorycomputer readable storage medium in the present embodiment, by constructing training data, in training data Per data, it is labeled based on classification to be sorted, obtains being labeled as a plurality of first data for belonging to classification to be sorted and mark To be not belonging to a plurality of second data of classification to be sorted, it is related to every first data that user behaviors log determination is clicked on based on user Feature is simultaneously used as fisrt feature, and the feature related to every second data and is used as second feature, and fisrt feature/second is special Sign includes:Literal feature and user behavior feature, and according to fisrt feature and second feature to every first data/every Mark is trained corresponding to two data, can be realized that the data that user behaviors log is clicked on to user carry out depth excavation, be extracted it In the data that refer to analyzed, scene can be incorporated multi-facetedly, significantly lifted data disambiguation precision, together When, time and the cost of data disambiguation are reduced, cost of implementation reduces and the automation disambiguation effect of lifting data disambiguation.
In order to realize above-described embodiment, the present invention also proposes a kind of computer program product, when in computer program product Instruction when being executed by processor, perform a kind of data disambiguation method, method includes:
Construct training data;
To, per data, being labeled based on classification to be sorted in training data, obtaining being labeled as belonging to classification to be sorted A plurality of first data and be labeled as being not belonging to a plurality of second data of classification to be sorted;
The user behaviors log determination feature related to every first data is clicked on based on user and is used as fisrt feature, Yi Jiyu The related feature of every second data is simultaneously used as second feature, and fisrt feature/second feature includes:Literal feature and user behavior Feature;
It is trained according to fisrt feature and second feature to being marked corresponding to every first data/every second data.
Computer program product in the present embodiment, by constructing training data, to, per data, being based in training data Classification to be sorted is labeled, obtain being labeled as a plurality of first data for belonging to classification to be sorted and being labeled as be not belonging to it is to be sorted A plurality of second data of classification, user behaviors log is clicked on based on user and determines the feature related to every first data and is used as first Feature, and the feature related to every second data and second feature is used as, fisrt feature/second feature includes:Literal spy Seek peace user behavior feature, and according to fisrt feature and second feature to corresponding to every first data/every second data Mark is trained, and can realize that the data that user behaviors log is clicked on to user carry out depth excavation, extraction is therein to be referred to count According to being analyzed, scene can be incorporated multi-facetedly, significantly lift data disambiguation precision, meanwhile, reduce data The time of disambiguation and cost, cost of implementation reduces and the automation disambiguation effect of lifting data disambiguation.
It should be noted that in the description of the invention, term " first ", " second " etc. are only used for describing purpose, without It is understood that to indicate or implying relative importance.In addition, in the description of the invention, unless otherwise indicated, the implication of " multiple " It is two or more.
Any process or method described otherwise above description in flow chart or herein is construed as, and represents to include Module, fragment or the portion of the code of the executable instruction of one or more the step of being used to realize specific logical function or process Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discuss suitable Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.
It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned In embodiment, software that multiple steps or method can be performed in memory and by suitable instruction execution system with storage Or firmware is realized.If, and in another embodiment, can be with well known in the art for example, realized with hardware Any one of row technology or their combination are realized:With the logic gates for realizing logic function to data-signal Discrete logic, have suitable combinational logic gate circuit application specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..
Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method carries Suddenly it is that by program the hardware of correlation can be instructed to complete, described program can be stored in a kind of computer-readable storage medium In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, can also That unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould Block can both be realized in the form of hardware, can also be realized in the form of software function module.The integrated module is such as Fruit is realized in the form of software function module and as independent production marketing or in use, can also be stored in a computer In read/write memory medium.
Storage medium mentioned above can be read-only storage, disk or CD etc..
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or the spy for combining the embodiment or example description Point is contained at least one embodiment or example of the present invention.In this manual, to the schematic representation of above-mentioned term not Necessarily refer to identical embodiment or example.Moreover, specific features, structure, material or the feature of description can be any One or more embodiments or example in combine in an appropriate manner.
Although embodiments of the invention have been shown and described above, it is to be understood that above-described embodiment is example Property, it is impossible to limitation of the present invention is interpreted as, one of ordinary skill in the art within the scope of the invention can be to above-mentioned Embodiment is changed, changed, replacing and modification.

Claims (13)

1. a kind of data disambiguation method, it is characterised in that comprise the following steps:
Construct training data;
To, per data, being labeled based on classification to be sorted in the training data, obtaining being labeled as belonging to described to be sorted A plurality of first data of classification and a plurality of second data for being labeled as being not belonging to the classification to be sorted;
Based on user user behaviors log is clicked on to determine the feature related to every first data and be used as fisrt feature, and with every The related feature of second data is simultaneously used as second feature, and the fisrt feature/second feature includes:Literal feature and user Behavioural characteristic;
According to the fisrt feature and the second feature to marked corresponding to every first data/every second data into Row training.
2. data disambiguation method as claimed in claim 1, it is characterised in that the user behavior is characterized as click feature, institute State and user behaviors log is clicked on based on user determine the feature related to every first data and be used as fisrt feature, and with every the The related feature of two data is simultaneously used as second feature, and the fisrt feature/second feature includes:Literal feature and user's row It is characterized, including:
The length characteristic of the second data of every first data/every is determined respectively;
Every first data/every second data are segmented respectively, obtain word segmentation result, and by the length characteristic With the word segmentation result as the literal feature;
From pre-set categories keywords database, it is determined that belonging to the classification keyword of the classification to be sorted, and belong to institute according to described The classification keyword for stating classification to be sorted generates the first keyword set;
From the pre-set categories keywords database, it is determined that the classification keyword of the classification to be sorted is not belonging to, and according to described The keyword for being not belonging to the classification to be sorted generates the second keyword set;
Clicked on from the user in user behaviors log, it is determined that belonging to the classification url of the classification to be sorted, and belong to institute according to described The classification url for stating classification to be sorted generates the first url set;
Example set is born according to general URL, it is determined that being not belonging to the classification url of the classification to be sorted, and is not belonging to according to described The classification url of the classification to be sorted generates the 2nd url set;
Using first keyword set and the first url set as to first data corresponding to first related push away Recommend, using second keyword set and the 2nd url set as the second associated recommendation corresponding with second data;
User behaviors log is clicked on according to the user, determines that user clicks on first number of first associated recommendation, user's search First number of first associated recommendation, in title corresponding to the website URL in the first url set, include described the First number of the classification keyword in one keyword set, and using first number as corresponding with first data Click feature;
User behaviors log is clicked on according to the user, determines that user clicks on second number of second associated recommendation, user's search Second number of second associated recommendation, in title corresponding to the website URL in the 2nd url set, include described the Second number of the classification keyword in two keyword sets, and using second number as corresponding with second data Click feature;
, will be described every using the literal feature of every first data and the click feature as corresponding fisrt feature The literal feature and the click feature of the data of bar second are as corresponding second feature.
3. data disambiguation method as claimed in claim 2, it is characterised in that described according to the fisrt feature and described second Feature is trained to being marked corresponding to every first data/every second data, including:
First candidate's set of URL is generated according to the click feature of first data to close, and it is special according to the click of second data Sign second candidate's set of URL of generation closes;
Respectively according to the general URL bear example set to the first candidate set of URL close and the second candidate set of URL close into Row filtering, obtain the first current set of URL and close and the second current set of URL conjunction;
Closed respectively from the described first current set of URL in being closed with the described second current set of URL, filter out number of clicks and be more than or equal to The URL of first preset value closes as first object set of URL and the second target set of URL closes;
Judge that the first object set of URL closes and the second target set of URL closes the similitude that the candidate's set of URL excavated with history closes Whether preparatory condition is met;
The URL for meeting the preparatory condition is closed as the first final set of URL and the second final set of URL closes;
Described first final set of URL is closed and the second final set of URL closes, and the fisrt feature and the second feature conduct The input of GBDT decision Tree algorithms, according to the output of the algorithm as disaggregated model corresponding with the classification to be sorted;
Based on the disaggregated model corresponding with the classification to be sorted, to each first data/each second data pair The mark answered is trained.
4. data disambiguation method as claimed in claim 3, it is characterised in that described based on the described and classification pair to be sorted The disaggregated model answered, it is trained to being marked corresponding to each first data/each second data, including:
Respectively by the fisrt feature for a plurality of first data for being labeled as belonging to the classification to be sorted, and it is labeled as being not belonging to described The second feature of a plurality of second data of classification to be sorted, as the input of the disaggregated model, it is defeated to obtain the disaggregated model The tag along sort corresponding with fisrt feature and/or second feature gone out;
According to described and each fisrt feature and/or each second feature to every first data/every second data pair The mark answered is trained.
5. data disambiguation method as claimed in claim 2 or claim 3, it is characterised in that also include:
The general URL is generated according to general website URL and bears example set, wherein, the general URL bears example and is:User Click in the general website URL, except with the classification to be sorted and with every first data/every second data phase URL outside multiple general website URL of matching.
A kind of 6. data disambiguator, it is characterised in that including:
Constructing module, for constructing training data;
Labeling module, for, per data, being labeled based on classification to be sorted in the training data, obtaining being labeled as belonging to In the classification to be sorted a plurality of first data and be labeled as being not belonging to a plurality of second data of the classification to be sorted;
Characteristic determination module, determine the feature related to every first data for clicking on user behaviors log based on user and be used as the One feature, and the feature related to every second data and second feature is used as, the fisrt feature/second feature bag Include:Literal feature and user behavior feature;
Training module, for according to the fisrt feature and the second feature to every first data/every second number It is trained according to corresponding mark.
7. data disambiguator as claimed in claim 6, it is characterised in that the user behavior is characterized as click feature, institute Characteristic determination module is stated, is specifically used for:
The length characteristic of the second data of every first data/every is determined respectively;
Every first data/every second data are segmented respectively, obtain word segmentation result, and by the length characteristic With the word segmentation result as the literal feature;
From pre-set categories keywords database, it is determined that belonging to the classification keyword of the classification to be sorted, and belong to institute according to described The classification keyword for stating classification to be sorted generates the first keyword set;
From the pre-set categories keywords database, it is determined that the classification keyword of the classification to be sorted is not belonging to, and according to described The keyword for being not belonging to the classification to be sorted generates the second keyword set;
Clicked on from the user in user behaviors log, it is determined that belonging to the classification url of the classification to be sorted, and belong to institute according to described The classification url for stating classification to be sorted generates the first url set;
Example set is born according to general URL, it is determined that being not belonging to the classification url of the classification to be sorted, and is not belonging to according to described The classification url of the classification to be sorted generates the 2nd url set;
Using first keyword set and the first url set as to first data corresponding to first related push away Recommend, using second keyword set and the 2nd url set as the second associated recommendation corresponding with second data;
User behaviors log is clicked on according to the user, determines that user clicks on first number of first associated recommendation, user's search First number of first associated recommendation, in title corresponding to the website URL in the first url set, include described the First number of the classification keyword in one keyword set, and using first number as corresponding with first data Click feature;
User behaviors log is clicked on according to the user, determines that user clicks on second number of second associated recommendation, user's search Second number of second associated recommendation, in title corresponding to the website URL in the 2nd url set, include described the Second number of the classification keyword in two keyword sets, and using second number as corresponding with second data Click feature;
, will be described every using the literal feature of every first data and the click feature as corresponding fisrt feature The literal feature and the click feature of the data of bar second are as corresponding second feature.
8. data disambiguator as claimed in claim 7, it is characterised in that the training module, be specifically used for:
First candidate's set of URL is generated according to the click feature of first data to close, and it is special according to the click of second data Sign second candidate's set of URL of generation closes;
Respectively according to the general URL bear example set to the first candidate set of URL close and the second candidate set of URL close into Row filtering, obtain the first current set of URL and close and the second current set of URL conjunction;
Closed respectively from the described first current set of URL in being closed with the described second current set of URL, filter out number of clicks and be more than or equal to The URL of first preset value closes as first object set of URL and the second target set of URL closes;
Judge that the first object set of URL closes and the second target set of URL closes the similitude that the candidate's set of URL excavated with history closes Whether preparatory condition is met;
The URL for meeting the preparatory condition is closed as the first final set of URL and the second final set of URL closes;
Described first final set of URL is closed and the second final set of URL closes, and the fisrt feature and the second feature conduct The input of GBDT decision Tree algorithms, according to the output of the algorithm as disaggregated model corresponding with the classification to be sorted;
Based on the disaggregated model corresponding with the classification to be sorted, to each first data/each second data pair The mark answered is trained.
9. data disambiguator as claimed in claim 8, it is characterised in that the training module, also particularly useful for:
Respectively by the fisrt feature for a plurality of first data for being labeled as belonging to the classification to be sorted, and it is labeled as being not belonging to described The second feature of a plurality of second data of classification to be sorted, as the input of the disaggregated model, it is defeated to obtain the disaggregated model The tag along sort corresponding with fisrt feature and/or second feature gone out;
According to described and each fisrt feature and/or each second feature to every first data/every second data pair The mark answered is trained.
10. data disambiguator as claimed in claim 7 or 8, it is characterised in that also include:
Generation module, example set is born for generating the general URL according to general website URL, wherein, it is described general URL bears example:User is clicked in the general website URL, except with the classification to be sorted and with every first data/ URL outside multiple general website URL of every second data match.
11. a kind of computer equipment, it is characterised in that including following one or more assemblies:Processor, memory, power supply electricity Road, multimedia groupware, audio-frequency assembly, the interface of input/output (I/O), sensor cluster, and communication component;Wherein, circuit Plate is placed in the interior volume that housing surrounds, and the processor and the memory are arranged on the circuit board;The power supply Circuit, for each circuit or the device power supply for the computer equipment;The memory is used to store executable program generation Code;The processor is run and the executable program generation by reading the executable program code stored in the memory Program corresponding to code, for performing:
Construct training data;
To, per data, being labeled based on classification to be sorted in the training data, obtaining being labeled as belonging to described to be sorted A plurality of first data of classification and a plurality of second data for being labeled as being not belonging to the classification to be sorted;
Based on user user behaviors log is clicked on to determine the feature related to every first data and be used as fisrt feature, and with every The related feature of second data is simultaneously used as second feature, and the fisrt feature/second feature includes:Literal feature and user Behavioural characteristic;
According to the fisrt feature and the second feature to marked corresponding to every first data/every second data into Row training.
12. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, it is characterised in that the program The data disambiguation method as any one of claim 1-5 is realized when being executed by processor.
13. a kind of computer program product, when the instruction in the computer program product is by computing device, perform one kind Data disambiguation method, methods described include:
Construct training data;
To, per data, being labeled based on classification to be sorted in the training data, obtaining being labeled as belonging to described to be sorted A plurality of first data of classification and a plurality of second data for being labeled as being not belonging to the classification to be sorted;
Based on user user behaviors log is clicked on to determine the feature related to every first data and be used as fisrt feature, and with every The related feature of second data is simultaneously used as second feature, and the fisrt feature/second feature includes:Literal feature and user Behavioural characteristic;
According to the fisrt feature and the second feature to marked corresponding to every first data/every second data into Row training.
CN201710807103.2A 2017-09-08 2017-09-08 Data disambiguation method and device and computer equipment Active CN107609094B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710807103.2A CN107609094B (en) 2017-09-08 2017-09-08 Data disambiguation method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710807103.2A CN107609094B (en) 2017-09-08 2017-09-08 Data disambiguation method and device and computer equipment

Publications (2)

Publication Number Publication Date
CN107609094A true CN107609094A (en) 2018-01-19
CN107609094B CN107609094B (en) 2020-12-04

Family

ID=61062901

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710807103.2A Active CN107609094B (en) 2017-09-08 2017-09-08 Data disambiguation method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN107609094B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021259002A1 (en) * 2020-06-23 2021-12-30 平安科技(深圳)有限公司 Decision tree-based method and apparatus for outputting abnormal data sources, and computer device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110231347A1 (en) * 2010-03-16 2011-09-22 Microsoft Corporation Named Entity Recognition in Query
US20120226641A1 (en) * 2010-09-29 2012-09-06 Yahoo! Inc. Training a search query intent classifier using wiki article titles and a search click log
CN103617239A (en) * 2013-11-26 2014-03-05 百度在线网络技术(北京)有限公司 Method and device for identifying named entity and method and device for establishing classification model
CN104915399A (en) * 2015-05-29 2015-09-16 百度在线网络技术(北京)有限公司 Recommended data processing method based on news headline and recommended data processing method system based on news headline
CN105095187A (en) * 2015-08-07 2015-11-25 广州神马移动信息科技有限公司 Search intention identification method and device
CN106339756A (en) * 2016-08-25 2017-01-18 北京百度网讯科技有限公司 Training data generation method and device and searching method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110231347A1 (en) * 2010-03-16 2011-09-22 Microsoft Corporation Named Entity Recognition in Query
US20120226641A1 (en) * 2010-09-29 2012-09-06 Yahoo! Inc. Training a search query intent classifier using wiki article titles and a search click log
CN103617239A (en) * 2013-11-26 2014-03-05 百度在线网络技术(北京)有限公司 Method and device for identifying named entity and method and device for establishing classification model
CN104915399A (en) * 2015-05-29 2015-09-16 百度在线网络技术(北京)有限公司 Recommended data processing method based on news headline and recommended data processing method system based on news headline
CN105095187A (en) * 2015-08-07 2015-11-25 广州神马移动信息科技有限公司 Search intention identification method and device
CN106339756A (en) * 2016-08-25 2017-01-18 北京百度网讯科技有限公司 Training data generation method and device and searching method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021259002A1 (en) * 2020-06-23 2021-12-30 平安科技(深圳)有限公司 Decision tree-based method and apparatus for outputting abnormal data sources, and computer device

Also Published As

Publication number Publication date
CN107609094B (en) 2020-12-04

Similar Documents

Publication Publication Date Title
WO2019227710A1 (en) Network public opinion analysis method and apparatus, and computer-readable storage medium
CN109271493B (en) Language text processing method and device and storage medium
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
CN108777674B (en) Phishing website detection method based on multi-feature fusion
CN104408093B (en) A kind of media event key element abstracting method and device
CN107437038B (en) Webpage tampering detection method and device
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
CN109376963B (en) Criminal case and criminal name and criminal law joint prediction method based on neural network
CN110032632A (en) Intelligent customer service answering method, device and storage medium based on text similarity
CN104679825B (en) Macroscopic abnormity of earthquake acquisition of information based on network text and screening technique
CN104102721A (en) Method and device for recommending information
CN111899089A (en) Enterprise risk early warning method and system based on knowledge graph
CN107341399A (en) Assess the method and device of code file security
CN111581956B (en) Sensitive information identification method and system based on BERT model and K nearest neighbor
CN104915399A (en) Recommended data processing method based on news headline and recommended data processing method system based on news headline
CN110222260A (en) A kind of searching method, device and storage medium
CN108229170A (en) Utilize big data and the software analysis method and device of neural network
CN108763313A (en) On-line training method, server and the storage medium of model
CN108984514A (en) Acquisition methods and device, storage medium, the processor of word
CN113971398A (en) Dictionary construction method for rapid entity identification in network security field
CN115659008A (en) Information pushing system and method for big data information feedback, electronic device and medium
CN115757991A (en) Webpage identification method and device, electronic equipment and storage medium
Gopal et al. Machine learning based classification of online news data for disaster management
CN108595466B (en) Internet information filtering and internet user information and network card structure analysis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant