CN105760526A - News classification method and device - Google Patents

News classification method and device Download PDF

Info

Publication number
CN105760526A
CN105760526A CN201610115723.5A CN201610115723A CN105760526A CN 105760526 A CN105760526 A CN 105760526A CN 201610115723 A CN201610115723 A CN 201610115723A CN 105760526 A CN105760526 A CN 105760526A
Authority
CN
China
Prior art keywords
score value
press release
area name
matching result
obtains
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610115723.5A
Other languages
Chinese (zh)
Other versions
CN105760526B (en
Inventor
钱烽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN201610115723.5A priority Critical patent/CN105760526B/en
Publication of CN105760526A publication Critical patent/CN105760526A/en
Application granted granted Critical
Publication of CN105760526B publication Critical patent/CN105760526B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Abstract

The embodiment of the invention provides a news classification method and device. The news classification method comprises the following steps: extracting titles of news articles; carrying out target class matching on the news titles to obtain a first matching result; calculating a first value of the first matching result; and when judging that the first value meets a first pre-set condition, classifying the news articles into a target class corresponding to the first matching result. According to the scheme provided by the invention, each news article does not need to be manually read and the news articles are classified according to article content marks, so that the defects in the prior art that the efficiency is relatively low, the timeliness is relatively poor and the accuracy is relatively low are overcome.

Description

A kind of method and apparatus of news category
Technical field
Embodiments of the present invention relate to field of computer technology, more specifically, embodiments of the present invention relate to method and the device of a kind of news category.
Background technology
This part is it is intended that the embodiments of the present invention stated in claims provide background or context.Description herein is not because including just admitting in this part to be prior art.
News, refers to a kind of appellation of the information propagated by media avenues such as newspaper, radio station, broadcast, television station, the Internets, is mainly the report to the report or true variation recently that the fact occurs recently, and therefore, the promptness of news is particularly important.
In daily life, oneself news of interest can be quickly found out for the ease of reader, need news is classified, sorting technique conventional at present is mainly manual method: manual read every section Press release, classify according to contribution content-label, for instance, the area according to its correspondence of contribution content-label, sort out contribution by area, collect the local news for this area.
Summary of the invention
But current method manually processes due to needs, accordingly, there exist inefficient, ageing poor and that accuracy is relatively low defect, this is very bothersome process.
For this, it is also very desirable to the method for the news category of a kind of improvement and device, so that solve to exist in prior art inefficient, ageing poor and defect that accuracy is relatively low.
In the present context, embodiments of the present invention expectation provides method and the device of a kind of news category.
In the first aspect of embodiment of the present invention, it is provided that a kind of method of news category, including:
Extract the headline of Press release;
Described headline is carried out target categorical match, obtains the first matching result;
Calculate the first score value of described first matching result, and judge described first score value meet first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result.
In one embodiment, method described according to the abovementioned embodiments of the present invention, described headline is carried out target categorical match, obtains the first matching result, including:
Described headline is carried out area name coupling, obtains at least one area name;
Calculate the first score value of described first matching result, including:
For any one area name at least one area name described, perform respectively:
Determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described headline;
By the product of described basic score value and described number of times, as the corresponding with described any one area name first initial score value;
Determining the maximum in all first initial score value and second largest value, described second largest value is meant less than described maximum, and more than the first initial score value of the initial score value of all residues first except described maximum in the described all first initial score value;
The ratio obtained divided by described second largest value by described maximum, as described first score value;
Described Press release is divided in the target classification corresponding to described first matching result, including:
Described Press release is divided in the area name corresponding to the maximum first initial score value in all first initial score value.
In certain embodiments, the method according to any of the above-described embodiment of the present invention, it is determined that described first score value meets first pre-conditioned, including:
Judge that described first score value is more than or equal to 1.5.
In certain embodiments, according to the method described in any of the above-described embodiment of the present invention, if it is determined that described first score value is unsatisfactory for described first pre-conditioned, described method also includes:
Extract the body content of described Press release;
Described body content is carried out target categorical match, obtains the second matching result;
Calculate the second score value of described second matching result, and judge described second score value meet second pre-conditioned time, described Press release is divided in the target classification corresponding to described second matching result.
In certain embodiments, according to the method described in any of the above-described embodiment of the present invention, described body content is carried out target categorical match, obtains the second matching result, including:
Described body content is carried out area name coupling, obtains at least one area name;
Calculate the second score value of described second matching result, including:
For any one area name at least one area name described, perform respectively:
Determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described body content;
By the product of described basic score value and described number of times, as the corresponding with described any one area name second initial score value;
Determining the number of times that the maximum in all second initial score value and objective area title occur in described body content, described objective area name is called the area name corresponding to described maximum;
The number of times described objective area title occurred in described body content, deducts the value that each number of times corresponding to residual sector title in residual sector title obtains, as described second score value;
Wherein: described residual sector name is called the area name at least one area name described except the area name corresponding to described maximum;
Described Press release is divided in the target classification corresponding to described second matching result, including:
Described Press release is divided in the area name corresponding to the maximum second initial score value in all second initial score value.
In certain embodiments, the method according to any of the above-described embodiment of the present invention, it is determined that described second score value meets second pre-conditioned, including:
Judge that described second score value is more than or equal to 3.
In certain embodiments, the method according to any of the above-described embodiment of the present invention, it is determined that described second score value be unsatisfactory for described second pre-conditioned after, described method also includes:
The probability in area belonging to described Press release is predicted according to disaggregated model;
When judging described probability more than threshold value, using described Press release as the Press release in area belonging to described.
In certain embodiments, according to the method described in any of the above-described embodiment of the present invention, before predicting the probability in area belonging to described Press release according to disaggregated model, described method also includes:
Obtain corpus, described corpus include determining whether described first score value meet first pre-conditioned time described Press release and the area name of correspondence, and/or judge described second score value meet second pre-conditioned time described Press release and the area name of correspondence;And
Based on described corpus, obtain described disaggregated model.
In certain embodiments, according to the method described in any of the above-described embodiment of the present invention, based on described corpus, obtain described disaggregated model, including:
Adopt vector space model and the reverse file word frequency TF-IDF algorithm of word frequency, each section of Press release in described corpus is extracted key word;
According to corresponding contribution attribute and key word, described each section of Press release is all encoded into characteristic vector;
The corpus being encoded to characteristic vector is carried out feature selection and feature combination;
Adopt this base of a fruit model of many sorted logics, the corpus after carrying out feature selection and feature combination is trained, obtains described disaggregated model.
In the second aspect of embodiment of the present invention, it is provided that the device of a kind of news category, including:
Extraction unit, for extracting the headline of Press release;
Matching unit, for described headline is carried out target categorical match, obtains the first matching result;
Computing unit, for calculating the first score value of described first matching result;
Judging unit, is used for judging whether described first score value meets first pre-conditioned;
Taxon, for described judging unit judge described first score value meet described first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result.
In one embodiment, device described according to the abovementioned embodiments of the present invention, described headline is carried out target categorical match by described matching unit, when obtaining the first matching result, particularly as follows:
Described headline is carried out area name coupling, obtains at least one area name;
Described computing unit includes determining unit and product computing unit, wherein:
Described determine unit, for for any one area name at least one area name described, performing respectively: determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described headline;
Described product computing unit, for by the product of described basic score value and described number of times, as the corresponding with described any one area name first initial score value;
Described determine that unit is additionally operable to, determine the maximum in all first initial score value and second largest value, described second largest value is meant less than described maximum, and more than the first initial score value of the initial score value of all residues first except described maximum in the described all first initial score value;
Described determining that unit is additionally operable to, the ratio obtained divided by described second largest value by described maximum, as described first score value;
Described taxon specifically for: described Press release is divided in the area name corresponding to the maximum first initial score value in all first initial score value.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described judging unit judge described first score value meet first pre-conditioned time, particularly as follows:
Judge that described first score value is more than or equal to 1.5.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described extraction unit is additionally operable to, and extracts the body content of described Press release;
Described matching unit is additionally operable to, and described body content is carried out target categorical match, obtains the second matching result;
Described computing unit is additionally operable to, and calculates the second score value of described second matching result;
Described judging unit is additionally operable to, it is judged that it is pre-conditioned whether described second score value meets second;
Described taxon is additionally operable to, described judging unit judge described second score value meet second pre-conditioned time, described Press release is divided in the target classification corresponding to described second matching result.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described body content is carried out target categorical match by described matching unit, when obtaining the second matching result, particularly as follows:
Described body content is carried out area name coupling, obtains at least one area name;
Described computing unit includes determining unit and product computing unit, wherein:
Described determine unit, for for any one area name at least one area name described, performing respectively: determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described body content;
Described product computing unit, for by the product of described basic score value and described number of times, as the corresponding with described any one area name second initial score value;
Described determine that unit is additionally operable to, it is determined that the number of times that maximum in all second initial score value and objective area title occur in described body content, described objective area name is called the area name corresponding to described maximum;
Described computing unit is additionally operable to, the number of times described objective area title occurred in described body content, deducts the value that each number of times corresponding to residual sector title in residual sector title obtains, as described second score value;
Wherein, described residual sector name is called the area name at least one area name described except the area name corresponding to described maximum;
Described taxon specifically for: described Press release is divided in the area name corresponding to the maximum second initial score value in all second initial score value.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described judging unit judge described second score value meet second pre-conditioned time, particularly as follows:
Judge that described second score value is more than or equal to 3.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described device also includes algorithm unit, for predicting the probability in area belonging to described Press release according to disaggregated model;When judging described probability more than threshold value, using described Press release as the Press release in area belonging to described.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described algorithm unit includes acquiring unit and training unit, wherein:
Described acquiring unit, for obtaining corpus, described corpus include determining whether described first score value meet first pre-conditioned time described Press release and the area name of correspondence, and/or judge described second score value meet second pre-conditioned time described Press release and the area name of correspondence;
Described training unit, for based on described corpus, obtaining described disaggregated model.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described algorithm unit also includes coding unit and characteristic processing unit, wherein:
Described extraction unit is additionally operable to, and adopts vector space model and the reverse file word frequency TF-IDF algorithm of word frequency, and each section of Press release in described corpus is extracted key word;
Described coding unit, for according to corresponding contribution attribute and key word, being all encoded into characteristic vector by described each section of Press release;
Described characteristic processing unit, for carrying out feature selection and feature combination by the corpus being encoded to characteristic vector;
Described training unit is additionally operable to, and adopts this base of a fruit model of many sorted logics, is trained by the corpus after carrying out feature selection and feature combination, obtains described disaggregated model.
In the third aspect of embodiment of the present invention, it is provided that a kind of method of news category, including:
Press release is carried out target categorical match, obtains matching result;
Calculate the score value of described matching result, and judge when described score value meets pre-conditioned, described Press release is divided in the target classification corresponding to described matching result;
Based on described Press release and corresponding target classification train classification models;Judge described score value be unsatisfactory for described pre-conditioned time, based on described disaggregated model, described Press release is classified.
In one embodiment, method described according to the abovementioned embodiments of the present invention, Press release is carried out target categorical match, obtains matching result, including:
Extract the headline of Press release;And
Described headline is carried out target categorical match, obtains the first matching result;
Calculate the score value of described matching result, and judge when described score value meets pre-conditioned, described Press release is divided in the target classification corresponding to described matching result, including:
Calculate the first score value of described first matching result, and judge described first score value meet first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result.
In certain embodiments, according to the method described in any of the above-described embodiment of the present invention, described headline is carried out target categorical match, obtain the first matching result, including:
Described headline is carried out area name coupling, obtains at least one area name;
Calculate the first score value of described first matching result, including:
For any one area name at least one area name described, perform respectively:
Determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described headline;
By the product of described basic score value and described number of times, as the corresponding with described any one area name first initial score value;
Determining the maximum in all first initial score value and second largest value, described second largest value is meant less than described maximum, and more than the first initial score value of the initial score value of all residues first except described maximum in the described all first initial score value;
The ratio obtained divided by described second largest value by described maximum, as described first score value;
Described Press release is divided in the target classification corresponding to described first matching result, including:
Described Press release is divided in the area name corresponding to the maximum first initial score value in all first initial score value.
In certain embodiments, according to the method described in any of the above-described embodiment of the present invention, if it is determined that described first score value is unsatisfactory for described first pre-conditioned, described method also includes:
Extract the body content of described Press release;
Described body content is carried out target categorical match, obtains the second matching result;
Calculate the second score value of described second matching result, and judge described second score value meet second pre-conditioned time, described Press release is divided in the target classification corresponding to described second matching result.
In certain embodiments, according to the method described in any of the above-described embodiment of the present invention, described body content is carried out target categorical match, obtains the second matching result, including:
Described body content is carried out area name coupling, obtains at least one area name;
Calculate the second score value of described second matching result, including:
For any one area name at least one area name described, perform respectively:
Determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described body content;
By the product of described basic score value and described number of times, as the corresponding with described any one area name second initial score value;
Determining the number of times that the maximum in all second initial score value and objective area title occur in described body content, described objective area name is called the area name corresponding to described maximum;
The number of times described objective area title occurred in described body content, deducts the value that each number of times corresponding to residual sector title in residual sector title obtains, as described second score value;
Wherein: described residual sector name is called the area name at least one area name described except the area name corresponding to described maximum;
Described Press release is divided in the target classification corresponding to described second matching result, including:
Described Press release is divided in the area name corresponding to the maximum second initial score value in all second initial score value.
In certain embodiments, according to the method described in any of the above-described embodiment of the present invention, based on described Press release and corresponding target classification train classification models, including:
Obtain corpus, described corpus include determining whether described first score value meet first pre-conditioned time described Press release and the area name of correspondence, and/or judge described second score value meet second pre-conditioned time described Press release and the area name of correspondence;And
Based on described corpus, obtain described disaggregated model.
In certain embodiments, according to the method described in any of the above-described embodiment of the present invention, based on described corpus, obtain described disaggregated model, including:
Adopt vector space model and the reverse file word frequency TF-IDF algorithm of word frequency, each section of Press release in described corpus is extracted key word;
According to corresponding contribution attribute and key word, described each section of Press release is all encoded into characteristic vector;
The corpus being encoded to characteristic vector is carried out feature selection and feature combination;
Adopt this base of a fruit model of many sorted logics, the corpus after carrying out feature selection and feature combination is trained, obtains described disaggregated model.
In certain embodiments, according to the method described in any of the above-described embodiment of the present invention, based on described disaggregated model, described Press release is classified, including:
The probability in area belonging to described Press release is predicted according to disaggregated model;
When judging described probability more than threshold value, using described Press release as the Press release in area belonging to described.
In certain embodiments, according to the method described in any of the above-described embodiment of the present invention, based on described Press release and corresponding target classification train classification models, including:
It is periodically based on described Press release and corresponding target classification train classification models.
In the fourth aspect of embodiment of the present invention, it is provided that the device of a kind of news category, including:
Matching unit, for Press release is carried out target categorical match, obtains matching result;
Computing unit, for calculating the score value of described matching result;
Judging unit, is used for judging whether described score value meets pre-conditioned;
Taxon, for when described judging unit judges that described score value meets pre-conditioned, being divided into described Press release in the target classification corresponding to described matching result;
Algorithm unit, for based on described Press release and corresponding target classification train classification models;
Described taxon is additionally operable to, described judging unit judge described score value be unsatisfactory for described pre-conditioned time, based on described disaggregated model, described Press release is classified.
In one embodiment, device described according to the abovementioned embodiments of the present invention, described device also includes extraction unit, for extracting the headline of Press release;
Described matching unit specifically for, described headline is carried out target categorical match, obtains the first matching result;
Described computing unit specifically for, calculate the first score value of described first matching result;
Described taxon specifically for, described judging unit judge described first score value meet first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described headline is carried out target categorical match by described matching unit, when obtaining the first matching result, particularly as follows:
Described headline is carried out area name coupling, obtains at least one area name;
Described computing unit includes determining unit and product computing unit, wherein:
Described determine unit, for for any one area name at least one area name described, performing respectively: determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described headline;
Described product computing unit, for by the product of described basic score value and described number of times, as the corresponding with described any one area name first initial score value;
Described determine that unit is additionally operable to, determine the maximum in all first initial score value and second largest value, described second largest value is meant less than described maximum, and more than the first initial score value of the initial score value of all residues first except described maximum in the described all first initial score value;
Described determining that unit is additionally operable to, the ratio obtained divided by described second largest value by described maximum, as described first score value;
Described taxon specifically for: described Press release is divided in the area name corresponding to the maximum first initial score value in all first initial score value.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described extraction unit is additionally operable to, and extracts the body content of described Press release;
Described matching unit is additionally operable to, and described body content is carried out target categorical match, obtains the second matching result;
Described computing unit is additionally operable to, and calculates the second score value of described second matching result;
Described judging unit is additionally operable to, it is judged that it is pre-conditioned whether described second score value meets second;
Described taxon is additionally operable to, described judging unit judge described second score value meet second pre-conditioned time, described Press release is divided in the target classification corresponding to described second matching result.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described body content is carried out target categorical match by described matching unit, when obtaining the second matching result, particularly as follows:
Described body content is carried out area name coupling, obtains at least one area name;
Described computing unit includes determining unit and product computing unit, wherein:
Described determine unit, for for any one area name at least one area name described, performing respectively: determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described body content;
Described product computing unit, for by the product of described basic score value and described number of times, as the corresponding with described any one area name second initial score value;
Described determine that unit is additionally operable to, it is determined that the number of times that maximum in all second initial score value and objective area title occur in described body content, described objective area name is called the area name corresponding to described maximum;
Described computing unit is additionally operable to, the number of times described objective area title occurred in described body content, deducts the value that each number of times corresponding to residual sector title in residual sector title obtains, as described second score value;
Wherein, described residual sector name is called the area name at least one area name described except the area name corresponding to described maximum;
Described taxon specifically for: described Press release is divided in the area name corresponding to the maximum second initial score value in all second initial score value.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described algorithm unit includes acquiring unit and training unit, wherein:
Described acquiring unit, for obtaining corpus, described corpus include determining whether described first score value meet first pre-conditioned time described Press release and the area name of correspondence, and/or judge described second score value meet second pre-conditioned time described Press release and the area name of correspondence;And
Described training unit, for based on described corpus, obtaining described disaggregated model.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described algorithm unit also includes coding unit and characteristic processing unit, wherein:
Described extraction unit is additionally operable to, and adopts vector space model and the reverse file word frequency TF-IDF algorithm of word frequency, and each section of Press release in described corpus is extracted key word;
Described coding unit, for according to corresponding contribution attribute and key word, being all encoded into characteristic vector by described each section of Press release;
Described characteristic processing unit, for carrying out feature selection and feature combination by the corpus being encoded to characteristic vector;
Described training unit is additionally operable to, and adopts this base of a fruit model of many sorted logics, is trained by the corpus after carrying out feature selection and feature combination, obtains described disaggregated model.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described algorithm unit specifically for, predict the probability in area belonging to described Press release according to disaggregated model;When judging described probability more than threshold value, using described Press release as the Press release in area belonging to described.
In certain embodiments, according to the device described in any of the above-described embodiment of the present invention, described algorithm unit specifically for, be periodically based on described Press release and corresponding target classification train classification models.
In the embodiment of the present invention, it is proposed to a kind of method of news category: extract the headline of Press release;Described headline is carried out target categorical match, obtains the first matching result;Calculate the first score value of described first matching result, and judge described first score value meet first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result;Owing to the program avoids manual read every section Press release, classify according to contribution content-label, therefore, solve inefficient, the ageing poor and defect that accuracy is relatively low existed in prior art;
In the embodiment of the present invention, it is also proposed that a kind of method of news category: Press release is carried out target categorical match, obtains matching result;Calculate the score value of described matching result, and judge when described score value meets pre-conditioned, described Press release is divided in the target classification corresponding to described matching result;Based on described Press release and corresponding target classification train classification models;Judge described score value be unsatisfactory for described pre-conditioned time, based on described disaggregated model, described Press release is classified, owing to the program also is able to avoid manual read every section Press release, classify according to contribution content-label, therefore, inefficient, the ageing poor and defect that accuracy is relatively low existed in prior art is solved.
Accompanying drawing explanation
Reading detailed description below by reference accompanying drawing, above-mentioned and other purposes of exemplary embodiment of the invention, feature and advantage will become prone to understand.In the accompanying drawings, illustrate some embodiments of the present invention by way of example, and not by way of limitation, wherein:
Figure 1A schematically shows the flow chart carrying out news category according to embodiment of the present invention;
Figure 1B schematically shows and carries out, according to body content, the flow chart classified according to embodiment of the present invention;
Fig. 1 C schematically shows the flow chart obtaining disaggregated model according to embodiment of the present invention;
Fig. 1 D schematically shows the flow chart carrying out Press release classification according to disaggregated model according to embodiment of the present invention;
Fig. 2 schematically shows the flow chart carrying out news category according to embodiment of the present invention;
Fig. 3 schematically shows a kind of schematic diagram of the device carrying out news category according to embodiment of the present invention;
Fig. 4 schematically shows another schematic diagram of the device carrying out news category according to another embodiment of the present invention;
Fig. 5 schematically shows another schematic diagram of the device carrying out news category according to another embodiment of the present invention;
Fig. 6 schematically shows another schematic diagram of the device carrying out news category according to another embodiment of the present invention;
In the accompanying drawings, identical or corresponding label represents identical or corresponding part.
Detailed description of the invention
Principles of the invention and spirit are described below with reference to some illustrative embodiments.Should be appreciated that providing these embodiments is only used to make those skilled in the art better understood when and then realize the present invention, and the scope being not intended to limit the present invention in any manner.On the contrary, it is provided that these embodiments are to make the disclosure more thorough and complete, and the scope of the present disclosure can intactly convey to those skilled in the art." embodiment " or " embodiment " in description both can represent an embodiment or a kind of embodiment, it is possible to represents the situation of some embodiments or some embodiments.
Art technology skilled artisan knows that, embodiments of the present invention can be implemented as a kind of system, device, equipment, method or computer program.Therefore, the disclosure can be implemented as following form, it may be assumed that the form that hardware, completely software (including firmware, resident software, microcode etc.), or hardware and software completely combines.
According to the embodiment of the present invention, it is proposed that the method and apparatus of a kind of news category.
It should be noted that any number of elements in accompanying drawing is all unrestricted for example, and any name is only used for distinguishing, and does not have any limitation.
Below technical term involved in the present invention is briefly described, in order to related personnel is better understood from this programme.
There is the machine learning classification algorithm of supervision: can refer to determine one group of other training dataset of marking class, use mathematical model and this group training dataset of optimized algorithm matching, obtaining mathematical model, the mathematical model obtained can be used to predict the training sample classification of unknown classification.Such as: logistic sorting algorithm, NB Algorithm, algorithm of support vector machine etc..
Disaggregated model: can refer to the machine learning classification algorithm having supervision, the mathematical model obtained after matching training dataset.
Corpus: the training dataset of the other text type of marking class can be referred to.
Bootstrap type: can refer to without by external resource, at the beginning of system start-up, relies on self strategy to reach the mode of certain effect.
Accuracy rate: the disaggregated model using the machine learning classification Algorithm for Training of supervision to obtain can be referred to, after the test sample predictions of one group of the unknown classification, the ratio of this true classification of obtained result and test specimens, accuracy rate can be used to weigh the classification capacity of sorting algorithm.
AC automat algorithm: can referring to, by constructing a dictionary tree, quickly search the algorithm of word frequency of occurrence in text, normal searched automotive engine system is for text word frequency coupling, and search efficiency is higher than Hash table.
Threshold value: be again marginal value, it is possible to refer to minimum or peak that an effect can produce.
This base of a fruit model of many sorted logics: adopt sigmoid function as coupling it is assumed that the Supervised machine learning sorting algorithm of two or more classification of can classifying.
Vector space model: the process of content of text is reduced to the vector operation in vector space, and it expresses semantic similarity with similarity spatially.When document is represented as the vector of document space, it is possible to by calculating the similarity that the COS distance between vector is measured between document.
TF-IDF (termfrequency inversedocumentfrequency, word frequency-reverse document-frequency) algorithm: a kind of conventional weighting technique prospected for information retrieval and information, in order to assess each word significance level at contribution.The importance of word is directly proportional increase along with the number of times that it occurs in contribution, but can be inversely proportional to decline along with the frequency that it occurs in corpus simultaneously.TF-IDF algorithm is often searched engine application, the tolerance of degree of correlation between inquiring about as file and user.
Feature selection: refer to from original M the feature of training dataset, select N number of most important feature so that the classifying quality optimization of machine learning algorithm.Feature selection is to select some most effective features from primitive character to reduce the process of data set dimension, is the important means improving learning algorithm performance, is also data prediction step crucial in pattern recognition.
Feature combines: refer to original for training dataset M feature, obtains N number of new feature, after being cascaded to original feature after doing linearly or nonlinearly combination.Use this M+N feature for machine learning classification algorithm so that it is the optimized process of effect.
Summary of the invention
The inventors discovered that, prior art adopts manually to be come news category, so there is inefficient, ageing poor and that accuracy is relatively low defect, it is to avoid adopt and manually carry out classifying, can improve the efficiency of news category, ageing and accuracy.
After the ultimate principle describing the present invention, introduce the various non-limiting embodiment of the present invention in detail below.
Application scenarios overview
Such as, for the Press release that title is " long-distance female passenger midway, Anhui lost contact ", first extract headline " long-distance female passenger midway, Anhui lost contact ", again headline is carried out target categorical match, obtain the first matching result, calculate the first score value of the first matching result, it is determined that described first score value meet first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result.
Press release involved in the present invention can be Netease's Press release, it is also possible to is other Press release, is not specifically limited at this.
Illustrative methods
With reference to Figure 1A, Fig. 2, the method for news category according to exemplary embodiment of the invention is described.It should be noted that above-mentioned application scenarios is for only for ease of the spirit and principle of understanding the present invention and illustrates, embodiments of the present invention are unrestricted in this regard.On the contrary, embodiments of the present invention can apply to any scene of being suitable for.
Figure 1A schematically shows the schematic flow sheet of the method 10 for news category according to embodiment of the present invention.As shown in Figure 1A, the method can include step 100,110 and 120.
Method 10 starts from step 100, wherein extracts the headline of Press release.
Press release in the embodiment of the present invention can be Netease's Press release, it is of course also possible to be the Press release of other media, is not specifically limited at this.
In the embodiment of the present invention, the mode of the headline extracting Press release has multiple, is not specifically limited at this.
After step 100, it is also possible to perform step 110, wherein described headline is carried out target categorical match, obtain the first matching result.
In the embodiment of the present invention, described headline is carried out target categorical match, when obtaining the first matching result, it is alternatively possible in the following way:
Described headline is carried out area name coupling, obtains at least one area name.
Such as, headline is " Beijing room rate is with Shanghai, Shenzhen House Price Ratio relatively ", owing to this headline and 3 area names match, therefore, obtains 3 area names.
In the embodiment of the present invention, when described headline is carried out target categorical match, it is possible to adopt AC automat algorithm to realize, naturally it is also possible to adopt other modes, be no longer described in detail at this.
It should be noted that some proper noun is likely to also include area name, in order to improve the accuracy of news category, the area name these proper nouns included is not as mating the area name obtained in the present invention.
Such as, the proper noun such as " Hang Zhoulu ", " Shanghai Volkswagen " is not as the area name in the present invention.
In the embodiment of the present invention, it is possible to storage includes the proper noun of area name, for instance, obtain by extracting the word relevant with area name from disclosed dictionary.
After step 110, it is also possible to perform step 120, wherein calculate the first score value of described first matching result, and judge described first score value meet first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result.
In the embodiment of the present invention, in the following way described headline is carried out target categorical match, when obtaining the first matching result: described headline is carried out area name coupling, obtain at least one area name.When calculating the first score value of described first matching result, it is alternatively possible in the following way:
For any one area name at least one area name described, perform respectively:
Determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described headline;
By the product of described basic score value and described number of times, as the corresponding with described any one area name first initial score value;
Determining the maximum in all first initial score value and second largest value, described second largest value is meant less than described maximum, and more than the first initial score value of the initial score value of all residues first except described maximum in the described all first initial score value;
The ratio obtained divided by described second largest value by described maximum, as described first score value;
Described Press release is divided in the target classification corresponding to described first matching result, including:
Described Press release is divided in the area name corresponding to the maximum first initial score value in all first initial score value.
In the embodiment of the present invention, area name can be divided three classes, provincial, city-level and district's level, as shown in table 1.
Table 1 area name is classified
Sequence number Area name Higher level area Area rank
1 Beijing Nothing Provincial
2 Zhejiang Nothing Provincial
3 Hangzhou Zhejiang City-level
4 Ningbo Zhejiang City-level
5 The West Lake Hangzhou District's level
In the embodiment of the present invention, the basic score value that the area name of different stage is corresponding can be different, such as, the basic score value that provincial area name is corresponding can be corresponding more than the area name of city-level basic score value, the basic score value that the basic score value that the area name of city-level is corresponding can be corresponding more than the area name of district's level.
In the embodiment of the present invention, it is determined that described first score value meet first pre-conditioned time, it is alternatively possible in the following way:
Judge that described first score value is more than or equal to 1.5.
Such as, it is " Wuhan, Nanjing Human are lived in the new Hangzhou in Hangzhou " to a headline, matches " Wuhan ", " Nanjing " and " Hangzhou " three place names.The basic score value assuming these three place names is all 10, then the initial score value of three place names respectively 10,10 and 20.Then its maximum is the 20 of corresponding " Hangzhou ", and the ratio that second largest value is the 10 of corresponding " Wuhan " or " Nanjing ", maximum and second largest value is 2, more than 1.5, meets pre-conditioned.Then Press release is divided in the classification of place name corresponding to maximum " Hangzhou ".
Except above-mentioned sorting technique, it is also possible to there are other arbitrary sorting techniques, as using occurrence number as the first score value, by Press release sort out medium to the place name that occurrence number is maximum.
Previously described is situation about news being classified according to headline, in actual applications, when the first score value be unsatisfactory for first pre-conditioned time, it is impossible to classify according to headline, now, further, it is possible to according to body content, news is classified, therefore, in the embodiment of the present invention, further, if it is determined that described first score value is unsatisfactory for described first pre-conditioned, described method also includes following operation:
Extract the body content of described Press release;
Described body content is carried out target categorical match, obtains the second matching result;
Calculate the second score value of described second matching result, and judge described second score value meet second pre-conditioned time, described Press release is divided in the target classification corresponding to described second matching result.
It is to say, the first matching result first with the headline of Press release is classified;If Press release cannot be classified according to the headline of Press release, next can utilize the body content of Press release that Press release is classified.As shown in Figure 1B.
In the embodiment of the present invention, described body content is carried out target categorical match, when obtaining the second matching result, it is alternatively possible in the following way:
Described body content is carried out area name coupling, obtains at least one area name;
Calculate the second score value of described second matching result, including:
For any one area name at least one area name described, perform respectively:
Determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described body content;
By the product of described basic score value and described number of times, as the corresponding with described any one area name second initial score value;
Determining the number of times that the maximum in all second initial score value and objective area title occur in described body content, described objective area name is called the area name corresponding to described maximum;
The number of times described objective area title occurred in described body content, deducts the value that each number of times corresponding to residual sector title in residual sector title obtains, as described second score value;
Wherein: described residual sector name is called the area name at least one area name described except the area name corresponding to described maximum;
Described Press release is divided in the target classification corresponding to described second matching result, including:
Described Press release is divided in the area name corresponding to the maximum second initial score value in all second initial score value.
Such as, body content is that " last Sunday, the prospect Hangzhou project elite league matches of 2015 China of Asian Football Association is being fired greatly in the rain.Football training base, 8 Hang Cheng amateur soccer team's oligomerisation Tonglus, the trial of strength that they will here launch three weeks by a definite date.Activity is sponsored by Hangzhou DFB, Hangzhou football administrative center undertakes, and is also a part for the 34th West Lake cup super league in Hangzhou.The football fan watched, except local football fan, the amateur soccer club area such as also including from Ningbo represents ", it is called Hangzhou and Ningbo according to the area coupling name that body content obtains.
In the embodiment of the present invention, it is determined that described second score value meets the second pre-conditioned mode to be had multiple, optionally, it is possible in the following way:
Judge that described second score value is more than or equal to 3.
Such as, body content is that " last Sunday, the prospect Hangzhou project elite league matches of 2015 China of Asian Football Association is being fired greatly in the rain.Football training base, 8 Hang Cheng amateur soccer team's oligomerisation Tonglus, the trial of strength that they will here launch three weeks by a definite date.Activity is sponsored by Hangzhou DFB, Hangzhou football administrative center undertakes, and is also a part for the 34th West Lake cup super league in Hangzhou.The football fan watched, except local football fan, the amateur soccer club area such as also including from Ningbo represents "; by above-mentioned body content is mated; the area coupling name obtained is called Hangzhou and Ningbo; and the number of times that " Hangzhou " occurs is 4 times; the number of times that " Ningbo " occurs is 1 time; then the first of Hangzhou the initial score value be 10 × 4=40, Ningbo the first initial score value be 10 × 1=10, first judge that value that the first initial score value in Hangzhou obtains divided by the first initial score value in Ningbo is more than 1.5;Then Press release is classified as the local news contribution in Hangzhou.If the value that the initial score value of the first of Hangzhou obtains divided by the first initial score value in Ningbo is less than 1.5, it is judged that whether the number of times that Hangzhou occurs deducts value that the number of times of Ningbo appearance obtains more than or equal to 3;Such as in the above example, it is 4-1=3 that the number of times that Hangzhou occurs deducts the number of times of its Ningbo appearance;Therefore, Press release is classified as the local news contribution in Hangzhou.
Procedure set forth above is bootstrapping, it is not necessary to any artificial input, can process extensive local news in real time and sort out request, have higher efficiency and ageing preferably, meet the function needs of internet news series products.
Previously described is first classify according to headline, if cannot classify according to headline, next classify according to body content, now, if according to when body content also cannot be carried out classifying, can classify according to disaggregated model, therefore, in the embodiment of the present invention, further, judge described second score value be unsatisfactory for described second pre-conditioned after, described method also includes following operation:
The probability in area belonging to described Press release is predicted according to disaggregated model;
When judging described probability more than threshold value, using described Press release as the Press release in area belonging to described.
In the embodiment of the present invention, before predicting the probability in area belonging to described Press release according to disaggregated model, described method also includes following operation:
Obtain corpus, described corpus include determining whether described first score value meet first pre-conditioned time described Press release and the area name of correspondence, and/or judge described second score value meet second pre-conditioned time described Press release and the area name of correspondence;And
Based on described corpus, obtain described disaggregated model.
Here " before " is for representing front and back in logic, carry out parallel according to respective demand it is true that predict the step of the probability in area belonging to described Press release according to disaggregated model with obtaining corpus and obtaining the step of disaggregated model based on corpus.
In the embodiment of the present invention, based on described corpus, when obtaining described disaggregated model, it is alternatively possible in the following way:
Adopt vector space model and TF-IDF algorithm, each section of Press release in described corpus is extracted key word;
According to corresponding contribution attribute and key word, described each section of Press release is all encoded into characteristic vector;
The corpus being encoded to characteristic vector is carried out feature selection and feature combination;
Adopt this base of a fruit model of many sorted logics, the corpus after carrying out feature selection and feature combination is trained, obtains described disaggregated model.
Fig. 1 C is the main process obtaining disaggregated model according to an embodiment: obtain corpus, each section of Press release in corpus is extracted key word, according to corresponding contribution attribute and key word, each section of Press release is encoded into characteristic vector, then, the corpus being encoded to characteristic vector is carried out feature selection and feature combination, it is trained it follows that carry out the corpus after feature selection and feature combination, obtains described disaggregated model.
In the embodiment of the present invention, it is possible to be updated periodically disaggregated model, for instance once a day.
In the embodiment of the present invention, contribution attribute includes distribute new dispatchs media information and/or the temporal information etc. of distributing new dispatchs of contribution.
In the embodiment of the present invention, if Press release also cannot be classified according to disaggregated model, it is possible to determine that this Press release cannot be classified.
Fig. 1 D is schematic flow sheet Press release classified according to disaggregated model, area and probability thereof belonging to disaggregated model expected news and journals contribution, it is judged that whether probability is more than threshold value, if, using described Press release as described affiliated regional Press release, otherwise it is assumed that contribution cannot be classified.
Said method sorts out, first by code of points bootstrapping, the local news contribution that accuracy rate is high, then by having the machine learning algorithm of supervision based on this part contribution train classification models, carry out other Press release supplementing classification, realize without artificial input, extensive local news can be processed in real time and sort out request, meet the function needs of internet news series products.
Fig. 2 schematically shows the schematic flow sheet of the method 20 for news category according to embodiment of the present invention.As in figure 2 it is shown, the method can include step 200,210 and 220.
Method 20 starts from step 200, wherein Press release is carried out target categorical match, obtains matching result.
Press release in the embodiment of the present invention can be Netease's Press release, it is of course also possible to be the Press release of other media, is not specifically limited at this.
In one embodiment, Press release carries out target categorical match include the title to Press release and carry out the other coupling of target class.
In the embodiment of the present invention, the mode of the headline extracting Press release has multiple, is not specifically limited at this.
In one embodiment, Press release carries out target categorical match include the body matter to Press release and carry out the other coupling of target class.
In one embodiment, Press release carries out target categorical match include the full text to Press release and carry out the other coupling of target class.In one embodiment, Press release is carried out target categorical match and includes first the title of Press release being carried out the other coupling of target class, if can not realize classifying according to title, continue the body matter to Press release and carry out the other coupling of target class.
After step 200, it is also possible to perform step 210, wherein calculate the score value of described matching result, and judge when described score value meets pre-conditioned, described Press release is divided in the target classification corresponding to described matching result.
After step 210, it is also possible to perform step 220, wherein based on described Press release and corresponding target classification train classification models;Judge described score value be unsatisfactory for described pre-conditioned time, based on described disaggregated model, described Press release is classified.
In the embodiment of the present invention, alternatively, Press release is carried out target categorical match, when obtaining matching result, it is possible in the following way:
Extract the headline of Press release;And
Described headline is carried out target categorical match, obtains the first matching result;
Calculate the score value of described matching result, and judge when described score value meets pre-conditioned, when described Press release is divided in the target classification corresponding to described matching result, it is alternatively possible in the following way:
Calculate the first score value of described first matching result, and judge described first score value meet first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result.
In the embodiment of the present invention, when described headline is carried out target categorical match, it is possible to adopt AC automat algorithm to realize, naturally it is also possible to adopt other modes, be no longer described in detail at this.
In the embodiment of the present invention, alternatively, described headline is carried out target categorical match, when obtaining the first matching result, it is possible in the following way:
Described headline is carried out area name coupling, obtains at least one area name.
Such as, headline is " Beijing room rate is with Shanghai, Shenzhen House Price Ratio relatively ", owing to this headline and 3 area names match, therefore, obtains 3 area names.
It should be noted that some proper noun is likely to also include area name, in order to improve the accuracy of news category, the area name these proper nouns included is not as mating the area name obtained in the present invention.
Such as, the proper noun such as " Hang Zhoulu ", " Shanghai Volkswagen " is not as the area name in the present invention.
In the embodiment of the present invention, it is possible to storage includes the proper noun of area name, for instance, obtain by extracting the word relevant with area name from disclosed dictionary.
Calculate the first score value of described first matching result, including:
For any one area name at least one area name described, perform respectively:
Determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described headline;
By the product of described basic score value and described number of times, as the corresponding with described any one area name first initial score value;
Determining the maximum in all first initial score value and second largest value, described second largest value is meant less than described maximum, and more than the first initial score value of the initial score value of all residues first except described maximum in the described all first initial score value;
The ratio obtained divided by described second largest value by described maximum, as described first score value;
Described Press release is divided in the target classification corresponding to described first matching result, including:
Described Press release is divided in the area name corresponding to the maximum first initial score value in all first initial score value.
In the embodiment of the present invention, area name can be divided three classes, provincial, city-level and district's level, as shown in table 1.
In the embodiment of the present invention, the basic score value that the area name of different stage is corresponding can be different, such as, the basic score value that provincial area name is corresponding can be corresponding more than the area name of city-level basic score value, the basic score value that the basic score value that the area name of city-level is corresponding can be corresponding more than the area name of district's level.
In the embodiment of the present invention, it is determined that described first score value meet first pre-conditioned time, it is alternatively possible in the following way:
Judge that described first score value is more than or equal to 1.5.
Previously described is situation about news being classified according to headline, in actual applications, when the first score value be unsatisfactory for first pre-conditioned time, cannot classify according to headline, now, further, according to body content, news can be classified, or directly according to body content, news is classified, or directly news is classified by the full text according to news, wherein according to the full text of news news carried out classification can adopt with according to headline news carried out the same sorting technique of classification or with carry out news classifying according to body content sorting technique.In one embodiment, the method according to body content, news classified includes:
Extract the body content of described Press release;
Described body content is carried out target categorical match, obtains the second matching result;
Calculate the second score value of described second matching result, and judge described second score value meet second pre-conditioned time, described Press release is divided in the target classification corresponding to described second matching result.
In one embodiment, the first matching result first with the headline of Press release is classified;If Press release cannot be classified according to the headline of Press release, next can utilize the body content of Press release that Press release is classified, as shown in Figure 1B.
In the embodiment of the present invention, described body content is carried out target categorical match, when obtaining the second matching result, it is alternatively possible in the following way:
Described body content is carried out area name coupling, obtains at least one area name;
When calculating the second score value of described second matching result, it is alternatively possible in the following way:
For any one area name at least one area name described, perform respectively:
Determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described body content;
By the product of described basic score value and described number of times, as the corresponding with described any one area name second initial score value;
Determining the number of times that the maximum in all second initial score value and objective area title occur in described body content, described objective area name is called the area name corresponding to described maximum;
The number of times described objective area title occurred in described body content, deducts the value that each number of times corresponding to residual sector title in residual sector title obtains, as described second score value;
Wherein: described residual sector name is called the area name at least one area name described except the area name corresponding to described maximum;
When described Press release being divided in the target classification corresponding to described second matching result, it is alternatively possible in the following way:
Described Press release is divided in the area name corresponding to the maximum second initial score value in all second initial score value.
Such as, body content is that " last Sunday, the prospect Hangzhou project elite league matches of 2015 China of Asian Football Association is being fired greatly in the rain.Football training base, 8 Hang Cheng amateur soccer team's oligomerisation Tonglus, the trial of strength that they will here launch three weeks by a definite date.Activity is sponsored by Hangzhou DFB, Hangzhou football administrative center undertakes, and is also a part for the 34th West Lake cup super league in Hangzhou.The football fan watched, except local football fan, the amateur soccer club area such as also including from Ningbo represents ", the area coupling name obtained is called Hangzhou and Ningbo.
In the embodiment of the present invention, it is determined that described second score value meets the second pre-conditioned mode to be had multiple, optionally, it is possible in the following way:
Judge that described second score value is more than or equal to 3.
Such as, body content is that " last Sunday, the prospect Hangzhou project elite league matches of 2015 China of Asian Football Association is being fired greatly in the rain.Football training base, 8 Hang Cheng amateur soccer team's oligomerisation Tonglus, the trial of strength that they will here launch three weeks by a definite date.Activity is sponsored by Hangzhou DFB, Hangzhou football administrative center undertakes, and is also a part for the 34th West Lake cup super league in Hangzhou.The football fan watched, except local football fan, the amateur soccer club area such as also including from Ningbo represents "; by above-mentioned body matter is mated; the area coupling name obtained is called Hangzhou and Ningbo; and the number of times that " Hangzhou " occurs is 4 times; the number of times that " Ningbo " occurs is 1 time; then the first of Hangzhou the initial score value be 10 × 4=40, Ningbo the first initial score value be 10 × 1=10, first judge that value that the first initial score value in Hangzhou obtains divided by the first initial score value in Ningbo is more than 1.5;Then Press release is classified as the local news contribution in Hangzhou.If the value that the initial score value of the first of Hangzhou obtains divided by the first initial score value in Ningbo is less than 1.5, it is judged that whether the number of times that Hangzhou occurs deducts value that the number of times of Ningbo appearance obtains more than or equal to 3;Such as in the above example, it is 4-1=3 that the number of times that Hangzhou occurs deducts the number of times of its Ningbo appearance;Therefore, Press release is classified as the local news contribution in Hangzhou.
In the embodiment of the present invention, during based on described Press release and corresponding target classification train classification models, it is alternatively possible in the following way:
Obtain corpus, described corpus include determining whether described first score value meet first pre-conditioned time described Press release and the area name of correspondence, and/or judge described second score value meet second pre-conditioned time described Press release and the area name of correspondence;And
Based on described corpus, obtain described disaggregated model.
In the embodiment of the present invention, based on described corpus, when obtaining described disaggregated model, alternatively can be in the following way:
Adopt vector space model and the reverse file word frequency TF-IDF algorithm of word frequency, each section of Press release in described corpus is extracted key word;
According to corresponding contribution attribute and key word, described each section of Press release is all encoded into characteristic vector;
The corpus being encoded to characteristic vector is carried out feature selection and feature combination;
Adopt this base of a fruit model of many sorted logics, the corpus after carrying out feature selection and feature combination is trained, obtains described disaggregated model.
In the embodiment of the present invention, when described Press release being classified based on described disaggregated model, it is alternatively possible in the following way:
The probability in area belonging to described Press release is predicted according to disaggregated model;
When judging described probability more than threshold value, using described Press release as the Press release in area belonging to described.
In the embodiment of the present invention, during based on described Press release and corresponding target classification train classification models, it is alternatively possible in the following way:
It is periodically based on described Press release and corresponding target classification train classification models.
Fig. 1 C is the main process obtaining disaggregated model: obtain corpus, each section of Press release in corpus is extracted key word, according to corresponding contribution attribute and key word, each section of Press release is encoded into characteristic vector, then, the corpus being encoded to characteristic vector is carried out feature selection and feature combination, it is trained it follows that carry out the corpus after feature selection and feature combination, obtains described disaggregated model.
In the embodiment of the present invention, it is possible to be updated periodically disaggregated model, for instance once a day.
In the embodiment of the present invention, contribution attribute includes distribute new dispatchs media information and/or the temporal information etc. of distributing new dispatchs of contribution.
In the embodiment of the present invention, if Press release cannot be classified according to disaggregated model, it is possible to determine that this Press release cannot be classified.
Fig. 1 D is schematic flow sheet Press release classified according to disaggregated model, area and probability thereof belonging to disaggregated model expected news and journals contribution, it is judged that whether probability is more than threshold value, if, using described Press release as described affiliated regional Press release, otherwise it is assumed that contribution cannot be classified.
Example devices
After the method describing exemplary embodiment of the invention, next, with reference to Fig. 3,4 respectively to exemplary embodiment of the invention, device 30,40 for news category is described, device 30 includes extraction unit 300, matching unit 310, computing unit 320, judging unit 330 and taxon 340, wherein:
Extraction unit 300, for extracting the headline of Press release;
Matching unit 310, for described headline is carried out target categorical match, obtains the first matching result;
Computing unit 320, for calculating the first score value of described first matching result;
Judging unit 330, is used for judging whether described first score value meets first pre-conditioned;
Taxon 340, for described judging unit 330 judge described first score value meet described first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result.
Press release in the embodiment of the present invention can be Netease's Press release, it is of course also possible to be the Press release of other media, is not specifically limited at this.
In the embodiment of the present invention, extraction unit 300 extracts the mode of the headline of Press release to be had multiple, is not specifically limited at this.
In the embodiment of the present invention, described headline is carried out target categorical match by described matching unit 310, when obtaining the first matching result, particularly as follows:
Described headline is carried out area name coupling, obtains at least one area name.
Such as, headline is " Beijing room rate is with Shanghai, Shenzhen House Price Ratio relatively ", owing to this headline and 3 area names match, therefore, obtains 3 area names.
In the embodiment of the present invention, when described headline is carried out target categorical match by matching unit 310, it is possible to adopt AC automat algorithm to realize, naturally it is also possible to adopt other modes, be no longer described in detail at this.
It should be noted that some proper noun is likely to also include area name, in order to improve the accuracy of news category, the area name these proper nouns included is not as mating the area name obtained in the present invention.
Such as, the proper noun such as " Hang Zhoulu ", " Shanghai Volkswagen " is not as the area name in the present invention.
In the embodiment of the present invention, it is possible to storage includes the proper noun of area name, for instance, obtain by extracting the word relevant with area name from disclosed dictionary.
In the embodiment of the present invention, alternatively, described computing unit 320 includes determining unit 320A and product computing unit 320B, wherein:
Described determine unit 320A, for for any one area name at least one area name described, performing respectively: determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described headline;
Described product computing unit 320B, for by the product of described basic score value and described number of times, as the corresponding with described any one area name first initial score value;
Described determine that unit 320A is additionally operable to, determine the maximum in all first initial score value and second largest value, described second largest value is meant less than described maximum, and more than the first initial score value of the initial score value of all residues first except described maximum in the described all first initial score value;
Described determining that unit 320A is additionally operable to, the ratio obtained divided by described second largest value by described maximum, as described first score value;
Described taxon 340 specifically for: described Press release is divided in the area name corresponding to the maximum first initial score value in all first initial score value.
In the embodiment of the present invention, area name can be divided three classes, provincial, city-level and district's level, as shown in table 1.
In the embodiment of the present invention, the basic score value that the area name of different stage is corresponding can be different, such as, the basic score value that provincial area name is corresponding can be corresponding more than the area name of city-level basic score value, the basic score value that the basic score value that the area name of city-level is corresponding can be corresponding more than the area name of district's level.
In the embodiment of the present invention, alternatively, described judging unit 330 judge described first score value meet first pre-conditioned time, particularly as follows:
Judge that described first score value is more than or equal to 1.5.
Such as, it is " Wuhan, Nanjing Human are lived in the new Hangzhou in Hangzhou " to a headline, matches " Wuhan ", " Nanjing " and " Hangzhou " three place names.The basic score value assuming these three place names is all 10, then the initial score value of three place names respectively 10,10 and 20.Then its maximum is the 20 of corresponding " Hangzhou ", and the ratio that second largest value is the 10 of corresponding " Wuhan " or " Nanjing ", maximum and second largest value is 2, more than 1.5, meets pre-conditioned.Then Press release is divided in the classification of place name corresponding to maximum " Hangzhou ".
Except above-mentioned sorting technique, it is also possible to there are other arbitrary sorting techniques, as using occurrence number as the first score value, by Press release sort out medium to the place name that occurrence number is maximum.
Previously described is situation about news being classified according to headline, in actual applications, when the first score value be unsatisfactory for first pre-conditioned time, cannot classify according to headline, now, further, according to body content, news can be classified, therefore, described extraction unit 300 is additionally operable to, and extracts the body content of described Press release;
Described matching unit 310 is additionally operable to, and described body content is carried out target categorical match, obtains the second matching result;
Described computing unit 320 is additionally operable to, and calculates the second score value of described second matching result;
Described judging unit 330 is additionally operable to, it is judged that it is pre-conditioned whether described second score value meets second;
Described taxon 340 is additionally operable to, described judging unit 330 judge described second score value meet second pre-conditioned time, described Press release is divided in the target classification corresponding to described second matching result.
It is to say, taxon 340 is classified first with the first matching result of the headline of Press release;If Press release cannot be classified according to the headline of Press release, next can utilize the body content of Press release that Press release is classified.As shown in Figure 1B.
In the embodiment of the present invention, alternatively, described body content is carried out target categorical match by described matching unit 310, when obtaining the second matching result, particularly as follows:
Described body content is carried out area name coupling, obtains at least one area name;
Described computing unit 320 includes determining unit 320A and product computing unit 320B, wherein:
Described determine unit 320A, for for any one area name at least one area name described, perform respectively: determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described body content;
Described product computing unit 320B, for by the product of described basic score value and described number of times, as the corresponding with described any one area name second initial score value;
Described determine that unit 320A is additionally operable to, it is determined that the number of times that maximum in all second initial score value and objective area title occur in described body content, described objective area name is called the area name corresponding to described maximum;
Described computing unit 320 is additionally operable to, the number of times described objective area title occurred in described body content, deducts the value that each number of times corresponding to residual sector title in residual sector title obtains, as described second score value;
Wherein, described residual sector name is called the area name at least one area name described except the area name corresponding to described maximum;
Described taxon 340 specifically for: described Press release is divided in the area name corresponding to the maximum second initial score value in all second initial score value.
Such as, body content is that " last Sunday, the prospect Hangzhou project elite league matches of 2015 China of Asian Football Association is being fired greatly in the rain.Football training base, 8 Hang Cheng amateur soccer team's oligomerisation Tonglus, the trial of strength that they will here launch three weeks by a definite date.Activity is sponsored by Hangzhou DFB, Hangzhou football administrative center undertakes, and is also a part for the 34th West Lake cup super league in Hangzhou.The football fan watched, except local football fan, the amateur soccer club area such as also including from Ningbo represents ", it is called Hangzhou and Ningbo according to the area coupling name that body content obtains.
In the embodiment of the present invention, alternatively, described judging unit 330 judge described second score value meet second pre-conditioned time, particularly as follows:
Judge that described second score value is more than or equal to 3.
Such as, body content is that " last Sunday, the prospect Hangzhou project elite league matches of 2015 China of Asian Football Association is being fired greatly in the rain.Football training base, 8 Hang Cheng amateur soccer team's oligomerisation Tonglus, the trial of strength that they will here launch three weeks by a definite date.Activity is sponsored by Hangzhou DFB, Hangzhou football administrative center undertakes, and is also a part for the 34th West Lake cup super league in Hangzhou.The football fan watched, except local football fan, the amateur soccer club area such as also including from Ningbo represents "; by above-mentioned body content is mated; the area coupling name obtained is called Hangzhou and Ningbo; and the number of times that " Hangzhou " occurs is 4 times; the number of times that " Ningbo " occurs is 1 time; then the first of Hangzhou the initial score value be 10 × 4=40, Ningbo the first initial score value be 10 × 1=10, it is judged that first unit 330 judges that value that the first initial score value in Hangzhou obtains divided by the first initial score value in Ningbo is more than 1.5;Press release is then classified as the local news contribution in Hangzhou by taxon 340.If the value that the initial score value of the first of judging unit 330 Hangzhou obtains divided by the first initial score value in Ningbo is less than 1.5, it is judged that whether the number of times that Hangzhou occurs deducts value that the number of times that Ningbo occurs obtains more than or equal to 3;Such as in the above example, it is 4-1=3 that the number of times that Hangzhou occurs deducts the number of times of its Ningbo appearance;Therefore, Press release is classified as the local news contribution in Hangzhou by taxon 340.
Procedure set forth above is bootstrapping, it is not necessary to any artificial input, can process extensive local news in real time and sort out request, have higher efficiency and ageing preferably, meet the function needs of internet news series products.
Previously described is that taxon 340 is first classified according to headline, if cannot classify according to headline, next classify according to body content, now, if according to when body content also cannot be carried out classifying, can classify according to disaggregated model, therefore, in the embodiment of the present invention, further, described device also includes algorithm unit 350, for predicting the probability in area belonging to described Press release according to disaggregated model;When judging described probability more than threshold value, using described Press release as the Press release in area belonging to described.
In the embodiment of the present invention, alternatively, described algorithm unit 350 includes acquiring unit 350A and training unit 350B, wherein:
Described acquiring unit 350A, for obtaining corpus, described corpus include determining whether described first score value meet first pre-conditioned time described Press release and the area name of correspondence, and/or judge described second score value meet second pre-conditioned time described Press release and the area name of correspondence;
Described training unit 350B, for based on described corpus, obtaining described disaggregated model.
Carry out parallel according to respective demand it should be noted that predict the step of the probability in area belonging to described Press release according to disaggregated model with obtaining corpus and obtaining the step of disaggregated model based on corpus.
In the embodiment of the present invention, alternatively, described algorithm unit 350 also includes coding unit 350C and characteristic processing unit 350D, wherein:
Described extraction unit 300 is additionally operable to, and adopts vector space model and the reverse file word frequency TF-IDF algorithm of word frequency, and each section of Press release in described corpus is extracted key word;
Described coding unit 350C is additionally operable to, and according to corresponding contribution attribute and key word, described each section of Press release is all encoded into characteristic vector;
Described characteristic processing unit 350D, for carrying out feature selection and feature combination by the corpus being encoded to characteristic vector;
Described training unit 350B is additionally operable to, and adopts this base of a fruit model of many sorted logics, is trained by the corpus after carrying out feature selection and feature combination, obtains described disaggregated model.
Fig. 1 C is the main process obtaining disaggregated model according to an embodiment: acquiring unit 350A obtains corpus, each section of Press release in corpus is extracted key word by extraction unit 300, each section of Press release is encoded into characteristic vector according to corresponding contribution attribute and key word by coding unit 350C, then, the corpus being encoded to characteristic vector is carried out feature selection and feature combination by characteristic processing unit 350D, next, training unit 350B carry out feature selection and feature combination after corpus be trained, obtain described disaggregated model.
In the embodiment of the present invention, it is possible to be updated periodically disaggregated model, for instance once a day.
In the embodiment of the present invention, contribution attribute includes distribute new dispatchs media information and/or the temporal information etc. of distributing new dispatchs of contribution.
In the embodiment of the present invention, if Press release also cannot be classified by taxon 340 according to disaggregated model, it is possible to determine that this Press release cannot be classified.
Fig. 1 D is schematic flow sheet Press release classified according to disaggregated model, area and probability thereof belonging to disaggregated model expected news and journals contribution, it is judged that whether probability is more than threshold value, if, using described Press release as described affiliated regional Press release, otherwise it is assumed that contribution cannot be classified.
Such scheme sorts out, first by code of points bootstrapping, the local news contribution that accuracy rate is high, then by having the machine learning algorithm of supervision based on this part contribution train classification models, carry out other Press release supplementing classification, realize without artificial input, extensive local news can be processed in real time and sort out request, meet the function needs of internet news series products.
Consulting shown in Fig. 4, device 40 includes matching unit 400, computing unit 410, judging unit 420, taxon 430 and algorithm unit 440, wherein:
Matching unit 400, for Press release is carried out target categorical match, obtains matching result;
Computing unit 410, for calculating the score value of described matching result;
Judging unit 420, is used for judging whether described score value meets pre-conditioned;
Taxon 430, for when described judging unit 420 judges that described score value meets pre-conditioned, being divided into described Press release in the target classification corresponding to described matching result;
Algorithm unit 440, for based on described Press release and corresponding target classification train classification models;
Described taxon 430 is additionally operable to, described judging unit 420 judge described score value be unsatisfactory for described pre-conditioned time, based on described disaggregated model, described Press release is classified
Press release in the embodiment of the present invention can be Netease's Press release, it is of course also possible to be the Press release of other media, is not specifically limited at this.
In one embodiment, Press release is carried out target categorical match and includes the title to Press release and carry out the other coupling of target class by matching unit 400.
In the embodiment of the present invention, the mode of the headline extracting Press release has multiple, is not specifically limited at this.
In one embodiment, Press release is carried out target categorical match and includes the body matter to Press release and carry out the other coupling of target class by matching unit 400.
In one embodiment, Press release is carried out target categorical match and includes the full text to Press release and carry out the other coupling of target class by matching unit 400.
In one embodiment, Press release is carried out target categorical match and includes first the title of Press release being carried out the other coupling of target class by matching unit 400, if can not realize classifying according to title, continues the body matter to Press release and carries out the other coupling of target class.
In the embodiment of the present invention, alternatively, described device also includes extraction unit 450, for extracting the headline of Press release;
Described matching unit 400 specifically for, described headline is carried out target categorical match, obtains the first matching result;
Described computing unit 410 specifically for, calculate the first score value of described first matching result;
Described taxon 430 specifically for, described judging unit 420 judge described first score value meet first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result.
In the embodiment of the present invention, when described headline is carried out target categorical match by matching unit 400, it is possible to adopt AC automat algorithm to realize, naturally it is also possible to adopt other modes, be no longer described in detail at this.
In the embodiment of the present invention, alternatively, described headline is carried out target categorical match by described matching unit 400, when obtaining the first matching result, particularly as follows:
Described headline is carried out area name coupling, obtains at least one area name.
Such as, headline is " Beijing room rate is with Shanghai, Shenzhen House Price Ratio relatively ", owing to this headline and 3 area names match, therefore, obtains 3 area names.
It should be noted that some proper noun is likely to also include area name, in order to improve the accuracy of news category, the area name these proper nouns included is not as mating the area name obtained in the present invention.
Such as, the proper noun such as " Hang Zhoulu ", " Shanghai Volkswagen " is not as the area name in the present invention.
In the embodiment of the present invention, it is possible to storage includes the proper noun of area name, for instance, obtain by extracting the word relevant with area name from disclosed dictionary.
Described computing unit 410 includes determining unit 410A and product computing unit 410B, wherein:
Described determine unit 410A, for for any one area name at least one area name described, performing respectively: determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described headline;
Described product computing unit 410B, for by the product of described basic score value and described number of times, as the corresponding with described any one area name first initial score value;
Described determine that unit 410A is additionally operable to, determine the maximum in all first initial score value and second largest value, described second largest value is meant less than described maximum, and more than the first initial score value of the initial score value of all residues first except described maximum in the described all first initial score value;
Described determining that unit 410A is additionally operable to, the ratio obtained divided by described second largest value by described maximum, as described first score value;
Described taxon 430 specifically for: described Press release is divided in the area name corresponding to the maximum first initial score value in all first initial score value.
In the embodiment of the present invention, area name can be divided three classes, provincial, city-level and district's level, as shown in table 1.
In the embodiment of the present invention, the basic score value that the area name of different stage is corresponding can be different, such as, the basic score value that provincial area name is corresponding can be corresponding more than the area name of city-level basic score value, the basic score value that the basic score value that the area name of city-level is corresponding can be corresponding more than the area name of district's level.
In the embodiment of the present invention, described judging unit 420 judge described first score value meet first pre-conditioned time, it is alternatively possible in the following way:
Judge that described first score value is more than or equal to 1.5.
Previously described is situation about news being classified according to headline, in actual applications, when the first score value be unsatisfactory for first pre-conditioned time, cannot classify according to headline, now, further, directly according to body content, news is classified, or directly news is classified by the full text according to news, wherein according to the full text of news news carried out classification can adopt with according to headline news carried out the same sorting technique of classification or with carry out news classifying according to body content sorting technique.In one embodiment, described extraction unit 450 is additionally operable to, and extracts the body content of described Press release;
Described matching unit 400 is additionally operable to, and described body content is carried out target categorical match, obtains the second matching result;
Described computing unit 410 is additionally operable to, and calculates the second score value of described second matching result;
Described judging unit 420 is additionally operable to, it is judged that it is pre-conditioned whether described second score value meets second;
Described taxon 430 is additionally operable to, described judging unit 420 judge described second score value meet second pre-conditioned time, described Press release is divided in the target classification corresponding to described second matching result.
In one embodiment, the first matching result first with the headline of Press release is classified;If Press release cannot be classified according to the headline of Press release, next can utilize the body content of Press release that Press release is classified, as shown in Figure 1B.
In the embodiment of the present invention, alternatively, described body content is carried out target categorical match by described matching unit 400, when obtaining the second matching result, particularly as follows:
Described body content is carried out area name coupling, obtains at least one area name;
Described computing unit 410 includes determining unit 410A and product computing unit 410B, wherein:
Described determine unit 410A, for for any one area name at least one area name described, perform respectively: determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described body content;
Described product computing unit 410B, for by the product of described basic score value and described number of times, as the corresponding with described any one area name second initial score value;
Described determine that unit 410A is additionally operable to, it is determined that the number of times that maximum in all second initial score value and objective area title occur in described body content, described objective area name is called the area name corresponding to described maximum;
Described computing unit 410 is additionally operable to, the number of times described objective area title occurred in described body content, deducts the value that each number of times corresponding to residual sector title in residual sector title obtains, as described second score value;
Wherein, described residual sector name is called the area name at least one area name described except the area name corresponding to described maximum;
Described taxon 430 specifically for: described Press release is divided in the area name corresponding to the maximum second initial score value in all second initial score value.
Such as, body content is that " last Sunday, the prospect Hangzhou project elite league matches of 2015 China of Asian Football Association is being fired greatly in the rain.Football training base, 8 Hang Cheng amateur soccer team's oligomerisation Tonglus, the trial of strength that they will here launch three weeks by a definite date.Activity is sponsored by Hangzhou DFB, Hangzhou football administrative center undertakes, and is also a part for the 34th West Lake cup super league in Hangzhou.The football fan watched, except local football fan, the amateur soccer club area such as also including from Ningbo represents ", the area coupling name obtained is called Hangzhou and Ningbo.
In the embodiment of the present invention, it is judged that unit 420 judges that described second score value meets the second pre-conditioned mode and has multiple, optionally, it is possible in the following way:
Judge that described second score value is more than or equal to 3.
Such as, body content is that " last Sunday, the prospect Hangzhou project elite league matches of 2015 China of Asian Football Association is being fired greatly in the rain.Football training base, 8 Hang Cheng amateur soccer team's oligomerisation Tonglus, the trial of strength that they will here launch three weeks by a definite date.Activity is sponsored by Hangzhou DFB, Hangzhou football administrative center undertakes, and is also a part for the 34th West Lake cup super league in Hangzhou.The football fan watched, except local football fan, the amateur soccer club area such as also including from Ningbo represents "; by above-mentioned body matter is mated; the area coupling name obtained is called Hangzhou and Ningbo; and the number of times that " Hangzhou " occurs is 4 times; the number of times that " Ningbo " occurs is 1 time; then the first of Hangzhou the initial score value be 10 × 4=40, Ningbo the first initial score value be 10 × 1=10, first judge that value that the first initial score value in Hangzhou obtains divided by the first initial score value in Ningbo is more than 1.5;Then Press release is classified as the local news contribution in Hangzhou.If the value that the initial score value of the first of Hangzhou obtains divided by the first initial score value in Ningbo is less than 1.5, it is judged that whether the number of times that Hangzhou occurs deducts value that the number of times of Ningbo appearance obtains more than or equal to 3;Such as in the above example, it is 4-1=3 that the number of times that Hangzhou occurs deducts the number of times of its Ningbo appearance;Therefore, Press release is classified as the local news contribution in Hangzhou.
In the embodiment of the present invention, alternatively, described algorithm unit 440 includes acquiring unit 440A and training unit 440B, wherein:
Described acquiring unit 440A, for obtaining corpus, described corpus include determining whether described first score value meet first pre-conditioned time described Press release and the area name of correspondence, and/or judge described second score value meet second pre-conditioned time described Press release and the area name of correspondence;And
Described training unit 440B, for based on described corpus, obtaining described disaggregated model.
In the embodiment of the present invention, alternatively, described algorithm unit 440 also includes coding unit 440C and characteristic processing unit 440D, wherein:
Described extraction unit 450 is additionally operable to, and adopts vector space model and the reverse file word frequency TF-IDF algorithm of word frequency, and each section of Press release in described corpus is extracted key word;
Described coding unit 440C, for according to corresponding contribution attribute and key word, being all encoded into characteristic vector by described each section of Press release;
Described characteristic processing unit 440D, for carrying out feature selection and feature combination by the corpus being encoded to characteristic vector;
Described training unit 440B, is used for adopting this base of a fruit model of many sorted logics, is trained by the corpus after carrying out feature selection and feature combination, obtains described disaggregated model.
In the embodiment of the present invention, alternatively, described algorithm unit 440 specifically for, predict the probability in area belonging to described Press release according to disaggregated model;When judging described probability more than threshold value, using described Press release as the Press release in area belonging to described.
In the embodiment of the present invention, alternatively, described algorithm unit 440 specifically for, be periodically based on described Press release and corresponding target classification train classification models.
Fig. 1 C is the main process obtaining disaggregated model: acquiring unit 440A obtains corpus, each section of Press release in corpus is extracted key word by extraction unit 450, each section of Press release is encoded into characteristic vector according to corresponding contribution attribute and key word by coding unit 440C, then, the corpus being encoded to characteristic vector is carried out feature selection and feature combination by characteristic processing unit 440D, next, training unit 440B carry out feature selection and feature combination after corpus be trained, obtain described disaggregated model.
In the embodiment of the present invention, it is possible to be updated periodically disaggregated model, for instance once a day.
In the embodiment of the present invention, contribution attribute includes distribute new dispatchs media information and/or the temporal information etc. of distributing new dispatchs of contribution.
In the embodiment of the present invention, if Press release cannot be classified according to disaggregated model, it is possible to determine that this Press release cannot be classified.
Fig. 1 D is schematic flow sheet Press release classified according to disaggregated model, area and probability thereof belonging to disaggregated model expected news and journals contribution, it is judged that whether probability is more than threshold value, if, using described Press release as described affiliated regional Press release, otherwise it is assumed that contribution cannot be classified.
Example devices
After the method and apparatus describing exemplary embodiment of the invention, it follows that introduce the device for news category of the another exemplary embodiment according to the present invention.
Person of ordinary skill in the field is it is understood that various aspects of the invention can be implemented as system, method or program product.Therefore, various aspects of the invention can be implemented as following form, that is: hardware embodiment, completely Software Implementation (including firmware, microcode etc.) completely, or the embodiment that hardware and software aspect combines, may be collectively referred to as " circuit ", " module " or " system " here.
In the embodiment that some are possible, can at least include at least one processing unit and at least one memory element according to the device for news category of the present invention.Wherein, described memory element has program stored therein code, when described program code is performed by described processing unit so that described processing unit performs the step being used in news category method according to the various illustrative embodiments of the present invention described in this specification above-mentioned " illustrative methods " part.Such as, described processing unit can perform step 100 as shown in Figure 1A: extracts the headline of Press release;Step 110: described headline is carried out target categorical match, obtains the first matching result;Step 120: calculate the first score value of described first matching result, and judge described first score value meet first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result.Again such as, described processing unit can perform step 200 as shown in Figure 2: Press release is carried out target categorical match, obtains matching result;Step 210: calculate the score value of described matching result, and judge when described score value meets pre-conditioned, described Press release is divided in the target classification corresponding to described matching result;Step 220: based on described Press release and corresponding target classification train classification models;Judge described score value be unsatisfactory for described pre-conditioned time, based on described disaggregated model, described Press release is classified.
Referring to Fig. 5, the device 50 for protecting news category according to the embodiment of the invention is described.The device 50 for news category that Fig. 5 shows is only an example, the function of the embodiment of the present invention and use scope should not brought any restriction.
As it is shown in figure 5, the device 50 for news category shows with the form of universal computing device.Assembly for the device 50 of news category can include but not limited to: at least one processing unit 516 above-mentioned, at least one memory element 528 above-mentioned, connect different system assembly (including memory element 528 and processing unit 516) bus 518.
Bus 518 represents one or more in a few class bus structures, including memory bus or Memory Controller, peripheral bus, AGP, processor or use any bus-structured local bus in multiple bus structures.
Memory element 528 can include the computer-readable recording medium of form of volatile memory, for instance random access memory (RAM) 530 and/or cache memory 532, it is also possible to read only memory (ROM) 534 further.
Memory element 528 can also include the program/utility 540 with one group of (at least one) program module 542, such program module 542 includes but not limited to: operating system, one or more application program, other program module and routine data, potentially includes the realization of network environment in each or certain combination in these examples.
Device 50 for news category can also communicate with one or more external equipments 514 (such as keyboard, sensing equipment, bluetooth equipment etc.), also can with one or more enable a user to this for the mutual equipment communication of the device 50 of news category, and/or can communicate with any equipment (such as router, modem etc.) that other computing equipments one or more communicate with making this device 50 for news category.This communication can be passed through input/output (I/O) interface 522 and carry out.Further, network adapter 520 and one or more network (such as LAN (LAN), wide area network (WAN) and/or public network, for instance the Internet) communication can also be passed through for the device 50 of news category.As it can be seen, network adapter 520 is communicated with other module of the device 50 for news category by bus 518.It is understood that, although not shown in, other hardware and/or software module can be used in conjunction with the device 50 for news category, include but not limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc..
Exemplary process product
In the embodiment that some are possible, various aspects of the invention are also implemented as the form of a kind of program product, it includes program code, when described program product runs on the terminal device, described program code be used for making described terminal unit perform described in this specification above-mentioned " illustrative methods " part according to the step in the method for news category of the various illustrative embodiments of the present invention, such as, described terminal unit can perform step 100 as shown in Figure 1A: extracts the headline of Press release;Step 110: described headline is carried out target categorical match, obtains the first matching result;Step 120: calculate the first score value of described first matching result, and judge described first score value meet first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result.Again such as, described processing unit can perform step 200 as shown in Figure 2: Press release is carried out target categorical match, obtains matching result;Step 210: calculate the score value of described matching result, and judge when described score value meets pre-conditioned, described Press release is divided in the target classification corresponding to described matching result;Step 220: based on described Press release and corresponding target classification train classification models;Judge described score value be unsatisfactory for described pre-conditioned time, based on described disaggregated model, described Press release is classified.
Described program product can adopt the combination in any of one or more computer-readable recording medium.Computer-readable recording medium can be readable signal medium or readable storage medium storing program for executing.Readable storage medium storing program for executing such as can be but not limited to the system of electricity, magnetic, optical, electromagnetic, infrared ray or quasiconductor, device or device or arbitrarily above combination.The example more specifically (non exhaustive list) of readable storage medium storing program for executing includes: have the combination of the electrical connection of one or more wire, portable disc, hard disk, random access memory (RAM), read only memory (ROM), erasable type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read only memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate.
As shown in Figure 6, describe the program product 60 for news category according to the embodiment of the present invention, it can adopt portable compact disc read only memory (CD-ROM) and include program code, it is possible at terminal unit, for instance run in PC.But, the program product of the present invention is not limited to this, and in this document, readable storage medium storing program for executing can be any tangible medium comprised or store program, and this program can be commanded execution system, device or device and use or in connection.
The data signal that readable signal medium can include in a base band or propagate as a carrier wave part, wherein carries readable program code.The data signal of this propagation can take various forms, and includes but not limited to the combination of electromagnetic signal, optical signal or above-mentioned any appropriate.Readable signal medium can also is that any computer-readable recording medium beyond readable storage medium storing program for executing, and this computer-readable recording medium can send, propagate or transmit for by instruction execution system, device or device use or program in connection.
The program code comprised on computer-readable recording medium with any suitable medium transmission, can include but not limited to wireless, wired, optical cable, RF etc. or the combination of above-mentioned any appropriate.
The program code for performing present invention operation can be write with the combination in any of one or more programming languages, described programming language includes object oriented program language such as Java, C++ etc., also includes process type programming language such as " C " language or similar programming language of routine.Program code fully can perform on the user computing device, partly performs on a user device, performs as an independent software kit, partly partly perform on a remote computing on the user computing device or perform in remote computing device or server completely.In the situation relating to remote computing device, remote computing device can include LAN (LAN) by the network of any kind or wide area network (WAN) is connected to user's computing equipment, or, it may be connected to external computing device (such as utilizes ISP to pass through Internet connection).
Although it should be noted that, be referred to some devices of the equipment for news category or sub-device in above-detailed, but this division is only not enforceable.It practice, according to the embodiment of the present invention, the feature of two or more devices above-described and function can embody in one apparatus.Otherwise, the feature of an above-described device and function can Further Division for be embodied by multiple devices.
Although additionally, describe the operation of the inventive method in the accompanying drawings with particular order, but, this does not require that or implies and must operate to perform these according to this particular order, or having to carry out all shown operation could realize desired result.Additionally or alternatively, it is convenient to omit some step, multiple steps are merged into a step and performs, and/or a step is decomposed into the execution of multiple step.
Although describe spirit and the principle of the present invention by reference to some detailed description of the invention, however, it should be understood that, the present invention is not limited to disclosed detailed description of the invention, the division of each side is not meant that the feature in these aspects can not combine to be benefited yet, this division merely to statement convenience.It is contemplated that contain various amendments included in the spirit and scope of claims and equivalent arrangements.

Claims (10)

1. a method for news category, including:
Extract the headline of Press release;
Described headline is carried out target categorical match, obtains the first matching result;
Calculate the first score value of described first matching result, and judge described first score value meet first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result.
2. the method for claim 1, carries out target categorical match to described headline, obtains the first matching result, including:
Described headline is carried out area name coupling, obtains at least one area name;
Calculate the first score value of described first matching result, including:
For any one area name at least one area name described, perform respectively:
Determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described headline;
By the product of described basic score value and described number of times, as the corresponding with described any one area name first initial score value;
Determining the maximum in all first initial score value and second largest value, described second largest value is meant less than described maximum, and more than the first initial score value of the initial score value of all residues first except described maximum in the described all first initial score value;
The ratio obtained divided by described second largest value by described maximum, as described first score value;
Described Press release is divided in the target classification corresponding to described first matching result, including:
Described Press release is divided in the area name corresponding to the maximum first initial score value in all first initial score value.
3. the method for claim 1, if it is determined that described first score value is unsatisfactory for described first pre-conditioned, described method also includes:
Extract the body content of described Press release;
Described body content is carried out target categorical match, obtains the second matching result;
Calculate the second score value of described second matching result, and judge described second score value meet second pre-conditioned time, described Press release is divided in the target classification corresponding to described second matching result.
4. method as claimed in claim 3, carries out target categorical match to described body content, obtains the second matching result, including:
Described body content is carried out area name coupling, obtains at least one area name;
Calculate the second score value of described second matching result, including:
For any one area name at least one area name described, perform respectively:
Determine the basic score value that described any one area name is corresponding and the number of times that described any one area name occurs in described body content;
By the product of described basic score value and described number of times, as the corresponding with described any one area name second initial score value;
Determining the number of times that the maximum in all second initial score value and objective area title occur in described body content, described objective area name is called the area name corresponding to described maximum;
The number of times described objective area title occurred in described body content, deducts the value that each number of times corresponding to residual sector title in residual sector title obtains, as described second score value;
Wherein: described residual sector name is called the area name at least one area name described except the area name corresponding to described maximum;
Described Press release is divided in the target classification corresponding to described second matching result, including:
Described Press release is divided in the area name corresponding to the maximum second initial score value in all second initial score value.
5. method as claimed in claim 3, it is determined that described second score value be unsatisfactory for described second pre-conditioned after, described method also includes:
The probability in area belonging to described Press release is predicted according to disaggregated model;
When judging described probability more than threshold value, using described Press release as the Press release in area belonging to described.
6. method as claimed in claim 5, before predicting the probability in area belonging to described Press release according to disaggregated model, described method also includes:
Obtain corpus, described corpus include determining whether described first score value meet first pre-conditioned time described Press release and the area name of correspondence, and/or judge described second score value meet second pre-conditioned time described Press release and the area name of correspondence;And
Based on described corpus, obtain described disaggregated model.
7. method as claimed in claim 6, based on described corpus, obtains described disaggregated model, including:
Adopt vector space model and the reverse file word frequency TF-IDF algorithm of word frequency, each section of Press release in described corpus is extracted key word;
According to corresponding contribution attribute and key word, described each section of Press release is all encoded into characteristic vector;
The corpus being encoded to characteristic vector is carried out feature selection and feature combination;
Adopt this base of a fruit model of many sorted logics, the corpus after carrying out feature selection and feature combination is trained, obtains described disaggregated model.
8. a device for news category, including:
Extraction unit, for extracting the headline of Press release;
Matching unit, for described headline is carried out target categorical match, obtains the first matching result;
Computing unit, for calculating the first score value of described first matching result;
Judging unit, is used for judging whether described first score value meets first pre-conditioned;
Taxon, for described judging unit judge described first score value meet described first pre-conditioned time, described Press release is divided in the target classification corresponding to described first matching result.
9. a method for news category, including:
Press release is carried out target categorical match, obtains matching result;
Calculate the score value of described matching result, and judge when described score value meets pre-conditioned, described Press release is divided in the target classification corresponding to described matching result;
Based on described Press release and corresponding target classification train classification models;Judge described score value be unsatisfactory for described pre-conditioned time, based on described disaggregated model, described Press release is classified.
Described training is it is anticipated that obtain being based on described training and expect.
10. a device for news category, including:
Matching unit, for Press release is carried out target categorical match, obtains matching result;
Computing unit, for calculating the score value of described matching result;
Judging unit, is used for judging whether described score value meets pre-conditioned;
Taxon, for when described judging unit judges that described score value meets pre-conditioned, being divided into described Press release in the target classification corresponding to described matching result;
Algorithm unit, for based on described Press release and corresponding target classification train classification models;
Described taxon is additionally operable to, described judging unit judge described score value be unsatisfactory for described pre-conditioned time, based on described disaggregated model, described Press release is classified.
CN201610115723.5A 2016-03-01 2016-03-01 A kind of method and apparatus of news category Active CN105760526B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610115723.5A CN105760526B (en) 2016-03-01 2016-03-01 A kind of method and apparatus of news category

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610115723.5A CN105760526B (en) 2016-03-01 2016-03-01 A kind of method and apparatus of news category

Publications (2)

Publication Number Publication Date
CN105760526A true CN105760526A (en) 2016-07-13
CN105760526B CN105760526B (en) 2019-05-07

Family

ID=56332195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610115723.5A Active CN105760526B (en) 2016-03-01 2016-03-01 A kind of method and apparatus of news category

Country Status (1)

Country Link
CN (1) CN105760526B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202057A (en) * 2016-08-30 2016-12-07 东软集团股份有限公司 The recognition methods of similar news information and device
CN106503266A (en) * 2016-11-30 2017-03-15 政和科技股份有限公司 Document Classification Method and device
CN107889068A (en) * 2017-12-11 2018-04-06 成都欧督系统科技有限公司 Message broadcast controlling method based on radio communication
CN108090099A (en) * 2016-11-22 2018-05-29 科大讯飞股份有限公司 A kind of text handling method and device
CN108090201A (en) * 2017-12-20 2018-05-29 珠海市君天电子科技有限公司 A kind of method, apparatus and electronic equipment of article content classification
CN109816134A (en) * 2017-11-22 2019-05-28 北京京东尚科信息技术有限公司 Shipping address prediction technique, device and storage medium
CN110674290A (en) * 2019-08-09 2020-01-10 国家计算机网络与信息安全管理中心 Relationship prediction method, device and storage medium for overlapping community discovery
CN110750697A (en) * 2019-10-30 2020-02-04 汉海信息技术(上海)有限公司 Merchant classification method, device, equipment and storage medium
CN111209390A (en) * 2020-01-06 2020-05-29 北大方正集团有限公司 News display method and system, and computer readable storage medium
CN111324735A (en) * 2020-02-20 2020-06-23 湖南芒果听见科技有限公司 Method and terminal for automatically classifying hourly essentials

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis
CN104346411A (en) * 2013-08-09 2015-02-11 北大方正集团有限公司 Method and equipment for clustering multiple manuscripts
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN104881458A (en) * 2015-05-22 2015-09-02 国家计算机网络与信息安全管理中心 Labeling method and device for web page topics

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis
CN104346411A (en) * 2013-08-09 2015-02-11 北大方正集团有限公司 Method and equipment for clustering multiple manuscripts
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN104881458A (en) * 2015-05-22 2015-09-02 国家计算机网络与信息安全管理中心 Labeling method and device for web page topics

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202057A (en) * 2016-08-30 2016-12-07 东软集团股份有限公司 The recognition methods of similar news information and device
CN106202057B (en) * 2016-08-30 2019-07-12 东软集团股份有限公司 The recognition methods of similar news information and device
CN108090099A (en) * 2016-11-22 2018-05-29 科大讯飞股份有限公司 A kind of text handling method and device
CN108090099B (en) * 2016-11-22 2022-02-25 科大讯飞股份有限公司 Text processing method and device
CN106503266A (en) * 2016-11-30 2017-03-15 政和科技股份有限公司 Document Classification Method and device
CN109816134A (en) * 2017-11-22 2019-05-28 北京京东尚科信息技术有限公司 Shipping address prediction technique, device and storage medium
CN109816134B (en) * 2017-11-22 2021-07-20 北京京东尚科信息技术有限公司 Method and device for predicting delivery address and storage medium
CN107889068A (en) * 2017-12-11 2018-04-06 成都欧督系统科技有限公司 Message broadcast controlling method based on radio communication
CN108090201A (en) * 2017-12-20 2018-05-29 珠海市君天电子科技有限公司 A kind of method, apparatus and electronic equipment of article content classification
CN110674290A (en) * 2019-08-09 2020-01-10 国家计算机网络与信息安全管理中心 Relationship prediction method, device and storage medium for overlapping community discovery
CN110674290B (en) * 2019-08-09 2023-03-10 国家计算机网络与信息安全管理中心 Relationship prediction method, device and storage medium for overlapping community discovery
CN110750697A (en) * 2019-10-30 2020-02-04 汉海信息技术(上海)有限公司 Merchant classification method, device, equipment and storage medium
CN111209390A (en) * 2020-01-06 2020-05-29 北大方正集团有限公司 News display method and system, and computer readable storage medium
CN111209390B (en) * 2020-01-06 2023-09-05 新方正控股发展有限责任公司 News display method and system and computer readable storage medium
CN111324735A (en) * 2020-02-20 2020-06-23 湖南芒果听见科技有限公司 Method and terminal for automatically classifying hourly essentials

Also Published As

Publication number Publication date
CN105760526B (en) 2019-05-07

Similar Documents

Publication Publication Date Title
CN105760526A (en) News classification method and device
CN106547871B (en) Neural network-based search result recall method and device
CN102902821B (en) The image high-level semantics mark of much-talked-about topic Network Based, search method and device
CN107657048B (en) User identification method and device
CN106815244B (en) Text vector representation method and device
CN109657054A (en) Abstraction generating method, device, server and storage medium
CN110909182A (en) Multimedia resource searching method and device, computer equipment and storage medium
CN110390094B (en) Method, electronic device and computer program product for classifying documents
CN109492081B (en) Text information searching and information interaction method, device, equipment and storage medium
CN111159404B (en) Text classification method and device
JP2018509664A (en) Model generation method, word weighting method, apparatus, device, and computer storage medium
CN103544307B (en) A kind of multiple search engine automation contrast evaluating method independent of document library
CN108268540A (en) A kind of video recommendation method based on video similarity, system and terminal
CN103309869A (en) Method and system for recommending display keyword of data object
CN101853297A (en) Method for fast obtaining expected image in electronic equipment
CN111125491A (en) Commodity information searching method and device, storage medium and electronic device
Wang et al. 3D model retrieval with weighted locality-constrained group sparse coding
US9875386B2 (en) System and method for randomized point set geometry verification for image identification
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
CN110162769B (en) Text theme output method and device, storage medium and electronic device
CN109858024B (en) Word2 vec-based room source word vector training method and device
KR101273646B1 (en) Method and system for indexing and searching in multi-modality data
CN107908649B (en) Text classification control method
CN110348581B (en) User feature optimizing method, device, medium and electronic equipment in user feature group
Hao et al. Modeling positive and negative feedback for improving document retrieval

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant