CN109918501A

CN109918501A - Method, apparatus, equipment and the storage medium of news article classification

Info

Publication number: CN109918501A
Application number: CN201910046633.9A
Authority: CN
Inventors: 金戈; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-01-18
Filing date: 2019-01-18
Publication date: 2019-06-21

Abstract

This application involves artificial intelligence fields, provide the method, apparatus and storage medium of a kind of news article classification, which comprises obtain the first data set, pre-process to first data set, obtain training set and test set；The parameter of the disaggregated model is obtained using training set pre-training disaggregated model using layering characteristic manner；The parameter in the disaggregated model is adjusted, after obtaining the optimal model parameters of the disaggregated model, constructs language model；The test set is tested using the language model, if it is detected that the accuracy rate of the language model is higher than preset threshold, it is determined that the language model meets class condition；Classified according to the language model to the second data set for inputting the language model.It provides and uses this programme, can be improved the accuracy rate of news article classification, and improve classifying quality of the term vector model to news article of word2vec pre-training.

Description

Method, apparatus, equipment and the storage medium of news article classification

Technical field

This application involves method, apparatus, equipment and storages that artificial intelligence field more particularly to a kind of news article are classified Medium.

Background technique

When classifying to news article, frequently with term vector model (word to vector, Word2vec) pre-training Word is embedded in vector by term vector model, which is included in the first layer of neural network, and the rest part of neural network remains unchanged Re -training is wanted, so being detached from the relationship of context when will lead to classification.The classifying quality for eventually leading to term vector model is poor.

Summary of the invention

This application provides method, apparatus, equipment and the storage mediums of a kind of classification of news article, are able to solve existing skill The poor problem of the accuracy rate and effect that news article is classified in art.

In a first aspect, the application provides a kind of method of news article classification, which comprises

The first data set is obtained, first data set may include more news articles；

First data set is pre-processed, training set and test set are obtained；

The parameter of the disaggregated model is obtained using training set pre-training disaggregated model using layering characteristic manner；

The parameter in the disaggregated model is adjusted, after obtaining the optimal model parameters of the disaggregated model, constructs language mould Type；

The test set is tested using the language model, if it is detected that the accuracy rate of the language model is higher than Preset threshold, it is determined that the language model meets class condition；

Classified according to the language model to the second data set for inputting the language model.

It is described that institute is obtained using training set pre-training disaggregated model using layering characteristic manner in a kind of possible design State the parameter of disaggregated model, comprising:

For the general levels structure of feature in the training set, from edge to shape, learn all levels from down to height Feature, the ability to express of the inner link and language construction of continuous text in the training set is extracted, with the training classification Model；

Wherein, the parameter of the disaggregated model indicates the weight of neural network, using the parameter of the disaggregated model as defeated The vectorization for entering the word of language model indicates.

In a kind of possible design, parameter in the adjustment disaggregated model obtains the optimal of the disaggregated model Model parameter, comprising:

In simulation space search, matches one and the training set is most matched it is assumed that obtain one group of optimal models Parameter.

It is described assuming that space search, matches one and the most matched vacation of the training set in a kind of possible design If to obtain one group of optimal model parameters, comprising:

The training set is inputted into the simulation space；

The disaggregated model is trained using the training set in the simulation space, training obtain one group with it is described The most matched optimal model parameters of training set.

In a kind of possible design, the parameter in the language model includes classification independent variable and classification dependent variable, described Construct language model, comprising:

Classification independent variable is set by headline and news author, sets classification dependent variable for news category；

According to the headline and the news author, the news category and optimal model parameters building The language model.

In a kind of possible design, the disaggregated model includes ELMo model, OPenAI GPT model or Bert model.

In a kind of possible design, the pretreatment includes stratified sampling, the missing values processing in data and feature comb Reason and screening.

Second aspect, the application provide a kind of device for classifying to news article, have and realize and correspond to above-mentioned the On the one hand a kind of function of the method for news article classification provided.The function can also be passed through by hardware realization Hardware executes corresponding software realization.Hardware or software include one or more modules corresponding with above-mentioned function, the mould Block can be software and/or hardware.

In a kind of possible design, described device includes:

Input/output module, for obtaining the first data set, first data set may include more news articles；

Processing module obtains training set and test set for pre-processing to first data set；Using layered sheet Sign mode obtains the parameter of the disaggregated model using training set pre-training disaggregated model；Ginseng in the whole disaggregated model Number after obtaining the optimal model parameters of the disaggregated model, constructs language model；Using the language model to the test set It is tested, if it is detected that the accuracy rate of the language model is higher than preset threshold, it is determined that the language model meets classification Condition；Classified according to the second data set that the language model inputs the language model to the input/output module.

In a kind of possible design, the processing module is specifically used for:

The training set is inputted into the simulation space by the input/output module；

In a kind of possible design, the parameter in the language model, including classification independent variable and classification dependent variable, it is described Processing module is specifically used for:

In a kind of possible design, the disaggregated model includes ELMo model, OPenAI GPT model or Bert model.Institute Stating pretreatment includes stratified sampling, the missing values processing in data and feature combing and screening.

The another aspect of the application provides a kind of computer equipment comprising processor, the memory of at least one connection And input-output unit, wherein the memory is for storing program code, and the processor is for calling in the memory Program code execute method described in above-mentioned first aspect.

The another aspect of the application provides a kind of computer storage medium comprising instruction, when it runs on computers When, so that computer executes method described in above-mentioned first aspect.

Compared to the prior art, in scheme provided by the present application, first data set is pre-processed, is trained Collection and test set；The parameter of the disaggregated model is obtained using training set pre-training disaggregated model using layering characteristic manner； The parameter in the disaggregated model is adjusted, after obtaining the optimal model parameters of the disaggregated model, constructs language model；Using institute It states language model to test the test set, if it is detected that the accuracy rate of the language model is higher than preset threshold, really The fixed language model meets class condition；The second data set for inputting the language model is carried out according to the language model Classification.There is provided use this programme, can be improved news article classification accuracy rate, and improve word2vec pre-training word to Model is measured to the classifying quality of news article.

Detailed description of the invention

Fig. 1 is a kind of flow diagram of method legal person of news article classification in the embodiment of the present application；

Fig. 2 is a kind of structural schematic diagram of the device in the embodiment of the present application for classifying to news article；

Fig. 3 is another structural schematic diagram of the device in the embodiment of the present application for classifying to news article.

The embodiments will be further described with reference to the accompanying drawings for realization, functional characteristics and the advantage of the application purpose.

Specific embodiment

It should be appreciated that specific embodiment described herein is not used to limit the application only to explain the application.This The specification and claims of application and term " first " in above-mentioned attached drawing, " second " etc. are for distinguishing similar right As without being used to describe a particular order or precedence order.It should be understood that the data used in this way in the appropriate case can be with It exchanges, so that the embodiments described herein can be implemented with the sequence other than the content for illustrating or describing herein.In addition, Term " includes " and " having " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a system The process, method, system, product or equipment of column step or module those of are not necessarily limited to be clearly listed step or module, and Being may include other steps or module being not clearly listed or intrinsic for these process, methods, product or equipment, this The division of module appeared in application, only a kind of division in logic can have other when realizing in practical application Division mode, such as multiple modules can be combined into or are integrated in another system, or some features can be ignored, or not held Row.

The application provides method, apparatus, equipment and the storage medium of a kind of news article classification, can be used for news category.

In order to solve the above technical problems, the application it is main the following technical schemes are provided:

The news article of acquisition is pre-processed, constructs pre-training language model (for example, by using ELMo model, OPenAI GPT model or Bert model), pre-training language model can be handled super large text or super large expectation, be made full use of big Single language corpus of scale.Classified by pre-training language model to these news articles, is able to solve above-mentioned background technique In technical problem, improve news article classification accuracy rate.

Fig. 1 is please referred to, a kind of method for providing news article classification to the application below is illustrated, the method Include:

101, the first data set is obtained.

Wherein, first data set may include more news articles.First data set owner will include news article Article title, article abstract, article summarize etc. information.The news article of each news platform can be obtained in a manner of crawler.

102, first data set is pre-processed, obtains training set and test set.

Wherein, training set is used to construct language model, and test set is used to examine the accuracy rate of the language model of building It tests.

In some embodiments, the pretreatment includes stratified sampling, missing values processing and feature combing in data With screening.

Wherein, stratified sampling, which refers to, carries out stochastical sampling to each classification respectively, to guarantee in sample space or type Uniformity and representativeness in selection.For example, divided according to story label (such as according to labels such as tourism, amusement, social activities Divide news article).

Missing values processing in data includes extraction, cleaning, conversion, integrated and filling.

Denoising mode can be used with screening in feature combing.

In some embodiments, pretreatment can use and reserve method or cross-validation method.Wherein, method is reserved to refer to institute The set that the first data set is divided into two mutual exclusions is stated, that is, is divided into test set and training set.Cross-validation method refers to will be described First data set is divided into the similar exclusive subsets of k size, and each subset keeps the consistency of data distribution as far as possible, i.e., Each subset is concentrated through stratified sampling from the first data and obtains.Then, use the union of k-1 subset as training every time Collection, remaining subset is as test set；It is obtained with k group training/test set, in this way so as to carry out k training and survey Examination, what is finally returned that is the mean value of k test result.

103, the parameter of the disaggregated model is obtained using training set pre-training disaggregated model using layering characteristic manner.

Wherein, the parameter of disaggregated model can become the weight of neural network, using the parameter of disaggregated model as input language Say that the vectorization of the word of model indicates, i.e. term vector.Term vector is used to measure the similitude between word and word.

By taking word2vec model as an example, the application carrys out the entire word2vec model of pre-training using layering characteristic manner, throws It has abandoned the mode of the first layer of in word2vec model initialization model in current mechanism.Such as study word vector ratio Make to learn image border in computer vision, then general levels structure one of the layering characteristic manner just as learning characteristics of image Sample from edge to shape, then arrives high-level semantics concept.When carrying out the entire word2vec model of pre-training using layering characteristic manner, These pre-training also acquire the feature of low-level and the spy of high-level while carrying out pre-training to entire word2vec model Sign enables word2vec model to learn into text higher level nuance, and the general of image is arrived in study Feature.

In some embodiments, disaggregated model can choose ELMo model, OPenAI GPT (Generative Pre- ) or the sorter models such as Bert model Training.

Wherein, ELMo is a kind of novel depth contextualized vocabulary sign, can carry out complex characteristic (such as syntax and semantic) to word It is modeled and (polysemant is modeled) with variation of the word in language context.The term vector of the application is the two-way language of depth The function for saying model (biLM) internal state, pre-training forms in one large-scale text corpus.Specifically, one is first trained A complete language model, then go processing to need the text of training with this language model, generate corresponding term vector, ELMo mould Type can generate different term vectors to the same word in different sentences.After this good language model of pre-training, ELMo is just It is to be used as word expression according to formula, is in fact exactly that each middle layer of this bi-directional language model is carried out a summation.Most Simply top expression can be used also as ELMo.It then, can be by ELMo in the NLP task for carrying out having supervision Directly as in the term vector input either top expression of model of merging features to specific tasks model.

ELMo be when learning language model gone from entire corpus study, by language model generate word to Amount is equivalent to the term vector learnt based on entire corpus, therefore, is capable of the meaning of more one word of accurate representation.

Bert model is that (i.e. pre-training language indicates for a kind of multi-layer biaxially oriented Transformer encoder based on fine tuning Method), input indicates to indicate single text sentence or a pair of of text in a word sequence, for example, word sequence is expressed as [problem, answer].Bert model makes that general " language understanding " can be trained on large-scale text corpus (such as wikipedia) Model.

104, the parameter in the disaggregated model is adjusted, after obtaining the optimal model parameters of the disaggregated model, constructs language Say model.

In some embodiments, the parameter adjusted in the disaggregated model obtains the optimal mould of the disaggregated model Shape parameter, comprising:

In simulation space search, matches one and the training set is most matched it is assumed that obtain one group of optimal models Parameter.It simulates space and alternatively referred to as assumes space.

For example, the training set is inputted the simulation space, in the simulation space using the training set to described Disaggregated model is trained, and final training obtains one group and the most matched optimal model parameters of the training set.

105, the test set is tested using the language model, if it is detected that the accuracy rate of the language model Higher than preset threshold, it is determined that the language model meets class condition.

Class condition can include: at least one text feature is same or similar, wherein text feature can be news article The labels such as label, such as finance and economics, amusement, sport, science and technology, military affairs or household.Text feature can also be that area name is (military The Chinese, Shenzhen etc.), the offer platform (such as newspaper, radio station, broadcast, internet) of news article.It can also be regional rank (example Such as city-level, provincial or area grade).

106, classified according to the language model to the second data set for inputting the language model.

Compared with current mechanism, in the embodiment of the present application, first data set is pre-processed, obtain training set and Test set；The parameter of the disaggregated model is obtained using training set pre-training disaggregated model using layering characteristic manner；Adjustment Parameter in the disaggregated model after obtaining the optimal model parameters of the disaggregated model, constructs language model；Utilize institute's predicate Speech model tests the test set, if it is detected that the accuracy rate of the language model is higher than preset threshold, it is determined that institute It states language model and meets class condition；The second data set for inputting the language model is divided according to the language model Class.It provides and uses this programme, can be improved the accuracy rate of news article classification, and improve the term vector of word2vec pre-training Classifying quality of the model to news article.

Optionally, described using layering characteristic manner in some embodiments of the present application, utilize training set pre-training point Class model obtains the parameter of the disaggregated model, comprising:

Optionally, in some embodiments of the present application, the parameter in the language model includes classification independent variable and divides Class dependent variable, the building language model, comprising:

Every technical characteristic mentioned in above-described embodiment is applied equally to corresponding to Fig. 2 and Fig. 3 in the application Embodiment, subsequent similar place repeats no more.

The method of news article a kind of in the application classification is illustrated above, below to execution above-mentioned news article point The device of the method for class is described.

A kind of structural schematic diagram of device 20 for classifying to news article as shown in Figure 2, can be applied to news Article classification.Device 20 in the embodiment of the present application can be realized corresponding to performed in embodiment corresponding to above-mentioned Fig. 1 Step in the method for news article classification.The function that device 20 is realized can also be held by hardware realization by hardware The corresponding software realization of row.Hardware or software include one or more modules corresponding with above-mentioned function, and the module can be with It is software and/or hardware.Described device 20 may include input/output module 201 and processing module 202, the processing module 202 Realizing with the function of input/output module 201 can refer to operation performed in embodiment corresponding to Fig. 1, not go to live in the household of one's in-laws on getting married herein It states.Processing module can be used for controlling the input for obtaining module 201 or output operation.

In some embodiments, the input/output module 201 can be used for obtaining the first data set, first data set It may include more news articles；

The processing module 202 can be used for pre-processing first data set, obtain training set and test set；It adopts The parameter of the disaggregated model is obtained using training set pre-training disaggregated model with layering characteristic manner；The whole disaggregated model In parameter, after obtaining the optimal model parameters of the disaggregated model, construct language model；Using the language model to described Test set is tested, if it is detected that the accuracy rate of the language model is higher than preset threshold, it is determined that the language model symbol Close class condition；Classified according to the language model to the second data set for inputting the language model.

In the embodiment of the present application, processing module 202 pre-processes first data set, obtains training set and test Collection；The parameter of the disaggregated model is obtained using training set pre-training disaggregated model using layering characteristic manner；Described in adjustment Parameter in disaggregated model after obtaining the optimal model parameters of the disaggregated model, constructs language model；Utilize the language mould Type tests the test set, if it is detected that the accuracy rate of the language model is higher than preset threshold, it is determined that institute's predicate Speech model meets class condition；Classified according to the language model to the second data set for inputting the language model.It mentions For using this programme, the accuracy rate of news article classification can be improved, and improve the term vector model of word2vec pre-training To the classifying quality of news article.

In some embodiments, the processing module 202 is specifically used for:

The training set is inputted into the simulation space by the input/output module 201；

In some embodiments, parameter in the language model, including classification independent variable and classification dependent variable, the place Reason module 202 is specifically used for:

In some embodiments, the disaggregated model includes ELMo model, OPenAI GPT model or Bert model.It is described Pretreatment includes stratified sampling, the missing values processing in data and feature combing and screening.

It describes in the embodiment of the present application and is used for news article point respectively from the angle of modular functionality entity above The device of class introduces a kind of computer equipment from hardware point of view below, as shown in figure 3, comprising: processor, memory, input Output unit and storage are in the memory and the computer program that can run on the processor.For example, the calculating Machine program can be the corresponding program of method of news article classification in embodiment corresponding to Fig. 1.For example, working as computer equipment When realizing the function of device 20 as shown in Figure 2, the processor is realized corresponding to above-mentioned Fig. 2 when executing the computer program Embodiment in each step in the method for news article classification that is executed by device 20；Alternatively, described in the processor execution The function of each module in the device 20 of embodiment corresponding to above-mentioned Fig. 2 is realized when computer program.In another example the computer journey Sequence can be the corresponding program of method of news article classification in embodiment corresponding to Fig. 1.

Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng the processor is the control centre of the computer equipment, utilizes various interfaces and the entire computer equipment of connection Various pieces.

The memory can be used for storing the computer program and/or module, and the processor is by operation or executes Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization The various functions of computer equipment.The memory can mainly include storing program area and storage data area, wherein storage program It area can application program (such as sound-playing function, image player function etc.) needed for storage program area, at least one function Deng；Storage data area, which can be stored, uses created data (such as audio data, video data etc.) etc. according to mobile phone.This Outside, memory may include high-speed random access memory, can also include nonvolatile memory, such as hard disk, memory, insert Connect formula hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash memory Block (Flash Card), at least one disk memory, flush memory device or other volatile solid-state parts.

The input-output unit can also be replaced with input unit and output unit, can be same or different object Manage entity.When for identical physical entity, input-output unit may be collectively referred to as.The input-output unit can be transceiver.

The memory can integrate in the processor, can also be provided separately with the processor.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, the technical solution of the application substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in a storage medium In (such as ROM/RAM), including some instructions are used so that a terminal (can be mobile phone, computer, server or network are set It is standby etc.) execute method described in each embodiment of the application.

Embodiments herein is described above in conjunction with attached drawing, but the application be not limited to it is above-mentioned specific Embodiment, the above mentioned embodiment is only schematical, rather than restrictive, those skilled in the art Under the enlightenment of the application, when not departing from the application objective and scope of the claimed protection, can also it make very much Form, it is all using equivalent structure or equivalent flow shift made by present specification and accompanying drawing content, directly or indirectly Other related technical areas are used in, these are belonged within the protection of the application.

Claims

1. a kind of method of news article classification, which is characterized in that the described method includes:

First data set is pre-processed, training set and test set are obtained；

The parameter in the disaggregated model is adjusted, after obtaining the optimal model parameters of the disaggregated model, constructs language model；

The test set is tested using the language model, is preset if it is detected that the accuracy rate of the language model is higher than Threshold value, it is determined that the language model meets class condition；

2. being instructed in advance the method according to claim 1, wherein described use layering characteristic manner using training set Practice disaggregated model, obtain the parameter of the disaggregated model, comprising:

For the general levels structure of feature in the training set, from edge to shape, from the spy down to high all levels of study Sign, extracts the ability to express of the inner link and language construction of continuous text in the training set, with the training disaggregated model；

Wherein, the parameter of the disaggregated model indicates the weight of neural network, using the parameter of the disaggregated model as input language Say that the vectorization of the word of model indicates.

3. method according to claim 1 or 2, which is characterized in that the parameter in the adjustment disaggregated model obtains The optimal model parameters of the disaggregated model, comprising:

In simulation space search, matches one and the training set is most matched it is assumed that obtain one group of optimal model parameters.

4. according to the method described in claim 3, it is characterized in that, it is described assuming that space search, match one with it is described Training set is most matched it is assumed that obtain one group of optimal model parameters, comprising:

The training set is inputted into the simulation space；

The disaggregated model is trained using the training set in the simulation space, training obtains one group and the training Collect most matched optimal model parameters.

5. according to the method described in claim 4, it is characterized in that, parameter in the language model include classification independent variable and Classification dependent variable, the building language model, comprising:

According to the headline and the news author, the news category and optimal model parameters building Language model.

6. the method according to claim 1, wherein the disaggregated model includes ELMo model, OPenAI GPT Model or Bert model.

7. the method according to claim 1, wherein the pretreatment includes the missing in stratified sampling, data Value processing and feature combing and screening.

8. a kind of device for classifying to news article, which is characterized in that described device includes:

Processing module obtains training set and test set for pre-processing to first data set；Using layering characterization side Formula obtains the parameter of the disaggregated model using training set pre-training disaggregated model；Parameter in the whole disaggregated model, obtains To after the optimal model parameters of the disaggregated model, language model is constructed；The test set is carried out using the language model Test, if it is detected that the accuracy rate of the language model is higher than preset threshold, it is determined that the language model meets class condition； Classified according to the second data set that the language model inputs the language model to the input/output module.

9. a kind of computer equipment, which is characterized in that described device includes:

At least one processor, memory and input-output unit；

Wherein, the memory is for storing program code, and the processor is for calling the program stored in the memory Code is executed such as method of any of claims 1-7.

10. a kind of computer storage medium, which is characterized in that it includes instruction, when run on a computer, so that calculating Machine executes such as method of any of claims 1-7.