CN110019776B - Article classification method and device and storage medium - Google Patents

Article classification method and device and storage medium Download PDF

Info

Publication number
CN110019776B
CN110019776B CN201710792136.4A CN201710792136A CN110019776B CN 110019776 B CN110019776 B CN 110019776B CN 201710792136 A CN201710792136 A CN 201710792136A CN 110019776 B CN110019776 B CN 110019776B
Authority
CN
China
Prior art keywords
article
articles
feature words
target category
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710792136.4A
Other languages
Chinese (zh)
Other versions
CN110019776A (en
Inventor
王树伟
温旭
花少勇
何鑫
姜国华
殷乐
花贵春
范欣
胡博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Beijing Co Ltd
Original Assignee
Tencent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Beijing Co Ltd filed Critical Tencent Technology Beijing Co Ltd
Priority to CN201710792136.4A priority Critical patent/CN110019776B/en
Publication of CN110019776A publication Critical patent/CN110019776A/en
Application granted granted Critical
Publication of CN110019776B publication Critical patent/CN110019776B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses an article classification method, an article classification device and a storage medium; the method comprises the following steps: matching the feature words of the articles included in the test set with the feature words included in the feature word bank of the target class to obtain successfully matched feature words; determining the score of the article belonging to the target category according to the score of the feature word successfully matched with the article and the corresponding weight; determining whether the article belongs to a first judgment result of the target category according to the score of the article; and inputting the characteristics of the remaining articles into a classifier model, and determining whether the remaining articles belong to a second judging result of the target category, wherein the remaining articles are articles which are determined according to the first judging result and do not belong to the target category.

Description

Article classification method and device and storage medium
Technical Field
The present invention relates to computer technologies, and in particular, to a method and apparatus for classifying articles, and a storage medium.
Background
Information in the Internet presents an explosive growth trend, becomes an important way for acquiring different types of articles such as news, public number articles, microblogs and the like, and covers various aspects of daily work, life and study.
In order to attract users to read or increase the click rate, some illegal articles with pornography, low colloquial and social negative colors are released by publishers of partial articles, and in the illegal articles, particularly in titles, some word patterns with obvious hints and social low colloquial behaviors are added; these categories of articles have adverse social effects on users, particularly underage users.
The quantity and the speed of the articles released in the Internet are of a massive level, and the manual auditing mode is difficult to adapt to the requirement of the user for rapidly releasing the articles; for machine auditing, in order to avoid manual and machine auditing when issuing, a plurality of expression modes with the color or implicit meaning of the edge ball are used in the illegal article, so that accurate identification is difficult.
In summary, there is no effective solution for accurately and efficiently classifying articles to filter the release of illegal articles in the internet.
Disclosure of Invention
The embodiment of the invention provides an article classification method, an article classification device and a storage medium, which can accurately and efficiently classify articles.
The technical scheme of the embodiment of the invention is realized as follows:
The embodiment of the invention provides an article classification method, which comprises the following steps:
matching the feature words of the articles included in the test set with the feature words included in the feature word bank of the target class to obtain successfully matched feature words;
determining the score of the article belonging to the target category according to the score of the feature word successfully matched with the article and the corresponding weight;
determining whether the article belongs to a first judgment result of the target category according to the score of the article;
and inputting the characteristics of the remaining articles into a classifier model, and determining whether the remaining articles belong to a second judging result of the target category, wherein the remaining articles are articles which are determined according to the first judging result and do not belong to the target category.
The embodiment of the invention provides an article classification device, which comprises:
the word stock identification unit is used for matching the feature words of the articles included in the test set with the feature words included in the feature word stock of the target class to obtain successfully matched feature words;
the word stock recognition unit is further used for determining the score of the article belonging to the target category according to the score of the feature word successfully matched with the article and the corresponding weight;
The word stock identification unit is further used for determining whether the article belongs to a first judgment result of the target category according to the score of the article;
the classifier model identification unit is used for inputting the characteristics of the remaining articles into the classifier model, determining whether the remaining articles belong to the second judgment result of the target category, wherein the remaining articles are articles which are determined according to the first judgment result and do not belong to the target category.
In the above solution, the classifier model identifying unit is further configured to:
obtaining feature words from a feature word library, wherein the feature words meet the following conditions: not present in articles included in the training set; the distance between the training set and the word vector of the feature words of the articles included in the training set is smaller than a distance threshold;
and taking the word vector of the feature words obtained for the article and the word vector of the feature words of the article as sample features of the article.
In the above scheme, the classifier model identifying unit is further configured to determine, according to a reading history of an article included in the training set, a reading proportion and/or number of the target classifications in a reading history of a reading user of the article; and adding the reading proportion and/or the reading quantity of the articles to the sample characteristics corresponding to the articles.
In the above scheme, the word stock recognition unit is further configured to perform word segmentation processing on each article included in the test set; punctuation marks, common words and stop words are screened from word segmentation results.
In the above scheme, the classifier model identifying unit is further configured to construct a training sample by using a word vector of a feature word of an article included in the training set as a sample feature and a corresponding classification result as a sample label;
respectively and iteratively training classifier models of different categories by using the constructed training samples until the iteration suspension condition is met;
fitting the prediction result of the classifier model to the classification result of the article included in the training set to obtain a fitting relation between the classifier models.
In the above scheme, the classifier model identifying unit is further configured to obtain a feature word from a feature word library, where the feature word meets the following conditions: not present in articles included in the training set; the distance between the training set and the word vector of the feature words of the articles included in the training set is smaller than a distance threshold;
and taking the word vector of the feature words obtained for the article and the word vector of the feature words of the article as sample features of the article.
In the above scheme, the classifier model identifying unit is further configured to determine, in a reading history of a reading user of an article included in the training set, a reading proportion and/or number of the target classifications;
and adding the reading proportion and/or the reading quantity of the articles to the sample characteristics corresponding to the articles.
The embodiment of the invention provides an article classification device, which comprises:
a memory for storing an executable program;
and the processor is used for realizing the article classification method provided by the embodiment of the invention by executing the executable program stored in the memory.
The embodiment of the invention provides a readable storage medium, which stores an executable program, and the executable program realizes the article classification method provided by the embodiment of the invention when being executed by a processor.
The embodiment of the invention has the following beneficial effects:
1) The method comprises the steps that the characteristic words of a target category are matched with the characteristic words of articles included in a test set, and scores are calculated, so that the articles which obviously belong to the target category and are included in the test set are rapidly identified to the greatest extent;
2) Judging whether the articles belong to the target category or not by using a classifier model for the articles which are left in the test set and are not judged to belong to the target category, and accurately identifying the articles which belong to the target category in the remaining articles by using the characteristic of accurate classification of the classifier model;
3) Based on the combination of the score and the classifier model, the classification accuracy is ensured while the classification efficiency is improved.
Drawings
FIG. 1A is a schematic diagram of an optional application scenario in which content review is performed at a server when an article classification application provided by an embodiment of the present invention is used in a news push service;
fig. 1B is an optional application scenario schematic diagram of content review performed at a server when an article classification application provided in an embodiment of the present invention is used for publishing a microblog/blog article;
FIG. 1C is a schematic diagram of an alternative application scenario in which an article classification application provided by an embodiment of the present invention performs content review at a client in web encyclopedia publishing;
FIG. 2 is a schematic diagram of an alternative hardware structure of the article classification apparatus according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an alternative functional architecture of an application provided by an embodiment of the present invention;
FIG. 4 is a schematic flow chart of an alternative article classification method according to an embodiment of the present invention;
FIG. 5A is a schematic diagram of an alternative implementation of the article classification method according to an embodiment of the present invention;
FIG. 5B is a schematic flow chart of an alternative article classification provided by an embodiment of the present invention;
FIG. 6 is a schematic flow chart of an alternative article classification method according to an embodiment of the present invention;
FIG. 7 is a schematic flow chart of an alternative method for training a classifier model provided by an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Before explaining the present invention in further detail, terms and terminology involved in the embodiments of the present invention are explained, and the terms and terminology involved in the embodiments of the present invention are applicable to the following explanation.
1) Articles, various forms of media available in the internet including text (of course, media forms such as video and audio may be embedded), such as articles for posting on, for example, news websites, microblogs, blogs, public numbers, instant messaging messages transmitted in social clients, and the like.
2) Violation articles, articles violating network-related laws and regulations; for example, the following categories of offending content are involved:
articles containing pornography (e.g., content that directly exposes and describes human sexual areas);
articles containing low-custom content (e.g., content that exhibits or obscures sexual behavior, has comma or insult, content that describes sexual behavior, sexual processes, sexual means in a language with sexual implications, sexual comma);
Articles containing political reaction content (e.g., anti-government, anti-country).
3) Word segmentation, also known as word segmentation, refers to the segmentation of articles into individual words according to a certain word segmentation strategy.
4) Stopping words, and filtering words which do not influence the classification decision of the articles from the articles in order to save storage space and processing efficiency when classifying the articles; commonly used words do not have a definite meaning (only if they are put into a complete sentence, such as pronouns, articles and numbers, mood aid words, adverbs, prepositions and conjunctions, etc.).
5) After the feature words are used for segmenting the articles, the stop words are filtered from the articles, and the words which are extracted from the rest words and represent the subjects of the articles can be used for classifying the articles.
The feature words include words that affect whether the article is a classification result of the offensive article, and the feature words may include words that make the article have a tendency to be classified as the offensive article (hereinafter referred to as low-custom words), and words related to sexual sketch have a tendency to be offensive article; similarly, words having a tendency to classify an article as a non-offensive article may also be included in the feature words, e.g., words associated with public welfare assistance have a tendency to have a non-offensive article.
6) The common words, namely the high-frequency words, are counted in the samples of the articles, the occurrence frequency of the words is higher than the frequency threshold value, and the common words have smaller contribution to the subjects to be expressed of the predicted articles and can be ignored due to the higher occurrence frequency.
7) Machine Learning (Machine Learning), which is to wrap sample features and labels of whether the sample features belong to a target class (such as a low-custom article) from article samples (simply referred to as samples) of a training set, trains a classifier model, and enables the trained classifier model to have the performance of judging whether the article samples of a test set belong to the target class.
8) A classifier model, also called a classifier, that is, a model for classification obtained by means of machine learning, is used for predicting whether an article is of a target class according to the sample characteristics of the articles included in the test set, and is used for indicating the probability that the article is of the target class; the classifier model herein adopts a classifier model of a support vector machine (SVM, support Vector Machines), a classifier model based on a word bag, a classifier model based on prior probability and sparse characteristics, a classifier model based on neural network and deep learning, and the like, and is used for two classification, namely judging whether articles to be classified belong to target classes or not according to the description.
9) The Word Vector is obtained by mapping the prior feature words into vectors obtained by using a Word-to-Vector mapping model such as Word-to-Vector (Word 2 Vector) according to the similarity of the semantics among the feature words, and the distances among the Word vectors of different feature words are inversely related to the similarity degree in terms of semantics, namely the semantics are more similar as the distance (such as Euclidean distance) of the Word vectors is smaller.
10 The training set comprises articles for training the classifier model, and the vector representation of the articles and the prior classification result are used for constructing training samples to train the classifier model so that the classifier model has the performance of performing two classification on target categories of articles to be tested.
11 A test set comprising articles to be tested (classified), the vector representation of the articles being used to input a classifier model to predict a score belonging to a target class.
The embodiment of the invention provides an article classification method, an article classification device for implementing the article classification method, and a storage medium storing an executable program for implementing the article classification method. With respect to implementation of the article classification method, embodiments of the present invention provide a terminal-side implementation and a server-side implementation, and an exemplary implementation scenario of article classification will be described.
Referring to fig. 1A, fig. 1A is an optional application scenario schematic diagram of content review performed on a server when the article classification application provided in the embodiment of the present invention is applied to a news push service, and a user may read news articles with different topics, such as a current news article and a financial news article, aggregated by a background server, by installing a news client in a terminal. The background server regularly pulls news articles from different article sources and stores the news articles into a database, classifies illegal articles/compliant articles, and screens out news articles which do not meet relevant regulations, such as low custom articles. The background server calculates user preferences based on data of different users (e.g., history browsing records, personal articles registered by the user, etc.), and pushes news articles conforming to the preferences to the user.
Referring to fig. 1B again, fig. 1B is an optional application scenario schematic diagram of content inspection performed at a server when the article classification application provided by the embodiment of the present invention is used for publishing a microblog/blog article, a microblog/blog publisher transmits the microblog/blog article to a server in a microblog/blog background through a client, the background server stores the microblog/blog article in a warehouse, and the article classification method provided by the embodiment of the present invention judges whether the microblog/blog article belongs to a illegal article, if yes, the microblog/blog article is placed in a locked state, and the article in the locked state will not be pushed to other users; if not, pushing the microblog/blog article to the concerned user according to the concerned condition.
Referring to fig. 1C again, fig. 1C is a schematic diagram of an optional application scenario in which the article classification application provided in the embodiment of the present invention publishes an article in a web encyclopedia and performs content review at a client, where a publisher of the web encyclopedia transmits an article explaining a new term to a server through the client, and the background server stores the article in a warehouse, and the article classification method provided in the embodiment of the present invention determines whether the article belongs to a illegal article, such as a low custom article, if so, places the article in a locked state, and other users accessing the web encyclopedia will not see the article uploading the locked state unless the uploading user modifies and recognizes the article as a compliant article.
In the application scenario, the target category is an offensive article, such as a low-custom article, which is not limited to this, for example, in the application scenario of web page encyclopedia, since the content is commonly maintained by network visitors, and often includes multiple topics, such as science, technology, history, literature, and the like, in order to avoid the situation that the user uploads articles which do not conform to the topics, the target category may be different topics, and the client may determine the target category of the articles which the user prepares to upload with respect to the pre-uploaded topics, and if the articles to be uploaded conform to the corresponding target category, the user is continuously prompted to upload, otherwise, the user is prompted to reselect the topics.
Continuing to describe the implementation of the article classification device provided in the embodiment of the present invention, as described above, the article classification device provided in the embodiment of the present invention may be implemented on a terminal side or a server side, and with respect to the hardware structure of the article classification device, referring to fig. 2, fig. 2 is a schematic diagram of an alternative hardware structure of the article classification device provided in the embodiment of the present invention, and the article classification device 100 may be implemented as various types of terminals, including: smart phones, tablet devices, personal digital assistants, and the like; the article classification device 100 may also be implemented as a server for implementing various applications, including servers of different applications as shown in fig. 1A-1C.
The article classification apparatus 100 shown in fig. 2 includes: at least one processor 101, a memory 102, at least one communication interface 104, and a user interface 103. The various components in the article classification device 100 are coupled together by a bus system 105. It is understood that the bus system 105 is used to enable connected communications between these components. The bus system 105 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 105 in fig. 2.
The user interface 103 may include, among other things, a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad, or touch screen, etc.
It will be appreciated that the memory 102 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), a programmable Read Only Memory (PROM, programmable Read-Only Memory), which serves as an external cache, among others. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronous Static Random Access Memory). The memory 102 described in embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
The memory 102 in embodiments of the present invention is used to store various categories of data to support the operation of the article classification device 100. Examples of such data include: any executable programs for operating on the article classification device 100, such as an operating system 1021 and application programs 1022; contact data; telephone book data; a message; a picture; video, etc. The operating system 1021 contains various system programs, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks. The application programs 1022 may include various application programs such as a Media Player (Media Player), a Browser (Browser), etc., for implementing various application services. The method of implementing article classification according to embodiments of the present invention may be included in the application 1022.
The method disclosed in the above embodiment of the present invention may be applied to the processor 101 or implemented by the processor 101. The processor 101 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 101 or instructions in the form of software. The processor 101 may be a general purpose processor, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 101 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the article classification method provided by the embodiment of the invention can be directly embodied in the hardware decoding processor for execution or the combination of hardware and software modules in the decoding processor for execution. The software modules may be located in a storage medium in the memory 102 and the processor 101 reads information in the memory 102, in combination with its hardware, performing the steps of the method as described above.
In an exemplary embodiment, the article classification apparatus 100 may be implemented by one or more application specific integrated circuits (ASICs, application Specific Integrated Circuit), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex programmable logic devices (CPLDs, complex Programmable Logic Device) for performing the aforementioned methods.
In an exemplary embodiment, referring to fig. 3, fig. 3 is an optional structural schematic diagram of an application 1022 provided in an embodiment of the present invention, including: a word stock recognition unit 10221 and a classifier model recognition unit 10222; as an example, the word stock recognition unit 10221 and the classifier model recognition unit 10222 may also be implemented as functional modules of the application 1022.
In the following, taking the example of determining whether each article in the test set belongs to the target category as an example, the processing procedure of the article classification device 100 is described, it can be understood that classifying a part of articles or all articles in the test set depends on the actual application requirement, and the classifying of each (each) article in the test set is also applicable to the scene of classifying a part of articles or one article.
The word stock identification unit 10221 is used for matching the feature words in each article included in the test set with each feature word in the feature word stock of the target class to obtain successfully matched feature words; determining the score of the corresponding article belonging to the target category according to the score of the feature word successfully matched with each article and the corresponding weight; judging whether the score of each article belongs to the target category or not to obtain a first judging result, wherein the first judging result represents a classification result of each article in the test set based on the score, namely: and (3) whether the article belongs to the target category or not, and regarding the article belonging to the target category in the first judging result as a final classifying result of the corresponding article.
In terms of determining whether articles belong to a target category based on the scores, for example, the thesaurus recognition unit 10221 is specifically configured to compare the score of each article with a score threshold of the target category; and when the score threshold value is exceeded, determining that the corresponding article belongs to the target category, and when the score threshold value is not exceeded, determining that the corresponding article does not belong to the target category.
For extracting feature words from articles, for example, the thesaurus recognition unit 10221 is further configured to perform at least one of the following to extract feature words in articles: matching the feature words in each article included in the test set with the feature words included in the feature word library of the target class one by one to obtain a single feature word successfully matched; and combining the feature words in the articles included in the test set according to the appearance sequence, and matching the feature words with the combined feature words included in the feature word library of the target category to obtain successfully matched combined feature words.
The classifier model identifying unit 10222 is configured to input, according to the first determination result, a feature of an article (hereinafter, also referred to as a remaining article) that is determined not to belong to the target category into a classifier model, and determine whether the remaining article belongs to the target category again, to obtain a second determination result, where the second determination result is a final classification result indicating whether the remaining article belongs to the target category.
In an alternative embodiment of the present invention, the word stock recognition unit 10221 is further configured to perform word segmentation processing on the articles before extracting feature words from each article included in the test set; screening punctuation marks, common words and stop words from word segmentation results; by removing the words and the symbols commonly appearing in the articles, the characteristic words extracted from the word stock can have the effect of differentiating the characteristics of other articles, so that the characteristic vector constructed based on the characteristic words can accurately represent the subject to be expressed by the articles.
In an alternative embodiment of the present invention, the word stock recognition unit 10221 is specifically configured to, when feature words in each article included in the test set are matched with feature words included in a feature word stock of a target category one by one, and obtain a single feature word that is successfully matched, add the scores of the single feature word that is successfully matched according to the corresponding weights, so as to obtain a score of a dimension of the single feature word; when the combined feature words in each article included in the test set are matched with the combined feature words included in the feature word library of the target class one by one, and the successfully matched combined feature words are obtained, the scores of the successfully matched combined feature words are added according to the corresponding weights, and the score of the dimension of the combined feature words is obtained; and adding the scores of the different dimensions of each article to obtain the score of the corresponding article belonging to the target category, judging whether the article belongs to the target category according to the score of each article belonging to the target category, for example, determining that the article belongs to the target category when the score of the article exceeds the score threshold of the target category, and determining that the article does not belong to the target category when the score of the article does not exceed the score threshold of the target category.
In an alternative embodiment of the present invention, the classifier model identifying unit 10222 is specifically configured to determine that the feature of the article that does not belong to the target category inputs a different classifier model to obtain the scores predicted by the different category classifier models; and fitting the scores predicted by the classifier models according to the fitting relation of the classifier models to obtain the scores of the residual articles corresponding to the target categories.
For example, in an alternative embodiment of the present invention, the classifier model identifying unit 10222 is further configured to construct a training sample by using a word vector of feature words of each article included in the training set as a sample feature and a corresponding classification result as a sample label; respectively and iteratively training classifier models of different categories by using the constructed training samples until the iteration suspension condition is met; initializing fitting relations of classifiers of different classes in a classifier model, fitting scores predicted by the classifiers of different classes according to the fitting relations as predicting results of the classifier model aiming at articles belonging to target analogy, and adjusting the fitting relations in a mode of fitting the predicting results of the classifier model to priori classifying results of all samples included in the training set so that the predicting results (such as scores) obtained by fitting according to the fitting relations are consistent with or close to actual scores of the articles belonging to the target class.
For the training samples constructed from the articles of the training set, in order to solve the problem of sparse feature vector distribution of the training samples, in an alternative embodiment of the present invention, the classifier model identifying unit 10222 is further configured to obtain feature words from a feature word library, where the feature words satisfy the following conditions: not present in articles included in the training set; the distance between the training set and the word vector of the feature words of the articles included in the training set is smaller than a distance threshold; and taking the word vector of the feature words obtained for each article and the word vector of the feature words of the corresponding article as sample features of the corresponding article.
In an alternative embodiment of the present invention, the classifier model identifying unit 10222 is further configured to obtain a reading history of reading articles included in the training set, and count the reading data of the target classifications in the reading histories of different users; and adding reading data of the article to the sample characteristics corresponding to the article.
In an exemplary embodiment, the present invention further provides a readable storage medium, for example, the memory 102 including an executable program, where the executable program may be executed by the processor 101 to complete the article classification method provided by the embodiment of the present invention. The readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above-described memories, such as a mobile phone, a computer, a tablet device, a personal digital assistant, and the like.
In the following, taking the target category as the low-custom article category as an example, an optional scheme of article classification provided by the embodiment of the invention is described, and a scheme of classifying whether articles belong to the low-custom article category by using a classifier model is provided.
Before the classifier model is used for two-class classification, a training set is required to be constructed for training, articles included in the training set are usually subjected to word segmentation, stop words, common words and punctuation marks are removed from the articles, feature words are further extracted, word vectors of the feature words are used as sample features, and the prior (i.e. pre-known) classification results (i.e. whether the articles belong to the class of the low-custom articles) of the articles are used as the marks of the samples for training the classifier model.
Taking an SVM classifier model as an example, for an article to be classified, a word vector of a feature word of the article is input into the SVM classifier model, the SVM classifier model outputs a score of the article which is a popular article, if the score exceeds a set score threshold, the article to be identified is determined to be the popular article, see fig. 4, fig. 4 is an optional flow diagram of the article classification method provided by the embodiment of the present invention, and each step will be described separately.
And step 101, segmenting the articles included in the test set.
The embodiment of the invention does not exclude the use of any word segmentation strategy, and tries to illustrate optional word segmentation strategies.
For articles whose morphemes are words, such as Latin-based articles like English, the word segmentation strategy may be to segment the articles using spaces between words as natural word segmentation markers.
For an article in which morphemes are single characters, such as an article in a Tibetan language system, such as Chinese, the word segmentation strategy can be as follows:
1) The character matching method is to match the character strings of the article to be analyzed with the words in a machine dictionary one by one, and if a certain character string is matched in the dictionary, a word is identified after the matching is successful.
2) The basic idea is to divide the text into syntactic and semantic analysis modes and process ambiguity by using syntactic information and semantic information.
3) According to the statistical method, according to the characteristic that words are stable combinations of words, in the context of an article, the more the number of simultaneous occurrence times of adjacent words is, the more likely to form one word, so that word identification is realized according to the frequency or probability of adjacent co-occurrence of the words, the frequency of the combinations of the adjacent co-occurrence words in the article is counted, and the mutual occurrence information of the adjacent co-occurrence words is calculated. For example, the co-occurrence information of two characters is defined, the adjacent co-occurrence probability of two Chinese characters X, Y is calculated, the co-occurrence information shows the tightness degree of the combination relation between the Chinese characters, and when the tightness degree is higher than a certain threshold value, the character group can be considered to form a word.
4) Machine learning method
Based on texts with a large number of segmented words, a statistical machine learning model is utilized to learn the word segmentation rule (called training), so that the segmentation of unknown texts is realized, a plurality of knowledge related to word segmentation is obtained, and the word segmentation rule is fully utilized by the function.
In one example of article segmentation, the following articles are segmented: i tired out of the body go into the door to get in the way of the old and young sisters to make dishes; the corresponding word segmentation result is (feature words are segmented by spaces): i tired out of the body go into the door and hit the old and the sister together to make dishes.
And 102, filtering common words, punctuation marks and stop words in the word segmentation result.
Since the common words, punctuation marks and stop words do not contribute to the classification of the predicted articles, the reliability degree of whether the articles belong to the target category is improved by the characteristic words extracted later after filtering.
Continuing the article of the previous example, filtering the common words, punctuation marks and stop words results are: the people feel tired, i.e. go into the door and collide with the young and cook together.
And step 103, extracting the characteristic words from the filtering result.
In an alternative embodiment of the invention, according to a word Frequency and reverse file Frequency (Term Frequency-Inverse Document Frequency) algorithm, extracting words meeting the following conditions as characteristic words: if a word or word occurs in an article with a high frequency (e.g., a priori frequency higher than the word occurs in articles of the target category) and rarely occurs in other articles, the word is considered to have category discrimination capability and thus be a feature word.
For filtering the residual words of the common words, punctuation marks and stop words in the articles, the importance quantization of the conditions is expressed as the product of word Frequency (Term Frequency) and reverse document Frequency (Inverse Document Fre quency), which is called word Frequency-reverse document Frequency, and a preset number (or proportion) of words with the maximum word Frequency-reverse document Frequency are selected as feature words.
Word frequency refers to the frequency of occurrence of words in an article, e.g. word t i For article d j Is expressed as:
Figure GDA0004109976660000141
wherein n is i,j Representation word t i In article d j The number of occurrences of sigma k n k,j Representation word t i In article d j Number of occurrences.
The reverse file frequency means: the training set includes the number of articles divided by the number of articles containing the word, and the quotient obtained is then taken as a logarithm, e.g., word t i The reverse file frequency of (2) is expressed as:
Figure GDA0004109976660000142
where |D| represents the number of articles included in the training set and j represents the term t included in the training set i Is a number of articles.
Combining equation (1) and equation (2), word t i Word frequency-reverse file frequency tfidf i,j Expressed as:
Figure GDA0004109976660000143
in yet another alternative embodiment of the present invention, a Chi-Square Test (Chi-Square good ss-of-Fit Test) is used to perform feature extraction, a hypothesis is made by generally assuming that the words are not related to the target category (i.e., the hypotheses are independent of each other), the deviation degree of the hypothesis with respect to the observed value (i.e., the correlation degree of the two) is calculated, the relevance of the words illustrated with greater deviation degree to the target category is the largest, and a predetermined number of words with the largest relevance are selected as feature words.
For example, the total number of documents in the test set is N, the occurrence frequency of the words in the article is counted as a frequency A, the occurrence frequency of the negative documents is counted as a frequency B, the occurrence frequency of the text document is not counted as a frequency C, the occurrence frequency of the negative documents is not counted as a frequency D, and the chi-square value of the words is expressed as:
Figure GDA0004109976660000151
of course, other algorithms for feature word extraction are not excluded, and for the examples of the foregoing articles, the result of feature word extraction is:
feature words: i, old men, sister and cooking.
And 104, obtaining all articles included in the test set, representing the articles by using word vectors of the feature words, and constructing a training sample.
The word vector of the feature word is represented in a mode of word vector+feature word sequence number, and then:
i, sequence number 1, word vector 1, weight 0.0453;
old man, serial number 29, word vector 29, weight 0.0233;
sister, sequence number 34, word vector 34, weight 0.018;
vegetable, serial number 109, word vector 109, weight 0.13.
The weight of a feature word represents the degree (or probability) that the feature word tends to semantically target class, the higher the weight the higher the likelihood that the feature word tends to express the subject of the target class; the vectors of articles are denoted (word vector 1, word vector 29, word vector 34, word vector 109), the classification of article labels is "hypo articles", and the corresponding construction training samples are: (word vector 1, word vector 29, word vector 34, word vector 109; score belonging to the hyponym article); it will be appreciated that each article included in the training set may construct a corresponding training sample in the manner described above.
And step 105, training an SVM classifier model by using the training samples.
Training the SVM classifier model in an iterative mode until the SVM classifier model converges on a loss function of test sample classification (which is an estimation of errors of two classifications of the SVM classifier model), or obtaining a preset value by iteration times.
And 106, predicting classification results of the articles to be tested by using an SVM classifier model.
And extracting characteristic words from the articles to be tested, constructing vector representation of the articles according to the corresponding word vectors, and transmitting the article vector representation into an SVM classifier model to predict scores belonging to target classification, and judging the articles to be popular articles if the scores exceed a set threshold value.
Take the following articles to be tested as examples: the food cooking programs of the children, the women and the public. The classifier model scores the article data low-custom classification as follows: 0.18892; assuming a score threshold of 0.5, the article to be tested belongs to a non-popular article.
In the above-mentioned article classification scheme, when the feature words of the articles included in the training set cover the feature words of the articles to be tested, the classification of the articles to be tested has higher accuracy, but in practical application, in order to avoid content inspection, the publisher of an article (such as a popular article) in a certain category may use a more obscure feature word to express the subject to be expressed (such as a popular category subject), and the purpose of the expression is to inspire the imagination of readers to the subject in the target category, but not use the feature word of the subject which is obviously inclined to the target category semantically.
For these cases, the training sample is difficult to cover the feature words used by the expression mode, so that the training SVM classifier model has great errors in the classification result of the article to be tested.
In view of the above problems, when a publisher uses a more obscure expression manner for a subject to be expressed, a manner of expressing feature words related to a subject of a target category is generally adopted, and feature words used in this manner are often far from the target subject semantically, i.e. are difficult to accurately classify according to the semantically distance, but are more obscure semantically.
In view of this, referring to fig. 5A, fig. 5A is an optional implementation schematic diagram of an article classification method provided by an embodiment of the present invention, in order to enable an accurate and efficient classification of articles that are significantly prone to a target category in terms of semantics and articles that use a vague expression, in an embodiment of the present invention, for articles that are significantly prone to a target category in terms of semantics included in a test set, a priori score is assigned to a feature word or a combination of feature words corresponding to a target category, to indicate a degree to which a feature word or a feature combination is significantly prone to a target category in terms of semantics, and according to a sum of scores of feature words that hit (i.e. match) the target category in terms of articles to be tested, whether the sum belongs to the target category is determined in terms of comparison with a priori score threshold, so as to implement a rapid classification of articles that are significantly prone to a target category in terms of semantics; for articles which are identified as not belonging to the target classification after classification, articles which still comprise potential target classification, namely articles which express the subject of the target classification in a hidden mode, for the rest articles which are included in the test set, a machine learning model trained according to the machine learning mode is used for continuing the classification of the target classification, and the classification performance of the classifier model is obtained by training by using article construction training samples which are included in the training set.
For example, referring to fig. 5B, fig. 5B is a schematic flow chart of an alternative article classification provided in an embodiment of the present invention, including the following steps:
and step 201, matching the feature words in each article included in the test set with each feature word in the feature word bank of the target class to obtain successfully matched feature words.
Step 202, determining the score of the corresponding article belonging to the target category according to the score of the feature word successfully matched by each article and the corresponding weight.
And step 203, judging whether the article belongs to the target category according to the score of each article, and obtaining a first judgment result.
And 204, inputting the characteristics of the articles which are judged not to belong to the target category in the first judgment result, and judging whether the corresponding articles belong to the target category again by the classifier model to obtain a second judgment result.
It can be seen that the articles which obviously belong to the target category and are included in the test set are rapidly identified to the greatest extent by utilizing the mode that the characteristic words of the target category are matched with the characteristic words of the articles to be classified and the score is calculated; for the rest articles in the test set, accurately judging whether the articles belong to the target category by using a classifier model; and the feature word matching and the classifier model are utilized to judge the article, so that the judgment precision is obviously improved compared with the single use of the classifier model to judge the classification on the basis of ensuring the judgment precision.
Continuing to explain the article classification method shown in fig. 5B, referring to fig. 6, fig. 6 is an optional flowchart of the article classification method provided by the embodiment of the present invention, two classifications are performed on articles to be tested based on target categories, that is, whether the articles to be tested belong to the target categories, firstly, according to feature words in a feature word bank of the target categories hit by the articles to be tested included in a test set, the feature words in the target categories include feature words that semantically tend to the target categories and corresponding scores thereof, by matching the feature words in the articles with the feature words included in the feature word bank, obtaining the feature words and corresponding scores in the feature word bank, performing an accumulated calculation on the scores of the hit feature words to obtain the scores of the articles to be tested belonging to the target categories, and identifying the articles of the target categories according to the scores higher than a score threshold; for articles with scores below the score threshold, the scheme of performing secondary recognition based on the classifier model will be described in connection with the following steps.
Step 301, building feature word libraries of different categories.
A rule base, also referred to herein as a class, for example, the feature word lexicon of each class (e.g., a low-custom class, an entertainment class, a game class, etc.) includes at least one of the following two classes:
1) A single feature word, a corresponding score, the score representing the degree of tendency of the feature word to the subject of the target category, the two having a positive correlation, the higher the score, the higher the degree of tendency of the feature to the subject of the target category;
2) Combining feature words, i.e., the collocation of two or more feature words, including the combination of feature words that occur consecutively and are separated by a predetermined number of characters (e.g., 3) (e.g., the combination of adjectives and nouns, the collocation of vernacular and verbs, the collocation of verbs and names, etc.), and corresponding scores; the score indicates the degree to which the combined feature word tends to have a positive correlation with the subject matter of the target category, and the higher the score, the higher the degree to which the combined feature word tends to have the subject matter of the target category.
Step 302, training a classifier model.
In an alternative embodiment of the present invention, referring to fig. 7, fig. 7 is a schematic flow chart of an alternative process for training a classifier model according to an embodiment of the present invention, involving the following steps:
in step 3021, training samples are constructed by taking word vectors of feature words of each article included in the training set as sample features and corresponding classification results as sample marks.
For example, for article 1 included in the training set, the a priori classification result thereof is that the article belongs to the target category, and if the feature words 1 to 3 are extracted, an exemplary structure of the positive sample is as follows: { word vector of feature word 1; word vectors for feature word 2; word vectors for feature word 3; score belonging to the target class }.
For another example, for the article 2 included in the training set, the a priori classification result is that the article does not belong to the target category, and if the feature words 4 to 6 are extracted, an exemplary structure of the negative sample is as follows: { word vector of feature word 4; word vectors for feature word 5; word vectors of feature words 6; score not belonging to the target class }.
In an alternative embodiment of the invention, aiming at the problem that the word vectors of the feature words of the articles of the training set are sparsely distributed in the vector space, in order to improve the hit rate of the feature words expressing the target category subject in the articles to be tested, for the feature words extracted from the articles of the training set, the semantically similar feature words are selected from the vector space of the prior test feature words according to the word vectors of the feature words, for example, the feature words meeting the following conditions are selected: not in each article included in the training set; the distance between the training set and the word vector of the feature word of the article included in the training set is smaller than a distance threshold value, and the word vector of the feature word selected for each article and the word vector of the feature word of the corresponding article are taken as sample features of the corresponding article;
for example, for the aforementioned article 1, selecting the feature word 11, the feature word 12, and the feature word 13 that are less than the distance threshold from the word vector 1 of the feature word 1, the following generalization is performed based on the positive sample of the article structure:
{ word vector of feature word 1; word vectors for feature word 2; word vectors for feature word 3; word vectors of feature words 11; word vectors of feature words 12; word vectors of feature words 13; score belonging to the target class }.
From the above, the method and the device effectively improve the hit rate of the feature words of the target category in the sample to be tested by generalizing the vector space of the word vectors of the training sample, particularly the positive sample, and are beneficial to realizing the accurate identification of the article of the target category.
In an alternative embodiment of the present invention, for the case that only the word vector of the included feature word is included in the training sample, features of the reading history of the reading user of the article are introduced into the training sample, that is, for each training sample correspondingly constructed for each document included in the training set, reading history data of the user reading the article is added, including the proportion/number of articles of the target category in the reading history of the user; then, for each article included in the training set, obtaining the reading history of the user reading each article, counting the reading proportion or number of the target classification in the reading history of the corresponding user, and adding the reading proportion or number of the corresponding article to the sample characteristic corresponding to each article as a characteristic of a new dimension.
Taking a sample 1 included in the training set as an example, setting a user reading the article 1 as a user 1 and a user 2, counting the duty ratio of the article of the reading history target analogy of the user 1 as a proportion 1, and the duty ratio of the article of the reading history target analogy of the user 2 as a proportion 2; then, for the sample constructed by the file 1, the characteristics of the reading history dimension are added, so as to obtain the following training sample:
{ word vector of feature word 1; word vectors for feature word 2; word vectors for feature word 3; proportion 1; ratio 2: score belonging to the target class }.
By adding the characteristics of the reading habit dimension into the training sample, the article to be tested can be judged to belong to the target category according to the reading habits of reading users of different categories.
In addition, the reading history data may be consistent with the collection period of the test set, for example, in units of weeks, the crawler technology is used to collect N articles newly published in the public number platform of a social network, and a training set D in the last week is constructed T ={t 1 ,...t N M users reading N articles of newly published articles in the social network within the last week of background query of the social network, denoted U T ={u 1 ,...u M Counting the reading history data of M users in the last week, counting the duty ratio of the articles of the target category in the reading histories of the M users according to the prior classification result of the articles of the public number platform, and marking as R T ={r 1 ,...r M }。
And 3022, respectively and iteratively training classifier models of different categories by using the constructed training samples until the iteration suspension condition is met.
For example, training an SVM classifier model with constructed training samples, the SVM classifier model having the following predictive model for the training samples:
feature vector of training sample x parameter vector = prediction score of training sample.
Taking a gradient descent method as an example, calculating the optimal value of classifier model parameters in each iteration process according to a loss function constructed by an SVM classifier model aiming at a prediction result of classification and an actual result of classification by a method for deriving the loss function until the loss function meets a convergence condition or the number of times of iterative training by using a training sample reaches a frequency threshold.
And 3023, fitting the prediction result of the classifier model to the prior classification result of each sample included in the training set to obtain a fitting relation of each classifier model.
The method comprises the steps of adjusting weights of prediction scores of classifier models of different categories aiming at different samples according to the fitting process, wherein the weights of the classifier models of different categories influence the final classification result of training samples, so that the weights of the prediction scores are closest to prior scores of all training samples to the greatest extent.
Taking fig. 5A as an example, assume that the classifier models of different classes adopt linear fitting mathematical models, and certainly, the embodiment of the present invention does not exclude the use of nonlinear mathematical models, and the weights of the classifier models of different classes, that is, the SVM classifier model, the word bag classifier model, the prior probability classifier model and the neural network classifier model are respectively defined as weight 1, weight 2, weight 3 and weight 4, and the corresponding mathematical models are:
prediction score = SVM classifier model prediction score x weight 1+ word bag classifier model x weight 1+ prior probability classifier model x weight 3+ neural network classifier model x weight 4.
The predictive score of the mathematical model is made to be maximally close to the a priori predictive score of the training sample by computing the optimal values of weights 1 through 4 by adjustment (e.g., using a curve fitting algorithm, a multiple regression fitting algorithm), etc.
Fig. 7 is a scheme for training different classes of classifier models and fitting the trained classifier models to the classifier models, but it will be appreciated that a single class of classifier models may be trained to make a classification decision on articles included in a test set as an alternative.
And 303, performing word segmentation processing on each article included in the test set.
It will be appreciated that the classification process may use the character matching method, understanding method, statistical method, etc. described above.
Step 304, punctuation marks, common words and stop words are screened from the word segmentation result, and feature words are extracted.
And 305, matching the feature words in the articles included in the test set with the feature words in the feature word bank of the target class to obtain successfully matched feature words.
According to the single feature word and the combined feature word included in the feature word library of the target category, correspondingly executing the following matching:
matching the feature words in each article included in the test set with the feature words included in the feature word library of the target class one by one to obtain a single feature word successfully matched;
and combining the feature words in each article included in the test set according to the appearance sequence, and matching the feature words with the combined feature words included in the feature word library of the target category to obtain successfully matched combined feature words.
And step 306, determining the score of the corresponding article belonging to the target category according to the score of the feature word successfully matched by each article and the corresponding weight.
In an alternative embodiment of the invention, according to the condition that the articles to be tested included in the test set are matched with the feature word stock of the target category, the score of the articles to be tested belonging to the target category is correspondingly calculated:
According to the characteristic words in each article included in the test set, the characteristic words are matched with the characteristic words included in the characteristic word library of the target class one by one, and when a single characteristic word which is successfully matched is obtained, the scores of the single characteristic word which is successfully matched are added according to the corresponding weights, and the score of the dimension of the single characteristic word is obtained;
when the combined feature words in each article included in the test set are matched with the combined feature words included in the feature word library of the target category one by one, and the successfully matched combined feature words are obtained, the scores of the successfully matched combined feature words are added according to the corresponding weights, and the score of the dimension of the combined feature words is obtained;
and adding the scores of the different dimensions of each article to obtain the score of the corresponding article belonging to the target category.
Step 307, judging whether the article belongs to the target category according to the score of each article: when the score of the article exceeds the score threshold, the step 310 is shifted to determine that the article belongs to the target category, and when the score does not exceed the score threshold, it is determined that the corresponding article does not belong to the target category, and the step 308 is shifted to perform secondary classification based on the classifier model.
Step 308, inputting the characteristics of the remaining articles which are judged not to belong to the target category into the classifier model to judge whether the remaining articles belong to the target category, if the score of the classifier model for the remaining articles belonging to the target category exceeds the score threshold, turning to step 310 to determine the remaining articles to belong to the target category, and if the score threshold is not exceeded, turning to step 309 to determine the remaining articles not to belong to the target category.
In an alternative embodiment of the invention, the characteristics which are judged not to belong to the object category article are input into different classifier models, and the scores predicted by the different category classifier models are obtained; fitting the scores predicted by the classifier models according to the fitting relation of the classifier models to obtain the scores of the corresponding target categories of the remaining articles; when the score exceeds a score threshold corresponding to the target category, judging that the rest articles belong to the target category; and when the score does not exceed the score threshold value corresponding to the target category, judging that the rest articles do not belong to the target category.
Step 309, determining that the article does not belong to the target category.
In step 310, it is determined that the article belongs to the target category.
The article classification method provided by the embodiment of the invention will be described below based on the classification of articles in a popular category (whether the articles belong to a popular article or a non-popular article).
As before, according to the scheme for classifying articles in a suboptimal category provided in fig. 4 according to the embodiment of the present invention, when the feature words of the articles included in the training set cover the feature words of the articles to be tested, the classification of the articles to be tested has higher accuracy, because the feature words of the articles to be tested already covered by the training sample can accurately represent the suboptimal degree of an article according to the weight of the feature words representing the suboptimal article category. However, in practical applications, in order to avoid content review, the publisher of the article may use a more combination type expression manner for the subject to be expressed (such as a subject of a popular type) or an implicit and scratch type expression manner, and in these cases, the training sample is difficult to cover the feature words used in the expression manner, so that the trained SVM classifier model cannot classify well, which results in a great error in the classification result of the article to be tested.
In the following, taking the classification of whether the articles belong to the low-custom category as an example to identify the low-custom articles, aiming at the problems, the technical scheme of the embodiment of the invention mainly identifies the low-custom articles from the low-custom word library and the collocation library and identifies the low-custom articles based on the classifier model, and respectively description will be given.
First, identifying the low-custom articles based on the low-custom word stock and the collocation stock.
In practical use, firstly, articles hitting a hyponymy word stock or a collocation library are directly identified, and for a scheme for identifying hyponymy articles based on a rule library, the hyponymy word stock (used for storing single hyponymy words) and the collocation library (used for storing collocations of the hyponymy words) are pre-established, and the establishment of the two libraries is mainly based on statistics generation of articles of some hyponymy websites and pornography websites, so that related hyponymy words can be extracted from websites with more specialized and wide sources.
Identifying suboptimal articles by organizing such libraries, performing word segmentation, punctuation coincidence deletion, common words and stop words on an article to be identified, extracting feature words from the result, and identifying the article as suboptimal articles if the extracted feature words correspond to a hit suboptimal word library and a collocation library by a certain number of suboptimal words and suboptimal word collocations and the score of the hit suboptimal words and suboptimal word collocations is greater than a score threshold value according to the addition of weights; and for the rest articles in the test set, namely the articles which are not identified as the low-custom categories, continuously identifying whether the articles are the low-custom articles by utilizing the mode of voting scoring by using the classifier model.
Describing with specific examples, a rule base is first established, including a base based on popular words and a base based on popular word collocations.
Based on the low-custom word library, the scores of different low-custom words and word low-custom degrees are recorded in the library, and the score is higher when the low-custom degree is higher, for example, a certain article to be tested hits the following low-custom words and corresponding scores in the low-custom word library:
Figure GDA0004109976660000231
then the score of the article matching the hyponym is: 2+3+3+1=9.
Based on the matching library, word matching with low colloquial tendency and corresponding scores are recorded, and the score is higher as the low colloquial degree is higher, for example, a certain article to be tested hits the following low colloquial word matching and corresponding scores in the low colloquial word matching library:
father {0,5} aunt 2
Father aunt et heat 3
I's public name wash 3
Where "×" denotes wild card, {0,5} denotes maximum (5) and minimum (0) of character spacing, then the article matches the hyponym matching library with the score: 2+3+3=8.
And carrying out word hit statistics and collocation hit statistics on the word segmentation result of an article, and adding the score and the 1 operation according to a certain weight to obtain the score of the article representing the low colloquial degree on the characteristic word level and the characteristic word collocation level, wherein when the score exceeds a set score threshold value, the article is considered to be the low colloquial article.
Second, a low-custom article is identified based on the classifier model.
For the scheme of identifying the suboptimal articles based on the classifier model, firstly, a training sample is constructed, a plurality of suboptimal articles and non-suboptimal articles are selected from a network, the articles are divided into a plurality of groups, for each suboptimal article, characteristic words are extracted, the word vectors of the characteristic words are utilized to form the characteristics (vector form) of the suboptimal articles, and a positive sample (a sample belonging to the suboptimal analogy) is constructed by combining the prior classification results (whether the articles are suboptimal articles) of the articles included in the training set.
Furthermore, for this phenomenon: if an article is popular, then the proportion of other articles read by the user reading the article is relatively high; the feature of the dimension of reading habit of the hypo-custom article is added into the feature of the training sample, such as the proportion of the hypo-custom article in the reading history of the reading user, so that the classifier model is trained, the limitation that the training sample of the hypo-custom article uses the word vector of the feature word as the feature is overcome, and the classifier model can accurately predict the score of the hypo-custom article to be tested based on the word vector of the training sample and the features of two different dimensions of the reading history.
In addition, a Word2Vec vector model is introduced for the situation that feature words of a training sample are sparse and feature words of an article to be tested are difficult to cover, words similar to the existing feature words in terms of semantics are added in the features of the training sample, namely Word vectors with the distance from the existing Word vectors being smaller than a distance threshold value, so that the Word vectors of the training sample are generalized to a larger vector space, feature words related to a suboptimal theme in the sample to be tested can be covered to the greatest extent, and even if some fresh suboptimal words appear in the article to be tested, the generalized vector space can still be hit.
In addition, for the situation that the article possibly comprises a picture, features of image dimension, such as texture features, outline features and the like, can be added in the features of the training sample, even words in the images can be extracted through word recognition as feature words, and the vector representation of the article can be obtained along with word vectors of the feature words extracted from the text of the article.
Training one or more classes of classifier models to predict the score of the article to be tested belonging to the popular article by utilizing the constructed training sample, wherein the classifier models of the classes adopt a single training mode (even if the classifier models of the different classes respectively predict the score of the article to be tested belonging to the popular article), the classifier models of the different classes refer to SVM classifier models with different theoretical supports, such as support vectors, and the classifier models based on word bags, prior probabilities, sparse features, neural networks, deep learning and the like are used for independently predicting the score of the article to be tested through the classifier models of the different classes, and the optimal classification result is obtained by adopting a voting mode for the score output by the classifier models of the classes.
For example, in the training stage of classifier models of multiple classes, the classifier models of multiple classes are fitted to the prior classification results of the training samples according to the scores predicted by the training samples, weights of classifier models of different classes are obtained, the results of the four classifier models are fitted by using a logistic regression method, and if the obtained scores exceed a certain threshold, the article is determined to be popular.
Described in connection with specific examples, the remaining articles are identified using a classifier model.
Mixing the articles with suboptimal and non-suboptimal according to a certain proportion, then performing word segmentation, removing stop words and extracting feature words, obtaining the reading histories of reading users of each article, counting the proportion of the suboptimal articles in the reading histories of the users, combining word vectors of feature words of the articles as features of training samples, and obtaining priori classification result marking training samples of the articles; and adding word vectors of the popular words in which the vector space is smaller than the distance threshold value into training samples corresponding to each article, and generalizing the word vectors in the training samples to a specific vector space.
Training a plurality of classifier models such as SVM and the like by using training samples, fitting the scores predicted by the classifier models of a plurality of types according to the fitting modes of the classifier models of a plurality of types to obtain the redest score of the article, and determining whether the article belongs to the popular category according to the comparison result with the score threshold.
In combination with the above scheme, a practical application scene is described, and in the reading list of the news aggregation APP, recognition of the low-custom articles is performed according to the user reading history and the characteristics of the chapters in the reading list, for example, automatic filtering or prompting is performed on the low-custom articles, so that the reading experience of the user is enhanced.
In summary, the embodiment of the invention has the following beneficial effects:
1) The method has a very good recognition effect on obvious low-custom expressions, particularly has very strong recognition capability of SVM, and almost has no low-custom condition omission in visual sense due to the combination of a rule base and a classifier model;
2) For a more obscure low-custom expression mode, a very good generalization effect can be obtained through different classifier models, namely, more related expression categories can be expanded from the existing training samples.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (13)

1. An article classification method, comprising:
matching the feature words of the articles included in the test set with the feature words included in the feature word bank of the target class to obtain successfully matched feature words;
when feature words of articles included in the test set are matched with feature words included in a feature word library of a target class, and single feature words which are successfully matched are obtained, the scores of the single feature words which are successfully matched are added according to corresponding weights, and the score of the single feature word dimension is obtained;
when the combined feature words in the articles included in the test set are matched with the combined feature words included in the feature word library of the target category, and the successfully matched combined feature words are obtained, the scores of the successfully matched combined feature words are added according to the corresponding weights, and the score of the dimension of the combined feature words is obtained;
adding the scores of the different dimensions of the article to obtain the score of the article belonging to the target category;
determining whether the article belongs to a first judgment result of the target category according to the score of the article;
and inputting the characteristics of the remaining articles into a classifier model, and determining whether the remaining articles belong to a second judging result of the target category, wherein the remaining articles are articles which are determined according to the first judging result and do not belong to the target category.
2. The method of claim 1, wherein matching the feature words of the articles included in the test set with feature words included in the feature word bank of the target category to obtain successfully matched feature words, comprises:
matching the feature words of the articles included in the test set with the feature words included in the feature word library of the target class to obtain single feature words successfully matched; and/or the number of the groups of groups,
and combining the feature words of the articles included in the test set according to the appearance sequence, and matching the feature words with the combined feature words included in the feature word library of the target category to obtain successfully matched combined feature words.
3. The method of claim 1, wherein the determining whether the article belongs to the first determination result of the target category according to the score of the article comprises:
when the score of the article exceeds the score threshold of the target category, determining that the article belongs to the target category,
and when the score of the article does not exceed the score threshold value of the target category, determining that the article does not belong to the target category.
4. The method of claim 1, wherein inputting the features of the remaining articles into a classifier model, determining whether the articles belong to a second determination of the target category comprises:
Fitting the scores predicted by the classifier models of different types according to the fitting relation of the classifier models of different types to obtain the scores of the residual articles corresponding to the target categories, wherein the predicted scores are predicted based on the characteristics of the residual articles;
and when the score of the article exceeds the score threshold value of the target category, determining that the article belongs to the target category, and when the score of the article does not exceed the score threshold value of the target category, determining that the article does not belong to the target category.
5. The method of any one of claims 1 to 4, further comprising:
constructing a training sample by taking word vectors of feature words of articles included in the training set as sample features and corresponding classification results as sample marks;
respectively and iteratively training classifier models of different categories by using the constructed training samples until the iteration suspension condition is met;
fitting the prediction result of the classifier model to the classification result of the article included in the training set to obtain a fitting relation between the classifier models.
6. The method as recited in claim 5, further comprising:
Obtaining feature words from a feature word library, wherein the feature words meet the following conditions: not present in articles included in the training set; the distance between the training set and the word vector of the feature words of the articles included in the training set is smaller than a distance threshold;
and taking the word vector of the feature words obtained for the article and the word vector of the feature words of the article as sample features of the article.
7. The method as recited in claim 5, further comprising:
determining the reading proportion and/or the number of target classifications in the reading histories of reading users of articles included in the training set;
and adding the reading proportion and/or the reading quantity of the articles to the sample characteristics corresponding to the articles.
8. An article classification apparatus, comprising:
the word stock identification unit is used for matching the feature words of the articles included in the test set with the feature words included in the feature word stock of the target class to obtain successfully matched feature words;
the word stock identification unit is further used for adding the scores of the successfully matched single feature words according to the corresponding weights when the feature words of the articles included in the test set are matched with the feature words included in the feature word stock of the target category and the single feature words which are successfully matched are obtained, so that the scores of the single feature word dimensions are obtained; when the combined feature words in the articles included in the test set are matched with the combined feature words included in the feature word library of the target category, and the successfully matched combined feature words are obtained, the scores of the successfully matched combined feature words are added according to the corresponding weights, and the score of the dimension of the combined feature words is obtained; adding the scores of the different dimensions of the article to obtain the score of the article belonging to the target category;
The word stock identification unit is further used for determining whether the article belongs to a first judgment result of the target category according to the score of the article;
the classifier model identification unit is used for inputting the characteristics of the remaining articles into the classifier model, determining whether the remaining articles belong to the second judgment result of the target category, wherein the remaining articles are articles which are determined according to the first judgment result and do not belong to the target category.
9. The apparatus of claim 8, wherein,
the word stock recognition unit is further configured to:
matching the feature words of the articles included in the test set with the feature words included in the feature word library of the target class to obtain single feature words successfully matched; and/or the number of the groups of groups,
and combining the feature words of the articles included in the test set according to the appearance sequence, and matching the feature words with the combined feature words included in the feature word library of the target category to obtain successfully matched combined feature words.
10. The apparatus of claim 8, wherein,
the word stock recognition unit is further configured to:
when the score of the article exceeds the score threshold of the target category, determining that the article belongs to the target category,
And when the score of the article does not exceed the score threshold value of the target category, determining that the article does not belong to the target category.
11. The device according to any one of claim 8 to 10, wherein,
the classifier model identification unit is further configured to:
fitting the scores predicted by the classifier models of different types according to the fitting relation of the classifier models of different types to obtain the scores of the residual articles corresponding to the target categories, wherein the predicted scores are predicted based on the characteristics of the residual articles;
and when the score of the article exceeds the score threshold value of the target category, determining that the article belongs to the target category, and when the score of the article does not exceed the score threshold value of the target category, determining that the article does not belong to the target category.
12. An article classification apparatus, comprising:
a memory for storing an executable program;
a processor for implementing the article classification method of any one of claims 1 to 7 by executing an executable program stored in the memory.
13. A computer-readable storage medium, characterized in that a computer-executable program is stored, which, when executed by a processor, implements the article classification method of any one of claims 1 to 7.
CN201710792136.4A 2017-09-05 2017-09-05 Article classification method and device and storage medium Active CN110019776B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710792136.4A CN110019776B (en) 2017-09-05 2017-09-05 Article classification method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710792136.4A CN110019776B (en) 2017-09-05 2017-09-05 Article classification method and device and storage medium

Publications (2)

Publication Number Publication Date
CN110019776A CN110019776A (en) 2019-07-16
CN110019776B true CN110019776B (en) 2023-04-28

Family

ID=67186213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710792136.4A Active CN110019776B (en) 2017-09-05 2017-09-05 Article classification method and device and storage medium

Country Status (1)

Country Link
CN (1) CN110019776B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598157B (en) * 2019-09-20 2023-01-03 北京字节跳动网络技术有限公司 Target information identification method, device, equipment and storage medium
CN111260145A (en) * 2020-01-20 2020-06-09 中国人民大学 Method and system for predicting reading amount of WeChat public number article
CN111462733B (en) * 2020-03-31 2024-04-16 科大讯飞股份有限公司 Multi-modal speech recognition model training method, device, equipment and storage medium
CN111859915B (en) * 2020-07-28 2023-10-24 北京林业大学 English text category identification method and system based on word frequency significance level
CN111931060B (en) * 2020-08-25 2023-11-03 腾讯科技(深圳)有限公司 Evaluation method of influence of release platform, related device and computer storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007048053A1 (en) * 2005-10-21 2007-04-26 Coifman Robert E Method and apparatus for improving the transcription accuracy of speech recognition software
CN101059796A (en) * 2006-04-19 2007-10-24 中国科学院自动化研究所 Two-stage combined file classification method based on probability subject
JP2010044585A (en) * 2008-08-12 2010-02-25 Yahoo Japan Corp Advertisement distribution device, advertisement distribution method and advertisement distribution control program
CN102541958A (en) * 2010-12-30 2012-07-04 百度在线网络技术(北京)有限公司 Method, device and computer equipment for identifying short text category information
CN103514174A (en) * 2012-06-18 2014-01-15 北京百度网讯科技有限公司 Text categorization method and device
CN103699521A (en) * 2012-09-27 2014-04-02 腾讯科技(深圳)有限公司 Text analysis method and device
US8713007B1 (en) * 2009-03-13 2014-04-29 Google Inc. Classifying documents using multiple classifiers
CN103813279A (en) * 2012-11-14 2014-05-21 中国移动通信集团设计院有限公司 Junk short message detecting method and device
CN103985381A (en) * 2014-05-16 2014-08-13 清华大学 Voice frequency indexing method based on parameter fusion optimized decision
CN105975478A (en) * 2016-04-09 2016-09-28 北京交通大学 Word vector analysis-based online article belonging event detection method and device
CN106095845A (en) * 2016-06-02 2016-11-09 腾讯科技(深圳)有限公司 File classification method and device
CN106168968A (en) * 2016-06-29 2016-11-30 杭州华三通信技术有限公司 A kind of Website classification method and device
CN106909654A (en) * 2017-02-24 2017-06-30 北京时间股份有限公司 A kind of multiclass classification system and method based on newsletter archive information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4743312B2 (en) * 2009-07-29 2011-08-10 株式会社デンソー Image identification device

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007048053A1 (en) * 2005-10-21 2007-04-26 Coifman Robert E Method and apparatus for improving the transcription accuracy of speech recognition software
CN101059796A (en) * 2006-04-19 2007-10-24 中国科学院自动化研究所 Two-stage combined file classification method based on probability subject
JP2010044585A (en) * 2008-08-12 2010-02-25 Yahoo Japan Corp Advertisement distribution device, advertisement distribution method and advertisement distribution control program
US8713007B1 (en) * 2009-03-13 2014-04-29 Google Inc. Classifying documents using multiple classifiers
CN102541958A (en) * 2010-12-30 2012-07-04 百度在线网络技术(北京)有限公司 Method, device and computer equipment for identifying short text category information
CN103514174A (en) * 2012-06-18 2014-01-15 北京百度网讯科技有限公司 Text categorization method and device
CN103699521A (en) * 2012-09-27 2014-04-02 腾讯科技(深圳)有限公司 Text analysis method and device
CN103813279A (en) * 2012-11-14 2014-05-21 中国移动通信集团设计院有限公司 Junk short message detecting method and device
CN103985381A (en) * 2014-05-16 2014-08-13 清华大学 Voice frequency indexing method based on parameter fusion optimized decision
CN105975478A (en) * 2016-04-09 2016-09-28 北京交通大学 Word vector analysis-based online article belonging event detection method and device
CN106095845A (en) * 2016-06-02 2016-11-09 腾讯科技(深圳)有限公司 File classification method and device
CN106168968A (en) * 2016-06-29 2016-11-30 杭州华三通信技术有限公司 A kind of Website classification method and device
CN106909654A (en) * 2017-02-24 2017-06-30 北京时间股份有限公司 A kind of multiclass classification system and method based on newsletter archive information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
An improved method of term weighting for text classification;Hua Jiang 等;《2009 IEEE International Conference on Intelligent Computing and Intelligent Systems》;294-298 *
一种半监督的中文垃圾微博过滤方法;姚子瑜 等;《中文信息学报》;第30卷(第05期);176-186 *
一种快速支持向量机分类算法的研究;刘向东 等;《计算机研究与发展》;第41卷(第08期);1327-1332 *
特征融合在微博数据挖掘中的应用研究;王和勇 等;《现代情报》;第35卷(第05期);68-72+77 *

Also Published As

Publication number Publication date
CN110019776A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
CN110019776B (en) Article classification method and device and storage medium
Montejo-Ráez et al. Ranked wordnet graph for sentiment polarity classification in twitter
US9645995B2 (en) Language identification on social media
Nie et al. Multimedia answering: enriching text QA with media information
Freeman Using naive bayes to detect spammy names in social networks
US20160335234A1 (en) Systems and Methods for Generating Summaries of Documents
US20160098645A1 (en) High-precision limited supervision relationship extractor
WO2022095374A1 (en) Keyword extraction method and apparatus, and terminal device and storage medium
US20150154286A1 (en) Method for disambiguated features in unstructured text
US7711673B1 (en) Automatic charset detection using SIM algorithm with charset grouping
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
CN110008309B (en) Phrase mining method and device
Ashraf et al. Abusive language detection in youtube comments leveraging replies as conversational context
CN105760363B (en) Word sense disambiguation method and device for text file
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
Syaifudin et al. Twitter data mining for sentiment analysis on peoples feedback against government public policy
CN113392331A (en) Text processing method and equipment
CN111259156A (en) Hot spot clustering method facing time sequence
Winters et al. Automatic joke generation: Learning humor from examples
CN114118087A (en) Entity determination method, entity determination device, electronic equipment and storage medium
EP3956781A1 (en) Irrelevancy filtering
Matsumoto et al. Emotion Recognition of Emoticons Based on Character Embedding.
de Souza Viana et al. A message classifier based on multinomial Naive Bayes for online social contexts
Sheykhlan et al. Pars-HAO: Hate Speech and Offensive Language Detection on Persian Social Media Using Ensemble Learning
Baktash et al. Tuning Traditional Language Processing Approaches for Pashto Text Classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant