CN112487194A - Document classification rule updating method, device, equipment and storage medium - Google Patents

Document classification rule updating method, device, equipment and storage medium Download PDF

Info

Publication number
CN112487194A
CN112487194A CN202011502638.7A CN202011502638A CN112487194A CN 112487194 A CN112487194 A CN 112487194A CN 202011502638 A CN202011502638 A CN 202011502638A CN 112487194 A CN112487194 A CN 112487194A
Authority
CN
China
Prior art keywords
dimension
value
document
target
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011502638.7A
Other languages
Chinese (zh)
Inventor
钱宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Consumer Finance Co Ltd
Original Assignee
Ping An Consumer Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Consumer Finance Co Ltd filed Critical Ping An Consumer Finance Co Ltd
Priority to CN202011502638.7A priority Critical patent/CN112487194A/en
Publication of CN112487194A publication Critical patent/CN112487194A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention provides a method, a device, equipment and a storage medium for updating document classification rules, wherein the method comprises the following steps: obtaining dimension values of a plurality of documents to be classified, which correspond to each dimension respectively; classifying the documents to be classified in each dimension according to the current classification rule of each dimension and the corresponding dimension value; detecting whether a target dimension category exists in the classification result of each dimension; reformulating a classification rule for the corresponding target dimension according to the target dimension category to obtain a reformulated dimension grade gradient; and updating the dimension grade gradient to the current classification rule corresponding to the target dimension to obtain a new classification rule. By detecting the dimensionality of the documents to be classified and classifying the documents based on the original scheme classification rule, the dimensionality grade with unreasonable number proportion of the classified documents is refined, and therefore the documents can be better classified according to the types of the existing documents.

Description

Document classification rule updating method, device, equipment and storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a storage medium for updating a document classification rule.
Background
With the development of computer technology, society has entered the big data era, and it is a matter of concern for every industry and practitioners to face such huge and complicated information and how to accurately and efficiently acquire valuable information. At present, the same classification rule is often adopted when data is classified, and when numerous documents are classified by adopting the same classification rule, the classification of a certain dimension class is possibly excessive, so that the document classification precision cannot be improved.
Disclosure of Invention
The invention mainly aims to provide a method, a device, equipment and a storage medium for updating a document classification rule, and aims to solve the problem that when a same classification rule is adopted to classify a plurality of documents, the classification of a certain dimension can be too much, so that the document classification precision cannot be improved.
The invention provides a method for updating document classification rules, which comprises the following steps:
obtaining dimension values of a plurality of documents to be classified, which correspond to each dimension respectively;
classifying the documents to be classified in each dimension according to the current classification rule of each dimension and the corresponding dimension value;
detecting whether a target dimension category exists in the classification result of each dimension; the target dimension category is the proportion of the number of the documents in the category, and reaches a preset document number proportion threshold value;
reformulating a classification rule for the corresponding target dimension according to the target dimension category to obtain a reformulated dimension grade gradient;
and updating the dimension grade gradient to the current classification rule corresponding to the target dimension to obtain a new classification rule.
Further, the step of reformulating the classification rule for the corresponding target dimension according to the target dimension category to obtain a reformulated dimension grade gradient includes:
obtaining the dimension value corresponding to each document in the target dimension, and establishing a dimension set corresponding to each dimension according to the dimension value of the document in each dimension;
calculating the distance between every two dimension values in each dimension set;
according to the formula
Figure BDA0002844065020000021
Calculating a density for each dimension value from the calculated distances; where ρ (j) represents the density of the jth dimension value, and c ═ max [ d (j, i)]And d (j, i) represents the distance between the jth dimension value and the ith dimension value, max [ d (j, i)]Representing the distance between the maximum value and the minimum value in each dimension value;
according to the formula
Figure BDA0002844065020000022
Calculating the dispersion of each dimension value; wherein LOF (j) represents the dispersion of the j-th dimension value;
according to the formula
Figure BDA0002844065020000023
Calculating the dimension level gradient, wherein f (x) represents a relation function of the average dispersion of each dimension value and the dimension level gradient.
Further, the step of obtaining the dimension values of the plurality of documents to be classified respectively corresponding to the dimensions includes:
performing word segmentation processing on each document to be classified through a regular expression and a word segmentation tool to obtain a plurality of corresponding words;
extracting entity nouns in the words according to a semantic recognition technology;
clustering the extracted entity nouns to obtain entity nouns corresponding to all dimensions respectively;
and calculating the dimension value of the document to be classified in each dimension based on the entity nouns corresponding to each dimension respectively.
Further, the step of classifying each document to be classified in each dimension according to the current classification rule of each dimension and the corresponding dimension value includes:
obtaining the dimension value of each document to be classified in each category according to the corresponding relation between the entity nouns and the dimension values after the clustering;
and classifying according to the dimension value of the document to be classified in each dimension and the current classification rule.
Further, the step of detecting whether a target dimension category exists in the classification result of each dimension includes:
acquiring the number of documents corresponding to each dimension grade in a first dimension;
comparing the number of the documents in each dimension grade with the total number of the documents in the first dimension to obtain the document number proportion corresponding to each dimension grade;
judging whether the number proportion of each document exceeds the preset document number proportion threshold value or not; and the category corresponding to the document number proportion exceeding the preset document number proportion threshold value is the target dimension category.
Further, the step of reformulating the classification rule for the corresponding target dimension according to the target dimension category to obtain a reformulated dimension grade gradient includes:
recording the dimension grade of which the document number proportion exceeds a document number proportion threshold value corresponding to the dimension grade as a first dimension grade;
acquiring a first dimension value corresponding to each document in the first dimension level;
calculating the variance of all first dimension values in the first dimension grade;
setting a plurality of corresponding sub-dimension grades for the first dimension grade according to the variance, thereby obtaining the refined dimension grade gradient; wherein the range of each of the sub-dimension levels is within the range of the first dimension level.
The invention also provides a device for updating the document classification rule, which comprises:
the dimension value module is used for acquiring the dimension values of the documents to be classified, which correspond to the dimensions respectively;
the document to be classified acquisition module is used for classifying the documents to be classified in each dimension according to the current classification rule of each dimension and the corresponding dimension value;
the target dimension acquisition module is used for detecting whether a target dimension category exists in the classification result of each dimension; the target dimension category is the proportion of the number of the documents in the category, and reaches a preset document number proportion threshold value;
the target classification acquisition module is used for re-formulating a classification rule for the corresponding target dimension according to the target dimension category to obtain a re-formulated dimension grade gradient;
and the target classification rule module is used for updating the dimension grade gradient to the current classification rule corresponding to the target dimension to obtain a new classification rule.
Further, the target classification obtaining module includes:
the dimension set establishing sub-module is used for acquiring the dimension values corresponding to the documents in the target dimension and establishing the dimension sets corresponding to the dimensions according to the dimension values of the documents in the dimensions;
the distance submodule is used for calculating the distance between every two dimension values in each dimension set;
a density calculation submodule for calculating a density according to a formula
Figure BDA0002844065020000041
Calculating a density for each dimension value from the calculated distances; where ρ (j) represents the density of the jth dimension value, and c ═ max [ d (j, i)]And d (j, i) represents the distance between the jth dimension value and the ith dimension value, max [ d (j, i)]Representing the distance between the maximum value and the minimum value in each dimension value;
a dispersion calculation submodule for calculating dispersion according to a formula
Figure BDA0002844065020000042
Calculating the dispersion of each dimension value; wherein LOF (j) represents the dispersion of the j-th dimension value;
a dimension grade gradient calculation submodule for calculating a dimension grade gradient according to a formula
Figure BDA0002844065020000043
Calculating the dimension level gradient, wherein f (x) represents a relation function of the average dispersion of each dimension value and the dimension level gradient.
The invention also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.
The invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method of any of the above.
The invention has the beneficial effects that: by detecting the dimensionality of the documents to be classified and classifying the documents based on the original scheme classification rule, the dimensionality grade with unreasonable number proportion of the classified documents is refined, and therefore the documents can be better classified according to the types of the existing documents.
Drawings
FIG. 1 is a flowchart illustrating a method for updating document classification rules according to an embodiment of the present invention;
FIG. 2 is a block diagram schematically illustrating an apparatus for updating document classification rules according to an embodiment of the present invention;
fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that all directional indicators (such as up, down, left, right, front, back, etc.) in the embodiments of the present invention are only used to explain the relative position relationship between the components, the motion situation, etc. in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicator is changed accordingly, and the connection may be a direct connection or an indirect connection.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and B, may mean: a exists alone, A and B exist simultaneously, and B exists alone.
In addition, the descriptions related to "first", "second", etc. in the present invention are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a method for updating a document classification rule, including:
s1: obtaining dimension values of a plurality of documents to be classified, which correspond to each dimension respectively;
s2: classifying the documents to be classified in each dimension according to the current classification rule of each dimension and the corresponding dimension value;
s3: detecting whether a target dimension category exists in the classification result of each dimension; the target dimension category is the proportion of the number of the documents in the category, and reaches a preset document number proportion threshold value;
s4: reformulating a classification rule for the corresponding target dimension according to the target dimension category to obtain a reformulated dimension grade gradient;
s5: and updating the dimension grade gradient to the current classification rule corresponding to the target dimension to obtain a new classification rule.
As described in the above step S1, the dimension values of the documents to be classified in each dimension are obtained. The obtaining manner may be obtaining by a natural language processing technology (which will be described in detail later, and is not described herein again), where each dimension has a different dimension level, and the documents to be classified are classified in the corresponding dimension according to the dimension level. The dimensions may include, among other things, sentiment dimensions, logic dimensions, content dimensions, feature dimensions, and the like.
As described in step S2, each document to be classified is classified in each dimension according to the current classification rule and the corresponding dimension value of each dimension. The current classification rule is a classification rule stored in a system or equipment, and the documents to be classified are classified according to the classification rule. Specifically, in some embodiments, each document to be classified may obtain a dimension value in each dimension according to the content of each document, that is, each document to be classified corresponds to a dimension value in each dimension, and the documents are classified according to the dimension level of the dimension value in the current classification rule, for example, in the emotion dimension, the happy dimension level corresponds to a numerical range of 80-90, and if the dimension value of the emotion dimension of the document to be classified is 85, the document to be classified may be classified in the happy dimension level. And the current classification rule is the current classification rule in the system.
As described in step S3, it is detected whether or not the target dimension category exists in the classification result of each dimension. For a certain dimension, if the number of documents in a classification result of a certain class far exceeds the number of documents in classification results of other classes, it indicates that the current classification rule is not applicable to the current document classification, that is, the percentage (i.e., number ratio) of the number of documents in the single dimension level exceeds a document number ratio threshold, where the document number ratio threshold is a preset value, and may be set to 30%, for example. In addition, it should be noted that the document number ratio does not refer to only the document number ratio of the documents to be classified after the documents to be classified are classified, but refers to the document number ratio of all the documents in the dimension after the documents to be classified are classified, and the documents to be classified are only a part of data.
As described in step S4, the classification rule is re-formulated for the corresponding target dimension according to the target dimension category, so as to obtain a re-formulated dimension level gradient. The refinement method can be to re-classify the existing dimension levels, for example, the range of the dimension values is 1-100, the current classification rule is 10-one dimension level, i.e., 1-10, …, 91-100, etc., and the re-classified dimension levels can be 1-5, …, 96-100, or 1-4, …, 97-100. Or a sub-dimension grade is allocated to the dimension grade exceeding the document number proportion threshold value in the current classification rule, for example, the dimension grade exceeding the document number proportion threshold value is 21-30, and then the sub-dimension grade can be set to the dimension grade, for example, 21-22, …, 29-30 and the like, so that the dimension grade is further refined, and the obtained data is more accurate.
As described in step S5, the dimension level gradient is updated to the current classification rule corresponding to the target dimension, so as to obtain a new classification rule. And reclassifying the documents according to the refined dimension grade gradient. The obtained classified data is more accurate, and excessive redundancy of classified data can not be caused because only the dimension grade of the target dimension which exceeds the document number proportion threshold value corresponding to the dimension grade is refined.
In an embodiment, the step S4 of refining the dimension level of the dimension to obtain a refined dimension level gradient includes:
s401: obtaining the dimension value corresponding to each document in the target dimension, and establishing a dimension set corresponding to each dimension according to the dimension value of the document in each dimension;
s402: calculating the distance between every two dimension values in the dimension set;
s403: according to the formula
Figure BDA0002844065020000071
Calculating a density for each dimension value from the calculated distances; where ρ (j) represents the density of the jth dimension value, and c ═ max [ d (j, i)]And d (j, i) represents the distance between the jth dimension value and the ith dimension value, max [ d (j, i)]Representing the distance between the maximum value and the minimum value in each dimension value;
s404: according to the formula
Figure BDA0002844065020000081
Calculating the dispersion of each dimension value; wherein LOF (j) represents the dispersion of the j-th dimension value;
s405: according to the formula
Figure BDA0002844065020000082
Calculating the dimension level gradient, wherein f (x) represents a relation function of the average dispersion of each dimension value and the dimension level gradient.
As described in step S401, the dimension values corresponding to the documents in the target dimension are obtained, and the documents are classified according to the corresponding dimension values in the target dimension, so that the corresponding dimension values can be directly obtained.
As described in step S402 above, the distance between each of the dimension values and the other of the dimension values is calculated. The specific algorithm of the distance is to take the difference between the two dimension values and then calculate the absolute value, namely the distance between each dimension value and the other dimension values in the dimension values, wherein the distance can embody the difference between the dimension value and the other dimension values, and the distance can be used for judging whether the distribution in the target dimension is reasonable or not.
As described in step S403, the density is calculated according to the formula that sufficiently considers the relationship between the dimensional values and the range c of the dimensional values, and the accuracy of the calculated density is relatively high, and the result calculated according to the density is closer to the true value.
As described above in steps S404-S405, according to the formula
Figure BDA0002844065020000083
The dispersion of each dimension value is calculated, wherein the dispersion is the fluctuation that can embody a single data, and therefore the dimension grade gradient thereof is calculated from the average dispersion of the dimension values. Wherein, f (x) is a function relationship between the preset average dispersion and the dimension grade gradient, and the function relationship may be a linear relationship, a nonlinear relationship, etc. Thereby obtaining a dimension grade gradient adapted to the document in the target dimension.
In an embodiment, the step S1 of obtaining the dimension values of the documents to be classified respectively corresponding to the dimensions includes:
s101: performing word segmentation processing on each document to be classified through a regular expression and a word segmentation tool to obtain a plurality of corresponding words;
s102: extracting entity nouns in the words according to a semantic recognition technology;
s103: clustering the extracted entity nouns to obtain entity nouns corresponding to all dimensions respectively;
s104: and calculating the dimension value of the document to be classified in each dimension based on the entity nouns corresponding to each dimension respectively.
As described in the above steps S101-S104, the acquisition of each dimension of the document to be classified is realized. Specifically, word segmentation is performed through a regular expression (a character sequence of a search pattern) and a word segmentation tool, the word segmentation tool may be any one of jieba, SnowNLP, THULAC, and NLPIR, so as to segment a document to be classified to obtain a plurality of corresponding words, and each word is identified according to a semantic identification technology, where the semantic identification technology may specifically be Natural Language Processing (NLP), clustering processing is performed after extracting a corresponding entity noun, an algorithm of the clustering processing preferably adopts a KMEANS clustering algorithm and a CLARANS clustering algorithm, then dimensions corresponding to the entity nouns after the clustering processing are identified, specifically, semantics of the entity nouns of each category are identified, similarity comparison is performed according to the identified semantics and categories preset in the system, and a similarity comparison mode may be through a cosine similarity calculation method. Dimension information contained in the document can be obtained based on the processing scheme, namely dimension values of the document to be classified in all dimensions are obtained.
In one embodiment, the step S2 of classifying the documents to be classified in each dimension according to the current classification rule and the corresponding dimension value of each dimension includes:
s201: obtaining the dimension value of each document to be classified in each category according to the corresponding relation between the entity nouns and the dimension values after the clustering;
s202: and classifying according to the dimension value of the document to be classified in each dimension and the current classification rule.
As described in the above steps S201-S202, classification of documents to be classified in various dimensions is achieved. The method comprises the steps of firstly, according to the corresponding relation between entity nouns and dimension values, wherein the corresponding relation is preset, considering that the entity nouns have corresponding similar words, preprocessing the entity nouns, namely converting the similar words into words in a database, and then converting the words into corresponding dimension values according to the preset corresponding relation of the dimension values so as to be convenient to distinguish, wherein the more similar the semantics of the two entity nouns are, the more similar the corresponding dimension values are, so that the classification is carried out according to the dimension values, and the automatic classification of each document in each dimension is realized.
In an embodiment, the step S3 of detecting whether there is a target dimension category in the classification result of each dimension includes:
s301: acquiring the number of documents corresponding to each dimension grade in a first dimension;
s302: comparing the number of the documents in each dimension grade with the total number of the documents in the first dimension to obtain the document number proportion corresponding to each dimension grade;
s303: judging whether the number proportion of each document exceeds the preset document number proportion threshold value or not; and the category corresponding to the document number proportion exceeding the preset document number proportion threshold value is the target dimension category.
As described in steps S301-S303 above, detection of whether classification in each dimension is reasonable is achieved. The document number proportion of each dimension is obtained by calculating, namely obtaining the document number corresponding to each dimension level in the first dimension, namely obtaining the total number of all documents, and it should be understood that the document number proportion does not only refer to the document number proportion of the document to be classified after the document to be classified is classified, but refers to the document number proportion of all documents in the dimension after the document to be classified is classified, and the document to be classified is only a part of data. And recording the category corresponding to the document number proportion exceeding the preset document number proportion threshold as the target dimension category, and refining the dimension in which the target dimension category is located, wherein for other dimensions without the target dimension category, the document number proportion is calculated without processing, so that the dimension needing to be refined is obtained subsequently.
In an embodiment, the step S4 of reformulating the classification rule for the corresponding target dimension according to the target dimension category to obtain a reformulated dimension level gradient includes:
s411: recording the dimension grade of which the document number proportion exceeds a document number proportion threshold value corresponding to the dimension grade as a first dimension grade;
s412: acquiring a first dimension value corresponding to each document in the first dimension level;
s413: calculating the variance of all first dimension values in the first dimension grade;
s414: setting a plurality of corresponding sub-dimension grades for the first dimension grade according to the variance, thereby obtaining the refined dimension grade gradient; wherein the range of each of the sub-dimension levels is within the range of the first dimension level.
As described in the foregoing steps S411 to S414, the obtaining of the dimension level gradient is implemented, that is, first dimension values in each document in the first dimension level are obtained first, and a plurality of sub-dimension levels are set for the corresponding first dimension levels according to the variance of all the first dimension values, where the corresponding relationship between the variance and the number of the sub-dimension levels is set in advance, that is, the number of the corresponding sub-dimension levels can be obtained according to the obtained variance. For example, if the variance is 0.1, the number of the corresponding sub-dimensions is 5, and the range of the number of the 5 sub-dimensions may be the range of the average distribution of the first dimension, for example, if the range of the first dimension is 0 to 10, the corresponding sub-dimensions are 0 to 2,2 to 4, 4 to 6,6 to 8, and 8 to 10. Therefore, on the premise of the current classification rule, the classification rule is more refined, the obtained classification result is more detailed, and the method is more suitable for the prior technical scheme.
Referring to fig. 2, the present invention further provides an apparatus for updating document classification rules, including:
the dimension value module 10 is configured to obtain a dimension value corresponding to each dimension of the plurality of documents to be classified;
a to-be-classified document obtaining module 20, configured to classify each to-be-classified document in each dimension according to the current classification rule of each dimension and the corresponding dimension value;
a target dimension obtaining module 30, configured to detect whether a target dimension category exists in a classification result of each dimension; the target dimension category is the proportion of the number of the documents in the category, and reaches a preset document number proportion threshold value;
the target classification acquisition module 40 is configured to reformulate a classification rule for a target dimension corresponding to the target dimension category according to the target dimension category, so as to obtain a reformulated dimension grade gradient;
and the target classification rule module 50 is configured to update the dimension level gradient to the current classification rule corresponding to the target dimension, so as to obtain a new classification rule.
In one embodiment, the object classification obtaining module includes:
the dimension set establishing sub-module is used for acquiring the dimension values corresponding to the documents in the target dimension and establishing the dimension sets corresponding to the dimensions according to the dimension values of the documents in the dimensions;
the distance calculation submodule is used for calculating the distance between every two dimension values in each dimension set;
a density calculation submodule for calculating a density according to a formula
Figure BDA0002844065020000111
Calculating a density for each dimension value from the calculated distances; where ρ (j) represents the density of the jth dimension value, and c ═ max [ d (j, i)]And d (j, i) represents the distance between the jth dimension value and the ith dimension value, max [ d (j, i)]Representing the distance between the maximum value and the minimum value in each dimension value;
a dispersion calculation submodule for calculating dispersion according to a formula
Figure BDA0002844065020000121
Calculating the dispersion of each dimension value; wherein LOF (j) represents the dispersion of the j-th dimension value;
a dimension grade gradient calculation submodule for calculating a dimension grade gradient according to a formula
Figure BDA0002844065020000122
Calculating the dimension level gradient, wherein f (x) represents a relation function of the average dispersion of each dimension value and the dimension level gradient. In one embodiment, the dimension value module includes:
the word segmentation processing sub-module is used for respectively carrying out word segmentation processing on the documents to be classified through a regular expression and a word segmentation tool to obtain a plurality of corresponding words;
the semantic recognition submodule is used for extracting entity nouns in the words according to a semantic recognition technology;
the entity noun extraction submodule is used for clustering the extracted entity nouns to obtain entity nouns corresponding to all dimensions respectively;
and the dimension value calculation submodule is used for calculating the dimension value of the document to be classified in each dimension based on the entity nouns corresponding to each dimension.
In one embodiment, the to-be-classified document obtaining module 20 includes:
the clustering processing submodule is used for obtaining the dimension value of each document to be classified in each category according to the corresponding relation between the entity nouns and the dimension values after clustering processing;
and the rule classification submodule is used for classifying the documents to be classified according to the current classification rule according to the dimension values of the documents in all dimensions.
In one embodiment, the target dimension obtaining module 30 includes:
the dimension grade acquisition submodule is used for acquiring the number of the documents corresponding to each dimension grade in the first dimension;
the document comparison submodule is used for comparing the number of the documents in each dimension grade with the total number of the documents in the first dimension to obtain the document number proportion corresponding to each dimension grade;
the preset submodule is used for judging whether the number proportion of the documents exceeds the preset document number proportion threshold value or not; and the category corresponding to the document number proportion exceeding the preset document number proportion threshold value is the target dimension category.
In one embodiment, the target classification obtaining module 40 includes:
the submodule is used for recording the dimension grade of which the document number proportion exceeds the document number proportion threshold value corresponding to the dimension grade as a first dimension grade;
the dimension value acquisition submodule is used for acquiring a first dimension value corresponding to each document in the first dimension level;
the variance submodule is used for calculating the variance of all the first dimension values in the first dimension grade;
the level setting submodule is used for setting a plurality of corresponding sub-dimension levels for the first dimension level according to the variance so as to obtain the refined dimension level gradient; wherein the range of each of the sub-dimension levels is within the range of the first dimension level.
The invention has the beneficial effects that: by detecting the dimensionality of the documents to be classified and classifying the documents based on the original scheme classification rule, the dimensionality grade with unreasonable number proportion of the classified documents is refined, and therefore the documents can be better classified according to the types of the existing documents.
Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing various documents and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program can implement the method for updating the document classification rule according to any one of the above embodiments when being executed by a processor.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.
The embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for updating the document classification rule according to any of the embodiments above may be implemented.
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware associated with instructions of a computer program, which may be stored on a non-volatile computer-readable storage medium, and when executed, may include processes of the above embodiments of the methods. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (10)

1. A method for updating document classification rules, comprising:
obtaining dimension values of a plurality of documents to be classified, which correspond to each dimension respectively;
classifying the documents to be classified in each dimension according to the current classification rule of each dimension and the corresponding dimension value;
detecting whether a target dimension category exists in the classification result of each dimension; the target dimension category is the proportion of the number of the documents in the category, and reaches a preset document number proportion threshold value;
reformulating a classification rule for the corresponding target dimension according to the target dimension category to obtain a reformulated dimension grade gradient;
and updating the dimension grade gradient to the current classification rule corresponding to the target dimension to obtain a new classification rule.
2. The method for updating document classification rules according to claim 1, wherein the step of reformulating the classification rules for the corresponding target dimensions according to the target dimension categories to obtain reformulated dimension level gradients comprises:
obtaining the dimension value corresponding to each document in the target dimension, and establishing a dimension set corresponding to each dimension according to the dimension value of the document in each dimension;
calculating the distance between every two dimension values in each dimension set;
according to the formula
Figure FDA0002844065010000011
Calculating a density for each dimension value from the calculated distances; where ρ (j) represents the density of the jth dimension value, and c ═ max [ d (j, i)]And d (j, i) represents the distance between the jth dimension value and the ith dimension value, max [ d (j, i)]Representing the distance between the maximum value and the minimum value in each dimension value;
according to the formula
Figure FDA0002844065010000012
Calculating the dispersion of each dimension value; wherein LOF (j) represents the dispersion of the j-th dimension value;
according to the formula
Figure FDA0002844065010000021
Calculating the dimension level gradient, wherein f (x) represents a relation function of the average dispersion of each dimension value and the dimension level gradient.
3. The method for updating the document classification rule according to claim 1, wherein the step of obtaining the dimension values of the documents to be classified respectively corresponding to the dimensions comprises:
performing word segmentation processing on each document to be classified through a regular expression and a word segmentation tool to obtain a plurality of corresponding words;
extracting entity nouns in the words according to a semantic recognition technology;
clustering the extracted entity nouns to obtain entity nouns corresponding to all dimensions respectively;
and calculating the dimension value of the document to be classified in each dimension based on the entity nouns corresponding to each dimension respectively.
4. The method for updating the document classification rule according to claim 3, wherein the step of classifying each document to be classified in each dimension according to the current classification rule and the corresponding dimension value of each dimension comprises:
obtaining the dimension value of each document to be classified in each category according to the corresponding relation between the entity nouns and the dimension values after the clustering;
and classifying according to the dimension value of the document to be classified in each dimension and the current classification rule.
5. The method for updating document classification rules according to claim 1, wherein the step of detecting whether the classification result of each dimension has the target dimension category comprises:
acquiring the number of documents corresponding to each dimension grade in a first dimension;
comparing the number of the documents in each dimension grade with the total number of the documents in the first dimension to obtain the document number proportion corresponding to each dimension grade;
judging whether the number proportion of each document exceeds the preset document number proportion threshold value or not; and the category corresponding to the document number proportion exceeding the preset document number proportion threshold value is the target dimension category.
6. The method for updating document classification rules according to claim 1, wherein the step of reformulating the classification rules for the corresponding target dimensions according to the target dimension categories to obtain reformulated dimension level gradients comprises:
recording the dimension grade of which the document number proportion exceeds a document number proportion threshold value corresponding to the dimension grade as a first dimension grade;
acquiring a first dimension value corresponding to each document in the first dimension level;
calculating the variance of all first dimension values in the first dimension grade;
setting a plurality of corresponding sub-dimension grades for the first dimension grade according to the variance, thereby obtaining the refined dimension grade gradient; wherein the range of each of the sub-dimension levels is within the range of the first dimension level.
7. An apparatus for updating document classification rules, comprising:
the dimension value module is used for acquiring the dimension values of the documents to be classified, which correspond to the dimensions respectively;
the document to be classified acquisition module is used for classifying the documents to be classified in each dimension according to the current classification rule of each dimension and the corresponding dimension value;
the target dimension acquisition module is used for detecting whether a target dimension category exists in the classification result of each dimension; the target dimension category is the proportion of the number of the documents in the category, and reaches a preset document number proportion threshold value;
the target classification acquisition module is used for re-formulating a classification rule for the corresponding target dimension according to the target dimension category to obtain a re-formulated dimension grade gradient;
and the target classification rule module is used for updating the dimension grade gradient to the current classification rule corresponding to the target dimension to obtain a new classification rule.
8. The apparatus for updating document classification rules according to claim 7, wherein the target classification obtaining module includes:
the dimension set establishing sub-module is used for acquiring the dimension values corresponding to the documents in the target dimension and establishing the dimension sets corresponding to the dimensions according to the dimension values of the documents in the dimensions;
the distance calculation submodule is used for calculating the distance between every two dimension values in each dimension set;
a density calculation submodule for calculating a density according to a formula
Figure FDA0002844065010000041
Calculating a density for each dimension value from the calculated distances; where ρ (j) represents the density of the jth dimension value, and c ═ max [ d (j, i)]And d (j, i) represents the distance between the jth dimension value and the ith dimension value, max [ d (j, i)]Representing the distance between the maximum value and the minimum value in each dimension value;
a dispersion calculation submodule for calculating dispersion according to a formula
Figure FDA0002844065010000042
Calculating the dispersion of each dimension value; wherein LOF (j) represents the dispersion of the j-th dimension value;
a dimension level gradient calculation sub-module,for according to a formula
Figure FDA0002844065010000043
Calculating the dimension level gradient, wherein f (x) represents a relation function of the average dispersion of each dimension value and the dimension level gradient.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202011502638.7A 2020-12-17 2020-12-17 Document classification rule updating method, device, equipment and storage medium Pending CN112487194A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011502638.7A CN112487194A (en) 2020-12-17 2020-12-17 Document classification rule updating method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011502638.7A CN112487194A (en) 2020-12-17 2020-12-17 Document classification rule updating method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112487194A true CN112487194A (en) 2021-03-12

Family

ID=74914801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011502638.7A Pending CN112487194A (en) 2020-12-17 2020-12-17 Document classification rule updating method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112487194A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6137911A (en) * 1997-06-16 2000-10-24 The Dialog Corporation Plc Test classification system and method
US20050044487A1 (en) * 2003-08-21 2005-02-24 Apple Computer, Inc. Method and apparatus for automatic file clustering into a data-driven, user-specific taxonomy
US20160092549A1 (en) * 2014-09-26 2016-03-31 International Business Machines Corporation Information Handling System and Computer Program Product for Deducing Entity Relationships Across Corpora Using Cluster Based Dictionary Vocabulary Lexicon
CN105786898A (en) * 2014-12-24 2016-07-20 中国移动通信集团公司 Domain ontology construction method and apparatus
CN105956031A (en) * 2016-04-25 2016-09-21 深圳市永兴元科技有限公司 Text classification method and apparatus
CN106126734A (en) * 2016-07-04 2016-11-16 北京奇艺世纪科技有限公司 The sorting technique of document and device
CN107943984A (en) * 2017-11-30 2018-04-20 广东欧珀移动通信有限公司 Image processing method, device, computer equipment and computer-readable recording medium
CN109101633A (en) * 2018-08-15 2018-12-28 北京神州泰岳软件股份有限公司 A kind of hierarchy clustering method and device
CN111475647A (en) * 2020-03-19 2020-07-31 平安国际智慧城市科技股份有限公司 Document processing method and device and server

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6137911A (en) * 1997-06-16 2000-10-24 The Dialog Corporation Plc Test classification system and method
US20050044487A1 (en) * 2003-08-21 2005-02-24 Apple Computer, Inc. Method and apparatus for automatic file clustering into a data-driven, user-specific taxonomy
US20160092549A1 (en) * 2014-09-26 2016-03-31 International Business Machines Corporation Information Handling System and Computer Program Product for Deducing Entity Relationships Across Corpora Using Cluster Based Dictionary Vocabulary Lexicon
CN105786898A (en) * 2014-12-24 2016-07-20 中国移动通信集团公司 Domain ontology construction method and apparatus
CN105956031A (en) * 2016-04-25 2016-09-21 深圳市永兴元科技有限公司 Text classification method and apparatus
CN106126734A (en) * 2016-07-04 2016-11-16 北京奇艺世纪科技有限公司 The sorting technique of document and device
CN107943984A (en) * 2017-11-30 2018-04-20 广东欧珀移动通信有限公司 Image processing method, device, computer equipment and computer-readable recording medium
CN109101633A (en) * 2018-08-15 2018-12-28 北京神州泰岳软件股份有限公司 A kind of hierarchy clustering method and device
CN111475647A (en) * 2020-03-19 2020-07-31 平安国际智慧城市科技股份有限公司 Document processing method and device and server

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王志华: "面向实体发现的网络信息聚类技术研究与实现", 《中国优秀硕士学位论文全文数据库(信息科技辑)》, pages 138 - 6340 *

Similar Documents

Publication Publication Date Title
CN109933785B (en) Method, apparatus, device and medium for entity association
CN107291723B (en) Method and device for classifying webpage texts and method and device for identifying webpage texts
CN109471942B (en) Chinese comment emotion classification method and device based on evidence reasoning rule
KR101999152B1 (en) English text formatting method based on convolution network
CN110909784B (en) Training method and device of image recognition model and electronic equipment
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN110427612B (en) Entity disambiguation method, device, equipment and storage medium based on multiple languages
CN111859983A (en) Natural language labeling method based on artificial intelligence and related equipment
CN113849648A (en) Classification model training method and device, computer equipment and storage medium
CN110399493B (en) Author disambiguation method based on incremental learning
CN114510923B (en) Text theme generation method, device, equipment and medium based on artificial intelligence
Puspaningrum et al. Detection of text similarity for indication plagiarism using winnowing algorithm based K-gram and jaccard coefficient
CN114492429A (en) Text theme generation method, device and equipment and storage medium
CN110795561A (en) Automatic identification system for electronic file material types and autonomous learning method thereof
CN113807073B (en) Text content anomaly detection method, device and storage medium
CN112487194A (en) Document classification rule updating method, device, equipment and storage medium
US20140181124A1 (en) Method, apparatus, system and storage medium having computer executable instrutions for determination of a measure of similarity and processing of documents
Schwenker et al. EXSCLAIM!--An automated pipeline for the construction of labeled materials imaging datasets from literature
CN112364620B (en) Text similarity judging method and device and computer equipment
CN116108230A (en) Long keyword string matching method, device and computer readable storage medium
Wei et al. Feature selection on Chinese text classification using character n-grams
CN113343699B (en) Log security risk monitoring method and device, electronic equipment and medium
CN114996389A (en) Method for checking consistency of label categories, storage medium and electronic equipment
CN112016292B (en) Method and device for setting article interception point and computer equipment
CN111898375B (en) Automatic detection and division method for article discussion data based on word vector sentence chain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination