CN106897454B - File classification method and device - Google Patents

File classification method and device Download PDF

Info

Publication number
CN106897454B
CN106897454B CN201710138149.XA CN201710138149A CN106897454B CN 106897454 B CN106897454 B CN 106897454B CN 201710138149 A CN201710138149 A CN 201710138149A CN 106897454 B CN106897454 B CN 106897454B
Authority
CN
China
Prior art keywords
picture
classification
file
feature vector
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710138149.XA
Other languages
Chinese (zh)
Other versions
CN106897454A (en
Inventor
赵毅强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing time Ltd.
Original Assignee
Beijing Time Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Time Co ltd filed Critical Beijing Time Co ltd
Publication of CN106897454A publication Critical patent/CN106897454A/en
Application granted granted Critical
Publication of CN106897454B publication Critical patent/CN106897454B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

The invention discloses a file classification method and a device, which relate to the technical field of file classification, and the method comprises the following steps: acquiring picture information contained in a file; determining a picture classification result corresponding to the picture information through a preset picture classification model; generating a file feature vector corresponding to the file according to the picture classification result; and determining a file classification result corresponding to the file feature vector through a preset file classification model. Therefore, the method and the device solve the problem that news cannot be classified according to the picture content in the prior art, and have the beneficial effect that more accurate and more precise classification can be carried out on the text and the picture content contained in the news in a comprehensive mode.

Description

File classification method and device
Technical Field
The invention relates to the technical field of file classification, in particular to a file classification method and device.
Background
News is a name for information transmitted through media such as newspapers, radio stations, broadcasting, television stations, internet, etc., and is mainly a report of a newly-occurring fact or a report of a newly-changed fact, so that the timeliness of news is very important. In daily life, in order for a reader to quickly find news of interest, the news needs to be classified. The current classification is generally a simple text screening, or a further key information screening, such as news provenance, language, and other key information, and then the news is classified according to the above information. The classification mode can also be widely applied to various files except news.
However, in the process of implementing the present invention, the inventors found that at least the following problems exist in the prior art: the prior art can only classify according to the text content in the files such as news. With the development of the society, the content of pictures in news is more and more, and on self-media platforms such as micro blogs and micro messages, many news are directly displayed in a picture form (for example, the whole text news is converted into the picture form and is added into the attached drawings of micro blogs or micro message friends), or two-dimensional codes and the like are added into the news, but the existing news classification technology cannot identify the pictures, cannot classify the news according to the content of the pictures, and reduces the accuracy of news classification. Therefore, the existing file classification method has the defects of single classification basis, narrow application range and the like.
Disclosure of Invention
In view of the above, the present invention has been made to provide a file classifying method and a corresponding apparatus that overcome or at least partially solve the above problems.
According to an aspect of the present invention, there is provided a file classification method, including: acquiring picture information contained in a file; determining a picture classification result corresponding to the picture information through a preset picture classification model; generating a file feature vector corresponding to the file according to the picture classification result; and determining a file classification result corresponding to the file feature vector through a preset file classification model.
According to another aspect of the present invention, there is provided a document sorting apparatus including: the acquisition module is used for acquiring picture information contained in the file; the picture classification module is used for determining a picture classification result corresponding to the picture information through a preset picture classification model; the characteristic vector module is used for generating a file characteristic vector corresponding to the file according to the picture classification result; and the file classification module is used for determining a file classification result corresponding to the file feature vector through a preset file classification model.
According to the file classification method and device provided by the invention, the picture classification result corresponding to the picture information contained in the file can be determined through the preset picture classification model, and the file is classified according to the picture classification result, so that the problem that the classification result is inaccurate because the existing file classification mode can only classify according to a single text characteristic is solved, the accuracy of the classification result is further improved, and the application range of the scheme is widened.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flowchart illustrating a file classification method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a file classification method according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a file sorting apparatus according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram illustrating a file sorting apparatus according to a fourth embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The invention provides a file classification method and device, which can at least solve the technical problems of inaccurate classification result and narrow application range of a file classification mode in the prior art.
Example one
Fig. 1 shows a flowchart of a file classification method provided in an embodiment of the present invention, where the method includes:
step S110: and acquiring picture information contained in the file.
The specific acquisition mode of the picture information can be flexibly determined by combining with the picture embedding mode in the file, the specific acquisition mode is not limited by the invention, and the technical personnel in the field can flexibly adopt various modes to realize the method. For example, if the picture in the file is embedded in the file in the form of an icon of a thumbnail, a complete picture corresponding to the thumbnail may be obtained first, and then the corresponding picture information may be determined according to the complete picture. For another example, if the picture in the file is embedded in the file in the form of a hyperlink, the corresponding original picture may be obtained according to the hyperlink, and then the corresponding picture information may be determined according to the original picture.
In addition, the acquired picture content can be directly used as picture information for subsequent processing, preset information extraction operation can be performed on the acquired picture content firstly, and the extracted important information is used as the picture information for subsequent processing, so that the workload during the subsequent processing can be reduced, and the processing speed can be improved; on the other hand, irrelevant information in the picture content can be filtered, so that the subsequent classification operation is more targeted. The invention does not limit the concrete implementation mode of the information extraction operation and the concrete representation form of the picture information.
Step S120: and determining a picture classification result corresponding to the picture information through a preset picture classification model.
In the embodiment of the present invention, the preset image classification model may be obtained by various machine learning algorithms such as a deep learning algorithm, and may also be obtained by using a conventional algorithm.
In addition, in order to facilitate statistics and identification of the obtained image classification result, in the embodiment of the present invention, the image classification result may be flexibly represented in various manners such as an image feature vector, an image feature matrix, and the like, and a specific representation form of the image classification result is not limited by the present invention.
Step S130: and generating a file feature vector corresponding to the file according to the picture classification result.
When the file to be classified only contains the picture content, the picture feature vector obtained in the step S120 can be directly used as the file feature vector corresponding to the file; when the file to be classified includes both the picture content and the text content, a corresponding text feature vector may be generated according to the text content through a vector space model, then the picture feature vector and the text feature vector obtained in step S120 are combined according to a preset rule, and a file feature vector corresponding to the file to be classified is generated according to a combination result. The above method for generating the text feature vector is only an example, and is not limited to the above, and in practical applications, a person skilled in the art may flexibly select a method for generating the text feature vector according to practical situations, and the present invention is not limited to this.
Step S140: and determining a file classification result corresponding to the file feature vector through a preset file classification model.
The file classification model is determined by a preset machine learning algorithm, and the machine learning algorithm can be a linear classification algorithm, a neural network classification algorithm or a deep learning algorithm. The present invention is not limited to this, and those skilled in the art can flexibly select the machine learning algorithm according to the actual situation.
Specifically, the file feature vector obtained in step S130 is input into a file classification model, and the file classification model obtains a file classification result corresponding to the file feature vector according to a corresponding rule and algorithm.
Therefore, the file classification method provided by the invention can identify the text content and the picture content in the file, and classify the file through the file classification model according to the identified text content and the identified picture content, so that the problem that news cannot be classified according to the picture content in the prior art is solved, and the beneficial effect of more accurately and more accurately classifying the text content and the picture content contained in the comprehensive news is achieved.
Example two
Fig. 2 shows a flowchart of a file classification method provided in the second embodiment of the present invention, where the method includes:
step S210: and performing machine learning on the pre-acquired picture training set through a machine learning algorithm, and generating a preset picture classification model according to a learning result.
Specifically, a targeted picture training set is set according to categories required by picture classification, and then machine learning is performed on the picture training set through a machine learning algorithm, so that a picture classification model is obtained according to a learning result, and the process is called a training process of the picture classification model; after the training is finished, the picture to be classified is input into the picture classification model, and then the prediction classification result of the picture to be classified can be obtained. The machine learning algorithm can be a deep learning algorithm or a neural network algorithm. In practical application, in order to improve the accuracy of the image classification model, each image classification result can be added into the image training set, so that the dynamic adjustment of the image classification model can be realized, and the classification accuracy of the image classification model is continuously improved.
Step S220: and acquiring picture information contained in the file.
In order to reduce the file size, for example, pictures included in a file such as news are generally thumbnails of which original pictures are compressed, and thus some information is lost. In order to ensure the accuracy of picture classification, the invention needs to acquire the related information of the original picture. Specifically, the file content is firstly analyzed, so that a target thumbnail picture is obtained; then, analyzing the target thumbnail picture to acquire a hyperlink associated with the original picture, such as a URL (uniform resource locator); and finally, positioning the original picture through the hyperlink, downloading the original picture and analyzing the original picture to acquire the complete information of the picture.
When the file to be classified includes a dynamic picture or a video, the step S220 of acquiring the picture information specifically includes: and acquiring the dynamic picture or video contained in the file, extracting at least one picture frame from the dynamic picture or video, and finally respectively confirming picture information corresponding to each picture frame. Specifically, in order to ensure that the extraction of the picture frame is representative and convenient to identify, the key frames in the dynamic picture or the video may be extracted according to a certain rule, and then the picture information corresponding to each key frame may be respectively confirmed. Because the key frame contains complete data of the current picture, the picture information acquired through the key frame is more complete and accurate. The rule for extracting the key frame may be to extract a key frame by a fixed number of frame numbers, or to compare the difference between adjacent frames, and extract a key frame when the difference reaches or exceeds a preset threshold. The present invention is not limited to the specific key frame extraction rule, and those skilled in the art can flexibly set the key frame extraction rule according to the actual situation.
In addition, in this embodiment, after the picture content is acquired, a preset information extraction operation is performed on the acquired picture content, and the extracted important information is used as picture information to perform subsequent processing. Specifically, the inventor finds that at least the following technical obstacles exist when the picture content is taken as a file classification basis in the process of implementing the invention: although the picture coding technology can achieve the effect of picture compression, no matter what coding method is adopted, the data volume of the picture content is far larger than that of the text content, and the data volume of the picture content reaches dozens of or even hundreds of megabytes at all times, so that the analysis and the processing of the picture are difficult. Accordingly, in this embodiment, irrelevant information such as background information in the picture can be filtered out through information extraction operation; important information in the picture can also be extracted through information extraction operation, for example, an area with significant pixel change is used as a key area, or the key area in the picture is identified according to a plurality of picture identification algorithms. Therefore, the workload during subsequent processing can be greatly reduced through the information extraction operation, and the processing speed is improved; and the subsequent classification operation is more targeted and has higher accuracy.
Step S230: and determining a picture classification result corresponding to the picture information through a preset picture classification model.
In the embodiment of the invention, the preset image classification model is preferably obtained through a deep learning algorithm, and the image classification precision is higher due to the strong autonomous learning capability of the deep learning algorithm, so that the subsequent file classification precision is improved.
Because the picture information is different from the text information, in order to facilitate statistics and identification of the obtained picture classification result, in the embodiment of the invention, the picture classification result is represented by the picture feature vector. The picture feature vector is a probability vector for representing the picture classification, and the probability vector is a multi-dimensional feature vector, each dimension of which represents a picture category. There are many expression rules for the feature vector, and the present invention is not limited to this, and those skilled in the art can flexibly set the expression rules according to specific situations. For convenience of understanding, the following representative methods for representing the feature vector of the picture are provided in the embodiments of the present invention.
In the first representation method, the pictures to be classified may belong to a plurality of categories, assuming that the total number of picture classifications corresponding to the picture classification model is N, where N is a natural number greater than 1, first determining the probability that the picture information belongs to each of the N picture classifications, and then setting an N-dimensional vector as a feature vector of the picture according to the probability that the picture information belongs to each of the N picture classifications; and the weight of each dimension is determined according to the probability that the picture information belongs to the corresponding picture classification. Specifically, for example, when the picture categories include airplane, car, person, and cat, N is 4, the airplane, car, person, and cat are respectively defined as the 0 th, 1 st, 2 nd, and 3 rd dimensions of the picture feature vector, and when the probability of the category to which a certain picture belongs is 0.7 (i.e., the category of the picture is 70% probable to be airplane), 0.2 (car), 0.03 (person), and 0.07 (cat), the picture feature vector indicating the classification result of the certain picture is [0.7,0.2,0.03,0.07 ]. In the representation mode, one picture can belong to a plurality of categories simultaneously, and the probability of the picture belonging to each category is given respectively, so that the picture category can be reflected more accurately, and more accurate information is provided for subsequent file classification. Moreover, the method is particularly suitable for application scenes with less clear picture classification results.
In a second representation method, the pictures to be classified are classified into only one category, which is the category with the highest probability among the picture classifications. For example, when the picture categories are plane, car, person, and cat, the plane, car, person, and cat are respectively defined as the 0 th, 1 st, 2 nd, and 3 rd dimensions of the picture feature vector, and when the probability of the category to which a certain picture belongs is 0.7 (i.e., the category of the picture is 70% likely to be plane), 0.2 (car), 0.03 (person), and 0.07 (cat), respectively, the probability that the picture category belongs to the plane is the greatest, and therefore the picture feature vector representing the classification result of the picture is [1,0,0,0 ]. In the representation method, one picture can only belong to one category, and the representation method is particularly suitable for application scenes with clear picture classification results.
In the third representation method, assuming that the total number of picture classifications corresponding to the picture classification model is N, where N is a natural number greater than 2, determining the probability that the picture information belongs to each of N picture classifications, and screening M picture classifications in the order of probability from high to low as the picture classification result of the picture information, where M is a natural number less than N; and then, respectively setting corresponding picture classification numbers for various picture classification results in advance, determining the picture classification numbers corresponding to the picture classification results of the picture information, and generating corresponding picture characteristic vectors according to the picture classification numbers. In a specific application, if the total number of classifications is very large, it is not always necessary to output predicted values in all the classifications, and at this time, several categories are fixedly selected according to the order of the predicted values from high to low, for example, 5 categories with higher predicted values are selected for each piece of picture information, and accordingly, the picture classification result of the piece of picture information is determined by the 5 categories, and in order to represent the picture classification result determined by the 5 categories, a unique picture classification number may be assigned to each picture classification result in advance, so that the picture classification result is represented by the corresponding picture classification number. When the total number of the categories is fixed and the number of the categories selected each time is also fixed, the total number of the image classification results is also fixed (equal to the number of different combinations obtained when the fixed number of categories are taken from all the categories). Thus, the result of the picture classification can be represented as a one-dimensional variable (whose value ranges from zero to this combined number), which can be used as one dimension (i.e., one dimension in the document feature vector) input in the subsequent classification. Besides the form of one-dimensional variables, the picture classification result can also be represented in the form of multidimensional vectors, for example, when the picture classification number is represented by binary coding, for example, the categories to be output have 4 categories, respectively 0, 1, 2 and 3, and two possible categories are output each time prediction is performed, so that there are 6 possibilities (0 and 1,0 and 2,0 and 3, 1 and 2, 1 and 3, 2 and 3), they can be represented by binary numbers (000, 001, 010, 011, 100, 101), and these binary numbers can be regarded as three-dimensional vectors themselves, and the value of each dimension is only 0 or 1, and the three-dimensional vectors can also be part of input in the subsequent classification.
The above-mentioned several representation methods may be applied alone or in combination, and the present invention is not limited thereto.
Step S240: acquiring text information contained in a file, and generating a text characteristic vector corresponding to the text information; and generating a picture characteristic vector corresponding to the picture classification result, combining the text characteristic vector and the picture characteristic vector, and generating a file characteristic vector according to the combination result.
The specific method for generating the text feature vector may be: preprocessing the text information to obtain a plurality of feature words according to a preprocessing result; and respectively endowing corresponding weights for the characteristic words, and generating text characteristic vectors according to the characteristic words and the weights thereof. The preprocessing modes are various, for example, western languages are uniformly converted into lowercase, junk information and advertisement information are filtered, stop words are deleted, and the preprocessing operations can enable extraction of text characteristic words to be smooth and convenient. The text feature vector may be generated from a word vector, and a word vector corresponding to text information may be generated using a word2vec tool. The above method for generating the text feature vector is only an example, and is not limited to the above, and in practical applications, a person skilled in the art may flexibly select a method for generating the text feature vector according to practical situations, and the present invention is not limited to this.
In addition, when the text feature vector and the picture feature vector are combined, the combination may be implemented in various ways, which is not limited in the present invention. For example, the image feature vector may be directly added to the tail of the text feature vector, so as to obtain a combined file feature vector; or adding the picture characteristic vector into the head of the text characteristic vector to obtain a combined file characteristic vector; the image feature vector can also be inserted into a specified position in the middle of the text feature vector, so that a combined file feature vector is obtained. In addition, the image feature vectors can be combined after necessary deletion and conversion, the method is not limited by the invention, and a person skilled in the art can select a proper mode according to the characteristics of the file classification model.
Step S250: and determining a file classification result corresponding to the file feature vector through a preset file classification model.
The file classification model is determined by a preset machine learning algorithm, and the machine learning algorithm can be a linear classification algorithm, a neural network classification algorithm or a deep learning algorithm. The present invention is not limited to this, and those skilled in the art can flexibly select the machine learning algorithm according to the actual situation.
The machine learning generally comprises two parts of training and forecasting, wherein the training is to train a file classification model through a preset training set, so that the file classification model can have a function of identifying file types after being trained through a large amount of data; the prediction is that after the text to be classified is input into the trained file classification model, the classification result of the text can be obtained. In order to continuously improve the classification accuracy of the file classification model, a correction process such as supervised learning and the like realized by utilizing back propagation and the like can be further included, so that the model can be dynamically corrected.
Therefore, the file classification method provided by the invention firstly determines the picture classification result corresponding to the picture information in the file through the picture classification model, can quickly and accurately quantize the picture through the method, and quantizes the picture with huge data volume and variable form into the corresponding picture classification result. In a word, the invention solves the problems that the picture content is inconvenient to analyze and express through a layer of picture classification model which is constructed in advance, and further applies the picture information to a file classification mode, thereby improving the accuracy of file classification and widening the application range.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a file classifying device according to a third embodiment of the present invention, where the device includes: an acquisition module 310, a picture classification module 320, a feature vector module 330, and a file classification module 340.
The obtaining module 310 is configured to obtain picture information included in a file.
The specific acquisition mode of the picture information can be flexibly determined by combining with the picture embedding mode in the file, the specific acquisition mode is not limited by the invention, and the technical personnel in the field can flexibly adopt various modes to realize the method. If the picture in the file is embedded in the file in the form of the thumbnail icon, the obtaining module 310 may first obtain the complete picture corresponding to the thumbnail, and then determine the corresponding picture information according to the complete picture. For another example, if the picture in the file is embedded in the file in a hyperlink form, the obtaining module 310 may first obtain the corresponding original picture according to the hyperlink, and then determine the corresponding picture information according to the original picture.
In addition, the image content directly acquired by the acquisition module 310 may be used as image information for subsequent processing, or the acquisition module 310 may perform a preset information extraction operation on the acquired image content first, and then perform subsequent processing on the extracted important information as image information, so that on one hand, the workload during subsequent processing can be reduced, and the processing speed can be increased; on the other hand, irrelevant information in the picture content can be filtered, so that the subsequent classification operation is more targeted. The invention does not limit the concrete implementation mode of the information extraction operation and the concrete representation form of the picture information.
The picture classification module 320 is configured to determine a picture classification result corresponding to the picture information according to a preset picture classification model.
In the embodiment of the present invention, the preset image classification model may be obtained by various machine learning algorithms such as a deep learning algorithm, and may also be obtained by using a conventional algorithm.
In addition, in order to facilitate statistics and identification of the obtained image classification result, in the embodiment of the present invention, the image classification result may be flexibly represented in various manners such as an image feature vector, an image feature matrix, and the like, and a specific representation form of the image classification result is not limited by the present invention.
The feature vector module 330 is configured to generate a file feature vector corresponding to the file according to the picture classification result.
When the file to be classified only contains the picture content, the picture feature vector obtained by the picture classification module 320 can be directly used as the file feature vector corresponding to the file; when the file to be classified includes both the picture content and the text content, a corresponding text feature vector may be generated according to the text content through the vector space model, and then the picture feature vector and the text feature vector obtained by the picture classification module 320 are combined according to a preset rule, and a file feature vector corresponding to the file to be classified is generated according to a combination result. The above method for generating the text feature vector is only an example, and is not limited to the above, and in practical applications, a person skilled in the art may flexibly select a method for generating the text feature vector according to practical situations, and the present invention is not limited to this.
The file classification module 340 is configured to determine a file classification result corresponding to the file feature vector through a preset file classification model.
The file classification model is determined by a preset machine learning algorithm, and the machine learning algorithm can be a linear classification algorithm, a neural network classification algorithm or a deep learning algorithm. The present invention is not limited to this, and those skilled in the art can flexibly select the machine learning algorithm according to the actual situation.
Specifically, the file feature vector obtained by the feature vector module 330 is input into a file classification model in the file classification module 340, and the file classification model obtains a file classification result corresponding to the file feature vector according to a corresponding rule and algorithm.
For the functional description of each module, reference may be made to the description of the corresponding part of each step in the foregoing method embodiment, and details are not described here.
Therefore, the file classification device provided by the invention can identify the text content and the picture content in the file, and classify the file through the file classification model according to the identified text content and the identified picture content, so that the problem that news cannot be classified according to the picture content in the prior art is solved, and the beneficial effect of more accurately and more accurately classifying the text content and the picture content contained in the comprehensive news is achieved.
Example four
Fig. 4 is a schematic structural diagram of a file classifying device according to a fourth embodiment of the present invention, where the device includes: the image classification model building module 410, the obtaining module 420, the image classification module 430, the feature vector module 440, and the file classification module 450, wherein the feature vector module 440 further includes: a text feature vector sub-module 441, a picture feature vector sub-module 442, and a combination sub-module 443.
The image classification model building module 410 is configured to perform machine learning on a pre-obtained image training set through a machine learning algorithm, and generate a preset image classification model according to a learning result.
Specifically, the image classification model building module 410 sets a targeted image training set according to the category required for image classification, and then performs machine learning on the image training set through a machine learning algorithm, so as to obtain an image classification model according to the learning result, which is called a training process of the image classification model; after the training is finished, the picture to be classified is input into the picture classification model, and then the prediction classification result of the picture to be classified can be obtained. The machine learning algorithm can be a deep learning algorithm or a neural network algorithm. In practical application, in order to improve the accuracy of the image classification model, each image classification result can be added into the image training set, so that the dynamic adjustment of the image classification model can be realized, and the classification accuracy of the image classification model is continuously improved.
The obtaining module 420 is configured to obtain picture information included in the file.
In order to reduce the file size, for example, pictures included in a file such as news are generally thumbnails of which original pictures are compressed, and thus some information is lost. In order to ensure the accuracy of picture classification, the invention needs to acquire the related information of the original picture. Specifically, the obtaining module 420 first parses the file content, so as to obtain the target thumbnail picture; then, analyzing the target thumbnail picture to acquire a hyperlink associated with the original picture, such as a URL (uniform resource locator); and finally, positioning the original picture through the hyperlink, downloading the original picture and analyzing the original picture to acquire the complete information of the picture.
When the file to be classified includes a dynamic picture or a video, the obtaining module 420 obtains the dynamic picture or the video included in the file, extracts at least one picture frame from the dynamic picture or the video, and finally respectively confirms picture information corresponding to each picture frame. Specifically, in order to ensure that the extraction of the picture frame is representative and convenient to identify, the key frames in the dynamic picture or the video may be extracted according to a certain rule, and then the picture information corresponding to each key frame may be respectively confirmed. Because the key frame contains complete data of the current picture, the picture information acquired through the key frame is more complete and accurate. The rule for extracting the key frame may be to extract a key frame by a fixed number of frame numbers, or to compare the difference between adjacent frames, and extract a key frame when the difference reaches or exceeds a preset threshold. The present invention is not limited to the specific key frame extraction rule, and those skilled in the art can flexibly set the key frame extraction rule according to the actual situation.
In addition, in this embodiment, after the obtaining module 420 obtains the picture content, a preset information extraction operation is performed on the obtained picture content, and the extracted important information is used as the picture information for subsequent processing. Specifically, the inventor finds that at least the following technical obstacles exist when the picture content is taken as a file classification basis in the process of implementing the invention: although the picture coding technology can achieve the effect of picture compression, no matter what coding method is adopted, the data volume of the picture content is far larger than that of the text content, and the data volume of the picture content reaches dozens of or even hundreds of megabytes at all times, so that the analysis and the processing of the picture are difficult. Accordingly, in this embodiment, irrelevant information such as the total background information of the picture can be filtered out through the information extraction operation of the obtaining module 420; important information in the picture can also be extracted through the information extraction operation of the obtaining module 420, for example, an area with significant pixel change is used as an important area, or the important area in the picture is identified according to a plurality of picture identification algorithms. Therefore, the workload during subsequent processing can be greatly reduced through the information extraction operation of the acquisition module 420, and the processing speed is improved; and the subsequent classification operation is more targeted and has higher accuracy.
The picture classification module 430 is configured to determine a picture classification result corresponding to the picture information according to a preset picture classification model.
In the embodiment of the invention, the preset image classification model is preferably obtained through a deep learning algorithm, and the image classification precision is higher due to the strong autonomous learning capability of the deep learning algorithm, so that the subsequent file classification precision is improved.
Because the picture information is different from the text information, in order to make the obtained picture classification result convenient for statistics and identification, in the embodiment of the present invention, the picture classification module 430 represents the picture classification result by using the picture feature vector. The picture feature vector is a probability vector for representing the picture classification, and the probability vector is a multi-dimensional feature vector, each dimension of which represents a picture category. There are many expression rules for the feature vector, and the present invention is not limited to this, and those skilled in the art can flexibly set the expression rules according to specific situations.
A feature vector module 440 for generating a document feature vector corresponding to the document.
When the file to be classified includes both picture content and text content, the feature vector module 440 may further include a text feature vector sub-module 441, a picture feature vector sub-module 442, and a combination sub-module 443. The text feature vector submodule 441 is configured to obtain text information included in the file, and generate a text feature vector corresponding to the text information; a picture feature vector submodule 442, configured to generate a picture feature vector corresponding to the picture classification result; and the combining submodule 443 is configured to combine the text feature vector and the picture feature vector, and generate a file feature vector according to a combination result.
The method for generating the text feature vector in the text feature vector submodule 441 specifically includes: preprocessing the text information to obtain a plurality of feature words according to a preprocessing result; and respectively endowing corresponding weights for the characteristic words, and generating text characteristic vectors according to the characteristic words and the weights thereof. The preprocessing modes are various, for example, western languages are uniformly converted into lowercase, junk information and advertisement information are filtered, stop words are deleted, and the preprocessing operations can enable extraction of text characteristic words to be smooth and convenient. The text feature vector may be generated from a word vector, and a word vector corresponding to text information may be generated using a word2vec tool. The above method for generating the text feature vector is only an example, and is not limited to the above, and in practical applications, a person skilled in the art may flexibly select a method for generating the text feature vector according to practical situations, and the present invention is not limited to this.
In addition, when the combining sub-module 443 combines the text feature vector and the picture feature vector, the combination may be implemented in various ways, which is not specifically limited by the present invention. For example, the image feature vector may be directly added to the tail of the text feature vector, so as to obtain a combined file feature vector; or adding the picture characteristic vector into the head of the text characteristic vector to obtain a combined file characteristic vector; the image feature vector can also be inserted into a specified position in the middle of the text feature vector, so that a combined file feature vector is obtained. In addition, the image feature vectors can be combined after necessary deletion and conversion, the method is not limited by the invention, and a person skilled in the art can select a proper mode according to the characteristics of the file classification model.
The file classification module 450 is configured to determine a file classification result corresponding to the file feature vector through a preset file classification model.
The document classification model in the document classification module 450 is determined by a preset machine learning algorithm, which may be a linear classification algorithm, a neural network classification algorithm, or a deep learning algorithm. The present invention is not limited to this, and those skilled in the art can flexibly select the machine learning algorithm according to the actual situation.
The machine learning generally comprises two parts of training and forecasting, wherein the training is to train a file classification model through a preset training set, so that the file classification model can have a function of identifying file types after being trained through a large amount of data; the prediction is that after the text to be classified is input into the trained file classification model, the classification result of the text can be obtained. In order to continuously improve the classification accuracy of the file classification model, a correction process such as supervised learning and the like realized by utilizing back propagation and the like can be further included, so that the model can be dynamically corrected.
For the functional description of each module, reference may be made to the description of the corresponding part of each step in the foregoing method embodiment, and details are not described here.
Therefore, the file classification device provided by the invention firstly determines the picture classification result corresponding to the picture information in the file through the picture classification model, can quickly and accurately quantize the picture through the mode, quantize the picture with huge data volume and variable form into the corresponding picture classification result, and has the advantages of high processing speed, accurate classification result and the like when determining the file type by utilizing the picture classification result because the picture classification result has the advantages of small data volume, high processing speed, good classification effect and the like. In a word, the invention solves the problems that the picture content is inconvenient to analyze and express through a layer of picture classification model which is constructed in advance, and further applies the picture information to a file classification mode, thereby improving the accuracy of file classification and widening the application range.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the document sorting apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (10)

1. A method of classifying a document, comprising:
acquiring picture information contained in a file;
determining a picture classification result corresponding to the picture information through a preset picture classification model;
generating a file feature vector corresponding to the file according to the picture classification result;
determining a file classification result corresponding to the file feature vector through a preset file classification model;
the step of generating a file feature vector corresponding to the file according to the picture classification result specifically includes:
acquiring text information contained in the file, and generating a text characteristic vector corresponding to the text information;
generating a picture feature vector corresponding to the picture classification result, combining the text feature vector with the picture feature vector, and generating the file feature vector according to the combination result;
the total number of the image classification corresponding to the image classification model is N, wherein N is a natural number greater than 2; the step of determining the picture classification result corresponding to the picture information specifically includes: respectively determining the probability of the picture information belonging to each picture classification in N picture classifications, and screening M picture classifications as the picture classification results of the picture information according to the sequence from high probability to low probability, wherein M is a natural number smaller than N;
the step of generating the picture feature vector corresponding to the picture classification result specifically includes: respectively setting corresponding picture classification numbers for various picture classification results in advance; and determining a picture classification number corresponding to the picture classification result of the picture information, and generating a corresponding picture characteristic vector according to the picture classification number.
2. The method of claim 1, wherein prior to performing the method, further comprising:
performing machine learning on a pre-acquired picture training set through a machine learning algorithm, and generating the preset picture classification model according to a learning result; wherein the machine learning algorithm comprises: deep learning algorithms, and neural network algorithms.
3. The method according to claim 1, wherein the step of generating a text feature vector corresponding to the text information specifically comprises:
preprocessing the text information to obtain a plurality of feature words according to a preprocessing result;
and respectively endowing corresponding weight to each feature word, and generating the text feature vector according to each feature word and the weight thereof.
4. The method according to any one of claims 1-3, wherein the document classification model is determined by a preset machine learning algorithm, wherein the machine learning algorithm comprises: linear classification algorithms, neural network classification algorithms, and deep learning algorithms.
5. The method according to any one of claims 1 to 3, wherein the step of obtaining the picture information contained in the file specifically comprises: the method comprises the steps of obtaining a dynamic picture contained in a file, extracting at least one picture frame contained in the dynamic picture, and respectively determining picture information corresponding to each picture frame.
6. A document sorting apparatus comprising:
the acquisition module is used for acquiring picture information contained in the file;
the picture classification module is used for determining a picture classification result corresponding to the picture information through a preset picture classification model;
the feature vector module is used for generating a file feature vector corresponding to the file according to the picture classification result;
the file classification module is used for determining a file classification result corresponding to the file feature vector through a preset file classification model;
wherein, the feature vector module specifically comprises:
the text characteristic vector submodule is used for acquiring text information contained in the file and generating a text characteristic vector corresponding to the text information;
the picture characteristic vector submodule is used for generating a picture characteristic vector corresponding to the picture classification result;
the combination submodule is used for combining the text characteristic vector and the picture characteristic vector and generating the file characteristic vector according to a combination result;
the total number of the image classification corresponding to the image classification model is N, wherein N is a natural number greater than 2; the picture classification module is specifically configured to: respectively determining the probability of the picture information belonging to each picture classification in N picture classifications, and screening M picture classifications as the picture classification results of the picture information according to the sequence from high probability to low probability, wherein M is a natural number smaller than N;
the picture feature vector submodule is specifically configured to: respectively setting corresponding picture classification numbers for various picture classification results in advance; and determining a picture classification number corresponding to the picture classification result of the picture information, and generating a corresponding picture characteristic vector according to the picture classification number.
7. The apparatus of claim 6, further comprising:
the image classification model building module is used for performing machine learning on a pre-acquired image training set through a machine learning algorithm and generating a preset image classification model according to a learning result; wherein the machine learning algorithm comprises: deep learning algorithms, and neural network algorithms.
8. The apparatus of claim 6, wherein the text feature vector submodule is specifically configured to:
preprocessing the text information to obtain a plurality of feature words according to a preprocessing result;
and respectively endowing corresponding weight to each feature word, and generating the text feature vector according to each feature word and the weight thereof.
9. The apparatus according to any one of claims 6-8, wherein the document classification model is determined by a preset machine learning algorithm, wherein the machine learning algorithm comprises: linear classification algorithms, neural network classification algorithms, and deep learning algorithms.
10. The apparatus of any of claims 6-8, wherein the means for obtaining is further configured to: the method comprises the steps of obtaining a dynamic picture contained in a file, extracting at least one picture frame contained in the dynamic picture, and respectively determining picture information corresponding to each picture frame.
CN201710138149.XA 2017-02-15 2017-03-09 File classification method and device Active CN106897454B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2017100810619 2017-02-15
CN201710081061 2017-02-15

Publications (2)

Publication Number Publication Date
CN106897454A CN106897454A (en) 2017-06-27
CN106897454B true CN106897454B (en) 2020-07-03

Family

ID=59185860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710138149.XA Active CN106897454B (en) 2017-02-15 2017-03-09 File classification method and device

Country Status (1)

Country Link
CN (1) CN106897454B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391751A (en) * 2017-08-15 2017-11-24 郑州云海信息技术有限公司 A kind of file classifying method and device
CN110209920A (en) * 2018-05-02 2019-09-06 腾讯科技(深圳)有限公司 Treating method and apparatus, storage medium and the electronic device of media resource
CN108765664B (en) * 2018-05-25 2021-03-16 Oppo广东移动通信有限公司 Fingerprint unlocking method and device, terminal and storage medium
CN111444966B (en) * 2018-09-14 2023-04-07 腾讯科技(深圳)有限公司 Media information classification method and device
CN112508094B (en) * 2020-07-24 2023-10-20 完美世界(北京)软件科技发展有限公司 Garbage picture identification method, device and equipment
CN112214603A (en) * 2020-10-26 2021-01-12 Oppo广东移动通信有限公司 Image-text resource classification method, device, terminal and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923561A (en) * 2010-05-24 2010-12-22 中国科学技术信息研究所 Automatic document classifying method
CN101937445A (en) * 2010-05-24 2011-01-05 中国科学技术信息研究所 Automatic file classification system
CN201796362U (en) * 2010-05-24 2011-04-13 中国科学技术信息研究所 Automatic file classifying system
CN103646072A (en) * 2013-12-10 2014-03-19 河南博仕达通信技术有限公司 Automatic file classification method and mobile terminal
CN103745201A (en) * 2014-01-06 2014-04-23 Tcl集团股份有限公司 Method and device for program recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923561A (en) * 2010-05-24 2010-12-22 中国科学技术信息研究所 Automatic document classifying method
CN101937445A (en) * 2010-05-24 2011-01-05 中国科学技术信息研究所 Automatic file classification system
CN201796362U (en) * 2010-05-24 2011-04-13 中国科学技术信息研究所 Automatic file classifying system
CN103646072A (en) * 2013-12-10 2014-03-19 河南博仕达通信技术有限公司 Automatic file classification method and mobile terminal
CN103745201A (en) * 2014-01-06 2014-04-23 Tcl集团股份有限公司 Method and device for program recognition

Also Published As

Publication number Publication date
CN106897454A (en) 2017-06-27

Similar Documents

Publication Publication Date Title
CN106897454B (en) File classification method and device
US11132555B2 (en) Video detection method, server and storage medium
CN108229504B (en) Image analysis method and device
JP6005837B2 (en) Image analysis apparatus, image analysis system, and image analysis method
CN108734159B (en) Method and system for detecting sensitive information in image
Ayed et al. MapReduce based text detection in big data natural scene videos
CN107193974B (en) Regional information determination method and device based on artificial intelligence
KR102002024B1 (en) Method for processing labeling of object and object management server
CN113963147B (en) Key information extraction method and system based on semantic segmentation
CN111461101B (en) Method, device, equipment and storage medium for identifying work clothes mark
CN111783712A (en) Video processing method, device, equipment and medium
CN110210038B (en) Core entity determining method, system, server and computer readable medium thereof
CN103020645A (en) System and method for junk picture recognition
CN114169381A (en) Image annotation method and device, terminal equipment and storage medium
CA2267828A1 (en) Multiple size reductions for image segmentation
CN112084812A (en) Image processing method, image processing device, computer equipment and storage medium
CN113033269A (en) Data processing method and device
CN111191591A (en) Watermark detection method, video processing method and related equipment
CN111783812A (en) Method and device for identifying forbidden images and computer readable storage medium
CN109145879B (en) Method, equipment and storage medium for identifying printing font
CN111651674A (en) Bidirectional searching method and device and electronic equipment
CN115588150A (en) Pet dog video target detection method and system based on improved YOLOv5-L
CN104462422A (en) Object processing method and device
CN108369559B (en) Apply the file structure analytical equipment of image procossing
CN114241253A (en) Model training method, system, server and storage medium for illegal content identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100089 710, 7 / F, building 1, zone 1, No.3, Xisanhuan North Road, Haidian District, Beijing

Patentee after: Beijing time Ltd.

Address before: 100089 710, 7 / F, building 1, zone 1, No.3, Xisanhuan North Road, Haidian District, Beijing

Patentee before: BEIJING TIME Co.,Ltd.