CN113515522A - Automatic label classification method based on data mining technology - Google Patents
Automatic label classification method based on data mining technology Download PDFInfo
- Publication number
- CN113515522A CN113515522A CN202110812540.XA CN202110812540A CN113515522A CN 113515522 A CN113515522 A CN 113515522A CN 202110812540 A CN202110812540 A CN 202110812540A CN 113515522 A CN113515522 A CN 113515522A
- Authority
- CN
- China
- Prior art keywords
- information
- label
- olp
- classification
- database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/26—Visual data mining; Browsing structured data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention discloses a label automatic classification method based on a data mining technology, which comprises the following steps: acquiring film watching record log information, and extracting semantic keywords in the information by a jieba word segmentation technology; acquiring OLP information according to semantic keywords based on a database and a label rule base; judging whether the obtained OLP information is complete or not; constructing an OLP relational tree, constructing a label category structure tree according to the OLP relational tree, and finishing classification; performing similarity query on the OLP information, judging whether the OLP information is noise content, if so, rejecting noise keyword information, and finishing a classification process; if not, acquiring the unrecognized keyword information through the web crawler information, setting the information as the label information to be defined, and defining the label information to be defined according to the similarity between the label information to be defined and the label information in the historical library. The invention realizes automatic identification and classification of the labels and timely update of the label system, and greatly improves the classification efficiency and the classification real-time property.
Description
Technical Field
The invention belongs to the field of automatic label classification, and particularly relates to an automatic label classification method based on a data mining technology.
Background
At present, due to the design of a label system, the labels are mostly classified in an artificial mode, the consumed time is too long, time and labor are wasted, the classification efficiency is low, and due to the limitation of the artificial mode, the label system is often too old and cannot be updated in time, and the real-time label construction requirement cannot be met.
Disclosure of Invention
The purpose of the invention is as follows: in order to overcome the defects in the prior art, the automatic label classification method based on the data mining technology is provided, automatic identification and classification of labels and timely updating of a label system are achieved, and classification efficiency and classification instantaneity are greatly improved.
The technical scheme is as follows: in order to achieve the above object, the present invention provides an automatic label classification method based on data mining technology, comprising the following steps:
s1: acquiring film watching record log information, and extracting semantic keywords in the information by a jieba word segmentation technology;
s2: acquiring OLP information according to semantic keywords based on a database and a label rule base;
s3: judging whether the obtained OLP information is complete, if so, entering step S4, and if not, entering step S5;
s4: constructing an OLP relational tree, constructing a label category structure tree according to the OLP relational tree, and finishing classification;
s5: based on a historical information tag library, carrying out similarity query on OLP information, judging whether the OLP information is noise content, if so, rejecting noise keyword information, and finishing a classification process; if not, acquiring the unrecognized keyword information through the web crawler information, setting the information as the label information to be defined, and defining the label information to be defined according to the similarity between the label information to be defined and the label information in the historical library.
Further, the OLP information in step S2 is obtained in the following manner: and querying in a database and a label rule base according to the semantic keywords to obtain corresponding OLP information. The OLP information includes the relationship of the subject, object, and attribute.
Further, the OLP relationship tree is constructed in step S4, and the specific way of constructing the label category structure tree according to the OLP relationship tree is as follows: the database comprises a personnel database and a photo database, and if the personnel database is matched with the photo database, the information of the main body is determined to be complete; if the matching is carried out in the photo library, the information of the object is determined to be complete; determining the type of the film according to the name of the film; and automatically creating labels of the personnel by utilizing OLP information rules according to the information such as film watching time, film classification and the like.
Further, the method for acquiring the information of the to-be-defined label in step S5 includes: and accessing the information of the keywords which cannot be identified to the Internet through the web crawler information to acquire the classification, and automatically classifying the information into the label information to be defined.
Further, in step S5, the manner of performing definition processing on the to-be-defined label information is as follows: cosine similarity calculation is carried out on label information of label information to be defined in a historical library, the label information to be defined with the similarity larger than a set value is directly maintained in a label rule library, and the label information to be defined with the similarity lower than the set value is judged and maintained manually and then maintained in the label rule library.
The classification method effectively solves the problem of automatic construction of subjects, objects, actions, attributes and incidence relations in the data, and effectively analyzes the content by utilizing the semantic recognition technology to obtain the keywords in the content. The label rules corresponding to the keywords are matched through a retrieval technology, unmatched contents are normalized for the second time through a web crawler technology, and the similarity of the record information normalized through the history is summarized, so that a label system can be quickly constructed through the method, and the purpose of real-time data label classification is achieved.
Has the advantages that: compared with the prior art, the method has the advantages that the film watching logs are intelligently classified through a data mining technology, the OLP relation can be constructed through semantic analysis according to the film watching records of different accounts, automatic identification and classification of the labels and timely updating of the label system are realized, the problems of high classification strength, low efficiency and excessively old label system existing in the conventional manual classification mode are solved, the manual maintenance workload is greatly reduced, and the real-time label construction requirement can be met.
Drawings
FIG. 1 is a schematic flow chart of the classification method of the present invention;
fig. 2 is a diagram illustrating OLP information in an embodiment.
Detailed Description
The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.
In the embodiment, a method for automatically classifying tags based on a data mining technology is provided, and for easy understanding, terms involved in a classification process are explained as follows:
1. OLP: the method for splitting, recombining and connecting various data in a highly abstract mode specifically comprises the association relationship among Object entity objects, Property attributes and Link entities.
2. jieba word segmentation: the 'ending' word segmentation is a Python Chinese word segmentation group, can perform the functions of word segmentation, part of speech tagging, keyword extraction and the like on a Chinese text, and supports a user-defined dictionary.
3. Web crawlers: a web crawler is a program that automatically retrieves web pages, which are important parts of search engines, from the world Wide Web for the search engines to download.
4. The category of label tree: a hierarchy of tags and tag objects. The nodes of the tree are specific attributes, the non-leaf nodes of the tree are names of tags, and the directory tree expresses the link mode of the objects and displays the path from one object to another object.
5. Data mining: the method is a process of searching information hidden in a large amount of data through an algorithm, and the algorithm is various and comprises a semantic analysis algorithm, a crawler algorithm, classification, clustering and the like.
As shown in fig. 1, the method for automatically classifying tags based on a data mining technology provided in this embodiment specifically includes the following steps:
s1: acquiring film watching record log information, and extracting semantic keywords in the information by a jieba word segmentation technology;
s2: based on a database and a label rule base, obtaining OLP information according to semantic keywords:
and querying in a database and a label rule base according to the semantic keywords to obtain corresponding OLP information. The OLP information includes the relationship of the subject, object, and attribute.
S3: judging whether the obtained OLP information is complete, if so, entering step S4, and if not, entering step S5;
s4: constructing an OLP relational tree, constructing a tag category structure tree according to the OLP relational tree, and finishing classification:
the database comprises a personnel database (subject database), a photo database (object database) and the like, and if the personnel database is matched, the information of the subject is determined to be complete; if the matching is carried out in the photo library, the information of the object is determined to be complete; determining the type of the film according to the name of the film; and automatically creating labels of the personnel by utilizing OLP information rules according to the information such as film watching time, film classification and the like.
S5: based on a historical information tag library, carrying out similarity query on OLP information, judging whether the OLP information is noise content, if so, rejecting noise keyword information, and finishing a classification process; if not, acquiring the unrecognized keyword information through the web crawler information, setting the information as the label information to be defined, and defining the label information to be defined according to the similarity between the label information to be defined and the label information in the historical library.
The method for acquiring the label information to be defined comprises the following steps: and accessing the information of the keywords which cannot be identified to the Internet through the web crawler information to acquire the classification, and automatically classifying the information into the label information to be defined.
The method for performing definition processing on the label information to be defined comprises the following steps: cosine similarity calculation is carried out on label information of the label information to be defined in a historical library, the label information to be defined with the similarity larger than 70% is directly maintained in a label rule library, the label information to be defined with the similarity lower than 70% is judged and maintained manually, and then the label information to be defined is maintained in the label rule library.
As shown in fig. 2, the OLP information obtained in step S2 in this embodiment is:
account number 000001 is 2020-06-0210:20:23 watching the ironmen, and segmenting into pieces by word segmentation technology
Account 000001/at/2020-06-0210: 20: 23/watch/ironmen.
Extracting entity Object 000001, Object ironman and Link relation of OLP relation Object,
Property attribute, time attribute 2020-06-0210:20:23
And constructing the hidden Property attribute according to the database and the label rule base.
The embodiment also provides an automatic label classification system based on the data mining technology, which comprises a network interface, a memory and a processor; the network interface is used for receiving and sending signals in the process of receiving and sending information with other external network elements; a memory for storing computer program instructions executable on the processor; a processor for, when executing the computer program instructions, performing the steps of the consensus method described above.
The present embodiment also provides a computer storage medium storing a computer program that when executed by a processor can implement the method described above. The computer-readable medium may be considered tangible and non-transitory. Non-limiting examples of a non-transitory tangible computer-readable medium include a non-volatile memory circuit (e.g., a flash memory circuit, an erasable programmable read-only memory circuit, or a mask read-only memory circuit), a volatile memory circuit (e.g., a static random access memory circuit or a dynamic random access memory circuit), a magnetic storage medium (e.g., an analog or digital tape or hard drive), and an optical storage medium (e.g., a CD, DVD, or blu-ray disc), among others. The computer program includes processor-executable instructions stored on at least one non-transitory tangible computer-readable medium. The computer program may also comprise or rely on stored data. The computer programs may include a basic input/output system (BIOS) that interacts with the hardware of the special purpose computer, a device driver that interacts with specific devices of the special purpose computer, one or more operating systems, user applications, background services, background applications, and the like.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Claims (6)
1. A label automatic classification method based on a data mining technology is characterized by comprising the following steps:
s1: acquiring film watching record log information, and extracting semantic keywords in the information by a jieba word segmentation technology;
s2: acquiring OLP information according to semantic keywords based on a database and a label rule base;
s3: judging whether the obtained OLP information is complete, if so, entering step S4, and if not, entering step S5;
s4: constructing an OLP relational tree, constructing a label category structure tree according to the OLP relational tree, and finishing classification;
s5: based on a historical information tag library, carrying out similarity query on OLP information, judging whether the OLP information is noise content, if so, rejecting noise keyword information, and finishing a classification process; if not, acquiring the unrecognized keyword information through the web crawler information, setting the information as the label information to be defined, and defining the label information to be defined according to the similarity between the label information to be defined and the label information in the historical library.
2. The method of claim 1, wherein the OLP information in step S2 is obtained in a manner as follows: and querying in a database and a label rule base according to the semantic keywords to obtain corresponding OLP information.
3. The method of claim 1, wherein the step S4 is implemented by constructing an OLP relationship tree, and the specific way of constructing the taget structure tree according to the OLP relationship tree is as follows: the database comprises a personnel database and a photo database, and if the personnel database is matched with the photo database, the information of the main body is determined to be complete; if the matching is carried out in the photo library, the information of the object is determined to be complete; determining the type of the film according to the name of the film; and automatically creating labels of the personnel by utilizing an OLP information rule according to the film watching time and the film classification information.
4. The method for automatically classifying tags based on data mining technology as claimed in claim 1, wherein the method for acquiring the information of the tags to be defined in step S5 is as follows: and accessing the information of the keywords which cannot be identified to the Internet through the web crawler information to acquire the classification, and automatically classifying the information into the label information to be defined.
5. The method according to claim 1, wherein the step S5 of performing definition processing on the to-be-defined label information includes: cosine similarity calculation is carried out on label information of label information to be defined in a historical library, the label information to be defined with the similarity larger than a set value is directly maintained in a label rule library, and the label information to be defined with the similarity lower than the set value is judged and maintained manually and then maintained in the label rule library.
6. The method of claim 1, wherein the OLP information in step S2 includes a relationship between a subject, an object, and an attribute.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110812540.XA CN113515522A (en) | 2021-07-19 | 2021-07-19 | Automatic label classification method based on data mining technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110812540.XA CN113515522A (en) | 2021-07-19 | 2021-07-19 | Automatic label classification method based on data mining technology |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113515522A true CN113515522A (en) | 2021-10-19 |
Family
ID=78067323
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110812540.XA Pending CN113515522A (en) | 2021-07-19 | 2021-07-19 | Automatic label classification method based on data mining technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113515522A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107844548A (en) * | 2017-10-30 | 2018-03-27 | 北京锐安科技有限公司 | A kind of data label method and apparatus |
KR20180129001A (en) * | 2017-05-24 | 2018-12-05 | 한국과학기술원 | Method and System for Entity summarization based on multilingual projected entity space |
CN109635171A (en) * | 2018-12-13 | 2019-04-16 | 成都索贝数码科技股份有限公司 | A kind of fusion reasoning system and method for news program intelligent label |
CN109657013A (en) * | 2018-11-30 | 2019-04-19 | 杭州数澜科技有限公司 | A kind of systematization generates the method and system of label |
CN111444334A (en) * | 2019-01-16 | 2020-07-24 | 阿里巴巴集团控股有限公司 | Data processing method, text recognition device and computer equipment |
CN111967262A (en) * | 2020-06-30 | 2020-11-20 | 北京百度网讯科技有限公司 | Method and device for determining entity tag |
CN112541066A (en) * | 2020-12-11 | 2021-03-23 | 清华大学 | Text-structured-based medical and technical report detection method and related equipment |
-
2021
- 2021-07-19 CN CN202110812540.XA patent/CN113515522A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20180129001A (en) * | 2017-05-24 | 2018-12-05 | 한국과학기술원 | Method and System for Entity summarization based on multilingual projected entity space |
CN107844548A (en) * | 2017-10-30 | 2018-03-27 | 北京锐安科技有限公司 | A kind of data label method and apparatus |
CN109657013A (en) * | 2018-11-30 | 2019-04-19 | 杭州数澜科技有限公司 | A kind of systematization generates the method and system of label |
CN109635171A (en) * | 2018-12-13 | 2019-04-16 | 成都索贝数码科技股份有限公司 | A kind of fusion reasoning system and method for news program intelligent label |
CN111444334A (en) * | 2019-01-16 | 2020-07-24 | 阿里巴巴集团控股有限公司 | Data processing method, text recognition device and computer equipment |
CN111967262A (en) * | 2020-06-30 | 2020-11-20 | 北京百度网讯科技有限公司 | Method and device for determining entity tag |
CN112541066A (en) * | 2020-12-11 | 2021-03-23 | 清华大学 | Text-structured-based medical and technical report detection method and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110851598B (en) | Text classification method and device, terminal equipment and storage medium | |
CN111324784A (en) | Character string processing method and device | |
CN109145110B (en) | Label query method and device | |
KR20100106464A (en) | Method and system for discovery and modification of data clusters and synonyms | |
US11907659B2 (en) | Item recall method and system, electronic device and readable storage medium | |
CN111125086B (en) | Method, device, storage medium and processor for acquiring data resources | |
CN111061742B (en) | Method and device for marking data and service system thereof | |
US10936637B2 (en) | Associating insights with data | |
CN111382184A (en) | Method for verifying drug document and drug document verification system | |
WO2019179408A1 (en) | Construction of machine learning model | |
CN114091426A (en) | Method and device for processing field data in data warehouse | |
CN112907358A (en) | Loan user credit scoring method, loan user credit scoring device, computer equipment and storage medium | |
WO2019200700A1 (en) | Official document processing method and apparatus, and terminal device and storage medium | |
CN113065018A (en) | Audio and video index library creating and retrieving method and device and electronic equipment | |
CN113660541A (en) | News video abstract generation method and device | |
CN114398315A (en) | Data storage method, system, storage medium and electronic equipment | |
CN110879799A (en) | Method and apparatus for managing technical metadata | |
CN113515522A (en) | Automatic label classification method based on data mining technology | |
CN110019783B (en) | Attribute word clustering method and device | |
CN110941952A (en) | Method and device for perfecting audit analysis model | |
CN115796146A (en) | File comparison method and device | |
CN114328844A (en) | Text data set management method, device, equipment and storage medium | |
CN110609926A (en) | Data tag storage management method and device | |
CN114692595B (en) | Repeated conflict scheme detection method based on text matching | |
US11416685B2 (en) | System and method for artificial intelligence driven document analysis, including automated reuse of predictive coding rules based on management and curation of datasets or models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |