WO2022048363A1 - 网站分类方法、装置、计算机设备及存储介质 - Google Patents

网站分类方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022048363A1
WO2022048363A1 PCT/CN2021/109553 CN2021109553W WO2022048363A1 WO 2022048363 A1 WO2022048363 A1 WO 2022048363A1 CN 2021109553 W CN2021109553 W CN 2021109553W WO 2022048363 A1 WO2022048363 A1 WO 2022048363A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
model
website
classification
sample
Prior art date
Application number
PCT/CN2021/109553
Other languages
English (en)
French (fr)
Inventor
吴满芳
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022048363A1 publication Critical patent/WO2022048363A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present application relates to the field of data analysis, and in particular, to a website classification method, apparatus, computer equipment and storage medium.
  • the application provides a website classification method, including:
  • the training sample set includes at least one sample text, and the sample text is website introduction text identifying the website type;
  • the text of the website to be classified is classified by using the text classification model, so as to obtain the category of the text of the website to be classified.
  • the application also provides a website classification device, including:
  • a training unit configured to train an initial classification model by using a training sample set, and obtain a text classification model, wherein the training sample set includes at least one sample text, and the sample text is a website profile text identifying a website type;
  • an obtaining unit used to obtain the text of the website to be classified
  • a prediction unit configured to use the text classification model to classify the text of the website to be classified, so as to obtain the category of the text of the website to be classified.
  • the present application also provides a computer device, the computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor executing the computer Implementing the above website classification method when the readable instruction includes:
  • the training sample set includes at least one sample text, and the sample text is the website profile text identifying the website type;
  • the text of the website to be classified is classified by using the text classification model, so as to obtain the category of the text of the website to be classified.
  • the present application also provides a computer-readable storage medium on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a processor, the above-mentioned website classification method is implemented, including:
  • the training sample set includes at least one sample text, and the sample text is website introduction text identifying the website type;
  • the text of the website to be classified is classified by using the text classification model, so as to obtain the category of the text of the website to be classified.
  • the website classification method, device, computer equipment and storage medium provided by this application use the website profile text identifying the website type as a training sample to train the initial classification model and obtain the text classification model, which greatly reduces the amount of time occupied in the training process.
  • the storage capacity improves the training speed; the text classification model is used to classify the text of the website to be classified, so as to obtain the category of the website text to be classified, so as to realize the purpose of quickly and accurately identifying the website type based on the website introduction text.
  • FIG. 2 is a flowchart of an embodiment in which the application adopts a training sample set to train an initial classification model to obtain a text classification model;
  • FIG. 3 is a flowchart of an embodiment of obtaining the first classification vector of each sample text through the first LightGBM model
  • FIG. 5 is a flowchart of an embodiment of using a text classification model to classify website texts to be classified to obtain categories of website texts to be classified;
  • FIG. 6 is a flowchart of another embodiment of obtaining a fourth classification vector of the website text to be classified by using the second LightGBM model
  • FIG. 7 is a flowchart of another embodiment of obtaining the fifth classification vector of the website text to be classified by the second Bi-LSTM model
  • FIG. 8 is a block diagram of an embodiment of the website classification device described in this application.
  • FIG. 9 is a block diagram of an embodiment of the training unit described in the application.
  • FIG. 10 is a block diagram of an embodiment of the prediction unit described in this application.
  • FIG. 11 is a hardware architecture diagram of an embodiment of the computer device of the present application.
  • the website classification method, device, computer equipment and storage medium provided by this application are suitable for fields such as insurance business and financial business.
  • This application relates to artificial intelligence.
  • the website introduction text identifying the website type is used as a training sample to train the initial classification model and obtain the text classification model, which greatly reduces the amount of storage occupied in the training process and improves the efficiency of the training process.
  • the speed of training; the text classification model is used to classify the text of the website to be classified, so as to obtain the category of the website text to be classified, so as to realize the purpose of quickly and accurately identifying the website type based on the website introduction text.
  • a website classification method of the present embodiment includes:
  • the training sample set includes at least one sample text, and the sample text is website introduction text identifying the website type;
  • the initial classification model includes the first LightGBM model and the first Bi-LSTM model
  • the text classification model includes the second LightGBM model and the second Bi-LSTM model
  • step S1 the first LightGBM model and the first Bi-LSTM model are trained by using the training sample set, and the second LightGBM model and the second Bi-LSTM model are obtained.
  • step S1 shown in FIG. 2 the following steps may be included:
  • the training sample set can be stored in the nodes of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • step S11 shown in FIG. 3 the following steps may be included:
  • Preprocessing the sample text in this step includes: removing Chinese and English punctuation marks in the sample text, deleting English characters and stop words, and segmenting the sample text to obtain all the segmented words in the sample text.
  • the word2vec function is used to train the corpus, and each word segment is mapped to a training word vector through the word2vec function, thereby obtaining a word segmentation matrix composed of the training word vectors.
  • the word segmentation adopts a training word vector represented by a preset dimension (for example: the training word vector is 300 dimensions)
  • the sample text is represented as a two-dimensional word segmentation matrix.
  • the IT-IDF value of each word segment is calculated, and the IT-IDF value is used as the weight of each word segment.
  • the discrete training word vectors are mapped into continuous vectors by embedding method, and weighted with the corresponding weights, so as to obtain the first word segmentation vector of the sample text.
  • using the first LightGBM model for training has the advantages of fast training speed, low memory consumption, and high accuracy, and can effectively improve the classification accuracy of the sample text.
  • step S12 shown in FIG. 4 it may include:
  • Preprocessing the sample text in this step includes: removing Chinese and English punctuation marks in the sample text, deleting English characters and stop words, and segmenting the sample text to obtain all the segmented words in the sample text.
  • the full name of the bert model is: BidirectionalEncoder Representations from Transformer.
  • the input of the bert model is the original word vector of each word/word in the sample text, which can be initialized randomly or pre-trained by algorithms such as Word2Vector as the initial value; the output is the full text of each word/word in the sample text.
  • the vector after the semantic information namely: the second word segmentation vector.
  • an attention mechanism is added to the first Bi-LSTM model, that is, an Attention layer.
  • the output vector of the last time series is used as a feature vector and then softmax classification is performed.
  • the Attention layer first calculates the weight of each time series, then takes the weighted sum of all the time series vectors as a feature vector, and then performs softmax classification.
  • the accuracy of the second classification vector can be effectively improved by adding an Attention layer.
  • the first classification vector and the second classification vector of the same sample text are respectively weighted and summed to obtain the training classification of the sample text;
  • the first classification vector and the second classification vector of the same sample text are weighted and summed to obtain the third classification vector of the sample text; the dimension with the largest probability distribution in the third classification vector is used as The training classification result of the sample text.
  • the preset threshold can be set as required, such as 90%, 95%, and the like.
  • the text of the website to be classified may be obtained based on the website introduction in the website ranking.
  • step S3 shown in FIG. 5 it may include:
  • step S31 shown in FIG. 6 the following steps may be included:
  • preprocessing the text of the website to be classified includes: removing Chinese and English punctuation marks in the text of the website to be classified, deleting English characters and stop words, and segmenting the text of the website to be classified, so as to obtain all the texts of the website to be classified. Participle.
  • S312. adoptsim module to convert each described word segmentation into word vector respectively, to obtain the word segmentation matrix of described website text to be classified;
  • the word2vec function is used to train the corpus, and each word segmentation is mapped to a word vector through the word2vec function, thereby obtaining a word segmentation matrix composed of word vectors.
  • step S312 because the word segmentation adopts a word vector represented by a preset dimension, the sample text is represented as a two-dimensional word segmentation matrix.
  • the IT-IDF value of each word segment is calculated, and the IT-IDF value is used as the weight of each word segment.
  • IT-IDF Representation method of generating sentence vector by weighting IT-IDF value and word vector: IT-IDF is to evaluate the importance of a word in a document set or corpus, while the pre-trained word vector embedding method focuses on the semantics of the word.
  • the combination of the two means that the text is the enrichment and expansion of the textual information.
  • each described word segmentation is weighted with corresponding weight respectively, to obtain the third word segmentation vector of the website text to be classified;
  • the discrete training word vectors are mapped into continuous vectors by the embedding method, and weighted with the corresponding weights, so as to obtain the third word segmentation vector of the website text to be classified.
  • Using the second LightGBM model for prediction in this step has the advantages of fast prediction speed, low memory consumption and high accuracy, and can effectively improve the classification accuracy of the website text to be classified.
  • IT-IDF is to evaluate the importance of a word in a document set or corpus
  • pre-trained word vector embedding method focuses on the word
  • the semantics of the two are combined to represent the text is the enrichment and expansion of the text information.
  • step S32 shown in FIG. 7 the following steps may be included:
  • preprocessing the text of the website to be classified includes: removing Chinese and English punctuation marks in the text of the website to be classified, deleting English characters and stop words, and segmenting the text of the website to be classified, so as to obtain all the texts of the website to be classified. Participle.
  • the full name of the bert model is: BidirectionalEncoder Representations from Transformer.
  • the input of the bert model is the original word vector of each word/word in the text of the website to be classified.
  • the vector can be initialized randomly or pre-trained by algorithms such as Word2Vector as the initial value; the output is the word/word in the text of the website to be classified.
  • the vector after the word integrates the semantic information of the full text, that is, the fourth word segmentation vector.
  • an attention mechanism is added to the second Bi-LSTM model, that is, an Attention layer.
  • the output vector of the last time series is used as a feature vector and then softmax classification is performed.
  • the Attention layer first calculates the weight of each time series, then takes the weighted sum of all the time series vectors as a feature vector, and then performs softmax classification.
  • the accuracy of the fifth classification vector can be effectively improved by adding an Attention layer.
  • the sixth classification vector of the website text to be classified is obtained by weighting and summing the fourth classification vector and the fifth classification vector of the website text to be classified; the dimension with the largest probability distribution in the sixth classification vector is As the classification result of the website text to be classified.
  • the website classification method uses the website profile text identifying the website type as a training sample to train the initial classification model and obtain the text classification model, which greatly reduces the amount of storage occupied in the training process and improves training.
  • the text classification model is used to classify the text of the website to be classified, so as to obtain the category of the website text to be classified, so as to realize the purpose of quickly and accurately identifying the website type based on the website introduction text.
  • the website classification method combines the Bi-LSTM model and the LightGBM model to classify and predict the website introduction, which improves the classification accuracy. Compared with the existing corpus training, this application starts from the website introduction. With less content, the overhead of the model is greatly reduced.
  • a website classification apparatus 1 of this embodiment includes: a training unit 11, an acquisition unit 12 and a prediction unit 13; wherein:
  • the training unit 11 is used to train the initial classification model by using a training sample set, and obtain a text classification model, wherein the training sample set includes at least one website text to be classified, and the website text to be classified is a website profile identifying the website type text;
  • the training sample set includes at least one sample text, and the sample text is a website introduction text identifying the website type; the training sample set can be stored in a node of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the initial classification model includes the first LightGBM model and the first Bi-LSTM model
  • the text classification model includes the second LightGBM model and the second Bi-LSTM model
  • the training unit 11 uses the training sample set to train the first LightGBM model and the first Bi-LSTM model, and obtains the second LightGBM model and the second Bi-LSTM model.
  • the training unit 11 may include: a first training module 111 , a second training module 112 , a first summation module 113 , a matching module 114 and a processing module 115 .
  • a first training module 111 configured to input at least one of the sample texts in the training sample set into the first LightGBM model, and obtain a first classification vector of each of the sample texts;
  • obtaining the first classification vector of each of the sample texts through the first LightGBM model may include the following steps:
  • Adopt gensim module to convert each described word segmentation into training word vector respectively, to obtain the word segmentation matrix of described sample text
  • the second training module 112 is configured to input at least one of the sample texts of the training sample set into the first Bi-LSTM model, and obtain the second classification vector of each of the sample texts;
  • obtaining the second classification vector of each of the sample texts through the first Bi-LSTM model may include the following steps:
  • an attention mechanism is added to the first Bi-LSTM model, that is, an Attention layer.
  • the output vector of the last time series is used as a feature vector and then softmax classification is performed.
  • the Attention layer first calculates the weight of each time series, then takes the weighted sum of all time series vectors as feature vectors, and then performs softmax classification.
  • the accuracy of the second classification vector can be effectively improved by adding an Attention layer.
  • the first summation module 113 is configured to perform weighted summation of the first classification vector and the second classification vector of the same sample text, respectively, to obtain the training classification of the sample text;
  • the first classification vector and the second classification vector of the same sample text are weighted and summed to obtain the third classification vector of the sample text; the dimension with the largest probability distribution in the third classification vector is used as The training classification result of the sample text.
  • a matching module 114 configured to match the training classification of each of the sample texts with the website type identifiers of the sample texts;
  • the processing module 115 is used to judge whether the matching degree of the training classification of all the sample texts is greater than a preset threshold, and if not, update the parameter values of the first LightGBM model and the parameter values of the first Bi-LSTM model. parameter values until the training of the first LightGBM model and the first Bi-LSTM model is completed, and the second LightGBM model and the second Bi-LSTM model are obtained.
  • the preset threshold can be set as required, such as 90%, 95%, and the like.
  • an obtaining unit 12 for obtaining the text of the website to be classified
  • the text of the website to be classified may be obtained based on the website introduction in the website ranking.
  • the prediction unit 13 is configured to use the text classification model to classify the text of the website to be classified, so as to obtain the category of the text of the website to be classified.
  • the prediction unit 13 may include: a first prediction module 131 , a second prediction module 132 and a second summation module 133 .
  • the first prediction module 131 is configured to input the text of the website to be classified into the second LightGBM model, and obtain a fourth classification vector of the text of the website to be classified;
  • S312. adoptsim module to convert each described word segmentation into word vector respectively, to obtain the word segmentation matrix of described website text to be classified;
  • IT-IDF is to evaluate the importance of a word in a document set or corpus
  • pre-trained word vector embedding method focuses on the word
  • the semantics of the two are combined to represent the text is the enrichment and expansion of the text information.
  • the second prediction module 132 is configured to input the text of the website to be classified into the second Bi-LSTM model, and obtain the fifth classification vector of the text of the website to be classified;
  • obtaining the fifth classification vector of the website text to be classified may include the following steps:
  • the second summation module 133 is configured to perform weighted summation of the fourth classification vector and the fifth classification vector of the website text to be classified, so as to obtain the classification of the website text to be classified.
  • the sixth classification vector of the website text to be classified is obtained by weighting and summing the fourth classification vector and the fifth classification vector of the website text to be classified; the dimension with the largest probability distribution in the sixth classification vector is As the classification result of the website text to be classified.
  • the website classification device 1 uses the website profile text identifying the website type as a training sample to train the initial classification model and obtain the text classification model, which greatly reduces the amount of storage occupied in the training process and improves the performance of the training process.
  • the speed of training; the text classification model is used to classify the text of the website to be classified, so as to obtain the category of the website text to be classified, so as to realize the purpose of quickly and accurately identifying the website type based on the website introduction text.
  • the website classification method combines the Bi-LSTM model and the LightGBM model to classify and predict the website introduction, which improves the classification accuracy. Compared with the existing corpus training, this application starts from the website introduction. With less content, the overhead of the model is greatly reduced.
  • the present application also provides a computer equipment 2, the computer equipment 2 includes a plurality of computer equipment 2, the components of the website classification device 1 of the second embodiment can be dispersed in different computer equipment 2, the computer equipment 2 It can be a smart phone, tablet computer, laptop computer, desktop computer, rack server, blade server, tower server or rack server (including independent servers, or server clusters composed of multiple servers) that execute programs, etc. .
  • the computer device 2 in this embodiment at least includes but is not limited to: a memory 21 , a processor 23 , a network interface 22 and a website classification device 1 (refer to FIG. 11 ) that can be communicatively connected to each other through a system bus. It should be pointed out that FIG. 11 only shows the computer device 2 having the component -, but it should be understood that it is not required to implement all the shown components, and more or less components may be implemented instead.
  • the memory 21 includes at least one type of computer-readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access Memory (RAM), Static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 21 may be an internal storage unit of the computer device 2 , such as a hard disk or a memory of the computer device 2 .
  • the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 21 may also include both the internal storage unit of the computer device 2 and its external storage device.
  • the memory 21 is generally used to store the operating system and various application software installed on the computer device 2 , such as the program code of the website classification method in the first embodiment.
  • the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 23 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 23 is generally used to control the overall operation of the computer device 2 , such as performing control and processing related to data interaction or communication with the computer device 2 .
  • the processor 23 is configured to run the program code or process data stored in the memory 21, for example, run the website classification apparatus 1 and the like.
  • the network interface 22 may comprise a wireless network interface or a wired network interface, and the network interface 22 is generally used to establish a communication connection between the computer device 2 and other computer devices 2 .
  • the network interface 22 is used to connect the computer device 2 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal.
  • the network can be an intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network Wireless or wired network such as network, Bluetooth (Bluetooth), Wi-Fi, etc.
  • FIG. 11 only shows the computer device 2 having components 21-23, but it should be understood that it is not required to implement all of the shown components and that more or less components may be implemented instead.
  • the website classification apparatus 1 stored in the memory 21 may also be divided into one or more program modules, and the one or more program modules are stored in the memory 21 and are composed of one or more program modules.
  • a processor in this embodiment, the processor 23 is executed to complete the present application.
  • the present application also provides a computer-readable storage medium, which includes multiple storage media, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM). ), Static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Disk, Optical, Server, App A shopping mall, etc., on which computer-readable instructions are stored, and when the programs are executed by the processor 23, realize corresponding functions.
  • the computer-readable storage medium of this embodiment is used to store the website classification apparatus 1, and when executed by the processor 23, implements the website classification method of the first embodiment.
  • the computer-readable storage medium may be non-volatile or volatile.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

提供了一种网站分类方法、装置、计算机设备及存储介质。网站分类方法通过采用标识网站类型的网站简介文本作为训练样本,对初始分类模型进行训练,获取文本分类模型;采用文本分类模型对待分类网站文本进行分类,以得到待分类网站文本的类别,从而实现基于网站简介文本可快速准确的识别网站类型的目的。

Description

网站分类方法、装置、计算机设备及存储介质
本申请要求于2020年9月2日递交的申请号为202010910928.9、名称为“网站分类方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据分析领域,尤其涉及网站分类方法、装置、计算机设备及存储介质。
背景技术
互联网网站天然存在各种有价值的信息,但当前互联网技术发展迅猛,每天都有大量旧网站消失、新网站出现。面对如此庞大的日新月异的网站数量和网站种类,如何高效准确的筛选出特定类别的网站是挖掘网站信息的重要前提之一。
发明人意识到,目前网站分类的方法很多,大致可以分为三种情况:人工标注网站类别,人工成本高、效率低;人工维护网站类别以及相应关键字的字典,使用规则进行网站分类,需要耗费大量的人力去整理维护字典,准确率不高;利用机器学习的方法对网站数据进行分类,虽然大大释放了人力,当无法保证分类的准确率。
发明内容
针对现有网站分类方法准确低的问题,现提供一种旨在提高网站分类准确率的网站分类方法、装置、计算机设备及存储介质。
为实现上述目的,本申请提供一种网站分类方法,包括:
采用训练样本集合对初始分类模型进行训练,获取文本分类模型;
其中,所述训练样本集合包括至少一个样本文本,所述样本文本为标识网站类型的网站简介文本;
获取待分类网站文本;
采用所述文本分类模型对所述待分类网站文本进行分类,以获取所述待分类网站文本的类别。
为实现上述目的,本申请还提供了一种网站分类装置,包括:
训练单元,用于采用训练样本集合对初始分类模型进行训练,获取文本分类模型,其中,所述训练样本集合包括至少一个样本文本,所述样本文本为标识网站类型的网站简介文本;
获取单元,用于获取待分类网站文本;
预测单元,用于采用所述文本分类模型对所述待分类网站文本进行分类,以获取所述待分类网站文本的类别。
为实现上述目的,本申请还提供了一种计算机设备,所述计算机设备包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述网站分类方法包括:
采用训练样本集合对初始分类模型进行训练,获取文本分类模型;
其中,所述训练样本集合包括至少一个样本文本,所述样本文本为标识网站类型的网 站简介文本;
获取待分类网站文本;
采用所述文本分类模型对所述待分类网站文本进行分类,以获取所述待分类网站文本的类别。
为实现上述目的,本申请还提供了一种计算机可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述网站分类方法,包括:
采用训练样本集合对初始分类模型进行训练,获取文本分类模型;
其中,所述训练样本集合包括至少一个样本文本,所述样本文本为标识网站类型的网站简介文本;
获取待分类网站文本;
采用所述文本分类模型对所述待分类网站文本进行分类,以获取所述待分类网站文本的类别。
本申请提供的网站分类方法、装置、计算机设备及存储介质通过采用标识网站类型的网站简介文本作为训练样本,对初始分类模型进行训练,获取文本分类模型,极大的减少了训练过程中占用的存储量,提高了训练的速度;采用文本分类模型对待分类网站文本进行分类,以得到待分类网站文本的类别,从而实现基于网站简介文本可快速准确的识别网站类型的目的。
附图说明
图1为本申请所述的网站分类方法的一种实施例的流程图;
图2为本申请采用训练样本集合对初始分类模型进行训练获取文本分类模型的一种实施例的流程图;
图3为通过第一LightGBM模型获取每个样本文本的第一分类向量的一种实施例的流程图;
图4为通过第一Bi-LSTM模型获取每个样本文本的第二分类向量的一种实施例的流程图;
图5为采用文本分类模型对待分类网站文本进行分类以获取待分类网站文本的类别的一种实施例的流程图;
图6为通过第二LightGBM模型获取待分类网站文本的第四分类向量的另一种实施例的流程图;
图7为通过第二Bi-LSTM模型获取待分类网站文本的第五分类向量的另一种实施例的流程图;
图8为本申请所述的网站分类装置的一种实施例的模块图;
图9为本申请所述训练单元的一种实施例的模块图;
图10为本申请所述预测单元的一种实施例的模块图;
图11为本申请计算机设备的一个实施例的硬件架构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
本申请提供的网站分类方法、装置、计算机设备及存储介质,适用于保险业务、金融业务等领域。本申请涉及人工智能,在机器学习中通过采用标识网站类型的网站简介文本作为训练样本,对初始分类模型进行训练,获取文本分类模型,极大的减少了训练过程中占用的存储量,提高了训练的速度;采用文本分类模型对待分类网站文本进行分类,以得到待分类网站文本的类别,从而实现基于网站简介文本可快速准确的识别网站类型的目的。
实施例一
请参阅图1,本实施例的一种网站分类方法,包括:
S1.采用训练样本集合对初始分类模型进行训练,获取文本分类模型;
其中,所述训练样本集合包括至少一个样本文本,所述样本文本为标识网站类型的网站简介文本;
需要说明的是:所述初始分类模型包括第一LightGBM模型和第一Bi-LSTM模型,所述文本分类模型包括第二LightGBM模型和第二Bi-LSTM模型;
在步骤S1中采用所述训练样本集合对所述第一LightGBM模型和所述第一Bi-LSTM模型进行训练,获取所述第二LightGBM模型和所述第二Bi-LSTM模型。
具体地,参阅图2所示步骤S1可包括以下步骤:
S11.将所述训练样本集合中的至少一个所述样本文本输入所述第一LightGBM模型,获取每个所述样本文本的第一分类向量;
需要说明的是:训练样本集合可以存储于一区块链的节点中。本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
进一步地,参阅图3所示步骤S11可包括以下步骤:
S111.对所述样本文本进行预处理,获取所述样本文本中的所有分词;
在本步骤中对样本文本进行预处理包括:去除样本文本中的中文及英文标点符号,删除英文字符及停用词,对样本文本进行分词,以得到样本文本中的所有分词。
S112.采用gensim模块分别将每个所述分词转换为训练词向量,以得到所述样本文本的分词矩阵;
在gensim模块中采用word2vec函数训练语料,通过word2vec函数将每一个分词分别映射到一个训练词向量,从而得到由训练词向量组成的分词矩阵。在步骤S112中由于分 词采用预设维度表示的训练词向量(如:训练词向量为300维),因此样本文本被表示成一个二维的分词矩阵。
S113.计算所述分词矩阵中每一个所述分词的词频-逆文件频率IT-IDF,将所述词频-逆文件频率作为相应的所述分词的权重;
由于机器学习的需求是将样本文本表示为一个一维的向量,因此,需要进一步将二维的分词矩阵转变成一个一维向量。在本步骤中采用计算每个分词的IT-IDF值,将IT-IDF值作为每个分词的权重。
S114.将所述分词矩阵中每个所述分词分别与相应的权重进行加权,以得到所述样本文本的第一分词向量;
在本步骤中,采用embeeding方式将离散的训练词向量映射为连续向量,并与相应的权重进行加权,从而得到样本文本的第一分词向量。
S115.将所述第一分词向量输入所述第一LightGBM模型进行训练,得到所述样本文本的第一分类向量。
在本步骤中采用第一LightGBM模型进行训练具有训练速度快、内存消耗低、准确率高的优点,可有效的提高样本文本的分类进准度。
S12.将所述训练样本集合的至少一个所述样本文本输入所述第一Bi-LSTM模型,获取每个所述样本文本的第二分类向量;
进一步地,参阅图4所示步骤S12可包括:
S121.对所述样本文本进行预处理,获取所述样本文本中的所有分词;
在本步骤中对样本文本进行预处理包括:去除样本文本中的中文及英文标点符号,删除英文字符及停用词,对样本文本进行分词,以得到样本文本中的所有分词。
S122.采用bert模型将所述样本文本中的所有分词转换为第二分词向量;
bert模型全称是:BidirectionalEncoder Representations from Transformer。bert模型的输入是样本文本中各个字/词的原始词向量,该向量既可以随机初始化,也可以利用Word2Vector等算法进行预训练以作为初始值;输出是样本文本中各个字/词融合了全文语义信息后的向量,即:第二分词向量。
S123.将所述第二分词向量输入所述第一Bi-LSTM模型进行训练,得到所述样本文本的第二分类向量。
在本实施例中,第一Bi-LSTM模型中增加了注意力机制即:Attention层,在第一Bi-LSTM模型中将最后一个时序的输出向量作为特征向量然后进行softmax分类。Attention层是先计算每个时序的权重,然后将所有时序的向量进行加权和作为特征向量,然后进行softmax分类。在本实施例中通过增加Attention层可有效的提高第二分类向量的准确性。
S13.分别将同一所述样本文本的所述第一分类向量和所述第二分类向量进行加权求和,以得到所述样本文本的训练分类;
例如:假设第一分类向量Y1:[y1,y2,…,yn],第二分类向量Y2:[y1,y2,…,yn],则样本文本进行加权求和Y=w1×Y1+(1-w1)×Y2;其中w1为预设的权重值,属于[0,1]。
在本实施例中,通过将同一所述样本文本的第一分类向量和第二分类向量进行加权求和,以获取样本文本的第三分类向量;将第三分类向量中概率分布最大的维度作为样本文本的训练分类结果。
S14.将每一个所述样本文本的所述训练分类与所述样本文本的网站类型标识进行匹配;
S15.判断所有所述样本文本的所述训练分类的匹配度是否大于预设阈值,若否,更新所述第一LightGBM模型的参数值,以及所述第一Bi-LSTM模型的参数值,直至完成对所述第一LightGBM模型和所述第一Bi-LSTM模型的训练,获取所述第二LightGBM模型和所述第二Bi-LSTM模型。
在本实施例中,预设阈值可以根据需要设定,如:90%,95%等。
S2.获取待分类网站文本;
在本实施例中,待分类网站文本可以基于网站排行榜中的网站简介获得。
S3.采用所述文本分类模型对所述待分类网站文本进行分类,以获取所述待分类网站文本的类别。
具体地,参阅图5所示步骤S3可包括:
S31.将所述待分类网站文本输入所述第二LightGBM模型,获取所述待分类网站文本的第四分类向量;
进一步地,参阅图6所示步骤S31可包括以下步骤:
S311.对所述待分类网站文本进行预处理,获取所述待分类网站文本中的所有分词;
在本步骤中对待分类网站文本进行预处理包括:去除待分类网站文本中的中文及英文标点符号,删除英文字符及停用词,对待分类网站文本进行分词,以得到待分类网站文本中的所有分词。
S312.采用gensim模块分别将每个所述分词转换为词向量,以得到所述待分类网站文本的分词矩阵;
在gensim模块中采用word2vec函数训练语料,通过word2vec函数将每一个分词分别映射到一个词向量,从而得到由词向量组成的分词矩阵。在步骤S312中因为分词采用预设维度表示的词向量,所以样本文本被表示成一个二维的分词矩阵。
S313.计算所述分词矩阵中每一个所述分词的词频-逆文件频率,将所述词频-逆文件频率作为相应的所述分词的权重;
由于机器学习的需求是将待分类网站文本表示为一个一维的向量,因此,需要进一步将二维的分词矩阵转变成一个一维向量。在本步骤中采用计算每个分词的IT-IDF值,将IT-IDF值作为每个分词的权重。
IT-IDF值与词向量加权生成句向量的表征方法:IT-IDF是评估一个单词在一个文档集合或语料库中的重要程度,预训练的词向量mbedding方式则关注的是单词的语义,将两者结合起来表示文本是对文本信息的丰富和扩充。
S314.将所述分词矩阵中每个所述分词分别与相应的权重进行加权,以得到所述待分 类网站文本的第三分词向量;
在本步骤中,采用embeeding方式将离散的训练词向量映射为连续向量,并与相应的权重进行加权,从而得到待分类网站文本的第三分词向量。
S315.将所述第三分词向量输入所述第二LightGBM模型进行预测,得到所述待分类网站文本的第四分类向量。
在本步骤中采用第二LightGBM模型进行预测具有预测速度快、内存消耗低、准确率高的优点,可有效的提高待分类网站文本的分类进准度。
在本实例中,IT-IDF值与词向量加权生成句向量的表征方法:IT-IDF是评估一个单词在一个文档集合或语料库中的重要程度,预训练的词向量mbedding方式则关注的是单词的语义,将两者结合起来表示文本是对文本信息的丰富和扩充。
S32.将所述待分类网站文本输入所述第二Bi-LSTM模型,获取所述待分类网站文本的第五分类向量;
进一步地,参阅图7所示步骤S32可包括以下步骤:
S321.对所述待分类网站文本进行预处理,获取所述待分类网站文本中的所有分词;
在本步骤中对待分类网站文本进行预处理包括:去除待分类网站文本中的中文及英文标点符号,删除英文字符及停用词,对待分类网站文本进行分词,以得到待分类网站文本中的所有分词。
S322.采用bert模型将所述待分类网站文本中的所有分词转换为第四分词向量;
bert模型全称是:BidirectionalEncoder Representations from Transformer。bert模型的输入是待分类网站文本中各个字/词的原始词向量,该向量既可以随机初始化,也可以利用Word2Vector等算法进行预训练以作为初始值;输出是待分类网站文本中各个字/词融合了全文语义信息后的向量,即:第四分词向量。
S323.将所述第四分词向量输入所述第二Bi-LSTM模型进行预测,得到所述待分类网站文本的第五分类向量。
在本实施例中,第二Bi-LSTM模型中增加了注意力机制即:Attention层,在第二Bi-LSTM模型中将最后一个时序的输出向量作为特征向量然后进行softmax分类。Attention层是先计算每个时序的权重,然后将所有时序的向量进行加权和作为特征向量,然后进行softmax分类。在本实施例中通过增加Attention层可有效的提高第五分类向量的准确性。
S33.将所述待分类网站文本的所述第四分类向量和所述第五分类向量进行加权求和,以得到所述待分类网站文本的分类。
在本实施例中,通过将待分类网站文本的第四分类向量和第五分类向量进行加权求和,以获取待分类网站文本的第六分类向量;将第六分类向量中概率分布最大的维度作为待分类网站文本的分类结果。
在本实施例中,网站分类方法通过采用标识网站类型的网站简介文本作为训练样本,对初始分类模型进行训练,获取文本分类模型,极大的减少了训练过程中占用的存储量,提高了训练的速度;采用文本分类模型对待分类网站文本进行分类,以得到待分类网站文 本的类别,从而实现基于网站简介文本可快速准确的识别网站类型的目的。网站分类方法集合了Bi-LSTM模型和LightGBM模型对网站简介进行分类预测,提高了分类的准确率,相比于现有的语料训练,本申请从网站简介出发该内容概括性强且短小精悍,占用内容少,大大降低了模型的开销。
实施例二
请参阅图8,本实施例的一种网站分类装置1,包括:训练单元11、获取单元12和预测单元13;其中:
训练单元11,用于采用训练样本集合对初始分类模型进行训练,获取文本分类模型,其中,所述训练样本集合包括至少一个待分类网站文本,所述待分类网站文本为标识网站类型的网站简介文本;
其中,所述训练样本集合包括至少一个样本文本,所述样本文本为标识网站类型的网站简介文本;训练样本集合可以存储于一区块链的节点中。本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
需要说明的是:所述初始分类模型包括第一LightGBM模型和第一Bi-LSTM模型,所述文本分类模型包括第二LightGBM模型和第二Bi-LSTM模型;
训练单元11采用所述训练样本集合对所述第一LightGBM模型和所述第一Bi-LSTM模型进行训练,获取所述第二LightGBM模型和所述第二Bi-LSTM模型。
具体地,参阅图9所示训练单元11可包括:第一训练模块111、第二训练模块112、第一求和模块113、匹配模块114和处理模块115。
第一训练模块111,用于将所述训练样本集合中的至少一个所述样本文本输入所述第一LightGBM模型,获取每个所述样本文本的第一分类向量;
进一步地,参阅图3所示第一训练模块111通过第一LightGBM模型获取每个所述样本文本的第一分类向量可包括以下步骤:
S111.对所述样本文本进行预处理,获取所述样本文本中的所有分词;
S112.采用gensim模块分别将每个所述分词转换为训练词向量,以得到所述样本文本的分词矩阵;
S113.计算所述分词矩阵中每一个所述分词的词频-逆文件频率IT-IDF,将所述词频-逆文件频率作为相应的所述分词的权重;
S114.将所述分词矩阵中每个所述分词分别与相应的权重进行加权,以得到所述样本文本的第一分词向量;
S115.将所述第一分词向量输入所述第一LightGBM模型进行训练,得到所述样本文本的第一分类向量。
第二训练模块112,用于将所述训练样本集合的至少一个所述样本文本输入所述第一 Bi-LSTM模型,获取每个所述样本文本的第二分类向量;
进一步地,参阅图4所示第二训练模块112通过第一Bi-LSTM模型获取每个所述样本文本的第二分类向量可包括以下步骤:
S121.对所述样本文本进行预处理,获取所述样本文本中的所有分词;
S122.采用bert模型将所述样本文本中的所有分词转换为第二分词向量;
S123.将所述第二分词向量输入所述第一Bi-LSTM模型进行训练,得到所述样本文本的第二分类向量。
在本实施例中,第一Bi-LSTM模型中增加了注意力机制即:Attention层,在第一Bi-LSTM模型中将最后一个时序的输出向量作为特征向量然后进行softmax分类。Attention层是先计算每个时序的权重,然后将所有时序的向量进行加权和作为特征向量,然后进行softmax分类。在本实施例中通过增加Attention层可有效的提高第二分类向量的准确性。
第一求和模块113,用于分别将同一所述样本文本的所述第一分类向量和所述第二分类向量进行加权求和,以得到所述样本文本的训练分类;
例如:假设第一分类向量Y1:[y1,y2,…,yn],第二分类向量Y2:[y1,y2,…,yn],则样本文本进行加权求和Y=w1×Y1+(1-w1)×Y2;其中w1为预设的权重值,属于[0,1]。
在本实施例中,通过将同一所述样本文本的第一分类向量和第二分类向量进行加权求和,以获取样本文本的第三分类向量;将第三分类向量中概率分布最大的维度作为样本文本的训练分类结果。
匹配模块114,用于将每一个所述样本文本的所述训练分类与所述样本文本的网站类型标识进行匹配;
处理模块115,用于判断所有所述样本文本的所述训练分类的匹配度是否大于预设阈值,若否,更新所述第一LightGBM模型的参数值,以及所述第一Bi-LSTM模型的参数值,直至完成对所述第一LightGBM模型和所述第一Bi-LSTM模型的训练,获取所述第二LightGBM模型和所述第二Bi-LSTM模型。
在本实施例中,预设阈值可以根据需要设定,如:90%,95%等。
获取单元12,用于获取待分类网站文本;
在本实施例中,待分类网站文本可以基于网站排行榜中的网站简介获得。
预测单元13,用于采用所述文本分类模型对所述待分类网站文本进行分类,以获取所述待分类网站文本的类别。
具体地,参阅图10所示预测单元13可包括:第一预测模块131、第二预测模块132和第二求和模块133。
第一预测模块131,用于将所述待分类网站文本输入所述第二LightGBM模型,获取所述待分类网站文本的第四分类向量;
进一步地,参阅图6所示第一预测模块131通过第二LightGBM模型获取所述待分类网站文本的第四分类向量可包括以下步骤:
S311.对所述待分类网站文本进行预处理,获取所述待分类网站文本中的所有分词;
S312.采用gensim模块分别将每个所述分词转换为词向量,以得到所述待分类网站文本的分词矩阵;
S313.计算所述分词矩阵中每一个所述分词的词频-逆文件频率,将所述词频-逆文件频率作为相应的所述分词的权重;
S314.将所述分词矩阵中每个所述分词分别与相应的权重进行加权,以得到所述待分类网站文本的第三分词向量;
S315.将所述第三分词向量输入所述第二LightGBM模型进行预测,得到所述待分类网站文本的第四分类向量。
在本实例中,IT-IDF值与词向量加权生成句向量的表征方法:IT-IDF是评估一个单词在一个文档集合或语料库中的重要程度,预训练的词向量mbedding方式则关注的是单词的语义,将两者结合起来表示文本是对文本信息的丰富和扩充。
第二预测模块132,用于将所述待分类网站文本输入所述第二Bi-LSTM模型,获取所述待分类网站文本的第五分类向量;
进一步地,参阅图7所示第二预测模块132通过第二Bi-LSTM模型,获取所述待分类网站文本的第五分类向量可包括以下步骤:
S321.对所述待分类网站文本进行预处理,获取所述待分类网站文本中的所有分词;
S322.采用bert模型将所述待分类网站文本中的所有分词转换为第四分词向量;
S323.将所述第四分词向量输入所述第二Bi-LSTM模型进行预测,得到所述待分类网站文本的第五分类向量。
第二求和模块133,用于将所述待分类网站文本的所述第四分类向量和所述第五分类向量进行加权求和,以得到所述待分类网站文本的分类。
在本实施例中,通过将待分类网站文本的第四分类向量和第五分类向量进行加权求和,以获取待分类网站文本的第六分类向量;将第六分类向量中概率分布最大的维度作为待分类网站文本的分类结果。
在本实施例中,网站分类装置1通过采用标识网站类型的网站简介文本作为训练样本,对初始分类模型进行训练,获取文本分类模型,极大的减少了训练过程中占用的存储量,提高了训练的速度;采用文本分类模型对待分类网站文本进行分类,以得到待分类网站文本的类别,从而实现基于网站简介文本可快速准确的识别网站类型的目的。网站分类方法集合了Bi-LSTM模型和LightGBM模型对网站简介进行分类预测,提高了分类的准确率,相比于现有的语料训练,本申请从网站简介出发该内容概括性强且短小精悍,占用内容少,大大降低了模型的开销。
实施例三
为实现上述目的,本申请还提供一种计算机设备2,该计算机设备2包括多个计算机设备2,实施例二的网站分类装置1的组成部分可分散于不同的计算机设备2中,计算机设备2可以是执行程序的智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成 的服务器集群)等。本实施例的计算机设备2至少包括但不限于:可通过系统总线相互通信连接的存储器21、处理器23、网络接口22以及网站分类装置1(参考图11)。需要指出的是,图11仅示出了具有组件-的计算机设备2,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
本实施例中,所述存储器21至少包括一种类型的计算机可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,存储器21可以是计算机设备2的内部存储单元,例如该计算机设备2的硬盘或内存。在另一些实施例中,存储器21也可以是计算机设备2的外部存储设备,例如该计算机设备2上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器21还可以既包括计算机设备2的内部存储单元也包括其外部存储设备。本实施例中,存储器21通常用于存储安装于计算机设备2的操作系统和各类应用软件,例如实施例一的网站分类方法的程序代码等。此外,存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器23在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器23通常用于控制计算机设备2的总体操作例如执行与所述计算机设备2进行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器23用于运行所述存储器21中存储的程序代码或者处理数据,例如运行所述的网站分类装置1等。
所述网络接口22可包括无线网络接口或有线网络接口,该网络接口22通常用于在所述计算机设备2与其他计算机设备2之间建立通信连接。例如,所述网络接口22用于通过网络将所述计算机设备2与外部终端相连,在所述计算机设备2与外部终端之间的建立数据传输通道和通信连接等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。
需要指出的是,图11仅示出了具有部件21-23的计算机设备2,但是应理解的是,并不要求实施所有示出的部件,可以替代的实施更多或者更少的部件。
在本实施例中,存储于存储器21中的所述网站分类装置1还可以被分割为一个或者多个程序模块,所述一个或者多个程序模块被存储于存储器21中,并由一个或多个处理器(本实施例为处理器23)所执行,以完成本申请。
实施例四
为实现上述目的,本申请还提供一种计算机可读存储介质,其包括多个存储介质,如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器 (EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、服务器、App应用商城等等,其上存储有计算机可读指令,程序被处理器23执行时实现相应功能。本实施例的计算机可读存储介质用于存储网站分类装置1,被处理器23执行时实现实施例一的网站分类方法。所述计算机可读存储介质可以是非易失性,也可以是易失性。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种网站分类方法,其中,包括:
    采用训练样本集合对初始分类模型进行训练,获取文本分类模型;
    其中,所述训练样本集合包括至少一个样本文本,所述样本文本为标识网站类型的网站简介文本;
    获取待分类网站文本;
    采用所述文本分类模型对所述待分类网站文本进行分类,以获取所述待分类网站文本的类别。
  2. 根据权利要求1所述的网站分类方法,其中,所述初始分类模型包括第一LightGBM模型和第一Bi-LSTM模型,所述文本分类模型包括第二LightGBM模型和第二Bi-LSTM模型;
    所述采用训练样本集合对初始分类模型进行训练,获取文本分类模型,包括:
    采用所述训练样本集合对所述第一LightGBM模型和所述第一Bi-LSTM模型进行训练,获取所述第二LightGBM模型和所述第二Bi-LSTM模型。
  3. 根据权利要求2所述的网站分类方法,其中,所述采用所述训练样本集合对所述第一LightGBM模型和所述第一Bi-LSTM模型进行训练,获取所述第二LightGBM模型和所述第二Bi-LSTM模型,包括:
    将所述训练样本集合中的至少一个所述样本文本输入所述第一LightGBM模型,获取每个所述样本文本的第一分类向量;
    将所述训练样本集合的至少一个所述样本文本输入所述第一Bi-LSTM模型,获取每个所述样本文本的第二分类向量;
    分别将同一所述样本文本的所述第一分类向量和所述第二分类向量进行加权求和,以得到所述样本文本的训练分类;
    将每一个所述样本文本的所述训练分类与所述样本文本的网站类型标识进行匹配;
    判断所有所述样本文本的所述训练分类的匹配度是否大于预设阈值,若否,更新所述第一LightGBM模型的参数值,以及所述第一Bi-LSTM模型的参数值,直至完成对所述第一LightGBM模型和所述第一Bi-LSTM模型的训练,获取所述第二LightGBM模型和所述第二Bi-LSTM模型。
  4. 根据权利要求3所述的网站分类方法,其中,所述将所述训练样本集合中的至少一个所述样本文本输入所述第一LightGBM模型,获取每个所述样本文本的第一分类向量,包括:
    对所述样本文本进行预处理,获取所述样本文本中的所有分词;
    采用gensim模块分别将每个所述分词转换为训练词向量,以得到所述样本文本的分词矩阵;
    计算所述分词矩阵中每一个所述分词的词频-逆文件频率,将所述词频-逆文件频率作为相应的所述分词的权重;
    将所述分词矩阵中每个所述分词分别与相应的权重进行加权,以得到所述样本文本的第一分词向量;
    将所述第一分词向量输入所述第一LightGBM模型进行训练,得到所述样本文本的第一分类向量。
  5. 根据权利要求3所述的网站分类方法,其中,所述将所述训练样本集合的至少一个所述样本文本输入所述第一Bi-LSTM模型,获取每个所述样本文本的第二分类向量,包括:
    对所述样本文本进行预处理,获取所述样本文本中的所有分词;
    采用bert模型将所述样本文本中的所有分词转换为第二分词向量;
    将所述第二分词向量输入所述第一Bi-LSTM模型进行训练,得到所述样本文本的第二分类向量。
  6. 根据权利要求2所述的网站分类方法,其中,所述采用所述文本分类模型对所述待分类网站文本进行分类,以获取所述待分类网站文本的类别,包括:
    将所述待分类网站文本输入所述第二LightGBM模型,获取所述待分类网站文本的第四分类向量;
    将所述待分类网站文本输入所述第二Bi-LSTM模型,获取所述待分类网站文本的第五分类向量;
    将所述待分类网站文本的所述第四分类向量和所述第五分类向量进行加权求和,以得到所述待分类网站文本的分类。
  7. 根据权利要求6所述的网站分类方法,其中,所述将所述待分类网站文本输入所述第二LightGBM模型,获取所述待分类网站文本的第四分类向量,包括:
    对所述待分类网站文本进行预处理,获取所述待分类网站文本中的所有分词;
    采用gensim模块分别将每个所述分词转换为词向量,以得到所述待分类网站文本的分词矩阵;
    计算所述分词矩阵中每一个所述分词的词频-逆文件频率,将所述词频-逆文件频率作为相应的所述分词的权重;
    将所述分词矩阵中每个所述分词分别与相应的权重进行加权,以得到所述待分类网站文本的第三分词向量;
    将所述第三分词向量输入所述第二LightGBM模型进行预测,得到所述待分类网站文本的第四分类向量。
  8. 一种网站分类装置,其中,包括:
    训练单元,用于采用训练样本集合对初始分类模型进行训练,获取文本分类模型,其中,所述训练样本集合包括至少一个样本文本,所述样本文本为标识网站类型的网站简介文本;
    获取单元,用于获取待分类网站文本;
    预测单元,用于采用所述文本分类模型对所述待分类网站文本进行分类,以获取所述待分类网站文本的类别。
  9. 一种计算机设备,其中,所述计算机设备包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现一种网站分类方法包括:
    采用训练样本集合对初始分类模型进行训练,获取文本分类模型;
    其中,所述训练样本集合包括至少一个样本文本,所述样本文本为标识网站类型的网站简介文本;
    获取待分类网站文本;
    采用所述文本分类模型对所述待分类网站文本进行分类,以获取所述待分类网站文本的类别。
  10. 根据权利要求9所述的计算机设备,其中,所述初始分类模型包括第一LightGBM模型和第一Bi-LSTM模型,所述文本分类模型包括第二LightGBM模型和第二Bi-LSTM模型;
    所述采用训练样本集合对初始分类模型进行训练,获取文本分类模型,包括:
    采用所述训练样本集合对所述第一LightGBM模型和所述第一Bi-LSTM模型进行训练,获取所述第二LightGBM模型和所述第二Bi-LSTM模型。
  11. 根据权利要求10所述的计算机设备,其中,所述采用所述训练样本集合对所述第一LightGBM模型和所述第一Bi-LSTM模型进行训练,获取所述第二LightGBM模型和所述第二Bi-LSTM模型,包括:
    将所述训练样本集合中的至少一个所述样本文本输入所述第一LightGBM模型,获取每个所述样本文本的第一分类向量;
    将所述训练样本集合的至少一个所述样本文本输入所述第一Bi-LSTM模型,获取每个所述样本文本的第二分类向量;
    分别将同一所述样本文本的所述第一分类向量和所述第二分类向量进行加权求和,以得到所述样本文本的训练分类;
    将每一个所述样本文本的所述训练分类与所述样本文本的网站类型标识进行匹配;
    判断所有所述样本文本的所述训练分类的匹配度是否大于预设阈值,若否,更新所述第一LightGBM模型的参数值,以及所述第一Bi-LSTM模型的参数值,直至完成对所述第一LightGBM模型和所述第一Bi-LSTM模型的训练,获取所述第二LightGBM模型和所述第二Bi-LSTM模型。
  12. 根据权利要求11所述的计算机设备,其中,所述将所述训练样本集合中的至少一个所述样本文本输入所述第一LightGBM模型,获取每个所述样本文本的第一分类向量,包括:
    对所述样本文本进行预处理,获取所述样本文本中的所有分词;
    采用gensim模块分别将每个所述分词转换为训练词向量,以得到所述样本文本的分词矩阵;
    计算所述分词矩阵中每一个所述分词的词频-逆文件频率,将所述词频-逆文件频率作 为相应的所述分词的权重;
    将所述分词矩阵中每个所述分词分别与相应的权重进行加权,以得到所述样本文本的第一分词向量;
    将所述第一分词向量输入所述第一LightGBM模型进行训练,得到所述样本文本的第一分类向量。
  13. 根据权利要求11所述的计算机设备,其中,所述将所述训练样本集合的至少一个所述样本文本输入所述第一Bi-LSTM模型,获取每个所述样本文本的第二分类向量,包括:
    对所述样本文本进行预处理,获取所述样本文本中的所有分词;
    采用bert模型将所述样本文本中的所有分词转换为第二分词向量;
    将所述第二分词向量输入所述第一Bi-LSTM模型进行训练,得到所述样本文本的第二分类向量。
  14. 根据权利要求10所述的计算机设备,其中,所述采用所述文本分类模型对所述待分类网站文本进行分类,以获取所述待分类网站文本的类别,包括:
    将所述待分类网站文本输入所述第二LightGBM模型,获取所述待分类网站文本的第四分类向量;
    将所述待分类网站文本输入所述第二Bi-LSTM模型,获取所述待分类网站文本的第五分类向量;
    将所述待分类网站文本的所述第四分类向量和所述第五分类向量进行加权求和,以得到所述待分类网站文本的分类。
  15. 一种计算机可读存储介质,其上存储有计算机可读指令,其中:所述计算机可读指令被处理器执行时实现一种网站分类方法包括:
    采用训练样本集合对初始分类模型进行训练,获取文本分类模型;
    其中,所述训练样本集合包括至少一个样本文本,所述样本文本为标识网站类型的网站简介文本;
    获取待分类网站文本;
    采用所述文本分类模型对所述待分类网站文本进行分类,以获取所述待分类网站文本的类别。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述初始分类模型包括第一LightGBM模型和第一Bi-LSTM模型,所述文本分类模型包括第二LightGBM模型和第二Bi-LSTM模型;
    所述采用训练样本集合对初始分类模型进行训练,获取文本分类模型,包括:
    采用所述训练样本集合对所述第一LightGBM模型和所述第一Bi-LSTM模型进行训练,获取所述第二LightGBM模型和所述第二Bi-LSTM模型。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述采用所述训练样本集合对所述第一LightGBM模型和所述第一Bi-LSTM模型进行训练,获取所述第二LightGBM模型和所述第二Bi-LSTM模型,包括:
    将所述训练样本集合中的至少一个所述样本文本输入所述第一LightGBM模型,获取每个所述样本文本的第一分类向量;
    将所述训练样本集合的至少一个所述样本文本输入所述第一Bi-LSTM模型,获取每个所述样本文本的第二分类向量;
    分别将同一所述样本文本的所述第一分类向量和所述第二分类向量进行加权求和,以得到所述样本文本的训练分类;
    将每一个所述样本文本的所述训练分类与所述样本文本的网站类型标识进行匹配;
    判断所有所述样本文本的所述训练分类的匹配度是否大于预设阈值,若否,更新所述第一LightGBM模型的参数值,以及所述第一Bi-LSTM模型的参数值,直至完成对所述第一LightGBM模型和所述第一Bi-LSTM模型的训练,获取所述第二LightGBM模型和所述第二Bi-LSTM模型。
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述将所述训练样本集合中的至少一个所述样本文本输入所述第一LightGBM模型,获取每个所述样本文本的第一分类向量,包括:
    对所述样本文本进行预处理,获取所述样本文本中的所有分词;
    采用gensim模块分别将每个所述分词转换为训练词向量,以得到所述样本文本的分词矩阵;
    计算所述分词矩阵中每一个所述分词的词频-逆文件频率,将所述词频-逆文件频率作为相应的所述分词的权重;
    将所述分词矩阵中每个所述分词分别与相应的权重进行加权,以得到所述样本文本的第一分词向量;
    将所述第一分词向量输入所述第一LightGBM模型进行训练,得到所述样本文本的第一分类向量。
  19. 根据权利要求17所述的计算机可读存储介质,其中,所述将所述训练样本集合的至少一个所述样本文本输入所述第一Bi-LSTM模型,获取每个所述样本文本的第二分类向量,包括:
    对所述样本文本进行预处理,获取所述样本文本中的所有分词;
    采用bert模型将所述样本文本中的所有分词转换为第二分词向量;
    将所述第二分词向量输入所述第一Bi-LSTM模型进行训练,得到所述样本文本的第二分类向量。
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述采用所述文本分类模型对所述待分类网站文本进行分类,以获取所述待分类网站文本的类别,包括:
    将所述待分类网站文本输入所述第二LightGBM模型,获取所述待分类网站文本的第四分类向量;
    将所述待分类网站文本输入所述第二Bi-LSTM模型,获取所述待分类网站文本的第五分类向量;
    将所述待分类网站文本的所述第四分类向量和所述第五分类向量进行加权求和,以得到所述待分类网站文本的分类。
PCT/CN2021/109553 2020-09-02 2021-07-30 网站分类方法、装置、计算机设备及存储介质 WO2022048363A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010910928.9 2020-09-02
CN202010910928.9A CN111984792A (zh) 2020-09-02 2020-09-02 网站分类方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022048363A1 true WO2022048363A1 (zh) 2022-03-10

Family

ID=73448456

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109553 WO2022048363A1 (zh) 2020-09-02 2021-07-30 网站分类方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN111984792A (zh)
WO (1) WO2022048363A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115146704A (zh) * 2022-05-27 2022-10-04 中睿信数字技术有限公司 基于分布式数据库和机器学习的事件自动分类方法和系统
CN115982505A (zh) * 2023-03-16 2023-04-18 北京匠数科技有限公司 基于vlm的网站检测方法和装置
CN117591674A (zh) * 2024-01-18 2024-02-23 交通运输部公路科学研究所 基于文本分类模型对桥梁检评文本的自动分类方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111984792A (zh) * 2020-09-02 2020-11-24 深圳壹账通智能科技有限公司 网站分类方法、装置、计算机设备及存储介质
CN113051500B (zh) * 2021-03-25 2022-08-16 武汉大学 一种融合多源数据的钓鱼网站识别方法及系统
CN113360657B (zh) * 2021-06-30 2023-10-24 安徽商信政通信息技术股份有限公司 一种公文智能分发办理方法、装置及计算机设备
CN113656738A (zh) * 2021-08-25 2021-11-16 成都知道创宇信息技术有限公司 网站分类方法、装置、电子设备及可读存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106960040A (zh) * 2017-03-27 2017-07-18 北京神州绿盟信息安全科技股份有限公司 一种url的类别确定方法及装置
US20170337266A1 (en) * 2016-05-19 2017-11-23 Conduent Business Services, Llc Method and system for data processing for text classification of a target domain
US20180240012A1 (en) * 2017-02-17 2018-08-23 Wipro Limited Method and system for determining classification of text
CN110442823A (zh) * 2019-08-06 2019-11-12 北京智游网安科技有限公司 网站分类方法、网站类型判断方法、存储介质及智能终端
CN111428034A (zh) * 2020-03-23 2020-07-17 京东数字科技控股有限公司 分类模型的训练方法、评论信息的分类方法及装置
CN111554268A (zh) * 2020-07-13 2020-08-18 腾讯科技(深圳)有限公司 基于语言模型的语言识别方法、文本分类方法和装置
CN111984792A (zh) * 2020-09-02 2020-11-24 深圳壹账通智能科技有限公司 网站分类方法、装置、计算机设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170337266A1 (en) * 2016-05-19 2017-11-23 Conduent Business Services, Llc Method and system for data processing for text classification of a target domain
US20180240012A1 (en) * 2017-02-17 2018-08-23 Wipro Limited Method and system for determining classification of text
CN106960040A (zh) * 2017-03-27 2017-07-18 北京神州绿盟信息安全科技股份有限公司 一种url的类别确定方法及装置
CN110442823A (zh) * 2019-08-06 2019-11-12 北京智游网安科技有限公司 网站分类方法、网站类型判断方法、存储介质及智能终端
CN111428034A (zh) * 2020-03-23 2020-07-17 京东数字科技控股有限公司 分类模型的训练方法、评论信息的分类方法及装置
CN111554268A (zh) * 2020-07-13 2020-08-18 腾讯科技(深圳)有限公司 基于语言模型的语言识别方法、文本分类方法和装置
CN111984792A (zh) * 2020-09-02 2020-11-24 深圳壹账通智能科技有限公司 网站分类方法、装置、计算机设备及存储介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115146704A (zh) * 2022-05-27 2022-10-04 中睿信数字技术有限公司 基于分布式数据库和机器学习的事件自动分类方法和系统
CN115146704B (zh) * 2022-05-27 2023-11-07 中睿信数字技术有限公司 基于分布式数据库和机器学习的事件自动分类方法和系统
CN115982505A (zh) * 2023-03-16 2023-04-18 北京匠数科技有限公司 基于vlm的网站检测方法和装置
CN117591674A (zh) * 2024-01-18 2024-02-23 交通运输部公路科学研究所 基于文本分类模型对桥梁检评文本的自动分类方法
CN117591674B (zh) * 2024-01-18 2024-04-26 交通运输部公路科学研究所 基于文本分类模型对桥梁检评文本的自动分类方法

Also Published As

Publication number Publication date
CN111984792A (zh) 2020-11-24

Similar Documents

Publication Publication Date Title
WO2022048363A1 (zh) 网站分类方法、装置、计算机设备及存储介质
CN111177569B (zh) 基于人工智能的推荐处理方法、装置及设备
CN112818093B (zh) 基于语义匹配的证据文档检索方法、系统及存储介质
WO2020147409A1 (zh) 一种文本分类方法、装置、计算机设备及存储介质
CN112632278A (zh) 一种基于多标签分类的标注方法、装置、设备及存储介质
CN113051914A (zh) 一种基于多特征动态画像的企业隐藏标签抽取方法及装置
CN112883730B (zh) 相似文本匹配方法、装置、电子设备及存储介质
CN113656547B (zh) 文本匹配方法、装置、设备及存储介质
CN112686022A (zh) 违规语料的检测方法、装置、计算机设备及存储介质
CN112149387A (zh) 财务数据的可视化方法、装置、计算机设备及存储介质
CN111709225B (zh) 一种事件因果关系判别方法、装置和计算机可读存储介质
CN112231416A (zh) 知识图谱本体更新方法、装置、计算机设备及存储介质
CN112686053A (zh) 一种数据增强方法、装置、计算机设备及存储介质
CN111985212A (zh) 文本关键字识别方法、装置、计算机设备及可读存储介质
WO2021012958A1 (zh) 原创文本甄别方法、装置、设备与计算机可读存储介质
CN114266255B (zh) 基于聚类模型的语料分类方法、装置、设备及存储介质
CN112529743B (zh) 合同要素抽取方法、装置、电子设备及介质
CN111767399B (zh) 一种基于不均衡文本集的情感分类器构建方法、装置、设备和介质
CN114676307A (zh) 基于用户检索的排序模型训练方法、装置、设备及介质
CN114840872A (zh) 秘密文本脱敏方法、装置、计算机设备及可读存储介质
CN114328894A (zh) 文档处理方法、装置、电子设备及介质
CN112199954A (zh) 基于语音语义的疾病实体匹配方法、装置及计算机设备
CN110909777A (zh) 一种多维特征图嵌入方法、装置、设备及介质
CN114238574B (zh) 基于人工智能的意图识别方法及其相关设备
CN112732913B (zh) 一种非均衡样本的分类方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21863431

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 30/06/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21863431

Country of ref document: EP

Kind code of ref document: A1