WO2019205374A1 - 模型的在线训练方法、服务器及存储介质 - Google Patents

模型的在线训练方法、服务器及存储介质 Download PDF

Info

Publication number
WO2019205374A1
WO2019205374A1 PCT/CN2018/102114 CN2018102114W WO2019205374A1 WO 2019205374 A1 WO2019205374 A1 WO 2019205374A1 CN 2018102114 W CN2018102114 W CN 2018102114W WO 2019205374 A1 WO2019205374 A1 WO 2019205374A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
model
training
words
server
Prior art date
Application number
PCT/CN2018/102114
Other languages
English (en)
French (fr)
Inventor
吴壮伟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019205374A1 publication Critical patent/WO2019205374A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Definitions

  • the present application relates to the field of data processing technologies, and in particular, to a model online training method, a server, and a computer readable storage medium.
  • AI artificial intelligence
  • static data for offline training, that is, training based on static data obtained internally.
  • This off-line training mechanism is less suitable for scenarios with higher real-time requirements, and is not applicable to scenarios that require a large amount of external data for deep learning.
  • the present application provides a model online training method, a server and a computer readable storage medium, the main purpose of which is to directly capture external data and implement online training of a fully automated model.
  • the present application provides an online training method for a model, the method comprising:
  • Crawling step receiving the seed webpage sent by the user, and using the crawler tool to crawl all the webpage links in the seed webpage;
  • Screening step use the preset filter to filter out the web links that meet the requirements
  • Pre-processing step obtaining the source code of the webpage from the webpage pointed to by the filtered webpage link, pre-processing the webpage source code, and obtaining the available word set of the webpage;
  • Invoking step using the ETL processing method to store the obtained available word set into a database according to a preset format, as training data, and calling the model to perform online training on the training data;
  • the generation step adjusting the model parameters, verifying the accuracy of the model, and generating a complete model
  • Storage step Store the generated model to the specified directory of the model server.
  • the application further provides a server, the server includes: a memory, a processor and a display, wherein the memory stores a model online training program, and the model online training program is executed by the processor, and the following steps can be implemented:
  • Crawling step receiving the seed webpage sent by the user, and using the crawler tool to crawl all the webpage links in the seed webpage;
  • Screening step use the preset filter to filter out the web links that meet the requirements
  • Pre-processing step obtaining the source code of the webpage from the webpage pointed to by the filtered webpage link, pre-processing the webpage source code, and obtaining the available word set of the webpage;
  • Invoking step using the ETL processing method to store the obtained available word set into a database according to a preset format, as training data, and calling the model to perform online training on the training data;
  • the generation step adjusting the model parameters, verifying the accuracy of the model, and generating a complete model
  • Storage step Store the generated model to the specified directory of the model server.
  • the present application further provides a computer readable storage medium, which includes a model online training program, which can be implemented as described above when the model online training program is executed by a processor. Any step in the online training method of the model.
  • the online training method, the server and the computer readable storage medium of the model proposed by the present application use a crawler tool to crawl all webpage links through a user-specified seed webpage, and filter out the webpages that meet the requirements. Then, the webpage source of the webpage is obtained, preprocessed, and the available word sets are stored in the database. Finally, the data in the database is used for training, parameter adjustment, verification accuracy, and the complete model is stored in the specified directory, so that the online training of the model can be automatically completed according to the dynamic data.
  • FIG. 1 is a schematic diagram of a preferred embodiment of a server of the present application.
  • FIG. 2 is a block diagram showing a preferred embodiment of the model online training program of FIG. 1;
  • FIG. 3 is a schematic diagram of the function of the program module of Figure 2;
  • FIG. 4 is a flow chart of a preferred embodiment of an online training method for a model of the present application.
  • embodiments of the present application can be implemented as a method, apparatus, device, system, or computer program product. Accordingly, the application can be embodied in a complete hardware, complete software (including firmware, resident software, microcode, etc.), or a combination of hardware and software.
  • FIG. 1 it is a schematic diagram of a preferred embodiment of the server 1 of the present application.
  • the server 1 refers to a product service platform, which may be a server, a tablet computer, a personal computer, a portable computer, and other electronic devices having computing functions.
  • the server 1 includes a memory 11, a processor 12, a display 13, a network interface 14, and a communication bus 15.
  • the network interface 14 can optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • Communication bus 15 is used to implement connection communication between these components.
  • the memory 11 includes at least one type of readable storage medium.
  • the at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card type memory, or the like.
  • the memory 11 may be an internal storage unit of the server 1, such as a hard disk of the server 1.
  • the memory 11 may also be an external storage unit of the server 1, such as a plug-in hard disk equipped on the server 1, a smart memory card (SMC), and a secure digital ( Secure Digital, SD) cards, flash cards, etc.
  • SMC smart memory card
  • SD Secure Digital
  • the processor 12 in some embodiments, may be a Central Processing Unit (CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as performing model online training.
  • CPU Central Processing Unit
  • microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as performing model online training.
  • Display 13 can be referred to as a display screen or display unit.
  • the display 13 can be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an Organic Light-Emitting Diode (OLED) touch sensor.
  • the display 13 is used to display information processed in the server 1 and a work interface for displaying visualizations, such as displaying available sets of words obtained after pre-processing.
  • the server 1 may further include a user interface
  • the user interface may include an input unit such as a keyboard, a voice output device such as an audio, a headset, etc.
  • the user interface may further include a standard wired interface and a wireless interface.
  • the server 1 further comprises a touch sensor.
  • the area provided by the touch sensor for the user to perform a touch operation is referred to as a touch area.
  • the touch sensor described herein may be a resistive touch sensor, a capacitive touch sensor, or the like.
  • the touch sensor includes not only a contact type touch sensor but also a proximity type touch sensor or the like.
  • the touch sensor may be a single sensor or a plurality of sensors arranged, for example, in an array. The user can launch the model online training program 10 by touching the touch area.
  • the server 1 may also include radio frequency (RF) circuits, sensors, audio circuits, and the like, and details are not described herein.
  • RF radio frequency
  • the program code of the online training program 10 is stored in the memory 11 as a computer storage medium.
  • the processor 12 executes the program code of the online training program 10, the following steps are implemented:
  • Crawling step receiving the seed webpage sent by the user, and using the crawler tool to crawl all the webpage links in the seed webpage;
  • Screening step use the preset filter to filter out the web links that meet the requirements
  • Pre-processing step obtaining the source code of the webpage from the webpage pointed to by the filtered webpage link, pre-processing the webpage source code, and obtaining the available word set of the webpage;
  • Invoking step using the ETL processing method to store the obtained available word set into a database according to a preset format, as training data, and calling the model to perform online training on the training data;
  • the generation step adjusting the model parameters, verifying the accuracy of the model, and generating a complete model
  • Storage step Store the generated model to the specified directory of the model server.
  • FIG. 2 a schematic diagram of a module of the online training program 10 and a flowchart of the preferred embodiment of the online training method for the model.
  • FIG. 2 is a block diagram of a preferred embodiment of the model online training program 10 of FIG.
  • a module as referred to in this application refers to a series of computer program instructions that are capable of performing a particular function.
  • the model online training program 10 includes: a crawling module 110, a screening module 120, a preprocessing module 130, a calling module 140, a generating module 150, and a storage module 160.
  • the crawling module 110 is configured to receive the seed webpage sent by the user 2, and use the crawler tool to crawl all the webpage links in the seed webpage.
  • the seed webpage is a webpage preset by a user. Using the crawler tool to crawl the webpage links in the seed webpage is not targeted, as long as the webpage links in the seed webpage are crawled out.
  • the reptile tool may be one or more of Beautiful Soup, Scrapy, mechanize, and cola.
  • the screening module 120 is configured to filter out a webpage that meets the requirements by using a preset filter. Afterwards, the screening module 120 sets the corresponding URL library 31 according to the specific model training, the filter accesses the page pointed to by the webpage link, and assigns the accessed webpage link to the corresponding webpage library 31. For example, according to the travel recommendation model, a travel URL library and a question and answer URL library are set, the travel address database is used to store the travel link, and the question and answer URL library is used to store the question and answer link.
  • the filter analyzes the page pointed to by the webpage link, determines whether each webpage link belongs to a travelogue link or a question-and-answer link, and assigns each webpage link to the corresponding webpage library 31.
  • the filters include, but are not limited to, JQuery filters, and may be other web page intent filters.
  • the website library 31 corresponding to the travel recommendation model includes, but is not limited to, a travel website library and a question and answer website library.
  • the pre-processing module 130 is configured to obtain the source code of the webpage from the webpage pointed to by the filtered webpage link, and pre-process the webpage source code to obtain a set of available words of the webpage.
  • the pre-processing includes webpage cleaning, word segmentation, and de-stopping words.
  • the webpage cleaning includes: extracting text about a title, a keyword, and a description in a webpage by using a regular expression; and cleaning the information in the text that is not related to the content of the webpage by using a regular expression. In addition to the topic information and title, there are a large number of content that is not related to the topic of the page.
  • the text parts of the three tags ⁇ title>, ⁇ keywords>, ⁇ description> are extracted from the webpage by using a regular expression, and the three tags respectively represent the title, keyword and description of the webpage.
  • the extracted text also contains Javascript script code, CSS style code and HTML tags that are equal to the content of the web page. Therefore, the regular expression is used to further clean and obtain the noiseless web page body information.
  • the word segmentation of the noiseless webpage text is performed, and the word segmentation may adopt one or more of a dictionary-based word segmentation method, an understanding-based word segmentation method, and a statistical-based word segmentation method.
  • the present application performs word segmentation on the filtered text using a staging term.
  • the specific processing step of the de-stop word includes: storing words with less related information, such as "yes”, “yes”, “all”, “and”, "already”, and the like. Then, the words in the webpage content after the word segmentation are compared with the stored words, and the words in the webpage content that are the same as the stored words are removed, and the words with large information amount are obtained.
  • the calling module 140 is configured to store the obtained available word set into the database 32 according to a preset format by using the ETL processing method, and use the model to perform online training on the training data.
  • the ETL processing mode refers to a process of processing data by means of extraction, conversion, and loading. For example, before being stored in the database 32, the required data is extracted by the ETL processing method and converted into a preset format and loaded into the database 32. When the call is needed, the required training data can also be extracted from the database 32 by using the ETL processing method, and loaded into the model for training.
  • the ETL tool may be one or more of Informatica, Datastage, OWB, Microsoft DTS, Beeload, Kettle, and the like.
  • the generating module 150 is configured to adjust model parameters, verify model accuracy, and generate a complete model.
  • the variables X 1 , X 2 , ..., X n are trained to automatically adjust the model parameters B 1 , B 2 , ..., B n ; the verification set is substituted into the model for verification to obtain a complete model. Assume that the preset threshold is 98%, and the verification set is substituted into the model. When the verification accuracy reaches 98%, the travel recommendation model is a complete model.
  • the model includes, but is not limited to, a multiple regression model, and may also include a Long-Short Term Memory (LSTM) Convolutional Neural Network (CNN) and a cyclic neural network model. (Recurrent Neural Network, RNN) and so on.
  • LSTM Long-Short Term Memory
  • CNN Convolutional Neural Network
  • RNN Recurrent Neural Network
  • the storage module 160 is configured to store the generated model to a specified directory of the model server. For example, store a complete travel recommendation model to a specified directory. Further, when the data in the database 32 is updated, the data in the database 32 is captured for training, and the parameters are automatically adjusted, and the travel recommendation model before the newly generated travel recommendation model is replaced is saved to the specified directory.
  • the user may also send feedback information in a preset manner to prompt the user that the required model training has been completed. For example, send a message to the user's mailbox by mail: "The model you need has been trained, please check it in the specified directory, thank you!
  • FIG. 4 it is a flowchart of a preferred embodiment of the online training method of the model of the present application.
  • the online training method for implementing the model includes: Step S10 - Step S60:
  • step S10 the seed webpage sent by the user 2 is received, and the crawling module 110 uses the crawler tool to crawl all the webpage links in the seed webpage.
  • the seed webpage is set by a user. Using the crawler tool to crawl the webpage links in the seed webpage is not targeted, as long as the webpage links in the seed webpage are crawled out.
  • the reptile tool may be one or more of Beautiful Soup, Scrapy, mechanize, and cola.
  • the travel recommendation model receive the Ctrip website URL sent by the user: http://www.ctrip.com/, and grab all the links in the Ctrip network, such as the travel link: http://you.ctrip.com/ Travels/xinjiang100008/3643336.html and Q&A links: http://you.ctrip.com/asks/shanghai2/5483046.html and so on.
  • the screening module 120 uses a preset filter to filter out the web links that meet the requirements. Specifically, the screening module 120 sets a corresponding URL library 31 according to a specific model training, and the filter accesses the page pointed to by the webpage link, and assigns the accessed webpage link to the corresponding webpage library 31. For example, according to the travel recommendation model, a travel URL library and a question and answer URL library are set, the travel address database is used to store the travel link, and the question and answer URL library is used to store the question and answer link.
  • the filter analyzes the page pointed to by the webpage link, determines whether each webpage link belongs to a travelogue link or a question-and-answer link, and assigns each webpage link to the corresponding webpage library 31.
  • the filters include, but are not limited to, JQuery filters, and may be other web page intent filters.
  • the website library 31 corresponding to the travel recommendation model includes, but is not limited to, a travel address database and a question and answer website library.
  • the website library 31 corresponding to the travel model may further include: a gourmet forest website library and a word list website.
  • the pre-processing module 130 obtains the source code of the webpage from the webpage pointed to by the filtered webpage link, and preprocesses the webpage source code to obtain a set of available words of the webpage.
  • the pre-processing includes webpage cleaning, word segmentation, and de-stopping words.
  • the webpage cleaning includes: extracting text about a title, a keyword, and a description in a webpage by using a regular expression; and cleaning the information in the text that is not related to the content of the webpage by using a regular expression. In addition to the topic information and title, there are a large number of content that is not related to the topic of the webpage.
  • the text parts of the three tags ⁇ title>, ⁇ keywords>, ⁇ description> are extracted from the webpage by using a regular expression, and the three tags respectively represent the title, keyword and description of the webpage.
  • the extracted text also contains Javascript script code, CSS style code and HTML tags that are equal to the content of the web page. Therefore, the regular expression is used to further clean and obtain the noiseless web page body information.
  • the word segmentation of the noiseless webpage text is performed, and the word segmentation may adopt one or more of a dictionary-based word segmentation method, an understanding-based word segmentation method, and a statistical-based word segmentation method.
  • the present application performs word segmentation on the filtered text using a staging term.
  • the content of the web page is divided into words, and the slashes are used as identifiers between the words. For example, “The moon-shaped Lombok beach has a delicate white sand beach and gentle waves.” After the wording, you get "Moonwan shape / / Lombok / beach / possess / delicate / white beach / with / Gentle / / Wave /.” Finally, use the stop word to remove the words with less information, and get the available words.
  • the words with less information refers to words with less information such as "y”, “to”, “yes”, “and”, “already”, “all”, “and” and punctuation in words. .
  • the specific processing step of the de-stop word includes: storing a word with a small amount of related information. Then, the words in the webpage content after the word segmentation are compared with the stored words, and the words in the webpage content that are the same as the stored words are removed, and the words with large information amount are obtained. For example, the sentence is to stop the word and get "Lombok/Beach/Own/Beach/and/Wave".
  • the calling module 140 uses the ETL processing method to store the obtained available word set in the preset format into the database 32 as training data, and invokes the model to perform online training on the training data.
  • the ETL processing mode refers to a process of processing data by means of extraction, conversion, and loading. Before being stored in the database 32, the required data is extracted by the ETL processing method and converted into a preset format and loaded into the database 32. For example, the number of days, time, per capita fee, companion, and subject and body information for the above-described travel website are stored in a table of the database 32 in accordance with the fields. When needed, the ETL processing method can also be used to extract the required training data from the database 32 and load it into the model for training.
  • the ETL tool may be one or more of Informatica, Datastage, OWB, Microsoft DTS, Beeload, Kettle, and the like.
  • step S50 the generating module 150 adjusts the model parameters, verifies the model accuracy, and generates a complete model.
  • the variables X 1 , X 2 , ..., X n are trained to automatically adjust the model parameters B 1 , B 2 , ..., B n ; the verification set is substituted into the model for verification to obtain a complete model.
  • X n represents the travel Variables such as number, number of attractions, number of tourist days, etc., automatically adjust model parameters B 1 , B 2 , ..., B n , and pass the verification set into verification, and obtain a tourism recommendation model with high accuracy (such as 98%).
  • the model can recommend the tourist attractions with high cost performance (recommendation index) according to the user's tourist attractions, travel days, per capita consumption, companions and other information.
  • the model includes, but is not limited to, a multiple regression model, and may also include an LSTM model, a CNN model, an RNN model, and the like.
  • different types of models are selected for model training. For example, classification of hotspot events can use LSTM models and the like.
  • step S60 the storage module 160 stores the generated model to a specified directory of the model server. For example, store a complete travel recommendation model to a specified directory. Further, when the data in the database 32 is updated, the data in the database 32 is captured for training, and the parameters are automatically adjusted, and the travel recommendation model before the newly generated travel recommendation model is replaced is saved to the specified directory.
  • the user may also send feedback information in a preset manner to prompt the user that the required model training has been completed. For example, send a message to the user's mailbox by mail: "The model you need has been trained, please check it in the specified directory, thank you!
  • the online training method of the model proposed in the foregoing embodiment obtains a suitable word set by pre-processing the webpage content pointed to by the webpage link by obtaining a suitable webpage link according to the user's seed webpage. Then, using the ETL processing method, the obtained available word set is stored in the database 32, and the model is called for training, the parameters are adjusted, the complete model is obtained, and the model is saved to the specified directory, so that the model can obtain a large amount of external data for online training, and realize the model. Training is fully automated.
  • the embodiment of the present application further provides a computer readable storage medium, where the model readable storage medium 10 includes a model online training program 10, and when the model online training program 10 is executed by the processor, the following operations are implemented:
  • Crawling step receiving the seed webpage sent by the user, and using the crawler tool to crawl all the webpage links in the seed webpage;
  • Screening step use the preset filter to filter out the web links that meet the requirements
  • Pre-processing step obtaining the source code of the webpage from the webpage pointed to by the filtered webpage link, pre-processing the webpage source code, and obtaining the available word set of the webpage;
  • Invoking step using the ETL processing method to store the obtained available word set into a database according to a preset format, as training data, and calling the model to perform online training on the training data;
  • the generation step adjusting the model parameters, verifying the accuracy of the model, and generating a complete model
  • Storage step Store the generated model to the specified directory of the model server.
  • the screening step further comprises:
  • the corresponding URL library is set, the filter accesses the page pointed to by the webpage link, and the visited webpage link is assigned to the corresponding webpage library.
  • the pre-processing includes webpage cleaning, word segmentation, and de-stopping words.
  • the step of cleaning the webpage comprises:
  • Regular expressions are used to clean information in the text that is not related to the content of the web page.
  • the ETL processing mode refers to a process of processing data by means of extraction, conversion, and loading.
  • the generating step comprises:
  • the training data is divided into a training set and a verification set
  • the technical solution of the present application which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM as described above). , a disk, an optical disk, including a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the various embodiments of the present application.
  • a terminal device which may be a mobile phone, a computer, a server, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种模型的在线训练方法、服务器及存储介质,该方法根据用户发送的种子网页,利用爬虫工具爬取该种子网页内的所有网页链接(S10),并利用筛选器筛选出符合要求的网页链接(S20)。接着,从筛选后的网页链接所指向的网页中获取网页源码,进行预处理,得到可用词集(S30)。然后,利用ETL处理方式将可用词集按照预设格式存储到数据库中,并调用模型对训练数据进行在线训练(S40)。最后,调整模型参数,验证模型准确性,生成完整的模型(S50),并将生成的模型存储到模型服务器的指定目录(S60)。利用本方法,可以在线进行模型训练。

Description

模型的在线训练方法、服务器及存储介质
本申请要求于2018年04月26日提交中国专利局、申请号为201810386021.X,名称为“模型的在线训练方法、服务器及存储介质”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种模型的在线训练方法、服务器及计算机可读存储介质。
背景技术
随着人工智能(Artificial Intelligence,AI)的迅速发展,利用模型训练进行深度学习的技术手段也广泛的运用于各领域中。目前的模型训练主要依赖于静态的数据进行离线训练,也即是说基于内部获得的静态数据进行训练。这种离线训练机制对于实时要求较高的场景的适用性较差,同时也不适用于需要大量外部数据进行深度学习的场景。
发明内容
鉴于以上内容,本申请提供一种模型的在线训练方法、服务器及计算机可读存储介质,其主要目的在于定向抓取外部数据,实现全自动化模型的在线训练。
为实现上述目的,本申请提供一种模型的在线训练方法,该方法包括:
爬取步骤:接收用户发送的种子网页,利用爬虫工具爬取该种子网页内的所有网页链接;
筛选步骤:利用预设的筛选器筛选出符合要求的网页链接;
预处理步骤:从筛选后的网页链接所指向的网页中获取网页源码,对网页源码进行预处理,得到网页的可用词集;
调用步骤:利用ETL处理方式将得到的可用词集按照预设格式存储到数据库中,作为训练数据,并调用模型对训练数据进行在线训练;
生成步骤:调整模型参数,验证模型准确性,生成完整的模型;
存储步骤:将生成的模型存储到模型服务器的指定目录。
此外,本申请还提供一种服务器,该服务器包括:存储器、处理器及显示器,所述存储器上存储模型在线训练程序,所述模型在线训练程序被所述处理器执行,可实现如下步骤:
爬取步骤:接收用户发送的种子网页,利用爬虫工具爬取该种子网页内的所有网页链接;
筛选步骤:利用预设的筛选器筛选出符合要求的网页链接;
预处理步骤:从筛选后的网页链接所指向的网页中获取网页源码,对网页源码进行预处理,得到网页的可用词集;
调用步骤:利用ETL处理方式将得到的可用词集按照预设格式存储到数据库中,作为训练数据,并调用模型对训练数据进行在线训练;
生成步骤:调整模型参数,验证模型准确性,生成完整的模型;
存储步骤:将生成的模型存储到模型服务器的指定目录。
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中包括模型在线训练程序,所述模型在线训练程序被处理器执行时,可实现如上所述模型的在线训练方法中的任意步骤。
本申请提出的模型的在线训练方法、服务器及计算机可读存储介质,通过用户指定的种子网页利用爬虫工具爬取所有的网页链接,并筛选出符合要求的网页。接着,获取网页的网页源码,进行预处理,得到可用词集存到数据库中。最后,利用数据库中的数据进行训练,调整参数,验证准确性,将完整的模型存到指定目录,从而能够根据动态数据自动完成模型的在线训练。
附图说明
图1为本申请服务器较佳实施例的示意图;
图2为图1中模型在线训练程序较佳实施例的模块示意图;
图3为图2中程序模块的功能示意图;
图4为本申请模型的在线训练方法较佳实施例的流程图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步 说明。
具体实施方式
应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
本领域的技术人员知道,本申请的实施方式可以实现为一种方法、装置、设备、系统或计算机程序产品。因此,本申请可以具体实现为完全的硬件、完全的软件(包括固件、驻留软件、微代码等),或者硬件和软件结合的形式。
如图1所示,是本申请服务器1较佳实施例的示意图。
在本实施例中,服务器1是指产品服务平台,该服务器1可以是服务器、平板电脑、个人电脑、便携计算机以及其它具有运算功能的电子设备。
该服务器1包括:存储器11、处理器12、显示器13、网络接口14及通信总线15。其中,网络接口14可选地可以包括标准的有线接口、无线接口(如WI-FI接口)。通信总线15用于实现这些组件之间的连接通信。
存储器11至少包括一种类型的可读存储介质。所述至少一种类型的可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器等的非易失性存储介质。在一些实施例中,所述存储器11可以是所述服务器1的内部存储单元,例如该服务器1的硬盘。在另一些实施例中,所述存储器11也可以是所述服务器1的外部存储单元,例如所述服务器1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
在本实施例中,所述存储器11不仅可以用于存储安装于所述服务器1的应用软件及各类数据,例如模型在线训练程序10、网页链接及可用词集等。
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其它数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行模型在线训练程序10的计算机程序代码等。
显示器13可以称为显示屏或显示单元。在一些实施例中显示器13可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode,OLED)触摸器等。显示器13用于显示在服务器1中处理的信息以及用于显示可视化的工作界面,例如显示预处理后得到的可用词 集。
图1仅示出了具有组件11-15以及模型在线训练程序10的服务器1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
可选地,该服务器1还可以包括用户接口,用户接口可以包括输入单元比如键盘(Keyboard)、语音输出装置比如音响、耳机等,可选地用户接口还可以包括标准的有线接口、无线接口。
可选地,该服务器1还包括触摸传感器。所述触摸传感器所提供的供用户进行触摸操作的区域称为触控区域。此外,这里所述的触摸传感器可以为电阻式触摸传感器、电容式触摸传感器等。而且,所述触摸传感器不仅包括接触式的触摸传感器,也可包括接近式的触摸传感器等。此外,所述触摸传感器可以为单个传感器,也可以为例如阵列布置的多个传感器。用户可以通过触摸所述触控区域启动模型在线训练程序10。
此外,该电子装置1的显示器的面积可以与所述触摸传感器的面积相同,也可以不同。可选地,将显示器与所述触摸传感器层叠设置,以形成触摸显示屏。该装置基于触摸显示屏侦测用户触发的触控操作。
该服务器1还可以包括射频(Radio Frequency,RF)电路、传感器和音频电路等等,在此不再赘述。
在图1所示的服务器1实施例中,作为一种计算机存储介质的存储器11中存储模型在线训练程序10的程序代码,处理器12执行模型在线训练程序10的程序代码时,实现如下步骤:
爬取步骤:接收用户发送的种子网页,利用爬虫工具爬取该种子网页内的所有网页链接;
筛选步骤:利用预设的筛选器筛选出符合要求的网页链接;
预处理步骤:从筛选后的网页链接所指向的网页中获取网页源码,对网页源码进行预处理,得到网页的可用词集;
调用步骤:利用ETL处理方式将得到的可用词集按照预设格式存储到数据库中,作为训练数据,并调用模型对训练数据进行在线训练;
生成步骤:调整模型参数,验证模型准确性,生成完整的模型;
存储步骤:将生成的模型存储到模型服务器的指定目录。
具体原理请参照下述图2关于模型在线训练程序10较佳实施例的模块示意图及图4关于模型的在线训练方法较佳实施例的流程图的介绍。
如图2所示,是图1中模型在线训练程序10较佳实施例的模块示意图。
在本实施方式中,以旅游推荐模型为例阐述本申请提供的模型的在线训练方法、程序及服务器的技术构思,其他类型模型同样适用。
本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段。
在本实施例中,模型在线训练程序10包括:爬取模块110、筛选模块120、预处理模块130、调用模块140、生成模块150及存储模块160。
以下结合图3的程序模块的功能示意图说明模块110-160的功能:
爬取模块110,用于接收用户2发送的种子网页,利用爬虫工具爬取该种子网页内的所有网页链接。其中,所述种子网页是用户预先设定的网页。利用爬虫工具爬取该种子网页内的网页链接是不定向的,只要是该种子网页内的网页链接都爬取出来。所述爬虫工具可以为Beautiful Soup、Scrapy、mechanize及cola的一种或几种。
筛选模块120,用于利用预设的筛选器筛选出符合要求的网页链接。之后,筛选模块120根据具体的模型训练设置相应的网址库31,筛选器访问网页链接所指向的页面,并将访问的网页链接分配到对应的网址库31。例如,根据旅游推荐模型设置游记网址库及问答网址库,游记网址库用于存储游记链接,问答网址库用于存储问答链接。筛选器分析网页链接所指向的页面,判断每个网页链接是属于游记链接或者是问答链接,并将每个网页链接分配到对应的网址库31。所述筛选器包括但不限于JQuery筛选器,还可以是其它网页意图筛选器。但应理解的是,所述旅游推荐模型对应的网址库31包括但不限于游记网址库及问答网址库。
预处理模块130,用于从筛选后的网页链接所指向的网页中获取网页源码,对网页源码进行预处理,得到网页的可用词集。所述预处理包括网页清洗、分词、去停用词。其中,所述网页清洗包括:利用正则表达式提取出网页中关于标题、关键字及描述的文本;利用正则表达式清洗所述文本中与网页内容无关的信息。由于网页除了包含主题信息和标题外,还有大量与网页主题 无关的内容。因此,采用正则表达式从网页中提取出<title>、<keywords>、<description>该3个标签中的文本部分,所述3个标签分别代表网页的标题、关键字及描述。但提取出的文本还包含Javascript脚本代码、CSS样式代码和HTML标签等于网页内容无关的信息,因此,采用正则表达式进一步清洗,得到无噪音的网页正文信息。对无噪音的网页正文信息进行分词,所述分词可以采用基于词典的分词方法、基于理解的分词方法和基于统计的分词方法中的一种或几种。优选地,本申请使用结巴分词对所述筛选文本进行分词处理。分词后,网页内容被切分成一个个词语,词语间用斜杠作为标示符。最后使用去停用词将信息量较少的词语去除,得到可用词集。其中,所述去停用词的具体处理步骤包括:存储相关信息量少的词语,如“的”、“是”、“都”、“而且”、“已经”等等。接着,将分词后的网页内容中的一个个词语与存储的词语进行对比,并将网页内容中与存储词语相同的词语去除,得到信息量大的词语。
调用模块140,用于利用ETL处理方式将得到的可用词集按照预设格式存储到数据库32中,作为训练数据,并调用模型对训练数据进行在线训练。所述ETL处理方式是指通过抽取、转换、加载的方式对数据进行处理的过程。例如,在存储到数据库32之前,利用ETL处理方式抽取出所需要的数据,并转换成预设格式加载到数据库32中。当需要调用时,也可以利用ETL处理方式从数据库32中抽取出所需要的训练数据,并加载到模型中训练。其中,所述ETL工具可以是Informatica、Datastage、OWB、微软DTS、Beeload、Kettle等的一种或几种。
生成模块150,用于调整模型参数,验证模型准确性,生成完整的模型。所述具体的在线训练过程包括:将训练数据分为训练集和验证集;将训练集代入构建的多元回归模型Y=A+B 1X 1+B 2X 2+……+B nX n的变量X 1、X 2、……、X n中训练,自动调整模型参数B 1,B 2,……,B n;将验证集代入该模型中进行验证,得到完整的模型。假设,预设阈值为98%,将验证集代入模型,验证准确率达到98%时,该旅游推荐模型为完整模型。但应理解的是,所述模型包括但不限于多元回归模型,还可以包括长短期记忆模型(Long-Short Term Memory,LSTM)卷积神经网络模型(Convolutional Neural Network,CNN)及循环神经网络模型(Recurrent Neural Network,RNN)等。针对不同类型的训 练,选择不同类型的模型进行模型训练。例如,热点事件的分类可以用到LSTM模型等。
存储模块160,用于将生成的模型存储到模型服务器的指定目录。例如,将生成完整的旅游推荐模型存储到指定目录。进一步地,当数据库32中的数据更新时,抓取数据库32中的数据进行训练,自动调整参数,得到新生成的旅游推荐模型替换之前的旅游推荐模型保存到指定目录。
在另一个实施例中,当模型训练完成后,还可以通过预设方式向用户发送反馈信息,提示用户其所需要的模型训练已完成。例如,通过邮件的方式向用户的邮箱发送提示信息:“您需要的模型已训练完成,请在指定的目录查看,谢谢!”
如图4所示,是本申请模型的在线训练方法较佳实施例的流程图。
在本实施例中,处理器12执行存储器11中存储的模型在线训练程序10的计算机程序时实现模型的在线训练方法包括:步骤S10-步骤S60:
步骤S10,接收用户2发送的种子网页,爬取模块110利用爬虫工具爬取该种子网页内的所有网页链接。其中,所述种子网页是用户设定的。利用爬虫工具爬取该种子网页内的网页链接是不定向的,只要是该种子网页内的网页链接都爬取出来。所述爬虫工具可以为Beautiful Soup、Scrapy、mechanize及cola的一种或几种。例如,针对旅游推荐模型,接收用户发送的携程网网址:http://www.ctrip.com/,抓取携程网内的所有网页链接,如游记链接:http://you.ctrip.com/travels/xinjiang100008/3643336.html和问答链接:http://you.ctrip.com/asks/shanghai2/5483046.html等。
步骤S20,筛选模块120利用预设的筛选器筛选出符合要求的网页链接。具体包括筛选模块120根据具体的模型训练设置相应的网址库31,筛选器访问网页链接所指向的页面,并将访问的网页链接分配到对应的网址库31。例如,根据旅游推荐模型设置游记网址库及问答网址库,游记网址库用于存储游记链接,问答网址库用于存储问答链接。筛选器分析网页链接所指向的页面,判断每个网页链接是属于游记链接或者是问答链接,并将每个网页链接分配到对应的网址库31。所述筛选器包括但不限于JQuery筛选器,还可以是其它网页意图筛选器。但应理解的是,所述旅游推荐模型对应的网址库31包 括但不限于游记网址库及问答网址库。例如,所述旅游模型对应的网址库31还可以包括:美食林网址库及口碑榜网址库。
步骤S30,预处理模块130从筛选后的网页链接所指向的网页中获取网页源码,对网页源码进行预处理,得到网页的可用词集。所述预处理包括网页清洗、分词、去停用词。其中,所述网页清洗包括:利用正则表达式提取出网页中关于标题、关键字及描述的文本;利用正则表达式清洗所述文本中与网页内容无关的信息。由于网页除了包含主题信息和标题外,还有大量与网页主题无关的内容。因此,采用正则表达式从网页中提取出<title>、<keywords>、<description>该3个标签中的文本部分,所述3个标签分别代表网页的标题、关键字及描述。但提取出的文本还包含Javascript脚本代码、CSS样式代码和HTML标签等于网页内容无关的信息,因此,采用正则表达式进一步清洗,得到无噪音的网页正文信息。对无噪音的网页正文信息进行分词,所述分词可以采用基于词典的分词方法、基于理解的分词方法和基于统计的分词方法中的一种或几种。优选地,本申请使用结巴分词对所述筛选文本进行分词处理。分词后,网页内容被切分成一个个词语,词语间用斜杠作为标示符。例如,“月湾形的龙目岛沙滩拥有着细腻的白色沙滩与温柔的海浪。”分词后得到“月湾形/的/龙目岛/沙滩/拥有着/细腻/的/白色沙滩/与/温柔/的/海浪/。”最后使用去停用词将信息量较少的词语去除,得到可用词集。所述信息量较少的词是指词语中的“的”、“以”、“是”、“而且”、“已经”、“都”、“与”及标点符号等信息量较少的词语。所述去停用词的具体处理步骤包括:存储相关信息量少的词语。接着,将分词后的网页内容中的一个个词语与存储的词语进行对比,并将网页内容中与存储词语相同的词语去除,得到信息量大的词语。例如,所述句子去停用词后得到“龙目岛/沙滩/拥有/沙滩/和/海浪”。
步骤S40,调用模块140利用ETL处理方式将得到的可用词集按照预设格式存储到数据库32中,作为训练数据,并调用模型对训练数据进行在线训练。所述ETL处理方式是指通过抽取、转换、加载的方式对数据进行处理的过程。在存储到数据库32之前,利用ETL处理方式抽取出所需要的数据,并转换成预设格式加载到数据库32中。例如,针对上述游记网站的天数、时间、人均费用、同伴,以及主题和正文信息,根据字段的方式存储在数据库32的表格中。当需要调用时,也可以利用ETL处理方式从数据库32中抽取出所需 要的训练数据,并加载到模型中训练。其中,所述ETL工具可以是Informatica、Datastage、OWB、微软DTS、Beeload、Kettle等的一种或几种。
步骤S50,生成模块150调整模型参数,验证模型准确性,生成完整的模型。所述具体的在线训练过程包括:将训练数据分为训练集和验证集;将训练集代入构建的多元回归模型Y=A+B 1X 1+B 2X 2+……+B nX n的变量X 1、X 2、……、X n中训练,自动调整模型参数B 1,B 2,……,B n;将验证集代入该模型中进行验证,得到完整的模型。例如,将数据库32的景点信息(包括游记数目、景点数目)、攻略信息(包括旅游天数、旅游时间、人均消费和同伴信息)及每个景点的推荐指数(人为标注的)分为训练集和验证集,将训练集代入构建的多元回归模型Y=A+B 1X 1+B 2X 2+……+B nX n中训练,X 1、X 2、…….、X n代表游记数目、景点数目、旅游天数等变量,自动调整模型参数B 1,B 2,……,B n,将验证集代入验证,得到准确率高(如98%)的旅游推荐模型。该模型能够根据用户的旅游景点、旅游天数、人均消费、同伴等信息为用户推荐性价比(推荐指数)较高的旅游景点。但应理解的是,所述模型包括但不限于多元回归模型,还可以包括LSTM模型、CNN模型及RNN模型等。针对不同类型的训练,选择不同类型的模型进行模型训练。例如,热点事件的分类可以用到LSTM模型等。
步骤S60,存储模块160将生成的模型存储到模型服务器的指定目录。例如,将生成完整的旅游推荐模型存储到指定目录。进一步地,当数据库32中的数据更新时,抓取数据库32中的数据进行训练,自动调整参数,得到新生成的旅游推荐模型替换之前的旅游推荐模型保存到指定目录。
在另一个实施例中,当模型训练完成后,还可以通过预设方式向用户发送反馈信息,提示用户其所需要的模型训练已完成。例如,通过邮件的方式向用户的邮箱发送提示信息:“您需要的模型已训练完成,请在指定的目录查看,谢谢!”
上述实施例提出的模型的在线训练方法,通过根据用户的种子网页获取合适的网页链接,对网页链接所指向的网页内容进行预处理,得到可用词集。接着,利用ETL处理方式将得到的可用词集存储到数据库32,并调用模型进行训练,调整参数,得到完整的模型,保存到指定目录,从而使得模型能够大量获取外部数据进行在线训练,实现模型训练完全自动化。
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质中包括模型在线训练程序10,所述模型在线训练程序10被处理器执行时实现如下操作:
爬取步骤:接收用户发送的种子网页,利用爬虫工具爬取该种子网页内的所有网页链接;
筛选步骤:利用预设的筛选器筛选出符合要求的网页链接;
预处理步骤:从筛选后的网页链接所指向的网页中获取网页源码,对网页源码进行预处理,得到网页的可用词集;
调用步骤:利用ETL处理方式将得到的可用词集按照预设格式存储到数据库中,作为训练数据,并调用模型对训练数据进行在线训练;
生成步骤:调整模型参数,验证模型准确性,生成完整的模型;
存储步骤:将生成的模型存储到模型服务器的指定目录。
优选地,所述筛选步骤还包括:
根据具体的模型训练设置相应的网址库,筛选器访问网页链接所指向的页面,并将访问的网页链接分配到对应的网址库。
优选地,所述预处理包括网页清洗、分词、去停用词。
优选地,所述网页清洗的步骤包括:
利用正则表达式提取出网页中关于标题、关键字及描述的文本;
利用正则表达式清洗所述文本中与网页内容无关的信息。
优选地,所述ETL处理方式是指通过抽取、转换、加载的方式对数据进行处理的过程。
优选地,所述生成步骤包括:
将训练数据分为训练集和验证集;
将训练集代入构建的多元回归模型Y=A+B 1X 1+B 2X 2+……+B nX n的变量X 1、X 2、……、X n中训练,自动调整模型参数B 1,B 2,……,B n
将验证集代入该模型中进行验证,得到完整的模型。
本申请之计算机可读存储介质的具体实施方式与上述模型的在线训练方法的具体实施方式大致相同,在此不再赘述。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其它相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种模型的在线训练方法,应用于服务器,其特征在于,所述方法包括:
    爬取步骤:接收用户发送的种子网页,利用爬虫工具爬取该种子网页内的所有网页链接;
    筛选步骤:利用预设的筛选器筛选出符合要求的网页链接;
    预处理步骤:从筛选后的网页链接所指向的网页中获取网页源码,对网页源码进行预处理,得到网页的可用词集;
    调用步骤:利用ETL处理方式将得到的可用词集按照预设格式存储到数据库中,作为训练数据,并调用模型对训练数据进行在线训练;
    生成步骤:调整模型参数,验证模型准确性,生成完整的模型;
    存储步骤:将生成的模型存储到模型服务器的指定目录。
  2. 根据权利要求1所述的模型的在线训练方法,其特征在于,所述筛选步骤还包括:
    根据具体的模型训练设置相应的网址库,筛选器访问网页链接所指向的页面,并将访问的网页链接分配到对应的网址库。
  3. 根据权利要求1所述的模型的在线训练方法,其特征在于,所述预处理包括网页清洗、分词、去停用词。
  4. 根据权利要求1或3所述的模型的在线训练方法,其特征在于,所述网页清洗的步骤包括:
    利用正则表达式提取出网页中关于标题、关键字及描述的文本;
    利用正则表达式清洗所述文本中与网页内容无关的信息。
  5. 根据权利要求1或3所述的模型的在线训练方法,其特征在于,所述去停用词的步骤包括:
    设置相关信息量少的词语,将分词后的网页内容中的词语与所述相关信息量少的词语进行对比;及
    若网页内容中的词语与所述相关信息量少的词语相同,则去除网页内容中的该词语,得到网页内容中信息量大的词语。
  6. 根据权利要求1所述的模型的在线训练方法,其特征在于,所述ETL处理方式是指通过抽取、转换、加载的方式对数据进行处理的过程。
  7. 根据权利要求1所述的模型的在线训练方法,其特征在于,所述生成步骤包括:
    将训练数据分为训练集和验证集;
    将训练集代入构建的多元回归模型Y=A+B 1X 1+B 2X 2+……+B nX n的变量X 1、X 2、……、X n中训练,自动调整模型参数B 1,B 2,……,B n
    将验证集代入该模型中进行验证,得到完整的模型。
  8. 一种服务器,其特征在于,所述服务器包括:存储器、处理器及显示器,所述存储器上存储有模型在线训练程序,所述模型在线训练程序被所述处理器执行,可实现如下步骤:
    爬取步骤:接收用户发送的种子网页,利用爬虫工具爬取该种子网页内的所有网页链接;
    筛选步骤:利用预设的筛选器筛选出符合要求的网页链接;
    预处理步骤:从筛选后的网页链接所指向的网页中获取网页源码,对网页源码进行预处理,得到网页的可用词集;
    调用步骤:利用ETL处理方式将得到的可用词集按照预设格式存储到数据库中,作为训练数据,并调用模型对训练数据进行在线训练;
    生成步骤:调整模型参数,验证模型准确性,生成完整的模型;
    存储步骤:将生成的模型存储到模型服务器的指定目录。
  9. 根据权利要求8所述的服务器,其特征在于,所述筛选步骤还包括:
    根据具体的模型训练设置相应的网址库,筛选器访问网页链接所指向的页面,并将访问的网页链接分配到对应的网址库。
  10. 根据权利要求8所述的服务器,其特征在于,所述预处理包括网页清洗、分词、去停用词。
  11. 根据权利要求8或10所述的服务器,其特征在于,所述网页清洗的步骤包括:
    利用正则表达式提取出网页中关于标题、关键字及描述的文本;
    利用正则表达式清洗所述文本中与网页内容无关的信息。
  12. 根据权利要求8或10所述的服务器,其特征在于,所述去停用词的步骤包括:
    设置相关信息量少的词语,将分词后的网页内容中的词语与所述相关信 息量少的词语进行对比;及
    若网页内容中的词语与所述相关信息量少的词语相同,则去除网页内容中的该词语,得到网页内容中信息量大的词语。
  13. 根据权利要求8所述的服务器,其特征在于,所述ETL处理方式是指通过抽取、转换、加载的方式对数据进行处理的过程。
  14. 根据权利要求8所述的服务器,其特征在于,所述生成步骤包括:
    将训练数据分为训练集和验证集;
    将训练集代入构建的多元回归模型Y=A+B 1X 1+B 2X 2+……+B nX n的变量X 1、X 2、……、X n中训练,自动调整模型参数B 1,B 2,……,B n
    将验证集代入该模型中进行验证,得到完整的模型。
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中包括模型在线训练程序,所述模型在线训练程序被处理器执行时,可实现如下步骤:
    爬取步骤:接收用户发送的种子网页,利用爬虫工具爬取该种子网页内的所有网页链接;
    筛选步骤:利用预设的筛选器筛选出符合要求的网页链接;
    预处理步骤:从筛选后的网页链接所指向的网页中获取网页源码,对网页源码进行预处理,得到网页的可用词集;
    调用步骤:利用ETL处理方式将得到的可用词集按照预设格式存储到数据库中,作为训练数据,并调用模型对训练数据进行在线训练;
    生成步骤:调整模型参数,验证模型准确性,生成完整的模型;
    存储步骤:将生成的模型存储到模型服务器的指定目录。
  16. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述筛选步骤还包括:
    根据具体的模型训练设置相应的网址库,筛选器访问网页链接所指向的页面,并将访问的网页链接分配到对应的网址库。
  17. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述预处理包括网页清洗、分词、去停用词。
  18. 根据权利要求15或17所述的计算机可读存储介质,其特征在于,所述网页清洗的步骤包括:
    利用正则表达式提取出网页中关于标题、关键字及描述的文本;
    利用正则表达式清洗所述文本中与网页内容无关的信息。
  19. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述ETL处理方式是指通过抽取、转换、加载的方式对数据进行处理的过程。
  20. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述生成步骤包括:
    将训练数据分为训练集和验证集;
    将训练集代入构建的多元回归模型Y=A+B 1X 1+B 2X 2+……+B nX n的变量X 1、X 2、……、X n中训练,自动调整模型参数B 1,B 2,……,B n
    将验证集代入该模型中进行验证,得到完整的模型。
PCT/CN2018/102114 2018-04-26 2018-08-24 模型的在线训练方法、服务器及存储介质 WO2019205374A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810386021.X 2018-04-26
CN201810386021.XA CN108763313A (zh) 2018-04-26 2018-04-26 模型的在线训练方法、服务器及存储介质

Publications (1)

Publication Number Publication Date
WO2019205374A1 true WO2019205374A1 (zh) 2019-10-31

Family

ID=64011863

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102114 WO2019205374A1 (zh) 2018-04-26 2018-08-24 模型的在线训练方法、服务器及存储介质

Country Status (2)

Country Link
CN (1) CN108763313A (zh)
WO (1) WO2019205374A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905888A (zh) * 2020-09-10 2021-06-04 中数通信息有限公司 一种基于信息监测的关键词发现方法、系统和电子设备

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232183B (zh) * 2018-12-07 2022-05-27 腾讯科技(深圳)有限公司 关键词提取模型训练方法、关键词提取方法、装置及存储介质
CN110147817B (zh) * 2019-04-11 2021-08-27 北京搜狗科技发展有限公司 训练数据集生成方法及装置
CN111126621B (zh) * 2019-12-17 2021-02-09 北京九章云极科技有限公司 在线模型训练方法及装置
CN113486268A (zh) * 2020-06-10 2021-10-08 海信集团有限公司 试衣镜模型的更新方法及试衣镜
CN114638322B (zh) * 2022-05-20 2022-09-13 南京大学 开放场景下基于给定描述的全自动目标检测系统和方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899324A (zh) * 2015-06-19 2015-09-09 成都国腾实业集团有限公司 一种基于idc有害信息监测系统的样本训练系统
CN105653444A (zh) * 2015-12-23 2016-06-08 北京大学 基于互联网日志数据的软件缺陷故障识别方法和系统
CN106504011A (zh) * 2015-09-06 2017-03-15 阿里巴巴集团控股有限公司 一种业务对象的展示方法和装置
US20170345075A1 (en) * 2016-05-27 2017-11-30 Facebook, Inc. Product Listing Recognizer

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914478B (zh) * 2013-01-06 2018-05-08 阿里巴巴集团控股有限公司 网页训练方法及系统、网页预测方法及系统
WO2016201631A1 (en) * 2015-06-17 2016-12-22 Yahoo! Inc. Systems and methods for online content recommendation
CN106709754A (zh) * 2016-11-25 2017-05-24 云南电网有限责任公司昆明供电局 一种用基于文本挖掘的电力用户分群方法
CN106682118A (zh) * 2016-12-08 2017-05-17 华中科技大学 基于网络爬虫和利用机器学习的社交网站虚假粉丝检测方法
CN106886918A (zh) * 2017-02-06 2017-06-23 中国联合网络通信集团有限公司 一种目标用户的确定方法、装置及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899324A (zh) * 2015-06-19 2015-09-09 成都国腾实业集团有限公司 一种基于idc有害信息监测系统的样本训练系统
CN106504011A (zh) * 2015-09-06 2017-03-15 阿里巴巴集团控股有限公司 一种业务对象的展示方法和装置
CN105653444A (zh) * 2015-12-23 2016-06-08 北京大学 基于互联网日志数据的软件缺陷故障识别方法和系统
US20170345075A1 (en) * 2016-05-27 2017-11-30 Facebook, Inc. Product Listing Recognizer

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905888A (zh) * 2020-09-10 2021-06-04 中数通信息有限公司 一种基于信息监测的关键词发现方法、系统和电子设备

Also Published As

Publication number Publication date
CN108763313A (zh) 2018-11-06

Similar Documents

Publication Publication Date Title
WO2019205374A1 (zh) 模型的在线训练方法、服务器及存储介质
US10628030B2 (en) Methods and systems for providing user feedback using an emotion scale
WO2020237856A1 (zh) 基于知识图谱的智能问答方法、装置及计算机存储介质
US10108698B2 (en) Common data repository for improving transactional efficiencies of user interactions with a computing device
US20140236575A1 (en) Exploiting the semantic web for unsupervised natural language semantic parsing
US20180004844A1 (en) Method and system for presenting content summary of search results
WO2018010579A1 (zh) 字符串的分词方法、装置及设备
CN102782751A (zh) 社会网络中的数字媒体语音标签
CN107092639A (zh) 一种搜索引擎系统
WO2016018683A1 (en) Image based search to identify objects in documents
CN107766399A (zh) 用于使图像与内容项目匹配的方法和系统及机器可读介质
WO2014000519A1 (zh) 关键词过滤系统及方法
WO2021068681A1 (zh) 标签分析方法、装置及计算机可读存储介质
KR20210044310A (ko) 동적 트렌드 클러스터링을 위한 시스템 및 방법
CN108027820A (zh) 用于产生短语黑名单以响应于搜索查询来防止某些内容出现在搜索结果中的方法和系统
US10339175B2 (en) Aggregating photos captured at an event
CN103886016A (zh) 一种用于确定页面中的垃圾文本信息的方法与设备
CN104866527A (zh) 一种动态匹配网页模板的方法及其装置
WO2022105497A1 (zh) 文本筛选方法、装置、设备及存储介质
JP2015069455A (ja) 会話文生成装置、会話文生成方法、及びプログラム
CN103123651A (zh) 一种快速查看多个同类文件的方法、装置和移动设备
US20130110816A1 (en) Default Query Rules
WO2016018682A1 (en) Processing image to identify object for insertion into document
US20130230248A1 (en) Ensuring validity of the bookmark reference in a collaborative bookmarking system
CN116661936A (zh) 页面数据的处理方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18915880

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 05.02.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18915880

Country of ref document: EP

Kind code of ref document: A1