WO2020140613A1 - 模型训练方法、装置、计算机设备及计算机可读存储介质 - Google Patents

模型训练方法、装置、计算机设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2020140613A1
WO2020140613A1 PCT/CN2019/117295 CN2019117295W WO2020140613A1 WO 2020140613 A1 WO2020140613 A1 WO 2020140613A1 CN 2019117295 W CN2019117295 W CN 2019117295W WO 2020140613 A1 WO2020140613 A1 WO 2020140613A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
sub
preset
corpus
data source
Prior art date
Application number
PCT/CN2019/117295
Other languages
English (en)
French (fr)
Inventor
吴壮伟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020140613A1 publication Critical patent/WO2020140613A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the technical field of model construction, and in particular to a model training method, device, computer equipment, and computer-readable storage medium.
  • Embodiments of the present application provide a model training method, apparatus, computer equipment, and computer-readable storage medium, which can solve the problem of low model training efficiency caused by computer hardware bottlenecks during model training in traditional technology.
  • an embodiment of the present application provides a method for model training.
  • the method includes: obtaining a training corpus in a first preset manner; dividing the corpus according to preset conditions to obtain multiple data blocks; The data blocks are respectively input to the corresponding sub-models according to a preset correspondence relationship to train each sub-model to obtain a trained sub-model; the trained sub-models are synthesized according to a second preset manner to obtain a synthesized model.
  • an embodiment of the present application further provides a model training device, including: an acquisition unit for acquiring a corpus for training in a first preset manner; a segmentation unit for adapting the corpus according to preset conditions Segmentation to obtain multiple data blocks; a training unit for inputting the data blocks to corresponding sub-models according to a preset correspondence relationship to train each sub-model to obtain a trained sub-model; a synthesis unit for Synthesizing the trained sub-model according to a second preset manner to obtain a synthesized model.
  • an embodiment of the present application further provides a computer device, which includes a memory and a processor, a computer program is stored on the memory, and the model training method is implemented when the processor executes the computer program.
  • an embodiment of the present application further provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the model training method.
  • FIG. 1 is a schematic diagram of an application scenario of a model training method provided by an embodiment of this application.
  • FIG. 2 is a schematic flowchart of a model training method provided by an embodiment of this application.
  • FIG. 3 is a schematic diagram of data processing flow of a model training method provided by an embodiment of this application.
  • FIG. 4 is a schematic diagram of a sub-process of a model training method provided by an embodiment of this application.
  • FIG. 5 is a schematic block diagram of a model training device provided by an embodiment of this application.
  • FIG. 6 is another schematic block diagram of a model training device provided by an embodiment of this application.
  • FIG. 7 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 is a schematic diagram of an application scenario of a model training method provided by an embodiment of the present application.
  • the application scenarios include: (1) Terminal. An application program interface is installed on the terminal shown in FIG. 1, and R&D personnel interact with the main server through the application program interface to provide content that requires manual input, such as an initial data source list, and upload to the server's Docker container to implement the main server execution
  • the terminal may be an electronic device such as a notebook computer, a tablet computer, or a desktop computer, and the terminal in FIG. 1 is connected to the main server.
  • (2) Server The server includes a master server and a slave server.
  • the embodiment of the present application adopts a distributed system, deploys multiple Docker containers to different slave servers through the master server, and trains multiple sub-models in parallel through the multiple slave servers to make the master server serial Multiple sub-models after training are synthesized to obtain a synthetic model, thereby improving the efficiency of model training.
  • the master server in Figure 1 is connected to the terminal and the slave server respectively.
  • each subject in FIG. 1 The working process of each subject in FIG. 1 is as follows: the main server obtains the training corpus through the first preset method, and the training corpus may be obtained from the terminal, or may be crawled from the Internet according to the data source list obtained from the terminal Acquisition; the master server divides the corpus according to preset conditions to obtain multiple data blocks; the data blocks are respectively input to the corresponding sub-models on the slave server according to the preset settings so that the slave server trains each sub-model, the master The server obtains the trained sub-model, and synthesizes the trained sub-model according to the second preset manner to obtain a synthesized model.
  • FIG. 1 only illustrates a desktop computer as a terminal.
  • the type of the terminal is not limited to that shown in FIG. 1.
  • the terminal may also be an electronic device such as a mobile phone, notebook computer, or tablet computer.
  • the application scenario of the above model training method is only used to illustrate the technical solution of the present application, and is not used to limit the technical solution of the present application.
  • the above connection relationship may also have other forms.
  • FIG. 2 is a schematic flowchart of a model training method provided by an embodiment of this application.
  • the model training method is applied to the main server in FIG. 1 to complete all or part of the functions of the model training method.
  • FIG. 2 is a schematic flowchart of a model training method provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a data processing flow of a model training method provided by an embodiment of the present application. As shown in FIG. 2, the method includes the following steps S210-S240:
  • the server obtains the corpus used for training in the first preset manner.
  • the first preset method includes receiving training data sent by a terminal, or Internet data crawled by a crawler system.
  • the training data sent by the terminal may be training data obtained through a storage device or data obtained through the Internet
  • the crawler system can crawl through a single computer device, and can also adopt a distributed crawler system to improve the efficiency of crawling data.
  • the preset conditions include according to the preset data size, data type, data structure, data source, processing depth and time distribution.
  • the data is divided into units of stored data. Commonly used data units include: megabytes, gigabytes, terabytes, and petabytes.
  • 1MB Megabyte megabytes is referred to as "megabytes” "
  • the data structure includes balanced structure corpus and natural random structure corpus.
  • Processing depth includes annotated corpus and non-annotated corpus.
  • Time distribution includes diachronic corpus and synchronic corpus.
  • Data sources include websites, books, newspapers and periodicals, etc., and data types include technology, sports, finance, and other types of data.
  • the complex model is decomposed into subdivided sub-models, and multiple sub-models and data blocks are correspondingly deployed on different machines, and finally the sub The models are aggregated and combined to output a complete model, which improves the efficiency of model training.
  • the corpus is segmented according to preset conditions to obtain multiple data blocks after the corpus has been segmented. Specifically, perform the following steps: First, you need to prepare a templated operation in advance. You can start the data division operation by customizing the Shell script. Among them, Shell Script, which is Shell Script in English, is a program file that is put into a file in advance by various commands to facilitate one-time execution. Secondly, conduct associative retrieval of corpora to obtain rich and comprehensive corpora through multiple sources. For example, the corpus is a news report about a hot event that originated from a website. You can use the crawler system to crawl the corpus of the hot event reported on other related websites.
  • Different websites will view the hot event from different angles. Report to form different corpora of the hotspot event, and through the related retrieval of the hotspot event, it is possible to obtain rich and comprehensive corpus from multiple corpus sources.
  • the input parameter is the number of divided data subsets, ⁇ D1, D2, D3.., Dn ⁇ , save to temporary directory 1, in order to prepare for the subsequent steps from the temporary directory Obtain the data block in 1.
  • dividing the corpus according to a preset condition to obtain multiple data blocks may be to classify the corpus according to a data type to obtain classified data, and then sort the classified data according to a time sequence of a preset cycle Split into multiple data blocks.
  • the corpus may be classified according to data type to obtain classified data, and the classified data may be divided according to a preset number of blocks or a preset size to obtain multiple data blocks, which may be different and flexible according to actual needs. .
  • the data block is obtained by dividing the data of the user corpus.
  • the settings of the data blocks can be pre-defined according to the model training conditions, which has the flexibility to meet different training conditions. For example, if the hardware resource configuration of a single computer device is high, each data block can be divided into more Large data blocks have a small number of data blocks. If the hardware resource configuration of a single computer device is low, but the number of computer devices is sufficient, each data block can be divided into smaller data blocks, but the number of data blocks is large.
  • the preset refers to the correspondence between the data block and the sub-model, that is, which sub-model processes which corresponding data block.
  • the embodiment of the present application obtains the final synthesized model by constructing a combination of multiple parallel submodules and a serial model that synthesizes the multiple submodules.
  • parallel training is used to train multiple sub-models to reduce the hardware resource requirements of a single computer device and improve the training efficiency of complex models.
  • the embodiment of the present application based on the division of the big data file, an evenly divided data block is obtained, and the data source corresponding to each data block has a model mechanism, that is, a sub-model, by inputting multiple data blocks separately To multiple sub-models on different machines, train multiple sub-models in parallel. Before training the sub-model, the parameters of the sub-model need to be set.
  • the preset correspondence between the data block and the sub-model is obtained from the storage file or the database, and the data blocks are respectively input to the corresponding sub-model according to the preset correspondence.
  • each sub-model is trained by a parallel architecture, and computing resources of a multi-core CPU or multiple computer devices can be used to improve the training efficiency of the model.
  • the second preset mode includes a fusion model or a preset combination mode.
  • the acquired sub-models after training are aggregated to synthesize the final synthetic model, which is the result of the complex model finally acquired.
  • the step of synthesizing the trained sub-models in a second preset manner to obtain a synthesized model includes: aggregating a plurality of trained sub-models through a fusion model to obtain a synthesized model.
  • each sub-model After training each sub-model, the results of the trained sub-model are output, and each sub-model undergoes a fusion model to do the final merge to obtain a synthetic model, that is, through the parallel sub-models, the corresponding sub-model is obtained As a result, by weighting and averaging each sub-model, the final composite model is obtained.
  • the weights used in the embodiments of the present application are calculated based on the model accuracy of each sub-model. For example, for regression problems, a regression model is used.
  • the sub-models included in the regression model include MSE (MSE, Mean Square Error, all Square error), RMSE (Root Mean Square Error), MAE (Mean Absolute Error) and R-Squared (R-squared), etc.
  • MSE Mean Square Error, all Square error
  • RMSE Root Mean Square Error
  • MAE Mean Absolute Error
  • R-squared R-squared
  • the serial model flow refers to passing the A model first and then the B model. At the same time, it also involves the read-in address of each sub-model and the output of the model prediction, and outputs the model prediction to the preset address.
  • the code for model prediction based on the multi-sub-user model is encapsulated in Docker containers, such as Docker 3, and several Docker 3 containers are created and started. After each sub-model training is completed, import several from temporary directory 2.
  • a sub-model, ⁇ model1,model2,...,modeln ⁇ starts the code for model prediction based on multiple sub-models, that is, the results of multiple sub-models are averaged, the input parameter is the data file of the temporary directory 3, and the output For the model prediction results, that is, the final synthetic model, save to the temporary directory 4.
  • the results of the model prediction and the output address need to be output to the corresponding location, for example, stored in a table in the specified database.
  • each data block formed by the subdivision data set is entered into the specified model for construction and training, and the sub-model as subdivision is obtained, and finally combined by multiple fine molecule models
  • the calculation constitutes the user's final multi-layer model.
  • the construction or update of the user's final multi-layer model is more the result of the combination of parallel model training and serial model data.
  • the model flow and data block can be customized, which has flexibility.
  • the embodiments of the present application are also based on Docker deployment, and the parallel architecture is used to complete the training and update of the model in multiple machines in parallel, which can increase the utilization rate of multi-core CPU or multi-computer computing resources, reduce the memory requirements of the server, and improve the environment deployment Convenience and convenience.
  • the step of acquiring the corpus used for training in the first preset manner includes: S211, acquiring the initial data source website list of the target object; S212, converting the initial data source website The list is classified according to preset conditions to obtain a list of different types of data source websites; S213, packaging the data source website list to the corresponding Docker container; S214, starting the Docker container to make the Docker container crawlable Obtain a new data source website; S215, add the new data source website to the corresponding data source website list to update the data source website of the target object; S216, based on the updated data source website crawl satisfy the preset Conditional corpus is used as training corpus.
  • the main server obtains the initial data source website list of the configured target, and the crawler system automatically classifies the initial data source website list according to preset conditions of the initial data source website list to obtain different types of data source website lists
  • the data source website is divided into different types according to the website identifier, and then the data source website list is packaged into the corresponding Docker container, the Docker container is deployed to different servers, and the Docker container is started to enable the
  • the Docker container obtains rich new data source websites by crawling, and adds the new data source website to the corresponding data source website list to update the target data source website and improve the target data source website.
  • it includes the following sub-steps:
  • the code includes the part of extracting the URL of the website, and the code that matches the URL and the corresponding crawler program, so that the URL automatically corresponds to the crawler program.
  • the crawler crawls the website with the corresponding URL.
  • it is necessary to build an index relationship between URL and crawler programs, and do all web crawlers of URL types in advance so that different types of URL crawlers correspond to different crawler programs.
  • the container Docker1 classify and segment the total input list through the crawler code, save the data source list of the same type, form a list to be crawled, and wait for crawling.
  • the input website URL list is classified according to the URL type, and the website URL list is classified.
  • the list segmentation code is started to divide the different data source lists into several lists. Corresponding to docker containers on different machines.
  • the crawler program mines a new URL according to the acquired URL, that is, the crawler program mines a new URL by starting the URL, and stores the new URL in the URL list to be crawled to complete the URL list. At the same time, you can also check whether there is an error reported during the data crawling process. If there is an error reported, the crawling process for this website ends.
  • Each type of URL list has a corresponding regular expression, and it is determined whether it is a type of URL by judging whether the returned result is empty. The judgment process is as follows: if the returned result is not empty, it is judged as this type of URL, if the judged result is empty, it is judged that it is not this type of URL.
  • the training corpus corresponding to the sub-model is obtained from the training corpus according to the input requirements of the sub-model in an extraction manner, thereby improving the training efficiency of the sub-model.
  • the data samples selected in the embodiments of the above application are full user data samples.
  • data blocks can also be extracted as equidistant sampling as data blocks used by the training sub-model to reduce the amount of training data and improve the model. Training efficiency. Use the equidistant sampling method to obtain data. For example, if 100 data blocks are obtained after the corpus is divided into blocks, 10 models are generated in the embodiment of the present application, and the first, 11, 21... Make up 10 samples. By summarizing the user-related corpus, and then classifying and saving the obtained corpus, a user data set is formed. Among them, classification refers to extracting and saving data sets from multiple sources according to different model input requirements. The classification here depends on the input content of the model.
  • the corresponding sub-model needs to include
  • the training efficiency of the model can be improved by extracting and saving the data including the latitude and longitude of the driving data as the training corpus.
  • model training methods described in the above embodiments can recombine the technical features contained in different embodiments as needed to obtain the combined implementation, but they are all within the scope of protection required by this application. Inside.
  • FIG. 5 is a schematic block diagram of a model training apparatus provided by an embodiment of the present application.
  • an embodiment of the present application further provides a model training device.
  • the model training apparatus includes a unit for performing the above model training method, and the apparatus may be configured in a computer device such as a server.
  • the model training device 500 includes an acquisition unit 501, a segmentation unit 502, a training unit 503 and a synthesis unit 504.
  • the obtaining unit 501 is used to obtain the training corpus in a first preset manner; the segmentation unit 502 is used to segment the corpus according to preset conditions to obtain multiple data blocks; the training unit 503 is used to The data blocks are respectively input to the corresponding sub-models according to a preset correspondence relationship to train each sub-model to obtain a trained sub-model; a synthesis unit 504 is used to synthesize the trained sub-model according to a second preset way Get a synthetic model.
  • the acquiring unit 501 includes: an acquiring subunit 5011 for acquiring an initial data source website list of a target object; a classification subunit 5012 for using the initial data source website The list is classified according to preset conditions to obtain different types of data source website lists; the packaging subunit 5013 is used to encapsulate the data source website list to the corresponding Docker container; the startup subunit 5014 is used to start the Docker container to Enable the Docker container to obtain a new data source website by crawling; add a subunit 5015 to add the new data source website to the corresponding data source website list to update the data source website of the target object; The sub-unit 5016 is configured to crawl a corpus satisfying preset conditions as a training corpus based on the updated data source website.
  • the model training device 500 further includes: a first extraction unit 505, configured to extract from the training corpus according to the input requirements of the sub-model The training corpus corresponding to the sub-model is obtained by extraction.
  • the segmentation unit 502 includes: a first classification subunit for classifying the corpus according to a data type to obtain classified data; second A dicing unit is used to divide the classification data according to a time sequence of a preset cycle to obtain multiple data blocks.
  • the segmentation unit 502 includes: a second segmentation unit for classifying the corpus according to data type to obtain classified data; second The numerator unit is used to divide the classification data according to a preset number of blocks or a preset size to obtain multiple data blocks.
  • the model training device 500 further includes: a second extraction unit 506 for extracting data blocks for use as training sub-models by equidistant sampling data block.
  • the synthesizing unit 504 is configured to aggregate multiple sub-models after training through a fusion model to obtain a synthetic model.
  • the model training apparatus 500 further includes: a re-acquisition unit, which is used to obtain the correspondence between the data block and the sub-model.
  • the above model training device may be implemented in the form of a computer program, and the computer program may run on the computer device shown in FIG. 7.
  • FIG. 7 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 700 may be a computer device such as a desktop computer or a server, or may be a component or part in other devices.
  • the computer device 700 includes a processor 702, a memory, and a network interface 705 connected through a system bus 701, where the memory may include a non-volatile storage medium 703 and an internal memory 704.
  • the non-volatile storage medium 703 can store an operating system 7031 and a computer program 7032.
  • the processor 702 can execute one of the above model training methods.
  • the processor 702 is used to provide computing and control capabilities to support the operation of the entire computer device 700.
  • the internal memory 704 provides an environment for the operation of the computer program 7032 in the non-volatile storage medium 703.
  • the processor 702 can execute the above model training method.
  • the network interface 705 is used for network communication with other devices.
  • the specific computer device 700 may include more or less components than shown in the figure, or combine certain components, or have a different arrangement of components.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are the same as those in the embodiment shown in FIG. 7 and will not be repeated here.
  • the processor 702 is used to run a computer program 7032 stored in the memory, so as to implement the model training method of the embodiment of the present application.
  • the processor 702 may be a central processing unit (Central Processing Unit, CPU), and the processor 702 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor.
  • the embodiments of the present application also provide a computer-readable storage medium.
  • the storage medium stores a computer program which, when executed by the processor, causes the processor to perform the steps of the model training method described in the above embodiments.
  • the storage medium is a physical, non-transitory storage medium, for example, it can be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, and various other physical storages that can store computer programs medium.
  • ROM Read-Only Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供了一种模型训练方法、装置、计算机设备及计算机可读存储介质。方法包括:通过第一预设方式获取训练语料;将语料按照预设条件进行切分以得到多个数据块;将数据块按照预设对应关系分别输入至对应的子模型以训练各个子模型,得到训练后的子模型;按照第二预设方式合成训练后的子模型以得到合成模型。本申请实施例实现模型训练时,基于并行和串行相结合的构思,通过将语料进行划分数据块,将所述数据块按照预先设置分别输入至对应的子模型,采用并行方式训练各个子模型,然后通过多个细分子模型进行串行的组合计算,组成了最终多层的合成模型。

Description

模型训练方法、装置、计算机设备及计算机可读存储介质
本申请要求于2019年1月4日提交中国专利局、申请号为201910008124.7、申请名称为“模型训练方法、装置、计算机设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及模型构建技术领域,尤其涉及一种模型训练方法、装置、计算机设备及计算机可读存储介质。
背景技术
随着大数据的发展和使用,使用大数据训练的用户模型也越来越复杂,而模型训练的传统技术中,随着用户模型复杂程度的提高和训练数据量的增大,经常会遇到计算机设备硬件资源的瓶颈,从而影响模型训练的效率。
发明内容
本申请实施例提供了一种模型训练方法、装置、计算机设备及计算机可读存储介质,能够解决传统技术中模型训练时因计算机硬件瓶颈导致的模型训练效率低的问题。
第一方面,本申请实施例提供了一种模型训练方法,所述方法包括:通过第一预设方式获取训练语料;将所述语料按照预设条件进行切分以得到多个数据块;将所述数据块按照预设对应关系分别输入至对应的子模型以训练各个子模型,得到训练后的子模型;按照第二预设方式合成所述训练后的子模型以得到合成模型。
第二方面,本申请实施例还提供了一种模型训练装置,包括:获取单元,用于通过第一预设方式获取训练使用的语料;切分单元,用于将所述语料按照预设条件进行切分以得到多个数据块;训练单元,用于将所述数据块按照预设对应关系分别输入至对应的子模型以训练各个子模型,得到训练后的子模型; 合成单元,用于按照第二预设方式合成所述训练后的子模型以得到合成模型。
第三方面,本申请实施例还提供了一种计算机设备,其包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现所述模型训练方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器执行所述模型训练方法。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的模型训练方法的应用场景示意图;
图2为本申请实施例提供的模型训练方法的流程示意图;
图3为本申请实施例提供的模型训练方法的数据处理流向示意图;
图4为本申请实施例提供的模型训练方法的子流程示意图;
图5为本申请实施例提供的模型训练装置的示意性框图;
图6为本申请实施例提供的模型训练装置的另一个示意性框图;以及
图7为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参阅图1,图1为本申请实施例提供的模型训练方法的应用场景示意图。所述应用场景包括:(1)终端。图1所示终端上安装有应用程序接口,研发人 员通过应用程序接口与主服务器进行交互,提供需要人工输入的内容,比如初始的数据源列表,上传至服务器的Docker容器,以实现主服务器执行模型训练方法的步骤,所述终端可以为笔记本电脑、平板电脑或者台式电脑等电子设备,图1中的终端与主服务器连接。(2)服务器。服务器包括主服务器和从服务器,本申请实施例采用分布式系统,通过主服务器部署多个Docker容器到不同的从服务器上,通过多个从服务器并行训练多个子模型,使主服务器以串行方式合成训练后的多个子模型以获取合成模型,从而提高模型训练的效率。图1中的主服务器分别与终端及从服务器连接。
图1中的各个主体工作过程如下:主服务器通过第一预设方式获取训练语料,所述训练语料可以是从终端获取,也可以是根据从终端获取的数据源列表采用爬取的方式从互联网获取;主服务器将所述语料按照预设条件进行切分以得到多个数据块;将所述数据块按照预先设置分别输入至从服务器上对应的子模型以使从服务器训练各个子模型,主服务器得到训练后的子模型,按照第二预设方式合成所述训练后的子模型以得到合成模型。
需要说明的是,图1中仅仅示意出台式电脑作为终端,在实际操作过程中,终端的类型不限于图1中所示,所述终端还可以为手机、笔记本电脑或者平板电脑等电子设备,上述模型训练方法的应用场景仅仅用于说明本申请技术方案,并不用于限定本申请技术方案,上述连接关系还可以有其他形式。
图2为本申请实施例提供的模型训练方法的示意性流程图。该模型训练方法应用于图1中的主服务器中,以完成模型训练方法的全部或者部分功能。
请参阅图2和图3,图2是本申请实施例提供的模型训练方法的流程示意图,图3为本申请实施例提供的模型训练方法的数据处理流向示意图。如图2所示,该方法包括以下步骤S210-S240:
S210、通过第一预设方式获取训练使用的语料。
具体地,服务器通过第一预设方式获取训练使用的语料。其中,所述第一预设方式包括接收终端发送的训练数据,或者通过爬虫系统爬取的互联网数据,终端发送的训练数据可以是通过存储设备获取的训练数据,也可以是通过互联网获取的数据,爬虫系统爬取语料时可以通过单一计算机设备进行爬取,又可 以采取分布式爬虫系统以提高爬取数据的效率。
S220、将所述语料按照预设条件进行切分以得到多个数据块。
其中,预设条件包括按照预设数据大小、数据类型、数据结构、数据来源、加工深度及时间分布等。按照预设数据大小是指以存储数据单位来划分数据,常用的数据单位包括:兆字节,吉字节、太字节及拍字节等,其中,1MB(Mega byte兆字节简称“兆”)=1024KB;1GB(Giga byte吉字节又称“千兆”)=1024MB;1TB(Tera byte万亿字节太字节)=1024GB其中1024=2^10(2的10次方);1PB(Peta byte千万亿字节拍字节)=1024TB。比如按照数据500M、2G或者或者0.5TB的数据大小进行划分,尤其是在大数据的情况下,需要根据具体的数据量设置合适的数据大小。数据结构包括平衡结构语料和自然随机结构语料。加工深度包括标注语料和非标注语料。时间分布包括历时语料和共时语料。数据来源包括网站及书籍报刊等,数据类型包括科技、体育或者财经等类型数据。
具体地,使用大数据训练模型时,若是需要使用大数据训练复杂模型,如果将复杂模型部署到单一机器处理大量数据,对于计算机硬件要求比较高。在实际工作中,经常会遇到计算机硬件资源的瓶颈,从而降低模型的训练效率。在本申请实施例中,通过将大数据分块获取大小合适的数据块,将复杂模型分解为细分的子模型,并且将多个子模型和数据块对应部署到不同的机器上,最后将子模型汇总经组合后输出完整的模型,提高了模型训练的效率。
在本申请实施例中,获取语料后,将所述语料按照预设条件进行切分,获取语料经过分块后的多个数据块。具体地进行以下步骤:首先需要事先准备模板化操作,可以通过定制Shell脚本,启动数据划分的操作。其中,Shell脚本,英文为Shell Script,是用各类命令预先放入到一个文件中,方便一次性执行的一个程序文件。其次,进行语料的关联检索,通过多个语料来源获取丰富且全面的语料。比如,语料为来源于一网站的关于一热点事件的新闻报道,可以通过爬虫系统采取爬取的方式获取该热点事件在其他相关网站报道的语料,由于不同的网站针对该热点事件会从不同角度进行报道以形成该热点事件不同的语料,通过该热点事件的关联检索,因此可以通过多个语料来源获取丰富且全面的语料。第三,将获取的用户语料形成的数据集进行划分,获取语料经数据分 块后的数据块。比如,将语料按照时序顺序进行划分以获取时序数据,通过用户的时序数据作为样本集,或者通过根据数据类型将数据进行分类作为划分数据,或者将时序和数据类型相结合的方式对数据进行划分,也就是将不同类型的数据按照时序进行切分以获取切分后的数据块。将数据集划分为数据块后,保存到预设的地址,以备后续步骤中从预设地址中获取数据块。在划分数据集之前,还需要将划分数据集的代码封装到Docker容器中,比如,将划分数据集的代码封装到Docker1,新建和启动Docker1,对数据集进行划分,获取单个用户的语料,并且对单一大数据块划分为若干个数据块子集,输入参数为划分数据子集个数,{D1,D2,D3..,Dn},保存到临时目录1,以备后续步骤中从临时目录1中获取数据块。
进一步地,将所述语料按照预设条件进行切分以得到多个数据块,可以是将所述语料按照数据类型进行分类以得到分类数据后,将所述分类数据按照预设周期的时序顺序进行切分以得到多个数据块。也可以是将所述语料按照数据类型进行分类以得到分类数据,将所述分类数据按照预设块数或者预设大小进行切分以得到多个数据块,可以根据实际需求采取不同的灵活方式。
本申请实施例中,首先会有初始化子模型的过程,然后基于数据集的构建,进入子模型的训练和构造。本申请实施例通过将用户语料的数据进行划分,获取数据块。同时,数据块的设置可以根据模型训练条件进行自定义的预先设置,具备灵活性,以满足不同的训练条件,比如,若单台计算机设备硬件资源配置高,可以将每一个数据块划分为较大的数据块,数据块数量少,若单台计算机设备硬件资源配置低,但计算机设备数量充足,可以将每一个数据块划分为较小的数据块,但数据块数量多。
S230、将所述数据块按照预设对应关系分别输入至对应的子模型以训练各个子模型,得到训练后的子模型。
其中,预先设置是指数据块与子模型的对应关系,也就是哪个子模型处理哪些对应的数据块。
具体地,本申请实施例通过构建并行化的多个子模块和将多个子模块合成起来的串行模型的组合以得到最后的合成模型。首先采用并行化训练的方式, 训练多个子模型,以降低对单台计算机设备硬件资源的要求,提高复杂模型的训练效率。在本申请实施例中基于对大数据文件进行切分,得到了均分的数据块,而每个数据块对应的数据源具有一个模型机制,也就是子模型,通过将多个数据块分别输入到不同机器设备上的多个子模型中,并行训练多个子模型。在训练子模型之前,还需要设置子模型的参数。比如,哪个子模型对应哪个地址的哪些数据块,或者哪个子模型对应哪些类型的数据块。得到切分的多个数据块后,从存储文件或者数据库中获取预先设置的数据块与子模型的对应关系,将所述数据块按照预设对应关系分别输入至对应的子模型。
在训练子模型之前,还需要将构建和更新子模型的代码封装到Docker容器中,比如封装到Docker2中,新建和启动若干个Docker2容器,部署到多台计算机设备上,启动构建和更新子模型的代码,比如,输入参数为从临时目录1的数据子集,输出子模型,保存到临时目录2。其中,子模型通过选用划分好的数据块,也就是获取的语料的数据集的子数据集,需要保持子模型的输入跟输出一致。本申请实施例中通过并行架构训练各个子模型,可以使用多核CPU或多台计算机设备的计算资源,提高模型的训练效率。
S240、按照第二预设方式合成所述训练后的子模型以得到合成模型。
其中,第二预设方式包括融合模型或者预设的组合方式等。
具体地,将获取的经过训练后的各个子模型经过汇总,合成最终的合成模型,为最后获取的复杂模型的结果。
在一个实施例中,所述按照第二预设方式合成所述训练后的子模型以得到合成模型的步骤包括:将训练后的多个子模型通过融合模型进行汇总以得到合成模型。
具体地,训练完各个子模型后,输出训练后的子模型结果,各个子模型再经过一个融合模型,做最终的合并以获取合成模型,也就是通过并行的各个子模型,得到对应的子模型结果,再通过对各个子模型进行加权求平均,获得最终的合成模型。其中,本申请实施例中采用的权重是基于各子模型的模型准确率来计算得到的,比如,回归问题,采用回归模型,回归模型中包括的子模型有MSE(MSE,Mean Square Error,均方误差),RMSE(Root Mean Squard Error, 均方根误差),MAE(Mean Absolute Error,平均绝对误差)及R-Squared(R平方)等,而回归模型采用的是分类准确率,子模型准确率越高,对应的子模型在融合模型中的权重会越高,子模型的准确率越低,对应的子模型在融合模型中的权重会越低。
在本申请实施例中,通过各个子模型的并行处理,以及最后对各个子模型的串行处理,获取最终的合成模型,将最终的合成模型保存到预设的地址。其中,串行模型流指的是,先通过A模型,再通过B模型。同时,还涉及各个子模型的读入地址,以及模型预测的输出,将模型预测输出到预设地址。具体实施时,将基于多子用户模型的模型预测的代码封装到Docker容器中,比如封装到Docker3中,新建和启动若干个Docker3容器,直到各个子模型训练完成之后,从临时目录2中导入若干个子模型,{model1,model2,...,modeln},启动基于多子模型的模型预测的代码,也就是将多个子模型的结果进行求平均值,输入参数为临时目录3的数据文件,输出为模型预测结果,也就是最终的合成模型,保存到临时目录4。其中,模型预测的结果和输出地址都需要输出到对应的位置,比如说保存在指定的数据库的某个表。
通过将用户数据进行划分数据块,采用并行训练模型,得到将细分数据集形成的各个数据块进入指定模型进行构建和训练,获取作为细分的子模型,最后通过多个细分子模型进行组合计算,组成了用户最终的多层模型,用户的最终多层模型的构建或者更新更多的是并行模型训练和串行模型数据相结合的结果。其中,模型流和数据块均可以通过自定义设置,具有灵活性。同时本申请实施例还基于Docker部署,在多台机器中采用并行架构并行完成模型的训练和更新,可以提高多核CPU或多机的计算资源使用率,能减轻对服务器内存的要求,提高环境部署的便利性和方便性。
请参阅图4,图4为本申请实施例提供的模型训练方法的子流程示意图。如图4所示,在该实施例中,所述通过第一预设方式获取训练使用的语料的步骤包括:S211、获取目标对象的初始数据源网站列表;S212、将所述初始数据源网站列表按照预设条件进行分类以得到不同类型的数据源网站列表;S213、封装所述数据源网站列表至对应的Docker容器;S214、启动所述Docker容器以 使所述Docker容器通过爬取的方式获取新数据源网站;S215、将所述新数据源网站添加至对应的数据源网站列表以更新所述目标对象的数据源网站;S216、基于更新后的所述数据源网站爬取满足预设条件的语料作为训练语料。
具体地,主服务器获取配置的目标的初始数据源网站列表,爬虫系统自动根据所述初始数据源网站列表的预设条件将所述初始数据源网站列表进行分类以获取不同类型的数据源网站列表,比如根据网站标识将数据源网站分为不同的类型,然后封装所述数据源网站列表至对应的Docker容器,所述Docker容器被部署到不同的服务器上,启动所述Docker容器以使所述Docker容器通过爬取获取丰富的新数据源网站,将所述新数据源网站添加至对应的数据源网站列表以更新所述目标的数据源网站,完善目标的数据源网站。具体来说,包括以下子步骤:
首先,获得初始网站列表,该列表可以通过手动配置,也就是由人工提供初始的数据源网站。
其次,通过将编写好的爬虫代码封装到Docker容器中,其中代码包括了提取网站URL的部分,同时还有匹配URL与对应爬取程序的代码,从而使URL自动与爬取程序对应,通过对应的爬取程序爬取对应的URL的网站。其中,需要构建URL与爬虫程序的索引关系,提前做好所有URL类型的网络爬虫,以使不同类型的URL爬虫对应不同的爬虫程序。
第三,启动容器Docker1,通过爬虫代码将总输入清单进行分类和分割,将同一类的数据源清单进行保存,形成待爬取列表,等待爬取。其中,通过启动URL分类和分割的代码,对输入的网站URL列表根据URL类型进行分类,实现网站URL列表进行分类操作,然后,启动列表分割的代码,将不同的数据源清单分成若干个列表,对应不同机器上的docker容器。
第四,启动容器Docker2,通过获得的数据源清单列表,通过匹配URL对应的爬虫程序,比如,X网站,对应着X网站爬取和解析的代码,传入X网站即可爬取,对外部网络进行访问,分开抓取对应的数据,并将数据返回到数据库中。
进一步地,爬虫程序根据获取的URL挖掘出新的URL,也就是爬虫程序通 过启动URL挖掘出新的URL,并将新的URL存储到待爬取的URL列表中以完善URL列表。同时,还可以核对是否有爬取数据过程中报错的情况,若有报错的情况,针对此网站的爬取过程结束。
对URL进行分类,可以通过预先设置的URL正则表达式进行。每类URL列表都有对应的正则表达式,通过判断返回的结果是否为空,来判定是否为该类URL。判断过程如下:若返回结果非空,则判断为该类URL,若判断结果为空,判断为非该类URL。
第五,直到所有Docker2的待爬取网站列表为空,停止操作。
第六,通过所述数据源网站爬取所述目标的语料以获取训练使用的语料。
进一步地,在一个实施例中,按照所述子模型的输入要求从所述训练语料中以抽取的方式获取所述子模型对应的训练语料,从而提高子模型的训练效率。
具体地,上述申请实施例选用的数据样本是用户全量的数据样本,进一步地,还可以按照等距离抽样的方式抽取数据块作为训练子模型使用的数据块,以减小训练数据量,提高模型的训练效率。采用等距离抽样的方法进行获取数据,比如,假如将语料进行分块后获取了100个数据块,本申请实施例中要生成10个模型,取出第1、11、21...等方式,组成10个样本。通过对用户相关的语料进行汇总,然后对获取的语料进行分类并保存,形成用户数据集。其中,分类是指将多个来源的数据集根据不同的模型输入要求进行抽取并且保存,此处的分类依赖于模型的输入内容,比如,若要进行行车轨迹预测,对应的子模型需要输入包括了行车的经纬度等数据作为训练语料,需要将包含行车的经纬度的数据进行抽取和保存,从而提高训练数据的准确性,并且去除了噪声数据,可以提高模型训练的效率。
需要说明的是,上述各个实施例所述的模型训练方法,可以根据需要将不同实施例中包含的技术特征重新进行组合,以获取组合后的实施方案,但都在本申请要求的保护范围之内。
请参阅图5,图5为本申请实施例提供的模型训练装置的示意性框图。对应于上述模型训练方法,本申请实施例还提供一种模型训练装置。如图5所示,该模型训练装置包括用于执行上述模型训练方法的单元,该装置可以被配置于 服务器等计算机设备中。具体地,请参阅图5,该模型训练装置500包括获取单元501、切分单元502、训练单元503及合成单元504。其中,获取单元501,用于通过第一预设方式获取训练语料;切分单元502,用于将所述语料按照预设条件进行切分以得到多个数据块;训练单元503,用于将所述数据块按照预设对应关系分别输入至对应的子模型以训练各个子模型,得到训练后的子模型;合成单元504,用于按照第二预设方式合成所述训练后的子模型以得到合成模型。
请参阅图6,图6为本申请实施例提供的模型训练装置的另一个示意性框图。如图6所示,在该实施例中,所述获取单元501包括:获取子单元5011,用于获取目标对象的初始数据源网站列表;分类子单元5012,用于将所述初始数据源网站列表按照预设条件进行分类以得到不同类型的数据源网站列表;封装子单元5013,用于封装所述数据源网站列表至对应的Docker容器;启动子单元5014,用于启动所述Docker容器以使所述Docker容器通过爬取的方式获取新数据源网站;添加子单元5015,用于将所述新数据源网站添加至对应的数据源网站列表以更新所述目标对象的数据源网站;爬取子单元5016,用于基于更新后的所述数据源网站爬取满足预设条件的语料作为训练语料。
请继续参阅图6,如图6所示,在该实施例中,所述模型训练装置500还包括:第一抽取单元505,用于按照所述子模型的输入要求从所述训练语料中以抽取的方式获取所述子模型对应的训练语料。
请继续参阅图6,如图6所示,在该实施例中,所述切分单元502包括:第一分类子单元,用于将所述语料按照数据类型进行分类以得到分类数据;第二切分子单元,用于将所述分类数据按照预设周期的时序顺序进行切分以得到多个数据块。
请继续参阅图6,如图6所示,在该实施例中,所述切分单元502包括:第二切分子单元,用于将所述语料按照数据类型进行分类以得到分类数据;第二切分子单元,用于将所述分类数据按照预设块数或者预设大小进行切分以得到多个数据块。
请继续参阅图6,如图6所示,在该实施例中,所述模型训练装置500还包括:第二抽取单元506,用于按照等距离抽样的方式抽取数据块作为训练子模型 使用的数据块。
在一个实施例中,所述合成单元504,用于将训练后的多个子模型通过融合模型进行汇总以得到合成模型。
在一个实施例中,所述模型训练装置500还包括:再获取单元,用于获取数据块与子模型的对应关系。
需要说明的是,所属领域的技术人员可以清楚地了解到,上述模型训练装置和各单元的具体实现过程,可以参考前述方法实施例中的相应描述,为了描述的方便和简洁,在此不再赘述。同时,上述模型训练装置中各个单元的划分和连接方式仅用于举例说明,在其他实施例中,可将模型训练装置按照需要划分为不同的单元,也可将模型训练装置中各单元采取不同的连接顺序和方式,以完成上述模型训练装置的全部或部分功能。
上述模型训练装置可以实现为一种计算机程序的形式,该计算机程序可以在如图7所示的计算机设备上运行。
请参阅图7,图7是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备700可以是台式机电脑或者服务器等计算机设备,也可以是其他设备中的组件或者部件。
参阅图7,该计算机设备700包括通过系统总线701连接的处理器702、存储器和网络接口705,其中,存储器可以包括非易失性存储介质703和内存储器704。该非易失性存储介质703可存储操作系统7031和计算机程序7032。该计算机程序7032被执行时,可使得处理器702执行一种上述模型训练方法。
其中,该处理器702用于提供计算和控制能力,以支撑整个计算机设备700的运行。该内存储器704为非易失性存储介质703中的计算机程序7032的运行提供环境,该计算机程序7032被处理器702执行时,可使得处理器702执行一种上述模型训练方法。该网络接口705用于与其它设备进行网络通信。本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备700的限定,具体的计算机设备700可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存 储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图7所示实施例一致,在此不再赘述。
其中,所述处理器702用于运行存储在存储器中的计算机程序7032,以实现本申请实施例的模型训练方法。
应当理解,在本申请实施例中,处理器702可以是中央处理单元(Central Processing Unit,CPU),该处理器702还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来完成,该计算机程序可存储于一计算机可读存储介质。该计算机程序被该计算机系统中的至少一个处理器执行,以实现上述方法的实施例的步骤。
因此,本申请实施例还提供一种计算机可读存储介质。该存储介质存储有计算机程序,该计算机程序被处理器执行时使处理器执行以上各实施例中所描述的模型训练方法的步骤。
所述存储介质为实体的、非瞬时性的存储介质,例如可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储计算机程序的实体存储介质。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上所述,仅为本申请的具体实施方式,但本申请明的保护范围并不局限 于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种模型训练方法,包括:
    通过第一预设方式获取训练语料;
    将所述语料按照预设条件进行切分以得到多个数据块;
    将所述数据块按照预设对应关系分别输入至对应的子模型以训练各个子模型,得到训练后的子模型;
    按照第二预设方式合成所述训练后的子模型以得到合成模型。
  2. 根据权利要求1所述模型训练方法,其中,所述通过第一预设方式获取训练语料的步骤包括:
    获取目标对象的初始数据源网站列表;
    将所述初始数据源网站列表按照预设条件进行分类以得到不同类型的数据源网站列表;
    封装所述数据源网站列表至对应的Docker容器;
    启动所述Docker容器以使所述Docker容器通过爬取的方式获取新数据源网站;
    将所述新数据源网站添加至对应的数据源网站列表以更新所述目标对象的数据源网站;
    基于更新后的所述数据源网站爬取满足预设条件的语料作为训练语料。
  3. 根据权利要求2所述模型训练方法,其中,基于更新后的所述数据源网站爬取满足预设条件的语料作为训练语料的步骤之后,还包括:
    按照所述子模型的输入要求从所述训练语料中以抽取的方式获取所述子模型对应的训练语料。
  4. 根据权利要求3所述模型训练方法,其中,所述将所述语料按照预设条件进行切分以得到多个数据块的步骤包括:
    将所述语料按照数据类型进行分类以得到分类数据;
    将所述分类数据按照预设周期的时序顺序进行切分以得到多个数据块。
  5. 根据权利要求3所述模型训练方法,其中,所述将所述语料按照预设条件进行切分以得到多个数据块的步骤包括:
    将所述语料按照数据类型进行分类以得到分类数据;
    将所述分类数据按照预设块数或者预设大小进行切分以得到多个数据块。
  6. 根据权利要求1所述模型训练方法,其中,所述将所述数据块按照预设对应关系分别输入至对应的子模型以训练各个子模型,得到训练后的子模型的步骤之前,还包括:
    按照等距离抽样的方式抽取数据块作为训练子模型使用的数据块。
  7. 根据权利要求1所述模型训练方法,其中,所述按照第二预设方式合成所述训练后的子模型以得到合成模型的步骤包括:
    将训练后的多个子模型通过融合模型进行汇总以得到合成模型。
  8. 根据权利要求1所述模型训练方法,其中,所述将所述数据块按照预设对应关系分别输入至对应的子模型以训练各个子模型的步骤之前,还包括:
    获取数据块与子模型的对应关系。
  9. 一种模型训练装置,包括:
    获取单元,用于通过第一预设方式获取训练使用的语料;
    切分单元,用于将所述语料按照预设条件进行切分以得到多个数据块;
    训练单元,用于将所述数据块按照预设对应关系分别输入至对应的子模型以训练各个子模型,得到训练后的子模型;
    合成单元,用于按照第二预设方式合成所述训练后的子模型以得到合成模型。
  10. 根据权利要求9所述模型训练装置,其中,所述获取单元包括:
    获取子单元,用于获取目标对象的初始数据源网站列表;
    分类子单元,用于将所述初始数据源网站列表按照预设条件进行分类以得到不同类型的数据源网站列表;
    封装子单元,用于封装所述数据源网站列表至对应的Docker容器;
    启动子单元,用于启动所述Docker容器以使所述Docker容器通过爬取的方式获取新数据源网站;
    添加子单元,用于将所述新数据源网站添加至对应的数据源网站列表以更新所述目标对象的数据源网站;
    爬取子单元,用于基于更新后的所述数据源网站爬取满足预设条件的语料作为训练语料。
  11. 一种计算机设备,包括存储器以及与所述存储器相连的处理器;所述存储器用于存储计算机程序;所述处理器用于运行所述存储器中存储的计算机程序,以执行如下步骤:
    通过第一预设方式获取训练语料;
    将所述语料按照预设条件进行切分以得到多个数据块;
    将所述数据块按照预设对应关系分别输入至对应的子模型以训练各个子模型,得到训练后的子模型;
    按照第二预设方式合成所述训练后的子模型以得到合成模型。
  12. 根据权利要求11所述计算机设备,其中,所述通过第一预设方式获取训练语料的步骤包括:
    获取目标对象的初始数据源网站列表;
    将所述初始数据源网站列表按照预设条件进行分类以得到不同类型的数据源网站列表;
    封装所述数据源网站列表至对应的Docker容器;
    启动所述Docker容器以使所述Docker容器通过爬取的方式获取新数据源网站;
    将所述新数据源网站添加至对应的数据源网站列表以更新所述目标对象的数据源网站;
    基于更新后的所述数据源网站爬取满足预设条件的语料作为训练语料。
  13. 根据权利要求12所述计算机设备,其中,基于更新后的所述数据源网站爬取满足预设条件的语料作为训练语料的步骤之后,还包括:
    按照所述子模型的输入要求从所述训练语料中以抽取的方式获取所述子模型对应的训练语料。
  14. 根据权利要求13所述计算机设备,其中,所述将所述语料按照预设条件进行切分以得到多个数据块的步骤包括:
    将所述语料按照数据类型进行分类以得到分类数据;
    将所述分类数据按照预设周期的时序顺序进行切分以得到多个数据块。
  15. 根据权利要求13所述计算机设备,其中,所述将所述语料按照预设条件进行切分以得到多个数据块的步骤包括:
    将所述语料按照数据类型进行分类以得到分类数据;
    将所述分类数据按照预设块数或者预设大小进行切分以得到多个数据块。
  16. 根据权利要求11所述计算机设备,其中,所述将所述数据块按照预设对应关系分别输入至对应的子模型以训练各个子模型,得到训练后的子模型的步骤之前,还包括:
    按照等距离抽样的方式抽取数据块作为训练子模型使用的数据块。
  17. 根据权利要求11所述计算机设备,其中,所述按照第二预设方式合成所述训练后的子模型以得到合成模型的步骤包括:
    将训练后的多个子模型通过融合模型进行汇总以得到合成模型。
  18. 根据权利要求11所述计算机设备,其中,所述将所述数据块按照预设对应关系分别输入至对应的子模型以训练各个子模型的步骤之前,还包括:
    获取数据块与子模型的对应关系。
  19. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如下步骤:
    通过第一预设方式获取训练语料;
    将所述语料按照预设条件进行切分以得到多个数据块;
    将所述数据块按照预设对应关系分别输入至对应的子模型以训练各个子模型,得到训练后的子模型;
    按照第二预设方式合成所述训练后的子模型以得到合成模型。
  20. 根据权利要求19所述计算机可读存储介质,其中,所述通过第一预设方式获取训练语料的步骤包括:
    获取目标对象的初始数据源网站列表;
    将所述初始数据源网站列表按照预设条件进行分类以得到不同类型的数据源网站列表;
    封装所述数据源网站列表至对应的Docker容器;
    启动所述Docker容器以使所述Docker容器通过爬取的方式获取新数据源网站;
    将所述新数据源网站添加至对应的数据源网站列表以更新所述目标对象的数据源网站;
    基于更新后的所述数据源网站爬取满足预设条件的语料作为训练语料。
PCT/CN2019/117295 2019-01-04 2019-11-12 模型训练方法、装置、计算机设备及计算机可读存储介质 WO2020140613A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910008124.7A CN109885378A (zh) 2019-01-04 2019-01-04 模型训练方法、装置、计算机设备及计算机可读存储介质
CN201910008124.7 2019-01-04

Publications (1)

Publication Number Publication Date
WO2020140613A1 true WO2020140613A1 (zh) 2020-07-09

Family

ID=66925610

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117295 WO2020140613A1 (zh) 2019-01-04 2019-11-12 模型训练方法、装置、计算机设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN109885378A (zh)
WO (1) WO2020140613A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109885378A (zh) * 2019-01-04 2019-06-14 平安科技(深圳)有限公司 模型训练方法、装置、计算机设备及计算机可读存储介质
CN112836827B (zh) * 2019-11-25 2024-04-26 上海哔哩哔哩科技有限公司 模型训练方法、装置以及计算机设备
CN111078500A (zh) * 2019-12-11 2020-04-28 何晨 运行配置参数的调整方法、装置、计算机设备和存储介质
CN113010501A (zh) * 2019-12-19 2021-06-22 北京国双科技有限公司 采收率预测模型获取方法、采收率预测方法和产品
CN113191173A (zh) * 2020-01-14 2021-07-30 北京地平线机器人技术研发有限公司 一种训练数据获取方法及装置
CN113140260B (zh) * 2020-01-20 2023-09-08 腾讯科技(深圳)有限公司 合成物的反应物分子组成数据预测方法和装置
CN111277445B (zh) * 2020-02-17 2022-06-07 网宿科技股份有限公司 一种评估在线节点服务器性能的方法及装置
CN112862728B (zh) * 2021-03-22 2023-06-27 上海壁仞智能科技有限公司 伪影去除方法、装置、电子设备和存储介质
CN113239963B (zh) * 2021-04-13 2024-03-01 联合汽车电子有限公司 车辆数据的处理方法、装置、设备、车辆和存储介质
CN113536553B (zh) * 2021-06-30 2024-05-17 广东利元亨智能装备股份有限公司 一种模型简化处理方法、装置、设备及计算机存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106101176A (zh) * 2016-05-27 2016-11-09 成都索贝数码科技股份有限公司 一种一体化的融媒体云生产发布系统与方法
CN108388544A (zh) * 2018-02-10 2018-08-10 桂林电子科技大学 一种基于深度学习的图文融合微博情感分析方法
CN108900467A (zh) * 2018-05-31 2018-11-27 华东师范大学 一种基于Docker的自动化蜜罐搭建及威胁感知的方法
CN109885378A (zh) * 2019-01-04 2019-06-14 平安科技(深圳)有限公司 模型训练方法、装置、计算机设备及计算机可读存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108958881A (zh) * 2018-05-31 2018-12-07 平安科技(深圳)有限公司 数据处理方法、装置及计算机可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106101176A (zh) * 2016-05-27 2016-11-09 成都索贝数码科技股份有限公司 一种一体化的融媒体云生产发布系统与方法
CN108388544A (zh) * 2018-02-10 2018-08-10 桂林电子科技大学 一种基于深度学习的图文融合微博情感分析方法
CN108900467A (zh) * 2018-05-31 2018-11-27 华东师范大学 一种基于Docker的自动化蜜罐搭建及威胁感知的方法
CN109885378A (zh) * 2019-01-04 2019-06-14 平安科技(深圳)有限公司 模型训练方法、装置、计算机设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN109885378A (zh) 2019-06-14

Similar Documents

Publication Publication Date Title
WO2020140613A1 (zh) 模型训练方法、装置、计算机设备及计算机可读存储介质
US10452691B2 (en) Method and apparatus for generating search results using inverted index
WO2020192534A1 (zh) 搜索方法、终端及介质
US10616038B2 (en) Method and system for clustering event messages and managing event-message clusters
WO2021042515A1 (zh) 图数据存储和查询方法、装置及计算机可读存储介质
CN108037961A (zh) 一种应用程序配置方法、装置、服务器和存储介质
CN109815283B (zh) 一种异构数据源可视化查询方法
JP6203374B2 (ja) ウェブページ・スタイルアドレスの統合
US10120928B2 (en) Method and system for clustering event messages and managing event-message clusters
US10282502B1 (en) Flexible constraint integrated circuit implementation runs
CN103092943B (zh) 一种广告调度的方法和广告调度服务器
US11907181B2 (en) Inferring a dataset schema from input files
US11132293B2 (en) Intelligent garbage collector for containers
WO2017101591A1 (zh) 一种知识库构建方法、控制器
US20220114361A1 (en) Multi-word concept tagging for images using short text decoder
WO2018176822A1 (zh) 一种操作ElasticSearch的方法及装置
WO2019196239A1 (zh) 一种线程接口的管理方法、终端设备及计算机可读存储介质
JP2021533499A (ja) 不均衡標本データの前処理方法、装置及びコンピュータ機器
US9336316B2 (en) Image URL-based junk detection
EP3079083A1 (en) Providing app store search results
CN106570153A (zh) 一种海量url的数据提取方法及系统
CN112650529A (zh) 可配置生成移动端app代码的系统及方法
WO2020034952A1 (zh) 基于浏览器缓存机制优化网页加载速度的方法、电子设备
CN106339381B (zh) 一种信息处理方法及装置
US9195940B2 (en) Jabba-type override for correcting or improving output of a model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19907201

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19907201

Country of ref document: EP

Kind code of ref document: A1