TWI761331B

TWI761331B - Sample serialization method and apparatus

Info

Publication number: TWI761331B
Application number: TW106104783A
Authority: TW
Inventors: 周俊
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2016-03-11
Filing date: 2017-02-14
Publication date: 2022-04-21
Also published as: CN107180017B; CN107180017A; WO2017152766A1; TW201734838A

Abstract

本發明實施例提供了一種樣本序列化方法和裝置，關於機器訓練技術領域。所述方法包括：獲取待序列化樣本中的各個字符串；根據各字符串與各管理伺服器之間的對應關係，確定每個字符串對應的管理伺服器；將所述字符串發送至相應的管理伺服器，以供各管理伺服器根據其維護的映射表，將接收到的字符串轉化為相應的序列化ID；其中，不同管理伺服器維護的映射表中的字符串互不相同；接收各個管理伺服器返回的對應各個字符串的序列化ID；根據接收到的各字符串對應的序列化ID，將各個樣本資料中的字符串轉換為相應的序列化ID。本發明降低字符串的序列化ID的查詢時間，從而可以減少對樣本序列化的時間，提高序列化效率。 Embodiments of the present invention provide a method and device for serializing samples, which relate to the technical field of machine training. The method includes: acquiring each character string in the sample to be serialized; determining the management server corresponding to each character string according to the correspondence between each character string and each management server; sending the character string to the corresponding The management server is used for each management server to convert the received string into the corresponding serialized ID according to the mapping table maintained by it; wherein, the strings in the mapping table maintained by different management servers are different from each other; Receive the serialized ID corresponding to each string returned by each management server; convert the string in each sample data into the corresponding serialized ID according to the serialized ID corresponding to each received string. The invention reduces the query time of the serialized ID of the string, thereby reducing the time for serializing the sample and improving the serialization efficiency.

Description

Sample serialization method and apparatus

本發明關於機器訓練技術領域，特別是關於一種樣本序列化方法和一種樣本序列化裝置。 The present invention relates to the technical field of machine training, in particular to a sample serialization method and a sample serialization device.

在互聯網中，基於使用者的網路行為能產生大量的資料，而為了研究使用者的各種行為習慣等方面，可能會構建各種各樣的模型，而為了訓練這些模型，一般採用機器學習系統。機器學習系統中，因為樣本資料中各個維度的字符串本身可能不是序列化的ID，比如並不是數字ID，而是根據業務需求進行命名的。那麼如果直接對樣本資料的字符串進行訓練，其計算量相對龐大，資源消耗多。 In the Internet, a large amount of data can be generated based on the user's network behavior, and in order to study various behavioral habits of users, various models may be constructed, and in order to train these models, machine learning systems are generally used. In the machine learning system, because the strings of each dimension in the sample data may not be serialized IDs, such as digital IDs, they are named according to business requirements. Then, if the string of sample data is directly trained, the amount of calculation is relatively large and the resource consumption is high.

因此，為了降低計算量，進行訓練之前，需要將所有的樣本資料中的字符串轉換成序列化ID，比如數字ID。比如一個樣本資料是格式如下：一共兩列：第一列為label列，該label列記錄使用者是否點擊，若記錄為1代表使用者點擊，若記錄為0代表使用者沒有點擊；第二列為特徵列，該特徵列是該條樣本的所有特徵，用逗號分隔，例如： 1 user_id_123,age_1,sex_1,age_comb_city3 Therefore, in order to reduce the amount of computation, before training, it is necessary to convert all strings in the sample data into serialized IDs, such as digital IDs. For example, the format of a sample data is as follows: there are two columns: the first column is the label column, which records whether the user clicks or not. If the record is 1, it means the user clicked, and if the record is 0, it means the user did not click; is the feature column, which is all the features of the sample, separated by commas, for example: 1 user_id_123,age_1,sex_1,age_comb_city3

則需要將其中的“user_id_123,age_1,sex_1,age_comb_city3”全部轉換成數字ID，也就是需要建立如下映射關係：{字符串集合}->{數字集合} Then you need to convert all "user_id_123, age_1, sex_1, age_comb_city3" into digital IDs, that is, you need to establish the following mapping relationship: {string collection}->{number collection}

那麼前述“user_id_123,age_1,sex_1,age_comb_city3”轉換得到的映射關係為：user_id_123->數字X，age_1->數字Y,sex_1->數字Z，age_comb_city3->數字F。 Then the above-mentioned "user_id_123, age_1, sex_1, age_comb_city3" converts the mapping relationship as: user_id_123->number X, age_1->number Y, sex_1->number Z, age_comb_city3->number F.

但是，在發明人使用過程中發現，當字符串集合元素非常多時，單機記憶體裝載不下，將樣本資料序列化的時間非常常長，比如20億字符串時，每個機器需要的加載完整的映射表，記憶體超過40G，序列化的時間也非常長。 However, in the process of using the inventor, it is found that when there are many string set elements, the single-machine memory cannot be loaded, and the time to serialize the sample data is very long. For example, when there are 2 billion strings, each machine needs to load a complete The mapping table, the memory exceeds 40G, and the serialization time is also very long.

鑒於上述問題，提出了本發明實施例以便提供一種克服上述問題或者至少部分地解決上述問題的一種樣本序列化方法和相應的一種樣本序列化裝置。 In view of the above problems, the embodiments of the present invention are proposed to provide a sample serialization method and a corresponding sample serialization device that overcome the above problems or at least partially solve the above problems.

為了解決上述問題，本發明揭露了一種樣本序列化方法，包括：獲取待序列化樣本中的各個字符串；根據各字符串與各管理伺服器之間的對應關係，確定每個字符串對應的管理伺服器；將所述字符串發送至相應的管理伺服器，以供各管理伺服器根據其維護的映射表，將接收到的字符串轉化為相應的序列化ID；其中，不同管理伺服器維護的映射表中的字符串互不相同；接收各個管理伺服器返回的對應各個字符串的序列化ID；根據接收到的各字符串對應的序列化ID，將各個樣本資料中的字符串轉換為相應的序列化ID。 In order to solve the above problems, the present invention discloses a sample serialization method, which includes: obtaining each character string in the sample to be serialized; management server; send the string to the corresponding management server for each management The server converts the received string into the corresponding serialized ID according to the mapping table maintained by it; the strings in the mapping tables maintained by different management servers are different from each other; Serialized ID of the string; according to the serialized ID corresponding to each received string, convert the string in each sample data into the corresponding serialized ID.

本發明還揭露了一種樣本序列化方法，包括：接收字符串；所述字符串由序列化伺服器根據字符串與各管理伺服器之間的對應關係發送；所述字符串由序列伺服器從樣本資料中獲取；根據本地維護的映射表，將所接收到的字符串轉換為序列化ID；其中，不同管理伺服器維護的映射表中的字符串互不相同；將所述字符串對應的序列化ID返回給相應的序列化伺服器，以供序列化伺服器根據接收到的各字符串對應的序列化ID，將各個樣本資料中的字符串轉換為相應的序列化ID。 The present invention also discloses a sample serialization method, comprising: receiving a character string; the character string is sent by the serialization server according to the corresponding relationship between the character string and each management server; the character string is sent by the serialization server from Obtained from the sample data; according to the locally maintained mapping table, the received string is converted into a serialized ID; wherein, the strings in the mapping tables maintained by different management servers are different from each other; The serialization ID is returned to the corresponding serialization server, so that the serialization server can convert the strings in each sample data into corresponding serialization IDs according to the serialization IDs corresponding to the received strings.

本發明還揭露了一種樣本序列化裝置，包括：字符串提取模組，用於獲取待序列化樣本中的各個字符串；管理伺服器確定模組，用於根據各字符串與各管理伺服器之間的對應關係，確定每個字符串對應的管理伺服器；字符串發送模組，用於將所述字符串發送至相應的管理伺服器，以供各管理伺服器根據其維護的映射表，將接收到的字符串轉化為相應的序列化ID；其中，不同管理伺服器維護的映射表中的字符串互不相同；序列化ID接收模組，用於接收各個管理伺服器返回的對應各個字符串的序列化ID；樣本序列化模組，用於根據接收到的各字符串對應的序列化ID，將各個樣本資料中的字符串轉換為相應的序列化ID。 The present invention also discloses a sample serialization device, comprising: a character string extraction module for obtaining each character string in the sample to be serialized; a management server determination module for matching each character string with each management server Correspondence between each string to determine the management server corresponding to each string; A string sending module is used to send the string to the corresponding management server, so that each management server can convert the received string into a corresponding serialized ID according to the mapping table maintained by the management server; wherein, The strings in the mapping tables maintained by different management servers are different from each other; the serialized ID receiving module is used to receive the serialized ID corresponding to each string returned by each management server; the sample serialization module is used to receive the serialized ID corresponding to each string returned by each management server; The serialized ID corresponding to each received string is converted into the corresponding serialized ID from the strings in each sample data.

本發明還揭露了一種樣本序列化裝置，包括：字符串接收模組，用於接收字符串；所述字符串由序列化伺服器根據字符串與各管理伺服器之間的對應關係發送；所述字符串由序列伺服器從樣本資料中獲取；字符串轉換模組，用於根據本地維護的映射表，將所接收到的字符串轉換為序列化ID；其中，不同管理伺服器維護的映射表中的字符串互不相同；數位化ID返回模組，用於將所述字符串對應的序列化ID返回給相應的序列化伺服器，以供序列化伺服器根據接收到的各字符串對應的序列化ID，將各個樣本資料中的字符串轉換為相應的序列化ID。 The invention also discloses a sample serialization device, comprising: a character string receiving module for receiving character strings; the character strings are sent by the serialization server according to the correspondence between the character strings and each management server; The string is obtained from the sample data by the serial server; the string conversion module is used to convert the received string into a serialized ID according to the locally maintained mapping table; among them, the mapping maintained by different management servers The strings in the table are different from each other; the digitized ID return module is used to return the serialization ID corresponding to the string to the corresponding serialization server, so that the serialization server can use the received strings according to each string. For the corresponding serialization ID, convert the strings in each sample data into the corresponding serialization ID.

本發明實施例包括以下優點：本發明實施例，將序列化需要的映射表分散到多個管理伺服器中，不同的管理伺服器的映射表中維護的字符串相互不同，相應字符串的數位化ID也不同。然後序列化伺服器只需將待序列化的樣本，將其中的字符串根據與各個管理伺服器之間的對應關係，發送至相應的管理伺服器；然後由該管理伺服器去獲取該字符串的序列化ID返回給序列化伺服器。該序列化ID，如數字ID。因此，序列化伺服器即可將樣本轉換為數位化的樣本，以備後續訓練使用。如此，序列化伺服器不用加載映射表，避免序列化伺服器的記憶體不足。另外，由於映射表分散到了多個管理伺服器，管理伺服器在查找字符串的序列化ID時，查詢的時間短，則可以降低字符串的序列化ID的查詢時間，從而可以減少對樣本序列化的時間，提高序列化效率。 The embodiments of the present invention include the following advantages: In the embodiments of the present invention, the mapping tables required for serialization are distributed among multiple management servers, and the strings maintained in the mapping tables of different management servers are different from each other, and the digits of the corresponding strings are different from each other. The chemical ID is also different. then serialize The server only needs to send the sample to be serialized and the string in it to the corresponding management server according to the corresponding relationship with each management server; then the management server will obtain the serialization of the string The ID is returned to the serialization server. The serialized ID, such as a numeric ID. Therefore, the serialization server can convert the samples into digitized samples for subsequent training. In this way, the serialization server does not need to load the mapping table, avoiding the memory shortage of the serialization server. In addition, since the mapping table is distributed to multiple management servers, when the management server finds the serialized ID of the string, the query time is short, which can reduce the query time of the serialized ID of the string, thereby reducing the number of sample sequences. time and improve serialization efficiency.

110‧‧‧步驟 110‧‧‧Steps

120‧‧‧步驟 120‧‧‧Steps

130‧‧‧步驟 130‧‧‧Steps

140‧‧‧步驟 140‧‧‧Steps

150‧‧‧步驟 150‧‧‧Steps

210‧‧‧步驟 210‧‧‧Steps

220‧‧‧步驟 220‧‧‧Steps

230‧‧‧步驟 230‧‧‧Steps

310‧‧‧步驟 310‧‧‧Steps

312‧‧‧步驟 312‧‧‧Steps

314‧‧‧步驟 314‧‧‧Steps

316‧‧‧步驟 316‧‧‧Steps

318‧‧‧步驟 318‧‧‧Steps

320‧‧‧步驟 320‧‧‧Steps

322‧‧‧步驟 322‧‧‧Steps

324‧‧‧步驟 324‧‧‧Steps

326‧‧‧步驟 326‧‧‧Steps

328‧‧‧步驟 328‧‧‧Steps

330‧‧‧步驟 330‧‧‧Steps

332‧‧‧步驟 332‧‧‧Steps

410‧‧‧字符串提取模組 410‧‧‧String Extraction Module

420‧‧‧管理伺服器確定模組 420‧‧‧Management server determination module

430‧‧‧字符串發送模組 430‧‧‧String sending module

440‧‧‧序列化ID接收模組 440‧‧‧Serialized ID receiving module

450‧‧‧樣本序列化模組 450‧‧‧Sample Serialization Module

510‧‧‧字符串接收模組 510‧‧‧String receiving module

520‧‧‧字符串轉換模組 520‧‧‧String Conversion Module

530‧‧‧數位化ID返回模組 530‧‧‧Digital ID Return Module

600‧‧‧調度伺服器 600‧‧‧Scheduling Server

601‧‧‧通知模組 601‧‧‧Notification Module

700‧‧‧序列化伺服器 700‧‧‧Serialization Server

701‧‧‧樣本獲取模組 701‧‧‧Sample acquisition module

702‧‧‧字符串提取模組 702‧‧‧String Extraction Module

703‧‧‧字符串取餘模組 703‧‧‧String remainder module

704‧‧‧第一餘數確定模組 704‧‧‧First remainder determination module

705‧‧‧字符串發送模組 705‧‧‧String sending module

706‧‧‧序列化ID接收模組 706‧‧‧Serialized ID receiving module

707‧‧‧樣本序列化模組 707‧‧‧Sample Serialization Module

708‧‧‧輸出模組 708‧‧‧Output Module

800‧‧‧每個管理伺服器 800‧‧‧per management server

801‧‧‧字符串接收模組 801‧‧‧String receiving module

802‧‧‧字符串轉換模組 802‧‧‧String Conversion Module

803‧‧‧數位化ID返回模組 803‧‧‧Digital ID Return Module

圖1是本發明的從序列化伺服器側描述的一種樣本序列化方法實施例的步驟流程圖；圖2是本發明的從管理伺服器側描述的一種樣本序列化方法實施例的步驟流程圖；圖3是本發明的一種樣本序列化方法實施例的步驟流程圖；圖4是本發明的一種樣本序列化裝置實施例的結構方塊圖；圖5是本發明的一種樣本序列化裝置實施例的結構方塊圖；圖6是本發明的一種樣本序列化系統實施例的結構方塊圖。 Fig. 1 is a flow chart of steps of a method embodiment of a sample serialization described from the serialization server side of the present invention; Fig. 2 is a flow chart of steps of an embodiment of a sample serialization method described from the management server side of the present invention 3 is a flow chart of the steps of a sample serialization method embodiment of the present invention; FIG. 4 is a structural block diagram of a sample serialization device embodiment of the present invention; FIG. 5 is a sample serialization device embodiment of the present invention. Figure 6 is a structural block diagram of a sample serialization system embodiment of the present invention. block diagram.

為使本發明的上述目的、特徵和優點能夠更加明顯易懂，下面結合圖式和具體實施方式對本發明作進一步詳細的說明。 In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be described in further detail below with reference to the drawings and specific embodiments.

本發明實施例的核心構思之一在於，將序列化需要的映射表分散到多個管理伺服器中，不同的管理伺服器的映射表中維護的字符串相互不同，相應字符串的序列化ID也不同。然後序列化伺服器只需對於待序列化的樣本資料，從該樣本資料中提取了字符串後，根據該字符串與各個管理伺服器之間的對應關係，將該字符串發送至相應的管理伺服器；然後由該管理伺服器去獲取該字符串的序列化ID返回給序列化伺服器。然後，序列化伺服器即可將樣本轉換為數位化的樣本，以備後續訓練使用。如此，序列化伺服器不用加載映射表，避免序列化伺服器的記憶體不足。另外，由於映射表分散到了多個管理伺服器，管理伺服器在查找字符串的序列化ID時，查詢的時間短，則可以降低字符串的序列化ID的查詢時間，從而可以減少對樣本序列化的時間，提高序列化效率。 One of the core concepts of the embodiments of the present invention is to disperse the mapping tables required for serialization into multiple management servers. The strings maintained in the mapping tables of different management servers are different from each other, and the serialization IDs of the corresponding strings are different from each other. Also different. Then the serialization server only needs to extract the character string from the sample data for the sample data to be serialized, and send the character string to the corresponding management server according to the corresponding relationship between the character string and each management server. server; the management server then obtains the serialization ID of the string and returns it to the serialization server. The serialization server can then convert the samples into digitized samples for subsequent training. In this way, the serialization server does not need to load the mapping table, avoiding the memory shortage of the serialization server. In addition, since the mapping table is distributed to multiple management servers, when the management server finds the serialized ID of the string, the query time is short, which can reduce the query time of the serialized ID of the string, thereby reducing the number of sample sequences. time and improve serialization efficiency.

實施例一 Example 1

參照圖1，示出了本發明的一種樣本序列化方法實施例的步驟流程圖，具體可以包括如下步驟：步驟110，獲取待序列化樣本中的各個字符串；在本發明實施例中，序列化伺服器首先接收待序列化的樣本資料，在一較佳的實施例中，在步驟110之前，還包括：步驟S100，獲取待序列化的各樣本資料；本發明實施例可以有一個或者多個序列化伺服器slave。各序列化伺服器可以根據調度伺服器coordinator的通知，去獲取由該序列化伺服器處理的一批樣本資料。 Referring to FIG. 1, a flow chart of steps of a sample serialization method embodiment of the present invention is shown, which may specifically include the following steps: Step 110: Acquire each character string in the sample to be serialized; in the embodiment of the present invention, the serialization server first receives the sample data to be serialized. In a preferred embodiment, before step 110, it further includes : Step S100, acquiring each sample data to be serialized; in this embodiment of the present invention, there may be one or more serialization server slaves. Each serialization server can obtain a batch of sample data processed by the serialization server according to the notification of the scheduling server coordinator.

在本發明實施例中，各個序列化伺服器和各管理伺服器以及調度伺服器，可以組成機器訓練的訓練集群。 In the embodiment of the present invention, each serialization server, each management server, and the scheduling server can form a training cluster for machine training.

在本發明另一較佳的實施例中，所述獲取待序列化的各樣本資料的步驟，包括：子步驟S11，獲取由調度伺服器對所有樣本資料進行平均分配後，屬於當前序列化伺服器自己的一批樣本資料。 In another preferred embodiment of the present invention, the step of acquiring each sample data to be serialized includes: sub-step S11 , acquiring the data belonging to the current serialization server after the scheduling server has evenly distributed all the sample data The device's own batch of sample data.

比如在訓練集群中存在兩台序列化伺服器，序列化伺服器A和序列化伺服器B。總共有10000條樣本資料，則調度伺服器可以將該10000條樣本資料分為兩份，每份5000條，分別通知序列化伺服器A和序列伺服器B去獲取相應的5000條樣本資料。 For example, there are two serialization servers in the training cluster, serialization server A and serialization server B. There are a total of 10,000 pieces of sample data, and the scheduling server can divide the 10,000 pieces of sample data into two parts, each with 5,000 pieces, and notify serialization server A and serial server B to obtain the corresponding 5,000 pieces of sample data.

當然，子步驟S11只是本發明的一種較佳的方式，其他分配方式也可以，本發明實施例不對其加以限制。比如根據序列化伺服器的性能分配，此時，調度伺服器在接收到上傳的樣本資料後，可以在分配為個序列化伺服器分配樣本資料之前，先獲取序列化伺服器的硬體性能，硬體性能在某個區間範圍的分配相應比例的樣本資料。 Of course, the sub-step S11 is only a preferred manner of the present invention, and other allocation manners are also possible, which are not limited in the embodiment of the present invention. For example, according to the performance allocation of the serialization server, at this time, after receiving the uploaded sample data, the scheduling server can allocate it as a serialization server. Before sample data, first obtain the hardware performance of the serialized server, and allocate the corresponding proportion of sample data within a certain range of hardware performance.

進一步的，在本發明實施例中，每個序列化伺服器，在獲取到由其序列化的樣本資料後，則從樣本中提取字符串。比如一條樣本如下：

Further, in this embodiment of the present invention, each serialization server, after acquiring the sample data serialized by it, extracts a character string from the sample. For example, a sample is as follows:

該樣本資料一共兩列，第一列為label列，表示使用者是否點擊，值為1代表使用者點擊，值為0代表使用者沒有點擊；第二列為特徵列，值是該條樣本資料的所有特徵，用逗號分隔。 There are two columns in the sample data. The first column is the label column, which indicates whether the user clicked. The value is 1, which means the user clicked, and the value is 0, which means the user did not click. The second column is the feature column, and the value is the sample data. All characteristics of , separated by commas.

那麼本發明的序列化伺服器則從特徵列裡提取，“user_id_123”,“age_1”,“sex_1”,“age_comb_city3”。 Then the serialization server of the present invention extracts "user_id_123", "age_1", "sex_1", and "age_comb_city3" from the feature column.

可以理解的是，上述示例僅僅是描述提取的字符串的示例，本發明並不受限於此，其他格式的樣本資料也可以採用。 It can be understood that the above examples are only examples for describing the extracted character string, and the present invention is not limited thereto, and sample data in other formats can also be used.

需要說明的是，在本發明實施例中，從樣本資料中提取字符串時，提取的是非純數位化的字符串。比如前述的“user_id_123”,“age_1”,“sex_1”,“age_comb_city3”。如果特徵列裡的特徵有純數字，則不提取。 It should be noted that, in the embodiment of the present invention, when a character string is extracted from the sample data, a non-purely digitized character string is extracted. For example, the aforementioned "user_id_123", "age_1", "sex_1", "age_comb_city3". If the features in the feature column have pure numbers, they will not be extracted.

在本發明實施例中，可以預先根據樣本資料的格式進行分析，確定需要以什麼樣的方式從樣本資料中提取字符串，比如採用什麼樣的提取模板提取資料等。當然，可以由調度伺服器預先確定需要以什麼樣的方式從樣本資料中提取字符串，然後通知給各個序列化伺服器。 In the embodiment of the present invention, the format of the sample data can be analyzed in advance to determine the method to extract character strings from the sample data, such as what kind of extraction template to use to extract the data. Of course, it can be predetermined by the scheduling server in what way Extract the string and notify each serialization server.

當然，本發明實施例中，對樣本資料進行序列化時，可以一條一條進行序列化，即提取一條樣本資料的字符串，發送至相應的各管理伺服器，在該條樣本資料序列化完之後進行下一條的序列化。也可以批量的進行序列化，即一次發送一批樣本資料的字符串，發送至相應的各管理伺服器。 Of course, in the embodiment of the present invention, when the sample data is serialized, it can be serialized one by one, that is, a string of sample data is extracted and sent to the corresponding management servers. After the sample data is serialized Serialize the next item. It can also be serialized in batches, that is, sending a batch of sample data strings at a time to the corresponding management servers.

步驟120，根據各字符串與各管理伺服器之間的對應關係，確定每個字符串對應的管理伺服器；本發明實施例的序列化伺服器可以將上述提取的字符串發給對應的管理伺服器master。在本發明實施例中字符串是屬於某個管理伺服器的映射表維護的。本發明實施例可以通過某種方式約定字符串與管理伺服器的對應關係。 Step 120, according to the correspondence between each character string and each management server, determine the management server corresponding to each character string; the serialization server in the embodiment of the present invention may send the above-mentioned extracted character string to the corresponding management server Server master. In this embodiment of the present invention, the character string is maintained by a mapping table belonging to a certain management server. In the embodiment of the present invention, the corresponding relationship between the character string and the management server can be agreed in a certain way.

在本發明一較佳的實施例中，所述根據各字符串與各管理伺服器之間的對應關係，確定每個字符串對應的管理伺服器的步驟，包括：子步驟S21，將字符串對應的哈希值除以管理伺服器的個數，得到餘數；子步驟S22，根據餘數與管理伺服器的對應關係，確定字符串對應的管理伺服器。 In a preferred embodiment of the present invention, the step of determining the management server corresponding to each character string according to the corresponding relationship between each character string and each management server includes: sub-step S21: The corresponding hash value is divided by the number of management servers to obtain the remainder; in sub-step S22, the management server corresponding to the character string is determined according to the corresponding relationship between the remainder and the management server.

在本發明實施例中，以前述字符串“user_id_123”為例，計算該字符串的哈希值hash_value，然後以hash_value除以管理伺服器的總個數P，取餘數，其公式如hash_value%P。 In the embodiment of the present invention, taking the aforementioned string "user_id_123" as an example, the hash value hash_value of the string is calculated, and then the hash_value is divided by the total number P of management servers, and the remainder is taken. The formula is hash_value%P .

在本發明實施例中，預先設置上述各個餘數與管理伺服器之間的對應關係。 In the embodiment of the present invention, the corresponding relationship between each of the remainders and the management server is preset.

比如有2個管理伺服器，2其對應的餘數為0、1。那麼可以先將0對應管理伺服器A，1對應管理伺服器B。那麼hash_value除以2後餘數為0的字符串，都發送至管理伺服器A；hash_value除以2後餘數為1的字符串都發送至管理伺服器B。 For example, there are 2 management servers, and the corresponding remainders of 2 are 0 and 1. Then you can first assign 0 to management server A and 1 to management server B. Then, the strings with the remainder of 0 after the hash_value divided by 2 are sent to management server A; the strings with the remainder of 1 after the hash_value divided by 2 are sent to the management server B.

在本發明實施例中，為了方便餘數與管理伺服器之間直接對應，可以將管理伺服器的直接按照前述餘數進行命名，那麼計算得到餘數後，直接就可以知道餘數是哪個管理伺服器。 In the embodiment of the present invention, in order to facilitate the direct correspondence between the remainder and the management server, the management server can be directly named according to the aforementioned remainder, then after calculating the remainder, it is possible to directly know which management server the remainder is.

本發明另一較佳的實施例中，在獲取待序列化樣本中的各個字符串的步驟之後，還包括：步驟S31，對各個字符串進行去重。 In another preferred embodiment of the present invention, after the step of acquiring each character string in the sample to be serialized, the method further includes: step S31 , deduplicating each character string.

在本發明實施例中，為了降低管理伺服器的計算量，以及網路的占用量，可以先將各個字符串進行去重。 In the embodiment of the present invention, in order to reduce the calculation amount of the management server and the network occupation amount, each character string may be deduplicated first.

從而每次發送到管理伺服器的字符串是唯一的，不會有重複的字符串發送，相應的也不會有重複的序列化ID返回，不會額外佔用網路帶寬。管理伺服器每次收到的字符串也是唯一的，在一次計算中只對該字符串計算一次，不會重複，降低管理伺服器計算量。 Therefore, the string sent to the management server is unique each time, and there will be no repeated string transmission, correspondingly no repeated serialized ID will be returned, and no additional network bandwidth will be occupied. The string received by the management server is also unique each time, and the string is only calculated once in one calculation, and it will not be repeated, which reduces the calculation amount of the management server.

步驟130，將所述字符串發送至相應的管理伺服器，以供各管理伺服器根據其維護的映射表，將接收到的字符串轉化為相應的序列化ID；其中，不同管理伺服器維護的映射表中的字符串互不相同；在本發明實施例中，各個管理伺服器可以預先獲取屬於該管理伺服器維護的字符串，然後構建各個管理伺服器自己的映射表。該映射表是字符串與序列化ID的對照表。 Step 130, sending the string to the corresponding management server, so that each management server converts the received string into a corresponding serialized ID according to the mapping table maintained by the management server; wherein, different management servers maintain The character strings in the mapping table of each management server are different from each other; in the embodiment of the present invention, each management server may obtain the character string maintained by the management server in advance, and then construct each management server's own mapping table. The mapping table is a comparison table of strings and serialized IDs.

在本發明實施例中，序列化ID為數字ID，因為在訓練過程中，數字最容易帶入公式進行計算。 In the embodiment of the present invention, the serialized ID is a digital ID, because in the training process, the number is most easily brought into the formula for calculation.

在本發明實施例中，對於各個字符串，可以將字符串的哈希值除以所有管理伺服器的個數，取其餘數，該餘數也與管理伺服器對應。如前述管理伺服器有2個，那麼0對應管理伺服器A，1對應管理伺服器B。然後字符串則可以根據該餘數與管理伺服器的對應關係，將字符串發送至相應管理伺服器。然後該管理伺服器可以基於該字符串構建映射表。 In the embodiment of the present invention, for each character string, the hash value of the character string can be divided by the number of all management servers, and the remainder is obtained, and the remainder also corresponds to the management server. As mentioned above, there are two management servers, then 0 corresponds to management server A, and 1 corresponds to management server B. Then, the character string can be sent to the corresponding management server according to the corresponding relationship between the remainder and the management server. The management server can then build a mapping table based on the string.

在實際應用中，各個序列化伺服器在獲取其樣本之後，先提取所有樣本的所有字符串，計算每個字符串的哈希值，將每個字符串的哈希值除以管理伺服器的總個數並取餘數，然後根據餘數與管理伺服器的對應關係，將字符串發送至相應的管理伺服器。 In practical applications, after each serialization server obtains its samples, it first extracts all strings of all samples, calculates the hash value of each string, and divides the hash value of each string by the management server's hash value. Calculate the total number and take the remainder, and then send the character string to the corresponding management server according to the corresponding relationship between the remainder and the management server.

管理伺服器則在收到字符串後，對字符串產生序列化ID。然後將字符串與對應的序列化ID構建映射表。 After receiving the string, the management server generates a serialized ID for the string. Then build a mapping table of strings and corresponding serialized IDs.

對於管理伺服器，在接收到了字符串後，則從本地維護的映射表中查詢該字符串的序列化ID，然後將該字符串對應的序列化ID返回給序列化伺服器。在實際應用中，管理伺服器可以將字符串與其對應的序列化ID一起返回給序列化伺服器。 For the management server, after receiving the string, query the serialization ID of the string from the locally maintained mapping table, and then return the serialization ID corresponding to the string to the serialization server. in practical application , the management server can return the string to the serialization server along with its corresponding serialization ID.

步驟140，接收各個管理伺服器返回的對應各個字符串的序列化ID；序列化伺服器在發送了樣本資料的各字符串後，則可以接收管理伺服器返回的上述各字符串對應的序列化ID。 Step 140: Receive the serialized ID corresponding to each character string returned by each management server; after the serialization server has sent each character string of the sample data, it can receive the serialized ID corresponding to each of the above-mentioned character strings returned by the management server ID.

步驟150，根據接收到的各字符串對應的序列化ID，將各個樣本資料中的字符串轉換為相應的序列化ID。 Step 150: Convert the strings in each sample data into corresponding serialization IDs according to the received serialization IDs corresponding to each string.

序列化伺服器接收到各個字符串的序列化ID後，將樣本資料中的字符串轉換為相應的序列化ID。比如前述“user_id_123”其序列化ID為11，“age_1”的序列化ID為13,“sex_1”的序列化ID為24,“age_comb_city3”的序列化ID為55。那麼轉換得到的序列化的樣本資料為：

After receiving the serialization ID of each string, the serialization server converts the string in the sample data into the corresponding serialization ID. For example, the serialized ID of the aforementioned "user_id_123" is 11, the serialized ID of "age_1" is 13, the serialized ID of "sex_1" is 24, and the serialized ID of "age_comb_city3" is 55. Then the converted serialized sample data is:

然後，序列化後的樣本資料則可以供後續機器訓練使用，加快訓練速度，提高訓練效率。 Then, the serialized sample data can be used for subsequent machine training, speeding up the training speed and improving the training efficiency.

本發明實施例中，首先，將序列化需要的映射表分散到多個管理伺服器中，不同的管理伺服器的映射表中維護的字符串相互不同，相應字符串的數位化ID也不同。完整的映射表分散到了多個管理伺服器，管理伺服器在查找字符串的序列化ID時，查詢的時間短，則可以降低字符串的序列化ID的查詢時間，從而可以減少對樣本序列化的時間，提高序列化效率。 In the embodiment of the present invention, first, the mapping table required for serialization is distributed among multiple management servers, the strings maintained in the mapping tables of different management servers are different from each other, and the digitized IDs of the corresponding strings are also different. The complete mapping table is distributed to multiple management servers. When the management server finds the serialized ID of the string, the query time is short, which can reduce the query time of the serialized ID of the string, thereby reducing the serialization of samples. time and improve serialization efficiency.

其次，序列化伺服器只需將待序列化的樣本，將其中的字符串根據與各個管理伺服器之間的對應關係，發送至相應的管理伺服器；然後由該管理伺服器去獲取該字符串的序列化ID返回給序列化伺服器。如此，序列化伺服器本身並不儲存序列化所需的完整的映射表，避免序列化伺服器的記憶體不足，並且提高了序列化伺服器的性能。 Secondly, the serialization server only needs to send the sample to be serialized and the character string in it to the corresponding management server according to the corresponding relationship with each management server; then the management server will obtain the character string The serialization ID of the string is returned to the serialization server. In this way, the serialization server itself does not store the complete mapping table required for serialization, which avoids the memory shortage of the serialization server and improves the performance of the serialization server.

實施例二 Embodiment 2

參照圖2，示出了本發明的一種樣本序列化方法實施例的步驟流程圖，具體可以包括如下步驟：步驟210，接收字符串；所述字符串由序列化伺服器根據字符串與各管理伺服器之間的對應關係發送；所述字符串由序列伺服器從樣本資料中獲取；在本發明實施例中，各個管理伺服器接收某個或者某幾個序列化伺服器發送的字符串。 Referring to FIG. 2, a flow chart of steps of a sample serialization method embodiment of the present invention is shown, which may specifically include the following steps: Step 210, receiving a character string; the character string is managed by the serialization server according to the character string and each The correspondence between the servers is sent; the character string is obtained from the sample data by the serial server; in the embodiment of the present invention, each management server receives the character string sent by one or several serialization servers.

在本發明實施例中，在序列化伺服器側對於待序列化的樣本資料，則可以從中提取字符串，然後根據字符串與各管理伺服器之間的對應關係確定管理伺服器，然後將字符串發送至該管理伺服器。 In the embodiment of the present invention, on the serialization server side, for the sample data to be serialized, a character string can be extracted from it, and then the management server is determined according to the corresponding relationship between the character string and each management server, and then the character string to the management server.

對於各個序列化伺服器而言，其根據字符串與各管理伺服器之間的對應關係確定管理伺服器，將字符串發送至該管理伺服器包括：子步驟S51，將字符串對應的哈希值除以管理伺服器的個數，得到餘數；子步驟S52，根據餘數與管理伺服器的對應關係，確定字符串對應的管理伺服器。 For each serialization server, it determines the management server according to the corresponding relationship between the character string and each management server, and sending the character string to the management server includes: sub-step S51, hashing the hash corresponding to the character string Divide the value by the number of management servers to get the remainder; Sub-step S52, according to the corresponding relationship between the remainder and the management server, determine the management server corresponding to the character string.

在本發明一較佳的實施例中，可以實時構建各個管理伺服器需要維護的映射表，那麼在步驟210之前還包括：步驟S201，獲取屬於當前管理伺服器自己的一批字符串；其中，屬於當前管理伺服器的一批字符串與屬於其他管理伺服器的字符串不同；在本發明實施例中，可以設置多個管理伺服器，那麼對於每個管理伺服器，可以獲取屬於自己的一批字符串，不同的管理伺服器獲取的字符串相互之間互不相同。 In a preferred embodiment of the present invention, a mapping table that needs to be maintained by each management server can be constructed in real time, then before step 210, it further includes: step S201, obtaining a batch of character strings belonging to the current management server; wherein, A batch of character strings belonging to the current management server is different from character strings belonging to other management servers; in this embodiment of the present invention, multiple management servers may be set, and then for each management server, a Batch strings, strings obtained by different management servers are different from each other.

在本發明實施例中，各個管理伺服器可以預先獲取屬於該管理伺服器維護的字符串，然後構建各個管理伺服器自己的映射表。 In this embodiment of the present invention, each management server may obtain character strings maintained by the management server in advance, and then construct its own mapping table for each management server.

在本發明實施例中，對於各個字符串，可以將字符串的哈希值除以所有管理伺服器的個數，取其餘數，該餘數也與管理伺服器對應。如前述管理伺服器有2個，那麼0對應管理伺服器A，1對應管理伺服器B。然後字符串則可以根據該餘數與管理伺服器的對應關係，將字符串發送至相應管理伺服器。然後該管理伺服器可以基於該字符串構建映射表。 In the embodiment of the present invention, for each character string, the hash value of the character string may be divided by the number of all management servers, and the remainder is obtained, and the remainder also corresponds to the management server. As mentioned above, there are two management servers, then 0 corresponds to management server A, and 1 corresponds to management server B. Then, the character string can be sent to the corresponding management server according to the corresponding relationship between the remainder and the management server. The management server can then build a mapping table based on the string.

在實際應用中，各個序列化伺服器在獲取其樣本之後，先提取所有樣本的所有字符串，計算每個字符串的哈希值，將每個字符串的哈希值除以管理伺服器的總個數並取餘數，然後根據餘數與管理伺服器的對應關係，將字符串發送至相應的管理伺服器。 In practical applications, after each serialization server obtains its samples, it first extracts all strings of all samples, calculates the hash value of each string, and divides the hash value of each string by the management server's hash value. The total number and the remainder are taken, and then according to the corresponding relationship between the remainder and the management server, the character The string is sent to the corresponding management server.

其中，屬於當前管理伺服器的一批字符串所對應的餘數屬於當前管理伺服器；所述餘數為所述字符串對應的哈希值除以各個管理伺服器的個數得到。 Wherein, the remainder corresponding to a batch of character strings belonging to the current management server belongs to the current management server; the remainder is obtained by dividing the hash value corresponding to the character string by the number of each management server.

步驟S202，將所述字符串進行序列化，並構建字符串與序列化ID的映射表；管理伺服器則在收到字符串後，對字符串產生序列化ID。然後將字符串與對應的序列化ID構建映射表。 Step S202, serialize the character string, and construct a mapping table between the character string and the serialized ID; after receiving the character string, the management server generates a serialized ID for the character string. Then build a mapping table of strings and corresponding serialized IDs.

較佳的，所述將所述字符串進行序列化，並構建字符串與序列化ID的映射表的步驟，包括：子步驟S41，獲取當前管理伺服器的排序之前的各個管理伺服器中的字符串的第一總數量N1；比如，管理伺服器有A、B、C，其順序也如A、B、C排序。對於第一個管理伺服器A，其有110個字符串；對於第二個管理伺服器B，其有90個字符串，對於第三個管理伺服器，其有100個字符串。 Preferably, the step of serializing the character string and constructing a mapping table between the character string and the serialized ID includes: sub-step S41, obtaining the information in each management server before the sorting of the current management server. The first total number N1 of character strings; for example, the management server has A, B, C, and the order is also as A, B, C. For the first management server A, it has 110 strings; for the second management server B, it has 90 strings, and for the third management server, it has 100 strings.

那麼管理伺服器A之前的各個管理伺服器中的字符串的第一總數量N1=0。 Then, the first total number N1 of character strings in each management server before management server A is N1=0.

管理伺服器B之前，有管理伺服器A，其第一總數量N1=110。 Before the management server B, there is the management server A, and the first total number N1=110.

管理伺服器C之前，有管理伺服器A和管理伺服器B，其第一總數量N1=200。 Before the management server C, there are the management server A and the management server B, and the first total number N1=200.

子步驟S42，以所述第一總數量N1加上當前管理伺服器的字符串的數量M得到第二總數量N2；子步驟S43，以[N1+1,N2]作為當前管理伺服器對字符串序列化的範圍。 Sub-step S42, the second total number N2 is obtained by adding the first total number N1 to the number M of character strings of the current management server; In sub-step S43, [N1+1, N2] is used as the range for serializing the character string by the current management server.

管理伺服器A的字符串數量M=110，那麼管理伺服器A的字符串序列化範圍為[1,110]，那麼對於管理伺服器A中的字符串，可以按序將其對應1到110的序列化ID。 The number of strings in management server A is M=110, then the serialization range of strings in management server A is [1,110], then for the strings in management server A, they can be sequenced from 1 to 110 in sequence Chemical ID.

管理伺服器B的字符串數量是90，那麼管理伺服器B的字符串序列化範圍為[111,200]，那麼對於管理伺服器B中的字符串，可以按序將其對應111到200的序列化ID。 The number of strings in management server B is 90, then the serialization range of strings in management server B is [111, 200], then the strings in management server B can be serialized corresponding to 111 to 200 in sequence ID.

管理伺服器C的字符串數量是100，那麼管理伺服器B的字符串序列化範圍為[201,300]，那麼對於管理伺服器C中的字符串，可以按序將其對應201到300的序列化ID。 The number of strings in management server C is 100, then the serialization range of strings in management server B is [201, 300], then for the strings in management server C, they can be serialized in sequence corresponding to 201 to 300 ID.

步驟220，根據本地維護的映射表，將所接收到的字符串轉換為序列化ID；其中，不同管理伺服器維護的映射表中的字符串互不相同；對於一個管理伺服器來說，由於其維護了一個映射表，該映射表中有字符串與其對應的序列化ID，由於其接收到的字符串是屬於該管理伺服器維護的，因此該管理伺服器可以根據其本地維護的映射表，將所接收到的字符串轉換為序列化ID。比如根據映射表中的字符串與數字ID的對應關係，查找字符串對應的數字ID，然後將查找到的數字ID返回給相應序列化伺服器。 Step 220, according to the mapping table maintained locally, convert the received string into a serialized ID; wherein, the strings in the mapping tables maintained by different management servers are different from each other; for one management server, because It maintains a mapping table, which contains strings and their corresponding serialized IDs. Since the received strings are maintained by the management server, the management server can use the mapping table maintained locally by the management server. , which converts the received string to a serialized ID. For example, according to the correspondence between the character string and the digital ID in the mapping table, look up the digital ID corresponding to the character string, and then return the found digital ID to the corresponding serialization server.

在本發明另一較佳的實施例中，所述根據本地維護的映射表，將所接收到的字符串轉換為序列化ID的步驟包括：子步驟S61，查詢本地維護的映射表中是否有所述字符串；子步驟S62，如果本地維護的映射表中有所述字符串，則獲取該字符串對應的序列化ID；子步驟S63，如果本地維護的映射表中沒有所述字符串，則針對所述字符串產生序列化ID，並將所述字符串以及相應序列化ID加入映射表。 In another preferred embodiment of the present invention, the local maintenance The mapping table, the step of converting the received string into serialized ID includes: sub-step S61, querying whether there is the character string in the locally maintained mapping table; sub-step S62, if there is any in the locally maintained mapping table If there is no such string in the locally maintained mapping table, a serialization ID is generated for the string, and the string and the The corresponding serialization ID is added to the mapping table.

在本發明實施例中，序列化伺服器獲取的樣本中可能存在管理伺服器的映射表中未記錄的字符串，對於該種情況，管理伺服器可以為其產生一個序列化ID，然後將字符串與序列化ID記錄到映射表中。同時，將該字符串對應的序列化ID返回給相應的序列化伺服器。 In this embodiment of the present invention, the sample obtained by the serialization server may contain unrecorded character strings in the mapping table of the management server. In this case, the management server may generate a serialization ID for it, and then convert the character Strings and serialization IDs are recorded in the mapping table. At the same time, the serialization ID corresponding to the string is returned to the corresponding serialization server.

在實際應用中，可以為各個監控伺服器預先劃定相互不重疊的序列化範圍，管理伺服器可以為該字符串分配序列化範圍中的序列化ID，如果其序列化範圍分配完畢，則可以再分配一個唯一的序列化範圍。 In practical applications, non-overlapping serialization ranges can be pre-defined for each monitoring server, and the management server can assign serialization IDs in the serialization range to the string. Allocate a unique serialization range again.

步驟230，將所述字符串對應的序列化ID返回給相應的序列化伺服器，以供序列化伺服器根據接收到的各字符串對應的序列化ID，將各個樣本資料中的字符串轉換為相應的序列化ID。 Step 230: Return the serialization ID corresponding to the string to the corresponding serialization server, so that the serialization server can convert the string in each sample data according to the received serialization ID corresponding to each string for the corresponding serialization ID.

當然，在本發明實施例中，管理伺服器在接收到字符串後，可以記錄該字符串由那個序列化伺服器發送，那麼在查找到該字符串對應的序列化ID後，可以根據記錄，將相應的字符串與序列化ID返回相應序列化伺服器。 Of course, in this embodiment of the present invention, after receiving the string, the management server may record that the string was sent by which serialization server, then After finding the serialization ID corresponding to the string, you can return the corresponding string and serialization ID to the corresponding serialization server according to the record.

實施例三 Embodiment 3

參照圖3，示出了本發明較佳的一種樣本序列化方法實施例的步驟流程圖。 Referring to FIG. 3 , a flow chart of steps of a preferred embodiment of a sample serialization method of the present invention is shown.

本實施例為了更清楚的描述序列化方法，以調度伺服器、序列化伺服器、管理伺服器整體架構的角度進行描述。 In this embodiment, in order to describe the serialization method more clearly, the description is made from the perspective of scheduling the server, serializing the server, and managing the overall architecture of the server.

在本發明實施例中，可以結合調度伺服器、序列化伺服器對各個管理伺服器創建映射表。如步驟S30-步驟 S38。 In this embodiment of the present invention, a mapping table may be created for each management server in combination with the scheduling server and the serialization server. As in step S30-step S38.

步驟S32，調度伺服器對所有樣本資料進行平均分配，並根據分配結果通知各序列化伺服器獲取屬於各序列化伺服器自己的一批樣本資料。 Step S32, the scheduling server evenly distributes all the sample data, and notifies each serialization server to obtain a batch of sample data belonging to each serialization server according to the distribution result.

在整個訓練開始之前，調度伺服器獲取到所有樣本資料的標識資訊後，可以對所有樣本資料進行平均分配。比如根據樣本資料的序列號將樣本資料平均分配給N個序列伺服器。調度伺服器將分配結構通知給各個序列化伺服器，以使各個序列化伺服器去獲取屬於自己的樣本資料。同時，調度伺服器通知序列化伺服器執行字符串序列化的過程，使其先不對樣本資料進行序列化操作，因為此時管理伺服器沒有映射表。 Before the entire training starts, after the scheduling server obtains the identification information of all the sample data, it can evenly distribute all the sample data. For example, the sample data is evenly distributed to N serial servers according to the serial number of the sample data. The dispatch server notifies each serialization server of the allocation structure, so that each serialization server obtains its own sample data. At the same time, the scheduling server notifies the serialization server to perform the string serialization process, so that it does not serialize the sample data first, because the management server does not have a mapping table at this time.

步驟S34，每個序列化伺服器根據調度伺服器的通知，獲取屬於自己的一批樣本資料，並所述樣本資料中所有的字符串整合發送至管理伺服器。 In step S34, each serialization server obtains its own batch of sample data according to the notification from the scheduling server, and integrates and sends all the character strings in the sample data to the management server.

在實際應用中，各台序列化伺服器獲取到前述第一次均分的樣本資料後，可以從這些樣本資料中按照預先配置的提取規則，從中提取該批資料的所有字符串，然後對這些字符串進行去重，再將去重後的字符串，按照發送規則發送至各管理伺服器。該發送規則包括：將字符串對應的哈希值除以管理伺服器的總個數，得到餘數，如；根據餘數與管理伺服器的對應關係，將各字符串發送至餘數相應的管理伺服器中。 In practical applications, after each serialization server obtains the above-mentioned first equally divided sample data, it can extract all the strings of the batch of data from these sample data according to the pre-configured extraction rules, and then The string is deduplicated, and then the deduplicated string is sent to each management server according to the sending rules. The sending rule includes: dividing the hash value corresponding to the string by the total number of management servers to obtain the remainder, for example, sending each string to the management server corresponding to the remainder according to the corresponding relationship between the remainder and the management server middle.

步驟S36，管理伺服器接收各序列化伺服器發送的字符串；步驟S38，管理伺服器在接收完屬於該管理伺服器的所有字符串後，將所述字符串進行序列化，並構建字符串與序列化ID的映射表。 Step S36, the management server receives the words sent by each serialization server String; Step S38, after the management server receives all the strings belonging to the management server, serializes the strings, and builds a mapping table between strings and serialized IDs.

在本發明實施例中，各序列化伺服器可以通過網路連接將字符串發送至管理伺服器，字符串發送完畢後，可以斷開與相應管理伺服器的網路連接。那麼管理伺服器則可以通過網路連接的中斷，判斷該序列化伺服器是否發送完畢其字符串。當管理伺服器判斷所有序列化伺服器發送完字符串後，則可以將所述字符串進行序列化，並構建字符串與序列化ID的映射表。 In the embodiment of the present invention, each serialization server can send the character string to the management server through a network connection, and after the character string is sent, the network connection with the corresponding management server can be disconnected. Then the management server can judge whether the serialization server has finished sending its string through the interruption of the network connection. After the management server determines that all serialization servers have sent the strings, the strings can be serialized, and a mapping table between strings and serialization IDs can be constructed.

當然，實際應用中，管理伺服器還可以採用其他方式確定其接收完屬於其自身的所有字符串。比如預先約定一個完畢標識，序列化伺服器在其字符串發送完畢之後，向各管理伺服器發送該完畢標識，然後各管理伺服器則記錄該序列化伺服器的完畢標識，當接收到所有序列化伺服器的完畢標識之後，則確定管理伺服器接收完屬於該管理伺服器的所有字符串。具體的方法，本發明實施例不對其加以限制。 Of course, in practical applications, the management server may also use other methods to determine that it has received all character strings belonging to itself. For example, a completion identifier is pre-agreed, and the serialization server sends the completion identifier to each management server after the string is sent, and then each management server records the completion identifier of the serialization server. After the completion flag of the conversion server is determined, it is determined that the management server has received all character strings belonging to the management server. The specific method is not limited in the embodiments of the present invention.

在管理伺服器構建完畢了前述映射表之後，調度伺服器可以再協調各個序列化伺服器執行對樣本資料的序列化操作。如步驟310-332。 After the management server completes the construction of the aforementioned mapping table, the scheduling server may coordinate the serialization servers to perform serialization operations on the sample data. As in steps 310-332.

步驟310，調度伺服器通知各序列化伺服器獲取屬於自己的樣本資料；對於每個序列化伺服器，執行以下步驟：步驟312，根據所述通知，讀取樣本資料；步驟314，從樣本資料中提取各個字符串；當然，實際應用中，對於提取的字符串，還會對其進行去重，然後執行步驟316。 Step 310, the scheduling server notifies each serialization server to obtain its own sample data; For each serialization server, perform the following steps: step 312, read sample data according to the notification; step 314, extract each character string from the sample data; of course, in practical applications, for the extracted character string, also It will be deduplicated and then step 316 will be executed.

步驟316，對各字符串，將字符串對應的哈希值除以管理伺服器的個數，得到餘數；步驟318，根據所述餘數與管理伺服器的對應關係，確定字符串對應的管理伺服器；步驟320，將所述字符串發送至相應的管理伺服器。 Step 316, for each character string, divide the hash value corresponding to the character string by the number of management servers to obtain the remainder; Step 318, determine the management server corresponding to the character string according to the correspondence between the remainder and the management server server; Step 320, sending the character string to the corresponding management server.

對於管理伺服器，則執行以下步驟：步驟322，接收字符串；接收步驟320中序列化伺服器發送的字符串。 For the management server, the following steps are performed: Step 322 , receiving a character string; receiving the character string sent by the serialization server in step 320 .

步驟324，根據本地維護的映射表，將所接收到的字符串轉換為序列化ID。 Step 324: Convert the received string into a serialized ID according to the locally maintained mapping table.

該映射表已經在步驟S32-S38中構建。 The mapping table has been constructed in steps S32-S38.

步驟326，將所述字符串對應的序列化ID返回給相應的序列化伺服器。 Step 326: Return the serialization ID corresponding to the string to the corresponding serialization server.

之後，對於每個序列化伺服器，再執行以下步驟：步驟328，接收各個管理伺服器返回的對應各個字符串的序列化ID；步驟330，根據接收到的各字符串對應的序列化ID，將各個樣本資料中的字符串轉換為相應的序列化ID。 Then, for each serialization server, perform the following steps: Step 328, receive the serialization ID corresponding to each character string returned by each management server; Step 330, according to the received serialization ID corresponding to each character string, Convert the strings in each sample material to the corresponding serialized IDs.

步驟332，輸出序列化的樣本資料。 Step 332, output the serialized sample data.

如此，序列化後的樣本資料則可以供給機器訓練使用。 In this way, the serialized sample data can be used for machine training.

最後，序列化伺服器只需將待序列化的樣本，將其中的字符串根據與各個管理伺服器之間的對應關係，發送至相應的管理伺服器；然後由該管理伺服器去獲取該字符串的序列化ID返回給序列化伺服器。如此，序列化伺服器本身並不儲存序列化所需的完整的映射表，避免序列化伺服器的記憶體不足，並且提高了序列化伺服器的性能。 Finally, the serialization server only needs to send the sample to be serialized and the character string in it to the corresponding management server according to the corresponding relationship with each management server; then the management server will obtain the character string The serialization ID of the string is returned to the serialization server. In this way, the serialization server itself does not store the complete mapping table required for serialization, which avoids the memory shortage of the serialization server and improves the performance of the serialization server.

然後，在結合步驟S32-S38的過程，在映射表的構建過程中，所有樣本的字符串分散到的多個序列化伺服器進行提取，提取速度快，使映射表的構建速度加快。其次，映射表的構建分散到了多個管理伺服器中，其每個管理伺服器不用構建完整的映射表，而只需構建部分的映射表，映射表構建速度加快。再次，構建映射表的位置變化為管理伺服器，傳統的進行序列化的序列化伺服器不用進行映射表的構建過程，也不用儲存映射表，減輕了序列化伺服器的負擔。 Then, in the process of combining steps S32-S38, in the process of constructing the mapping table, the strings of all samples are scattered to multiple serialization servers for extraction, and the extraction speed is fast, so that the construction speed of the mapping table is accelerated. Secondly, the construction of the mapping table is distributed among multiple management servers, and each management server does not need to build a complete mapping table, but only needs to build a part of the mapping table, and the construction speed of the mapping table is accelerated. Thirdly, the location of building the mapping table is changed to the management server. The traditional serialization server does not need to perform the construction process of the mapping table, nor does it need to store the mapping table, which reduces the serialization server. the burden of the device.

需要說明的是，對於方法實施例，為了簡單描述，故將其都表述為一系列的動作組合，但是所屬技術領域中具有通常知識者應該知悉，本發明實施例並不受所描述的動作順序的限制，因為依據本發明實施例，某些步驟可以採用其他順序或者同時進行。其次，所屬技術領域中具有通常知識者也應該知悉，說明書中所描述的實施例均屬於較佳實施例，所關於的動作並不一定是本發明實施例所必須的。 It should be noted that, for the sake of simple description, the method embodiments are described as a series of action combinations, but those with ordinary knowledge in the technical field should know that the embodiments of the present invention are not subject to the described action sequence. , because according to the embodiment of the present invention, some steps may be performed in other order or simultaneously. Secondly, those with ordinary knowledge in the technical field should also know that the embodiments described in the specification are all preferred embodiments, and the related actions are not necessarily required by the embodiments of the present invention.

實施例四 Embodiment 4

參照圖4，示出了本發明的一種樣本序列化裝置實施例的結構方塊圖，具體可以包括如下模組：字符串提取模組410，用於獲取待序列化樣本中的各個字符串；其中，在字符串提取模組410之前還包括：樣本資料獲取模組S400，用於獲取待序列化的各樣本資料；管理伺服器確定模組420，用於根據各字符串與各管理伺服器之間的對應關係，確定每個字符串對應的管理伺服器；字符串發送模組430，用於將所述字符串發送至相應的管理伺服器，以供各管理伺服器根據其維護的映射表，將接收到的字符串轉化為相應的序列化ID；其中，不同管理伺服器維護的映射表中的字符串互不相同；序列化ID接收模組440，用於接收各個管理伺服器返回的對應各個字符串的序列化ID；樣本序列化模組450，用於根據接收到的各字符串對應的序列化ID，將各個樣本資料中的字符串轉換為相應的序列化ID。 Referring to FIG. 4 , there is shown a structural block diagram of an embodiment of a sample serialization apparatus of the present invention, which may specifically include the following modules: a string extraction module 410 for obtaining each string in the sample to be serialized; wherein , before the character string extraction module 410, it further includes: a sample data acquisition module S400 for obtaining each sample data to be serialized; a management server determination module 420 for according to the relationship between each character string and each management server The corresponding relationship between each character string is determined, and the management server corresponding to each character string is determined; the character string sending module 430 is used to send the character string to the corresponding management server for each management server to maintain according to the mapping table. , convert the received string into the corresponding serialized ID; among them, different The strings in the mapping table maintained by the management server are different from each other; the serialization ID receiving module 440 is used to receive the serialized ID corresponding to each string returned by each management server; the sample serialization module 450 is used for Convert the strings in each sample data to the corresponding serialization IDs according to the serialization IDs corresponding to the received strings.

在本發明另一較佳的實施例中，所述管理伺服器確定模組420包括：字符串取餘模組，用於將字符串對應的哈希值除以管理伺服器的個數，得到餘數；第一餘數確定模組，用於根據餘數與管理伺服器的對應關係，確定字符串對應的管理伺服器。 In another preferred embodiment of the present invention, the management server determination module 420 includes: a character string remainder module, which is used to divide the hash value corresponding to the character string by the number of management servers to obtain Remainder; the first remainder determination module is used to determine the management server corresponding to the character string according to the corresponding relationship between the remainder and the management server.

在本發明另一較佳的實施例中，在字符串提取模組410之後，還包括：去重模組，用於對各個字符串進行去重。 In another preferred embodiment of the present invention, after the character string extracting module 410, it further includes: a deduplication module for deduplicating each character string.

在本發明另一較佳的實施例中，所述符串提取模組410之前包括：第一樣本資料獲取模組，用於獲取由調度伺服器對所有樣本資料進行平均分配後，屬於當前序列化伺服器自己的一批樣本資料。 In another preferred embodiment of the present invention, the string extracting module 410 includes a first sample data obtaining module before the Serialize the server's own batch of sample data.

本實施例可以應用於序列化伺服器側。 This embodiment can be applied to the serialization server side.

本發明實施例中，首先，將序列化需要的映射表分散到多個管理伺服器中，不同的管理伺服器的映射表中維護的字符串相互不同，相應字符串的數位化ID也不同。完整的映射表分散到了多個管理伺服器，管理伺服器在查找字符串的序列化ID時，查詢的時間短，則可以降低字符串的序列化ID的查詢時間，從而可以減少對樣本序列化的時間，提高序列化效率。 In the embodiment of the present invention, first, the mapping table required for serialization is distributed among multiple management servers, the strings maintained in the mapping tables of different management servers are different from each other, and the digitized IDs of the corresponding strings are also different. Finish The entire mapping table is distributed to multiple management servers. When the management server finds the serialized ID of the string, the query time is short, which can reduce the query time of the serialized ID of the string, thereby reducing the serialization of samples. time and improve serialization efficiency.

實施例五 Embodiment 5

參照圖5，示出了本發明的另一種樣本序列化裝置實施例的結構方塊圖，具體可以包括如下模組：字符串接收模組510，用於接收字符串；所述字符串由序列化伺服器根據字符串與各管理伺服器之間的對應關係發送；所述字符串由序列伺服器從樣本資料中獲取；字符串轉換模組520，用於根據本地維護的映射表，將所接收到的字符串轉換為序列化ID；其中，不同管理伺服器維護的映射表中的字符串互不相同；數位化ID返回模組530，用於將所述字符串對應的序列化ID返回給相應的序列化伺服器，以供序列化伺服器根據接收到的各字符串對應的序列化ID，將各個樣本資料中的字符串轉換為相應的序列化ID。 Referring to FIG. 5 , there is shown a structural block diagram of another embodiment of the sample serialization apparatus of the present invention, which may specifically include the following modules: a character string receiving module 510 for receiving character strings; the character strings are serialized by The server sends the string according to the correspondence between the string and each management server; the string is obtained from the sample data by the serial server; the string conversion module 520 is used to convert the received data according to the locally maintained mapping table. The received string is converted into a serialized ID; wherein, the strings in the mapping tables maintained by different management servers are different from each other; the digitized ID return module 530 is used to return the serialized ID corresponding to the string to The corresponding serialization server, for the serialization server to convert the strings in each sample data into corresponding serialization IDs according to the serialization IDs corresponding to the received strings.

本實施例可以應用於管理伺服器側。 This embodiment can be applied to the management server side.

在本發明一較佳的實施例中，所述字符串接收模組510之前包括：字符串獲取模組，用於獲取屬於當前管理伺服器自己的一批字符串；其中，屬於當前管理伺服器的一批字符串與屬於其他管理伺服器的字符串不同；映射表構建模組，用於將所述字符串進行序列化，並構建字符串與序列化ID的映射表。 In a preferred embodiment of the present invention, before the character string receiving module 510 includes: a character string obtaining module for obtaining a batch of character strings belonging to the current management server; wherein, the character string belonging to the current management server A batch of strings is different from strings belonging to other management servers; a mapping table building module is used to serialize the strings and build a mapping table between strings and serialization IDs.

在本發明另一較佳的實施例中，所述映射表構建模組包括：第一數量獲取模組，用於獲取當前管理伺服器的排序之前的各個管理伺服器中的字符串的第一總數量N1；第二數量獲取模組，用於以所述第一總數量N1加上當前管理伺服器的字符串的數量M得到第二總數量N2；序列化範圍確定模組，用於以[N1+1,N2]作為當前管理伺服器對字符串序列化的範圍。 In another preferred embodiment of the present invention, the mapping table construction module includes: a first quantity acquisition module for acquiring the first number of strings in each management server before the current management server is sorted The total number N1; the second number obtaining module is used to obtain the second total number N2 by adding the first total number N1 to the number M of the character strings of the current management server; the serialization range determination module is used to obtain the second total number N2; [N1+1,N2] as the current management server to serialize the range of strings.

在本發明另一較佳的實施例中，所述字符串轉換模組包括：查詢模組，用於查詢本地維護的映射表中是否有所述字符串；第一數位化ID獲取模組，用於如果本地維護的映射表中有所述字符串，則獲取該字符串對應的序列化ID；產生模組，用於如果本地維護的映射表中沒有所述字符串，則針對所述字符串產生序列化ID，並將所述字符串以及相應序列化ID加入映射表。 In another preferred embodiment of the present invention, the character string conversion module includes: a query module for querying whether the character string exists in the locally maintained mapping table; a first digitized ID acquisition module, Used to obtain the serialization ID corresponding to the character string if there is the character string in the locally maintained mapping table; a module is generated for if the character string is not included in the locally maintained mapping table, for the character string yields a serialized ID, and converts the character The string and the corresponding serialization ID are added to the mapping table.

在本發明另一較佳的實施例中，所述屬於當前管理伺服器的一批字符串包括：所述一批字符串中各字符串所對應的餘數屬於當前管理伺服器；所述餘數為所述字符串對應的哈希值除以各個管理伺服器的個數得到。 In another preferred embodiment of the present invention, the batch of character strings belonging to the current management server includes: the remainder corresponding to each character string in the batch of character strings belongs to the current management server; the remainder is The hash value corresponding to the character string is obtained by dividing the number of each management server.

實施例六 Embodiment 6

參照圖6，示出了本發明的另一種樣本序列化系統實施例的結構方塊圖，具體可以包括如下模組：調度伺服器600，多個序列化伺服器700，多個管理伺服器800。圖中僅僅示出了3個序列化伺服器700和3個管理伺服器800，各種伺服器的數量可以根據實際需求設置。 Referring to FIG. 6, it shows a structural block diagram of another embodiment of the sample serialization system of the present invention, which may specifically include the following modules: a scheduling server 600, a plurality of serialization servers 700, a plurality of management Server 800. Only three serialization servers 700 and three management servers 800 are shown in the figure, and the number of various servers can be set according to actual requirements.

其中，調度伺服器600包括：通知模組601，用於調度伺服器通知各序列化伺服器獲取屬於自己的樣本資料；在本發明較佳的實施例中，在實際應用中，調度伺服器600還包括：平均分配模組，用於對所有樣本資料進行平均分配，並根據分配結果通知各序列化伺服器獲取屬於各序列化伺服器自己的一批樣本資料。 The scheduling server 600 includes: a notification module 601, which is used for the scheduling server to notify each serialization server to obtain its own sample data; in a preferred embodiment of the present invention, in practical applications, the scheduling server 600 It also includes: an even distribution module, which is used to distribute all sample data evenly, and notify each serialization server to obtain a batch of sample data belonging to each serialization server according to the distribution result.

調度伺服器600在整個訓練開始之前，該通知模組還用於通知序列化伺服器執行字符串序列化的過程，使其先不對樣本資料進行序列化操作，因為此時管理伺服器沒有映射表。 Before the scheduling server 600 starts the whole training, the notification module is also used to notify the serialization server to perform the string serialization process, so that it does not serialize the sample data first, because the management server does not have a mapping table at this time .

其中，每個序列化伺服器700包括：樣本獲取模組701，用於根據所述通知，讀取樣本資料；字符串提取模組702，用於從樣本資料中提取各個字符串；當然，實際應用中，字符串提取模組702還用於對於提取的字符串，還會對其進行去重，然後進入。 Wherein, each serialization server 700 includes: a sample acquisition module 701 for reading sample data according to the notification; a character string extraction module 702 for extracting each character string from the sample data; In applications, the character string extraction module 702 is also used to deduplicate the extracted character string before entering.

字符串取餘模組703，用於對各字符串，將字符串對應的哈希值除以管理伺服器的個數，得到餘數；第一餘數確定模組704，用於根據所述餘數與管理伺服器的對應關係，確定字符串對應的管理伺服器。 The character string remainder module 703 is used to divide the hash value corresponding to the character string by the number of management servers for each character string to obtain the remainder; the first remainder determination module 704 is used to determine the remainder according to the remainder and management server The corresponding relationship of the server, to determine the management server corresponding to the string.

字符串發送模組705，用於將所述字符串發送至相應的管理伺服器；序列化ID接收模組706，用於接收各個管理伺服器返回的對應各個字符串的序列化ID；樣本序列化模組707，用於根據接收到的各字符串對應的序列化ID，將各個樣本資料中的字符串轉換為相應的序列化ID。 The character string sending module 705 is used for sending the character string to the corresponding management server; the serialization ID receiving module 706 is used for receiving the serialized ID corresponding to each character string returned by each management server; the sample sequence The transformation module 707 is configured to convert the strings in each sample data into corresponding serialization IDs according to the received serialization IDs corresponding to the respective strings.

輸出模組708，用於輸出序列化的樣本資料。 The output module 708 is used for outputting the serialized sample data.

在本發明另一實施例中，為了給管理伺服器創建映射表提供支持，序列化伺服器700包括：整合發送模組，用於每個序列化伺服器根據調度伺服器的通知，獲取屬於自己的一批樣本資料，並所述樣本資料中所有的字符串整合發送至管理伺服器。 In another embodiment of the present invention, in order to provide support for the management server to create a mapping table, the serialization server 700 includes: an integrated sending module for each serialization server to obtain its own A batch of sample data, and all the strings in the sample data are integrated and sent to the management server.

每個管理伺服器800包括：字符串接收模組801，用於接收字符串；接收字符串發送模組705發送的字符串。 Each management server 800 includes: a character string receiving module 801 for receiving character strings; and receiving character strings sent by the character string sending module 705 .

字符串轉換模組802，用於根據本地維護的映射表，將所接收到的字符串轉換為序列化ID；數位化ID返回模組803，用於將所述字符串對應的序列化ID返回給相應的序列化伺服器，在本發明另一實施例中，管理伺服器800還通過以下模組創建映射表：字符串獲取模組，用於獲取屬於當前管理伺服器自己的一批字符串；其中，屬於當前管理伺服器的一批字符串與屬於其他管理伺服器的字符串不同；該字符串獲取模組獲取的字符串可以由序列化伺服器的整合發送模組中獲得字符串。 The string conversion module 802 is used to convert the received string into a serialized ID according to the locally maintained mapping table; the digitized ID return module 803 is used to return the serialized ID corresponding to the string For the corresponding serialization server, in another embodiment of the present invention, the management server 800 also creates a mapping table through the following modules: a character string acquisition module, used to obtain the information belonging to the current management server itself Among them, the batch of strings belonging to the current management server is different from the strings belonging to other management servers; the strings obtained by the string acquisition module can be sent by the integrated sending module of the serialization server to get the string.

映射表構建模組，用於將所述字符串進行序列化，並構建字符串與序列化ID的映射表。 The mapping table building module is used for serializing the string and constructing a mapping table between strings and serialization IDs.

對於裝置實施例而言，由於其與方法實施例基本相似，所以描述的比較簡單，相關之處參見方法實施例的部分說明即可。 As for the apparatus embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for related parts.

本說明書中的各個實施例均採用遞進的方式描述，每個實施例重點說明的都是與其他實施例的不同之處，各個實施例之間相同相似的部分互相參見即可。 Each embodiment in this specification is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same and similar parts between the embodiments can be referred to each other.

所屬技術領域中具有通常知識者應明白，本發明實施例的實施例可提供為方法、裝置、或電腦程式產品。因此，本發明實施例可採用完全硬體實施例、完全軟體實施例、或結合軟體和硬體方面的實施例的形式。而且，本發明實施例可採用在一個或多個其中包含有電腦可用程式代碼的電腦可用儲存媒介(包括但不限於磁盤儲存器、CD-ROM、光學儲存器等)上實施的電腦程式產品的形式。 It should be understood by those of ordinary skill in the art that embodiments of the embodiments of the present invention may be provided as a method, an apparatus, or a computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may employ computer program products implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. form.

在一個典型的配置中，所述電腦設備包括一個或多個處理器(CPU)、輸入/輸出介面、網路介面和記憶體。記憶體可能包括電腦可讀媒介中的非永久性儲存器，隨機存取儲存器(RAM)及/或非易失性記憶體等形式，如只讀儲存器(ROM)或快閃記憶體(flash RAM)。記憶體是電腦可讀媒介的示例。電腦可讀媒介包括永久性和非永久性、可行動和非可行動媒體可以由任何方法或技術來實現資訊儲存。資訊可以是電腦可讀指令、資料結構、程式的模組或其他資料。電腦的儲存媒介的例子包括，但不限於相變記憶體(PRAM)、靜態隨機存取儲存器(SRAM)、動態隨機存取儲存器(DRAM)、其他類型的隨機存取儲存器(RAM)、只讀儲存器(ROM)、電可抹除可編程只讀儲存器(EEPROM)、快閃記憶體或其他記憶體技術、只讀光碟只讀儲存器(CD-ROM)、數位多功能光碟(DVD)或其他光學儲存、卡式磁帶，磁帶磁磁盤儲存或其他磁性儲存設備或任何其他非傳輸媒介，可用於儲存可以被計算設備存取的資訊。按照本文中的界定，電腦可讀媒介不包括非持續性的電腦可讀媒體(transitory media)，如調變的資料信號和載波。 In a typical configuration, the computer device includes one or more processors (CPUs), an input/output interface, a network interface, and memory. Memory may include non-persistent storage in computer readable media, random access memory (RAM) and/or non-volatile memory in the form of read only memory (ROM) or flash memory ( flash RAM). Memory is an example of a computer-readable medium. Computer-readable media includes both permanent and non-permanent, removable and non-removable media, and storage of information can be accomplished by any method or technology. Information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM) , read only memory (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, cassette tape, magnetic tape storage or other magnetic storage device or any other non-transmission medium that can be used for storage that can be accessed by a computing device information. As defined herein, computer-readable media does not include non-persistent computer-readable media (transitory media), such as modulated data signals and carrier waves.

本發明實施例是參照根據本發明實施例的方法、終端設備(系統)、和電腦程式產品的流程圖及/或方塊圖來描述的。應理解可由電腦程式指令實現流程圖及/或方塊圖中的每一流程及/或方塊、以及流程圖及/或方塊圖中的流程及/或方塊的結合。可提供這些電腦程式指令到通用電腦、專用電腦、嵌入式處理機或其他可編程資料處理終端設備的處理器以產生一個機器，使得通過電腦或其他可編程資料處理終端設備的處理器執行的指令產生用於實現在流程圖一個流程或多個流程及/或方塊圖一個方塊或多個方塊中指定的功能的裝置。 Embodiments of the present invention are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present invention. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal equipment to produce a machine that causes the instructions to be executed by the processor of the computer or other programmable data processing terminal equipment Means are created for implementing the functions specified in the flow or flows of the flowchart and/or the block or blocks of the block diagrams.

這些電腦程式指令也可儲存在能引導電腦或其他可編程資料處理終端設備以特定方式工作的電腦可讀儲存器中，使得儲存在該電腦可讀儲存器中的指令產生包括指令裝置的製造品，該指令裝置實現在流程圖一個流程或多個流程及/或方塊圖一個方塊或多個方塊中指定的功能。 These computer program instructions may also be stored in computer readable storage capable of directing a computer or other programmable data processing terminal equipment to operate in a particular manner, such that the instructions stored in the computer readable storage produce an article of manufacture comprising the instruction means , the instruction means implement the functions specified in the flow or flow of the flowchart and/or the block or blocks of the block diagram.

這些電腦程式指令也可裝載到電腦或其他可編程資料處理終端設備上，使得在電腦或其他可編程終端設備上執行一系列操作步驟以產生電腦實現的處理，從而在電腦或其他可編程終端設備上執行的指令提供用於實現在流程圖一個流程或多個流程及/或方塊圖一個方塊或多個方塊中指定的功能的步驟。 These computer program instructions can also be loaded on a computer or other programmable data processing terminal equipment, so that a series of operational steps are performed on the computer or other programmable terminal equipment to produce computer-implemented processing, so that the computer or other programmable terminal equipment The instructions executed on the above provide steps for implementing the functions specified in the flow diagram flow or flow diagrams and/or the block diagram flow diagram block or blocks.

儘管已描述了本發明實施例的較佳實施例，但所屬技術領域中具有通常知識者一旦得知了基本進步性概念，則可對這些實施例做出另外的變更和修改。所以，所附申請專利範圍意欲解釋為包括較佳實施例以及落入本發明實施例範圍的所有變更和修改。 While the preferred embodiments of the present invention have been described, additional changes and modifications to these embodiments may be made by those of ordinary skill in the art once the basic progressive concepts become known. Therefore, the scope of the appended claims is intended to be construed to include the preferred embodiment as well as all changes and modifications that fall within the scope of the embodiments of the present invention.

最後，還需要說明的是，在本文中，諸如第一和第二等之類的關係術語僅僅用來將一個實體或者操作與另一個實體或操作區分開來，而不一定要求或者暗示這些實體或操作之間存在任何這種實際的關係或者順序。而且，術語“包括”、“包含”或者其任何其他變體意在涵蓋非排他性的包含，從而使得包括一系列要素的過程、方法、物品或者終端設備不僅包括那些要素，而且還包括沒有明確列出的其他要素，或者是還包括為這種過程、方法、物品或者終端設備所固有的要素。在沒有更多限制的情況下，由語句“包括一個......”限定的要素，並不排除在包括所述要素的過程、方法、物品或者終端設備中還存在另外的相同要素。 Finally, it should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or terminal device comprising a list of elements includes not only those elements, but also a non-exclusive list of elements. other elements, or also include elements inherent to such a process, method, article or terminal equipment. Without further limitation, an element defined by the phrase "comprises a..." does not preclude the presence of additional identical elements in the process, method, article or terminal device comprising said element.

以上對本發明所提供的一種樣本序列化方法和一種樣本序列化裝置，進行了詳細介紹，本文中應用了具體個例對本發明的原理及實施方式進行了闡述，以上實施例的說明只是用於幫助理解本發明的方法及其核心思想；同時，對於所屬技術領域中具有通常知識者，依據本發明的思想，在具體實施方式及應用範圍上均會有改變之處，綜上所述，本說明書內容不應理解為對本發明的限制。 A sample serialization method and a sample serialization device provided by the present invention have been introduced in detail above. Specific examples are used in this paper to illustrate the principles and implementations of the present invention. The descriptions of the above embodiments are only for help. Understand the method of the present invention and its core idea; at the same time, for those with ordinary knowledge in the technical field, according to the idea of the present invention, there will be changes in the specific implementation and application scope. In summary, this specification The contents should not be construed as limiting the present invention.

Claims

A sample serialization method, comprising: acquiring each character string in a sample to be serialized; determining a management server corresponding to each character string according to the correspondence between each character string and each management server; Send it to the corresponding management server, so that each management server can convert the received string into the corresponding serialized ID according to the mapping table maintained by the management server; wherein, the strings in the mapping table maintained by different management servers interact with each other. Not the same; receive the serialized ID corresponding to each string returned by each management server; convert the string in each sample data into the corresponding serialized ID according to the serialized ID corresponding to each received string, wherein, After the step of acquiring each string in the sample to be serialized, the method further includes: deduplicating each string.

The method according to item 1 of the scope of the application, wherein the step of determining the management server corresponding to each character string according to the corresponding relationship between each character string and each management server comprises: corresponding the character string to the management server. The hash value of is divided by the number of management servers to obtain the remainder; according to the corresponding relationship between the remainder and the management server, the management server corresponding to the string is determined.

According to the method described in item 1 or 2 of the claimed scope, the , before the step of acquiring each character string in the sample to be serialized, further includes: acquiring a batch of sample data belonging to the current serialization server after all sample data are evenly distributed by the scheduling server.

A sample serialization method, comprising: receiving a character string; the character string is sent by a serialization server according to the correspondence between the character string and each management server; the character string is obtained by the serial server from sample data; Convert the received string into a serialized ID according to the locally maintained mapping table; wherein, the strings in the mapping tables maintained by different management servers are different from each other; return the serialized ID corresponding to the string to the The corresponding serialization server is used for the serialization server to convert the strings in each sample data into the corresponding serialization IDs according to the serialization IDs corresponding to the received strings; Before the step, it also includes: obtaining a batch of character strings belonging to the current management server; wherein, a batch of character strings belonging to the current management server is different from character strings belonging to other management servers; serializing the character strings , and build a mapping table of strings and serialized IDs.

The method according to item 4 of the scope of the patent application, wherein the step of serializing the string and constructing a mapping table between the string and the serialized ID includes: Obtain the first total number N1 of character strings in each management server before the sorting of the current management server; add the first total number N1 to the number M of character strings of the current management server to obtain the second total number N2 ; Take [N1+1,N2] as the current management server's range of string serialization.

The method according to claim 4 or 5, wherein the step of converting the received character string into a serialized ID according to a locally maintained mapping table includes: querying whether there is any in the locally maintained mapping table the character string; if the character string is in the locally maintained mapping table, obtain the serialization ID corresponding to the character string; if there is no such character string in the locally maintained mapping table, generate a sequence for the character string The serialized ID is added, and the string and the corresponding serialized ID are added to the mapping table.

The method according to claim 4 or 5, wherein the batch of character strings belonging to the current management server includes: the remainder corresponding to each character string in the batch of character strings belongs to the current management server; The remainder is obtained by dividing the hash value corresponding to the character string by the number of each management server.

A sample serialization device, comprising: a character string extraction module for obtaining each character string in a sample to be serialized; a management server determination module for obtaining each character string according to each character string and each management server The correspondence between the servers determines the management server corresponding to each character string; the character string sending module is used to send the character string to the corresponding management server for each management server to maintain according to the corresponding management server. A mapping table, which converts the received strings into corresponding serialized IDs; among them, the strings in the mapping tables maintained by different management servers are different from each other; the serialized ID receiving module is used to receive the return of each management server The serialized ID corresponding to each string; the sample serialization module is used to convert the string in each sample data into the corresponding serialized ID according to the serialized ID corresponding to each received string, where in After the string extraction module, it also includes: a deduplication module, which is used to deduplicate each string.

The device according to item 8 of the scope of the application, wherein the management server determination module comprises: a character string remainder module for dividing the hash value corresponding to the character string by the number of management servers, The remainder is obtained; the first remainder determination module is used to determine the management server corresponding to the character string according to the corresponding relationship between the remainder and the management server.

The device according to claim 8 or 9, wherein, before the character string extraction module, it includes: a first sample data acquisition module for acquiring after the scheduling server has evenly distributed all the sample data , which belongs to the current serialization server's own batch of sample data.

A sample serialization device, comprising: a character string receiving module for receiving character strings; the character string is sent by a serialization server according to the correspondence between the character string and each management server; the character string is sent by a serialization server The server obtains from the sample data; the string conversion module is used to convert the received string into a serialized ID according to the locally maintained mapping table; among them, the strings in the mapping table maintained by different management servers They are different from each other; the digitized ID return module is used to return the serialized ID corresponding to the string to the corresponding serialization server, so that the serialization server can receive the serialized ID corresponding to each string. , convert the character strings in each sample data into corresponding serialized IDs, wherein, before the character string receiving module includes: a character string obtaining module for obtaining a batch of character strings belonging to the current management server; Among them, a batch of strings belonging to the current management server is different from strings belonging to other management servers; a mapping table building module is used to serialize the strings and build a mapping between strings and serialization IDs surface.

The device according to claim 11, wherein the mapping table construction module comprises: a first quantity acquisition module, configured to acquire the character string in each management server before the sorting of the current management server The first total quantity N1; the second quantity obtaining module is used to obtain the second total quantity N2 by adding the first total quantity N1 to the number M of the character strings of the current management server; the serialization range determination module, using Yu takes [N1+1,N2] as the current pipe The range that the management server serializes the string.

The device according to claim 11 or 12, wherein the character string conversion module includes: a query module for querying whether the character string exists in the locally maintained mapping table; the first digitized ID Obtain a module for obtaining the serialization ID corresponding to the string if the string is in the locally maintained mapping table; generate a module for if the string is not in the locally maintained mapping table, then A serialization ID is generated for the string, and the string and the corresponding serialization ID are added to the mapping table.

The device according to claim 11 or 12, wherein the batch of character strings belonging to the current management server includes: the remainder corresponding to each character string in the batch of character strings belongs to the current management server; The remainder is obtained by dividing the hash value corresponding to the character string by the number of each management server.