CN113609843A - Sentence and word probability calculation method and system based on gradient lifting decision tree - Google Patents

Sentence and word probability calculation method and system based on gradient lifting decision tree Download PDF

Info

Publication number
CN113609843A
CN113609843A CN202111184159.XA CN202111184159A CN113609843A CN 113609843 A CN113609843 A CN 113609843A CN 202111184159 A CN202111184159 A CN 202111184159A CN 113609843 A CN113609843 A CN 113609843A
Authority
CN
China
Prior art keywords
participle
decision tree
elements
sequence
serial number
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111184159.XA
Other languages
Chinese (zh)
Other versions
CN113609843B (en
Inventor
蓝建敏
申鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Excellence Information Technology Co ltd
Original Assignee
Excellence Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Excellence Information Technology Co ltd filed Critical Excellence Information Technology Co ltd
Priority to CN202111184159.XA priority Critical patent/CN113609843B/en
Publication of CN113609843A publication Critical patent/CN113609843A/en
Application granted granted Critical
Publication of CN113609843B publication Critical patent/CN113609843B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a sentence probability calculation method and a sentence probability calculation system based on a gradient lifting decision tree. Therefore, parallelism and high efficiency of sentence word probability calculation are improved, and the beneficial effects of reducing the calculation cost and time complexity of all candidate participles are realized through data parallelization.

Description

Sentence and word probability calculation method and system based on gradient lifting decision tree
Technical Field
The disclosure belongs to the technical field of big data, and particularly relates to a sentence probability calculation method and system based on a gradient lifting decision tree.
Background
In modern society, the growth rate of text data is rapidly increasing, and the demand for data analysis capability of text files is also rapidly increasing. In modern electronic production life, the preprocessing of text data is an important part, and meanwhile, as the data is accelerated rapidly, the time cost and the calculation cost for analyzing the data are increased exponentially, and the processing capacity and the processing speed of the existing text data are obviously poor on the task of processing a large number of word segmentation character strings. The error correction calculation aiming at the large-scale word segmentation character string sequence causes too many calculation sites, and the calculation cost and the time complexity of adopting an confusion set to calculate all candidate words are too high. In the patent document with publication number CN107861936A, although the method and apparatus for analyzing the polarity probability of a sentence are provided, which can calculate the dimension of the sentence according to the word vector of the sentence and determine the polarity probability of the sentence by combining with a training model, and can improve the accuracy of probability analysis to some extent, the parallelism and efficiency of the calculation of the word probability of the sentence are still insufficient.
Disclosure of Invention
The invention aims to provide a sentence probability calculation method and a sentence probability calculation system based on a gradient boosting decision tree, which are used for solving one or more technical problems in the prior art and at least providing a beneficial selection or creation condition.
The error correction calculation aiming at the large-scale word segmentation character string sequence causes too many calculation sites, and the calculation cost and the time complexity of adopting an confusion set to calculate all candidate words are too high.
The invention provides a sentence probability calculation method and a system based on a gradient lifting decision tree, which are characterized in that a text data set is subjected to data cleaning to obtain a preprocessed data set, character strings in the preprocessed data set are read, the read character strings are segmented by using HanLP, a set of segmented words is combined into a segmented word labeling set, the obtained segmented words are stored as an input data set in a json format, the input data set is read and processed into parallel data, the parallel data are input into a plurality of gradient lifting decision tree models in batches for learning, during the learning process of the plurality of gradient lifting decision tree models, sharing weights among the gradient lifting decision tree models are calculated according to batch input of the parallel data, loss functions of the gradient lifting decision tree models are adjusted by the sharing weights, a plurality of discrimination models are obtained after learning, data to be monitored in the same format as the parallel data, and respectively inputting the plurality of discrimination models to respectively obtain a plurality of output discrimination scores, and respectively combining the discrimination scores output by the discrimination models with the sharing weight to obtain a final overall discrimination score. Therefore, parallelism and high efficiency of sentence word probability calculation are improved, and the beneficial effects of reducing the calculation cost and time complexity of all candidate participles are realized through data parallelization.
In order to achieve the above object, according to an aspect of the present disclosure, there is provided a sentence probability calculation method based on a gradient boosting decision tree, the method including the steps of:
s100, performing data cleaning on the text data set, and removing punctuations and invalid non-Chinese characters in the text data set to obtain a preprocessed data set;
s200, reading character strings in the preprocessed data set, performing word segmentation on the read character strings by using HanLP, combining a set of word segmentation into a word segmentation labeling set, and storing the obtained word segmentation as an input data set in a json format;
s300, reading and processing an input data set into parallel data;
s400, inputting parallel data into a plurality of gradient lifting decision tree models in batches for learning;
s500, in the process of learning a plurality of gradient lifting decision tree models, calculating the sharing weight among the gradient lifting decision tree models according to the batch input of parallel data, adjusting the loss functions of the gradient lifting decision tree models according to the sharing weight, and obtaining a plurality of discrimination models after learning;
s600, respectively inputting the data to be monitored with the same format as the parallel data into a plurality of discrimination models to respectively obtain a plurality of output discrimination scores, and respectively combining the discrimination scores output by the discrimination models with the sharing weight to obtain a final overall discrimination score.
Further, in S100, the method for cleaning the data of the text data set to remove punctuation marks and invalid non-chinese characters from the text data set to obtain the preprocessed data set includes: the method comprises the steps of obtaining a text data set from a database of a server cluster, wherein the text data set is a table in the database, the field type of data of the table is a character string, the character string in the table is a character string read from any text file, each line in the table is a character string of one text, each character in the data of the table is converted into a decimal value according to an ASCII code comparison table, the corresponding character with the decimal value range of [0,31] is deleted in the data, or the character with the decimal value of [0,31] in the ASCII code comparison table in the characters of the field of the data of the table is deleted, and the data of the deleted table is stored as one or more csv files as a pre-processing data set.
Further, in S200, the method of reading a character string in the preprocessed data set, performing word segmentation on the read character string by using HanLP, combining a set of word segmentation into a word segmentation tagging set, and storing the obtained word segmentation in a json format as an input data set includes: reading a character string in each row from a preprocessed data set, reading each row as a character string, performing word segmentation on each read character string by using HanLP, taking a participle obtained after performing word segmentation on each character string as a participle array, reading each participle for each row to obtain a character string consisting of partial characters in one character string, marking an inter-specific set consisting of all the participles in all the participle arrays as a set Aset, marking the Aset as a participle label set, marking the number of Aset elements as n, marking the serial number of the Aset elements as i, i belongs to [1, n ], marking the serial number of the participle in the Aset on each participle array to obtain a participle label sequence, namely, marking the elements in the participle label sequence by one participle and the serial number of the participle in the participle label set Aset, if the sequence number of the participle in the Aset is i, the participle with the sequence number of i in the Aset is marked as Aset (i), an element in a participle marking sequence consists of the sequence number i and the participle Aset (i), namely < i, Aset (i) >, and a set consisting of all the participle marking sequences is stored as an input data set in a json format.
Further, in S300, the method for reading and processing the input data set into parallel data is as follows: reading a set formed by each participle labeling sequence in an input data set as a set Bset, wherein the number of elements in the set Bset is m, the serial number of the elements in the set Bset is variable j, j belongs to [1, m ], the element with the serial number of j in the set Bset is denoted as B _ j, a function len () is defined as a function for calculating the number of elements in an array or a sequence or a vector in an acquisition input function, len (B _ j) represents the number of elements in B _ j, a set formed by the number of elements in the elements of each serial number in the set Bset is denoted as a set Lset, Lset = { len (B _ j) | j belongs to [1, m ] }, a function max () represents a function for calculating the element with the largest value in the array or the sequence or the vector in the acquisition input function, and max (Lset) represents the element with the largest value in the acquisition set Lset;
defining a function Han (), wherein the calculation process of the function Han () comprises the following steps:
s301-1, inputting a set Bset by a function Han ();
s301-2, setting an empty set Cset;
s301-3, obtaining len (B _ j) corresponding to elements of each sequence number in the set Bset through a function len (), obtaining max (Lset) with the largest value in a set Lset formed by the len (B _ j) corresponding to the elements of each sequence number, and marking max (Lset) as ml;
s301-4, if B _ j of any sequence number in elements B _ j in a set Bset meets a constraint condition len (B _ j) < ml, adding ml-len (B _ j) elements consisting of empty character strings and sequence numbers 0 at the tail of the participle tagging sequence B _ j, marking an element consisting of an empty character string and a sequence number 0 as a filling element, namely prolonging the length of the participle tagging sequence B _ j to ml by adding ml-len (B _ j) elements consisting of empty character strings and sequence numbers 0 behind the original len (B _ j), and processing the elements B _ j of each sequence number in the set Bset into a participle tagging sequence with the length of ml; in the word segmentation labeling sequence with the length of ml, the sequence number of an element in the word segmentation labeling sequence is il, and il belongs to [1, ml ];
s301-5, adding each element in the processed set Bset into Cset;
s301-6, the function Han () outputs Cset;
furthermore, on the basis of the output set Cset, the number of the set Cset elements is m as same as the number of the elements in the set Bset, the number of the set Cset elements is j as same as the number of the elements in the set Bset, each element in the set Cset is a participle tagging sequence with the length of ml, m participle tagging sequences with the length of ml in the set Cset are sequentially used as m rows with the length of ml of the matrix (namely, m participle tagging sequences with the length of ml in the set Cset are used as m one-dimensional matrices, and a transposed matrix obtained after the conversion of each one-dimensional matrix is used as m rows with the length of ml), and the m participle tagging sequences with the length of ml in the set Cset are used as a matrix of m rows with the size of ml × m and ml columns;
each column in the matrix Cmat is an element in the set Cset, i.e. the number of columns in the matrix Cmat is m when the number of elements in the set Cset is the same as the number of elements in the set Cset, the serial number of the columns in the matrix Cmat is j when the number of the elements in the set Cset is the same as the serial number of the elements in the set Cset, each column in the matrix Cmat is a participle tagging sequence with the length of ml, i.e. the number of rows in the matrix Cmat is ml, the serial number of the rows in the matrix Cmat is il when the serial number of the elements in the participle tagging sequence is the same as the serial number of the elements in the participle tagging sequence, the elements in the rows il of the matrix Cmat are il in each participle tagging sequence from 1 to m in the set Cset, the elements in the rows il of the j of the column in the matrix Cmat are il when the serial number of the element in the participle tagging sequence, the rows il of the column in the matrix Cmat are il [ il, ], the column in the matrix Cmat [ il, the column in the matrix Cmat, j [ il ] is il, the column in the matrix Cmat, j [ il ] in the column in the set, j [ j ] is marked as the column in the set, j ];
the element of the jth column in the ith row in the matrix Cmat is marked as Cmat [ il, j ], the Cmat [ il, j ] consists of a participle and a serial number of the participle in a participle label set, the participle in the Cmat [ il, j ] is marked as Ctr (il, j), the participle in the Cmat [ il, j ] is marked as Cid (il, j) in the participle label set, and the element of the jth column in the ith row in the Cmat [ il, j ] can be marked as Cmat [ il, j ] = < Cid (il, j), Ctr (il, j) >;
replacing the participle Ctr (il, j) in each element in the matrix Cmat with a word vector obtained by the participle Ctr (il, j) through a word vector algorithm, recording the word vector obtained by the Ctr (il, j) through the word vector algorithm as Cvec (il, j), recording the number of dimensions of the word vector as k, the serial number of each dimension of the word vector as v, v epsilon [1, k ], recording the word vector obtained by the word vector algorithm of the empty character string as a full zero vector, replacing the participle Ctr (il, j) in each element in the matrix Cmat with Cvec (il, j) to obtain an array recorded as Ctensior, recording the il row in the Ctensior as Ctensor [ il ], recording the j column in the Ctensor as [ j ], recording the il row in the Ctensor as Ctensor [ il, j ], recording the element in the il row and the j column in the Ctensor as Ctensor [ il, j ], recording the Ctensor [ il, j ] composed of Cid (il, j) and Ctensor [ il, j ],il, j, Ctensor [ il, j ],j ], j) the value of the v-th dimension in Cvec (il, j) is denoted as Cvec (il, j, v), and the Csensor is the parallel data.
Further, in S400, the method of inputting the parallel data into the multiple gradient boosting decision tree models in batches for learning includes:
creating q different gradient lifting decision tree models, defining int () as a function of rounding down, acquiring a participle label set, further acquiring the number n of elements of the participle label set, acquiring an input data set, further acquiring the number m of the elements in the input data set, wherein the calculation for acquiring the numerical value of q is as follows:
Figure 734602DEST_PATH_IMAGE001
obtaining parallel data, marking as Ctensor, and performing batch division on the parallel data Ctensor comprises the following specific steps:
s401, acquiring the number of rows in Ctensor as m; setting an initial value of a variable p, p to be 0; setting an empty set gap; go to S402;
s402, defining the number of the sub-batches as epi; dividing m by q; judging whether m is divided by q to have a remainder, if so, turning to S403, otherwise, turning to S405;
s403, changing the numerical value of p to be equal to the remainder of dividing m by q; randomly taking out q rows from the Ctensor to ensure that the number of the rows in the Ctensor is m-q; go to S404;
s404, making the result of q-p be g, randomly taking out g from Ctensor and putting the g in gap; adding all elements in the gap into Ctensor; clearing the set gap; go to S405;
s405, acquiring the column number in the Ctensor; changing the value of epi to be equal to the value of the result obtained by dividing the number of columns in the acquired Ctensor by q; go to S406;
s406, dividing Csensor into q rows, wherein each row is epi rows in Csensor and is marked as one batch, and inputting each batch into a gradient lifting decision tree model for learning, wherein the gradient lifting decision tree model is an XGboost model, so that q different gradient lifting decision tree models are obtained through learning;
in S406, the specific method of inputting a gradient boost decision tree model for learning in each batch is as follows: in the learning process, randomly deleting any row and any column of elements in parallel data, marking the position of the position left after the deleted elements are deleted as a placeholder, enabling a gradient lifting decision tree model to predict the elements before the placeholder is deleted according to other elements which are not deleted, outputting the probability of predicting the elements before the placeholder is deleted, calculating the probability of predicting the elements before the placeholder is deleted in an XGBoost mode, obtaining the cross entropy of the elements before the placeholder is deleted, obtaining the output result of a loss function through a sigmoid activation function, and using the output result of the loss function to optimize the model by gradient descent.
Further, in S500, according to the batch input of the parallel data, the method for calculating the shared weight between the gradient boosting decision tree models includes: recording the parallel data as Ctensior, dividing the parallel data into a plurality of batches according to columns, wherein the number of the batches into which the parallel data are divided according to columns is q, the serial number of each batch is qi, qi is belonged to [1, q ], the batches with serial number qi in Ctensior are Ctensior (qi), Ctensior (qi) is Ctensior, the number of the rows of each batch are the same and are ml, the serial number of the rows of each batch is il, il is belonged to [1, ml ], the number of the columns of Ctensior (qi) is mqi, the serial number of the rows of Ctensior (jqi), jqi ∈ [1, mqi ], the rows in Ctensior (qi) with serial number il are Ctensior (qi), (i), the serial number of the rows in Ctensior (mqi) is Ctensior, the serial number of the rows in Ctensior (mqi) is qi, the serial number of the rows of Ctensior (qi) is Ctensior (qil), (29 qi) is Ctensior, the serial number of the rows of the column is 5929 (qi), jqi, where Csensor (qi) (il, jqi) is composed of a sequence number and a word vector, Csensor (qi) (il, jqi) is marked as cid (qi) (il, jqi), Csensor (qi) (il, jqi) is marked as Cvec (qi) (il, jqi), the sharing weight is defined as the value of measuring the frequency of the information quantity of each batch input, the function of calculating the sharing weight according to the batch input is defined as Cov (), the sharing weight of Csensor (qi) is calculated by the function Cov (), the formula of Cov (qi) is:
Figure 629746DEST_PATH_IMAGE002
the calculated cov (qi) is the sharing weight of the batch input with the sequence number qi in the Ctensor, the array composed of the sharing weights of the batch input with each sequence number in the Ctensor is recorded as an array Covs, = [ cov (qi) | qi ∈ [1, q ] ], the arithmetic mean value of each element in the array Covs is recorded as Covg, the function exp () is an exponential function with a natural constant e as the bottom, the number of a plurality of different gradient lifting decision tree models is set to be q which is the same as the number of the batch input, the sequence numbers of a plurality of different gradient lifting decision tree models are set to be qi which is the same as the sequence number of the batch input, the gradient lifting decision tree model with the sequence number qi is recorded as an exponential function M qi, the set composed of each gradient lifting decision tree model is recorded as Ms, the output result of the loss function of M (qi) (m), (qi) Ms, M qi) is the loss result of the loss function of loss of the gradient lifting decision tree models is loss (loss) of loss of each gradient lifting decision tree model (i) which is recorded as loss of each gradient lifting decision tree model (loss function of the gradient lifting decision tree model (q) of each gradient lifting decision tree model is recorded as loss of the loss model (q) of the batch input Obtaining a value of a result adjusted by the shared weight of the gradient lifting decision tree model, wherein a return value of a Loss function is a parameter for optimizing the model by using gradient descent in the model training process of machine learning, and the return value of the Loss function of M (qi) is recorded as L (qi), L (qi) is used for optimizing M (qi), an array formed by the return values of the Loss functions of the gradient lifting decision tree models with each sequence number is an array Loss, a calculation formula for adjusting the Loss functions of a plurality of gradient lifting decision tree models by the shared weight is as follows,
Figure 990320DEST_PATH_IMAGE003
Figure 846281DEST_PATH_IMAGE004
thus, the l (qi) corresponding to each m (qi) is used to optimize the model, so that q discriminant models are obtained after optimization.
Further, in S600, the method for respectively inputting the data to be monitored in the same format as the parallel data into the multiple discriminant models to respectively obtain multiple output discriminant scores, and respectively combining the discriminant scores output by the respective discriminant models with the shared weight to obtain a final overall discriminant score includes: the number and the serial number of the discriminant models are equal to those of the gradient lifting decision tree models before optimization, the number of the discriminant models is q, the serial number of the discriminant models is qi, qi belongs to [1, q ], the discriminant models output the discriminant scores of XGboost, elements in any row and any column in parallel data are randomly deleted, the positions of the deleted elements are put into placeholders, the gradient lifting decision tree models predict the elements before the placeholders are deleted according to other elements which are not deleted, the probability of predicting the elements before the placeholders are deleted is output, the discriminant models with the serial numbers qi are output score (qi), score (qi) outputs the probability of predicting the elements before the placeholders are deleted, the shared weight corresponding to the discriminant models with the serial numbers qi is recorded as cov (qi), the final overall discriminant scores are recorded as score, and the calculation formula of score is as follows:
Figure 216082DEST_PATH_IMAGE005
and the calculated score is the final total discrimination score obtained by combining the discrimination scores output by the discrimination models with the sharing weight, and the calculated score is used as sentence probability to be stored.
The present disclosure also provides a sentence probability calculation system based on a gradient boosting decision tree, which includes: the processor executes the computer program to realize the steps in the sentence probability calculation method based on the gradient boosting decision tree, the sentence probability calculation system based on the gradient boosting decision tree is operated in the computing equipment of desktop computers, notebooks, palmtops and cloud data centers, the operable system comprises the processor, the memory and the server cluster, and the processor executes the computer program to operate in the following units of the system:
the preprocessing unit is used for cleaning the data of the text data set to remove punctuations and invalid non-Chinese characters in the text data set to obtain a preprocessed data set;
the input data set unit is used for reading character strings in the preprocessing data set, performing word segmentation on the read character strings by using HanLP, combining a set of segmented words into a word segmentation label set, and storing the obtained segmented words as an input data set in a json format;
a parallel data processing unit for reading and processing the input data set into parallel data;
the model learning unit is used for inputting parallel data into a plurality of gradient lifting decision tree models in batches for learning;
the batch input unit is used for calculating the sharing weight among the gradient lifting decision tree models according to batch input of parallel data in the process of learning the gradient lifting decision tree models, adjusting the loss functions of the gradient lifting decision tree models according to the sharing weight, and obtaining a plurality of discrimination models after learning;
and the probability output unit is used for respectively inputting the data to be monitored with the same format as the parallel data into the plurality of discrimination models to respectively obtain a plurality of output discrimination scores and respectively combining the discrimination scores output by the discrimination models with the sharing weight to obtain a final overall discrimination score.
The beneficial effect of this disclosure does: the invention provides a sentence probability calculation method and a sentence probability calculation system based on a gradient lifting decision tree. Therefore, parallelism and high efficiency of sentence word probability calculation are improved, and the beneficial effects of reducing the calculation cost and time complexity of all candidate participles are realized through data parallelization.
Drawings
The foregoing and other features of the present disclosure will become more apparent from the detailed description of the embodiments shown in conjunction with the drawings in which like reference characters designate the same or similar elements throughout the several views, and it is apparent that the drawings in the following description are merely some examples of the present disclosure and that other drawings may be derived therefrom by those skilled in the art without the benefit of any inventive faculty, and in which:
FIG. 1 is a flow chart of a sentence probability calculation method based on a gradient boosting decision tree;
fig. 2 is a system diagram of a sentence probability calculation system based on a gradient boosting decision tree.
Detailed Description
The conception, specific structure and technical effects of the present disclosure will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings to fully understand the objects, aspects and effects of the present disclosure. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and larger, smaller, larger, etc. are understood as excluding the number, and larger, smaller, inner, etc. are understood as including the number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
Fig. 1 is a flowchart illustrating a sentence probability calculation method based on a gradient boosting decision tree according to the present invention, and the following describes a sentence probability calculation method and system based on a gradient boosting decision tree according to an embodiment of the present invention with reference to fig. 1.
The disclosure provides a sentence probability calculation method based on a gradient lifting decision tree, which specifically comprises the following steps:
s100, performing data cleaning on the text data set, and removing punctuations and invalid non-Chinese characters in the text data set to obtain a preprocessed data set;
s200, reading character strings in the preprocessed data set, performing word segmentation on the read character strings by using HanLP, combining a set of word segmentation into a word segmentation labeling set, and storing the obtained word segmentation as an input data set in a json format;
s300, reading and processing an input data set into parallel data;
s400, inputting parallel data into a plurality of gradient lifting decision tree models in batches for learning;
s500, in the process of learning a plurality of gradient lifting decision tree models, calculating the sharing weight among the gradient lifting decision tree models according to the batch input of parallel data, adjusting the loss functions of the gradient lifting decision tree models according to the sharing weight, and obtaining a plurality of discrimination models after learning;
s600, respectively inputting the data to be monitored with the same format as the parallel data into a plurality of discrimination models to respectively obtain a plurality of output discrimination scores, and respectively combining the discrimination scores output by the discrimination models with the sharing weight to obtain a final overall discrimination score.
Further, in S100, the method for cleaning the data of the text data set to remove punctuation marks and invalid non-chinese characters from the text data set to obtain the preprocessed data set includes: the method comprises the steps of obtaining a text data set from a database of a server cluster, wherein the text data set is a table in the database, the field type of data in the table is characters, the characters in the table are characters of character strings read from any text file, each line of data in the table is a character string of a text, characters with decimal values of [0,31] in an ASCII code comparison table in the characters of the field of the data in the table are deleted, and the data of the deleted table are stored as one or more csv files to serve as a preprocessing data set.
Further, in S200, the method of reading a character string in the preprocessed data set, performing word segmentation on the read character string by using HanLP, combining a set of word segmentation into a word segmentation tagging set, and storing the obtained word segmentation in a json format as an input data set includes: reading a character string in each row from a preprocessed data set, reading each row as a character string, performing word segmentation on each read character string by using HanLP, taking a participle obtained after performing word segmentation on each character string as a participle array, reading each participle for each row to obtain a character string consisting of partial characters in one character string, marking an inter-specific set consisting of all the participles in all the participle arrays as a set Aset, marking the Aset as a participle label set, marking the number of Aset elements as n, marking the serial number of the Aset elements as i, i belongs to [1, n ], marking the serial number of the participle in the Aset on each participle array to obtain a participle label sequence, namely, marking the elements in the participle label sequence by one participle and the serial number of the participle in the participle label set Aset, if the sequence number of the participle in the Aset is i, the participle with the sequence number of i in the Aset is marked as Aset (i), an element in a participle marking sequence consists of the sequence number i and the participle Aset (i), namely < i, Aset (i) >, and a set consisting of all the participle marking sequences is stored as an input data set in a json format.
Further, in S300, the method for reading and processing the input data set into parallel data is as follows: reading a set formed by each participle labeling sequence in an input data set as a set Bset, wherein the number of elements in the set Bset is m, the serial number of the elements in the set Bset is variable j, j belongs to [1, m ], the element with the serial number of j in the set Bset is marked as B _ j, a definition function len () is a function for calculating the number of elements in an array or a sequence or a vector in an input function, len (B _ j) represents the number of the elements in B _ j, wherein the range of the number of elements in B _ j is greater than 0, the set formed by the number of elements in the elements of each sequence number in the set Bset is denoted as set Lset, Lset = { len (B _ j) | j ∈ [1, m ] }, function max () represents the function for calculating the element with the largest numerical value in the array or sequence or vector in the acquisition input function, and max (Lset) represents the element with the largest numerical value in the acquisition set Lset;
defining a function Han (), wherein the calculation process of the function Han () comprises the following steps:
s301-1, inputting a set Bset by a function Han ();
s301-2, setting an empty set Cset;
s301-3, obtaining len (B _ j) corresponding to elements of each sequence number in the set Bset through a function len (), obtaining max (Lset) with the largest value in a set Lset formed by the len (B _ j) corresponding to the elements of each sequence number, and marking max (Lset) as ml;
s301-4, if B _ j of any sequence number in elements B _ j in a set Bset meets a constraint condition len (B _ j) < ml, adding ml-len (B _ j) elements consisting of empty character strings and sequence numbers 0 at the tail of the participle tagging sequence B _ j, marking an element consisting of an empty character string and a sequence number 0 as a filling element, namely prolonging the length of the participle tagging sequence B _ j to ml by adding ml-len (B _ j) elements consisting of empty character strings and sequence numbers 0 behind the original len (B _ j), and processing the elements B _ j of each sequence number in the set Bset into a participle tagging sequence with the length of ml; in the word segmentation labeling sequence with the length of ml, the sequence number of an element in the word segmentation labeling sequence is il, and il belongs to [1, ml ];
s301-5, adding each element in the processed set Bset into Cset;
s301-6, the function Han () outputs Cset;
furthermore, on the basis of the output set Cset, the number of the elements of the set Cset is m which is the same as the number of the elements in the set Bset, the number of the elements of the set Cset is j which is the same as the number of the elements in the set Bset, each element in the set Cset is a participle tagging sequence with the length of ml, and m participle tagging sequences with the length of ml in the set Cset are taken as a matrix of m rows and ml columns with the size of ml × m to be recorded as a matrix Cmat;
each column in the matrix Cmat is an element in the set Cset, i.e. the number of columns in the matrix Cmat is m when the number of elements in the set Cset is the same as the number of elements in the set Cset, the serial number of the columns in the matrix Cmat is j when the number of the elements in the set Cset is the same as the serial number of the elements in the set Cset, each column in the matrix Cmat is a participle tagging sequence with the length of ml, i.e. the number of rows in the matrix Cmat is ml, the serial number of the rows in the matrix Cmat is il when the serial number of the elements in the participle tagging sequence is the same as the serial number of the elements in the participle tagging sequence, the elements in the rows il of the matrix Cmat are il in each participle tagging sequence from 1 to m in the set Cset, the elements in the rows il of the j of the column in the matrix Cmat are il when the serial number of the element in the participle tagging sequence, the rows il of the column in the matrix Cmat are il [ il, ], the column in the matrix Cmat [ il, the column in the matrix Cmat, j [ il ] is il, the column in the matrix Cmat, j [ il ] in the column in the set, j [ j ] is marked as the column in the set, j ];
the element of the jth column in the ith row in the matrix Cmat is marked as Cmat [ il, j ], the Cmat [ il, j ] consists of a participle and a serial number of the participle in a participle label set, the participle in the Cmat [ il, j ] is marked as Ctr (il, j), the participle in the Cmat [ il, j ] is marked as Cid (il, j) in the participle label set, and the element of the jth column in the ith row in the Cmat [ il, j ] can be marked as Cmat [ il, j ] = < Cid (il, j), Ctr (il, j) >;
replacing the participle Ctr (il, j) in each element in the matrix Cmat with a word vector obtained by the participle Ctr (il, j) through a word vector algorithm, recording the word vector obtained by the Ctr (il, j) through the word vector algorithm as Cvec (il, j), recording the number of dimensions of the word vector as k, the serial number of each dimension of the word vector as v, v epsilon [1, k ], recording the word vector obtained by the word vector algorithm of the empty character string as a full zero vector, replacing the participle Ctr (il, j) in each element in the matrix Cmat with Cvec (il, j) to obtain an array recorded as Ctensior, recording the il row in the Ctensior as Ctensor [ il ], recording the j column in the Ctensor as [ j ], recording the il row in the Ctensor as Ctensor [ il, j ], recording the element in the il row and the j column in the Ctensor as Ctensor [ il, j ], recording the Ctensor [ il, j ] composed of Cid (il, j) and Ctensor [ il, j ],il, j, Ctensor [ il, j ],j ], j) recording the value of the v dimension in Cvec (il, j) as Cvec (il, j, v), and obtaining Csensor which is parallel data;
wherein, preferably, the code used may include:
utils import *
import random
class ParalDel:
def __init__(self, il, ml):
"""
# substitutes Cvec (il, j) for the participle Cstr (il, j) in each element of the matrix Cmat to obtain an array denoted Csensor,
"""
self.cmat = utils.cmat
self.json_id = ml
self.count = il * ml - 1
self.pool = set()
self.used_count = 0
def pre_mer(self):
"""
the ith row in Csensor is denoted as Csensor [ il ], the jth column in Csensor is denoted as Csensor [ j ], the element in the ith row and the jth column in Csensor is denoted as Csensor [ il, j ],
ctensor [ il, j ] consists of Cid (il, j) and Cvec (il, j), Ctensor [ il, j ] = < Cid (il, j), Cvec (il, j) >, the value of the v dimension in Cvec (il, j) is marked as Cvec (il, j, v), and the Ctensor is the parallel data
"""
while True:
ar = random.randint(self.used_count, self.count)
if ar not in self.pool:
self.pool.add(ar)
return divmod(ar, self.json_id)
def paralellize(self):
self.pool = set();
Parallel data are thus obtained.
Further, in S400, the method of inputting the parallel data into the multiple gradient boosting decision tree models in batches for learning includes:
creating q different gradient lifting decision tree models, defining int () as a function of rounding down, acquiring a participle label set, further acquiring the number n of elements of the participle label set, acquiring an input data set, further acquiring the number m of the elements in the input data set, wherein the calculation for acquiring the numerical value of q is as follows:
Figure 450886DEST_PATH_IMAGE006
obtaining parallel data, marking as Ctensor, and performing batch division on the parallel data Ctensor comprises the following specific steps:
s401, acquiring the number of rows in Ctensor as m; setting an initial value of a variable p, p to be 0; setting an empty set gap; go to S402;
s402, defining the number of the sub-batches as epi; dividing m by q; judging whether m is divided by q to have a remainder, if so, turning to S403, otherwise, turning to S405;
s403, changing the numerical value of p to be equal to the remainder of dividing m by q; randomly taking out q rows from the Ctensor to ensure that the number of the rows in the Ctensor is m-q; go to S404;
s404, making the result of q-p be g, randomly taking out g from Ctensor and putting the g in gap; adding all elements in the gap into Ctensor; clearing the set gap; go to S405;
s405, acquiring the column number in the Ctensor; changing the value of epi to be equal to the value of the result obtained by dividing the number of columns in the acquired Ctensor by q; go to S406;
s406, dividing Csensor into q rows, wherein each row is epi rows in Csensor and is marked as one batch, and inputting each batch into a gradient lifting decision tree model for learning, wherein the gradient lifting decision tree model is an XGboost model, so that q different gradient lifting decision tree models are obtained through learning;
in S406, the specific method of inputting a gradient boost decision tree model for learning in each batch is as follows: in the learning process, randomly deleting any row and any column of elements in parallel data, marking the deleted elements as placeholders, enabling a gradient lifting decision tree model to predict elements before the placeholders are deleted according to other elements which are not deleted, outputting the probability of predicting the elements before the placeholders are deleted, calculating the probability of predicting the elements before the placeholders are deleted and the cross entropy of the elements before the placeholders are deleted in an XGBoost mode, obtaining the output result of a loss function through a sigmoid activation function, and using the output result of the loss function to optimize the model by gradient descent.
Further, in S500, according to the batch input of the parallel data, the method for calculating the shared weight between the gradient boosting decision tree models includes: recording the parallel data as Ctensior, dividing the parallel data into a plurality of batches according to columns, wherein the number of the batches into which the parallel data are divided according to columns is q, the serial number of each batch is qi, qi is belonged to [1, q ], the batches with serial number qi in Ctensior are Ctensior (qi), Ctensior (qi) is Ctensior, the number of the rows of each batch are the same and are ml, the serial number of the rows of each batch is il, il is belonged to [1, ml ], the number of the columns of Ctensior (qi) is mqi, the serial number of the rows of Ctensior (jqi), jqi ∈ [1, mqi ], the rows in Ctensior (qi) with serial number il are Ctensior (qi), (i), the serial number of the rows in Ctensior (mqi) is Ctensior, the serial number of the rows in Ctensior (mqi) is qi, the serial number of the rows of Ctensior (qi) is Ctensior (qil), (29 qi) is Ctensior, the serial number of the rows of the column is 5929 (qi), jqi, where Csensor (qi) (il, jqi) is composed of a sequence number and a word vector, Csensor (qi) (il, jqi) is marked as cid (qi) (il, jqi), Csensor (qi) (il, jqi) is marked as Cvec (qi) (il, jqi), the sharing weight is defined as the value of measuring the frequency of the information quantity of each batch input, the function of calculating the sharing weight according to the batch input is defined as Cov (), the sharing weight of Csensor (qi) is calculated by the function Cov (), the formula of Cov (qi) is:
Figure 247940DEST_PATH_IMAGE002
the calculated cov (qi) is the sharing weight of the batch input with the sequence number qi in the Ctensor, the array composed of the sharing weights of the batch input with each sequence number in the Ctensor is recorded as an array Covs, = [ cov (qi) | qi ∈ [1, q ] ], the arithmetic mean value of each element in the array Covs is recorded as Covg, the function exp () is an exponential function with a natural constant e as the bottom, the number of a plurality of different gradient lifting decision tree models is set to be q which is the same as the number of the batch input, the sequence numbers of a plurality of different gradient lifting decision tree models are set to be qi which is the same as the sequence number of the batch input, the gradient lifting decision tree model with the sequence number qi is recorded as an exponential function M qi, the set composed of each gradient lifting decision tree model is recorded as Ms, the output result of the loss function of M (qi) (m), (qi) Ms, M qi) is the loss result of the loss function of loss of the gradient lifting decision tree models is loss (loss) of loss of each gradient lifting decision tree model (i) which is recorded as loss of each gradient lifting decision tree model (loss function of the gradient lifting decision tree model (q) of each gradient lifting decision tree model is recorded as loss of the loss model (q) of the batch input Obtaining a value of a result adjusted by the shared weight of the gradient lifting decision tree model, wherein a return value of a Loss function is a parameter for optimizing the model by using gradient descent in the model training process of machine learning, and the return value of the Loss function of M (qi) is recorded as L (qi), L (qi) is used for optimizing M (qi), an array formed by the return values of the Loss functions of the gradient lifting decision tree models with each sequence number is an array Loss, a calculation formula for adjusting the Loss functions of a plurality of gradient lifting decision tree models by the shared weight is as follows,
Figure 591197DEST_PATH_IMAGE003
Figure 499110DEST_PATH_IMAGE007
thus, the l (qi) corresponding to each m (qi) is used to optimize the model, so that q discriminant models are obtained after optimization.
Further, in S600, the method for respectively inputting the data to be monitored in the same format as the parallel data into the multiple discriminant models to respectively obtain multiple output discriminant scores, and respectively combining the discriminant scores output by the respective discriminant models with the shared weight to obtain a final overall discriminant score includes:
the monitoring data can be input character string data or a matrix with the same size as the parallel data, wherein elements in the matrix are composed of a serial number of a participle in a participle labeling set and a word vector obtained by the participle through a word vector algorithm, the number and the serial number of discriminant models are equal to those of gradient lifting decision tree models before optimization, the number of discriminant models is q, the serial number of discriminant models is qi, qi belongs to [1, q ], the output of discriminant models is discriminant score of XGboost, any row of elements in any row in the parallel data are randomly deleted and the deleted elements are marked as placeholders, the gradient lifting decision tree models are enabled to predict the elements before the placeholders are deleted according to other elements which are not deleted, the probability of predicting the elements before the placeholders are deleted is output, the output of the discriminant models with the serial number qi is score (qi), score (qi) outputs the probability of predicting the elements before the placeholders are deleted, the sharing weight corresponding to the discriminant model with the sequence number qi is recorded as cov (qi), the final overall discriminant score is recorded as score, and the calculation formula of score is as follows:
Figure 244212DEST_PATH_IMAGE005
and the calculated score is the final total discrimination score obtained by combining the discrimination scores output by the discrimination models with the sharing weight, and the calculated score is used as sentence probability to be stored.
The sentence probability calculation system based on the gradient lifting decision tree comprises: the sentence probability calculation system based on the gradient boosting decision tree can be operated in computing equipment such as desktop computers, notebooks, palmtop computers, cloud data centers and the like, and the operable system can include, but is not limited to, a processor, a memory and a server cluster.
As shown in fig. 2, the sentence probability calculation system based on a gradient boosting decision tree according to an embodiment of the present disclosure includes: a processor, a memory and a computer program stored in the memory and operable on the processor, the processor implementing the steps in an embodiment of the gradient boosting decision tree-based sentence probability calculation method described above when executing the computer program, the processor executing the computer program to run in the units of the following system:
the preprocessing unit is used for cleaning the data of the text data set to remove punctuations and invalid non-Chinese characters in the text data set to obtain a preprocessed data set;
the input data set unit is used for reading character strings in the preprocessing data set, performing word segmentation on the read character strings by using HanLP, combining a set of segmented words into a word segmentation label set, and storing the obtained segmented words as an input data set in a json format;
a parallel data processing unit for reading and processing the input data set into parallel data;
the model learning unit is used for inputting parallel data into a plurality of gradient lifting decision tree models in batches for learning;
the batch input unit is used for calculating the sharing weight among the gradient lifting decision tree models according to batch input of parallel data in the process of learning the gradient lifting decision tree models, adjusting the loss functions of the gradient lifting decision tree models according to the sharing weight, and obtaining a plurality of discrimination models after learning;
and the probability output unit is used for respectively inputting the data to be monitored with the same format as the parallel data into the plurality of discrimination models to respectively obtain a plurality of output discrimination scores and respectively combining the discrimination scores output by the discrimination models with the sharing weight to obtain a final overall discrimination score.
The sentence probability calculation system based on the gradient boosting decision tree can be operated in computing equipment such as desktop computers, notebooks, palm computers and cloud data centers. The sentence probability calculation system based on the gradient boosting decision tree comprises a processor and a memory. Those skilled in the art will appreciate that the example is only an example of the sentence probability calculation method and system based on the gradient boosting decision tree, and does not constitute a limitation of the sentence probability calculation method and system based on the gradient boosting decision tree, and may include more or less components than or equal to a certain proportion, or combine some components, or different components, for example, the sentence probability calculation system based on the gradient boosting decision tree may further include an input and output device, a network access device, a bus, and the like.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete component Gate or transistor logic, discrete hardware components, etc. The general processor can be a microprocessor or the processor can also be any conventional processor and the like, the processor is a control center of the sentence probability calculation system based on the gradient lifting decision tree, and various interfaces and lines are utilized to connect all subareas of the whole sentence probability calculation system based on the gradient lifting decision tree.
The memory can be used for storing the computer program and/or the module, and the processor realizes various functions of the sentence probability calculation method and the sentence probability calculation system based on the gradient boosting decision tree by running or executing the computer program and/or the module stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Note: if a repeatedly defined variable or code occurs in this application, the scope of this variable is only in this paragraph.
The invention provides a sentence probability calculation method and a sentence probability calculation system based on a gradient lifting decision tree. Therefore, parallelism and high efficiency of sentence word probability calculation are improved, and the beneficial effects of reducing the calculation cost and time complexity of all candidate participles are realized through data parallelization.
Although the description of the present disclosure has been rather exhaustive and particularly described with respect to several illustrated embodiments, it is not intended to be limited to any such details or embodiments or any particular embodiments, so as to effectively encompass the intended scope of the present disclosure. Furthermore, the foregoing describes the disclosure in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial modifications of the disclosure, not presently foreseen, may nonetheless represent equivalent modifications thereto.

Claims (8)

1. A sentence probability calculation method based on a gradient lifting decision tree is characterized by comprising the following steps:
s100, performing data cleaning on the text data set, and removing punctuations and invalid non-Chinese characters in the text data set to obtain a preprocessed data set;
s200, reading character strings in the preprocessed data set, performing word segmentation on the read character strings by using HanLP, combining a set of word segmentation into a word segmentation labeling set, and storing the obtained word segmentation as an input data set in a json format;
s300, reading and processing an input data set into parallel data;
s400, inputting parallel data into a plurality of gradient lifting decision tree models in batches for learning;
s500, in the process of learning a plurality of gradient lifting decision tree models, calculating the sharing weight among the gradient lifting decision tree models according to the batch input of parallel data, adjusting the loss functions of the gradient lifting decision tree models according to the sharing weight, and obtaining a plurality of discrimination models after learning;
s600, respectively inputting the data to be monitored with the same format as the parallel data into a plurality of discrimination models to respectively obtain a plurality of output discrimination scores, and respectively combining the discrimination scores output by the discrimination models with the sharing weight to obtain a final overall discrimination score.
2. The sentence probability calculation method based on the gradient boosting decision tree as claimed in claim 1, wherein in S100, the text data set is subjected to data cleaning to remove punctuation marks and invalid non-chinese characters therein, and the method for obtaining the preprocessed data set comprises: the method comprises the steps of obtaining a text data set from a database of a server cluster, wherein the text data set is a table in the database, the field type of data of the table is a character string, each line of data in the table is a character string of a text, converting each character in the data of the table into a decimal numerical value according to an ASCII code comparison table, deleting corresponding characters with a decimal numerical value range of [0,31] in the data, and using the deleted data of the table as a preprocessing data set.
3. The sentence probability calculation method based on the gradient boosting decision tree as claimed in claim 1, wherein in S200, the method of reading the character strings in the preprocessed data set, performing word segmentation on the read character strings by using HanLP, combining the set of word segmentation into a word segmentation label set, and storing the obtained word segmentation in json format as the input data set comprises: reading a character string in each row from a preprocessed data set, reading each row as a character string, performing word segmentation on each read character string by using HanLP, taking a participle obtained after performing word segmentation on each character string as a participle array, reading each participle for each row to obtain a character string consisting of partial characters in one character string, marking an inter-specific set consisting of all the participles in all the participle arrays as a set Aset, marking the Aset as a participle label set, marking the number of Aset elements as n, marking the serial number of the Aset elements as i, i belongs to [1, n ], marking the serial number of the participle in the Aset on each participle array to obtain a participle label sequence, namely, marking the elements in the participle label sequence by one participle and the serial number of the participle in the participle label set Aset, if the sequence number of the participle in the Aset is i, the participle with the sequence number of i in the Aset is marked as Aset (i), an element in a participle marking sequence consists of the sequence number i and the participle Aset (i), namely < i, Aset (i) >, and a set consisting of all the participle marking sequences is stored as an input data set in a json format.
4. The sentence probability calculation method based on gradient boosting decision tree according to claim 1, wherein in S300, the method for reading and processing the input data set into parallel data is as follows: reading a set formed by each participle labeling sequence in an input data set as a set Bset, wherein the number of elements in the set Bset is m, the serial number of the elements in the set Bset is variable j, j belongs to [1, m ], the element with the serial number of j in the set Bset is denoted as B _ j, a function len () is defined as a function for calculating the number of elements in an array or a sequence or a vector in an acquisition input function, len (B _ j) represents the number of elements in B _ j, a set formed by the number of elements in the elements of each serial number in the set Bset is denoted as a set Lset, namely Lset = { len (B _ j) | j belongs to [1, m ] }, a function max represents a function for calculating the element with the largest value in the array or the sequence or the vector in the acquisition input function, and max (Lset) represents the element with the largest value in the acquisition set;
defining a function Han (), wherein the calculation process of the function Han () comprises the following steps:
s301-1, inputting a set Bset by a function Han ();
s301-2, setting an empty set Cset;
s301-3, obtaining len (B _ j) corresponding to elements of each sequence number in the set Bset through a function len (), obtaining max (Lset) with the largest value in a set Lset formed by the len (B _ j) corresponding to the elements of each sequence number, and marking max (Lset) as ml;
s301-4, if the element B _ j of each sequence number in the set Bset meets the constraint condition len (B _ j) < ml, adding ml-len (B _ j) elements consisting of empty character strings and sequence numbers 0 at the tail of the participle tagging sequence B _ j, marking an element consisting of an empty character string and a sequence number 0 as a filling element, namely prolonging the length of the participle tagging sequence B _ j to ml by adding ml-len (B _ j) elements consisting of empty character strings and sequence numbers 0 behind the original len (B _ j), and processing the elements B _ j of each sequence number in the set Bset into the participle tagging sequence with the length of ml; in the word segmentation labeling sequence with the length of ml, the sequence number of an element in the word segmentation labeling sequence is il, and il belongs to [1, ml ];
s301-5, adding each element in the processed set Bset into Cset;
s301-6, the function Han () outputs Cset;
furthermore, on the basis of the output set Cset, the number of the elements of the set Cset is m as same as the number of the elements in the set Bset, the serial number of the elements of the set Cset is j as same as the serial number of the elements in the set Bset, each element in the set Cset is a participle tagging sequence with the length of ml, m participle tagging sequences with the length of ml in the set Cset are sequentially used as m rows with the length of ml of the matrix, and m participle tagging sequences with the length of ml in the set Cset are used as a matrix with the size of ml × m rows and ml columns to be used as a matrix Cmat;
each column in the matrix Cmat is an element in the set Cset, i.e. the number of columns in the matrix Cmat is m when the number of elements in the set Cset is the same as the number of elements in the set Cset, the serial number of the columns in the matrix Cmat is j when the number of the elements in the set Cset is the same as the serial number of the elements in the set Cset, each column in the matrix Cmat is a participle tagging sequence with the length of ml, i.e. the number of rows in the matrix Cmat is ml, the serial number of the rows in the matrix Cmat is il when the serial number of the elements in the participle tagging sequence is the same as the serial number of the elements in the participle tagging sequence, the elements in the rows il of the matrix Cmat are il in each participle tagging sequence from 1 to m in the set Cset, the elements in the rows il of the j of the column in the matrix Cmat are il when the serial number of the element in the participle tagging sequence, the rows il of the column in the matrix Cmat are il [ il, ], the column in the matrix Cmat [ il, the column in the matrix Cmat, j [ il ] is il, the column in the matrix Cmat, j [ il ] in the column in the set, j [ j ] is marked as the column in the set, j ];
the element of the jth column in the ith row in the matrix Cmat is marked as Cmat [ il, j ], the Cmat [ il, j ] consists of a participle and a serial number of the participle in a participle label set, the participle in the Cmat [ il, j ] is marked as Ctr (il, j), the participle in the Cmat [ il, j ] is marked as Cid (il, j) in the participle label set, and the element of the jth column in the ith row in the Cmat [ il, j ] can be marked as Cmat [ il, j ] = < Cid (il, j), Ctr (il, j) >;
replacing the participle Ctr (il, j) in each element in the matrix Cmat with a word vector obtained by the participle Ctr (il, j) through a word vector algorithm, recording the word vector obtained by the Ctr (il, j) through the word vector algorithm as Cvec (il, j), recording the number of dimensions of the word vector as k, the serial number of each dimension of the word vector as v, v epsilon [1, k ], recording the word vector obtained by the word vector algorithm of the empty character string as a full zero vector, replacing the participle Ctr (il, j) in each element in the matrix Cmat with Cvec (il, j) to obtain an array recorded as Ctensior, recording the il row in the Ctensior as Ctensor [ il ], recording the j column in the Ctensor as [ j ], recording the il row in the Ctensor as Ctensor [ il, j ], recording the element in the il row and the j column in the Ctensor as Ctensor [ il, j ], recording the Ctensor [ il, j ] composed of Cid (il, j) and Ctensor [ il, j ],il, j, Ctensor [ il, j ],j ], j) the value of the v-th dimension in Cvec (il, j) is denoted as Cvec (il, j, v), and the Csensor is the parallel data.
5. The sentence probability calculation method based on gradient boosting decision tree according to claim 1, wherein in S400, the method of inputting parallel data into a plurality of gradient boosting decision tree models in batches for learning is as follows:
creating q different gradient lifting decision tree models, defining int () as a function of rounding down, acquiring a participle label set, further acquiring the number n of elements of the participle label set, acquiring an input data set, further acquiring the number m of the elements in the input data set, wherein the calculation for acquiring the numerical value of q is as follows:
Figure 337123DEST_PATH_IMAGE001
obtaining parallel data, marking as Ctensor, and performing batch division on the parallel data Ctensor comprises the following specific steps:
s401, acquiring the number of rows in Ctensor as m; setting an initial value of a variable p, p to be 0; setting an empty set gap; go to S402;
s402, defining the number of the sub-batches as epi; dividing m by q; judging whether m is divided by q to have a remainder, if so, turning to S403, otherwise, turning to S405;
s403, changing the numerical value of p to be equal to the remainder of dividing m by q; randomly taking out q rows from the Ctensor to ensure that the number of the rows in the Ctensor is m-q; go to S404;
s404, making the result of q-p be g, randomly taking out g from Ctensor and putting the g in gap; adding all elements in the gap into Ctensor; clearing the set gap; go to S405;
s405, acquiring the column number in the Ctensor; changing the value of epi to be equal to the value of the result obtained by dividing the number of columns in the acquired Ctensor by q; go to S406;
s406, dividing the Csensor into q rows, wherein each row is epi rows in the Csensor and is marked as one batch, and inputting each batch into a gradient lifting decision tree model for learning, wherein the gradient lifting decision tree model is an XGboost model, so that q different gradient lifting decision tree models are obtained through learning.
6. The method for calculating sentence probability based on gradient boosting decision tree according to claim 1, wherein in S500, the method for calculating the sharing weight between each gradient boosting decision tree model according to the batch input of parallel data comprises: recording the parallel data as Ctensior, dividing the parallel data into a plurality of batches according to columns, wherein the number of the batches into which the parallel data are divided according to columns is q, the serial number of each batch is qi, qi is belonged to [1, q ], the batches with serial number qi in Ctensior are Ctensior (qi), Ctensior (qi) is Ctensior, the number of the rows of each batch are the same and are ml, the serial number of the rows of each batch is il, il is belonged to [1, ml ], the number of the columns of Ctensior (qi) is mqi, the serial number of the rows of Ctensior (jqi), jqi ∈ [1, mqi ], the rows in Ctensior (qi) with serial number il are Ctensior (qi), (i), the serial number of the rows in Ctensior (mqi) is Ctensior, the serial number of the rows in Ctensior (mqi) is qi, the serial number of the rows of Ctensior (qi) is Ctensior (qil), (29 qi) is Ctensior, the serial number of the rows of the column is 5929 (qi), jqi, where Csensor (qi) (il, jqi) is composed of a sequence number and a word vector, Csensor (qi) (il, jqi) is marked as cid (qi) (il, jqi), Csensor (qi) (il, jqi) is marked as Cvec (qi) (il, jqi), the sharing weight is defined as the value of measuring the frequency of the information quantity of each batch input, the function of calculating the sharing weight according to the batch input is defined as Cov (), the sharing weight of Csensor (qi) is calculated by the function Cov (), the formula of Cov (qi) is:
Figure 167675DEST_PATH_IMAGE002
the calculated cov (qi) is the sharing weight of the batch input with the sequence number qi in the Ctensor, the array composed of the sharing weights of the batch input with each sequence number in the Ctensor is recorded as an array Covs, = [ cov (qi) | qi ∈ [1, q ] ], the arithmetic mean value of each element in the array Covs is recorded as Covg, the function exp () is an exponential function with a natural constant e as the bottom, the number of a plurality of different gradient lifting decision tree models is set to be q which is the same as the number of the batch input, the sequence numbers of a plurality of different gradient lifting decision tree models are set to be qi which is the same as the sequence number of the batch input, the gradient lifting decision tree model with the sequence number qi is recorded as an exponential function M qi, the set composed of each gradient lifting decision tree model is recorded as Ms, the output result of the loss function of M (qi) (m), (qi) Ms, M qi) is the loss result of the loss function of loss of the gradient lifting decision tree models is loss (loss) of loss of each gradient lifting decision tree model (i) which is recorded as loss of each gradient lifting decision tree model (loss function of the gradient lifting decision tree model (q) of each gradient lifting decision tree model is recorded as loss of the loss model (q) of the batch input Obtaining a value of a result adjusted by the shared weight of each gradient lifting decision tree model, wherein a return value of a Loss function is a parameter for optimizing the model by using gradient descent in the model training process of machine learning, the return value of the Loss function of M (qi) is recorded as L (qi), L (qi) is used for optimizing M (qi), an array formed by the return values of the Loss functions of the gradient lifting decision tree models with each sequence number is an array Loss, a calculation formula for adjusting the Loss functions of a plurality of gradient lifting decision tree models by the shared weight is as follows,
Figure 738334DEST_PATH_IMAGE003
Figure 337943DEST_PATH_IMAGE004
therefore, L (qi) corresponding to each M (qi) is used for optimizing each gradient lifting decision tree model, and q discriminant models are obtained after optimization.
7. The sentence probability calculation method based on the gradient boosting decision tree as claimed in claim 6, wherein in S600, the data to be monitored with the same format as the parallel data is respectively input into a plurality of discriminant models to respectively obtain a plurality of output discriminant scores, and the method for respectively combining the discriminant scores output by each discriminant model with the sharing weight to obtain the final overall discriminant score is as follows: the number and the serial number of the discriminant models are equal to those of the gradient lifting decision tree models before optimization, the number of the discriminant models is q, the serial number of the discriminant models is qi, qi belongs to [1, q ], the output of the discriminant models is a discriminant score of XGboost, elements in any row and any column in the parallel data are randomly deleted, the positions left after the deleted elements are deleted are placed in placeholders, and the gradient lifting decision tree models predict the elements before the placeholders are deleted according to other elements which are not deleted, and outputting the probability of the elements before the predicted placeholders are deleted, wherein the output of the discrimination model with the sequence number qi is score (qi), score (qi) outputs the probability of the elements before the predicted placeholders are deleted, the sharing weight corresponding to the discrimination model with the sequence number qi is recorded as cov (qi), the final overall discrimination score is recorded as score, and the calculation formula of score is as follows:
Figure 211221DEST_PATH_IMAGE005
and the calculated score is the final total discrimination score obtained by combining the discrimination scores output by the discrimination models with the sharing weight, and the calculated score is used as sentence probability to be stored.
8. A sentence probability calculation system based on a gradient boosting decision tree, wherein the sentence probability calculation system based on the gradient boosting decision tree comprises: the processor executes the computer program to realize the steps in the gradient boosting decision tree-based sentence probability calculation method in claim 1, the gradient boosting decision tree-based sentence probability calculation system is operated in a desktop computer, a notebook computer, a palm computer and computing equipment of a cloud data center, and the operable system comprises the processor, the memory and a server cluster.
CN202111184159.XA 2021-10-12 2021-10-12 Sentence and word probability calculation method and system based on gradient lifting decision tree Active CN113609843B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111184159.XA CN113609843B (en) 2021-10-12 2021-10-12 Sentence and word probability calculation method and system based on gradient lifting decision tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111184159.XA CN113609843B (en) 2021-10-12 2021-10-12 Sentence and word probability calculation method and system based on gradient lifting decision tree

Publications (2)

Publication Number Publication Date
CN113609843A true CN113609843A (en) 2021-11-05
CN113609843B CN113609843B (en) 2022-02-01

Family

ID=78310943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111184159.XA Active CN113609843B (en) 2021-10-12 2021-10-12 Sentence and word probability calculation method and system based on gradient lifting decision tree

Country Status (1)

Country Link
CN (1) CN113609843B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743110A (en) * 2021-11-08 2021-12-03 京华信息科技股份有限公司 Word missing detection method and system based on fine-tuning generation type confrontation network model
CN114091624A (en) * 2022-01-18 2022-02-25 蓝象智联(杭州)科技有限公司 Federal gradient lifting decision tree model training method without third party
CN117725437A (en) * 2024-02-18 2024-03-19 南京汇卓大数据科技有限公司 Machine learning-based data accurate matching analysis method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279252A (en) * 2015-10-12 2016-01-27 广州神马移动信息科技有限公司 Related word mining method, search method and search system
CN107423339A (en) * 2017-04-29 2017-12-01 天津大学 Popular microblogging Forecasting Methodology based on extreme Gradient Propulsion and random forest
CN107608961A (en) * 2017-09-08 2018-01-19 广州汪汪信息技术有限公司 Sentiment analysis method, electronic equipment, storage medium, system based on visual angle
WO2018107921A1 (en) * 2016-12-15 2018-06-21 腾讯科技(深圳)有限公司 Answer sentence determination method, and server
CN109086412A (en) * 2018-08-03 2018-12-25 北京邮电大学 A kind of unbalanced data classification method based on adaptive weighted Bagging-GBDT
US20190220514A1 (en) * 2017-02-23 2019-07-18 Tencent Technology (Shenzhen) Company Ltd Keyword extraction method, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279252A (en) * 2015-10-12 2016-01-27 广州神马移动信息科技有限公司 Related word mining method, search method and search system
WO2018107921A1 (en) * 2016-12-15 2018-06-21 腾讯科技(深圳)有限公司 Answer sentence determination method, and server
US20190220514A1 (en) * 2017-02-23 2019-07-18 Tencent Technology (Shenzhen) Company Ltd Keyword extraction method, computer equipment and storage medium
CN107423339A (en) * 2017-04-29 2017-12-01 天津大学 Popular microblogging Forecasting Methodology based on extreme Gradient Propulsion and random forest
CN107608961A (en) * 2017-09-08 2018-01-19 广州汪汪信息技术有限公司 Sentiment analysis method, electronic equipment, storage medium, system based on visual angle
CN109086412A (en) * 2018-08-03 2018-12-25 北京邮电大学 A kind of unbalanced data classification method based on adaptive weighted Bagging-GBDT

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743110A (en) * 2021-11-08 2021-12-03 京华信息科技股份有限公司 Word missing detection method and system based on fine-tuning generation type confrontation network model
CN114091624A (en) * 2022-01-18 2022-02-25 蓝象智联(杭州)科技有限公司 Federal gradient lifting decision tree model training method without third party
CN117725437A (en) * 2024-02-18 2024-03-19 南京汇卓大数据科技有限公司 Machine learning-based data accurate matching analysis method

Also Published As

Publication number Publication date
CN113609843B (en) 2022-02-01

Similar Documents

Publication Publication Date Title
CN113609843B (en) Sentence and word probability calculation method and system based on gradient lifting decision tree
CN109271521B (en) Text classification method and device
CN112883190A (en) Text classification method and device, electronic equipment and storage medium
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN115146865A (en) Task optimization method based on artificial intelligence and related equipment
CN112883730B (en) Similar text matching method and device, electronic equipment and storage medium
CN113722512A (en) Text retrieval method, device and equipment based on language model and storage medium
CN112783825B (en) Data archiving method, device, computer device and storage medium
CN115018588A (en) Product recommendation method and device, electronic equipment and readable storage medium
CN110083731B (en) Image retrieval method, device, computer equipment and storage medium
CN114969387A (en) Document author information disambiguation method and device and electronic equipment
US9008974B2 (en) Taxonomic classification system
CN111553442B (en) Optimization method and system for classifier chain tag sequence
CN111651625A (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN113705201B (en) Text-based event probability prediction evaluation algorithm, electronic device and storage medium
CN113627157B (en) Probability threshold value adjusting method and system based on multi-head attention mechanism
CN115952800A (en) Named entity recognition method and device, computer equipment and readable storage medium
CN113505117A (en) Data quality evaluation method, device, equipment and medium based on data indexes
CN113724779A (en) SNAREs protein identification method, system, storage medium and equipment based on machine learning technology
CN113268571A (en) Method, device, equipment and medium for determining correct answer position in paragraph
CN113343102A (en) Data recommendation method and device based on feature screening, electronic equipment and medium
CN113469237A (en) User intention identification method and device, electronic equipment and storage medium
CN112632264A (en) Intelligent question and answer method and device, electronic equipment and storage medium
CN107622129B (en) Method and device for organizing knowledge base and computer storage medium
CN113934842A (en) Text clustering method and device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant