WO2023039925A1 - 企业分类模型智能构建方法、装置、设备及介质 - Google Patents

企业分类模型智能构建方法、装置、设备及介质 Download PDF

Info

Publication number
WO2023039925A1
WO2023039925A1 PCT/CN2021/120254 CN2021120254W WO2023039925A1 WO 2023039925 A1 WO2023039925 A1 WO 2023039925A1 CN 2021120254 W CN2021120254 W CN 2021120254W WO 2023039925 A1 WO2023039925 A1 WO 2023039925A1
Authority
WO
WIPO (PCT)
Prior art keywords
classification model
sample feature
training data
data set
sample
Prior art date
Application number
PCT/CN2021/120254
Other languages
English (en)
French (fr)
Inventor
谢翀
罗伟杰
陈永红
黄开梅
Original Assignee
深圳前海环融联易信息科技服务有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海环融联易信息科技服务有限公司 filed Critical 深圳前海环融联易信息科技服务有限公司
Publication of WO2023039925A1 publication Critical patent/WO2023039925A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0202Market predictions or forecasting for commercial activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to a method, device, equipment and medium for intelligently constructing an enterprise classification model.
  • the industry classification of an enterprise is an important part of the business analysis and investment decision-making process. Enterprises segment target markets by understanding the specific industry in which potential customers are located, and carry out targeted marketing; researchers predict the future development trend of the industry for investment institutions. It provides a favorable basis for decision-making.
  • the efficiency of manual industry classification is low, and the classification results are also easily affected by subjective factors.
  • the inventors found that due to factors such as unbalanced distribution of corpus samples and inconsistencies in manual judgment standards, the enterprise classification model based on traditional machine learning in the traditional technology cannot accurately and efficiently train the classification model due to low training quality.
  • the performance of the final enterprise classification model is poor, and it is difficult to carry out practical application. Therefore, there is a problem that a high-performance enterprise classification model cannot be constructed in the prior art method.
  • the embodiment of the present application provides a method, device, equipment and medium for intelligently constructing an enterprise classification model, aiming at solving the problem that a high-performance enterprise classification model cannot be constructed in the prior art methods.
  • the embodiment of the present application provides a method for intelligently constructing an enterprise classification model, which includes:
  • the corpus samples contained in the corpus database are converted according to the preset text processing rules to obtain the corresponding sample feature sequence;
  • an intelligent enterprise classification model construction device which includes:
  • the sample feature sequence acquisition unit is used to convert the corpus samples contained in the corpus database according to the preset text processing rules to obtain the corresponding sample feature sequence if the input corpus database is received;
  • a training data set generation unit configured to generate a plurality of training data sets corresponding to preset data set generation rules according to the sample feature sequence
  • a model output information acquisition unit configured to sequentially input the sample feature sequences in each of the training data sets into the initial classification model for calculation and processing, so as to obtain the model output information corresponding to each of the sample feature sequences;
  • a model training unit configured to iteratively adjust the parameter values in the initial classification model according to the preset parameter adjustment rules and the model output information of the sample feature sequences contained in each of the training data sets, so as to obtain the The classification model after the initial classification model is trained.
  • the embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor executes the computer program.
  • the program realizes the intelligent construction method of the enterprise classification model described in the first aspect above.
  • the embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the above-mentioned first step.
  • the enterprise classification model intelligent construction method In one aspect, the enterprise classification model intelligent construction method.
  • Embodiments of the present application provide a method, device, computer equipment, and readable storage medium for intelligently constructing an enterprise classification model.
  • the corpus samples contained in the corpus database are converted and processed to obtain the sample feature sequence, and multiple training data sets are generated according to the sample feature sequence, and the sample feature sequences in the training data set are sequentially input into the initial classification model for calculation and processing, and the corresponding Model output information, iteratively adjusting the parameter values in the initial classification model according to the parameter adjustment rules and the model output information of the sample feature sequence contained in each training data set, to obtain the trained classification model.
  • the parameter values in the initial classification model can be iteratively adjusted based on the parameter adjustment rules including the weighted loss value calculation formula, so as to avoid the unbalanced distribution of corpus samples, inconsistent manual judgment standards and other factors on the training of the initial classification model. Noise interference greatly improves the performance of the constructed classification model.
  • Fig. 1 is the schematic flow chart of the enterprise classification model intelligent construction method that the embodiment of the present application provides;
  • Fig. 2 is a schematic subflow diagram of the intelligent construction method of the enterprise classification model provided by the embodiment of the present application;
  • Fig. 3 is another schematic diagram of the sub-flow of the enterprise classification model intelligent construction method provided by the embodiment of the present application.
  • Fig. 4 is another schematic diagram of the sub-flow of the enterprise classification model intelligent construction method provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of another sub-flow of the enterprise classification model intelligent construction method provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of another sub-flow of the enterprise classification model intelligent construction method provided by the embodiment of the present application.
  • Fig. 7 is another schematic flow chart of the intelligent construction method of the enterprise classification model provided by the embodiment of the present application.
  • FIG. 8 is a schematic block diagram of an enterprise classification model intelligent construction device provided by an embodiment of the present application.
  • Fig. 9 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • Fig. 1 is a schematic flow chart of the intelligent construction method of the enterprise classification model provided by the embodiment of the present application; the intelligent construction method of the enterprise classification model is applied in the user terminal or management server, and the intelligent construction method of the enterprise classification model is installed on The application software in the user terminal or management server is executed.
  • the management server is a server that can execute the intelligent construction method of the enterprise classification model to intelligently construct the enterprise classification model.
  • the management server can be a server built inside the enterprise or government department
  • the user terminal is a terminal device that can implement the intelligent construction method of the enterprise classification model to intelligently construct the enterprise classification model, such as a desktop computer, a notebook computer, a tablet computer, or a mobile phone.
  • the method includes steps S110-S140.
  • the corpus samples contained in the corpus database are converted according to the preset text processing rules to obtain the corresponding sample feature sequence.
  • the input corpus database can be received.
  • the corpus database contains multiple corpus samples. Each corpus sample corresponds to a company.
  • the corpus sample can be a piece of text containing text such as the description information of the company profile, the overall profile information of the company, and the company's annual report. information.
  • the corpus samples can be converted and processed according to the text processing rules to obtain the sample feature sequence, which is a coded sequence that expresses the character features contained in the corpus samples by means of digital coding.
  • the text processing rule includes an invalid character set and a feature lexicon.
  • step S110 includes sub-steps S111 and S112.
  • the invalid characters contained in the text information contained in the corpus sample can be filtered out first, and the invalid character set includes the invalid characters that need to be filtered out, and the invalid character set can include invalid characters such as symbols, spaces, etc. Then, invalid characters such as symbols and spaces contained in the text information in the corpus sample can be filtered out to obtain valid text information containing only valid characters.
  • the effective text information is converted into feature words.
  • the feature words contained in the feature lexicon can be composed of one or more characters. Each feature word corresponds to a one-hot encoding. Then the effective text information can be converted into Characters are matched with feature words. If a character or a combination of consecutive characters in the valid text information matches the feature words, the one-hot encoding corresponding to the feature words is obtained for feature word conversion. According to the characters in the valid text information The specific position is sorted by the one-hot encoding obtained by the feature word conversion, and the corresponding sample feature sequence can be obtained.
  • the feature lexicon can be a stammer word segmentation lexicon containing 166,000 feature words, which correspondingly contains 166,000 one-hot encodings.
  • Multiple training data sets corresponding to preset data set generation rules are generated according to the sample feature sequence.
  • a plurality of corresponding training data sets can be generated based on the sample feature sequence, and the generated training data sets need to satisfy a data set generation rule, wherein the data set generation rule includes the number of samples.
  • step S120 includes sub-steps S121 and S122.
  • the corpus samples contain text information and target classification labels corresponding to the text information.
  • the target classification labels are the real classification labels of the industry category to which the text information of each sample belongs.
  • An industry label in the target classification labels corresponds to a specific An industry category, the target category label can contain one or more industry labels.
  • the target classification label corresponding to an enterprise is "finance, real estate”.
  • the target classification labels corresponding to each sample feature sequence can be counted to obtain the number of corpus samples corresponding to each industry label in the corpus database, and the label statistics can be obtained by obtaining the number of corpus samples for each industry label.
  • the obtained label statistics can be represented by N, N is a one-dimensional array of 1 ⁇ K, K is the total number of industry categories included, and N k is the corpus of industry labels corresponding to the kth industry category Number of samples.
  • a plurality of sample feature sequences equal to the number of samples are randomly obtained and combined to obtain a corresponding training data set, and each training data set contains a plurality of sample feature sequences equal to the number of samples. For example, if the corpus database contains 10,000 corpus samples, and the number of samples is 500, then 500 of the sample feature sequences corresponding to the 10,000 corpus samples can be randomly obtained and combined as a training data set, and the samples corresponding to the remaining 9,500 corpus samples The feature sequence is again randomly acquired.
  • the sample feature sequences in each of the training data sets are sequentially input into the initial classification model for calculation and processing, so as to obtain model output information corresponding to each of the sample feature sequences.
  • the initial classification model is a neural network model based on the multilayer perceptron structure.
  • the initial classification model can be a four-layer fully connected neural network, in which the first three layers use batch normalization and random deactivation (the retention probability is set to 50% ), so that the model can quickly converge and avoid overfitting, and the first three layers use the hyperbolic tangent function as the activation function, and the last layer is the multi-label classification output layer, using batch normalization and Sigmoid activation functions.
  • the input layer contains 166,000 neurons
  • the first layer of fully connected neural network has 640 neurons
  • the second layer and the third layer have 4096 neurons respectively
  • the last layer that is, the output layer
  • the number of units is the total number of industry categories
  • the neurons contained in the front and back layers are associated with each other using a linear function.
  • the parameter values of the primary function in the initial classification model are the same default value.
  • the sample feature sequence in the training data set is sequentially input into the initial classification model for calculation and processing, and the corresponding model output information can be obtained from the last layer of the initial classification model.
  • step S130 includes sub-steps S131 and S132.
  • the parameter adjustment rule includes a weighted loss value calculation formula and a gradient calculation formula.
  • S140 Iteratively adjust the parameter values in the initial classification model according to the preset parameter adjustment rules and the model output information of the sample feature sequence contained in each of the training data sets, so as to obtain the initial classification model for The trained classification model.
  • the parameter adjustment rule includes a weighted loss calculation formula and a gradient calculation formula.
  • step S140 includes sub-steps S141 , S142 , S143 , S144 and S145 .
  • weighted loss value calculation formula and the label statistical information perform weighted calculation on the model output information of a sample feature sequence in the training data set and the corresponding target classification label, and obtain the corresponding to each sample feature sequence The weighted loss value of .
  • the weighted loss calculation formula is the calculation formula used to calculate the weighted loss value corresponding to the sample feature sequence.
  • the tag statistical information is introduced and a special calculation method is used for weighted calculation to obtain the weighted loss value.
  • the weighted loss value can be To better avoid the extremely unbalanced distribution of sample labels in the industry and the noise of sample labels, this process of obtaining weighted loss values can greatly improve the efficiency and quality of training the initial classification model.
  • k means the kth industry category
  • K is the total number of industry categories included
  • N is the total number of corpus samples in the corpus database
  • is the Sigmoid activation function of the output layer
  • the update value of each parameter in the initial classification model can be calculated according to the gradient calculation formula to update the original parameter value of the parameter. Specifically, the calculated value obtained by calculating a sample feature sequence for a parameter in the initial classification model is input.
  • the gradient calculation formula, combined with the above weighted loss value, can calculate the update value corresponding to the parameter, and this calculation process is also called gradient descent calculation.
  • the gradient calculation formula can be expressed by formula (2):
  • One update of all parameter values in the initial classification model can be realized through one sample feature sequence, and multiple iterative updates of all parameter values in the initial classification model can be realized through multiple sample feature sequences contained in a training data set.
  • step S141 It is judged whether there is a training data set that has not been trained, and if it exists, return to step S141, and if not, obtain the classification model obtained from the current training as the trained classification model.
  • the parameter adjustment rule further includes a typical sample set, as shown in FIG. 6 , step S140 includes sub-steps S1401, S1402, S1403, S1404, S1405 and S1406.
  • the typical sample set contains multiple typical samples, and the typical sample is the sample corresponding to a strong representative enterprise.
  • the feature sequence of each sample in the training data set can be judged and classified to judge each sample. Whether this feature sequence is a typical sample, the sample feature sequence contained in each training data set is sequentially judged and classified and counted in the above-mentioned manner, and the number of typical samples in each training data set, that is, the number of atypical samples can be obtained.
  • the number of typical samples is 200, and the number of atypical samples is 300.
  • the label statistical information, the number of typical samples and the number of atypical samples perform model output information and corresponding target classification labels on the sample feature sequence in the training data set weighted calculation to obtain a high confidence weighted loss value corresponding to each of the sample feature sequences.
  • label statistical information typical sample numbers and atypical sample numbers are introduced, and a special calculation method is used to carry out weighted calculations to obtain high confidence weighted loss values.
  • a special calculation method is used to carry out weighted calculations to obtain high confidence weighted loss values.
  • the label distribution is extremely unbalanced and the sample labels have noise problems. At the same time, it can also greatly improve the confidence of the calculated loss value.
  • This high-confidence weighted loss value acquisition process can further improve the efficiency and efficiency of training the initial classification model. quality.
  • k means the kth industry category
  • K is the total number of industry categories included
  • N is the total number of corpus samples in the corpus database
  • is the Sigmoid activation function of the output layer
  • the high-confidence weighted loss value of each sample feature sequence, and the calculated value of each parameter in the initial classification model for the sample feature sequence obtain the value of each parameter update value to iteratively update the initial classification model.
  • the update value of each parameter in the initial classification model can be calculated according to the gradient calculation formula to update the original parameter value of the parameter.
  • the specific process of updating the parameter value in the initial classification model has been described in detail in the above steps.
  • One update of all parameter values in the initial classification model can be realized through one sample feature sequence, and multiple iterative updates of all parameter values in the initial classification model can be realized through multiple sample feature sequences contained in a training data set.
  • step S1401 It is judged whether there is still a training data set that has not been trained, if yes, return to step S1401, if not, obtain the classification model obtained from the current training as the trained classification model.
  • steps S150 , S160 , S170 and S180 are further included after step S140 .
  • the classification request information is the input request information that needs to be classified by industry.
  • the classification request information includes the enterprise name, and the enterprise description information corresponding to the enterprise name can be obtained according to the information collection rules.
  • the information collection rule includes multiple collection addresses, and the relevant description information of the enterprise name can be collected based on the collection addresses to obtain the enterprise description information.
  • the collection address may be a website of a government department for industry and commerce, a website of a third-party enterprise inquiry institution, and the like.
  • the obtained enterprise description information is a piece of text information including text.
  • the enterprise description information can be converted and processed through the text processing rules. Specifically, the process of converting the enterprise description information is the same as the process of converting the text information contained in the corpus sample, and will not be repeated here.
  • the obtained enterprise description feature information can be used to quantitatively represent the specific characteristics of the enterprise description information.
  • the corresponding feature output information can be obtained from the output layer of the trained classification model.
  • the specific process of obtaining the feature output information is the same as the above-mentioned specific process of obtaining the model output information, and will not be repeated here.
  • the enterprise classification label information corresponding to the classification request information can be obtained from the feature output information based on the label acquisition rules.
  • the enterprise classification label information includes at least one enterprise label.
  • the label acquisition rules can include probability thresholds and acquisition quantities.
  • the enterprise label corresponding to the neuron whose probability value is not less than the probability threshold can be obtained from the feature output information based on the probability threshold as the corresponding enterprise classification label information; it can also be obtained from the feature output information based on the number of acquisitions.
  • the enterprise labels corresponding to the number of acquired neurons are used as the corresponding enterprise classification label information.
  • the corpus samples contained in the corpus database are converted and processed according to the text processing rules to obtain the sample feature sequence, and multiple training data sets are generated according to the sample feature sequence, and the training data is collected
  • the sample feature sequences of the samples are sequentially input into the initial classification model for calculation and processing, and the corresponding model output information is obtained, and the parameter values in the initial classification model are calculated according to the parameter adjustment rules and the model output information of the sample feature sequences contained in each training data set. Iterative adjustment to obtain the trained classification model.
  • the parameter values in the initial classification model can be iteratively adjusted based on the parameter adjustment rules including the weighted loss value calculation formula, so as to avoid the unbalanced distribution of corpus samples, inconsistent manual judgment standards and other factors on the training of the initial classification model. Noise interference greatly improves the performance of the constructed classification model.
  • the embodiment of the present application also provides an intelligent enterprise classification model construction device, the enterprise classification model intelligent construction device can be configured in the user terminal or management server, the enterprise classification model intelligent construction device is used to execute the aforementioned enterprise classification model intelligent construction method any of the examples.
  • FIG. 8 is a schematic block diagram of an apparatus for intelligently constructing an enterprise classification model provided by an embodiment of the present application.
  • the enterprise classification model intelligent construction device 100 includes a sample feature sequence acquisition unit 110 , a training data set generation unit 120 , a model output information acquisition unit 130 and a model training unit 140 .
  • the sample feature sequence acquisition unit 110 is configured to convert the corpus samples contained in the corpus database according to preset text processing rules to obtain the corresponding sample feature sequence if the input corpus database is received.
  • the sample feature sequence acquisition unit 110 includes a subunit: a valid text information acquisition unit, configured to filter invalid characters in each piece of text information contained in the corpus sample according to the invalid character set The corresponding effective text information is obtained; the feature word conversion unit is used to perform feature word conversion on the effective text information according to the feature lexicon, and obtain a sample feature sequence corresponding to each of the corpus samples.
  • the training data set generation unit 120 is configured to generate a plurality of training data sets corresponding to preset data set generation rules according to the sample feature sequence.
  • the training data set generation unit 120 includes a subunit: a statistical label information acquisition unit, configured to perform statistics on the target classification labels corresponding to each of the sample feature sequences, and obtain corresponding label statistical information ;
  • the training data set acquisition unit is used to randomly obtain the combination of the sample feature sequences equal to the number of samples as the training data set.
  • the model output information acquisition unit 130 is configured to sequentially input the sample feature sequences in each of the training data sets into the initial classification model for calculation and processing, so as to obtain the model output information corresponding to each of the sample feature sequences.
  • the model output information acquisition unit 130 includes subunits: a sample feature sequence input unit, configured to sequentially input the sample feature sequence into the input layer of the initial classification model; an associated computing unit, configured to Perform association calculation on the sample feature sequence through the association formula between neurons in the initial classification model, and obtain model output information corresponding to the sample feature sequence from an output layer of the initial classification model.
  • the model training unit 140 is configured to iteratively adjust the parameter values in the initial classification model according to the preset parameter adjustment rules and the model output information of the sample feature sequences contained in each of the training data sets, so as to obtain the The classification model after training the initial classification model.
  • the model training unit 140 includes a subunit: a weighted loss value calculation unit is used to calculate a model of the sample feature sequence in the training data set according to the weighted loss value calculation formula and the label statistical information The output information and the corresponding target classification label are weighted and calculated to obtain the weighted loss value corresponding to each of the sample feature sequences; the first iterative update unit is used to calculate the gradient according to the gradient calculation formula and each of the sample feature sequences The weighted loss value and the calculated value of each parameter in the initial classification model are calculated on the sample feature sequence to obtain an update value of each parameter to iteratively update the initial classification model.
  • the first judging unit is used to judge whether there is the training data set that has not been trained; the first return execution unit is used to return and execute the weighted loss value according to the weighted loss value if there is the training data set that has not been trained.
  • the calculation formula and the label statistics information carry out weighted calculation on the model output information of the sample feature sequence in the training data set and the corresponding target classification label, and obtain the weighted loss value corresponding to each of the sample feature sequences; the first step A classification model determination unit, configured to obtain the current initial classification model and determine it as the trained classification model if there is no training data set that has not been trained.
  • the model training unit 140 includes a subunit: a classification and statistics unit, configured to perform classification and statistics on a sample feature sequence in the training data set according to the typical sample set, so as to obtain The number of typical samples and the number of atypical samples corresponding to the set; a high-confidence weighted loss value calculation unit, used to calculate the weighted loss value according to the formula, the label statistics, the number of typical samples and the number of atypical samples, Carry out weighted calculation on the model output information of the sample feature sequence in the training data set and the corresponding target classification label to obtain a high confidence weighted loss value corresponding to each of the sample feature sequences; the second iterative update unit is used to The gradient calculation formula, the high confidence weighted loss value of each of the sample feature sequences, and the calculation value of each parameter in the initial classification model for the sample feature sequence are obtained to obtain the update value of each of the parameters for iteration Updating the initial classification model; a second judging unit for judging whether there is a training
  • the device 100 for intelligently constructing an enterprise classification model further includes a subunit: an enterprise description information acquisition unit, configured to acquire information related to the information according to preset information collection rules if the input classification request information is received.
  • the enterprise description information corresponding to the enterprise name contained in the classification request information
  • the enterprise description characteristic information acquisition unit which is used to convert the enterprise description information according to the text processing rules to obtain the corresponding enterprise description characteristic information
  • the characteristic output information The acquisition unit is used to input the enterprise description feature information into the trained classification model for calculation and processing to obtain the corresponding feature output information
  • the enterprise classification label information acquisition unit is used to obtain the label information according to the preset label acquisition rules from the
  • the enterprise classification label information corresponding to the classification request information is obtained from the characteristic output information.
  • the enterprise classification model intelligent construction device provided in the embodiment of the present application applies the above-mentioned enterprise classification model intelligent construction method, converts the corpus samples contained in the corpus database according to the text processing rules to obtain the sample feature sequence, and generates multiple training sessions according to the sample feature sequence.
  • Data set, the sample feature sequence in the training data set is input into the initial classification model in turn for calculation and processing, and the corresponding model output information is obtained.
  • the initial The parameter values in the classification model are iteratively adjusted to obtain the trained classification model.
  • the parameter values in the initial classification model can be iteratively adjusted based on the parameter adjustment rules including the weighted loss value calculation formula, so as to avoid the unbalanced distribution of corpus samples, inconsistent manual judgment standards and other factors on the training of the initial classification model. Noise interference greatly improves the performance of the constructed classification model.
  • the above-mentioned device for intelligently constructing an enterprise classification model can be realized in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 9 .
  • FIG. 9 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • the computer device may be a user terminal or a management server for executing the method for intelligently constructing an enterprise classification model to intelligently construct the enterprise classification model.
  • the computer device 500 includes a processor 502 connected through a system bus 501 , a memory and a network interface 505 , wherein the memory may include a storage medium 503 and an internal memory 504 .
  • the storage medium 503 can store an operating system 5031 and a computer program 5032 .
  • the processor 502 can execute the method for intelligently constructing an enterprise classification model, wherein the storage medium 503 can be a volatile storage medium or a non-volatile storage medium.
  • the processor 502 is used to provide calculation and control capabilities and support the operation of the entire computer device 500 .
  • the internal memory 504 provides an environment for running the computer program 5032 in the storage medium 503.
  • the processor 502 can execute the intelligent construction method of the enterprise classification model.
  • the network interface 505 is used for network communication, such as providing data transmission and the like.
  • the network interface 505 is used for network communication, such as providing data transmission and the like.
  • FIG. 9 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation on the computer device 500 on which the solution of this application is applied.
  • the specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.
  • the processor 502 is configured to run the computer program 5032 stored in the memory, so as to realize the corresponding functions in the above-mentioned intelligent construction method of the enterprise classification model.
  • the embodiment of the computer device shown in FIG. 9 does not constitute a limitation on the specific composition of the computer device.
  • the computer device may include more or less components than those shown in the illustration. Or combine certain components, or different component arrangements.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in FIG. 9 , and will not be repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • a computer readable storage medium may be a volatile or non-volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, wherein when the computer program is executed by the processor, the steps included in the above-mentioned intelligent construction method of the enterprise classification model are realized.
  • the disclosed devices, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only logical function division.
  • there may be other division methods, and units with the same function may also be combined into one Units such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the readable storage medium includes several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned computer-readable storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种企业分类模型智能构建方法、装置、设备及介质,该方法包括:根据文本处理规则对语料数据库包含的语料样本进行转换处理得到样本特征序列,根据样本特征序列生成多个训练数据集,将样本特征序列输入初始分类模型进行计算处理得到对应的模型输出信息,根据参数调整规则及每一训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,得到训练后的分类模型。

Description

企业分类模型智能构建方法、装置、设备及介质
本申请要求于2021年09月15日提交中国专利局、申请号为202111077364.6,发明名称为“企业分类模型智能构建方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种企业分类模型智能构建方法、装置、设备及介质。
背景技术
企业所属行业分类是商业分析和投资决策过程中的重要一环,企业通过了解潜在客户所在的特定行业细分目标市场,有针对性地进行营销;研究人员通过预测行业未来的发展趋势为投资机构提供有利的决策依据,但是由于行业分类标准各异、涉及多领域专业知识等原因,人工划分行业效率低下,分类结果也易受主观因素影响。然而发明人发现,传统技术中基于传统机器学习的企业分类模型由于语料样本分布不均衡、人工判定标准不统一而存在噪声干扰等因素,导致无法准确对分类模型进行高效训练,因训练质量较低而最终得到的企业分类模型性能较差,难以进行实际应用。因此,现有技术方法中存在无法构建得到高性能企业分类模型的问题。
发明内容
本申请实施例提供了一种企业分类模型智能构建方法、装置、设备及介质,旨在解决现有技术方法中所存在的无法构建得到高性能企业分类模型的问题。
第一方面,本申请实施例提供了一种企业分类模型智能构建方法,其包括:
若接收到所输入的语料数据库,根据预置的文本处理规则对语料数据库中包含的语料样本进行转换处理,得到对应的样本特征序列;
根据所述样本特征序列生成与预置的数据集生成规则对应的多个训练数据集;
将每一所述训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,以得到每一所述样本特征序列对应的模型输出信息;
根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型。
第二方面,本申请实施例提供了一种企业分类模型智能构建装置,其包括:
样本特征序列获取单元,用于若接收到所输入的语料数据库,根据预置的文本处理规则对语料数据库中包含的语料样本进行转换处理,得到对应的样本特征序列;
训练数据集生成单元,用于根据所述样本特征序列生成与预置的数据集生成规则对应的多个训练数据集;
模型输出信息获取单元,用于将每一所述训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,以得到每一所述样本特征序列对应的模型输出信息;
模型训练单元,用于根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型。
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面所述的企业分类模型智能构建方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行上述第一方面所述的企业分类模型智能构建方法。
本申请实施例提供了一种企业分类模型智能构建方法、装置、计算机设备及可读存储介质。根据文本处理规则对语料数据库包含的语料样本进行转换处理得到样本特征序列,根据样本特征序列生成多个训练数据集,将训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,得到对应的模型输出信息,根据参数调整规则及每一训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,得到训练后的分类模型。通过上述方法,可基于包含加权损失值计算公式的参数调整规则对初始分类模型中的参数值进行迭代调整,从而避免语料样本分布不均衡、人工判定标准不统一等因素对训练初始分类模型造成的噪声干扰,大幅提高了所构建得到的分类模型的性能。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的企业分类模型智能构建方法的流程示意图;
图2为本申请实施例提供的企业分类模型智能构建方法的子流程示意图;
图3为本申请实施例提供的企业分类模型智能构建方法的另一子流程示意图;
图4为本申请实施例提供的企业分类模型智能构建方法的另一子流程示意图;
图5为本申请实施例提供的企业分类模型智能构建方法的另一子流程示意图;
图6为本申请实施例提供的企业分类模型智能构建方法的另一子流程示意图;
图7为本申请实施例提供的企业分类模型智能构建方法的另一流程示意图;
图8为本申请实施例提供的企业分类模型智能构建装置的示意性框图;
图9为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
请参阅图1,图1是本申请实施例提供的企业分类模型智能构建方法的流程示意图;该企业分类模型智能构建方法应用于用户终端或管理服务器中,该企业分类模型智能构建方法通过安装于用户终端或管理服务器中的应用软件进行执行,管理服务器即是可执行企业分类模型智能构建方法以对企业分类模型进行智能化构建的服务器,管理服务器可以是企业或政府部门内部所构建的服务器端,用户终端即是可执行企业分类模型智能构建方法以对企业分类模型进行智能化构建的终端设备,例如台式电脑、笔记本电脑、平板电脑或手机等。如图1所示,该方法包括步骤S110~S140。
S110、若接收到所输入的语料数据库,根据预置的文本处理规则对语料数据库中包含的语料样本进行转换处理,得到对应的样本特征序列。
若接收到所输入的语料数据库,根据预置的文本处理规则对语料数据库中包含的语料样本进行转换处理,得到对应的样本特征序列。可接收所输入的语料数据库,语料数据库中包含多条语料样本,每一条语料样本即对应一个企业,语料样本可以是企业概况的描述信息、企业整体的简介信息、企业年报等包含文字的一段文本信息。可根据文本处理规则对语料样本进行转换处理,得到样本特征序列,样本特征序列也即是采用数字编码方式对语料样本所包含的文字特征进行表示的编码序列。其中,所述文本处理规则包括无效字符集合及特征词库。
在一实施例中,如图2所示,步骤S110包括子步骤S111和S112。
S111、根据所述无效字符集合对每一条所述语料样本包含的文本信息中的无效字符进行滤除,得到对应的有效文本信息。
具体的,可首先对语料样本包含的文本信息中所包含的无效字符进行滤除,无效字符集合中即包含了所需滤除的无效字符,无效字符集合中可包含符号、空格等无效字符,则可对语料样本中文本信息所包含的符号、空格等无效字符进行滤除,得到仅包含有效字符的有效文本信息。
S112、根据所述特征词库对所述有效文本信息进行特征词转换,得到与每一条所述语料样本对应的样本特征序列。
根据特征词库对有效文本信息进行特征词转换,特征词库中包含的特征词可由一个或多个字符组成,每一特征词即对应一个独热编码,则可将有效文本信息中所包含的字符与特征词相匹配,若有效文本信息中某一个字符或连续多个字符的组合与特征词相匹配,则获取该特征词对应的独热编码进行特征词转换,根据有效文本信息中字符的具体位置对特征词转换得到的独热编码进行排序,即可得到对应的样本特征序列。其中,特征词库可以是包含16.6万特征词的结巴分词词库,则其中对应包含16.6万个独热编码。
S120、根据所述样本特征序列生成与预置的数据集生成规则对应的多个训练数据集。
根据所述样本特征序列生成与预置的数据集生成规则对应的多个训练数据集。可基于样本特征序列生成对应的多个训练数据集,所生成的训练数据集需满足数据集生成规则,其中,所述数据集生成规则包括样本数量。
在一实施例中,如图3所示,步骤S120包括子步骤S121和S122。
S121、对与每一所述样本特征序列对应的目标分类标签进行统计,得到对应的标签统计信息。
语料样本中包含文本信息,以及与文本信息对应的目标分类标签,目标分类标签即为每一样本该文本信息所对应企业所属行业类别的真实分类标签,目标分类标签中一个行业标签即对应具体的一个行业分类,目标分类标签中可包含一个或多个行业标签。如某企业对应的目标分类标签为“金融、房地产”。可对每一样本特征序列对应的目标分类标签进行统计,以得到语料数据库中每一行业标签对应的语料样本的数量,获取每一行业标签的语料样本数量即可得到标签统计信息。例如所得到的标签统计信息可采用N表示,N即为一个1×K的一维数组,K即为所包含的行业类别总数,则N k即表示第k个行业类别所对应行业标签的语料样本数量。
S122、随机获取与所述样本数量相等的所述样本特征序列组合为训练数据集。
随机获取与样本数量相等的多条样本特征序列进行组合,即可得到对应的训练数据集,则每一训练数据集中均包含与样本数量相等的多条样本特征序列。例如,语料数据库中包含10000个语料样本,样本数量为500,则可从10000个语料样本对应的样本特征序列中随机获取500个进行组合作为一个训练数据集,对剩余9500个语料样本对应的样本特征序列再次进行随机获取操作。
S130、将每一所述训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,以得到每一所述样本特征序列对应的模型输出信息。
将每一所述训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,以得到每一所述样本特征序列对应的模型输出信息。初始分类模型即为基于多层感知机结构所构建的神经网络模型,初始分类模型可以是四层全连接神经网络,其中前三层均采用了批标准化和随机失活(保留概率设置为50%),使得模型能够快速收敛和避免过拟合,并且前三层均采用双曲正切函数作为激活函数,最后一层为多标签分类输出层,采用批标准化和Sigmoid激活函数。其中,输入层包含16.6万个神经元,第一层全连接神经网络有640个神经元,第二层和第三层分别有4096个神经元,最后一层(也即是输出层)的神经元个数为行业类别总数,前后两层所包含的神经元之间均采用一次函数进行关联连接,初始分类模型中一次函数的参数值均为同一默认值。将训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,即可从初始分类模型的最后一层获取对应的模型输出信息。
在一实施例中,如图4所示,步骤S130包括子步骤S131和S132。
S131、依次将所述样本特征序列输入所述初始分类模型的输入层;S132、通过所述初始分类模型中神经元之间的关联公式对所述样本特征序列进行关联计算,并从所述初始分类模型的输出层获取与所述样本特征序列对应的模型输出信息。
根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型。可通过训练数据集依次对初始分类模型中的参数值进行调整,则多个训练数据集可实现对初始分类模型中的参数值进行迭代调整,对初始分类模型中参数值进行调整的过程也即是对初始分类模型进行训练的具体过程。通过多个训练数据集对初始分类模型进行迭代训练后,即可得到训练后的分类模型。其中,所述参数调整规则包括加权损失值计算公式及梯 度计算公式。
S140、根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型。
根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型。可通过训练数据集依次对初始分类模型中的参数值进行调整,则多个训练数据集可实现对初始分类模型中的参数值进行迭代调整,对初始分类模型中参数值进行调整的过程也即是对初始分类模型进行训练的具体过程。通过多个训练数据集对初始分类模型进行迭代训练后,即可得到训练后的分类模型。其中,所述参数调整规则包括加权损失值计算公式及梯度计算公式。
在一实施例中,如图5所示,步骤S140包括子步骤S141、S142、S143、S144和S145。
S141、根据所述加权损失值计算公式及所述标签统计信息对一个所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的加权损失值。
加权损失计算公式即是用于计算与样本特征序列对应的加权损失值的计算公式,本实施例中引入了标签统计信息并采用特殊的计算方式进行加权计算得到加权损失值,通过加权损失值能够更好地避免所属行业样本标签分布极其不平衡和样本标签存在噪音的问题,这一加权损失值的获取过程能够极大提高对初始分类模型进行训练的效率和质量。
具体的,加权损失值计算公式可采用公式(1)进行表示:
Figure PCTCN2021120254-appb-000001
其中,k即表示第k个行业类别,k∈[1,K],K即为所包含的行业类别总数,N即为语料数据库中语料样本的总数,σ为输出层的Sigmoid激活函数,y k为目标分类标签(其中,y k取值为0或1,若样本特征序列的目标分类标签中包含与第k个行业类别对应的行业标签,则y k=1。若样本特征序列的目标分类标签中不包含与第k个行业类别对应的行业标签,则y k=0),l k为模型输出信息(其中,l k取值范围为[0,1])。
S142、根据所述梯度计算公式、每一所述样本特征序的加权损失值及所述初始分类模型中每一参数对所述样本特征序列进行计算的计算值获取每一所述参数的更新值以迭代更新所述初始分类模型。
可根据梯度计算公式计算得到初始分类模型中每一参数的更新值以对参数原始的参数值进行更新,具体的,将初始分类模型中一个参数对一个样本特征序列进行计算所得到的计算值输入梯度计算公式,并结合上述加权损失值,即可计算得到与该参数对应的更新值,这一计算过程也即为梯度下降计算。
具体的,梯度计算公式可采用公式(2)进行表示:
Figure PCTCN2021120254-appb-000002
其中,
Figure PCTCN2021120254-appb-000003
为计算得到的参数t的更新值,ω t为参数t原始的参数值,η为梯度计算公式 中预置的学习率,
Figure PCTCN2021120254-appb-000004
为基于损失值及参数t对应的计算值对该参数t的偏导值(这一计算过程中需使用参数t对应的计算值)。
通过一条样本特征序列即可实现对初始分类模型中所有参数值进行一次更新,通过一个训练数据集包含的多条样本特征序列即可实现对初始分类模型中所有参数值进行多次迭代更新。
S143、判断是否存在未进行训练的所述训练数据集;S144、若存在未进行训练的所述训练数据集,返回执行所述根据所述加权损失值计算公式及所述标签统计信息对一个所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的加权损失值的步骤;S145、若不存在未进行训练的所述训练数据集,获取当前的初始分类模型确定为所述训练后的分类模型。
判断是否还存在未进行训练的训练数据集,若存在,则返回执行步骤S141,若不存在,则获取当前训练得到的分类模型作为训练后的分类模型。
在一实施例中,所述参数调整规则还包括典型样本集合,如图6所示,步骤S140包括子步骤S1401、S1402、S1403、S1404、S1405和S1406。
S1401、根据所述典型样本集合对一个所述训练数据集中样本特征序列进行分类统计,以得到与所述训练数据集对应的典型样本数及非典型样本数。
典型样本集合中包含多个典型样本,典型样本也即是具有较强代表性的企业所对应的样本,可基于典型样本集合对训练数据集中每一样本特征序列进行判断并分类,以判断每一样本特征序列是否为典型样本,采用上述方式对每一训练数据集中包含的样本特征序列依次进行判断并分类统计,即可得到每一训练数据集中的典型样本数即非典型样本数。
例如,对某一训练数据集进行分类统计后,得到典型样本数为200,非典型样本数为300。
S1402、根据所述加权损失值计算公式、所述标签统计信息、所述典型样本数及所述非典型样本数,对所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的高置信加权损失值。
本实施例中引入了标签统计信息、典型样本数及非典型样本数,并采用特殊的计算方式进行加权计算得到高置信加权损失值,通过高置信加权损失值不仅能够更好地避免所属行业样本标签分布极其不平衡和样本标签存在噪音的问题,同时还能够大幅提高所计算得到的损失值的置信度,这一高置信加权损失值的获取过程能够进一步提高对初始分类模型进行训练的效率和质量。
具体的,高置信加权损失值计算公式可采用公式(3)进行表示:
Figure PCTCN2021120254-appb-000005
其中,k即表示第k个行业类别,k∈[1,K],K即为所包含的行业类别总数,N即为语料数据库中语料样本的总数,σ为输出层的Sigmoid激活函数,y k为目标分类标签(其中,y k取值为0或1,若样本特征序列的目标分类标签中包含与第k个行业类别对应的行业标签,则y k=1。若样本特征序列的目标分类标签中不包含与第k个行业类别对应的行业标签,则y k=0),l k为模型输出信息(其中,l k取值范围为[0,1]),V k即为当前训练数据集的典型样本数;U k即为当前训练数据集的非典型样本数。
具体的,在计算样本特征序列的高置信加权损失值之前,需要先判断样本特征序列是否属于典型样本,若属于,则采用公式(4)计算得到φ_ k(N k,V k,U k)值:
Figure PCTCN2021120254-appb-000006
若样本特征序列不属于典型样本,则采用公式(5)计算得到φ_ k(N k,V k,U k)值:
Figure PCTCN2021120254-appb-000007
高置信加权损失值计算公式后半部分的具体计算过程与上述加权损失值的计算过程相同,在此不作赘述。
S1403、根据所述梯度计算公式、每一所述样本特征序的高置信加权损失值及所述初始分类模型中每一参数对所述样本特征序列进行计算的计算值获取每一所述参数的更新值以迭代更新所述初始分类模型。
可根据梯度计算公式计算得到初始分类模型中每一参数的更新值以对参数原始的参数值进行更新,对初始分类模型中参数值进行更新的具体过程在上述步骤中已详细描述。通过一条样本特征序列即可实现对初始分类模型中所有参数值进行一次更新,通过一个训练数据集包含的多条样本特征序列即可实现对初始分类模型中所有参数值进行多次迭代更新。
S1404、判断是否存在未进行训练的所述训练数据集;S1405、若存在未进行训练的所述训练数据集,返回执行所述根据所述典型样本集合对一个所述训练数据集中样本特征序列进行分类统计,以得到与所述训练数据集对应的典型样本数及非典型样本数的步骤;S1406、若不存在未进行训练的所述训练数据集,获取当前的初始分类模型确定为所述训练后的分类模型。
判断是否还存在未进行训练的训练数据集,若存在,则返回执行步骤S1401,若不存在,则获取当前训练得到的分类模型作为训练后的分类模型。
在一实施例中,如图7所示,步骤S140之后还包括步骤S150、S160、S170和S180。
S150、若接收到所输入的分类请求信息,根据预置的信息采集规则获取与所述分类请求信息中所包含企业名称对应的企业描述信息。
分类请求信息即为所输入的需要进行行业分类的请求信息,分类请求信息中包含企业名称,则可根据信息采集规则获取与企业名称对应的企业描述信息。其中,信息采集规则包含多个采集地址,可基于采集地址对企业名称的相关描述信息进行采集得到企业描述信息。其中,采集地址可以是政府工商部门的网址、第三方企业查询机构的网址等。所得到的企业描述信息即为包含文字的一段文本信息。
S160、根据所述文本处理规则对所述企业描述信息进行转换处理,得到对应的企业描述特征信息。
可通过文本处理规则对企业描述信息进行转换处理,具体的,对企业描述信息进行转换处理的过程与对语料样本中包含的文本信息进行转换处理的过程相同,在此不作赘述。所得到的企业描述特征信息即可用于对企业描述信息的具体特征进行量化表征。
S170、将所述企业描述特征信息输入所述训练后的分类模型进行计算处理,得到对应的 特征输出信息。
将企业描述特征信息输入训练后的分类模型进行计算处理,即可从训练后的分类模型的输出层中获取对应的特征输出信息。获取特征输出信息的具体过程与上述获取模型输出信息的具体过程相同,在此不作赘述。
S180、根据预置的标签获取规则从所述特征输出信息中获取与所述分类请求信息对应的企业分类标签信息。
可基于标签获取规则从特征输出信息中获取与分类请求信息对应的企业分类标签信息,企业分类标签信息中至少包含一个企业标签,具体的,标签获取规则中可包含概率阈值、获取数量。例如,可基于概率阈值从特征输出信息中获取概率值不小于概率阈值的神经元所对应的企业标签,作为对应的企业分类标签信息;也可基于获取数量从特征输出信息中获取概率值靠前且与获取数量相等的神经元所对应的企业标签,作为对应的企业分类标签信息。获取到企业分类标签信息后可根据分类请求信息对应进行反馈,以使发送分类请求信息能够获取到对应的企业分类标签信息。
在本申请实施例所提供的企业分类模型智能构建方法中,根据文本处理规则对语料数据库包含的语料样本进行转换处理得到样本特征序列,根据样本特征序列生成多个训练数据集,将训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,得到对应的模型输出信息,根据参数调整规则及每一训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,得到训练后的分类模型。通过上述方法,可基于包含加权损失值计算公式的参数调整规则对初始分类模型中的参数值进行迭代调整,从而避免语料样本分布不均衡、人工判定标准不统一等因素对训练初始分类模型造成的噪声干扰,大幅提高了所构建得到的分类模型的性能。
本申请实施例还提供一种企业分类模型智能构建装置,该企业分类模型智能构建装置可配置于用户终端或管理服务器中,该企业分类模型智能构建装置用于执行前述的企业分类模型智能构建方法的任一实施例。具体地,请参阅图8,图8为本申请实施例提供的企业分类模型智能构建装置的示意性框图。
如图8所示,企业分类模型智能构建装置100包括样本特征序列获取单元110、训练数据集生成单元120、模型输出信息获取单元130和模型训练单元140。
样本特征序列获取单元110,用于若接收到所输入的语料数据库,根据预置的文本处理规则对语料数据库中包含的语料样本进行转换处理,得到对应的样本特征序列。
在一具体实施例中,所述样本特征序列获取单元110包括子单元:有效文本信息获取单元,用于根据所述无效字符集合对每一条所述语料样本包含的文本信息中的无效字符进行滤除,得到对应的有效文本信息;特征词转换单元,用于根据所述特征词库对所述有效文本信息进行特征词转换,得到与每一条所述语料样本对应的样本特征序列。
训练数据集生成单元120,用于根据所述样本特征序列生成与预置的数据集生成规则对应的多个训练数据集。
在一具体实施例中,所述训练数据集生成单元120包括子单元:统计标签信息获取单元,用于对与每一所述样本特征序列对应的目标分类标签进行统计,得到对应的标签统计信息;训练数据集获取单元,用于随机获取与所述样本数量相等的所述样本特征序列组合为训练数 据集。
模型输出信息获取单元130,用于将每一所述训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,以得到每一所述样本特征序列对应的模型输出信息。
在一具体实施例中,所述模型输出信息获取单元130包括子单元:样本特征序列输入单元,用于依次将所述样本特征序列输入所述初始分类模型的输入层;关联计算单元,用于通过所述初始分类模型中神经元之间的关联公式对所述样本特征序列进行关联计算,并从所述初始分类模型的输出层获取与所述样本特征序列对应的模型输出信息。
模型训练单元140,用于根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型。
在一具体实施例中,所述模型训练单元140包括子单元:加权损失值计算单元用于根据所述加权损失值计算公式及所述标签统计信息对一个所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的加权损失值;第一迭代更新单元,用于根据所述梯度计算公式、每一所述样本特征序的加权损失值及所述初始分类模型中每一参数对所述样本特征序列进行计算的计算值获取每一所述参数的更新值以迭代更新所述初始分类模型。第一判断单元,用于判断是否存在未进行训练的所述训练数据集;第一返回执行单元,用于若存在未进行训练的所述训练数据集,返回执行所述根据所述加权损失值计算公式及所述标签统计信息对一个所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的加权损失值的步骤;第一分类模型确定单元,用于若不存在未进行训练的所述训练数据集,获取当前的初始分类模型确定为所述训练后的分类模型。
在一具体实施例中,所述模型训练单元140包括子单元:分类统计单元,用于根据所述典型样本集合对一个所述训练数据集中样本特征序列进行分类统计,以得到与所述训练数据集对应的典型样本数及非典型样本数;高置信加权损失值计算单元,用于根据所述加权损失值计算公式、所述标签统计信息、所述典型样本数及所述非典型样本数,对所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的高置信加权损失值;第二迭代更新单元,用于根据所述梯度计算公式、每一所述样本特征序的高置信加权损失值及所述初始分类模型中每一参数对所述样本特征序列进行计算的计算值获取每一所述参数的更新值以迭代更新所述初始分类模型;第二判断单元,用于判断是否存在未进行训练的所述训练数据集;第二返回执行单元,用于若存在未进行训练的所述训练数据集,返回执行所述根据所述典型样本集合对一个所述训练数据集中样本特征序列进行分类统计,以得到与所述训练数据集对应的典型样本数及非典型样本数的步骤;第二分类模型确定单元,用于若不存在未进行训练的所述训练数据集,获取当前的初始分类模型确定为所述训练后的分类模型。
在一具体实施例中,所述企业分类模型智能构建装置100还包括子单元:企业描述信息获取单元,用于若接收到所输入的分类请求信息,根据预置的信息采集规则获取与所述分类请求信息中所包含企业名称对应的企业描述信息;企业描述特征信息获取单元,用于根据所述文本处理规则对所述企业描述信息进行转换处理,得到对应的企业描述特征信息;特征输 出信息获取单元,用于将所述企业描述特征信息输入所述训练后的分类模型进行计算处理,得到对应的特征输出信息;企业分类标签信息获取单元,用于根据预置的标签获取规则从所述特征输出信息中获取与所述分类请求信息对应的企业分类标签信息。
在本申请实施例所提供的企业分类模型智能构建装置应用上述企业分类模型智能构建方法,根据文本处理规则对语料数据库包含的语料样本进行转换处理得到样本特征序列,根据样本特征序列生成多个训练数据集,将训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,得到对应的模型输出信息,根据参数调整规则及每一训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,得到训练后的分类模型。通过上述方法,可基于包含加权损失值计算公式的参数调整规则对初始分类模型中的参数值进行迭代调整,从而避免语料样本分布不均衡、人工判定标准不统一等因素对训练初始分类模型造成的噪声干扰,大幅提高了所构建得到的分类模型的性能。
上述企业分类模型智能构建装置可以实现为计算机程序的形式,该计算机程序可以在如图9所示的计算机设备上运行。
请参阅图9,图9是本申请实施例提供的计算机设备的示意性框图。该计算机设备可以是用于执行企业分类模型智能构建方法以对企业分类模型进行智能化构建的用户终端或管理服务器。
参阅图9,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括存储介质503和内存储器504。
该存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行企业分类模型智能构建方法,其中,存储介质503可以为易失性的存储介质或非易失性的存储介质。
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。
该内存储器504为存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行企业分类模型智能构建方法。
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现上述的企业分类模型智能构建方法中对应的功能。
本领域技术人员可以理解,图9中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图9所示实施例一致,在此不再赘述。
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现 成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为易失性或非易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现上述的企业分类模型智能构建方法中所包含的步骤。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个计算机可读存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的计算机可读存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种企业分类模型智能构建方法,所述方法包括:
    若接收到所输入的语料数据库,根据预置的文本处理规则对语料数据库中包含的语料样本进行转换处理,得到对应的样本特征序列;
    根据所述样本特征序列生成与预置的数据集生成规则对应的多个训练数据集;
    将每一所述训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,以得到每一所述样本特征序列对应的模型输出信息;
    根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型;其中,所述参数调整规则包括加权损失值计算公式及梯度计算公式。
  2. 根据权利要求1所述的企业分类模型智能构建方法,其中,所述文本处理规则包括无效字符集合及特征词库,所述根据预置的文本处理规则对语料数据库中包含的语料样本进行转换处理,得到对应的样本特征序列,包括:
    根据所述无效字符集合对每一条所述语料样本包含的文本信息中的无效字符进行滤除,得到对应的有效文本信息;
    根据所述特征词库对所述有效文本信息进行特征词转换,得到与每一条所述语料样本对应的样本特征序列。
  3. 根据权利要求1所述的企业分类模型智能构建方法,其中,所述数据集生成规则包括样本数量,所述根据所述样本特征序列生成与预置的数据集生成规则对应的多个训练数据集,包括:
    对与每一所述样本特征序列对应的目标分类标签进行统计,得到对应的标签统计信息;
    随机获取与所述样本数量相等的所述样本特征序列组合为训练数据集。
  4. 根据权利要求1所述的企业分类模型智能构建方法,其中,所述将每一所述训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,以得到每一所述样本特征序列对应的模型输出信息,包括:
    依次将所述样本特征序列输入所述初始分类模型的输入层;
    通过所述初始分类模型中神经元之间的关联公式对所述样本特征序列进行关联计算,并从所述初始分类模型的输出层获取与所述样本特征序列对应的模型输出信息。
  5. 根据权利要求3所述的企业分类模型智能构建方法,其中,所述根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型,包括:
    根据所述加权损失值计算公式及所述标签统计信息对一个所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的加权损失值;
    根据所述梯度计算公式、每一所述样本特征序的加权损失值及所述初始分类模型中每一参数对所述样本特征序列进行计算的计算值获取每一所述参数的更新值以迭代更新所述初始分类模型;
    判断是否存在未进行训练的所述训练数据集;
    若存在未进行训练的所述训练数据集,返回执行所述根据所述加权损失值计算公式及所 述标签统计信息对一个所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的加权损失值的步骤;
    若不存在未进行训练的所述训练数据集,获取当前的初始分类模型确定为所述训练后的分类模型。
  6. 根据权利要求3所述的企业分类模型智能构建方法,其中,所述参数调整规则还包括典型样本集合;所述根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型,包括:
    根据所述典型样本集合对一个所述训练数据集中样本特征序列进行分类统计,以得到与所述训练数据集对应的典型样本数及非典型样本数;
    根据所述加权损失值计算公式、所述标签统计信息、所述典型样本数及所述非典型样本数,对所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的高置信加权损失值;
    根据所述梯度计算公式、每一所述样本特征序的高置信加权损失值及所述初始分类模型中每一参数对所述样本特征序列进行计算的计算值获取每一所述参数的更新值以迭代更新所述初始分类模型;
    判断是否存在未进行训练的所述训练数据集;
    若存在未进行训练的所述训练数据集,返回执行所述根据所述典型样本集合对一个所述训练数据集中样本特征序列进行分类统计,以得到与所述训练数据集对应的典型样本数及非典型样本数的步骤;
    若不存在未进行训练的所述训练数据集,获取当前的初始分类模型确定为所述训练后的分类模型。
  7. 根据权利要求1所述的企业分类模型智能构建方法,其中,所述根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型之后,还包括:
    若接收到所输入的分类请求信息,根据预置的信息采集规则获取与所述分类请求信息中所包含企业名称对应的企业描述信息;
    根据所述文本处理规则对所述企业描述信息进行转换处理,得到对应的企业描述特征信息;
    将所述企业描述特征信息输入所述训练后的分类模型进行计算处理,得到对应的特征输出信息;
    根据预置的标签获取规则从所述特征输出信息中获取与所述分类请求信息对应的企业分类标签信息。
  8. 一种企业分类模型智能构建装置,包括:
    样本特征序列获取单元,用于若接收到所输入的语料数据库,根据预置的文本处理规则对语料数据库中包含的语料样本进行转换处理,得到对应的样本特征序列;
    训练数据集生成单元,用于根据所述样本特征序列生成与预置的数据集生成规则对应的多个训练数据集;
    模型输出信息获取单元,用于将每一所述训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,以得到每一所述样本特征序列对应的模型输出信息;
    模型训练单元,用于根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型。
  9. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:
    若接收到所输入的语料数据库,根据预置的文本处理规则对语料数据库中包含的语料样本进行转换处理,得到对应的样本特征序列;
    根据所述样本特征序列生成与预置的数据集生成规则对应的多个训练数据集;
    将每一所述训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,以得到每一所述样本特征序列对应的模型输出信息;
    根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型;其中,所述参数调整规则包括加权损失值计算公式及梯度计算公式。
  10. 根据权利要求9所述的计算机设备,其中,所述文本处理规则包括无效字符集合及特征词库,所述根据预置的文本处理规则对语料数据库中包含的语料样本进行转换处理,得到对应的样本特征序列,包括:
    根据所述无效字符集合对每一条所述语料样本包含的文本信息中的无效字符进行滤除,得到对应的有效文本信息;
    根据所述特征词库对所述有效文本信息进行特征词转换,得到与每一条所述语料样本对应的样本特征序列。
  11. 根据权利要求9所述的计算机设备,其中,所述数据集生成规则包括样本数量,所述根据所述样本特征序列生成与预置的数据集生成规则对应的多个训练数据集,包括:
    对与每一所述样本特征序列对应的目标分类标签进行统计,得到对应的标签统计信息;
    随机获取与所述样本数量相等的所述样本特征序列组合为训练数据集。
  12. 根据权利要求9所述的计算机设备,其中,所述将每一所述训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,以得到每一所述样本特征序列对应的模型输出信息,包括:
    依次将所述样本特征序列输入所述初始分类模型的输入层;
    通过所述初始分类模型中神经元之间的关联公式对所述样本特征序列进行关联计算,并从所述初始分类模型的输出层获取与所述样本特征序列对应的模型输出信息。
  13. 根据权利要求11所述的计算机设备,其中,所述根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型,包括:
    根据所述加权损失值计算公式及所述标签统计信息对一个所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的加权损失值;
    根据所述梯度计算公式、每一所述样本特征序的加权损失值及所述初始分类模型中每一参数对所述样本特征序列进行计算的计算值获取每一所述参数的更新值以迭代更新所述初始分类模型;
    判断是否存在未进行训练的所述训练数据集;
    若存在未进行训练的所述训练数据集,返回执行所述根据所述加权损失值计算公式及所述标签统计信息对一个所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的加权损失值的步骤;
    若不存在未进行训练的所述训练数据集,获取当前的初始分类模型确定为所述训练后的分类模型。
  14. 根据权利要求11所述的计算机设备,其中,所述参数调整规则还包括典型样本集合;所述根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型,包括:
    根据所述典型样本集合对一个所述训练数据集中样本特征序列进行分类统计,以得到与所述训练数据集对应的典型样本数及非典型样本数;
    根据所述加权损失值计算公式、所述标签统计信息、所述典型样本数及所述非典型样本数,对所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的高置信加权损失值;
    根据所述梯度计算公式、每一所述样本特征序的高置信加权损失值及所述初始分类模型中每一参数对所述样本特征序列进行计算的计算值获取每一所述参数的更新值以迭代更新所述初始分类模型;
    判断是否存在未进行训练的所述训练数据集;
    若存在未进行训练的所述训练数据集,返回执行所述根据所述典型样本集合对一个所述训练数据集中样本特征序列进行分类统计,以得到与所述训练数据集对应的典型样本数及非典型样本数的步骤;
    若不存在未进行训练的所述训练数据集,获取当前的初始分类模型确定为所述训练后的分类模型。
  15. 根据权利要求9所述的计算机设备,其中,所述根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型之后,还包括:
    若接收到所输入的分类请求信息,根据预置的信息采集规则获取与所述分类请求信息中所包含企业名称对应的企业描述信息;
    根据所述文本处理规则对所述企业描述信息进行转换处理,得到对应的企业描述特征信息;
    将所述企业描述特征信息输入所述训练后的分类模型进行计算处理,得到对应的特征输出信息;
    根据预置的标签获取规则从所述特征输出信息中获取与所述分类请求信息对应的企业分类标签信息。
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,当所述计算机程序被处理器执行以下操作:
    若接收到所输入的语料数据库,根据预置的文本处理规则对语料数据库中包含的语料样本进行转换处理,得到对应的样本特征序列;
    根据所述样本特征序列生成与预置的数据集生成规则对应的多个训练数据集;
    将每一所述训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,以得到每一所述样本特征序列对应的模型输出信息;
    根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型;其中,所述参数调整规则包括加权损失值计算公式及梯度计算公式。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述文本处理规则包括无效字符集合及特征词库,所述根据预置的文本处理规则对语料数据库中包含的语料样本进行转换处理,得到对应的样本特征序列,包括:
    根据所述无效字符集合对每一条所述语料样本包含的文本信息中的无效字符进行滤除,得到对应的有效文本信息;
    根据所述特征词库对所述有效文本信息进行特征词转换,得到与每一条所述语料样本对应的样本特征序列。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述数据集生成规则包括样本数量,所述根据所述样本特征序列生成与预置的数据集生成规则对应的多个训练数据集,包括:
    对与每一所述样本特征序列对应的目标分类标签进行统计,得到对应的标签统计信息;
    随机获取与所述样本数量相等的所述样本特征序列组合为训练数据集。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述将每一所述训练数据集中的样本特征序列依次输入初始分类模型进行计算处理,以得到每一所述样本特征序列对应的模型输出信息,包括:
    依次将所述样本特征序列输入所述初始分类模型的输入层;
    通过所述初始分类模型中神经元之间的关联公式对所述样本特征序列进行关联计算,并从所述初始分类模型的输出层获取与所述样本特征序列对应的模型输出信息。
  20. 根据权利要求18所述的计算机可读存储介质,其中,所述根据预置的参数调整规则及所述每一所述训练数据集所包含样本特征序列的模型输出信息对所述初始分类模型中的参数值进行迭代调整,以得到所述初始分类模型进行训练后的分类模型,包括:
    根据所述加权损失值计算公式及所述标签统计信息对一个所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的加权损失值;
    根据所述梯度计算公式、每一所述样本特征序的加权损失值及所述初始分类模型中每一参数对所述样本特征序列进行计算的计算值获取每一所述参数的更新值以迭代更新所述初始分类模型;
    判断是否存在未进行训练的所述训练数据集;
    若存在未进行训练的所述训练数据集,返回执行所述根据所述加权损失值计算公式及所述标签统计信息对一个所述训练数据集中样本特征序列的模型输出信息及对应的目标分类标签进行加权计算,得到与每一所述样本特征序列对应的加权损失值的步骤;
    若不存在未进行训练的所述训练数据集,获取当前的初始分类模型确定为所述训练后的分类模型。
PCT/CN2021/120254 2021-09-15 2021-09-24 企业分类模型智能构建方法、装置、设备及介质 WO2023039925A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111077364.6A CN113535964B (zh) 2021-09-15 2021-09-15 企业分类模型智能构建方法、装置、设备及介质
CN202111077364.6 2021-09-15

Publications (1)

Publication Number Publication Date
WO2023039925A1 true WO2023039925A1 (zh) 2023-03-23

Family

ID=78092584

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120254 WO2023039925A1 (zh) 2021-09-15 2021-09-24 企业分类模型智能构建方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN113535964B (zh)
WO (1) WO2023039925A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975626A (zh) * 2023-06-09 2023-10-31 浙江大学 一种供应链数据模型的自动更新方法及装置
CN117172792A (zh) * 2023-11-02 2023-12-05 赞塔(杭州)科技有限公司 客户信息管理方法及装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113990324A (zh) * 2021-11-24 2022-01-28 深圳市品索科技有限公司 一种语音智能家居控制系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095996A (zh) * 2016-06-22 2016-11-09 量子云未来(北京)信息科技有限公司 用于文本分类的方法
CN109902722A (zh) * 2019-01-28 2019-06-18 北京奇艺世纪科技有限公司 分类器、神经网络模型训练方法、数据处理设备及介质
US20200117712A1 (en) * 2018-10-12 2020-04-16 Siemens Healthcare Gmbh Sentence generation
CN111625645A (zh) * 2020-05-14 2020-09-04 北京字节跳动网络技术有限公司 文本生成模型的训练方法、装置和电子设备

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7970718B2 (en) * 2001-05-18 2011-06-28 Health Discovery Corporation Method for feature selection and for evaluating features identified as significant for classifying data
US11604981B2 (en) * 2019-07-01 2023-03-14 Adobe Inc. Training digital content classification models utilizing batchwise weighted loss functions and scaled padding based on source density
CN110705607B (zh) * 2019-09-12 2022-10-25 西安交通大学 一种基于循环重标注自助法的行业多标签降噪方法
CN111859171A (zh) * 2019-09-24 2020-10-30 北京嘀嘀无限科技发展有限公司 一种信息推送方法、装置、电子设备及存储介质
CN111078871A (zh) * 2019-11-21 2020-04-28 深圳前海环融联易信息科技服务有限公司 一种基于人工智能的合同自动分类的方法及系统
CN111461180A (zh) * 2020-03-12 2020-07-28 平安科技(深圳)有限公司 样本分类方法、装置、计算机设备及存储介质
CN111581385B (zh) * 2020-05-06 2024-04-02 西安交通大学 一种不平衡数据采样的中文文本类别识别系统及方法
CN112214605A (zh) * 2020-11-05 2021-01-12 腾讯科技(深圳)有限公司 一种文本分类方法和相关装置
CN112766320B (zh) * 2020-12-31 2023-12-22 平安科技(深圳)有限公司 一种分类模型训练方法及计算机设备
CN112765358B (zh) * 2021-02-23 2023-04-07 西安交通大学 一种基于噪声标签学习的纳税人行业分类方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095996A (zh) * 2016-06-22 2016-11-09 量子云未来(北京)信息科技有限公司 用于文本分类的方法
US20200117712A1 (en) * 2018-10-12 2020-04-16 Siemens Healthcare Gmbh Sentence generation
CN109902722A (zh) * 2019-01-28 2019-06-18 北京奇艺世纪科技有限公司 分类器、神经网络模型训练方法、数据处理设备及介质
CN111625645A (zh) * 2020-05-14 2020-09-04 北京字节跳动网络技术有限公司 文本生成模型的训练方法、装置和电子设备

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975626A (zh) * 2023-06-09 2023-10-31 浙江大学 一种供应链数据模型的自动更新方法及装置
CN116975626B (zh) * 2023-06-09 2024-04-19 浙江大学 一种供应链数据模型的自动更新方法及装置
CN117172792A (zh) * 2023-11-02 2023-12-05 赞塔(杭州)科技有限公司 客户信息管理方法及装置

Also Published As

Publication number Publication date
CN113535964B (zh) 2021-12-24
CN113535964A (zh) 2021-10-22

Similar Documents

Publication Publication Date Title
WO2023039925A1 (zh) 企业分类模型智能构建方法、装置、设备及介质
WO2021155706A1 (zh) 利用不平衡正负样本对业务预测模型训练的方法及装置
US11244205B2 (en) Generating multi modal image representation for an image
CN114930318B (zh) 使用来自多个分类模块的聚合信息对数据进行分类
WO2019200782A1 (zh) 样本数据分类方法、模型训练方法、电子设备及存储介质
WO2022048173A1 (zh) 基于人工智能的客户意图识别方法、装置、设备及介质
US11227217B1 (en) Entity transaction attribute determination method and apparatus
WO2021164481A1 (zh) 基于神经网络模型的手写签名的自动校验的方法和装置
CN112487199B (zh) 一种基于用户购买行为的用户特征预测方法
CN108427729A (zh) 一种基于深度残差网络与哈希编码的大规模图片检索方法
CN112749274B (zh) 基于注意力机制和干扰词删除的中文文本分类方法
Bai et al. NHL Pathological Image Classification Based on Hierarchical Local Information and GoogLeNet‐Based Representations
CN109840413B (zh) 一种钓鱼网站检测方法及装置
CN111259140A (zh) 一种基于lstm多实体特征融合的虚假评论检测方法
US20230049817A1 (en) Performance-adaptive sampling strategy towards fast and accurate graph neural networks
CN114511576B (zh) 尺度自适应特征增强深度神经网络的图像分割方法与系统
CN112348079B (zh) 数据降维处理方法、装置、计算机设备及存储介质
WO2020024444A1 (zh) 人群绩效等级识别方法、装置、存储介质及计算机设备
Koklu et al. Estimation of credit card customers payment status by using kNN and MLP
CN113269647A (zh) 基于图的交易异常关联用户检测方法
CN115062732A (zh) 基于大数据用户标签信息的资源共享合作推荐方法及系统
CN114491084B (zh) 基于自编码器的关系网络信息挖掘方法、装置及设备
WO2023024408A1 (zh) 用户特征向量确定方法、相关设备及介质
CN114970751A (zh) 基于自编码器的自适应目标分类方法、系统及电子设备
WO2022267167A1 (zh) 文本类型智能识别方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21957200

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE