WO2019200806A1 - Dispositif de génération d'un modèle de classification de texte, procédé et support d'informations lisible par ordinateur - Google Patents

Dispositif de génération d'un modèle de classification de texte, procédé et support d'informations lisible par ordinateur Download PDF

Info

Publication number
WO2019200806A1
WO2019200806A1 PCT/CN2018/102400 CN2018102400W WO2019200806A1 WO 2019200806 A1 WO2019200806 A1 WO 2019200806A1 CN 2018102400 W CN2018102400 W CN 2018102400W WO 2019200806 A1 WO2019200806 A1 WO 2019200806A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
text
sample set
preset
candidate
Prior art date
Application number
PCT/CN2018/102400
Other languages
English (en)
Chinese (zh)
Inventor
王健宗
吴天博
黄章成
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019200806A1 publication Critical patent/WO2019200806A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of text classification technologies, and in particular, to a device, a method, and a computer readable storage medium for generating a text classification model.
  • the present application provides a device for generating a text classification model, a method, and a computer readable storage medium, the main purpose of which is to provide a device for generating a text classification model that can be used for emotional sentiment classification of texts in a financial field, to solve the prior art.
  • the problem of emotionally categorizing texts in the financial field cannot be achieved.
  • the present application provides a device for generating a text classification model, the device comprising a memory and a processor, wherein the memory stores a model generation program executable on the processor, the model generation program being The processor implements the following steps when executed:
  • the word vector is extracted, and based on the adaboost algorithm, the word vector corresponding to the training sample and the tagged category information are input into a preset plurality of weak classifiers for training, and the plurality of weak classifiers obtained by the training are combined into a financial field.
  • Text classification model
  • the present application further provides a method for generating a text classification model, the method comprising:
  • the word vector is extracted, and based on the adaboost algorithm, the word vector corresponding to the training sample and the tagged category information are input into a preset plurality of weak classifiers for training, and the plurality of weak classifiers obtained by the training are combined into a financial field.
  • Text classification model
  • the present application further provides a computer readable storage medium having a model generation program stored thereon, the model generation program being executable by one or more processors to implement The steps of the method of generating a text classification model as described above.
  • FIG. 1 is a schematic diagram of a preferred embodiment of a device for generating a text classification model of the present application
  • FIG. 2 is a schematic diagram of a program module of a model generation program in an embodiment of a device for generating a text classification model of the present application
  • FIG. 3 is a flow chart of a preferred embodiment of a method for generating a text classification model of the present application.
  • the application provides a device for generating a text classification model.
  • FIG. 1 there is shown a schematic diagram of a preferred embodiment of a generating apparatus for a text classification model of the present application.
  • the generating device of the text classification model may be a PC (Personal Computer), or may be a terminal device such as a smart phone, a tablet computer, or a portable computer.
  • the text classification model generating apparatus 1 includes at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.
  • the memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (for example, an SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like.
  • the memory 11 may in some embodiments be an internal storage unit of the generating device 1 of the text classification model, such as the hard disk of the generating device 1 of the text classification model.
  • the memory 11 may also be an external storage device of the text classification model generating device 1 in other embodiments, such as a plug-in hard disk equipped with a text classification model generating device 1 , a smart memory card (SMC), Secure Digital (SD) card, Flash Card, etc.
  • SMC smart memory card
  • SD Secure Digital
  • the memory 11 may also include an internal storage unit of the generating device 1 including both the text classification model and an external storage device.
  • the memory 11 can be used not only for storing application software and various types of data of the generation device 1 installed in the text classification model, such as code of the model generation program 01, but also for temporarily storing data that has been output or is to be output.
  • the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data processing chip for running program code or processing stored in the memory 11. Data, such as execution model generation program 01 and the like.
  • CPU Central Processing Unit
  • controller microcontroller
  • microprocessor or other data processing chip for running program code or processing stored in the memory 11.
  • Data such as execution model generation program 01 and the like.
  • Communication bus 13 is used to implement connection communication between these components.
  • the network interface 14 can optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and is typically used to generate a communication connection between the device 1 and other electronic devices.
  • a standard wired interface such as a WI-FI interface
  • Figure 1 shows only the generating device 1 having the text classification model of the components 11-14 and the model generation program 01, but it should be understood that not all of the illustrated components are required to be implemented, and alternative implementations may be more or less s component.
  • the device 1 may further include a user interface
  • the user interface may include a display
  • an input unit such as a keyboard
  • the optional user interface may further include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like.
  • the display may also be suitably referred to as a display screen or display unit for displaying information processed in the text classification model generating device 1 and a user interface for displaying the visualization.
  • the model generation program 01 is stored in the memory 11; when the processor 12 executes the model generation program 01 stored in the memory 11, the following steps are implemented:
  • A1. Obtain a word segmentation dictionary in the financial field based on collected financial domain vocabulary, and a text corpus of a preset financial field.
  • the whole domain word segmentation dictionary is obtained.
  • the vocabulary of the collected financial field is added to form a financial domain word segment dictionary.
  • the vocabulary sources in the financial field mainly include the following three categories: financial terminology, such as “William indicator”, “moving average”, “convertible bonds”, etc.; financial forum terms, such as users in some stock market forums in the comments Words used in stocks; network terms and specific symbols applied to the financial sector, such as "junk stocks".
  • A2 Select a candidate new word from the text corpus according to a preset algorithm, and add to the word segment dictionary.
  • step A2 includes:
  • A21 based on the word segmentation dictionary, using the word segmentation algorithm to perform word segmentation processing on the text corpus, and acquiring a candidate word set according to the word segmentation result;
  • A22 calculating information gain of each candidate word in the candidate word set, and selecting information a candidate word having a gain greater than a first preset threshold as a first candidate new word, adding the first candidate new word to the word segment dictionary;
  • A23 using a word segment dictionary added with the first candidate new word, using The word segmentation algorithm segments the text corpus and uses the word segmentation processed text corpus training word vector model;
  • A24 using the trained word vector model to calculate the semantics of the word in the word segmentation result and the first candidate new word Similarity;
  • A25 a word with a semantic similarity greater than a second preset threshold as a second candidate new word, and adding the second candidate new word to the word segment dictionary.
  • Get the text corpus used to expand the word breaker Specifically, a web crawler is used to capture a large amount of financial news text information related to the financial theme to be analyzed from the financial website to form a text corpus.
  • the pre-processed data is pre-processed, and the useless information such as garbled symbols and web escape symbols contained therein is removed, and the text data is retained as a text corpus.
  • the emotional tendency of a large amount of text data in the text corpus is classified by manual labeling, that is, the category labeling information is added to the text data.
  • the current word segmentation dictionary is used as a dictionary of the default word segmentation algorithm, and the text corpus is segmented. Then, the stop words in the word segmentation result are filtered according to the preset stop word vocabulary to remove the irrelevant words in the result. The set of candidate words is composed of the remaining word segmentation results.
  • the category labeling information corresponding to the word segmentation result is consistent with the category labeling information of the corresponding text data.
  • the information gain is an entropy-based evaluation method, and when used for feature selection, measuring whether a word is present or not is judged by a text. Whether it belongs to the amount of information provided by a certain class; it is defined as the difference between the amount of information before and after the occurrence of a certain feature value in the document, and the calculation formula is:
  • P(C j ) represents the probability that the category C j appears in the data set
  • P(t i ) represents the probability that the feature item t i appears in the data set
  • t i ) represents the feature item t i
  • the probability of appearing in a document determined to be category C j Indicates the probability that the feature item t i does not appear
  • is the total number of categories.
  • the category refers to the classification of emotional orientation
  • the feature item is the candidate word.
  • the above probability values can be calculated by counting the statistics of candidate words in the text corpus.
  • the usefulness of the candidate words is judged based on the calculated information gain, and the larger the value of the information gain, the more useful the classification is.
  • the candidate words in the candidate word set whose information gain is greater than the first preset threshold are added as the first candidate new words to the current word segment dictionary to realize the expansion of the word segment dictionary.
  • the same word segmentation algorithm is used to process the same text corpus, and the word segmentation result is obtained.
  • the word corpus training word vector model is processed by the word segmentation, and the word vector model obtained by training is used to calculate the word segmentation.
  • the word vector of each word, the semantic similarity of the first candidate new word of the word processed by the word segmentation is calculated according to the word vector, and if the semantic similarity is greater than the second preset threshold, it is taken as the second candidate new word, and the word segmentation will be
  • the second candidate word selected in the result is added to the word segment dictionary to realize the re-expansion of the word segment dictionary.
  • stop words in the word segmentation result are deleted by the stop word list, because these stop words are noisy and have no meaning for text classification, delete These words can improve the accuracy of text categorization while reducing the amount of computation when selecting candidate new words.
  • the three expansions of the word segmentation dictionary are actually realized.
  • the first time is to obtain a preliminary expansion of the financial domain vocabulary by manual collection, and the second time is to select new words by calculating the information gain, the third time.
  • the new word is selected again by calculating the semantic similarity by the word vector.
  • both the second and third expansions are re-expanded on the basis of the last expanded word segmentation dictionary.
  • the word segmentation algorithm of the word segmentation dictionary is extended for the segmentation of the training sample of the classification model. The richer the financial domain vocabulary in the word segment dictionary, the more accurate the word segmentation result of the financial domain text, and the higher the classification accuracy of the training classification model. .
  • the word frequency of the second candidate new word in the text corpus is calculated, and the word frequency is used as the second candidate new word in the word segmentation.
  • the word frequency can be calculated in the same way for the first candidate new word and used as its weight in the word segmentation dictionary.
  • A3. Acquire a sample set, and classify the training samples in the sample set according to a preset sentiment orientation classification mode.
  • Obtaining a sample set for training the text classification model and acquiring, for each training data in the sample set, a plurality of labeling information of each of the training data according to the preset sentiment orientation classification mode, and selecting a plurality of labeling information
  • the most frequently occurring annotation information is used as the labeling result of the training data.
  • the user can set a corresponding emotional tendency classification mode according to the financial problem to be analyzed, for example, dividing the text in the stock forum into holding, selling, and buying; and dividing the stock discussion text in the microblog or forum into positive Negative and neutral; divide the financial news text into positive, negative and neutral.
  • A4 Perform the word segmentation processing on the training samples in the sample set by using a preset word segmentation algorithm based on the word segment dictionary to which the candidate new words are added.
  • A5. Extracting a word vector according to the word segmentation result, and based on the adaboost algorithm, input the word vector corresponding to the training sample and the labeled category information into a preset plurality of weak classifiers for training, and combine the plurality of weak classifiers obtained by the training into A text classification model for the financial sector.
  • Adaboost is an iterative algorithm whose core idea is to train different classifiers (weak classifiers) for the same training set, and then combine these weak classifiers to form a stronger final classifier (strong classifier).
  • the word segmentation process is used for the training sample segmentation process using the preset word segmentation algorithm, and the word vector of the word segmentation result is extracted using the trained word vector model. It should be noted that the word segmentation algorithm used in the solution of this embodiment is always the same algorithm.
  • the word vector is extracted using the word2vec model and the Glove (Global Vectors for word representation) model, and each word segmentation result is obtained by two word vectors.
  • a classifier based on a convolutional neural network algorithm a classifier based on a cyclic neural network algorithm, and a classifier based on a long-short-term memory network algorithm are used as weak classifiers.
  • the above two word vectors are respectively input as inputs, and actually six weak classification models can be constructed.
  • each weak classifier is trained using samples from the sample set.
  • the weight of the sample is reduced; if a sample is not accurately classified, then in the next sample set, the sample is raised. Weight.
  • the weighted updated sample set is used to train the next classifier, and the entire training process proceeds so iteratively.
  • the weight of the weak classifier with small classification error rate is increased, which plays a greater role in determining the final classification function, and reduces the weak classification with large classification error rate.
  • the weight of the device makes it play a smaller role in the final classification function.
  • Each weak classifier is iteratively trained in accordance with the above process.
  • the weak classifiers obtained from each training are combined as the final text classification model.
  • the text classification model can be used to classify the sentiment orientation of the financial domain text, and to judge whether the stock discussion text in the forum is negative, positive or neutral.
  • the apparatus for generating a text classification model proposed in this embodiment, by text corpus mining in the financial field, filters new financial domain words from the corpus as much as possible, and adds them to the word segmentation dictionary to realize the expansion of the word segment dictionary in the financial field. And using the word segmentation dictionary expanded after the financial vocabulary to perform segmentation processing on the training samples in the sample set, and classifying the sample data in the sample set according to the preset sentiment orientation classification mode, and finally training to obtain a text classification model, the model can be applied to The classification of sentiment orientation in the financial sector.
  • the model generation program may also be divided into one or more modules, one or more modules being stored in the memory 11 and being processed by one or more processors (this embodiment is The processor 12) is executed to complete the present application, and the module referred to in the present application refers to a series of computer program instruction segments capable of performing a specific function for describing the execution process of the model generation program in the text classification model generating device.
  • FIG. 2 it is a schematic diagram of a program module of a model generation program in an embodiment of a device for generating a text classification model.
  • the model generation program can be divided into a data acquisition module 10 and a new word selection.
  • Module 20, sample labeling module 30, sample word segmentation module 40, and model training module 50 by way of example:
  • the data obtaining module 10 is configured to: obtain a word segment dictionary of a financial field constructed based on the collected financial domain vocabulary, and a text corpus of a preset financial field;
  • the new word selection module 20 is configured to: select a candidate new word from the text corpus according to a preset algorithm, and add to the word segment dictionary;
  • the sample labeling module 30 is configured to: acquire a sample set, and perform category labeling on the training samples in the sample set according to a preset sentiment orientation classification mode;
  • the sample word segmentation module 40 is configured to perform word segmentation processing on the training samples in the sample set by using a preset word segmentation algorithm based on the word segment dictionary to which the candidate new words are added;
  • the model training module 50 is configured to: extract a word vector according to the word segmentation result, and input the word vector corresponding to the training sample and the labeled category information into a preset plurality of weak classifiers for training based on the adobost algorithm, and the plurality of trainings are obtained.
  • the weak classifier is combined into a text classification model in the financial field.
  • the present application also provides a method for generating a text classification model.
  • FIG. 3 it is a flowchart of a preferred embodiment of a method for generating a text classification model of the present application. The method can be performed by a device that can be implemented by software and/or hardware.
  • the method for generating a text classification model includes:
  • Step S10 Obtain a word segmentation dictionary of a financial field constructed based on the collected financial domain vocabulary, and a text corpus of a preset financial field.
  • the whole domain word segmentation dictionary is obtained.
  • the vocabulary of the collected financial field is added to form a financial domain word segment dictionary.
  • the vocabulary sources in the financial field mainly include the following three categories: financial terminology, such as “William indicator”, “moving average”, “convertible bonds”, etc.; financial forum terms, such as users in some stock market forums in the comments Words used in stocks; network terms and specific symbols applied to the financial sector, such as "junk stocks".
  • Step S20 Select a candidate new word from the text corpus according to a preset algorithm, and add to the word segment dictionary.
  • step S20 includes: performing word segmentation processing on the text corpus using the word segmentation algorithm based on the word segmentation dictionary, acquiring a candidate word set according to the word segmentation result; and calculating an information gain of each candidate word in the candidate word set.
  • Get the text corpus used to expand the word breaker Specifically, a web crawler is used to capture a large amount of financial news text information related to the financial theme to be analyzed from the financial website to form a text corpus.
  • the pre-processed data is pre-processed, and the useless information such as garbled symbols and web escape symbols contained therein is removed, and the text data is retained as a text corpus.
  • the emotional tendency of a large amount of text data in the text corpus is classified by manual labeling, that is, the category labeling information is added to the text data.
  • the current word segmentation dictionary is used as a dictionary of the default word segmentation algorithm, and the text corpus is segmented. Then, the stop words in the word segmentation result are filtered according to the preset stop word vocabulary to remove the irrelevant words in the result. The set of candidate words is composed of the remaining word segmentation results.
  • the category labeling information corresponding to the word segmentation result is consistent with the category labeling information of the corresponding text data.
  • the information gain is an entropy-based evaluation method, and when used for feature selection, measuring whether a word is present or not is judged by a text. Whether it belongs to the amount of information provided by a certain class; it is defined as the difference between the amount of information before and after the occurrence of a certain feature value in the document, and the calculation formula is:
  • P(C j ) represents the probability that the category C j appears in the data set
  • P(t i ) represents the probability that the feature item t i appears in the data set
  • t i ) represents the feature item t i
  • the probability of appearing in a document determined to be category C j Indicates the probability that the feature item t i does not appear
  • is the total number of categories.
  • the category refers to the classification of emotional orientation
  • the feature item is the candidate word.
  • the above probability values can be calculated by counting the statistics of candidate words in the text corpus.
  • the usefulness of the candidate words is judged based on the calculated information gain, and the larger the value of the information gain, the more useful the classification is.
  • the candidate words in the candidate word set whose information gain is greater than the first preset threshold are added as the first candidate new words to the current word segment dictionary to realize the expansion of the word segment dictionary.
  • the same word segmentation algorithm is used to process the same text corpus, and the word segmentation result is obtained.
  • the word corpus training word vector model is processed by the word segmentation, and the word vector model obtained by training is used to calculate the word segmentation.
  • the word vector of each word, the semantic similarity of the first candidate new word of the word processed by the word segmentation is calculated according to the word vector, and if the semantic similarity is greater than the second preset threshold, it is taken as the second candidate new word, and the word segmentation will be
  • the second candidate word selected in the result is added to the word segment dictionary to realize the re-expansion of the word segment dictionary.
  • stop words in the word segmentation result are deleted by the stop word list, because these stop words are noisy and have no meaning for text classification, delete These words can improve the accuracy of text categorization while reducing the amount of computation when selecting candidate new words.
  • the three expansions of the word segmentation dictionary are actually realized.
  • the first time is to obtain a preliminary expansion of the financial domain vocabulary by manual collection, and the second time is to select new words by calculating the information gain, the third time.
  • the new word is selected again by calculating the semantic similarity by the word vector.
  • both the second and third expansions are re-expanded on the basis of the last expanded word segmentation dictionary.
  • the word segmentation algorithm of the word segmentation dictionary is extended for the segmentation of the training sample of the classification model. The richer the financial domain vocabulary in the word segment dictionary, the more accurate the word segmentation result of the financial domain text, and the higher the classification accuracy of the training classification model. .
  • the word frequency of the second candidate new word in the text corpus is calculated, and the word frequency is used as the second candidate new word in the word segmentation.
  • the word frequency can be calculated in the same way for the first candidate new word and used as its weight in the word segmentation dictionary.
  • Step S30 Acquire a sample set, and perform category labeling on the training samples in the sample set according to a preset sentiment orientation classification mode.
  • Obtaining a sample set for training the text classification model and acquiring, for each training data in the sample set, a plurality of labeling information of each of the training data according to the preset sentiment orientation classification mode, and selecting a plurality of labeling information
  • the most frequently occurring annotation information is used as the labeling result of the training data.
  • the user may set a corresponding emotional tendency classification mode according to the financial problem to be analyzed, for example, dividing the stock forum text into holding, selling, and buying; and dividing the microblog stock discussion text into positive, negative, and neutral; The financial news texts are divided into positive, negative and neutral.
  • Step S40 Perform word segmentation processing on the training samples in the sample set by using a preset word segmentation algorithm based on the word segment dictionary to which the candidate new words are added.
  • Step S50 extracting a word vector according to the word segmentation result, and based on the adaboost algorithm, inputting the word vector corresponding to the training sample and the labeled category information into a preset plurality of weak classifiers for training, and combining the plurality of weak classifiers obtained by the training.
  • the word segmentation process is used for the training sample segmentation process using the preset word segmentation algorithm, and the word vector of the word segmentation result is extracted using the trained word vector model. It should be noted that the word segmentation algorithm used in the solution of this embodiment is always the same algorithm.
  • the word vector is extracted using the word2vec model and the Glove model, respectively, and each word segmentation result is obtained by two word vectors.
  • a classifier based on a convolutional neural network algorithm a classifier based on a cyclic neural network algorithm, and a classifier based on a long-short-term memory network algorithm are used as weak classifiers.
  • the above two word vectors are respectively input as inputs, and actually six weak classification models can be constructed.
  • each weak classifier is trained using samples from the sample set.
  • the weight of the sample is reduced; if a sample is not accurately classified, then in the next sample set, the sample is raised. Weight.
  • the weighted updated sample set is used to train the next classifier, and the entire training process proceeds so iteratively.
  • the weight of the weak classifier with small classification error rate is increased, which plays a greater role in determining the final classification function, and reduces the weak classification with large classification error rate.
  • the weight of the device makes it play a smaller role in the final classification function.
  • Each weak classifier is iteratively trained in accordance with the above process.
  • the weak classifiers obtained from each training are combined as the final text classification model.
  • the text classification model can be used to classify the sentiment orientation of the financial domain text, and to judge whether the stock discussion text in the forum is negative, positive or neutral.
  • the method for generating a text classification model proposed in this embodiment, through text corpus mining in the financial field, filters new financial domain words from the corpus as much as possible, and adds them to the word segmentation dictionary to realize the expansion of the financial domain word segmentation dictionary. And using the word segmentation dictionary expanded after the financial vocabulary to perform segmentation processing on the training samples in the sample set, and classifying the sample data in the sample set according to the preset sentiment orientation classification mode, and finally training to obtain a text classification model, the model can be applied to The classification of sentiment orientation in the financial sector.
  • the embodiment of the present application further provides a computer readable storage medium, where the model readable program is stored on the computer readable storage medium, and the model generation program may be executed by one or more processors to implement the following operations:
  • the word vector is extracted, and based on the adaboost algorithm, the word vector corresponding to the training sample and the tagged category information are input into a preset plurality of weak classifiers for training, and the plurality of weak classifiers obtained by the training are combined into a financial field.
  • Text classification model
  • the specific embodiment of the computer readable storage medium of the present application is substantially the same as the embodiment of the apparatus and method for generating a text classification model, and will not be described herein.
  • the technical solution of the present application which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM as described above). , a disk, an optical disk, including a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the various embodiments of the present application.
  • a terminal device which may be a mobile phone, a computer, a server, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un dispositif de génération d'un modèle de classification de texte. Le dispositif comprend une mémoire et un processeur. Un programme de génération de modèles, exécutable sur le processeur, est stocké dans la mémoire. Les étapes suivantes sont réalisées lorsque le programme est exécuté par le processeur : l'acquisition d'un dictionnaire de segmentation de mots relatif au domaine des finances et d'un corps de texte relatif au domaine des finances ; la sélection de nouveaux mots candidats à partir du corps de texte, et l'ajout de ceux-ci au dictionnaire de segmentation de mots ; l'acquisition d'un ensemble d'échantillons et la réalisation d'un étiquetage de classe sur des échantillons d'apprentissage dans l'ensemble d'échantillons ; et la réalisation, sur la base du dictionnaire de segmentation de mots auquel ont été ajoutés les nouveaux mots candidats et au moyen d'un algorithme prédéfini de segmentation de mots, d'une segmentation de mots sur les échantillons d'apprentissage dans l'ensemble d'échantillons, l'extraction de vecteurs de mot, l'entrée, sur la base d'un algorithme d'Adaboost, des vecteurs de mots et des informations de classes étiquetées dans de multiples classificateurs faibles aux fins d'apprentissage, et l'obtention d'un modèle de classification de texte. La présente invention concerne en outre un procédé pour générer un modèle de classification de texte, et un support d'informations lisible par ordinateur. La présente invention résout le problème selon lequel il manque, dans l'état de la technique, des procédés de classification d'orientation de sentiment pour un texte relatif au domaine des finances.
PCT/CN2018/102400 2018-04-20 2018-08-27 Dispositif de génération d'un modèle de classification de texte, procédé et support d'informations lisible par ordinateur WO2019200806A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810361702.0 2018-04-20
CN201810361702.0A CN108804512B (zh) 2018-04-20 2018-04-20 文本分类模型的生成装置、方法及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2019200806A1 true WO2019200806A1 (fr) 2019-10-24

Family

ID=64093733

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102400 WO2019200806A1 (fr) 2018-04-20 2018-08-27 Dispositif de génération d'un modèle de classification de texte, procédé et support d'informations lisible par ordinateur

Country Status (2)

Country Link
CN (1) CN108804512B (fr)
WO (1) WO2019200806A1 (fr)

Cited By (71)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110837732A (zh) * 2019-10-31 2020-02-25 北京奇艺世纪科技有限公司 目标人物间亲密度识别方法、装置、电子设备及存储介质
CN110879934A (zh) * 2019-10-31 2020-03-13 杭州电子科技大学 一种高效的Wide & Deep深度学习模型
CN110968702A (zh) * 2019-11-29 2020-04-07 北京明略软件系统有限公司 一种事理关系提取方法及装置
CN110991612A (zh) * 2019-11-29 2020-04-10 交通银行股份有限公司 一种基于词向量的国际惯例实时推理模型的报文分析方法
CN111046177A (zh) * 2019-11-26 2020-04-21 方正璞华软件(武汉)股份有限公司 一种仲裁案件自动预判方法及装置
CN111078883A (zh) * 2019-12-13 2020-04-28 北京明略软件系统有限公司 危险指数分析方法、装置、电子设备和存储介质
CN111078546A (zh) * 2019-12-05 2020-04-28 北京云聚智慧科技有限公司 一种表达页面特征的方法和电子设备
CN111125317A (zh) * 2019-12-27 2020-05-08 携程计算机技术(上海)有限公司 对话型文本分类的模型训练、分类、系统、设备和介质
CN111125323A (zh) * 2019-11-21 2020-05-08 腾讯科技(深圳)有限公司 一种聊天语料标注方法、装置、电子设备及存储介质
CN111159589A (zh) * 2019-12-30 2020-05-15 中国银联股份有限公司 分类字典建立方法、商户数据分类方法、装置及设备
CN111191119A (zh) * 2019-12-16 2020-05-22 绍兴市上虞区理工高等研究院 一种基于神经网络的科技成果自学习方法及装置
CN111221950A (zh) * 2019-12-30 2020-06-02 航天信息股份有限公司 一种用户弱感情的分析方法及装置
CN111259148A (zh) * 2020-01-19 2020-06-09 北京松果电子有限公司 信息处理方法、装置及存储介质
CN111310464A (zh) * 2020-02-17 2020-06-19 北京明略软件系统有限公司 词向量获取模型生成方法、装置及词向量获取方法、装置
CN111309920A (zh) * 2020-03-26 2020-06-19 清华大学深圳国际研究生院 一种文本分类方法、终端设备及计算机可读存储介质
CN111309859A (zh) * 2020-01-21 2020-06-19 上饶市中科院云计算中心大数据研究院 一种景区网络口碑情感分析方法及装置
CN111309855A (zh) * 2019-12-24 2020-06-19 中国银行股份有限公司 一种文本信息的处理方法及系统
CN111325562A (zh) * 2020-02-17 2020-06-23 武汉轻工大学 粮食安全追溯系统及方法
CN111339268A (zh) * 2020-02-19 2020-06-26 北京百度网讯科技有限公司 实体词识别方法和装置
CN111367962A (zh) * 2020-02-28 2020-07-03 北京金堤科技有限公司 数据库的更新方法及装置、计算机可读存储介质、电子设备
CN111523308A (zh) * 2020-03-18 2020-08-11 大箴(杭州)科技有限公司 中文分词的方法、装置及计算机设备
CN111601314A (zh) * 2020-05-27 2020-08-28 北京亚鸿世纪科技发展有限公司 预训练模型加短信地址双重判定不良短信的方法和装置
CN111652281A (zh) * 2020-04-30 2020-09-11 中国平安财产保险股份有限公司 信息数据的分类方法、装置及可读存储介质
CN111680225A (zh) * 2020-04-26 2020-09-18 国家计算机网络与信息安全管理中心 基于机器学习的微信金融消息分析方法及系统
CN111680804A (zh) * 2020-06-02 2020-09-18 中国电力科学研究院有限公司 一种运检工作票生成方法、设备以及计算机可读介质
CN111680803A (zh) * 2020-06-02 2020-09-18 中国电力科学研究院有限公司 一种运检工作票生成系统
CN111680155A (zh) * 2020-05-13 2020-09-18 新华网股份有限公司 文本分类方法、装置、电子设备及计算机存储介质
CN111709233A (zh) * 2020-05-27 2020-09-25 西安交通大学 基于多注意力卷积神经网络的智能导诊方法及系统
CN111737993A (zh) * 2020-05-26 2020-10-02 浙江华云电力工程设计咨询有限公司 一种配电网设备的故障缺陷文本提取设备健康状态方法
CN111737999A (zh) * 2020-06-24 2020-10-02 深圳前海微众银行股份有限公司 一种序列标注方法、装置、设备及可读存储介质
CN111753091A (zh) * 2020-06-30 2020-10-09 北京小米松果电子有限公司 分类方法、分类模型的训练方法、装置、设备及存储介质
CN111783451A (zh) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 用于增强文本样本的方法和装置
CN111782803A (zh) * 2020-06-05 2020-10-16 京东数字科技控股有限公司 一种工单的处理方法、装置、电子设备及存储介质
CN111832292A (zh) * 2020-06-03 2020-10-27 北京百度网讯科技有限公司 文本识别处理方法、装置、电子设备和存储介质
CN111930942A (zh) * 2020-08-07 2020-11-13 腾讯云计算(长沙)有限责任公司 文本分类方法、语言模型训练方法、装置及设备
CN111966944A (zh) * 2020-08-17 2020-11-20 中电科大数据研究院有限公司 一种多层级用户评论安全审核的模型构建方法
CN112015895A (zh) * 2020-08-26 2020-12-01 广东电网有限责任公司 一种专利文本分类方法及装置
CN112016319A (zh) * 2020-09-08 2020-12-01 平安科技(深圳)有限公司 预训练模型获取、疾病实体标注方法、装置及存储介质
CN112101015A (zh) * 2020-09-08 2020-12-18 腾讯科技(深圳)有限公司 一种识别多标签对象的方法及装置
CN112287639A (zh) * 2020-10-30 2021-01-29 上海中通吉网络技术有限公司 一种智能客服工单分类方法
CN112364131A (zh) * 2020-11-10 2021-02-12 中国平安人寿保险股份有限公司 一种语料处理方法及其相关装置
CN112529743A (zh) * 2020-12-18 2021-03-19 平安银行股份有限公司 合同要素抽取方法、装置、电子设备及介质
CN112528022A (zh) * 2020-12-09 2021-03-19 广州摩翼信息科技有限公司 主题类别对应的特征词提取和文本主题类别识别方法
CN112650837A (zh) * 2020-12-28 2021-04-13 上海风秩科技有限公司 结合分类算法与非监督算法的文本质量控制方法及系统
CN112765936A (zh) * 2020-12-31 2021-05-07 出门问问(武汉)信息科技有限公司 一种基于语言模型进行运算的训练方法及装置
CN112784061A (zh) * 2021-01-27 2021-05-11 数贸科技(北京)有限公司 知识图谱的构建方法、装置、计算设备及存储介质
CN112861533A (zh) * 2019-11-26 2021-05-28 阿里巴巴集团控股有限公司 实体词识别方法及装置
CN112948573A (zh) * 2021-02-05 2021-06-11 北京百度网讯科技有限公司 文本标签的提取方法、装置、设备和计算机存储介质
CN112948583A (zh) * 2021-02-26 2021-06-11 中国光大银行股份有限公司 数据的分类方法及装置、存储介质、电子装置
CN112989032A (zh) * 2019-12-17 2021-06-18 医渡云(北京)技术有限公司 实体关系分类方法、装置、介质及电子设备
CN113011183A (zh) * 2021-03-23 2021-06-22 北京科东电力控制系统有限责任公司 一种电力调控领域非结构化文本数据处理方法及系统
CN113032573A (zh) * 2021-04-30 2021-06-25 《中国学术期刊(光盘版)》电子杂志社有限公司 一种结合主题语义与tf*idf算法的大规模文本分类方法及系统
CN113033198A (zh) * 2021-03-25 2021-06-25 平安国际智慧城市科技股份有限公司 相似文本推送方法、装置、电子设备及计算机存储介质
CN113052191A (zh) * 2019-12-26 2021-06-29 航天信息股份有限公司 一种神经语言网络模型的训练方法、装置、设备及介质
CN113095068A (zh) * 2021-04-30 2021-07-09 平安国际智慧城市科技股份有限公司 基于权重字典的情感分析方法、系统、装置及存储介质
CN113302683A (zh) * 2019-12-24 2021-08-24 深圳市优必选科技股份有限公司 多音字预测方法及消歧方法、装置、设备及计算机可读存储介质
CN113377965A (zh) * 2021-06-30 2021-09-10 中国农业银行股份有限公司 感知文本关键词的方法及相关装置
CN113392209A (zh) * 2020-10-26 2021-09-14 腾讯科技(深圳)有限公司 一种基于人工智能的文本聚类方法、相关设备及存储介质
CN113449097A (zh) * 2020-03-24 2021-09-28 百度在线网络技术(北京)有限公司 一种对抗样本的生成方法、装置、电子设备和存储介质
CN113468292A (zh) * 2021-06-29 2021-10-01 中国银联股份有限公司 方面级情感分析方法、装置及计算机可读存储介质
CN113627530A (zh) * 2021-08-11 2021-11-09 中国平安人寿保险股份有限公司 相似问题文本生成方法、装置、设备及介质
CN113761882A (zh) * 2020-06-08 2021-12-07 北京沃东天骏信息技术有限公司 一种词典构建方法和装置
CN114004234A (zh) * 2020-07-28 2022-02-01 深圳Tcl数字技术有限公司 一种语义识别方法、存储介质及终端设备
CN114090601A (zh) * 2021-11-23 2022-02-25 北京百度网讯科技有限公司 一种数据筛选方法、装置、设备以及存储介质
CN114281928A (zh) * 2020-09-28 2022-04-05 中国移动通信集团广西有限公司 基于文本数据的模型生成方法、装置及设备
CN114443849A (zh) * 2022-02-09 2022-05-06 北京百度网讯科技有限公司 一种标注样本选取方法、装置、电子设备和存储介质
CN114638195A (zh) * 2022-01-21 2022-06-17 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 一种基于多任务学习的立场检测方法
CN115952290A (zh) * 2023-03-09 2023-04-11 太极计算机股份有限公司 基于主动学习和半监督学习的案情特征标注方法、装置和设备
CN116307792A (zh) * 2022-10-12 2023-06-23 广州市阿尔法软件信息技术有限公司 一种面向城市体检主题场景的评估方法及装置
CN116361463A (zh) * 2023-03-27 2023-06-30 应急管理部国家减灾中心(应急管理部卫星减灾应用中心) 一种地震灾情信息提取方法、装置、设备及介质
CN117093715A (zh) * 2023-10-18 2023-11-21 湖南财信数字科技有限公司 词库扩充方法、系统、计算机设备及存储介质

Families Citing this family (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299276B (zh) * 2018-11-15 2021-11-19 创新先进技术有限公司 一种将文本转化为词嵌入、文本分类方法和装置
CN109614499B (zh) * 2018-11-22 2023-02-17 创新先进技术有限公司 一种词典生成方法、新词发现方法、装置及电子设备
CN109783800B (zh) * 2018-12-13 2024-04-12 北京百度网讯科技有限公司 情感关键词的获取方法、装置、设备及存储介质
CN109684634B (zh) * 2018-12-17 2023-07-25 北京百度网讯科技有限公司 情感分析方法、装置、设备及存储介质
CN109741190A (zh) * 2018-12-27 2019-05-10 清华大学 一种个股公告分类的方法、系统及设备
CN111401030B (zh) * 2018-12-28 2024-01-09 北京嘀嘀无限科技发展有限公司 服务异常识别方法、装置、服务器及可读存储介质
CN109685156B (zh) * 2018-12-30 2021-11-05 杭州灿八科技有限公司 一种用于识别情绪的分类器的获取方法
CN110008464A (zh) * 2019-01-02 2019-07-12 阿里巴巴集团控股有限公司 业务词库的构建方法、装置、服务器及可读存储介质
CN109871444A (zh) * 2019-01-16 2019-06-11 北京邮电大学 一种文本分类方法及系统
CN109871889B (zh) * 2019-01-31 2019-12-24 内蒙古工业大学 突发事件下大众心理评估方法
CN109783632B (zh) * 2019-02-15 2023-07-18 腾讯科技(深圳)有限公司 客服信息推送方法、装置、计算机设备及存储介质
CN110059187B (zh) * 2019-04-10 2022-06-07 华侨大学 一种集成浅层语义预判模态的深度学习文本分类方法
CN110232914A (zh) * 2019-05-20 2019-09-13 平安普惠企业管理有限公司 一种语义识别方法、装置以及相关设备
CN110347821B (zh) * 2019-05-29 2023-08-25 华东理工大学 一种文本类别标注的方法、电子设备和可读存储介质
CN110210028B (zh) * 2019-05-30 2023-04-28 杭州远传新业科技股份有限公司 针对语音转译文本的领域特征词提取方法、装置、设备及介质
CN110188204B (zh) * 2019-06-11 2022-10-04 腾讯科技(深圳)有限公司 一种扩展语料挖掘方法、装置、服务器及存储介质
CN110674289A (zh) * 2019-07-04 2020-01-10 南瑞集团有限公司 基于分词权重判断文章所属分类的方法、装置和存储介质
CN110457475B (zh) * 2019-07-25 2023-06-30 创新先进技术有限公司 一种用于文本分类体系构建和标注语料扩充的方法和系统
CN110489556A (zh) * 2019-08-22 2019-11-22 重庆锐云科技有限公司 关于跟进记录的质量评价方法、装置、服务器及存储介质
CN112445907A (zh) * 2019-09-02 2021-03-05 顺丰科技有限公司 文本情感分类方法、装置、设备、及存储介质
CN110704581B (zh) * 2019-09-11 2024-03-08 创新先进技术有限公司 计算机执行的文本情感分析方法及装置
CN110597958B (zh) * 2019-09-12 2022-03-25 思必驰科技股份有限公司 文本分类模型训练和使用方法及装置
CN112579768A (zh) * 2019-09-30 2021-03-30 北京国双科技有限公司 一种情感分类模型训练方法、文本情感分类方法及装置
CN111104510B (zh) * 2019-11-15 2023-05-09 南京中新赛克科技有限责任公司 一种基于词嵌入的文本分类训练样本扩充方法
CN110990567A (zh) * 2019-11-25 2020-04-10 国家电网有限公司 一种增强领域特征的电力审计文本分类方法
CN111177403B (zh) * 2019-12-16 2023-06-23 恩亿科(北京)数据科技有限公司 样本数据的处理方法和装置
CN111177378B (zh) * 2019-12-20 2023-09-26 北京淇瑀信息科技有限公司 一种文本挖掘方法、装置及电子设备
CN111144097B (zh) * 2019-12-25 2023-08-18 华中科技大学鄂州工业技术研究院 一种对话文本的情感倾向分类模型的建模方法和装置
CN111143569B (zh) * 2019-12-31 2023-05-02 腾讯科技(深圳)有限公司 一种数据处理方法、装置及计算机可读存储介质
CN114556328A (zh) * 2019-12-31 2022-05-27 深圳市欢太科技有限公司 数据处理方法、装置、电子设备和存储介质
CN111198948B (zh) * 2020-01-08 2024-06-14 深圳前海微众银行股份有限公司 文本分类校正方法、装置、设备及计算机可读存储介质
CN111325033B (zh) * 2020-03-20 2023-07-11 中国建设银行股份有限公司 实体识别方法、装置、电子设备及计算机可读存储介质
CN111444326B (zh) * 2020-03-30 2023-10-20 腾讯科技(深圳)有限公司 一种文本数据处理方法、装置、设备以及存储介质
CN113111175A (zh) * 2020-04-28 2021-07-13 北京明亿科技有限公司 基于深度学习模型极端行为识别方法与装置、设备及介质
CN111368555B (zh) * 2020-05-27 2020-08-28 腾讯科技(深圳)有限公司 一种数据识别方法、装置、存储介质和电子设备
CN112417860A (zh) * 2020-12-08 2021-02-26 携程计算机技术(上海)有限公司 训练样本增强方法、系统、设备及存储介质
CN112632971B (zh) * 2020-12-18 2023-08-25 上海明略人工智能(集团)有限公司 一种用于实体匹配的词向量训练方法与系统
CN112926631A (zh) * 2021-02-01 2021-06-08 大箴(杭州)科技有限公司 金融文本的分类方法、装置及计算机设备
CN113051401A (zh) * 2021-04-06 2021-06-29 明品云(北京)数据科技有限公司 一种文本结构化标注方法、系统、设备和介质
CN113240485A (zh) * 2021-05-10 2021-08-10 北京沃东天骏信息技术有限公司 文本生成模型的训练方法、文本生成方法和装置
CN113177109A (zh) * 2021-05-27 2021-07-27 中国平安人寿保险股份有限公司 文本的弱标注方法、装置、设备以及存储介质
CN113723114A (zh) * 2021-08-31 2021-11-30 平安普惠企业管理有限公司 基于多意图识别的语义分析方法、装置、设备及存储介质
CN113642678B (zh) * 2021-10-12 2022-01-07 南京山猫齐动信息技术有限公司 一种对抗消息样本生成的方法、装置及存储介质
CN114091469B (zh) * 2021-11-23 2022-08-19 杭州萝卜智能技术有限公司 基于样本扩充的网络舆情分析方法
CN114936282B (zh) * 2022-04-28 2024-06-11 北京中科闻歌科技股份有限公司 金融风险线索确定方法、装置、设备和介质
CN115861606B (zh) * 2022-05-09 2023-09-08 北京中关村科金技术有限公司 一种针对长尾分布文档的分类方法、装置及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023967A (zh) * 2010-11-11 2011-04-20 清华大学 一种面向股票领域的文本情感分类方法
US20130018824A1 (en) * 2011-07-11 2013-01-17 Accenture Global Services Limited Sentiment classifiers based on feature extraction
CN103559174A (zh) * 2013-09-30 2014-02-05 东软集团股份有限公司 语义情感分类特征值提取方法及系统
WO2016085409A1 (fr) * 2014-11-24 2016-06-02 Agency For Science, Technology And Research Procédé et système de classification de sentiments et de classification d'émotions
CN106547738A (zh) * 2016-11-02 2017-03-29 北京亿美软通科技有限公司 一种基于文本挖掘的金融类逾期短信智能判别方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104142913A (zh) * 2013-05-07 2014-11-12 株式会社日立制作所 词语极性的判别方法和判别系统
CN104331506A (zh) * 2014-11-20 2015-02-04 北京理工大学 一种面向双语微博文本的多类情感分析方法与系统
CN105022725B (zh) * 2015-07-10 2018-04-20 河海大学 一种应用于金融Web领域的文本情感倾向分析方法
CN105740349B (zh) * 2016-01-25 2019-03-08 重庆邮电大学 一种结合Doc2vec和卷积神经网络的情感分类方法
CN107436875B (zh) * 2016-05-25 2020-12-04 华为技术有限公司 文本分类方法及装置
RU2657173C2 (ru) * 2016-07-28 2018-06-08 Общество с ограниченной ответственностью "Аби Продакшн" Сентиментный анализ на уровне аспектов с использованием методов машинного обучения
CN106598940A (zh) * 2016-11-01 2017-04-26 四川用联信息技术有限公司 基于全局优化关键词质量的文本相似度求解算法
CN107122382B (zh) * 2017-02-16 2021-03-23 江苏大学 一种基于说明书的专利分类方法
CN107491531B (zh) * 2017-08-18 2019-05-17 华南师范大学 基于集成学习框架的中文网络评论情感分类方法
CN107729374A (zh) * 2017-09-13 2018-02-23 厦门快商通科技股份有限公司 一种情感词典的扩充方法及文本情感识别方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023967A (zh) * 2010-11-11 2011-04-20 清华大学 一种面向股票领域的文本情感分类方法
US20130018824A1 (en) * 2011-07-11 2013-01-17 Accenture Global Services Limited Sentiment classifiers based on feature extraction
CN103559174A (zh) * 2013-09-30 2014-02-05 东软集团股份有限公司 语义情感分类特征值提取方法及系统
WO2016085409A1 (fr) * 2014-11-24 2016-06-02 Agency For Science, Technology And Research Procédé et système de classification de sentiments et de classification d'émotions
CN106547738A (zh) * 2016-11-02 2017-03-29 北京亿美软通科技有限公司 一种基于文本挖掘的金融类逾期短信智能判别方法

Cited By (117)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110879934A (zh) * 2019-10-31 2020-03-13 杭州电子科技大学 一种高效的Wide & Deep深度学习模型
CN110837732A (zh) * 2019-10-31 2020-02-25 北京奇艺世纪科技有限公司 目标人物间亲密度识别方法、装置、电子设备及存储介质
CN110879934B (zh) * 2019-10-31 2023-05-23 杭州电子科技大学 一种基于Wide&Deep深度学习模型的文本预测方法
CN110837732B (zh) * 2019-10-31 2024-01-26 北京奇艺世纪科技有限公司 目标人物间亲密度识别方法、装置、电子设备及存储介质
CN111125323A (zh) * 2019-11-21 2020-05-08 腾讯科技(深圳)有限公司 一种聊天语料标注方法、装置、电子设备及存储介质
CN111125323B (zh) * 2019-11-21 2024-01-19 腾讯科技(深圳)有限公司 一种聊天语料标注方法、装置、电子设备及存储介质
CN111046177A (zh) * 2019-11-26 2020-04-21 方正璞华软件(武汉)股份有限公司 一种仲裁案件自动预判方法及装置
CN112861533A (zh) * 2019-11-26 2021-05-28 阿里巴巴集团控股有限公司 实体词识别方法及装置
CN110991612A (zh) * 2019-11-29 2020-04-10 交通银行股份有限公司 一种基于词向量的国际惯例实时推理模型的报文分析方法
CN110968702A (zh) * 2019-11-29 2020-04-07 北京明略软件系统有限公司 一种事理关系提取方法及装置
CN110968702B (zh) * 2019-11-29 2023-05-09 北京明略软件系统有限公司 一种事理关系提取方法及装置
CN111078546A (zh) * 2019-12-05 2020-04-28 北京云聚智慧科技有限公司 一种表达页面特征的方法和电子设备
CN111078883A (zh) * 2019-12-13 2020-04-28 北京明略软件系统有限公司 危险指数分析方法、装置、电子设备和存储介质
CN111191119A (zh) * 2019-12-16 2020-05-22 绍兴市上虞区理工高等研究院 一种基于神经网络的科技成果自学习方法及装置
CN111191119B (zh) * 2019-12-16 2023-12-12 绍兴市上虞区理工高等研究院 一种基于神经网络的科技成果自学习方法及装置
CN112989032A (zh) * 2019-12-17 2021-06-18 医渡云(北京)技术有限公司 实体关系分类方法、装置、介质及电子设备
CN111309855A (zh) * 2019-12-24 2020-06-19 中国银行股份有限公司 一种文本信息的处理方法及系统
CN113302683A (zh) * 2019-12-24 2021-08-24 深圳市优必选科技股份有限公司 多音字预测方法及消歧方法、装置、设备及计算机可读存储介质
CN113302683B (zh) * 2019-12-24 2023-08-04 深圳市优必选科技股份有限公司 多音字预测方法及消歧方法、装置、设备及计算机可读存储介质
CN113052191A (zh) * 2019-12-26 2021-06-29 航天信息股份有限公司 一种神经语言网络模型的训练方法、装置、设备及介质
CN111125317A (zh) * 2019-12-27 2020-05-08 携程计算机技术(上海)有限公司 对话型文本分类的模型训练、分类、系统、设备和介质
CN111159589B (zh) * 2019-12-30 2023-10-20 中国银联股份有限公司 分类字典建立方法、商户数据分类方法、装置及设备
CN111221950A (zh) * 2019-12-30 2020-06-02 航天信息股份有限公司 一种用户弱感情的分析方法及装置
CN111159589A (zh) * 2019-12-30 2020-05-15 中国银联股份有限公司 分类字典建立方法、商户数据分类方法、装置及设备
CN111259148A (zh) * 2020-01-19 2020-06-09 北京松果电子有限公司 信息处理方法、装置及存储介质
CN111259148B (zh) * 2020-01-19 2024-03-26 北京小米松果电子有限公司 信息处理方法、装置及存储介质
CN111309859A (zh) * 2020-01-21 2020-06-19 上饶市中科院云计算中心大数据研究院 一种景区网络口碑情感分析方法及装置
CN111309859B (zh) * 2020-01-21 2023-07-07 上饶市中科院云计算中心大数据研究院 一种景区网络口碑情感分析方法及装置
CN111310464B (zh) * 2020-02-17 2024-02-02 北京明略软件系统有限公司 词向量获取模型生成方法、装置及词向量获取方法、装置
CN111310464A (zh) * 2020-02-17 2020-06-19 北京明略软件系统有限公司 词向量获取模型生成方法、装置及词向量获取方法、装置
CN111325562A (zh) * 2020-02-17 2020-06-23 武汉轻工大学 粮食安全追溯系统及方法
CN111325562B (zh) * 2020-02-17 2023-08-01 武汉轻工大学 粮食安全追溯系统及方法
CN111339268B (zh) * 2020-02-19 2023-08-15 北京百度网讯科技有限公司 实体词识别方法和装置
CN111339268A (zh) * 2020-02-19 2020-06-26 北京百度网讯科技有限公司 实体词识别方法和装置
CN111367962B (zh) * 2020-02-28 2024-01-30 北京金堤科技有限公司 数据库的更新方法及装置、计算机可读存储介质、电子设备
CN111367962A (zh) * 2020-02-28 2020-07-03 北京金堤科技有限公司 数据库的更新方法及装置、计算机可读存储介质、电子设备
CN111523308A (zh) * 2020-03-18 2020-08-11 大箴(杭州)科技有限公司 中文分词的方法、装置及计算机设备
CN111523308B (zh) * 2020-03-18 2024-01-26 大箴(杭州)科技有限公司 中文分词的方法、装置及计算机设备
CN113449097A (zh) * 2020-03-24 2021-09-28 百度在线网络技术(北京)有限公司 一种对抗样本的生成方法、装置、电子设备和存储介质
CN111309920B (zh) * 2020-03-26 2023-03-24 清华大学深圳国际研究生院 一种文本分类方法、终端设备及计算机可读存储介质
CN111309920A (zh) * 2020-03-26 2020-06-19 清华大学深圳国际研究生院 一种文本分类方法、终端设备及计算机可读存储介质
CN111680225A (zh) * 2020-04-26 2020-09-18 国家计算机网络与信息安全管理中心 基于机器学习的微信金融消息分析方法及系统
CN111680225B (zh) * 2020-04-26 2023-08-18 国家计算机网络与信息安全管理中心 基于机器学习的微信金融消息分析方法及系统
CN111652281A (zh) * 2020-04-30 2020-09-11 中国平安财产保险股份有限公司 信息数据的分类方法、装置及可读存储介质
CN111652281B (zh) * 2020-04-30 2023-08-18 中国平安财产保险股份有限公司 信息数据的分类方法、装置及可读存储介质
CN111680155A (zh) * 2020-05-13 2020-09-18 新华网股份有限公司 文本分类方法、装置、电子设备及计算机存储介质
CN111737993B (zh) * 2020-05-26 2024-04-02 浙江华云电力工程设计咨询有限公司 一种配电网设备的故障缺陷文本提取设备健康状态方法
CN111737993A (zh) * 2020-05-26 2020-10-02 浙江华云电力工程设计咨询有限公司 一种配电网设备的故障缺陷文本提取设备健康状态方法
CN111601314A (zh) * 2020-05-27 2020-08-28 北京亚鸿世纪科技发展有限公司 预训练模型加短信地址双重判定不良短信的方法和装置
CN111709233B (zh) * 2020-05-27 2023-04-18 西安交通大学 基于多注意力卷积神经网络的智能导诊方法及系统
CN111709233A (zh) * 2020-05-27 2020-09-25 西安交通大学 基于多注意力卷积神经网络的智能导诊方法及系统
CN111601314B (zh) * 2020-05-27 2023-04-28 北京亚鸿世纪科技发展有限公司 预训练模型加短信地址双重判定不良短信的方法和装置
CN111680803A (zh) * 2020-06-02 2020-09-18 中国电力科学研究院有限公司 一种运检工作票生成系统
CN111680804B (zh) * 2020-06-02 2023-09-01 中国电力科学研究院有限公司 一种运检工作票生成方法、设备以及计算机可读介质
CN111680804A (zh) * 2020-06-02 2020-09-18 中国电力科学研究院有限公司 一种运检工作票生成方法、设备以及计算机可读介质
CN111680803B (zh) * 2020-06-02 2023-09-01 中国电力科学研究院有限公司 一种运检工作票生成系统
CN111832292A (zh) * 2020-06-03 2020-10-27 北京百度网讯科技有限公司 文本识别处理方法、装置、电子设备和存储介质
CN111832292B (zh) * 2020-06-03 2024-02-02 北京百度网讯科技有限公司 文本识别处理方法、装置、电子设备和存储介质
CN111782803A (zh) * 2020-06-05 2020-10-16 京东数字科技控股有限公司 一种工单的处理方法、装置、电子设备及存储介质
CN113761882A (zh) * 2020-06-08 2021-12-07 北京沃东天骏信息技术有限公司 一种词典构建方法和装置
CN111737999A (zh) * 2020-06-24 2020-10-02 深圳前海微众银行股份有限公司 一种序列标注方法、装置、设备及可读存储介质
CN111753091A (zh) * 2020-06-30 2020-10-09 北京小米松果电子有限公司 分类方法、分类模型的训练方法、装置、设备及存储介质
CN111783451A (zh) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 用于增强文本样本的方法和装置
CN114004234A (zh) * 2020-07-28 2022-02-01 深圳Tcl数字技术有限公司 一种语义识别方法、存储介质及终端设备
CN111930942A (zh) * 2020-08-07 2020-11-13 腾讯云计算(长沙)有限责任公司 文本分类方法、语言模型训练方法、装置及设备
CN111930942B (zh) * 2020-08-07 2023-08-15 腾讯云计算(长沙)有限责任公司 文本分类方法、语言模型训练方法、装置及设备
CN111966944B (zh) * 2020-08-17 2024-04-09 中电科大数据研究院有限公司 一种多层级用户评论安全审核的模型构建方法
CN111966944A (zh) * 2020-08-17 2020-11-20 中电科大数据研究院有限公司 一种多层级用户评论安全审核的模型构建方法
CN112015895A (zh) * 2020-08-26 2020-12-01 广东电网有限责任公司 一种专利文本分类方法及装置
CN112101015B (zh) * 2020-09-08 2024-01-26 腾讯科技(深圳)有限公司 一种识别多标签对象的方法及装置
CN112016319A (zh) * 2020-09-08 2020-12-01 平安科技(深圳)有限公司 预训练模型获取、疾病实体标注方法、装置及存储介质
CN112016319B (zh) * 2020-09-08 2023-12-15 平安科技(深圳)有限公司 预训练模型获取、疾病实体标注方法、装置及存储介质
CN112101015A (zh) * 2020-09-08 2020-12-18 腾讯科技(深圳)有限公司 一种识别多标签对象的方法及装置
CN114281928A (zh) * 2020-09-28 2022-04-05 中国移动通信集团广西有限公司 基于文本数据的模型生成方法、装置及设备
CN113392209B (zh) * 2020-10-26 2023-09-19 腾讯科技(深圳)有限公司 一种基于人工智能的文本聚类方法、相关设备及存储介质
CN113392209A (zh) * 2020-10-26 2021-09-14 腾讯科技(深圳)有限公司 一种基于人工智能的文本聚类方法、相关设备及存储介质
CN112287639A (zh) * 2020-10-30 2021-01-29 上海中通吉网络技术有限公司 一种智能客服工单分类方法
CN112364131A (zh) * 2020-11-10 2021-02-12 中国平安人寿保险股份有限公司 一种语料处理方法及其相关装置
CN112364131B (zh) * 2020-11-10 2024-05-17 中国平安人寿保险股份有限公司 一种语料处理方法及其相关装置
CN112528022A (zh) * 2020-12-09 2021-03-19 广州摩翼信息科技有限公司 主题类别对应的特征词提取和文本主题类别识别方法
CN112529743A (zh) * 2020-12-18 2021-03-19 平安银行股份有限公司 合同要素抽取方法、装置、电子设备及介质
CN112529743B (zh) * 2020-12-18 2023-08-08 平安银行股份有限公司 合同要素抽取方法、装置、电子设备及介质
CN112650837A (zh) * 2020-12-28 2021-04-13 上海风秩科技有限公司 结合分类算法与非监督算法的文本质量控制方法及系统
CN112650837B (zh) * 2020-12-28 2023-12-12 上海秒针网络科技有限公司 结合分类算法与非监督算法的文本质量控制方法及系统
CN112765936A (zh) * 2020-12-31 2021-05-07 出门问问(武汉)信息科技有限公司 一种基于语言模型进行运算的训练方法及装置
CN112765936B (zh) * 2020-12-31 2024-02-23 出门问问(武汉)信息科技有限公司 一种基于语言模型进行运算的训练方法及装置
CN112784061A (zh) * 2021-01-27 2021-05-11 数贸科技(北京)有限公司 知识图谱的构建方法、装置、计算设备及存储介质
CN112948573A (zh) * 2021-02-05 2021-06-11 北京百度网讯科技有限公司 文本标签的提取方法、装置、设备和计算机存储介质
CN112948573B (zh) * 2021-02-05 2024-04-02 北京百度网讯科技有限公司 文本标签的提取方法、装置、设备和计算机存储介质
CN112948583A (zh) * 2021-02-26 2021-06-11 中国光大银行股份有限公司 数据的分类方法及装置、存储介质、电子装置
CN113011183B (zh) * 2021-03-23 2023-09-05 北京科东电力控制系统有限责任公司 一种电力调控领域非结构化文本数据处理方法及系统
CN113011183A (zh) * 2021-03-23 2021-06-22 北京科东电力控制系统有限责任公司 一种电力调控领域非结构化文本数据处理方法及系统
CN113033198B (zh) * 2021-03-25 2022-08-26 平安国际智慧城市科技股份有限公司 相似文本推送方法、装置、电子设备及计算机存储介质
CN113033198A (zh) * 2021-03-25 2021-06-25 平安国际智慧城市科技股份有限公司 相似文本推送方法、装置、电子设备及计算机存储介质
CN113032573A (zh) * 2021-04-30 2021-06-25 《中国学术期刊(光盘版)》电子杂志社有限公司 一种结合主题语义与tf*idf算法的大规模文本分类方法及系统
CN113095068A (zh) * 2021-04-30 2021-07-09 平安国际智慧城市科技股份有限公司 基于权重字典的情感分析方法、系统、装置及存储介质
CN113032573B (zh) * 2021-04-30 2024-01-23 同方知网数字出版技术股份有限公司 一种结合主题语义与tf*idf算法的大规模文本分类方法及系统
CN113468292A (zh) * 2021-06-29 2021-10-01 中国银联股份有限公司 方面级情感分析方法、装置及计算机可读存储介质
CN113377965B (zh) * 2021-06-30 2024-02-23 中国农业银行股份有限公司 感知文本关键词的方法及相关装置
CN113377965A (zh) * 2021-06-30 2021-09-10 中国农业银行股份有限公司 感知文本关键词的方法及相关装置
CN113627530A (zh) * 2021-08-11 2021-11-09 中国平安人寿保险股份有限公司 相似问题文本生成方法、装置、设备及介质
CN113627530B (zh) * 2021-08-11 2023-09-15 中国平安人寿保险股份有限公司 相似问题文本生成方法、装置、设备及介质
CN114090601A (zh) * 2021-11-23 2022-02-25 北京百度网讯科技有限公司 一种数据筛选方法、装置、设备以及存储介质
CN114090601B (zh) * 2021-11-23 2023-11-03 北京百度网讯科技有限公司 一种数据筛选方法、装置、设备以及存储介质
CN114638195A (zh) * 2022-01-21 2022-06-17 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 一种基于多任务学习的立场检测方法
CN114638195B (zh) * 2022-01-21 2022-11-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 一种基于多任务学习的立场检测方法
CN114443849A (zh) * 2022-02-09 2022-05-06 北京百度网讯科技有限公司 一种标注样本选取方法、装置、电子设备和存储介质
CN114443849B (zh) * 2022-02-09 2023-10-27 北京百度网讯科技有限公司 一种标注样本选取方法、装置、电子设备和存储介质
US11907668B2 (en) 2022-02-09 2024-02-20 Beijing Baidu Netcom Science Technology Co., Ltd. Method for selecting annotated sample, apparatus, electronic device and storage medium
CN116307792B (zh) * 2022-10-12 2024-03-12 广州市阿尔法软件信息技术有限公司 一种面向城市体检主题场景的评估方法及装置
CN116307792A (zh) * 2022-10-12 2023-06-23 广州市阿尔法软件信息技术有限公司 一种面向城市体检主题场景的评估方法及装置
CN115952290B (zh) * 2023-03-09 2023-06-02 太极计算机股份有限公司 基于主动学习和半监督学习的案情特征标注方法、装置和设备
CN115952290A (zh) * 2023-03-09 2023-04-11 太极计算机股份有限公司 基于主动学习和半监督学习的案情特征标注方法、装置和设备
CN116361463B (zh) * 2023-03-27 2023-12-08 应急管理部国家减灾中心(应急管理部卫星减灾应用中心) 一种地震灾情信息提取方法、装置、设备及介质
CN116361463A (zh) * 2023-03-27 2023-06-30 应急管理部国家减灾中心(应急管理部卫星减灾应用中心) 一种地震灾情信息提取方法、装置、设备及介质
CN117093715A (zh) * 2023-10-18 2023-11-21 湖南财信数字科技有限公司 词库扩充方法、系统、计算机设备及存储介质
CN117093715B (zh) * 2023-10-18 2023-12-29 湖南财信数字科技有限公司 词库扩充方法、系统、计算机设备及存储介质

Also Published As

Publication number Publication date
CN108804512A (zh) 2018-11-13
CN108804512B (zh) 2020-11-24

Similar Documents

Publication Publication Date Title
WO2019200806A1 (fr) Dispositif de génération d'un modèle de classification de texte, procédé et support d'informations lisible par ordinateur
AU2017243270B2 (en) Method and device for extracting core words from commodity short text
US11093854B2 (en) Emoji recommendation method and device thereof
WO2019184217A1 (fr) Procédé et appareil de classification d'événement de point d'accès sans fil, et support de stockage
CN110263248B (zh) 一种信息推送方法、装置、存储介质和服务器
US8676730B2 (en) Sentiment classifiers based on feature extraction
WO2019041521A1 (fr) Appareil et procédé d'extraction de mot-clé d'utilisateur et support de mémoire lisible par ordinateur
WO2019218514A1 (fr) Procédé permettant d'extraire des informations cibles de page web, dispositif et support d'informations
US11106718B2 (en) Content moderation system and indication of reliability of documents
CN111126069B (zh) 一种基于视觉对象引导的社交媒体短文本命名实体识别方法
WO2017167067A1 (fr) Procédé et dispositif pour une classification de texte de page internet, procédé et dispositif pour une reconnaissance de texte de page internet
WO2022095374A1 (fr) Procédé et appareil d'extraction de mots-clés, ainsi que dispositif terminal et support de stockage
US20120136812A1 (en) Method and system for machine-learning based optimization and customization of document similarities calculation
CN103927309B (zh) 一种对业务对象标注信息标签的方法及装置
CN111460153A (zh) 热点话题提取方法、装置、终端设备及存储介质
CN110083832B (zh) 文章转载关系的识别方法、装置、设备及可读存储介质
CN112632226B (zh) 基于法律知识图谱的语义搜索方法、装置和电子设备
CN110287314B (zh) 基于无监督聚类的长文本可信度评估方法及系统
US10417578B2 (en) Method and system for predicting requirements of a user for resources over a computer network
CN111753082A (zh) 基于评论数据的文本分类方法及装置、设备和介质
WO2019214142A1 (fr) Dispositif électronique, procédé de prédiction basée sur des données de rapport de recherche, programme et support de stockage informatique
WO2022183991A1 (fr) Procédé et appareil de classification de documents et dispositif électronique
CN115248890B (zh) 用户兴趣画像的生成方法、装置、电子设备以及存储介质
Wang et al. A novel calibrated label ranking based method for multiple emotions detection in Chinese microblogs
CN109753646B (zh) 一种文章属性识别方法以及电子设备

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23/02/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18915109

Country of ref document: EP

Kind code of ref document: A1