WO2020140632A1 - Hidden feature extraction method, apparatus, computer device and storage medium - Google Patents

Hidden feature extraction method, apparatus, computer device and storage medium Download PDF

Info

Publication number
WO2020140632A1
WO2020140632A1 PCT/CN2019/118242 CN2019118242W WO2020140632A1 WO 2020140632 A1 WO2020140632 A1 WO 2020140632A1 CN 2019118242 W CN2019118242 W CN 2019118242W WO 2020140632 A1 WO2020140632 A1 WO 2020140632A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus
word vector
feature
self
hidden
Prior art date
Application number
PCT/CN2019/118242
Other languages
French (fr)
Chinese (zh)
Inventor
金戈
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020140632A1 publication Critical patent/WO2020140632A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Definitions

  • the present application relates to the technical field of text classification, and in particular to a method, apparatus, computer equipment, and computer-readable storage medium for extracting hidden features.
  • Embodiments of the present application provide an implicit feature extraction method, device, computer equipment, and computer-readable storage medium, which can solve the problem of low text classification efficiency in the conventional technology.
  • an embodiment of the present application provides an implicit feature extraction method, the method includes: acquiring a first corpus for performing implicit feature extraction; embedding the first corpus into words to embed the first corpus Convert to word vectors; extract the word vector features of the word vectors through a convolutional neural network; encode the word vector features by self-encoding to extract the hidden features of the word vector features.
  • an embodiment of the present application further provides a computer device, which includes a memory and a processor, a computer program is stored on the memory, and the hidden feature extraction method is implemented when the processor executes the computer program .
  • an embodiment of the present application further provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to execute the implicit Feature extraction method.
  • FIG. 1 is a schematic diagram of an application scenario of an implicit feature extraction method provided by an embodiment of this application
  • FIG. 2 is a schematic flowchart of a method for extracting hidden features provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of word vectors in a method for extracting hidden features provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a self-encoding structure in an implicit feature extraction method provided by an embodiment of this application;
  • FIG. 5 is a schematic flowchart of a self-encoding structure in an implicit feature extraction method provided by an embodiment of this application;
  • FIG. 6 is a schematic diagram of corpus display in a method for extracting hidden features provided by an embodiment of the present application
  • each subject in FIG. 1 obtains a first corpus for implicit feature extraction, embeds the first corpus into words to convert the first corpus into a word vector, and extracts all the data through a convolutional neural network.
  • the word vector features of the predicate vector, and the word vector features are encoded in a self-encoding manner to extract the hidden features of the word vector features.
  • FIG. 1 only illustrates a desktop computer as a terminal.
  • the type of the terminal is not limited to that shown in FIG. 1.
  • the terminal may also be an electronic device such as a mobile phone, notebook computer, or tablet computer.
  • the application scenarios of the above implicit feature extraction method are only used to illustrate the technical solution of the present application, and are not used to limit the technical solution of the present application.
  • the server obtains the first corpus for performing implicit feature extraction.
  • the first corpus may be a preset corpus on a designated website on the web, and crawling rules may be preset according to actual needs, for example, crawling rules
  • the corpus of a certain web page can also be the relevant corpus of a subject crawled.
  • the first corpus may also be a corpus provided through a corpus database, such as user data accumulated on a website.
  • word embedding is a type of word representation, words with similar meaning have similar representations, is the general term of the method of mapping vocabulary to real number vector.
  • word embedding is a type of technology in which a single word is represented as a real vector in a predefined vector space, and each word is mapped to a vector.
  • FIG. 3 is a schematic diagram of word vectors in a method for extracting hidden features provided by an embodiment of the present application.
  • the text corpus is converted into a pre-trained word vector, that is, the input natural language is encoded into a word vector, which is prepared for the pre-trained word vector.
  • a pre-trained word vector that is, the input natural language is encoded into a word vector, which is prepared for the pre-trained word vector.
  • pre-trained word vectors it is divided into Static method and No-static method. Static method refers to the parameter of the word vector is no longer adjusted during the training of TextCNN.
  • No-static method adjusts the parameter of the word vector during the training process , So the result of No-static method is better than that of Static method.
  • TextCNN English is Text Convolutional Neural Network, a text classification model based on convolutional neural network, that is, using convolutional neural network to classify text.
  • Embedding layer embedding layer
  • it can be adjusted once every 100 batches, which can reduce the training time and fine-tune the word vector.
  • the first corpus can be word embedded using a trained preset word vector dictionary to convert the first corpus into word vectors.
  • the word vector may use Word2Vec pre-trained word vectors, that is, each vocabulary has a corresponding vector representation, and such vector representations can express vocabulary information in data form.
  • Word2vec English is Word to vector, is a software tool for training word vectors.
  • Convolutional Neural Networks English is Convolutional Neural Networks, referred to as CNN, is a type of feedforward neural networks (Feedforward Neural Networks) that contains convolution or related calculations and has a deep structure, is a representative of deep learning (Deep Learning) One of the algorithms. Since the convolutional neural network can perform translation-invariant classification (English is Shift-Invariant Classification), it is also called “translation-invariant artificial neural network (English is Shift-Invariant Artificial Neural Networks, referred to as SIANN).
  • SIANN Shift-Invariant Artificial Neural Networks
  • a convolutional neural network is established, and the features of the corpus are extracted using the convolutional neural network.
  • Convolutional neural networks capture local text information through multiple scale convolution kernels.
  • the vertical dimension of the first-level convolution kernel can be selected from multiple types of scales from 1 to 5 to correspond to the number of captured words, and the horizontal dimension remains the same as the word vector dimension.
  • the one-dimensional convolutional layer corresponding to the longitudinal dimension can be selected according to the length of the text to further refine the information.
  • the self-encoding method refers to the way of encoding through the self-encoding structure.
  • the self-encoding structure is an unsupervised learning method based on the neural network to learn the hidden features. It is an artificial neural network and is used effectively in unsupervised learning.
  • coding The purpose of self-encoding is to learn a representation of a set of data. The representation is generally described by numbers. This representation is also called representation. Encoding is usually used for dimensionality reduction, and self-encoding can also be used for data generation models. Please refer to FIG. 4.
  • FIG. 4 is a schematic diagram of a self-encoding structure in an implicit feature extraction method provided by an embodiment of the present application. As shown in FIG.
  • the self-encoding structure generally includes an input layer, a hidden layer, and an output layer.
  • the input layer receives external input data, encodes through the hidden layer in the middle to learn hidden features, and decodes and outputs the hidden features through the output layer.
  • the hidden layer can be expressed as a functional relationship, such as Hw, b (x), where H is an implicit feature, x is a variable, w and b are parameters
  • the hidden layer structure in the self-encoding structure can be composed of a layer It can also be composed of multiple layers.
  • the hidden layer is composed of one layer and can be called a hidden layer.
  • the hidden layer is composed of multiple layers and can be called multiple hidden layers.
  • the hidden layer shown in FIG. 4 is one layer.
  • the hidden layer may also be multiple layers such as 2, 3 or 4 layers.
  • the construction of the self-coding structure can be achieved through the tensorflow library in Python. The network structure after the construction can be trained, and the self-coding structure after the training can be officially used.
  • the word vector features are encoded by a self-encoding function to obtain the hidden features of the word vector features. That is, the terminal encodes the feature of the word vector through the hidden layer of the self-encoding structure to obtain a digital description of the first corpus for dimensionality reduction, where the hidden layer refers to an unsupervised learning method through a neural network, Convert the text corpus to a digital representation to imply a non-literal form to express the meaning of the text in order to achieve the purpose of extracting a large amount of corpus and then accurately restoring it.
  • the hidden layer is an intermediate layer between the input layer and the output layer of the neural network. Each hidden layer contains a certain number of hidden units, and there are connections between the hidden units and the input and output layers.
  • the self-encoding structure can also be understood as the conversion process of text corpus as follows: 10 dimensions (Chinese characters)-5 dimensions (numbers)-10 dimensions (Chinese characters), where dimension refers to the dimension, and 5 dimension refers to the hidden features of the text as 5 dimensions, such as 5 lines, training to obtain accuracy of 5 dimensions.
  • the following process is realized through a neural network: text representation-replaced by a hidden layer to a digital representation (text meaning expressed by numbers)-restored text representation.
  • FIG. 5 is a schematic flowchart of a self-encoding structure in a method for extracting hidden features provided by an embodiment of the present application. As shown in Figure 5, build a self-encoding network structure.
  • the embodiments of the present application belong to the technical field of text classification.
  • the first corpus when implementing hidden feature extraction, is embedded into the first corpus to obtain the first corpus by acquiring the first corpus for implicit feature extraction. Convert to word vectors, extract the word vector features of the word vectors through a convolutional neural network, and then use an unsupervised algorithm to cluster the corpus description, and then encode the word vector features by self-encoding to extract the
  • the hidden features of the predicate vector feature are used to reduce the dimension of the corpus data, so as to extract the hidden features of the corpus through unsupervised learning, which can improve the accuracy of subsequent learning modeling and overcome the amount of training data. influences.
  • the method further includes:
  • the second corpus is displayed in a preset form.
  • the second corpus has a certain regularity, and the corpus can be displayed in the form of a table or in the form of a chart, so that the user can use the form or graph To obtain information about the second corpus.
  • Table 1 is an example of the second corpus obtained in the form of a table.
  • FIG. 6 is a schematic diagram of the corpus display in the method of extracting hidden features provided by an embodiment of the present application.
  • FIG. 6 is a diagram The form shows an example of the obtained second corpus.
  • the method before the step of encoding the word vector feature through the self-encoding function to obtain the hidden feature of the word vector feature, the method further includes: training the self-encoding function using a training corpus.
  • the step of using the training corpus to train the self-encoding function includes: S710, inputting the word vector features of the training corpus to the self-encoding function; S720, inputting The word vector features of the training corpus are encoded by the self-encoding function to extract the hidden features of the word vector features; S730, decoding the hidden features to obtain a decoded third corpus; S740, determining the location Whether the similarity between the training corpus and the third corpus is greater than or equal to a preset similarity threshold; S750, if the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity threshold, it is determined Complete the training of the self-encoding structure; S760.
  • the self-encoding structure Before using the self-encoding structure to learn the hidden features of the text, the self-encoding structure needs to be trained. After the self-encoding structure extracts the hidden features of the corpus to meet the accuracy requirements, the self-encoding structure is trained.
  • the self-coding network structure after training can be used for feature extraction of text, and the hidden features of the text are learned according to the self-coding structure to use the hidden features of the extracted corpus for modeling and other uses.
  • the loss function in the self-encoding structure is MSE, where, MSE, mean-square error in English, mean square error, is a method of calculating the sum of squares of the distance between the predicted value and the true value
  • the training method is ADAM.
  • ADMA English is Adaptive, estimation, adaptive moment estimation, and the learning rate is 0.001.
  • the learning rate also known as the learning rate, English is Earning Rate, which controls the learning progress of the model.
  • the self-coding network structure after training can be used to extract hidden features of text. Specifically, the self-encoding structure training process is as follows:
  • the training corpus is a Text text corpus, for example, the obtained training corpus includes: cat 1, dog 1, dog 3, person, cat 2, dog 2.
  • Convert the training corpus into a word vector through the word embedding layer that is, convert the text corpus into a word vector. For example, after the above training text corpus is converted into a word vector: 1'(cat 1), 2'(dog 1 ), 2" (dog 3), 3 (person), 1" (cat 2), 2"' (dog 2).
  • Extract the word vector features of the word vector through the convolutional neural network to achieve an unsupervised form of clustering representation that is, extract and classify the word vectors after the training corpus conversion through the convolutional neural network to obtain the Describe the characteristics of the training corpus, for example, the word vector features obtained from the above word vectors are: 1'and 1" (cat 1, cat 2); 2', 2" and 2"' (dog 1, dog 2, dog 3 ); 3 (person).
  • the word vector features of the training corpus are encoded by the hidden layer to learn the hidden features of the training corpus, and a self-encoding structure is established based on the output of the convolutional neural network to use the hidden features of the training corpus, That is, the word vector features of the training corpus are input to the hidden layer of the self-encoding structure through the input layer of the self-encoding structure, that is, the word vector features of the training corpus are input to the self-encoding function to perform Encoding, so as to express the corresponding meaning of the text corpus in digital form, which is an implicit representation relative to the text form.
  • the hidden features learned from the above training corpus are: 1(1' and 1"), 2(2', 2" and 2"'), 3(3).
  • the hidden features of the training corpus are decoded through the output layer of the self-encoding structure to obtain the decoded third corpus, that is, the digital form of the hidden features is restored to the text form through the neural network of the self-encoding structure
  • the content of the restored corpus and the text content of the original training corpus meet the similarity requirement to achieve decoding, that is, the digital form of the hidden feature is restored to the meaning of the text form through a self-encoding structure, and the final result requires the restored content It meets the similarity requirement with the original text.
  • the structure of the above hidden features is: cat 1, cat 2, dog 1, dog 2, dog 3, person, or cat 1, dog 1, dog 3, person, cat 2. Dog 2.
  • the convolutional neural network is established during the training process.
  • the convolutional neural network here is pre-trained to realize the feature extraction of the text using the convolutional neural network.
  • the hidden features of the text refer to the hidden features shown in Figure 5.
  • Layer generated features During the training process, the self-encoding structure, convolutional neural network structure, and word vectors will be updated. Finally, the similarity between the training corpus and the third corpus meets a preset similarity threshold, and the hidden layer of the trained self-encoding structure can reflect the hidden features of the text and can be used for multiple purposes.
  • the embodiment of the present application extracts the hidden features of the text by using an unsupervised algorithm, first converts the text into a pre-trained word vector, and uses the convolutional neural network to extract the features of the text, and then establishes a self-coding structure based on the output of the convolutional neural network to learn
  • the text has hidden features.
  • the self-encoding structure, convolutional neural network structure, and word vectors will be updated.
  • the hidden layer of the trained self-encoding structure can reflect the hidden features of the text to extract the hidden features of the model through unsupervised algorithms, and can be used for multiple purposes. The obtained information can improve the accuracy of subsequent supervised learning modeling and overcome The impact of the amount of training data.
  • the hidden feature extraction model established by the method of the embodiment of the present application is suitable for supervised training with a small number of training samples. Since deep learning has a high possibility of overfitting, a small amount of training sample data will seriously affect the generalization ability of the model. Therefore, hidden features can be established by a large amount of unlabeled training data through the method of the embodiment of the present application Extract the model to learn the hidden features of the text, and then combine the hidden features in the hidden feature extraction model and the training data with annotations to perform supervised learning modeling to improve the accuracy of supervised learning modeling.
  • FIG. 8 is a schematic block diagram of an apparatus for extracting hidden features provided by an embodiment of the present application.
  • an embodiment of the present application further provides a hidden feature extraction device.
  • the hidden feature extraction device includes a unit for performing the aforementioned hidden feature extraction method, and the device may be configured in a computer device such as a terminal or a server.
  • the hidden feature extraction device 800 includes an acquisition unit 801, a conversion unit 802, a first extraction unit 803 and a second extraction unit 804.
  • the obtaining unit 801 is used to obtain a first corpus for performing hidden feature extraction; the conversion unit 802 is used to embed the first corpus into words to convert the first corpus into a word vector; the first extraction unit 803, used to extract word vector features of the word vector through a convolutional neural network; a second extraction unit 804, used to encode the word vector features by self-encoding to extract hidden features of the word vector features .
  • the second extraction unit 804 is configured to encode the word vector feature through a self-encoding function to obtain the hidden feature of the word vector feature.
  • the hidden feature extraction device 800 further includes: a decoding unit 805 for decoding the hidden feature to obtain a decoded second corpus; a display unit 806, It is used to display the second corpus in a preset form; a training unit 807 is used to train the self-encoding function using a training corpus.
  • the training unit 807 includes: an input subunit 8071 for inputting the word vector features of the training corpus to the self-encoding function; an encoding subunit 8072 for Encode the word vector features of the training corpus through the self-encoding function to extract the hidden features of the word vector features; a decoding sub-unit 8073 is used to decode the hidden features to obtain the decoded Three corpora; a judgment subunit 8074, used to judge whether the similarity between the training corpus and the third corpus is greater than or equal to a preset similarity threshold; a determination subunit 8075, used to determine whether the training corpus and the third corpus The similarity of the three corpora is greater than or equal to the preset similarity threshold, and it is determined to complete the training of the self-encoding structure; an adjustment subunit 8076 is used if the similarity between the training corpus and the third corpus is less than the Preset a similarity threshold, adjust parameters in the self-encoding function and continue
  • the display unit is configured to display the second corpus in a table form or a chart form.
  • the conversion unit 802 is configured to embed the first corpus into words using a trained preset word vector dictionary to convert the first corpus into word vectors.
  • the division and connection of the units in the above hidden feature extraction device are only for illustration.
  • the hidden feature extraction device may be divided into different units as needed, or the hidden feature may be extracted.
  • the units in the device adopt different connection sequences and methods to complete all or part of the functions of the hidden feature extraction device.
  • the above-mentioned hidden feature extraction device may be implemented in the form of a computer program, and the computer program may run on the computer device shown in FIG. 10.
  • FIG. 10 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 1000 may be a computer device such as a desktop computer or a server, or may be a component or part in other devices.
  • the computer device 1000 includes a processor 1002, a memory, and a network interface 1005 connected through a system bus 1001, where the memory may include a non-volatile storage medium 1003 and an internal memory 1004.
  • the non-volatile storage medium 1003 can store an operating system 10031 and a computer program 10032.
  • the computer program 10032 When executed, it may cause the processor 1002 to execute one of the aforementioned implicit feature extraction methods.
  • the processor 1002 is used to provide computing and control capabilities to support the operation of the entire computer device 1000.
  • the internal memory 1004 provides an environment for the operation of the computer program 10032 in the non-volatile storage medium 1003.
  • the processor 1002 can execute the above-mentioned hidden feature extraction method.
  • the network interface 1005 is used for network communication with other devices.
  • the structure shown in FIG. 10 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 1000 to which the solution of the present application is applied.
  • the specific computer device 1000 may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 10, and details are not described herein again.
  • the processor 1002 is used to run the computer program 10032 stored in the memory to implement the hidden feature extraction method in the embodiment of the present application.
  • the processor 1002 may be a central processing unit (Central Processing Unit, CPU), and the processor 1002 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor.
  • the embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the processor causes the processor to perform the operations described in the foregoing embodiments. Steps of hidden feature extraction method.
  • the storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, and various other physical storage media that can store computer programs .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

Provided by the embodiments of the present application are a hidden feature extraction method, an apparatus, a computer device, and a computer-readable storage medium. The embodiments of the present application relate to the technical field of text classification. In the embodiments of the present application, when hidden feature extraction is performed, a first corpus for performing hidden feature extraction is acquired, word embedding is performed on the first corpus so as to convert the first corpus into a word vector, a word vector feature of the word vector is extracted by means of a convolutional neural network, the word vector is clustered and described by using an unsupervised algorithm, and then the word vector feature is encoded by means of self-encoding so as to extract a hidden feature of the word vector feature.

Description

隐含特征提取方法、装置、计算机设备及存储介质Hidden feature extraction method, device, computer equipment and storage medium
本申请要求于2019年1月4日提交中国专利局、申请号为201910007711.4、申请名称为“隐含特征提取方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application submitted to the China Patent Office on January 4, 2019, with the application number 201910007711.4 and the application name "implicit feature extraction method, device, computer equipment and storage medium", all of which are approved by The reference is incorporated in this application.
技术领域Technical field
本申请涉及文本分类技术领域,尤其涉及一种隐含特征提取方法、装置、计算机设备及计算机可读存储介质。The present application relates to the technical field of text classification, and in particular to a method, apparatus, computer equipment, and computer-readable storage medium for extracting hidden features.
背景技术Background technique
传统文本分类模型为监督学习模型,监督学习模型是指利用一组已知类别的样本调整分类器的参数,使其达到所要求性能的过程,也称为监督训练模型或有教师学习模型,因此,使用监督学习模型需要根据已知类别的样本进行文本的分类,所以使用监督学习模型进行文本分类时需要大量含标注数据,以便根据标注数据对文本进行分类,大量标注数据的处理会导致文本分类的效率比较低。The traditional text classification model is a supervised learning model. The supervised learning model refers to the process of adjusting the parameters of the classifier using a set of samples of known categories to achieve the required performance. It is also called a supervised training model or a teacher learning model, so , The use of supervised learning model needs to classify text based on samples of known categories, so when using supervised learning model for text classification, a large amount of labeled data is required to classify the text according to the labeled data, and the processing of large amounts of labeled data will lead to text classification The efficiency is relatively low.
发明内容Summary of the invention
本申请实施例提供了一种隐含特征提取方法、装置、计算机设备及计算机可读存储介质,能够解决传统技术中文本分类效率比较低的问题。Embodiments of the present application provide an implicit feature extraction method, device, computer equipment, and computer-readable storage medium, which can solve the problem of low text classification efficiency in the conventional technology.
第一方面,本申请实施例提供了一种隐含特征提取方法,所述方法包括:获取进行隐含特征提取的第一语料;将所述第一语料进行词嵌入以将所述第一语料转化为词向量;通过卷积神经网络提取所述词向量的词向量特征;将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征。In a first aspect, an embodiment of the present application provides an implicit feature extraction method, the method includes: acquiring a first corpus for performing implicit feature extraction; embedding the first corpus into words to embed the first corpus Convert to word vectors; extract the word vector features of the word vectors through a convolutional neural network; encode the word vector features by self-encoding to extract the hidden features of the word vector features.
第二方面,本申请实施例还提供了一种隐含特征提取装置,包括:获取单元,用于获取进行隐含特征提取的第一语料;转化单元,用于将所述第一语料进行词嵌入以将所述第一语料转化为词向量;第一提取单元,用于通过卷积神 经网络提取所述词向量的词向量特征;第二提取单元,用于将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征。In a second aspect, an embodiment of the present application further provides an apparatus for extracting hidden features, including: an acquiring unit for acquiring a first corpus for performing hidden feature extraction; and a conversion unit for performing word classification on the first corpus Embedding to convert the first corpus into word vectors; a first extraction unit for extracting word vector features of the word vector through a convolutional neural network; a second extraction unit for passing the word vector features through The encoding method encodes to extract the hidden features of the word vector features.
第三方面,本申请实施例还提供了一种计算机设备,其包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现所述隐含特征提取方法。In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory and a processor, a computer program is stored on the memory, and the hidden feature extraction method is implemented when the processor executes the computer program .
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器执行所述隐含特征提取方法。According to a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to execute the implicit Feature extraction method.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the technical solutions of the embodiments of the present application, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的隐含特征提取方法的应用场景示意图;FIG. 1 is a schematic diagram of an application scenario of an implicit feature extraction method provided by an embodiment of this application;
图2为本申请实施例提供的隐含特征提取方法的流程示意图;2 is a schematic flowchart of a method for extracting hidden features provided by an embodiment of the present application;
图3为本申请实施例提供的隐含特征提取方法中词向量示意图;3 is a schematic diagram of word vectors in a method for extracting hidden features provided by an embodiment of the present application;
图4为本申请实施例提供的隐含特征提取方法中的自编码结构示意图;4 is a schematic diagram of a self-encoding structure in an implicit feature extraction method provided by an embodiment of this application;
图5为本申请实施例提供的隐含特征提取方法中的自编码结构流程示意图;FIG. 5 is a schematic flowchart of a self-encoding structure in an implicit feature extraction method provided by an embodiment of this application;
图6为本申请实施例提供的隐含特征提取方法中的语料显示示意图;6 is a schematic diagram of corpus display in a method for extracting hidden features provided by an embodiment of the present application;
图7为本申请实施例提供的隐含特征提取方法中的一个子流程示意图;7 is a schematic diagram of a sub-process in a method for extracting hidden features provided by an embodiment of the present application;
图8为本申请实施例提供的隐含特征提取装置的示意性框图;8 is a schematic block diagram of an apparatus for extracting hidden features provided by an embodiment of the present application;
图9为本申请实施例提供的隐含特征提取装置的另一个示意性框图;以及9 is another schematic block diagram of an apparatus for extracting hidden features provided by an embodiment of the present application; and
图10为本申请实施例提供的计算机设备的示意性框图。10 is a schematic block diagram of a computer device provided by an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部 的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.
请参阅图1,图1为本申请实施例提供的隐含特征提取方法的应用场景示意图。所述应用场景包括:(1)终端。图1所示终端上安装有应用程序,研发人员通过应用程序实现执行隐含特征提取方法的步骤,所述终端可以为笔记本电脑、平板电脑或者台式电脑等电子设备,图1中所示的终端应用环境也可以更换为服务器等计算机设备。若图1中的应用环境为服务器,服务器可以为服务器集群或者云服务器。服务器集群又可以采用分布式系统,分布式系统的服务器又可以包括主服务器和从服务器,以使主服务器使用获得的语料执行隐含特征提取方法的步骤。Please refer to FIG. 1. FIG. 1 is a schematic diagram of an application scenario of an implicit feature extraction method provided by an embodiment of the present application. The application scenarios include: (1) Terminal. An application program is installed on the terminal shown in FIG. 1, and the developer implements the steps of executing the hidden feature extraction method through the application program. The terminal may be an electronic device such as a notebook computer, tablet computer, or desktop computer. The terminal shown in FIG. 1 The application environment can also be replaced with computer equipment such as servers. If the application environment in FIG. 1 is a server, the server may be a server cluster or a cloud server. The server cluster can also adopt a distributed system, and the server of the distributed system can include a master server and a slave server, so that the master server uses the obtained corpus to perform the steps of the hidden feature extraction method.
图1中的各个主体工作过程如下:终端获取进行隐含特征提取的第一语料,将所述第一语料进行词嵌入以将所述第一语料转化为词向量,通过卷积神经网络提取所述词向量的词向量特征,将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征。The working process of each subject in FIG. 1 is as follows: the terminal obtains a first corpus for implicit feature extraction, embeds the first corpus into words to convert the first corpus into a word vector, and extracts all the data through a convolutional neural network. The word vector features of the predicate vector, and the word vector features are encoded in a self-encoding manner to extract the hidden features of the word vector features.
需要说明的是,图1中仅仅示意出台式电脑作为终端,在实际操作过程中,终端的类型不限于图1中所示,所述终端还可以为手机、笔记本电脑或者平板电脑等电子设备,上述隐含特征提取方法的应用场景仅仅用于说明本申请技术方案,并不用于限定本申请技术方案。It should be noted that FIG. 1 only illustrates a desktop computer as a terminal. In actual operation, the type of the terminal is not limited to that shown in FIG. 1. The terminal may also be an electronic device such as a mobile phone, notebook computer, or tablet computer. The application scenarios of the above implicit feature extraction method are only used to illustrate the technical solution of the present application, and are not used to limit the technical solution of the present application.
图2为本申请实施例提供的隐含特征提取方法的示意性流程图。该隐含特征提取方法应用于图1中的终端中以完成隐含特征提取方法的全部或者部分功能。FIG. 2 is a schematic flowchart of an implicit feature extraction method provided by an embodiment of the present application. The hidden feature extraction method is applied to the terminal in FIG. 1 to complete all or part of the functions of the hidden feature extraction method.
请参阅图2,图2是本申请实施例提供的隐含特征提取方法的流程示意图。如图2所示,该方法包括以下步骤S210-S240:Please refer to FIG. 2, which is a schematic flowchart of an implicit feature extraction method provided by an embodiment of the present application. As shown in FIG. 2, the method includes the following steps S210-S240:
S210、获取进行隐含特征提取的第一语料。S210. Acquire the first corpus for performing hidden feature extraction.
具体地,服务器获取进行隐含特征提取的第一语料,所述第一语料可以是通过爬取网络上指定网站上的预设语料,爬取规则可以根据实际需要预先设置,比如,爬取规则为某一网页的语料,也可以是爬取的某一主体的相关语料。所述第一语料还可以是通过语料数据库提供的语料,比如某一网站积累的用户数 据等。Specifically, the server obtains the first corpus for performing implicit feature extraction. The first corpus may be a preset corpus on a designated website on the web, and crawling rules may be preset according to actual needs, for example, crawling rules The corpus of a certain web page can also be the relevant corpus of a subject crawled. The first corpus may also be a corpus provided through a corpus database, such as user data accumulated on a website.
S220、将所述第一语料进行词嵌入以将所述第一语料转化为词向量。S220. Embed the first corpus into words to convert the first corpus into word vectors.
其中,词嵌入,英文为Word Embedding,是一种词的类型表示,具有相似意义的词具有相似的表示,是将词汇映射到实数向量的方法总称。具体地,词嵌入是一类技术,单个词在预定义的向量空间中被表示为实数向量,每个单词都映射到一个向量。请参阅图3,图3为本申请实施例提供的隐含特征提取方法中词向量示意图。如图3所示,假如在一个文本中包含“猫”“狗”及“爱情”等若干单词,而这若干单词映射到向量空间中,“猫”对应的向量为(0.1,0.2,0.3),“狗”对应的向量为(0.2,0.2,0.4),“爱情”对应的映射为(-0.4,-0.5,-0.2)(本数据仅为示意)。像这种将文本X{x1,x2,x3,x4,x5……xn}映射到多维向量空间Y{y1,y2,y3,y4,y5……yn},这个映射过程就叫做词嵌入。之所以希望把每个单词都变成一个向量,目的还是为了方便计算,比如“猫”,“狗”,“爱情”三个词。对于我们人而言,我们可以知道“猫”和“狗”表示的都是动物,而“爱情”是表示的一种情感,但是对于机器而言,这三个词都是用0和1表示成二进制的字符串而已,无法对其进行计算。而通过词嵌入这种方式将单词转变为词向量,机器便可对单词进行计算,通过计算不同词向量之间夹角余弦值cos而得出单词之间的相似性,比如,在图3中,由于cosα<cosβ,可“猫”与“狗”更相似,猫与“爱情”差异较大。Among them, the word embedding, the English word is "Embedding", is a type of word representation, words with similar meaning have similar representations, is the general term of the method of mapping vocabulary to real number vector. Specifically, word embedding is a type of technology in which a single word is represented as a real vector in a predefined vector space, and each word is mapped to a vector. Please refer to FIG. 3, which is a schematic diagram of word vectors in a method for extracting hidden features provided by an embodiment of the present application. As shown in Figure 3, if a text contains several words such as "cat", "dog" and "love", and these words are mapped into a vector space, the vector corresponding to "cat" is (0.1, 0.2, 0.3) The vector corresponding to "dog" is (0.2, 0.2, 0.4), and the mapping corresponding to "love" is (-0.4, -0.5, -0.2) (this data is only for illustration). Like this, the text X{x1,x2,x3,x4,x5...xn} is mapped to the multidimensional vector space Y{y1,y2,y3,y4,y5...yn}, this mapping process is called word embedding. The reason why you want to turn each word into a vector is to facilitate calculation, such as "cat", "dog", "love" three words. For us, we can know that "cat" and "dog" are both animals, and "love" is an emotion, but for machines, these three words are all represented by 0 and 1. It becomes a binary string, and it cannot be calculated. By converting words into word vectors through word embedding, the machine can calculate words, and calculate the similarity between words by calculating the angle cosine value cos between different word vectors, for example, in Figure 3 Since cosα<cosβ, “cat” and “dog” are more similar, and cat and “love” are quite different.
具体地,将文本语料转化为预训练的词向量,也就是将输入的自然语言编码成词向量,为预训练词向量准备。具体实施时,可以使用预训练好的词向量,也可以直接在训练TextCNN的过程中训练出一套词向量,不过使用预训练好的词向量比在训练TextCNN的过程中训练出一套词向量快100倍不止。如果使用预训练好的词向量,又分为Static方法和No-static方法,Static方法是指在训练TextCNN过程中不再调节词向量的参数,No-static方法在训练过程中调节词向量的参数,所以No-static方法的结果比Static方法的结果要好。其中,TextCNN,英文为Text Convolutional Neural Network,基于卷积神经网络的文本分类模型,也就是利用卷积神经网络对文本进行分类。Specifically, the text corpus is converted into a pre-trained word vector, that is, the input natural language is encoded into a word vector, which is prepared for the pre-trained word vector. In specific implementation, you can use pre-trained word vectors, or you can directly train a set of word vectors during the training of TextCNN, but using pre-trained word vectors is better than training a set of word vectors during the training of TextCNN. More than 100 times faster. If you use pre-trained word vectors, it is divided into Static method and No-static method. Static method refers to the parameter of the word vector is no longer adjusted during the training of TextCNN. No-static method adjusts the parameter of the word vector during the training process , So the result of No-static method is better than that of Static method. Among them, TextCNN, English is Text Convolutional Neural Network, a text classification model based on convolutional neural network, that is, using convolutional neural network to classify text.
进一步地,还可以不在每一个Batch(批)中都调节Embedding层(嵌入层), 而是每100个Batch调节一次,这样可以减少训练的时间,又可以微调词向量。Furthermore, instead of adjusting the Embedding layer (embedding layer) in each batch, it can be adjusted once every 100 batches, which can reduce the training time and fine-tune the word vector.
更进一步地,可以使用训练好的预设词向量字典将所述第一语料进行词嵌入以将所述第一语料转化为词向量。在一个实施例中,词向量可以采用Word2Vec预训练词向量,即每个词汇都有对应的向量表示,此类向量表示能够以数据形式表达词汇信息。其中,Word2vec,英文为Word to vector,是一款用于训练词向量的软件工具。Furthermore, the first corpus can be word embedded using a trained preset word vector dictionary to convert the first corpus into word vectors. In one embodiment, the word vector may use Word2Vec pre-trained word vectors, that is, each vocabulary has a corresponding vector representation, and such vector representations can express vocabulary information in data form. Among them, Word2vec, English is Word to vector, is a software tool for training word vectors.
S230、通过卷积神经网络提取所述词向量的词向量特征。S230. Extract word vector features of the word vector through a convolutional neural network.
其中,卷积神经网络,英文为Convolutional Neural Networks,简称为CNN,是一类包含卷积或者相关计算且具有深度结构的前馈神经网络(Feedforward Neural Networks),是深度学习(Deep Learning)的代表算法之一。由于卷积神经网络能够进行平移不变分类(英文为Shift-Invariant Classification),因此也被称为“平移不变人工神经网络(英文为Shift-Invariant Artificial Neural Networks,简称为SIANN)。Among them, Convolutional Neural Networks, English is Convolutional Neural Networks, referred to as CNN, is a type of feedforward neural networks (Feedforward Neural Networks) that contains convolution or related calculations and has a deep structure, is a representative of deep learning (Deep Learning) One of the algorithms. Since the convolutional neural network can perform translation-invariant classification (English is Shift-Invariant Classification), it is also called "translation-invariant artificial neural network (English is Shift-Invariant Artificial Neural Networks, referred to as SIANN).
具体地,建立卷积神经网络,采用卷积神经网络提取语料的特征。卷积神经网络通过多种尺度卷积核捕捉文本局部信息。在实践中,首层卷积核的纵向维度可选取1至5中的多类尺度以对应捕捉词汇数量,横向维度与词向量维度保持相同。在首层卷积层之后,可根据文本长度以选取对应纵向维度的一维卷积层,以进一步提炼信息。Specifically, a convolutional neural network is established, and the features of the corpus are extracted using the convolutional neural network. Convolutional neural networks capture local text information through multiple scale convolution kernels. In practice, the vertical dimension of the first-level convolution kernel can be selected from multiple types of scales from 1 to 5 to correspond to the number of captured words, and the horizontal dimension remains the same as the word vector dimension. After the first convolutional layer, the one-dimensional convolutional layer corresponding to the longitudinal dimension can be selected according to the length of the text to further refine the information.
S240、将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征。S240. Encode the word vector features by self-encoding to extract hidden features of the word vector features.
其中,自编码方式是指通过自编码结构进行编码的方式,自编码结构是一种基于神经网络学习隐含特征的非监督学习方法,是一种人工神经网络,在无监督学习中用于有效编码。自编码的目的是对一组数据学习出一种表示,该表示一般用数字描述,该表示也称表征,编码,通常用于降维,并且自编码还可以用于数据的生成模型。请参阅图4,图4为本申请实施例提供的隐含特征提取方法中的自编码结构示意图。如图4所示,自编码结构一般包括输入层、隐层及输出层。输入层接收外部输入的数据,经过中间的隐层进行编码以学习隐含特征,通过输出层将隐含特征进行解码输出。其中,隐层可以表示为一种函数 关系,比如Hw,b(x),其中,H为隐含特征,x为变量,w和b为参数,自编码结构中的隐层结构可以由一层构成,也可以由多层构成,隐层由一层构成可以称为一个隐层,隐层由多层构成可以称为多个隐层,图4所示的隐层为一层,图4中的隐层也可以为2层、3层或者4层等多层。自编码结构的构建可以通过Python中tensorflow库实现,构建完毕的网络结构即可进行训练,训练完毕的自编码结构就可以正式使用。Among them, the self-encoding method refers to the way of encoding through the self-encoding structure. The self-encoding structure is an unsupervised learning method based on the neural network to learn the hidden features. It is an artificial neural network and is used effectively in unsupervised learning. coding. The purpose of self-encoding is to learn a representation of a set of data. The representation is generally described by numbers. This representation is also called representation. Encoding is usually used for dimensionality reduction, and self-encoding can also be used for data generation models. Please refer to FIG. 4. FIG. 4 is a schematic diagram of a self-encoding structure in an implicit feature extraction method provided by an embodiment of the present application. As shown in FIG. 4, the self-encoding structure generally includes an input layer, a hidden layer, and an output layer. The input layer receives external input data, encodes through the hidden layer in the middle to learn hidden features, and decodes and outputs the hidden features through the output layer. Among them, the hidden layer can be expressed as a functional relationship, such as Hw, b (x), where H is an implicit feature, x is a variable, w and b are parameters, the hidden layer structure in the self-encoding structure can be composed of a layer It can also be composed of multiple layers. The hidden layer is composed of one layer and can be called a hidden layer. The hidden layer is composed of multiple layers and can be called multiple hidden layers. The hidden layer shown in FIG. 4 is one layer. The hidden layer may also be multiple layers such as 2, 3 or 4 layers. The construction of the self-coding structure can be achieved through the tensorflow library in Python. The network structure after the construction can be trained, and the self-coding structure after the training can be officially used.
具体地,自编码结构构建完成后,根据卷积神经网络输出建立自编码结构。在本申请实施例中,自编码结构的输入与输出均为卷积神经网络输出信息,而中间一个或多个隐层则可视为隐含特征。经过训练的自编码结构通过编码,将输入转化为隐含信息,并将隐含信息通过解码得到与原始输入相近的输出。在此情况下,隐层单元便能够记录文本的大量信息。Specifically, after the construction of the self-coding structure is completed, the self-coding structure is established according to the output of the convolutional neural network. In the embodiment of the present application, the input and output of the self-encoding structure are both output information of the convolutional neural network, and one or more hidden layers in the middle can be regarded as hidden features. The trained self-encoding structure converts the input into hidden information through encoding, and decodes the hidden information to obtain an output close to the original input. In this case, the hidden layer unit can record a large amount of text information.
在一个实施例中,将所述词向量特征通过自编码函数进行编码以获取所述词向量特征的隐含特征。也就是终端将所述词向量特征通过所述自编码结构的隐层进行编码以获取所述第一语料进行降维后的数字描述,其中,隐层是指通过神经网络的非监督学习方式,将文字语料转换为数字表现方式从而用非文字形式隐含表示文字含义的层,以实现将大量的语料提取后再进行准确还原的目的。隐层是位于神经网络输入层与输出层之间的中间层。每个隐层均含有一定数量的隐含单元,隐含单元与输入层及输出层之间存在连接。自编码结构也可以理解为文字语料的转换过程如下:10维(汉字)—5维(数字)—10维(汉字),其中,维是指维度,其中5维是指文本的隐含特征为5维,比如为5行,训练获取5维的准确性。通过神经网络实现如下过程:文字表示—通过隐层置换为数字表示(通过数字的方式表示的文本含义)—还原文字表示。请参阅图5,图5为本申请实施例提供的隐含特征提取方法中的自编码结构流程示意图。如图5所示,建立自编码网络结构,比如,卷积神经网络的输入维度和输出维度均为384*1,也就是384行和1列的结构,而目标是学习得到维度100*1的隐含特征,也就是得到100行和1列的结构。自编码网络结构的输入层与输出层维度均为384行,自编码结构共包含3层,包括两层384维度的输入层及输出层,以及100维度的中间隐层,其中,中间的隐层可以不止一层,可以根据实际需 求设置多层,比如2层,3层或者4层等。In one embodiment, the word vector features are encoded by a self-encoding function to obtain the hidden features of the word vector features. That is, the terminal encodes the feature of the word vector through the hidden layer of the self-encoding structure to obtain a digital description of the first corpus for dimensionality reduction, where the hidden layer refers to an unsupervised learning method through a neural network, Convert the text corpus to a digital representation to imply a non-literal form to express the meaning of the text in order to achieve the purpose of extracting a large amount of corpus and then accurately restoring it. The hidden layer is an intermediate layer between the input layer and the output layer of the neural network. Each hidden layer contains a certain number of hidden units, and there are connections between the hidden units and the input and output layers. The self-encoding structure can also be understood as the conversion process of text corpus as follows: 10 dimensions (Chinese characters)-5 dimensions (numbers)-10 dimensions (Chinese characters), where dimension refers to the dimension, and 5 dimension refers to the hidden features of the text as 5 dimensions, such as 5 lines, training to obtain accuracy of 5 dimensions. The following process is realized through a neural network: text representation-replaced by a hidden layer to a digital representation (text meaning expressed by numbers)-restored text representation. Please refer to FIG. 5, which is a schematic flowchart of a self-encoding structure in a method for extracting hidden features provided by an embodiment of the present application. As shown in Figure 5, build a self-encoding network structure. For example, the input and output dimensions of the convolutional neural network are 384*1, that is, the structure of 384 rows and 1 column, and the goal is to learn the dimension 100*1. The implicit feature is to get a structure with 100 rows and 1 column. The input layer and output layer of the self-encoding network structure have a dimension of 384 lines. The self-encoding structure contains a total of 3 layers, including two layers of 384-dimensional input layer and output layer, and a 100-dimensional intermediate hidden layer, of which the middle hidden layer There can be more than one layer, and multiple layers can be set according to actual needs, such as 2, 3, or 4 layers.
进一步地,将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征之后,后续在需要时,将所述隐含特征进行解码以获取解码后的第二语料。具体地,在大数据处理的过程中,对于原始的大量数据可以经过自编码结构进行编码,以对实现将数据进行降维压缩,减少语料的大小,便于数据的存储和提高数据的读取效率。后续需要对大数据进行分析以提取数据隐含的情报信息时,可以将所述隐含特征进行解码以获取解码后的第二语料。比如,一个购物网站,必定积淀了大量的用户购买数据,为了方便对数据进行存储和分析,可以对所述大量数据进行隐含特征的学习,后续需要对用户群体进行分析,以获知用户的购买习惯和喜好等用户信息时,可以将从大量原始的用户数据中获取的隐含特征进行解码,获取解码后的第二语料,由于所述第二语料进行过聚类分析和压缩,可以提高对语料进行分析和处理的效率。Further, after encoding the word vector features by self-encoding to extract the hidden features of the word vector features, subsequently, when necessary, the hidden features are decoded to obtain the decoded second corpus. Specifically, in the process of big data processing, a large amount of original data can be encoded by a self-encoding structure to reduce the dimensionality and compression of the data, reduce the size of the corpus, facilitate data storage and improve data reading efficiency . When subsequent analysis of big data is required to extract the intelligence information implied by the data, the implicit feature may be decoded to obtain the decoded second corpus. For example, a shopping website must accumulate a large amount of user purchase data. In order to facilitate the storage and analysis of the data, the hidden characteristics of the large amount of data can be learned, and the user group needs to be analyzed later to learn the user's purchase When user information such as habits and preferences can be decoded, the hidden features obtained from a large amount of original user data can be decoded to obtain the decoded second corpus. Since the second corpus has undergone cluster analysis and compression, it can improve the The efficiency of corpus analysis and processing.
本申请实施例属于文本分类技术领域,本申请实施例在实现隐含特征提取时,通过获取进行隐含特征提取的第一语料,将所述第一语料进行词嵌入以将所述第一语料转化为词向量,通过卷积神经网络提取所述词向量的词向量特征,从而采用无监督算法将所述语料进行聚类描述,然后将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征以实现将所述语料的数据进行降维处理,从而实现通过无监督学习提取语料的隐含特征,能够提升后续学习建模的精度,并且克服了训练数据量的影响。The embodiments of the present application belong to the technical field of text classification. In the embodiments of the present application, when implementing hidden feature extraction, the first corpus is embedded into the first corpus to obtain the first corpus by acquiring the first corpus for implicit feature extraction. Convert to word vectors, extract the word vector features of the word vectors through a convolutional neural network, and then use an unsupervised algorithm to cluster the corpus description, and then encode the word vector features by self-encoding to extract the The hidden features of the predicate vector feature are used to reduce the dimension of the corpus data, so as to extract the hidden features of the corpus through unsupervised learning, which can improve the accuracy of subsequent learning modeling and overcome the amount of training data. influences.
在一个实施例中,所述将所述隐含特征进行解码以获取解码后的第二语料的步骤之后,还包括:In one embodiment, after the step of decoding the hidden feature to obtain the decoded second corpus, the method further includes:
以预设形式显示所述第二语料。The second corpus is displayed in a preset form.
具体地,由于所述语料是经过聚类处理的语料,所述第二语料有一定的规律性,可以将所述语料以表格的形式或者以图表的形式显示,便于用户可以表格的形式或者图形的形式获取所述第二语料的相关信息。请参阅表格1和图6,表格1是以表格的形式显示得到的第二语料的示例,图6为本申请实施例提供的隐含特征提取方法中的语料显示示意图,图6是以图表的形式显示得到的第二语料的示例。Specifically, because the corpus is a clustered corpus, the second corpus has a certain regularity, and the corpus can be displayed in the form of a table or in the form of a chart, so that the user can use the form or graph To obtain information about the second corpus. Please refer to Table 1 and FIG. 6, Table 1 is an example of the second corpus obtained in the form of a table. FIG. 6 is a schematic diagram of the corpus display in the method of extracting hidden features provided by an embodiment of the present application. FIG. 6 is a diagram The form shows an example of the obtained second corpus.
表格1Table 1
主题 theme 出现的次数Occurrences
Cat 100次100 times
dog 60次60 times
爱情love 80次80 times
在一个实施例中,所述将所述词向量特征通过自编码函数进行编码以获取所述词向量特征的隐含特征的步骤之前,还包括:使用训练语料训练所述自编码函数。In one embodiment, before the step of encoding the word vector feature through the self-encoding function to obtain the hidden feature of the word vector feature, the method further includes: training the self-encoding function using a training corpus.
进一步地,请参阅图7,图7为本申请实施例提供的隐含特征提取方法中的一个子流程示意图。如图7所示,在该实施例中,所述使用训练语料训练所述自编码函数的步骤包括:S710、将所述训练语料的词向量特征输入至所述自编码函数;S720、将所述训练语料的词向量特征通过所述自编码函数进行编码以提取所述词向量特征的隐含特征;S730、将所述隐含特征进行解码以获取解码后的第三语料;S740、判断所述训练语料和所述第三语料的相似度是否大于或者等于预设相似度阈值;S750、若所述训练语料和所述第三语料的相似度大于或者等于所述预设相似度阈值,确定完成所述自编码结构的训练;S760、若所述训练语料和所述第三语料的相似度小于所述预设相似度阈值,调整所述自编码函数中的参数继续训练所述自编码函数,直至所述训练语料和所述第三语料的相似度大于或者等于所述预设相似度阈值。Further, please refer to FIG. 7, which is a schematic diagram of a sub-process in the hidden feature extraction method provided by the embodiment of the present application. As shown in FIG. 7, in this embodiment, the step of using the training corpus to train the self-encoding function includes: S710, inputting the word vector features of the training corpus to the self-encoding function; S720, inputting The word vector features of the training corpus are encoded by the self-encoding function to extract the hidden features of the word vector features; S730, decoding the hidden features to obtain a decoded third corpus; S740, determining the location Whether the similarity between the training corpus and the third corpus is greater than or equal to a preset similarity threshold; S750, if the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity threshold, it is determined Complete the training of the self-encoding structure; S760. If the similarity between the training corpus and the third corpus is less than the preset similarity threshold, adjust the parameters in the self-encoding function to continue training the self-encoding function Until the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity threshold.
具体地,在使用自编码结构学习文本隐含特征之前,需要对自编码结构进行训练,自编码结构提取语料的隐含特征满足准确度要求后,自编码结构训练完毕。训练完毕的自编码网络结构即可用于文本的特征提取,根据自编码结构学习文本隐含特征以使用提取语料的隐含特征进行建模等其他使用。Specifically, before using the self-encoding structure to learn the hidden features of the text, the self-encoding structure needs to be trained. After the self-encoding structure extracts the hidden features of the corpus to meet the accuracy requirements, the self-encoding structure is trained. The self-coding network structure after training can be used for feature extraction of text, and the hidden features of the text are learned according to the self-coding structure to use the hidden features of the extracted corpus for modeling and other uses.
进一步地,自编码结构训练时,自编码结构中的损失函数为MSE,其中,,MSE,英文为mean-square error,均方误差,是一种计算预测值与真实值之间距离的平方和的方法,训练方法为ADAM,其中,ADMA,英文为Adaptive moment estimation,自适应矩估计,学习率为0.001,其中,学习率,也称为学习速率, 英文为Earning Rate,控制模型的学习进度。训练完毕的自编码网络结构即可用于文本的隐含特征提取。具体地,自编码结构训练过程如下:Further, when training the self-encoding structure, the loss function in the self-encoding structure is MSE, where, MSE, mean-square error in English, mean square error, is a method of calculating the sum of squares of the distance between the predicted value and the true value The training method is ADAM. Among them, ADMA, English is Adaptive, estimation, adaptive moment estimation, and the learning rate is 0.001. Among them, the learning rate, also known as the learning rate, English is Earning Rate, which controls the learning progress of the model. The self-coding network structure after training can be used to extract hidden features of text. Specifically, the self-encoding structure training process is as follows:
1)、获取训练语料,这里的训练语料为Text文本语料,比如获取的训练语料包括:猫1,狗1,狗3,人,猫2,狗2。1) Obtain the training corpus, where the training corpus is a Text text corpus, for example, the obtained training corpus includes: cat 1, dog 1, dog 3, person, cat 2, dog 2.
2)、通过词嵌入层将训练语料转化为词向量,也就是将文字语料转换为词向量,比如,上述训练的文字语料转换词向量后为:1'(猫1),2'(狗1),2"(狗3),3(人),1"(猫2),2"'(狗2)。2). Convert the training corpus into a word vector through the word embedding layer, that is, convert the text corpus into a word vector. For example, after the above training text corpus is converted into a word vector: 1'(cat 1), 2'(dog 1 ), 2" (dog 3), 3 (person), 1" (cat 2), 2"' (dog 2).
3)、通过卷积神经网络提取词向量的词向量特征以实现非监督形式的聚类表示,也就是通过卷积神经网络将所述训练语料转换后的词向量进行特征提取及分类以获取所述训练语料的特征,比如,根据上述词向量获取的词向量特征为:1'和1"(猫1、猫2);2'、2"和2"'(狗1,狗2,狗3);3(人)。3). Extract the word vector features of the word vector through the convolutional neural network to achieve an unsupervised form of clustering representation, that is, extract and classify the word vectors after the training corpus conversion through the convolutional neural network to obtain the Describe the characteristics of the training corpus, for example, the word vector features obtained from the above word vectors are: 1'and 1" (cat 1, cat 2); 2', 2" and 2"' (dog 1, dog 2, dog 3 ); 3 (person).
4)、将所述训练语料的词向量特征通过所述自编码函数进行编码以提取所述词向量特征的隐含特征。具体地,将所述训练语料的词向量特征通过所述隐层进行编码以学习所述训练语料的隐含特征,根据卷积神经网络输出建立自编码结构以所述训练语料的隐含特征,也就是将所述训练语料的词向量特征通过所述自编码结构的输入层输入至所述自编码结构的隐层,也即将所述训练语料的词向量特征输入至所述自编码函数以进行编码,从而用数字形式表示文字语料对应的含义,相对于文字形式是隐含的表示,比如,根据上述训练语料学习的隐含特征为:1(1'和1"),2(2'、2"和2"'),3(3)。4). Encoding the word vector features of the training corpus through the self-encoding function to extract the hidden features of the word vector features. Specifically, the word vector features of the training corpus are encoded by the hidden layer to learn the hidden features of the training corpus, and a self-encoding structure is established based on the output of the convolutional neural network to use the hidden features of the training corpus, That is, the word vector features of the training corpus are input to the hidden layer of the self-encoding structure through the input layer of the self-encoding structure, that is, the word vector features of the training corpus are input to the self-encoding function to perform Encoding, so as to express the corresponding meaning of the text corpus in digital form, which is an implicit representation relative to the text form. For example, the hidden features learned from the above training corpus are: 1(1' and 1"), 2(2', 2" and 2"'), 3(3).
5)、将所述隐含特征进行解码以获取解码后的第三语料。具体地,将所述训练语料的隐含特征通过自编码结构的输出层进行解码以获取解码后的第三语料,也就是通过自编码结构的神经网络将隐含特征的数字形式还原为文字形式,还原后的语料内容和原训练语料的文字内容满足相似度要求以实现解码,也就是通过自编码结构将所述隐含特征的数字形式还原为文字形式的含义,最后的结果要求还原的内容和原文字满足相似度要求,比如,上述隐含特征还原后的结构为:猫1,猫2,狗1,狗2,狗3,人,或者猫1,狗1,狗3,人,猫2,狗2。5) Decode the hidden feature to obtain the decoded third corpus. Specifically, the hidden features of the training corpus are decoded through the output layer of the self-encoding structure to obtain the decoded third corpus, that is, the digital form of the hidden features is restored to the text form through the neural network of the self-encoding structure The content of the restored corpus and the text content of the original training corpus meet the similarity requirement to achieve decoding, that is, the digital form of the hidden feature is restored to the meaning of the text form through a self-encoding structure, and the final result requires the restored content It meets the similarity requirement with the original text. For example, the structure of the above hidden features is: cat 1, cat 2, dog 1, dog 2, dog 3, person, or cat 1, dog 1, dog 3, person, cat 2. Dog 2.
6)、判断所述训练语料和所述第三语料的相似度是否大于或者等于预设相 似度阈值,若所述训练语料和所述第三语料的相似度大于或者等于所述预设相似度阈值,确定完成所述自编码结构的训练,若所述训练语料和所述第三语料的相似度小于所述预设相似度阈值,调整所述自编码方式中的参数继续训练所述自编码方式,直至所述训练语料和所述第三语料的相似度大于或者等于所述预设相似度阈值。6). Determine whether the similarity between the training corpus and the third corpus is greater than or equal to a preset similarity threshold, if the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity Threshold, determine to complete the training of the self-encoding structure, if the similarity between the training corpus and the third corpus is less than the preset similarity threshold, adjust the parameters in the self-encoding mode to continue training the self-encoding Manner until the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity threshold.
其中,训练过程中卷积神经网络建立,此处的卷积神经网络为预训练以实现采用卷积神经网络提取文本的特征。根据自编码结构,获得文本的隐含特征,比如,提供一段语料,获取该语料的词向量形式的数字表示,即为该语料的隐含特征,文本隐含特征是指图5中所示隐层生成的特征。在训练过程中,自编码结构、卷积神经网络结构以及词向量均将被更新。最终,所述训练语料和所述第三语料的相似度满足预设相似度阈值,经训练的自编码结构的隐层能够反映文本隐含特征,能够用于多项用途。Among them, the convolutional neural network is established during the training process. The convolutional neural network here is pre-trained to realize the feature extraction of the text using the convolutional neural network. Obtain the hidden features of the text according to the self-encoding structure. For example, provide a piece of corpus and obtain a digital representation of the corpus in the form of a word vector, which is the hidden feature of the corpus. The hidden features of the text refer to the hidden features shown in Figure 5. Layer generated features. During the training process, the self-encoding structure, convolutional neural network structure, and word vectors will be updated. Finally, the similarity between the training corpus and the third corpus meets a preset similarity threshold, and the hidden layer of the trained self-encoding structure can reflect the hidden features of the text and can be used for multiple purposes.
本申请实施例通过采用无监督算法提取文本隐含特征,首先将文本转化为预训练的词向量,并采用卷积神经网络提取文本的特征,随后根据卷积神经网络输出建立自编码结构以学习文本隐含特征,训练过程中,自编码结构、卷积神经网络结构以及词向量均将被更新。最终,经训练的自编码结构的隐层能够反映文本隐含特征以实现通过无监督算法提取模型隐含特征,能够用于多项用途,所得信息能够提升后续监督学习建模的精度,克服了训练数据量的影响。本申请实施例的方法所建立的隐含特征提取模型在实际应用方面,适用于训练样本数量较小的监督训练。由于深度学习具有较高的过拟合可能性,因此较小的训练样本数据量将严重影响模型泛化能力,因此,可通过无标注的大量训练数据通过本申请实施例的方法建立隐含特征提取模型以学习文本隐含特征,再结合隐含特征提取模型中隐含特征与含标注的训练数据进行监督学习建模,以提高监督学习建模的精度。The embodiment of the present application extracts the hidden features of the text by using an unsupervised algorithm, first converts the text into a pre-trained word vector, and uses the convolutional neural network to extract the features of the text, and then establishes a self-coding structure based on the output of the convolutional neural network to learn The text has hidden features. During the training process, the self-encoding structure, convolutional neural network structure, and word vectors will be updated. Finally, the hidden layer of the trained self-encoding structure can reflect the hidden features of the text to extract the hidden features of the model through unsupervised algorithms, and can be used for multiple purposes. The obtained information can improve the accuracy of subsequent supervised learning modeling and overcome The impact of the amount of training data. In practical application, the hidden feature extraction model established by the method of the embodiment of the present application is suitable for supervised training with a small number of training samples. Since deep learning has a high possibility of overfitting, a small amount of training sample data will seriously affect the generalization ability of the model. Therefore, hidden features can be established by a large amount of unlabeled training data through the method of the embodiment of the present application Extract the model to learn the hidden features of the text, and then combine the hidden features in the hidden feature extraction model and the training data with annotations to perform supervised learning modeling to improve the accuracy of supervised learning modeling.
需要说明的是,上述各个实施例所述的隐含特征提取方法,可以根据需要将不同实施例中包含的技术特征重新进行组合,以获取组合后的实施方案,但都在本申请要求的保护范围之内。It should be noted that the hidden feature extraction methods described in the above embodiments can recombine the technical features contained in different embodiments as needed to obtain the combined implementation, but they are all protected by the application Within range.
请参阅图8,图8为本申请实施例提供的隐含特征提取装置的示意性框图。 对应于上述隐含特征提取方法,本申请实施例还提供一种隐含特征提取装置。如图8所示,该隐含特征提取装置包括用于执行上述隐含特征提取方法的单元,该装置可以被配置于终端或者服务器等计算机设备中。具体地,请参阅图8,该隐含特征提取装置800包括获取单元801、转化单元802、第一提取单元803及第二提取单元804。其中,获取单元801,用于获取进行隐含特征提取的第一语料;转化单元802,用于将所述第一语料进行词嵌入以将所述第一语料转化为词向量;第一提取单元803,用于通过卷积神经网络提取所述词向量的词向量特征;第二提取单元804,用于将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征。Please refer to FIG. 8, which is a schematic block diagram of an apparatus for extracting hidden features provided by an embodiment of the present application. Corresponding to the above hidden feature extraction method, an embodiment of the present application further provides a hidden feature extraction device. As shown in FIG. 8, the hidden feature extraction device includes a unit for performing the aforementioned hidden feature extraction method, and the device may be configured in a computer device such as a terminal or a server. Specifically, referring to FIG. 8, the hidden feature extraction device 800 includes an acquisition unit 801, a conversion unit 802, a first extraction unit 803 and a second extraction unit 804. Wherein, the obtaining unit 801 is used to obtain a first corpus for performing hidden feature extraction; the conversion unit 802 is used to embed the first corpus into words to convert the first corpus into a word vector; the first extraction unit 803, used to extract word vector features of the word vector through a convolutional neural network; a second extraction unit 804, used to encode the word vector features by self-encoding to extract hidden features of the word vector features .
在一个实施例中,第二提取单元804,用于将所述词向量特征通过自编码函数进行编码以获取所述词向量特征的隐含特征。In one embodiment, the second extraction unit 804 is configured to encode the word vector feature through a self-encoding function to obtain the hidden feature of the word vector feature.
请参阅图9,图9为本申请实施例提供的隐含特征提取装置的另一个示意性框图。如图9所示,在该实施例中,所述隐含特征提取装置800还包括:解码单元805,用于将所述隐含特征进行解码以获取解码后的第二语料;显示单元806,用于以预设形式显示所述第二语料;训练单元807,用于使用训练语料训练所述自编码函数。Please refer to FIG. 9, which is another schematic block diagram of an implicit feature extraction device provided by an embodiment of the present application. As shown in FIG. 9, in this embodiment, the hidden feature extraction device 800 further includes: a decoding unit 805 for decoding the hidden feature to obtain a decoded second corpus; a display unit 806, It is used to display the second corpus in a preset form; a training unit 807 is used to train the self-encoding function using a training corpus.
请继续参阅图9,在该实施例中,所述训练单元807包括:输入子单元8071,用于将所述训练语料的词向量特征输入至所述自编码函数;编码子单元8072,用于将所述训练语料的词向量特征通过所述自编码函数进行编码以提取所述词向量特征的隐含特征;解码子单元8073,用于将所述隐含特征进行解码以获取解码后的第三语料;判断子单元8074,用于判断所述训练语料和所述第三语料的相似度是否大于或者等于预设相似度阈值;确定子单元8075,用于若所述训练语料和所述第三语料的相似度大于或者等于所述预设相似度阈值,确定完成所述自编码结构的训练;调整子单元8076,用于若所述训练语料和所述第三语料的相似度小于所述预设相似度阈值,调整所述自编码函数中的参数继续训练所述自编码函数,直至所述训练语料和所述第三语料的相似度大于或者等于所述预设相似度阈值。Please continue to refer to FIG. 9, in this embodiment, the training unit 807 includes: an input subunit 8071 for inputting the word vector features of the training corpus to the self-encoding function; an encoding subunit 8072 for Encode the word vector features of the training corpus through the self-encoding function to extract the hidden features of the word vector features; a decoding sub-unit 8073 is used to decode the hidden features to obtain the decoded Three corpora; a judgment subunit 8074, used to judge whether the similarity between the training corpus and the third corpus is greater than or equal to a preset similarity threshold; a determination subunit 8075, used to determine whether the training corpus and the third corpus The similarity of the three corpora is greater than or equal to the preset similarity threshold, and it is determined to complete the training of the self-encoding structure; an adjustment subunit 8076 is used if the similarity between the training corpus and the third corpus is less than the Preset a similarity threshold, adjust parameters in the self-encoding function and continue training the self-encoding function until the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity threshold.
在一个实施例中,其中,所述显示单元,用于以表格形式或者以图表形式 显示所述第二语料。In one embodiment, the display unit is configured to display the second corpus in a table form or a chart form.
在一个实施例中,所述转化单元802,用于使用训练好的预设词向量字典将所述第一语料进行词嵌入以将所述第一语料转化为词向量。In one embodiment, the conversion unit 802 is configured to embed the first corpus into words using a trained preset word vector dictionary to convert the first corpus into word vectors.
需要说明的是,所属领域的技术人员可以清楚地了解到,上述隐含特征提取装置和各单元的具体实现过程,可以参考前述方法实施例中的相应描述,为了描述的方便和简洁,在此不再赘述。It should be noted that those skilled in the art can clearly understand that the specific implementation process of the above implicit feature extraction device and each unit can refer to the corresponding description in the foregoing method embodiments. For convenience and conciseness of description, here No longer.
同时,上述隐含特征提取装置中各个单元的划分和连接方式仅用于举例说明,在其他实施例中,可将隐含特征提取装置按照需要划分为不同的单元,也可将隐含特征提取装置中各单元采取不同的连接顺序和方式,以完成上述隐含特征提取装置的全部或部分功能。At the same time, the division and connection of the units in the above hidden feature extraction device are only for illustration. In other embodiments, the hidden feature extraction device may be divided into different units as needed, or the hidden feature may be extracted. The units in the device adopt different connection sequences and methods to complete all or part of the functions of the hidden feature extraction device.
上述隐含特征提取装置可以实现为一种计算机程序的形式,该计算机程序可以在如图10所示的计算机设备上运行。The above-mentioned hidden feature extraction device may be implemented in the form of a computer program, and the computer program may run on the computer device shown in FIG. 10.
请参阅图10,图10是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备1000可以是台式机电脑或者服务器等计算机设备,也可以是其他设备中的组件或者部件。Please refer to FIG. 10, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 1000 may be a computer device such as a desktop computer or a server, or may be a component or part in other devices.
参阅图10,该计算机设备1000包括通过系统总线1001连接的处理器1002、存储器和网络接口1005,其中,存储器可以包括非易失性存储介质1003和内存储器1004。Referring to FIG. 10, the computer device 1000 includes a processor 1002, a memory, and a network interface 1005 connected through a system bus 1001, where the memory may include a non-volatile storage medium 1003 and an internal memory 1004.
该非易失性存储介质1003可存储操作系统10031和计算机程序10032。该计算机程序10032被执行时,可使得处理器1002执行一种上述隐含特征提取方法。The non-volatile storage medium 1003 can store an operating system 10031 and a computer program 10032. When the computer program 10032 is executed, it may cause the processor 1002 to execute one of the aforementioned implicit feature extraction methods.
该处理器1002用于提供计算和控制能力,以支撑整个计算机设备1000的运行。The processor 1002 is used to provide computing and control capabilities to support the operation of the entire computer device 1000.
该内存储器1004为非易失性存储介质1003中的计算机程序10032的运行提供环境,该计算机程序10032被处理器1002执行时,可使得处理器1002执行一种上述隐含特征提取方法。The internal memory 1004 provides an environment for the operation of the computer program 10032 in the non-volatile storage medium 1003. When the computer program 10032 is executed by the processor 1002, the processor 1002 can execute the above-mentioned hidden feature extraction method.
该网络接口1005用于与其它设备进行网络通信。本领域技术人员可以理解,图10中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对 本申请方案所应用于其上的计算机设备1000的限定,具体的计算机设备1000可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图10所示实施例一致,在此不再赘述。The network interface 1005 is used for network communication with other devices. Those skilled in the art can understand that the structure shown in FIG. 10 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 1000 to which the solution of the present application is applied. The specific computer device 1000 may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 10, and details are not described herein again.
其中,所述处理器1002用于运行存储在存储器中的计算机程序10032,以实现本申请实施例的隐含特征提取方法。Wherein, the processor 1002 is used to run the computer program 10032 stored in the memory to implement the hidden feature extraction method in the embodiment of the present application.
应当理解,在本申请实施例中,处理器1002可以是中央处理单元(Central Processing Unit,CPU),该处理器1002还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in the embodiment of the present application, the processor 1002 may be a central processing unit (Central Processing Unit, CPU), and the processor 1002 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor.
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来完成,该计算机程序可存储于一计算机可读存储介质。该计算机程序被该计算机系统中的至少一个处理器执行,以实现上述实施例所描述方法的步骤。Those of ordinary skill in the art may understand that all or part of the processes in the method for implementing the foregoing embodiments may be completed by a computer program, and the computer program may be stored in a computer-readable storage medium. The computer program is executed by at least one processor in the computer system to implement the steps of the method described in the above embodiments.
因此,本申请实施例还提供一种计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时使处理器执行以上各实施例中所描述的隐含特征提取方法的步骤。Therefore, the embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium stores a computer program. When the computer program is executed by the processor, the processor causes the processor to perform the operations described in the foregoing embodiments. Steps of hidden feature extraction method.
所述存储介质为实体的、非瞬时性的存储介质,例如可以是U盘、移动硬盘、只读存储器(Read-OnlyMemory,ROM)、磁碟或者光盘等各种可以存储计算机程序的实体存储介质。The storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, and various other physical storage media that can store computer programs .
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地 描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art may realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two, in order to clearly explain the hardware and software. Interchangeability, in the above description, the composition and steps of each example have been generally described according to function. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
以上所述,仅为本申请的具体实施方式,但本申请明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above is only the specific implementation of this application, but the scope of protection disclosed in this application is not limited to this, any person skilled in the art can easily think of various equivalents within the technical scope disclosed in this application Modifications or replacements, these modifications or replacements should be covered within the scope of protection of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种隐含特征提取方法,包括:An implicit feature extraction method, including:
    获取进行隐含特征提取的第一语料;Obtain the first corpus for implicit feature extraction;
    将所述第一语料进行词嵌入以将所述第一语料转化为词向量;Embedding the first corpus into words to convert the first corpus into word vectors;
    通过卷积神经网络提取所述词向量的词向量特征;Extract the word vector features of the word vector through a convolutional neural network;
    将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征。The word vector feature is encoded in a self-encoding manner to extract hidden features of the word vector feature.
  2. 根据权利要求1所述隐含特征提取方法,其中,所述将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征的步骤包括:The hidden feature extraction method according to claim 1, wherein the step of encoding the word vector feature by self-encoding to extract the hidden feature of the word vector feature includes:
    将所述词向量特征通过自编码函数进行编码以获取所述词向量特征的隐含特征。The feature of the word vector is encoded by a self-encoding function to obtain the hidden feature of the feature of the word vector.
  3. 根据权利要求1所述隐含特征提取方法,其中,所述将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征的步骤之后,还包括:The hidden feature extraction method according to claim 1, wherein after the step of encoding the word vector feature by self-encoding to extract the hidden feature of the word vector feature, the method further comprises:
    将所述隐含特征进行解码以获取解码后的第二语料。Decode the hidden feature to obtain the decoded second corpus.
  4. 根据权利要求3所述隐含特征提取方法,其中,所述将所述隐含特征进行解码以获取解码后的第二语料的步骤之后,还包括:The method for extracting hidden features according to claim 3, wherein after the step of decoding the hidden features to obtain the decoded second corpus, the method further includes:
    以预设形式显示所述第二语料。The second corpus is displayed in a preset form.
  5. 根据权利要求4所述隐含特征提取方法,其中,所述以预设形式显示所述第二语料的步骤包括:The hidden feature extraction method according to claim 4, wherein the step of displaying the second corpus in a preset form includes:
    以表格形式或者以图表形式显示所述第二语料。The second corpus is displayed in table form or in chart form.
  6. 根据权利要求2所述隐含特征提取方法,其中,所述将所述词向量特征通过自编码函数进行编码以获取所述词向量特征的隐含特征的步骤之前,还包括:The hidden feature extraction method according to claim 2, wherein before the step of encoding the word vector feature by a self-encoding function to obtain the hidden feature of the word vector feature, the method further includes:
    使用训练语料训练所述自编码函数。The training corpus is used to train the self-encoding function.
  7. 根据权利要求6所述隐含特征提取方法,其中,所述使用训练语料训练所述自编码函数的步骤包括:The hidden feature extraction method according to claim 6, wherein the step of training the self-encoding function using a training corpus includes:
    将所述训练语料的词向量特征输入至所述自编码函数;Input the word vector features of the training corpus to the self-encoding function;
    将所述训练语料的词向量特征通过所述自编码函数进行编码以提取所述词向量特征的隐含特征;Encoding the word vector features of the training corpus through the self-encoding function to extract hidden features of the word vector features;
    将所述隐含特征进行解码以获取解码后的第三语料;Decoding the hidden feature to obtain the decoded third corpus;
    判断所述训练语料和所述第三语料的相似度是否大于或者等于预设相似度阈值;Determine whether the similarity between the training corpus and the third corpus is greater than or equal to a preset similarity threshold;
    若所述训练语料和所述第三语料的相似度大于或者等于所述预设相似度阈值,确定完成所述自编码结构的训练;If the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity threshold, it is determined to complete the training of the self-encoding structure;
    若所述训练语料和所述第三语料的相似度小于所述预设相似度阈值,调整所述自编码函数中的参数继续训练所述自编码函数,直至所述训练语料和所述第三语料的相似度大于或者等于所述预设相似度阈值。If the similarity between the training corpus and the third corpus is less than the preset similarity threshold, adjust the parameters in the self-encoding function to continue training the self-encoding function until the training corpus and the third corpus The similarity of the corpus is greater than or equal to the preset similarity threshold.
  8. 根据权利要求1所述隐含特征提取方法,其中,所述将所述第一语料进行词嵌入以将所述第一语料转化为词向量的步骤包括:The hidden feature extraction method according to claim 1, wherein the step of embedding the first corpus into words to convert the first corpus into a word vector includes:
    使用训练好的预设词向量字典将所述第一语料进行词嵌入以将所述第一语料转化为词向量。Use the trained preset word vector dictionary to embed the first corpus to convert the first corpus into word vectors.
  9. 一种隐含特征提取装置,包括:An implicit feature extraction device, including:
    获取单元,用于获取进行隐含特征提取的第一语料;An obtaining unit, configured to obtain a first corpus for performing hidden feature extraction;
    转化单元,用于将所述第一语料进行词嵌入以将所述第一语料转化为词向量;A conversion unit for embedding the first corpus into words to convert the first corpus into word vectors;
    第一提取单元,用于通过卷积神经网络提取所述词向量的词向量特征;A first extraction unit, configured to extract the word vector features of the word vector through a convolutional neural network;
    第二提取单元,用于将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征。The second extraction unit is configured to encode the word vector features by self-encoding to extract the hidden features of the word vector features.
  10. 根据权利要求9所述隐含特征提取装置,其中,所述第二提取单元,用于将所述词向量特征通过自编码函数进行编码以获取所述词向量特征的隐含特征。The hidden feature extraction device according to claim 9, wherein the second extraction unit is configured to encode the word vector feature by a self-encoding function to obtain the hidden feature of the word vector feature.
  11. 一种计算机设备,包括存储器以及与所述存储器相连的处理器;所述存储器用于存储计算机程序;所述处理器用于运行所述存储器中存储的计算机程序,以执行如下步骤:A computer device includes a memory and a processor connected to the memory; the memory is used to store a computer program; the processor is used to run the computer program stored in the memory to perform the following steps:
    获取进行隐含特征提取的第一语料;Obtain the first corpus for implicit feature extraction;
    将所述第一语料进行词嵌入以将所述第一语料转化为词向量;Embedding the first corpus into words to convert the first corpus into word vectors;
    通过卷积神经网络提取所述词向量的词向量特征;Extract the word vector features of the word vector through a convolutional neural network;
    将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征。The word vector feature is encoded in a self-encoding manner to extract hidden features of the word vector feature.
  12. 根据权利要求11所述计算机设备,其中,所述将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征的步骤包括:The computer device according to claim 11, wherein the step of encoding the word vector feature by self-encoding to extract the hidden feature of the word vector feature includes:
    将所述词向量特征通过自编码函数进行编码以获取所述词向量特征的隐含特征。The feature of the word vector is encoded by a self-encoding function to obtain the hidden feature of the feature of the word vector.
  13. 根据权利要求11所述计算机设备,其中,所述将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征的步骤之后,还包括:The computer device according to claim 11, wherein after the step of encoding the word vector feature by self-encoding to extract the hidden feature of the word vector feature, further comprising:
    将所述隐含特征进行解码以获取解码后的第二语料。Decode the hidden feature to obtain the decoded second corpus.
  14. 根据权利要求13所述计算机设备,其中,所述将所述隐含特征进行解码以获取解码后的第二语料的步骤之后,还包括:The computer device according to claim 13, wherein after the step of decoding the hidden feature to obtain the decoded second corpus, further comprising:
    以预设形式显示所述第二语料。The second corpus is displayed in a preset form.
  15. 根据权利要求14所述计算机设备,其中,所述以预设形式显示所述第二语料的步骤包括:The computer device according to claim 14, wherein the step of displaying the second corpus in a preset form includes:
    以表格形式或者以图表形式显示所述第二语料。The second corpus is displayed in table form or in chart form.
  16. 根据权利要求12所述计算机设备,其中,所述将所述词向量特征通过自编码函数进行编码以获取所述词向量特征的隐含特征的步骤之前,还包括:The computer device according to claim 12, wherein before the step of encoding the word vector feature by a self-encoding function to obtain the hidden feature of the word vector feature, further comprising:
    使用训练语料训练所述自编码函数。The training corpus is used to train the self-encoding function.
  17. 根据权利要求16所述计算机设备,其中,所述使用训练语料训练所述自编码函数的步骤包括:The computer device according to claim 16, wherein the step of training the self-encoding function using a training corpus includes:
    将所述训练语料的词向量特征输入至所述自编码函数;Input the word vector features of the training corpus to the self-encoding function;
    将所述训练语料的词向量特征通过所述自编码函数进行编码以提取所述词向量特征的隐含特征;Encoding the word vector features of the training corpus through the self-encoding function to extract hidden features of the word vector features;
    将所述隐含特征进行解码以获取解码后的第三语料;Decoding the hidden feature to obtain the decoded third corpus;
    判断所述训练语料和所述第三语料的相似度是否大于或者等于预设相似度阈值;Determine whether the similarity between the training corpus and the third corpus is greater than or equal to a preset similarity threshold;
    若所述训练语料和所述第三语料的相似度大于或者等于所述预设相似度阈值,确定完成所述自编码结构的训练;If the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity threshold, it is determined to complete the training of the self-encoding structure;
    若所述训练语料和所述第三语料的相似度小于所述预设相似度阈值,调整所述自编码函数中的参数继续训练所述自编码函数,直至所述训练语料和所述第三语料的相似度大于或者等于所述预设相似度阈值。If the similarity between the training corpus and the third corpus is less than the preset similarity threshold, adjust the parameters in the self-encoding function to continue training the self-encoding function until the training corpus and the third corpus The similarity of the corpus is greater than or equal to the preset similarity threshold.
  18. 根据权利要求11所述计算机设备,其中,所述将所述第一语料进行词嵌入以将所述第一语料转化为词向量的步骤包括:The computer device according to claim 11, wherein the step of embedding the first corpus into words to convert the first corpus into a word vector includes:
    使用训练好的预设词向量字典将所述第一语料进行词嵌入以将所述第一语料转化为词向量。Use the trained preset word vector dictionary to embed the first corpus to convert the first corpus into word vectors.
  19. 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如下步骤:A computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to implement the following steps:
    获取进行隐含特征提取的第一语料;Obtain the first corpus for implicit feature extraction;
    将所述第一语料进行词嵌入以将所述第一语料转化为词向量;Embedding the first corpus into words to convert the first corpus into word vectors;
    通过卷积神经网络提取所述词向量的词向量特征;Extract the word vector features of the word vector through a convolutional neural network;
    将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征。The word vector feature is encoded in a self-encoding manner to extract hidden features of the word vector feature.
  20. 根据权利要求19所述存储介质,其中,所述将所述词向量特征通过自编码方式进行编码以提取所述词向量特征的隐含特征的步骤包括:The storage medium according to claim 19, wherein the step of encoding the word vector feature by self-encoding to extract the hidden feature of the word vector feature includes:
    将所述词向量特征通过自编码函数进行编码以获取所述词向量特征的隐含特征。The feature of the word vector is encoded by a self-encoding function to obtain the hidden feature of the feature of the word vector.
PCT/CN2019/118242 2019-01-04 2019-11-14 Hidden feature extraction method, apparatus, computer device and storage medium WO2020140632A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910007711.4 2019-01-04
CN201910007711.4A CN109871531A (en) 2019-01-04 2019-01-04 Hidden feature extracting method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2020140632A1 true WO2020140632A1 (en) 2020-07-09

Family

ID=66917462

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118242 WO2020140632A1 (en) 2019-01-04 2019-11-14 Hidden feature extraction method, apparatus, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN109871531A (en)
WO (1) WO2020140632A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435199A (en) * 2021-07-18 2021-09-24 谢勇 Storage and reading interference method and system for character corresponding culture

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871531A (en) * 2019-01-04 2019-06-11 平安科技(深圳)有限公司 Hidden feature extracting method, device, computer equipment and storage medium
CN110442677A (en) * 2019-07-04 2019-11-12 平安科技(深圳)有限公司 Text matches degree detection method, device, computer equipment and readable storage medium storing program for executing
CN111507100B (en) * 2020-01-14 2023-05-05 上海勃池信息技术有限公司 Convolutional self-encoder and word embedded vector compression method based on same
CN111222981A (en) * 2020-01-16 2020-06-02 中国建设银行股份有限公司 Credibility determination method, device, equipment and storage medium
CN112929341A (en) * 2021-01-22 2021-06-08 网宿科技股份有限公司 DGA domain name detection method, system and device
CN113239128B (en) * 2021-06-01 2022-03-18 平安科技(深圳)有限公司 Data pair classification method, device, equipment and storage medium based on implicit characteristics
CN113627514A (en) * 2021-08-05 2021-11-09 南方电网数字电网研究院有限公司 Data processing method and device of knowledge graph, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529721A (en) * 2016-11-08 2017-03-22 安徽大学 Advertisement click-through rate predication system based on deep characteristic extraction and predication method thereof
CN108733682A (en) * 2017-04-14 2018-11-02 华为技术有限公司 A kind of method and device generating multi-document summary
CN109871531A (en) * 2019-01-04 2019-06-11 平安科技(深圳)有限公司 Hidden feature extracting method, device, computer equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107516110B (en) * 2017-08-22 2020-02-18 华南理工大学 Medical question-answer semantic clustering method based on integrated convolutional coding
CN108427771B (en) * 2018-04-09 2020-11-10 腾讯科技(深圳)有限公司 Abstract text generation method and device and computer equipment
CN108960959B (en) * 2018-05-23 2020-05-12 山东大学 Multi-mode complementary clothing matching method, system and medium based on neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529721A (en) * 2016-11-08 2017-03-22 安徽大学 Advertisement click-through rate predication system based on deep characteristic extraction and predication method thereof
CN108733682A (en) * 2017-04-14 2018-11-02 华为技术有限公司 A kind of method and device generating multi-document summary
CN109871531A (en) * 2019-01-04 2019-06-11 平安科技(深圳)有限公司 Hidden feature extracting method, device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435199A (en) * 2021-07-18 2021-09-24 谢勇 Storage and reading interference method and system for character corresponding culture
CN113435199B (en) * 2021-07-18 2023-05-26 谢勇 Storage and reading interference method and system for character corresponding culture

Also Published As

Publication number Publication date
CN109871531A (en) 2019-06-11

Similar Documents

Publication Publication Date Title
WO2020140632A1 (en) Hidden feature extraction method, apparatus, computer device and storage medium
US11816442B2 (en) Multi-turn dialogue response generation with autoregressive transformer models
US11244205B2 (en) Generating multi modal image representation for an image
WO2020140403A1 (en) Text classification method and apparatus, computer device and storage medium
WO2021068352A1 (en) Automatic construction method and apparatus for faq question-answer pair, and computer device and storage medium
WO2021179570A1 (en) Sequence labeling method and apparatus, and computer device and storage medium
WO2020151175A1 (en) Method and device for text generation, computer device, and storage medium
CN110377733B (en) Text-based emotion recognition method, terminal equipment and medium
JP2023500222A (en) Sequence mining model training method, sequence data processing method, sequence mining model training device, sequence data processing device, computer equipment, and computer program
US20220230061A1 (en) Modality adaptive information retrieval
WO2020143303A1 (en) Method and device for training deep learning model, computer apparatus, and storage medium
CN117173269A (en) Face image generation method and device, electronic equipment and storage medium
CN115269768A (en) Element text processing method and device, electronic equipment and storage medium
CN111368554B (en) Statement processing method, device, computer equipment and storage medium
CN114065771A (en) Pre-training language processing method and device
CN116127049A (en) Model training method, text generation method, terminal device and computer medium
CN115033683B (en) Digest generation method, digest generation device, digest generation equipment and storage medium
CN112487136A (en) Text processing method, device, equipment and computer readable storage medium
CN112509559B (en) Audio recognition method, model training method, device, equipment and storage medium
CN117371447A (en) Named entity recognition model training method, device and storage medium
CN116306612A (en) Word and sentence generation method and related equipment
CN110442706B (en) Text abstract generation method, system, equipment and storage medium
CN111310460A (en) Statement adjusting method and device
CN117252274B (en) Text audio image contrast learning method, device and storage medium
CN111639152B (en) Intention recognition method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19906677

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19906677

Country of ref document: EP

Kind code of ref document: A1