WO2020140632A1

WO2020140632A1 - Hidden feature extraction method, apparatus, computer device and storage medium

Info

Publication number: WO2020140632A1
Application number: PCT/CN2019/118242
Authority: WO
Inventors: 金戈; 徐亮
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-01-04
Filing date: 2019-11-14
Publication date: 2020-07-09
Also published as: CN109871531A

Abstract

Provided by the embodiments of the present application are a hidden feature extraction method, an apparatus, a computer device, and a computer-readable storage medium. The embodiments of the present application relate to the technical field of text classification. In the embodiments of the present application, when hidden feature extraction is performed, a first corpus for performing hidden feature extraction is acquired, word embedding is performed on the first corpus so as to convert the first corpus into a word vector, a word vector feature of the word vector is extracted by means of a convolutional neural network, the word vector is clustered and described by using an unsupervised algorithm, and then the word vector feature is encoded by means of self-encoding so as to extract a hidden feature of the word vector feature.

Description

Hidden feature extraction method, device, computer equipment and storage medium

This application requires the priority of the Chinese patent application submitted to the China Patent Office on January 4, 2019, with the application number 201910007711.4 and the application name "implicit feature extraction method, device, computer equipment and storage medium", all of which are approved by The reference is incorporated in this application.

Technical field

The present application relates to the technical field of text classification, and in particular to a method, apparatus, computer equipment, and computer-readable storage medium for extracting hidden features.

Background technique

The traditional text classification model is a supervised learning model. The supervised learning model refers to the process of adjusting the parameters of the classifier using a set of samples of known categories to achieve the required performance. It is also called a supervised training model or a teacher learning model, so , The use of supervised learning model needs to classify text based on samples of known categories, so when using supervised learning model for text classification, a large amount of labeled data is required to classify the text according to the labeled data, and the processing of large amounts of labeled data will lead to text classification The efficiency is relatively low.

Summary of the invention

Embodiments of the present application provide an implicit feature extraction method, device, computer equipment, and computer-readable storage medium, which can solve the problem of low text classification efficiency in the conventional technology.

In a first aspect, an embodiment of the present application provides an implicit feature extraction method, the method includes: acquiring a first corpus for performing implicit feature extraction; embedding the first corpus into words to embed the first corpus Convert to word vectors; extract the word vector features of the word vectors through a convolutional neural network; encode the word vector features by self-encoding to extract the hidden features of the word vector features.

In a second aspect, an embodiment of the present application further provides an apparatus for extracting hidden features, including: an acquiring unit for acquiring a first corpus for performing hidden feature extraction; and a conversion unit for performing word classification on the first corpus Embedding to convert the first corpus into word vectors; a first extraction unit for extracting word vector features of the word vector through a convolutional neural network; a second extraction unit for passing the word vector features through The encoding method encodes to extract the hidden features of the word vector features.

In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory and a processor, a computer program is stored on the memory, and the hidden feature extraction method is implemented when the processor executes the computer program .

According to a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to execute the implicit Feature extraction method.

BRIEF DESCRIPTION

In order to more clearly explain the technical solutions of the embodiments of the present application, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.

FIG. 1 is a schematic diagram of an application scenario of an implicit feature extraction method provided by an embodiment of this application;

2 is a schematic flowchart of a method for extracting hidden features provided by an embodiment of the present application;

3 is a schematic diagram of word vectors in a method for extracting hidden features provided by an embodiment of the present application;

4 is a schematic diagram of a self-encoding structure in an implicit feature extraction method provided by an embodiment of this application;

FIG. 5 is a schematic flowchart of a self-encoding structure in an implicit feature extraction method provided by an embodiment of this application;

6 is a schematic diagram of corpus display in a method for extracting hidden features provided by an embodiment of the present application;

7 is a schematic diagram of a sub-process in a method for extracting hidden features provided by an embodiment of the present application;

8 is a schematic block diagram of an apparatus for extracting hidden features provided by an embodiment of the present application;

9 is another schematic block diagram of an apparatus for extracting hidden features provided by an embodiment of the present application; and

10 is a schematic block diagram of a computer device provided by an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

Please refer to FIG. 1. FIG. 1 is a schematic diagram of an application scenario of an implicit feature extraction method provided by an embodiment of the present application. The application scenarios include: (1) Terminal. An application program is installed on the terminal shown in FIG. 1, and the developer implements the steps of executing the hidden feature extraction method through the application program. The terminal may be an electronic device such as a notebook computer, tablet computer, or desktop computer. The terminal shown in FIG. 1 The application environment can also be replaced with computer equipment such as servers. If the application environment in FIG. 1 is a server, the server may be a server cluster or a cloud server. The server cluster can also adopt a distributed system, and the server of the distributed system can include a master server and a slave server, so that the master server uses the obtained corpus to perform the steps of the hidden feature extraction method.

The working process of each subject in FIG. 1 is as follows: the terminal obtains a first corpus for implicit feature extraction, embeds the first corpus into words to convert the first corpus into a word vector, and extracts all the data through a convolutional neural network. The word vector features of the predicate vector, and the word vector features are encoded in a self-encoding manner to extract the hidden features of the word vector features.

It should be noted that FIG. 1 only illustrates a desktop computer as a terminal. In actual operation, the type of the terminal is not limited to that shown in FIG. 1. The terminal may also be an electronic device such as a mobile phone, notebook computer, or tablet computer. The application scenarios of the above implicit feature extraction method are only used to illustrate the technical solution of the present application, and are not used to limit the technical solution of the present application.

FIG. 2 is a schematic flowchart of an implicit feature extraction method provided by an embodiment of the present application. The hidden feature extraction method is applied to the terminal in FIG. 1 to complete all or part of the functions of the hidden feature extraction method.

Please refer to FIG. 2, which is a schematic flowchart of an implicit feature extraction method provided by an embodiment of the present application. As shown in FIG. 2, the method includes the following steps S210-S240:

S210. Acquire the first corpus for performing hidden feature extraction.

Specifically, the server obtains the first corpus for performing implicit feature extraction. The first corpus may be a preset corpus on a designated website on the web, and crawling rules may be preset according to actual needs, for example, crawling rules The corpus of a certain web page can also be the relevant corpus of a subject crawled. The first corpus may also be a corpus provided through a corpus database, such as user data accumulated on a website.

S220. Embed the first corpus into words to convert the first corpus into word vectors.

Among them, the word embedding, the English word is "Embedding", is a type of word representation, words with similar meaning have similar representations, is the general term of the method of mapping vocabulary to real number vector. Specifically, word embedding is a type of technology in which a single word is represented as a real vector in a predefined vector space, and each word is mapped to a vector. Please refer to FIG. 3, which is a schematic diagram of word vectors in a method for extracting hidden features provided by an embodiment of the present application. As shown in Figure 3, if a text contains several words such as "cat", "dog" and "love", and these words are mapped into a vector space, the vector corresponding to "cat" is (0.1, 0.2, 0.3) The vector corresponding to "dog" is (0.2, 0.2, 0.4), and the mapping corresponding to "love" is (-0.4, -0.5, -0.2) (this data is only for illustration). Like this, the text X{x1,x2,x3,x4,x5...xn} is mapped to the multidimensional vector space Y{y1,y2,y3,y4,y5...yn}, this mapping process is called word embedding. The reason why you want to turn each word into a vector is to facilitate calculation, such as "cat", "dog", "love" three words. For us, we can know that "cat" and "dog" are both animals, and "love" is an emotion, but for machines, these three words are all represented by 0 and 1. It becomes a binary string, and it cannot be calculated. By converting words into word vectors through word embedding, the machine can calculate words, and calculate the similarity between words by calculating the angle cosine value cos between different word vectors, for example, in Figure 3 Since cosα<cosβ, “cat” and “dog” are more similar, and cat and “love” are quite different.

Specifically, the text corpus is converted into a pre-trained word vector, that is, the input natural language is encoded into a word vector, which is prepared for the pre-trained word vector. In specific implementation, you can use pre-trained word vectors, or you can directly train a set of word vectors during the training of TextCNN, but using pre-trained word vectors is better than training a set of word vectors during the training of TextCNN. More than 100 times faster. If you use pre-trained word vectors, it is divided into Static method and No-static method. Static method refers to the parameter of the word vector is no longer adjusted during the training of TextCNN. No-static method adjusts the parameter of the word vector during the training process , So the result of No-static method is better than that of Static method. Among them, TextCNN, English is Text Convolutional Neural Network, a text classification model based on convolutional neural network, that is, using convolutional neural network to classify text.

Furthermore, instead of adjusting the Embedding layer (embedding layer) in each batch, it can be adjusted once every 100 batches, which can reduce the training time and fine-tune the word vector.

Furthermore, the first corpus can be word embedded using a trained preset word vector dictionary to convert the first corpus into word vectors. In one embodiment, the word vector may use Word2Vec pre-trained word vectors, that is, each vocabulary has a corresponding vector representation, and such vector representations can express vocabulary information in data form. Among them, Word2vec, English is Word to vector, is a software tool for training word vectors.

S230. Extract word vector features of the word vector through a convolutional neural network.

Among them, Convolutional Neural Networks, English is Convolutional Neural Networks, referred to as CNN, is a type of feedforward neural networks (Feedforward Neural Networks) that contains convolution or related calculations and has a deep structure, is a representative of deep learning (Deep Learning) One of the algorithms. Since the convolutional neural network can perform translation-invariant classification (English is Shift-Invariant Classification), it is also called "translation-invariant artificial neural network (English is Shift-Invariant Artificial Neural Networks, referred to as SIANN).

Specifically, a convolutional neural network is established, and the features of the corpus are extracted using the convolutional neural network. Convolutional neural networks capture local text information through multiple scale convolution kernels. In practice, the vertical dimension of the first-level convolution kernel can be selected from multiple types of scales from 1 to 5 to correspond to the number of captured words, and the horizontal dimension remains the same as the word vector dimension. After the first convolutional layer, the one-dimensional convolutional layer corresponding to the longitudinal dimension can be selected according to the length of the text to further refine the information.

S240. Encode the word vector features by self-encoding to extract hidden features of the word vector features.

Among them, the self-encoding method refers to the way of encoding through the self-encoding structure. The self-encoding structure is an unsupervised learning method based on the neural network to learn the hidden features. It is an artificial neural network and is used effectively in unsupervised learning. coding. The purpose of self-encoding is to learn a representation of a set of data. The representation is generally described by numbers. This representation is also called representation. Encoding is usually used for dimensionality reduction, and self-encoding can also be used for data generation models. Please refer to FIG. 4. FIG. 4 is a schematic diagram of a self-encoding structure in an implicit feature extraction method provided by an embodiment of the present application. As shown in FIG. 4, the self-encoding structure generally includes an input layer, a hidden layer, and an output layer. The input layer receives external input data, encodes through the hidden layer in the middle to learn hidden features, and decodes and outputs the hidden features through the output layer. Among them, the hidden layer can be expressed as a functional relationship, such as Hw, b (x), where H is an implicit feature, x is a variable, w and b are parameters, the hidden layer structure in the self-encoding structure can be composed of a layer It can also be composed of multiple layers. The hidden layer is composed of one layer and can be called a hidden layer. The hidden layer is composed of multiple layers and can be called multiple hidden layers. The hidden layer shown in FIG. 4 is one layer. The hidden layer may also be multiple layers such as 2, 3 or 4 layers. The construction of the self-coding structure can be achieved through the tensorflow library in Python. The network structure after the construction can be trained, and the self-coding structure after the training can be officially used.

Specifically, after the construction of the self-coding structure is completed, the self-coding structure is established according to the output of the convolutional neural network. In the embodiment of the present application, the input and output of the self-encoding structure are both output information of the convolutional neural network, and one or more hidden layers in the middle can be regarded as hidden features. The trained self-encoding structure converts the input into hidden information through encoding, and decodes the hidden information to obtain an output close to the original input. In this case, the hidden layer unit can record a large amount of text information.

In one embodiment, the word vector features are encoded by a self-encoding function to obtain the hidden features of the word vector features. That is, the terminal encodes the feature of the word vector through the hidden layer of the self-encoding structure to obtain a digital description of the first corpus for dimensionality reduction, where the hidden layer refers to an unsupervised learning method through a neural network, Convert the text corpus to a digital representation to imply a non-literal form to express the meaning of the text in order to achieve the purpose of extracting a large amount of corpus and then accurately restoring it. The hidden layer is an intermediate layer between the input layer and the output layer of the neural network. Each hidden layer contains a certain number of hidden units, and there are connections between the hidden units and the input and output layers. The self-encoding structure can also be understood as the conversion process of text corpus as follows: 10 dimensions (Chinese characters)-5 dimensions (numbers)-10 dimensions (Chinese characters), where dimension refers to the dimension, and 5 dimension refers to the hidden features of the text as 5 dimensions, such as 5 lines, training to obtain accuracy of 5 dimensions. The following process is realized through a neural network: text representation-replaced by a hidden layer to a digital representation (text meaning expressed by numbers)-restored text representation. Please refer to FIG. 5, which is a schematic flowchart of a self-encoding structure in a method for extracting hidden features provided by an embodiment of the present application. As shown in Figure 5, build a self-encoding network structure. For example, the input and output dimensions of the convolutional neural network are 384*1, that is, the structure of 384 rows and 1 column, and the goal is to learn the dimension 100*1. The implicit feature is to get a structure with 100 rows and 1 column. The input layer and output layer of the self-encoding network structure have a dimension of 384 lines. The self-encoding structure contains a total of 3 layers, including two layers of 384-dimensional input layer and output layer, and a 100-dimensional intermediate hidden layer, of which the middle hidden layer There can be more than one layer, and multiple layers can be set according to actual needs, such as 2, 3, or 4 layers.

Further, after encoding the word vector features by self-encoding to extract the hidden features of the word vector features, subsequently, when necessary, the hidden features are decoded to obtain the decoded second corpus. Specifically, in the process of big data processing, a large amount of original data can be encoded by a self-encoding structure to reduce the dimensionality and compression of the data, reduce the size of the corpus, facilitate data storage and improve data reading efficiency . When subsequent analysis of big data is required to extract the intelligence information implied by the data, the implicit feature may be decoded to obtain the decoded second corpus. For example, a shopping website must accumulate a large amount of user purchase data. In order to facilitate the storage and analysis of the data, the hidden characteristics of the large amount of data can be learned, and the user group needs to be analyzed later to learn the user's purchase When user information such as habits and preferences can be decoded, the hidden features obtained from a large amount of original user data can be decoded to obtain the decoded second corpus. Since the second corpus has undergone cluster analysis and compression, it can improve the The efficiency of corpus analysis and processing.

The embodiments of the present application belong to the technical field of text classification. In the embodiments of the present application, when implementing hidden feature extraction, the first corpus is embedded into the first corpus to obtain the first corpus by acquiring the first corpus for implicit feature extraction. Convert to word vectors, extract the word vector features of the word vectors through a convolutional neural network, and then use an unsupervised algorithm to cluster the corpus description, and then encode the word vector features by self-encoding to extract the The hidden features of the predicate vector feature are used to reduce the dimension of the corpus data, so as to extract the hidden features of the corpus through unsupervised learning, which can improve the accuracy of subsequent learning modeling and overcome the amount of training data. influences.

In one embodiment, after the step of decoding the hidden feature to obtain the decoded second corpus, the method further includes:

The second corpus is displayed in a preset form.

Specifically, because the corpus is a clustered corpus, the second corpus has a certain regularity, and the corpus can be displayed in the form of a table or in the form of a chart, so that the user can use the form or graph To obtain information about the second corpus. Please refer to Table 1 and FIG. 6, Table 1 is an example of the second corpus obtained in the form of a table. FIG. 6 is a schematic diagram of the corpus display in the method of extracting hidden features provided by an embodiment of the present application. FIG. 6 is a diagram The form shows an example of the obtained second corpus.

Table 1

主题 theme	出现的次数Occurrences

猫Cat	100次100 times
狗dog	60次60 times
爱情love	80次80 times

In one embodiment, before the step of encoding the word vector feature through the self-encoding function to obtain the hidden feature of the word vector feature, the method further includes: training the self-encoding function using a training corpus.

Further, please refer to FIG. 7, which is a schematic diagram of a sub-process in the hidden feature extraction method provided by the embodiment of the present application. As shown in FIG. 7, in this embodiment, the step of using the training corpus to train the self-encoding function includes: S710, inputting the word vector features of the training corpus to the self-encoding function; S720, inputting The word vector features of the training corpus are encoded by the self-encoding function to extract the hidden features of the word vector features; S730, decoding the hidden features to obtain a decoded third corpus; S740, determining the location Whether the similarity between the training corpus and the third corpus is greater than or equal to a preset similarity threshold; S750, if the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity threshold, it is determined Complete the training of the self-encoding structure; S760. If the similarity between the training corpus and the third corpus is less than the preset similarity threshold, adjust the parameters in the self-encoding function to continue training the self-encoding function Until the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity threshold.

Specifically, before using the self-encoding structure to learn the hidden features of the text, the self-encoding structure needs to be trained. After the self-encoding structure extracts the hidden features of the corpus to meet the accuracy requirements, the self-encoding structure is trained. The self-coding network structure after training can be used for feature extraction of text, and the hidden features of the text are learned according to the self-coding structure to use the hidden features of the extracted corpus for modeling and other uses.

Further, when training the self-encoding structure, the loss function in the self-encoding structure is MSE, where, MSE, mean-square error in English, mean square error, is a method of calculating the sum of squares of the distance between the predicted value and the true value The training method is ADAM. Among them, ADMA, English is Adaptive, estimation, adaptive moment estimation, and the learning rate is 0.001. Among them, the learning rate, also known as the learning rate, English is Earning Rate, which controls the learning progress of the model. The self-coding network structure after training can be used to extract hidden features of text. Specifically, the self-encoding structure training process is as follows:

1) Obtain the training corpus, where the training corpus is a Text text corpus, for example, the obtained training corpus includes: cat 1, dog 1, dog 3, person, cat 2, dog 2.

2). Convert the training corpus into a word vector through the word embedding layer, that is, convert the text corpus into a word vector. For example, after the above training text corpus is converted into a word vector: 1'(cat 1), 2'(dog 1 ), 2" (dog 3), 3 (person), 1" (cat 2), 2"' (dog 2).

3). Extract the word vector features of the word vector through the convolutional neural network to achieve an unsupervised form of clustering representation, that is, extract and classify the word vectors after the training corpus conversion through the convolutional neural network to obtain the Describe the characteristics of the training corpus, for example, the word vector features obtained from the above word vectors are: 1'and 1" (cat 1, cat 2); 2', 2" and 2"' (dog 1, dog 2, dog 3 ); 3 (person).

4). Encoding the word vector features of the training corpus through the self-encoding function to extract the hidden features of the word vector features. Specifically, the word vector features of the training corpus are encoded by the hidden layer to learn the hidden features of the training corpus, and a self-encoding structure is established based on the output of the convolutional neural network to use the hidden features of the training corpus, That is, the word vector features of the training corpus are input to the hidden layer of the self-encoding structure through the input layer of the self-encoding structure, that is, the word vector features of the training corpus are input to the self-encoding function to perform Encoding, so as to express the corresponding meaning of the text corpus in digital form, which is an implicit representation relative to the text form. For example, the hidden features learned from the above training corpus are: 1(1' and 1"), 2(2', 2" and 2"'), 3(3).

5) Decode the hidden feature to obtain the decoded third corpus. Specifically, the hidden features of the training corpus are decoded through the output layer of the self-encoding structure to obtain the decoded third corpus, that is, the digital form of the hidden features is restored to the text form through the neural network of the self-encoding structure The content of the restored corpus and the text content of the original training corpus meet the similarity requirement to achieve decoding, that is, the digital form of the hidden feature is restored to the meaning of the text form through a self-encoding structure, and the final result requires the restored content It meets the similarity requirement with the original text. For example, the structure of the above hidden features is: cat 1, cat 2, dog 1, dog 2, dog 3, person, or cat 1, dog 1, dog 3, person, cat 2. Dog 2.

6). Determine whether the similarity between the training corpus and the third corpus is greater than or equal to a preset similarity threshold, if the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity Threshold, determine to complete the training of the self-encoding structure, if the similarity between the training corpus and the third corpus is less than the preset similarity threshold, adjust the parameters in the self-encoding mode to continue training the self-encoding Manner until the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity threshold.

Among them, the convolutional neural network is established during the training process. The convolutional neural network here is pre-trained to realize the feature extraction of the text using the convolutional neural network. Obtain the hidden features of the text according to the self-encoding structure. For example, provide a piece of corpus and obtain a digital representation of the corpus in the form of a word vector, which is the hidden feature of the corpus. The hidden features of the text refer to the hidden features shown in Figure 5. Layer generated features. During the training process, the self-encoding structure, convolutional neural network structure, and word vectors will be updated. Finally, the similarity between the training corpus and the third corpus meets a preset similarity threshold, and the hidden layer of the trained self-encoding structure can reflect the hidden features of the text and can be used for multiple purposes.

The embodiment of the present application extracts the hidden features of the text by using an unsupervised algorithm, first converts the text into a pre-trained word vector, and uses the convolutional neural network to extract the features of the text, and then establishes a self-coding structure based on the output of the convolutional neural network to learn The text has hidden features. During the training process, the self-encoding structure, convolutional neural network structure, and word vectors will be updated. Finally, the hidden layer of the trained self-encoding structure can reflect the hidden features of the text to extract the hidden features of the model through unsupervised algorithms, and can be used for multiple purposes. The obtained information can improve the accuracy of subsequent supervised learning modeling and overcome The impact of the amount of training data. In practical application, the hidden feature extraction model established by the method of the embodiment of the present application is suitable for supervised training with a small number of training samples. Since deep learning has a high possibility of overfitting, a small amount of training sample data will seriously affect the generalization ability of the model. Therefore, hidden features can be established by a large amount of unlabeled training data through the method of the embodiment of the present application Extract the model to learn the hidden features of the text, and then combine the hidden features in the hidden feature extraction model and the training data with annotations to perform supervised learning modeling to improve the accuracy of supervised learning modeling.

It should be noted that the hidden feature extraction methods described in the above embodiments can recombine the technical features contained in different embodiments as needed to obtain the combined implementation, but they are all protected by the application Within range.

Please refer to FIG. 8, which is a schematic block diagram of an apparatus for extracting hidden features provided by an embodiment of the present application. Corresponding to the above hidden feature extraction method, an embodiment of the present application further provides a hidden feature extraction device. As shown in FIG. 8, the hidden feature extraction device includes a unit for performing the aforementioned hidden feature extraction method, and the device may be configured in a computer device such as a terminal or a server. Specifically, referring to FIG. 8, the hidden feature extraction device 800 includes an acquisition unit 801, a conversion unit 802, a first extraction unit 803 and a second extraction unit 804. Wherein, the obtaining unit 801 is used to obtain a first corpus for performing hidden feature extraction; the conversion unit 802 is used to embed the first corpus into words to convert the first corpus into a word vector; the first extraction unit 803, used to extract word vector features of the word vector through a convolutional neural network; a second extraction unit 804, used to encode the word vector features by self-encoding to extract hidden features of the word vector features .

In one embodiment, the second extraction unit 804 is configured to encode the word vector feature through a self-encoding function to obtain the hidden feature of the word vector feature.

Please refer to FIG. 9, which is another schematic block diagram of an implicit feature extraction device provided by an embodiment of the present application. As shown in FIG. 9, in this embodiment, the hidden feature extraction device 800 further includes: a decoding unit 805 for decoding the hidden feature to obtain a decoded second corpus; a display unit 806, It is used to display the second corpus in a preset form; a training unit 807 is used to train the self-encoding function using a training corpus.

Please continue to refer to FIG. 9, in this embodiment, the training unit 807 includes: an input subunit 8071 for inputting the word vector features of the training corpus to the self-encoding function; an encoding subunit 8072 for Encode the word vector features of the training corpus through the self-encoding function to extract the hidden features of the word vector features; a decoding sub-unit 8073 is used to decode the hidden features to obtain the decoded Three corpora; a judgment subunit 8074, used to judge whether the similarity between the training corpus and the third corpus is greater than or equal to a preset similarity threshold; a determination subunit 8075, used to determine whether the training corpus and the third corpus The similarity of the three corpora is greater than or equal to the preset similarity threshold, and it is determined to complete the training of the self-encoding structure; an adjustment subunit 8076 is used if the similarity between the training corpus and the third corpus is less than the Preset a similarity threshold, adjust parameters in the self-encoding function and continue training the self-encoding function until the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity threshold.

In one embodiment, the display unit is configured to display the second corpus in a table form or a chart form.

In one embodiment, the conversion unit 802 is configured to embed the first corpus into words using a trained preset word vector dictionary to convert the first corpus into word vectors.

It should be noted that those skilled in the art can clearly understand that the specific implementation process of the above implicit feature extraction device and each unit can refer to the corresponding description in the foregoing method embodiments. For convenience and conciseness of description, here No longer.

At the same time, the division and connection of the units in the above hidden feature extraction device are only for illustration. In other embodiments, the hidden feature extraction device may be divided into different units as needed, or the hidden feature may be extracted. The units in the device adopt different connection sequences and methods to complete all or part of the functions of the hidden feature extraction device.

The above-mentioned hidden feature extraction device may be implemented in the form of a computer program, and the computer program may run on the computer device shown in FIG. 10.

Please refer to FIG. 10, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 1000 may be a computer device such as a desktop computer or a server, or may be a component or part in other devices.

Referring to FIG. 10, the computer device 1000 includes a processor 1002, a memory, and a network interface 1005 connected through a system bus 1001, where the memory may include a non-volatile storage medium 1003 and an internal memory 1004.

The non-volatile storage medium 1003 can store an operating system 10031 and a computer program 10032. When the computer program 10032 is executed, it may cause the processor 1002 to execute one of the aforementioned implicit feature extraction methods.

The processor 1002 is used to provide computing and control capabilities to support the operation of the entire computer device 1000.

The internal memory 1004 provides an environment for the operation of the computer program 10032 in the non-volatile storage medium 1003. When the computer program 10032 is executed by the processor 1002, the processor 1002 can execute the above-mentioned hidden feature extraction method.

The network interface 1005 is used for network communication with other devices. Those skilled in the art can understand that the structure shown in FIG. 10 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 1000 to which the solution of the present application is applied. The specific computer device 1000 may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 10, and details are not described herein again.

Wherein, the processor 1002 is used to run the computer program 10032 stored in the memory to implement the hidden feature extraction method in the embodiment of the present application.

It should be understood that in the embodiment of the present application, the processor 1002 may be a central processing unit (Central Processing Unit, CPU), and the processor 1002 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor.

Those of ordinary skill in the art may understand that all or part of the processes in the method for implementing the foregoing embodiments may be completed by a computer program, and the computer program may be stored in a computer-readable storage medium. The computer program is executed by at least one processor in the computer system to implement the steps of the method described in the above embodiments.

Therefore, the embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium stores a computer program. When the computer program is executed by the processor, the processor causes the processor to perform the operations described in the foregoing embodiments. Steps of hidden feature extraction method.

The storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, and various other physical storage media that can store computer programs .

Those of ordinary skill in the art may realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two, in order to clearly explain the hardware and software. Interchangeability, in the above description, the composition and steps of each example have been generally described according to function. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

The above is only the specific implementation of this application, but the scope of protection disclosed in this application is not limited to this, any person skilled in the art can easily think of various equivalents within the technical scope disclosed in this application Modifications or replacements, these modifications or replacements should be covered within the scope of protection of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

An implicit feature extraction method, including:

Obtain the first corpus for implicit feature extraction;

Embedding the first corpus into words to convert the first corpus into word vectors;

Extract the word vector features of the word vector through a convolutional neural network;

The word vector feature is encoded in a self-encoding manner to extract hidden features of the word vector feature.
The hidden feature extraction method according to claim 1, wherein the step of encoding the word vector feature by self-encoding to extract the hidden feature of the word vector feature includes:

The feature of the word vector is encoded by a self-encoding function to obtain the hidden feature of the feature of the word vector.
The hidden feature extraction method according to claim 1, wherein after the step of encoding the word vector feature by self-encoding to extract the hidden feature of the word vector feature, the method further comprises:

Decode the hidden feature to obtain the decoded second corpus.
The method for extracting hidden features according to claim 3, wherein after the step of decoding the hidden features to obtain the decoded second corpus, the method further includes:

The second corpus is displayed in a preset form.
The hidden feature extraction method according to claim 4, wherein the step of displaying the second corpus in a preset form includes:

The second corpus is displayed in table form or in chart form.
The hidden feature extraction method according to claim 2, wherein before the step of encoding the word vector feature by a self-encoding function to obtain the hidden feature of the word vector feature, the method further includes:

The training corpus is used to train the self-encoding function.
The hidden feature extraction method according to claim 6, wherein the step of training the self-encoding function using a training corpus includes:

Input the word vector features of the training corpus to the self-encoding function;

Encoding the word vector features of the training corpus through the self-encoding function to extract hidden features of the word vector features;

Decoding the hidden feature to obtain the decoded third corpus;

Determine whether the similarity between the training corpus and the third corpus is greater than or equal to a preset similarity threshold;

If the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity threshold, it is determined to complete the training of the self-encoding structure;

If the similarity between the training corpus and the third corpus is less than the preset similarity threshold, adjust the parameters in the self-encoding function to continue training the self-encoding function until the training corpus and the third corpus The similarity of the corpus is greater than or equal to the preset similarity threshold.
The hidden feature extraction method according to claim 1, wherein the step of embedding the first corpus into words to convert the first corpus into a word vector includes:

Use the trained preset word vector dictionary to embed the first corpus to convert the first corpus into word vectors.
An implicit feature extraction device, including:

An obtaining unit, configured to obtain a first corpus for performing hidden feature extraction;

A conversion unit for embedding the first corpus into words to convert the first corpus into word vectors;

A first extraction unit, configured to extract the word vector features of the word vector through a convolutional neural network;

The second extraction unit is configured to encode the word vector features by self-encoding to extract the hidden features of the word vector features.
The hidden feature extraction device according to claim 9, wherein the second extraction unit is configured to encode the word vector feature by a self-encoding function to obtain the hidden feature of the word vector feature.
A computer device includes a memory and a processor connected to the memory; the memory is used to store a computer program; the processor is used to run the computer program stored in the memory to perform the following steps:

Obtain the first corpus for implicit feature extraction;

Embedding the first corpus into words to convert the first corpus into word vectors;

Extract the word vector features of the word vector through a convolutional neural network;

The word vector feature is encoded in a self-encoding manner to extract hidden features of the word vector feature.
The computer device according to claim 11, wherein the step of encoding the word vector feature by self-encoding to extract the hidden feature of the word vector feature includes:

The feature of the word vector is encoded by a self-encoding function to obtain the hidden feature of the feature of the word vector.
The computer device according to claim 11, wherein after the step of encoding the word vector feature by self-encoding to extract the hidden feature of the word vector feature, further comprising:

Decode the hidden feature to obtain the decoded second corpus.
The computer device according to claim 13, wherein after the step of decoding the hidden feature to obtain the decoded second corpus, further comprising:

The second corpus is displayed in a preset form.
The computer device according to claim 14, wherein the step of displaying the second corpus in a preset form includes:

The second corpus is displayed in table form or in chart form.
The computer device according to claim 12, wherein before the step of encoding the word vector feature by a self-encoding function to obtain the hidden feature of the word vector feature, further comprising:

The training corpus is used to train the self-encoding function.
The computer device according to claim 16, wherein the step of training the self-encoding function using a training corpus includes:

Input the word vector features of the training corpus to the self-encoding function;

Encoding the word vector features of the training corpus through the self-encoding function to extract hidden features of the word vector features;

Decoding the hidden feature to obtain the decoded third corpus;

Determine whether the similarity between the training corpus and the third corpus is greater than or equal to a preset similarity threshold;

If the similarity between the training corpus and the third corpus is greater than or equal to the preset similarity threshold, it is determined to complete the training of the self-encoding structure;

If the similarity between the training corpus and the third corpus is less than the preset similarity threshold, adjust the parameters in the self-encoding function to continue training the self-encoding function until the training corpus and the third corpus The similarity of the corpus is greater than or equal to the preset similarity threshold.
The computer device according to claim 11, wherein the step of embedding the first corpus into words to convert the first corpus into a word vector includes:

Use the trained preset word vector dictionary to embed the first corpus to convert the first corpus into word vectors.
A computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to implement the following steps:

Obtain the first corpus for implicit feature extraction;

Embedding the first corpus into words to convert the first corpus into word vectors;

Extract the word vector features of the word vector through a convolutional neural network;

The word vector feature is encoded in a self-encoding manner to extract hidden features of the word vector feature.
The storage medium according to claim 19, wherein the step of encoding the word vector feature by self-encoding to extract the hidden feature of the word vector feature includes:

The feature of the word vector is encoded by a self-encoding function to obtain the hidden feature of the feature of the word vector.