CN112632994A

CN112632994A - Method, device and equipment for determining basic attribute characteristics based on text information

Info

Publication number: CN112632994A
Application number: CN202011394269.4A
Authority: CN
Inventors: 刘泽城
Original assignee: Dazhu Hangzhou Technology Co ltd
Current assignee: Dazhu Hangzhou Technology Co ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2021-04-09
Anticipated expiration: 2040-12-03
Also published as: CN112632994B

Abstract

The application belongs to the field of data processing, and discloses a method, a device and equipment for determining basic attribute characteristics based on text information. The processing and analyzing process can simplify the steps of obtaining the basic attribute characteristics and improve the time utilization rate.

Description

Method, device and equipment for determining basic attribute characteristics based on text information

Technical Field

The present application relates to the field of data processing, and in particular, to a method, an apparatus, and a device for determining basic attribute characteristics based on text information.

Background

The basic attributes of the user, such as name, gender, address, age, etc., are the most basic data for analyzing the characteristics of the user, and are generally used when the user is subjected to feature analysis.

However, at present, the basic attributes of the user are obtained by the direct input of the user, which takes up the time of the user, and some users are unable to obtain the basic attributes without wishing to input the information.

Therefore, how to obtain the basic attribute characteristics of the user according to the related text information about the user on the network becomes a technical problem to be solved urgently at present.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, and a device for determining basic attribute features based on text information. The method mainly aims to solve the technical problem of how to obtain the basic attribute characteristics of the user according to the related text information about the user on the network at present.

According to a first aspect of the present application, a method for determining basic attribute features based on text information is provided, the steps including:

acquiring related text information of a user through a network;

inputting the relevant text information of the user into a language identification model for processing to obtain a user text feature vector, wherein a plurality of sample text information is used for training a GPT2 model in advance to obtain the language identification model capable of identifying the text feature vector in the text information;

inputting the user text feature vector into a sequence labeling model for processing to obtain entity feature data, wherein the sequence labeling model capable of identifying entity features in the text feature vector is constructed in advance;

inputting the user text feature vector into a classification model for processing to obtain class feature data, wherein the classification model capable of identifying class features in the text feature vector is constructed in advance;

and combining the entity characteristic data with the category characteristic data to obtain the basic attribute characteristics of the user.

Further, before the inputting the relevant text information of the user into the language identification model for processing to obtain the user text feature vector, the method further includes:

pre-creating a GPT2 model having a plurality of input paths;

creating a query vector, a key vector, and a value vector for each input path;

obtaining a plurality of sample text messages, and labeling text feature vectors for each sample text message in advance;

inputting the sample text information through an input path, and determining a corresponding sample query vector, a sample key vector and a sample value vector for a sample word in the sample text information according to the query vector, the key vector and the value vector;

multiplying the sample query vector of any sample word with the key vectors of other sample words to obtain the attention score corresponding to the sample word;

multiplying the attention value corresponding to the sample word with the sample value vector corresponding to the sample word, and then summing to obtain a sample text feature vector;

comparing the sample text feature vector with a pre-marked text feature vector, if the comparison is not consistent, adjusting the created query vector, key vector and value vector to enable the sample text feature vector to be consistent with the pre-marked text feature vector, otherwise, inputting the next sample text information;

and the GPT2 model obtained after all sample text information is completely processed is used as a language identification model.

Further, the inputting the relevant text information of the user into a language identification model for processing to obtain a user text feature vector specifically includes:

inputting the relevant text information of the user through an input path, and determining a corresponding user query vector, a user key vector and a user value vector for a user word in the relevant text information of the user according to the query vector, the key vector and the value vector;

multiplying the user query vector of any user word with the key vectors of other user words to obtain the attention score corresponding to the user word;

and multiplying the attention value corresponding to the user word by the user value vector corresponding to the user word, and then summing to obtain the user text feature vector.

Further, the inputting the user text feature vector into a sequence labeling model for processing to obtain entity feature data specifically includes:

inputting the user text feature vector from an input port of the sequence labeling model;

the processing layer of the sequence labeling model receives the user text characteristic vectors transmitted from an input port, and labels are carried out on each word vector in the user text characteristic vectors to obtain a corresponding label set;

and extracting word vectors corresponding to name labels and/or address labels according to the label set, and sorting the word vectors into corresponding name features and address features for output.

Further, when the classification model is a DPCNN model, the inputting the user text feature vector into the classification model for processing to obtain category feature data specifically includes:

inputting the user text feature vector into a DPCNN first layer for convolution processing to obtain a corresponding feature map, and embedding the feature map as the text of the user text feature vector;

the DPCNN second layer receives text embedding sent by the DPCNN first layer and stacks the text embedding;

the third layer of the DPCNN receives the stacked text data, integrates the text data into an integrated vector, extracts the features of the integrated vector, extracts the gender features and/or the age features, and outputs the gender features and/or the age features.

Further, when the classification model is a transform model, the user text feature vector is input into the classification model for processing to obtain category feature data, and the method specifically includes:

performing word embedding on the user text feature vector, and inputting the user text feature vector after word embedding into an encoder and a decoder;

the encoder encodes the user text feature vector after the words are embedded, and performs residual error connection and normalization processing on the encoded data to obtain encoded text data;

the decoder decodes the user text characteristic vector after the word is embedded, and performs residual error connection and normalization processing on the decoded data to obtain decoded text data;

processing the coded text data and the decoded text data by using a multi-head attention mechanism, and performing residual error connection and normalization processing on the data processed by the multi-head attention mechanism to obtain output data;

inputting the output data into a full connection layer for processing to obtain the probability of the corresponding class characteristics;

and inputting the probability belonging to the corresponding class characteristics into a logistic regression layer for processing, and outputting the processed probability belonging to the gender characteristics and/or the age characteristics.

Further, before the combining the entity feature data and the category feature data to obtain the basic attribute feature of the user, the method further includes:

correcting wrongly written or mispronounced words in the entity characteristic data and/or the category characteristic data, and clearing punctuation marks in the entity characteristic data and/or the category characteristic data.

According to a second aspect of the present application, there is provided an apparatus for determining basic attribute features based on text information, comprising:

the acquisition module is used for acquiring relevant text information of a user through a network;

the feature vector processing module is used for inputting the relevant text information of the user into a language identification model for processing to obtain a user text feature vector, wherein the GPT2 model is trained by utilizing a plurality of sample text information in advance to obtain the language identification model capable of identifying the text feature vector in the text information;

the entity feature processing module is used for inputting the user text feature vector into a sequence labeling model for processing to obtain entity feature data, wherein the sequence labeling model capable of identifying entity features in the text feature vector is constructed in advance;

the category feature processing module is used for inputting the user text feature vector into a classification model for processing to obtain category feature data, wherein the classification model capable of identifying category features in the text feature vector is constructed in advance;

and the combination module is used for combining the entity characteristic data and the category characteristic data to obtain the basic attribute characteristics of the user.

According to a third aspect of the present application, a storage medium is proposed, on which a computer program is stored which, when being executed by a processor, carries out the method of the first aspect.

According to a fourth aspect of the present application, a front-end server device is proposed, comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, the processor implementing the method of the first aspect when executing the program.

By means of the technical scheme, the method, the device and the equipment for determining the basic attribute characteristics based on the text information can obtain the relevant text information about the user from a network, perform language identification by using a language identification model to obtain the text characteristic vector of the user, process the text characteristic vector of the user by using a sequence marking model to obtain the entity characteristic data, perform classification processing on the text characteristic vector of the user by using a classification model to obtain the category characteristic data, and integrate the entity characteristic data and the category characteristic data to obtain the corresponding basic attribute characteristics of the user. The processing and analyzing process can simplify the steps of obtaining the basic attribute characteristics and improve the time utilization rate.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow diagram of a method for determining basic attribute features based on textual information according to one embodiment of the present application;

FIG. 2 is a block diagram of an apparatus for determining basic attribute features based on text information according to another embodiment of the present application;

fig. 3 is a block diagram of a terminal device according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As shown in fig. 1, an embodiment of the present application provides a method for determining basic attribute features based on text information, including the steps of:

step 101, obtaining relevant text information of a user through a network.

The acquired relevant text information of the user comprises the following steps: wiki (wiki) datasets, network open source datasets, etc.

Step 102, inputting relevant text information of a user into a language identification model for processing to obtain a user text feature vector, wherein a plurality of sample text information is used for training a GPT2 model in advance to obtain a language identification model capable of identifying the text feature vector in the text information.

The GPT2(Generative Pre-Training 2) model is a model for performing feature processing on text content and converting the text content into text feature vectors.

Step 103, inputting the user text feature vector into a sequence labeling model for processing to obtain entity feature data, wherein the sequence labeling model capable of identifying entity features in the text feature vector is constructed in advance.

Wherein the sequence annotation model comprises at least one of: a NER (named entity recognition) model, a LSTM (Long Short-Term Memory) model, a CRF (conditional random field algorithm) model, an HMM (hidden Markov) model, a CNN (convolutional neural network) model, and a BILSTM (Bi-directional Long Short-Term Memory) model.

After the sequence labeling model processing, corresponding entity characteristic data can be obtained, wherein the entity characteristic data comprises name characteristics and address characteristics.

And 104, inputting the user text feature vector into a classification model for processing to obtain class feature data, wherein the classification model capable of identifying class features in the text feature vector is constructed in advance.

The classification model can be a neural network model, a large amount of text sample data is used for carrying out classification training on the neural network, classification characteristic results of all words are correspondingly output when one text sample data is trained, and the neural network is continuously corrected according to actual classification results, so that the classification precision of the neural network is higher, and the classification results are more accurate.

After the classification processing of the classification model, the obtained class characteristic data comprises: gender characteristics and age characteristics.

And 105, combining the entity characteristic data with the category characteristic data to obtain the basic attribute characteristics of the user.

The name feature and the age feature obtained in step 103, and the gender feature and the age feature obtained in step 104 are combined together to form the basic attribute features of the user, namely, the name, address, gender and age of the user.

And the characteristics of the user are output and displayed, so that further analysis processing can be carried out according to the basic attribute characteristics of the user, for example, the purchase rules, the character characteristics and the like of the user are analyzed.

According to the scheme, the relevant text information about the user can be acquired from the network, the language identification model is used for carrying out language identification to obtain the user text feature vector, the user text feature vector is processed by the sequence marking model to obtain the entity feature data, the user text feature vector is classified by the classification model to obtain the category feature data, and the entity feature data and the category feature data are integrated to obtain the corresponding basic attribute feature of the user. The processing and analyzing process can simplify the steps of obtaining the basic attribute characteristics and improve the time utilization rate.

In a particular embodiment, prior to step 102, the method further comprises:

step A1, pre-create a GPT2 model with multiple input paths.

The GPT2 models with the preset number of layers are constructed in advance, the specific number of layers can be selected according to actual needs, for example, a small GPT2 model is 12 layers, a medium GPT2 model is 24 layers, a large GPT2 model is 36 layers, and a large GPT2 model is 48 layers.

Step a2, a query vector, a key vector, and a value vector are created for each input path.

Where the query vector is to show the current word, which would be scored for other words using the keys. The key vector is a match when searching for related words, like the labels of all words in a segment. The value vector is the embodiment of the actual word, and after the relevancy of each word is scored, the added value can be used for representing the current word.

Step A3, obtaining a plurality of sample text messages, and labeling text feature vectors for each sample text message in advance.

The sample text information can be acquired through a network, or some text contents describing the personal information are intercepted in the article. And marking each sample text message in advance according to the correct text feature vector so as to determine the correctness of the processing result according to the marks.

Step A4, inputting the sample text information through the input path, and determining corresponding sample query vector, sample key vector and sample value vector for the sample word in the sample text information according to the query vector, key vector and value vector.

And after all the terms in the sample text information enter and exit through the input path, assigning values to each sample term in the sample text information by using the query vector, the key vector and the value vector.

Step A5, multiplying the sample query vector of any sample word with the key vectors of other sample words to obtain the attention score corresponding to the sample word.

Wherein, the corresponding sample words are multiple.

And step A6, multiplying the attention value corresponding to the sample word with the sample value vector corresponding to the sample word, and then summing to obtain the sample text feature vector.

Step A7, comparing the sample text feature vector with the pre-marked text feature vector, if the comparison is not consistent, adjusting the created query vector, key vector and value vector to make the sample text feature vector consistent with the pre-marked text feature vector, otherwise, inputting the next sample text information.

And step A8, using the GPT2 model obtained after all sample text information is completely processed as a language recognition model.

In the above-described embodiment, the language recognition model obtained after the learning training is performed using the GPT2 model. The language identification model can also be tested by using a plurality of test texts with vector marks, the accuracy rate of the test output result corresponding to the corresponding vector mark is judged, and if the accuracy rate is smaller than a preset probability threshold (for example, 98 percent) to prove that the accuracy of the obtained language identification model does not meet the requirement, the language identification model is retrained again by using the test texts. And continuously repeating the process until the test accuracy corresponding to the obtained language identification model is more than or equal to the preset probability threshold. This can further improve the accuracy of the recognition of the language recognition model.

In a specific embodiment, step 102 specifically includes:

step 1021, inputting the relevant text information of the user through the input path, and determining a corresponding user query vector, a user key vector and a user value vector for the user words in the relevant text information of the user according to the query vector, the key vector and the value vector.

Step 1022, multiply the user query vector of any user term by the key vector of other user terms to obtain the attention score corresponding to the user term.

And 1023, multiplying the attention score corresponding to the user word with the user value vector corresponding to the user word, and then summing to obtain the user text feature vector.

Through the scheme, the language identification model obtained by training the GPT2 model is used for processing the relevant text information of the user, so that the obtained text feature vector of the user can be more accurate, and a more accurate processing result can be obtained when the text feature vector of the user is used for further processing.

In a specific embodiment, step 103 specifically includes:

and step 1031, inputting the user text feature vector from an input port of the sequence labeling model.

And 1032, receiving the user text characteristic vectors transmitted from the input port by the processing layer of the sequence labeling model, and performing label labeling on each word vector in the user text characteristic vectors to obtain a corresponding label set.

And 1033, extracting word vectors corresponding to the name tags and/or the address tags according to the tag set, and sorting the word vectors into corresponding name features and address features for output.

Wherein, the corresponding label includes: BA represents that the Chinese character is an address first character, MA represents that the Chinese character is an address middle character, and EA represents that the Chinese character is a tail character of the address; BO represents that the Chinese character is the first character of the organization name, MO represents that the Chinese character is the middle character of the organization name, and EO represents that the Chinese character is the tail character of the organization name; BP stands for the first character of the name, MP stands for the middle character of the name, EP stands for the end character of the name, and O stands for the name of the Chinese character not belonging to the named entity. The user text feature vector is X ═ { X1, X2, X3 … xi … xn }, and the label set obtained after label labeling is Y ═ { Y1, Y2, Y3 … yi … yn }.

For example, for a text message "a king of black Buddha head village", the label set obtained after the corresponding processing is "black/BA Buddha/MA head/MA village/EA/O king/BP certain/MO". The corresponding obtained name characteristic is 'a certain king' and the address characteristic is 'black Buddha head village'. And then outputs the two characteristics.

Through the scheme, the entity characteristic content belonging to the name label or the address label can be extracted by labeling the label, and the identification rate of the entity characteristic is further improved.

In a specific embodiment, when the classification model is a DPCNN (Deep Neural Networks, stacked Convolutional Neural Networks) model, step 104 specifically includes:

step 1041, inputting the user text feature vector into the first layer of DPCNN for convolution processing to obtain a corresponding feature map, and embedding the feature map as a text of the user text feature vector.

And 1042, the DPCNN second layer receives the text embedding sent by the DPCNN first layer and stacks the text embedding.

And 1043, receiving the stacked text data by the third layer of the DPCNN, integrating the text data into an integration vector, extracting the features of the integration vector, and extracting the gender features and/or the age features to output.

The DPCNN model has a pyramid structure, so that the model can find the long-range dependency relationship in the text, and the DPCNN model is used for processing the text feature vector of the user according to the steps, so that the accuracy of the extracted gender feature and the age feature can be ensured.

In addition, two values are respectively assigned to the corresponding sex characteristics, namely a male and a female, and 89 values are assigned to the corresponding age characteristics from 2 to 90.

In a specific embodiment, when the classification model is a transform model, step 104 specifically includes:

and step 1044, performing word embedding on the user text feature vector, and inputting the user text feature vector after word embedding into an encoder and a decoder.

And step 1045, the encoder encodes the user text feature vector after the word embedding, and performs residual connection and normalization on the encoded data to obtain encoded text data.

And step 1046, decoding the user text feature vector after the word embedding by the decoder, and performing residual connection and normalization processing on the decoded data to obtain decoded text data.

And step 1047, processing the coded text data and the decoded text data by using a multi-head attention mechanism, and performing residual error connection and normalization processing on the data processed by the multi-head attention mechanism to obtain output data.

Step 1048, inputting the output data into the full link layer for processing to obtain the probability of the corresponding class feature.

And 1049, inputting the probabilities belonging to the corresponding class features into a logistic regression layer for processing, and outputting the processed gender features and/or age features.

Through the scheme, the user text feature vectors are classified by using the Transformer model, so that the classification speed is higher, the classification result is more accurate, and the obtained gender feature and/or age feature is more fit with the actual situation.

In a specific embodiment, before combining the entity feature data and the category feature data to obtain the basic attribute feature of the user, the method further includes:

correcting wrongly written words in the entity characteristic data and/or the class characteristic data, and clearing punctuation marks in the entity characteristic data and/or the class characteristic data.

In the above scheme, in order to ensure the accuracy of the output result, the error with obvious error is corrected, for example, correcting the gender to be "female" and removing punctuation marks in the output result. And errors of subsequent user analysis results caused by inaccurate output results are avoided.

And analyzing the preference, social contact and the like of the user by combining the obtained basic characteristics of the user with other characteristics of the user, and further formulating a corresponding accurate marketing strategy for the user.

The method specifically comprises the following steps:

and S1, adding corresponding label values for the acquired feature texts of at least one user, wherein each user corresponds to at least one feature text.

The categories of the feature texts of the user comprise: name, favorite color, occupation, school calendar, school, age, address, hobby, sex, height, weight, and other personal information. Different features of each category correspond to different tag values. And the acquisition rules of the label values corresponding to different features are pre-stored in a label value storage library, and the label value adding rules of corresponding categories are directly called to add proper label values to the corresponding feature texts.

For example, age class, the corresponding tag value adding rule is: a label value of 1 for ages 0-10, 2 for ages 11-18, 3 for ages 19-45, 4 for ages 46-65, and 5 for ages above 65.

S2, matching a corresponding label weight for each label value.

And S3, constructing a label matrix set by taking the users as rows and the label weights as columns.

And S4, calculating the correlation coefficient among the users according to the label matrix set, determining the correlation value among the users according to the correlation coefficient, and constructing a correlation network matrix according to the correlation value.

S5, obtaining at least one social network matrix, and combining the related network matrix and the at least one social network matrix to construct a multiple similarity network matrix.

And S6, receiving the information of the active users, marking the active users in the multiple similarity network matrix according to the information of the active users, taking the active users as seed users, and calculating the path distance between other users in the multiple similarity network matrix and the seed users.

And S7, taking other users with the path distance less than or equal to the set threshold value as target users, and acquiring and displaying the personal text information of the target users.

By the scheme, corresponding label values can be added to various characteristics of the users, corresponding label weights are matched, a related network matrix is built according to the correlation among the users, the related network matrix and other community network matrices are combined to form a plurality of similarity network matrices, so that the related users corresponding to active users as seed users can be searched according to the multiple similarity network matrices, the success rate of product promotion to the users is proved to be relatively high, in addition, the hidden characteristics of the individual users, the similarity among the individual users and the social attributes can be mined according to the correlation among the users shown in the built multiple similarity network matrices, and the crowd clustering or the accurate marketing based on the seed user expansion can be conveniently output.

In a specific embodiment, S1 specifically includes:

and S11, adding corresponding labels to the feature texts by using a label prediction model or a label adding rule.

In this step, the tag addition rule includes: tags are extracted by keyword matching based on keyword preferences. The tag addition rule further includes: and extracting information of the high-frequency template based on the structuring. The label prediction model comprises: the prediction is performed using classification, regression models. And (3) a label prediction model or a label adding rule, wherein the two rules are different. In practical application, the problem that the label prediction model is difficult to solve is solved, and a good effect can be achieved by using a simple label adding rule.

S12, determining corresponding label values for the labels of the feature text from a plurality of dimensions, wherein the plurality of dimensions include: frequency dimension, label importance dimension obtained after statistical calculation is carried out on the label by utilizing tf-idf algorithm, and data different granularities or specific behavior dimension.

In the above scheme, the tf-idf algorithm is term frequency-inverse document frequency, which is a statistical method for evaluating the importance of a word to a document set or a document in a corpus.

The frequency dimension is the frequency of the label; the label importance dimension is that all label values of the personal characteristic text of the user are regarded as documents, a specific label is regarded as a vocabulary, tf-idf is calculated, and the importance of the label can be described; data different granularity or specific behavior dimension: for example, in retail, purchasing, collecting, paying attention to and the like belong to specific behaviors, and different time windows are different granularities of data.

Through the scheme, the accuracy and the coverage rate of the label value determination of the feature text of the user can be ensured as much as possible by combining information of different layers.

In a specific embodiment, S2 specifically includes:

and S21, setting corresponding basic weight values for each label value.

The corresponding basic weight value a can be set according to the accuracy or importance of different sources of the feature text.

S22, a corresponding time attenuation coefficient b is set for each label value.

And S23, multiplying the basic weight value by the time attenuation coefficient by a b, and then carrying out normalization processing to obtain the corresponding label weight.

By the scheme, the obtained label weight can better accord with the characteristics of the characteristic text of the user, and the operation directly carried out according to the label matrix set obtained by the label weight can be more accurate.

In a specific embodiment, after S3, the method specifically includes:

and S31, receiving the characteristic feedback text with the user mark, and searching the characteristic text of the user corresponding to the characteristic feedback text according to the user mark.

And S32, comparing whether the characteristic feedback text is the same as the characteristic text, if not, matching the corresponding feedback label weight according to the label value corresponding to the characteristic feedback text, replacing the label weight corresponding to the characteristic text in the label matrix set with the feedback label weight to form a new label matrix set, and if so, not processing.

For example, tags tag1, tag2, and tag3 … of user a take values of 1,2,3 …; the tag value of B is a2, 3,4 … tag matrix set (also called user portrait) which is a matrix with the user as a row and the tag value as a column.

In the scheme, in the marketing process, a marketer feeds back a statistical result of a corresponding user, performs data backflow according to the feedback result and the steps, and performs iterative fitting on the label weight, so that a corresponding label matrix set is adjusted, and monitoring and detection of the label matrix set are realized.

For example, data such as gender and the like are fed back, corresponding label values are determined directly according to the fed-back gender, corresponding label weights are further determined as feedback label weights, and the corresponding label weights in the label matrix set are replaced by the feedback label weights.

As another example, a portion of the game payers may be better marketed and a portion of the game payers may be less effective. The method can be regarded as a classification task, the importance of different features to classification is judged (such as calculating gini indexes and the like), and the corresponding basic weight value is changed according to the importance, so that the corresponding label weight is adjusted.

Through the scheme, the label matrix set can be adjusted in time according to the feedback result of later-stage marketing, and the accuracy of the label matrix set is guaranteed.

In a specific embodiment, S4 specifically includes:

and S41, calculating correlation coefficients among the users according to the label matrix set by using a Pearson algorithm.

Wherein, the Pearson correlation coefficient, and the formula for calculating the correlation coefficient of two variables X, Y are:

wherein N is the number of variable values.

The Pearson correlation coefficient is used for measuring whether two data sets are on the same line or not, and is used for measuring the linear relation between distance variables, and the larger the absolute value of the correlation coefficient is, the stronger the correlation is: the closer the correlation coefficient is to 1 or-1, the stronger the correlation, the closer the correlation coefficient is to 0, and the weaker the correlation.

And S42, when the correlation coefficient is greater than or equal to the set correlation threshold, determining that a connecting edge exists between two users corresponding to the correlation coefficient, the correlation value between the two users corresponding to the correlation coefficient is 1, and when the correlation coefficient is smaller than the set correlation threshold, the correlation value between the two users corresponding to the correlation coefficient is 0.

And S43, constructing the related network matrix by taking each user as the row and the column of the related network matrix and taking the related value as the value of the related network matrix.

For example, if users 1 and 2 have continuous edges, the network matrix M [1,2] ═ 1M [2,1] ═ 1 (symmetric matrix)

Through the scheme, the incidence relation among all users can be directly obtained according to the related network matrix, and the accurate marketing strategy can be conveniently appointed according to the incidence relation.

In a particular embodiment, after S5, the method further includes:

and S51, detecting a target network matrix from each network matrix of the multiple similarity network matrices by using a Louvain algorithm.

The Louvain algorithm is a community discovery algorithm, is a graph algorithm model based on modularity, is high in speed, and has relatively obvious clustering effect on a multiple similarity network matrix with few point multilateral edges.

And carrying out community detection on the multiple similarity network matrix by using a Louvain algorithm, and processing the detected target network matrix as a new node.

And S52, determining a network correlation coefficient between each target network matrix by taking the target network matrix as a network node, determining that a connecting edge exists between two target network matrices corresponding to the network correlation coefficient when the network correlation coefficient is greater than or equal to a set network correlation threshold, wherein the network correlation value between the two target network matrices corresponding to the network correlation coefficient is 1, and the network correlation value between the two target network matrices corresponding to the network correlation coefficient is 0 when the network correlation coefficient is less than the set correlation threshold.

And S53, constructing the newly constructed network matrix by taking each target network matrix as the row and column of the newly constructed network matrix and taking the network correlation value as the value of the newly constructed network matrix.

S54, when a plurality of newly constructed network matrixes are obtained correspondingly, the newly constructed network matrixes are detected by using the Louvain algorithm, the previous community network is reconstructed by using the detected target newly constructed network matrixes, and the multilevel community network is obtained through iterative processing.

Through the scheme, a multi-level community network can be obtained, the relation and the preference among all users can be determined more conveniently according to the community network, and then a better specified accurate marketing strategy is convenient to use.

In a particular embodiment, after S54, the method further includes:

s55, calculating the numerical average value P1 of the label weights of all types of the undetermined community networks in the multilevel community network and the numerical average value P2 of all types of label weights of other community networks in the multilevel community network.

S56, if the P1 of one or more types of label weights is larger than P2, the label category corresponding to one or more types of label weights is used for marking the social network to be determined.

For example, the average value of the tag weights of the pending community network a in the finance class is significantly higher than the average value of the tag weights of all people in the finance class, and the pending community network a can be considered as an investment family. Marketing promotion of investment direction can be carried out on the network A of the undetermined community.

Through the scheme, the characteristics of crowd clustering of each community network can be determined according to the calculation of the label weight, and then the community networks are marked according to the characteristics, so that targeted marketing popularization is conveniently performed on users in the community networks.

Another social network may be associated with a plurality of tags, including: a favorite sports mark, a favorite investment mark, a favorite shopping mark, a favorite pet mark, etc.

Based on the embodiment shown in fig. 1, an apparatus for determining basic attribute features based on text information is provided, as shown in fig. 2, including:

the obtaining module 21 is configured to obtain relevant text information of a user through a network.

And the feature vector processing module 22 is configured to input the relevant text information of the user into the language identification model for processing, so as to obtain a user text feature vector, where a plurality of sample text information is used to train the GPT2 model in advance, so as to obtain a language identification model capable of identifying the text feature vector in the text information.

And the entity feature processing module 23 is configured to input the user text feature vector into a sequence tagging model for processing to obtain entity feature data, where a sequence tagging model capable of identifying entity features in the text feature vector is pre-constructed.

And the category feature processing module 24 is configured to input the user text feature vector into a classification model for processing to obtain category feature data, where a classification model capable of identifying category features in the text feature vector is pre-constructed.

And the combining module 25 is configured to combine the entity feature data and the category feature data to obtain the basic attribute feature of the user.

In a specific embodiment, the apparatus further comprises: the modules are constructed into a module, and the module is composed of a plurality of modules,

the module construction module specifically includes:

a creating unit configured to create a GPT2 model having a plurality of input paths in advance; a query vector, a key vector, and a value vector are created for each input path.

And the acquisition unit is used for acquiring a plurality of sample text messages and marking text feature vectors for each sample text message in advance.

And the input unit is used for inputting the sample text information through an input path, and determining corresponding sample query vectors, sample key vectors and sample value vectors for sample words in the sample text information according to the query vectors, the key vectors and the value vectors.

The calculation unit is used for multiplying the sample query vector of any sample word with the key vectors of other sample words to obtain the attention score corresponding to the sample word; and multiplying the attention value corresponding to the sample word by the sample value vector corresponding to the sample word, and then summing to obtain the sample text feature vector.

And the comparison unit is used for comparing the sample text feature vector with the pre-marked text feature vector, if the comparison is inconsistent, adjusting the created query vector, the key vector and the value vector to enable the sample text feature vector to be consistent with the pre-marked text feature vector, and otherwise, inputting the next sample text information.

And the model determining unit is used for obtaining a GPT2 model as a language recognition model after all sample text information is completely processed.

In a specific embodiment, the feature vector processing module 22 specifically includes:

and the vector determining unit is used for inputting the relevant text information of the user through an input path, and determining a corresponding user query vector, a user key vector and a user value vector for the user words in the relevant text information of the user according to the query vector, the key vector and the value vector.

And the attention score calculating unit is used for multiplying the user query vector of any user word with the key vectors of other user words to obtain the attention score corresponding to the user word.

And the summation calculation unit is used for multiplying the attention score corresponding to the user word with the user value vector corresponding to the user word and then carrying out summation processing to obtain the user text feature vector.

In a specific embodiment, the entity feature processing module 23 is specifically configured to: inputting a user text feature vector from an input port of the sequence labeling model; a processing layer of the sequence labeling model receives the user text characteristic vectors transmitted from the input port, and labels are carried out on each word vector in the user text characteristic vectors to obtain a corresponding label set; and extracting word vectors corresponding to the name labels and/or the address labels according to the label set, and sorting the word vectors into corresponding name features and address features for output.

In a specific embodiment, when the classification model is a DPCNN model, the category feature processing module 24 is specifically configured to:

inputting the user text feature vector into a DPCNN first layer for convolution processing to obtain a corresponding feature map, and embedding the feature map as the text of the user text feature vector; the DPCNN second layer receives text embedding sent by the DPCNN first layer and stacks the text embedding; the third layer of the DPCNN receives the stacked text data, integrates the text data into an integrated vector, extracts the features of the integrated vector, extracts the gender features and/or the age features, and outputs the gender features and/or the age features.

In a specific embodiment, when the classification model is a Transformer model, the classification feature processing module 24 is specifically configured to:

performing word embedding on the user text characteristic vector, and inputting the user text characteristic vector after word embedding into an encoder and a decoder; the encoder encodes the user text feature vector after word embedding, and performs residual error connection and normalization processing on the encoded data to obtain encoded text data; the decoder decodes the user text characteristic vector after word embedding, and performs residual error connection and normalization processing on the decoded data to obtain decoded text data; processing the coded text data and the decoded text data by using a multi-head attention mechanism, and performing residual connection and normalization processing on the data processed by the multi-head attention mechanism to obtain output data; inputting the output data into a full connection layer for processing to obtain the probability of the corresponding class characteristics; and inputting the probability belonging to the corresponding class characteristics into a logistic regression layer for processing, and outputting the processed probability belonging to the gender characteristics and/or the age characteristics.

In a specific embodiment, the apparatus further comprises:

and the error correction module is used for correcting wrongly written words in the entity characteristic data and/or the category characteristic data and clearing punctuation marks in the entity characteristic data and/or the category characteristic data.

Based on the foregoing embodiments of the system shown in fig. 1 and the method shown in fig. 2, correspondingly, the present application further provides a storage medium, on which a computer program is stored, which when executed by a processor implements the foregoing method for determining basic attribute features based on text information shown in fig. 2.

Based on the foregoing embodiments of the system shown in fig. 1 and the method shown in fig. 2, in order to achieve the foregoing object, an embodiment of the present application further provides a terminal device, as shown in fig. 3, including a memory 32 and a processor 31, where the memory 32 and the processor 31 are both disposed on a bus 33, the memory 32 stores a computer program, and the processor 31 implements the method for determining the basic attribute characteristics based on the text information shown in fig. 2 when executing the computer program.

The storage medium may further include an operating system and a network communication module. An operating system is a program that manages the hardware and software resources of a computer device, supporting the operation of information handling programs, as well as other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and other hardware and software in the computer equipment.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile memory (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the implementation scenarios of the present application.

Optionally, the device may also be connected to a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, sensors, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., a bluetooth interface, WI-FI interface), etc.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware.

By applying the technical scheme of the application, the relevant text information about the user can be acquired from the network, the language identification model is used for carrying out language identification to obtain the user text characteristic vector, then the user text characteristic vector is processed by using the sequence marking model to obtain the entity characteristic data, the user text characteristic vector is classified by using the classification model to obtain the category characteristic data, and the entity characteristic data and the category characteristic data are integrated to obtain the corresponding basic attribute characteristic of the user. The processing and analyzing process can simplify the steps of obtaining the basic attribute characteristics and improve the time utilization rate.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims

1. A method for determining basic attribute features based on textual information, comprising the steps of:

acquiring related text information of a user through a network;

2. The method of claim 1, wherein before the inputting the relevant text information of the user into the language identification model for processing, the method further comprises:

pre-creating a GPT2 model having a plurality of input paths;

creating a query vector, a key vector, and a value vector for each input path;

3. The method according to claim 2, wherein the inputting the relevant text information of the user into a language identification model for processing to obtain a user text feature vector specifically comprises:

4. The method according to claim 1, wherein the inputting the user text feature vector into a sequence labeling model for processing to obtain entity feature data specifically comprises:

5. The method according to claim 1, wherein when the classification model is a DPCNN model, the inputting the user text feature vector into the classification model for processing to obtain category feature data specifically includes:

6. The method according to claim 1, wherein when the classification model is a transform model, the inputting the user text feature vector into the classification model for processing to obtain category feature data specifically includes:

7. The method of claim 1, wherein before the combining the entity characteristic data and the category characteristic data to obtain the basic attribute characteristics of the user, the method further comprises:

8. An apparatus for determining basic attribute features based on text information, comprising:

9. A storage medium on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of claims 1-7.

10. A terminal device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the method of claims 1-7 when executing the program.