CN107832304A - A kind of method and system that user's sex is judged based on Message-text - Google Patents
A kind of method and system that user's sex is judged based on Message-text Download PDFInfo
- Publication number
- CN107832304A CN107832304A CN201711184662.9A CN201711184662A CN107832304A CN 107832304 A CN107832304 A CN 107832304A CN 201711184662 A CN201711184662 A CN 201711184662A CN 107832304 A CN107832304 A CN 107832304A
- Authority
- CN
- China
- Prior art keywords
- text
- measured
- word
- user
- sex
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
One kind judges user's property method for distinguishing based on Message-text, and it comprises the following steps:Receive text to be measured;Extract the text feature of text to be measured;By the substitution disaggregated model of the text feature of the correspondence of extraction text to be measured, judge to should text to be measured user's sex, wherein, disaggregated model is based on shot and long term Memory Neural Networks;Text feature includes word frequency and the Huffman encoding of the text to be measured.Additionally provide a kind of system that user's sex is judged based on Message-text.Significant difference of the present invention based on men and women user on language performance, context semantic relation, word frequency and the Huffman encoding of the content of text sent using user, can realize higher user's sex discrimination.
Description
Technical field
It is more particularly to a kind of that user's sex is judged based on Message-text the present invention relates to natural language processing technique field
Method and system.
Background technology
Along with the development of network, online game is also arisen at the historic moment, this game on line, electronics trip by internet
Play, rapidly opens market.Online game, refer generally to several player users and swum by the video of computer network interaction entertainment
Play.Online game must rely on internet progress, can more people simultaneously participate in, friendship is reached by interpersonal interaction
The purpose of stream, amusement and leisure.When registering game account, because personal attribute's data are related to privacy concern, user often selects
Select the modes such as the gender information for not filling in or filling in falseness and hide its personal information, cause the push nothing related to user's sex
Method is realized with the demand of user and matched.On the other hand, sex identification is in information retrieval and recommendation, social investigation, psychodiagnostics etc.
Aspect equally has very extensive application, therefore the deduction to user's sex has extensive Research Prospects and practical value.
According to investigation, the known accounting distribution that type of play is played using men and women of industry is made a decision (with reference to Publication No.
A kind of CN106844687A Chinese invention patent application " method and system that user's sex is determined based on games log "), this
Class method can be summarized as follows:The ratio that each money game occurs in other point of table of the own intellectual is first counted, obtains each money game
The accounting of sex polarity, i.e. men and women in the game of each money.To unknown sex user, its list of games played is collected, and be based on
The list judges user's sex using the method for cumulative summation.The factor that this method relies on is more single, men and women's accounting in game
Discrimination it is not high, cause the accuracy rate of method not high.For example rely on and whether play game《King's honor》To judge men and women, accurately
Property obviously it is not high.
The content of the invention
In order to solve the problems, such as that existing user's gender identification method discrimination is low.According to the first aspect of the invention, carry
One kind has been supplied to judge user's property method for distinguishing based on Message-text, it comprises the following steps:Receive text to be measured;Extract text to be measured
This text feature;The substitution disaggregated model of the text feature of the correspondence of extraction text to be measured is judged to should text to be measured
User's sex, wherein, disaggregated model is based on shot and long term Memory Neural Networks;Text feature includes the word frequency of the text to be measured
And Huffman encoding.
Further, the step of extracting the text feature of text to be measured includes following sub-step:Text to be measured is segmented, with
Generate to should text to be measured one or more words to be measured;Count the word frequency of word to be measured;Word to be measured is encoded based on Huffman tree,
Generate to should word to be measured Huffman encoding;Word frequency and Huffman encoding based on word to be measured, utilize the output pair of CBOW models
The embedded vector answered, this article eigen include insertion vector.
Further, in the step of being segmented to text to be measured, including following sub-step:It is to be measured based on dictionary for word segmentation, structure
The directed acyclic graph of text, wherein, segmented using the viterbi algorithms of HMM model to not including word in the dictionary for word segmentation
Processing;The maximum probability path of directed acyclic graph is found using Dynamic Programming;The word segmentation result in the corresponding maximum probability path of output.
Further, it is further comprising the steps of before the step of counting the word frequency of word to be measured:It is to be measured using dictionary matching
Word, to filter out stop words, the word higher than default word frequency and the word less than default word frequency.
Further, in the word frequency based on word to be measured and Huffman encoding, using be embedded in corresponding to the output of CBOW models to
It is further comprising the steps of before the step of amount:The word frequency of modal particle in word to be measured is weighted, to improve the word of modal particle
The weight of frequency.
Further, the text feature also includes the ratio of the interrogative sentence in the text to be measured.
Further, if the text to be measured is contained interrogative and ended up with modal particle, the text to be measured is judged
For interrogative sentence.
According to the second aspect of the invention, there is provided a kind of system that user's sex is judged based on Message-text, including:
First module, receive text to be measured;Second module, extract the text feature of text to be measured;3rd module, the correspondence of extraction is treated
Survey text text feature substitution disaggregated model, judge to should text to be measured user's sex, wherein, disaggregated model is based on
Shot and long term Memory Neural Networks;Text feature includes word frequency and the Huffman encoding of the text to be measured.
According to the third aspect of the present invention, there is provided a kind of computer-readable recording medium, be stored thereon with computer
Program, when the program is executed by processor described in realization one side of the invention the step of method.
The beneficial effects of the present invention are:Significant difference based on men and women user on language performance, is sent out using user
Context semantic relation, word frequency and the Huffman encoding of the content of text gone out, higher user's sex discrimination can be realized.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, embodiment will be described below
In the required accompanying drawing used be briefly described, it should be apparent that, drawings in the following description be only the present invention some
Embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, can also be attached according to these
Figure obtains other accompanying drawings.
Fig. 1 is the general flowchart of one or more embodiments of the invention;
Fig. 2 is the detailed figure of one or more embodiments of the invention;
Fig. 3 is the flow chart of train classification models;
Fig. 4 is LSTM mathematical principle schematic diagram.
Embodiment
Those of ordinary skill in the art are it is to be appreciated that the list of each example described with reference to the embodiments described herein
Member and algorithm steps, it can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually
Performed with hardware or software mode, application-specific and design constraint depending on technical scheme.Professional and technical personnel
Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed
The scope of the present invention.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.
In several embodiments provided herein, it should be understood that disclosed systems, devices and methods, can be with
Realize by another way.For example, device embodiment described above is only schematical, for example, the unit
Division, only a kind of division of logic function, can there is other dividing mode, such as multiple units or component when actually realizing
Another system can be combined or be desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or
The mutual coupling that discusses or to be directly harmonious or communicate to connect can be indirect coupling by some interfaces, device or unit
Close or communicate to connect, can be electrical, mechanical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit
The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple
On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in two processing units, can also
That unit is individually physically present, can also two or more units it is integrated in a unit.
If the function is realized in the form of SFU software functional unit and is used as independent production marketing or in use, can be with
It is stored in two computer read/write memory mediums.Based on such understanding, technical scheme is substantially in other words
The part to be contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter
Calculation machine software product is stored in a storage medium, including some instructions are causing a computer equipment (can be
People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the present invention.
And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
It will be appreciated by those skilled in the art that in the description of the present application, " first ", " secondly ", " first ", "
Two ", the term such as " first step ", " second step ", is not offered as the limitation to sequencing unless otherwise stated.For example, " step
Three " can be before " step 1 ", and " the 4th step " can be carried out simultaneously with " second step ".
According to the general cognition of society, different trends is presented to the preference of game in men and women user.Overview, male more likes
The activity of joyous actively fierceness, they like seeking to irritate, like occupying an leading position, like taking a risk.Relatively, women then likes more
It is anxious sentimental, like conservative, they like seeking stable rather than irritated, and they like being dominated, and like endearing little bird.Male's phase
To more liking that object for appreciation acts adventure game, women is then relative to prefer the game of entertaining class., can be with order to improve the effect of game recommdation
To the different types of product of user-customized recommended of different sexes, beneficial to optimization Consumer's Experience, and the success rate of push is improved.
For example think that male user recommends the game of the stimulations such as war class, and recommend the game of the leisure such as flight class game to female user.
On the other hand, existing game will appreciate that the channel of user's sex is limited.For example partial game does not force user to fill out
Write real gender information;Even if fill in gender information, also have no way of verifying its authenticity.How the property of in game user is judged
It is not the technical barrier of a professional.The accurate distribution of game is realized, it is necessary to solve this technical barrier.
According to known document, due to having distinction and indicative data few to user's sex, rarely to this problem
Research.Find that one type method is that the distribution that type of play is played according to men and women makes a decision by investigating, for example play certain class
The male of game is a lot, then is just determined as male by acquiescence when there is user to play such game.Such method judgment basis is thick
Put, the accuracy rate of differentiation is very low.For example play game《King's honor》Men and women user's ratio it is roughly equal, whether played using user
The game distinguishes judgement, and accuracy rate is not obviously high.
On the other hand, inventor recognizes the significant difference on language performance based on men and women, using user in chatroom etc.
In content of text can judge user's sex exactly, efficiently solve the problem of professional.Further, identification
As a result the accurate distribution of push is can be applied to, clicking rate is effectively improved, creates huge commercial value, there is practical industry
Practicality.
Difference of the men and women user on language performance is described below, by long-term statistics and summary, inventor summarizes
Briefly sex differernce is more significantly counted at least in terms of following four:
(1) gender differences of vocabulary:The gender differences of vocabulary are mainly shown as that men and women selects the difference of word.Sex language
Say that the variation performance on vocabulary is the abundantest, women forms some features in the utilization of language:
1) word of some modifieds such as the kind adjective of women, adverbial word.As " excellent " " does not also come " " what a nuisance " always
" greatly " " excellent " " very beautiful " " good general " etc..Male language is then more ease in use, simple and direct, resolute
Speech.
2) modal particle that women uses is more.Such as ", eh,, Ow, clicking the tongue, uh uh ", and male is in this side
Face uses fewer.Women than men is more prone to anxious state of mind, and this fluctuation is shown between the lines, and has with modal particle
There is stronger appeal, once you form a habit, it comes natural to you, forms the different pragmatic habits of men and women's both sexes.
3) women uses mild and roundabout word and term of courtesy.Women emotion is typically finer and smoother, and sentiment is gentleer, typically not
Meeting serious wording refusal, word is milder and roundabout, can take the face of hearer into account.When expressing an opinion, the multi-purpose tone solicited, discussed,
It will not make one to feel aggressive.Even imperative sentence also more uses the tone for representing request.These asking with the mild and roundabout tone
Ask, without any meaning forced and ordered, functioned well as the effect of communication.
4) use of the women to words for colors be than male more more commonly.Women is especially sensitive to color, in language performance
In, unconsciously using the word of many expression colors.In the childhood, father and mother just dress up girl must be beautiful bright, contact
Color enriched than boy.With the intensification of schooling, the raising of estheticism, the sensitiveness of color is got over
Come stronger, the utilizations of some color words can also show the fine and smooth emotion of girl.
(2) gender differences of grammer:It is mainly manifested in women than men and more uses query clause, to each term language
Regular habits are in following the beaten track.Women is more prone to the type of writing using social approval, and linguistic form is near the mark language;Male is then
Tend to the type of writing do not approved of using society, be imbued with innovation and change.
(3) difference of language emotion:Woman is often more perceptual, male's then often rationality, and this point is in a session also usually
Embody.It is both to describe certain section of experience of oneself, male often recollectedly describes its process, objectively analyzes its gain and loss,
And women then often pays attention to the description and evaluation to wherein certain fragment, say sometimes and just excitedly rejoice oneself at certain suddenly
Final result is satisfied with part thing, or sacred Anshan part thing gives the emotional trauma of oneself, the narration of experience is usually by emotion
Express and interrupt.For another example, it is both to tell the unfortunate experience of oneself, male is conceived to after this brings oneself objective more
Fruit, often can repeatedly it say:" what ifWhat ifWhat shall i do" show to seek the anxiety for solving desorption method in hot water,
And women is then conceived to the subjective feeling that this part thing is triggered more, often can sigh:" my god, cries how I live!" thus, male
If compared with rationality, be then usually in a bad mood if women.
(4) speak out with turning to say:The women guard heart is strong, does not often say directly, likes circling.For example " this room is true
It is vexed!" meaning between the lines is intended to go out for a walk.Male then speaks frankly and to the point.Therefore, the frequency that women employs an euphemism is significantly larger than
Male.
Discovery based on more than, the first purpose of the disclosure are that structure new algorithm understands the chat text of game user,
For speculating user's sex.But those skilled in the art should know, this disclosure relates to technical scheme can similarly fit
Sex for the identification user using text messages such as short message (SMS), Email, instant messagings (IM).
Reference picture 1-3, according to the one side of the disclosure, there is provided a kind of to judge user's sex based on Message-text
Method, it comprises the following steps:Receive text to be measured;Extract the text feature of text to be measured;By the correspondence of extraction text to be measured
Text feature substitution disaggregated model, judge to should text to be measured user's sex, wherein, disaggregated model is based on shot and long term
Memory Neural Networks (Long-Short Term Memory, LSTM);Text feature includes word frequency and the Kazakhstan of the text to be measured
Fu Man encodes (Huffman Coding).Significant difference of this method based on men and women user on language performance, utilizes user
Context semantic relation, word frequency and the Huffman encoding of the content of text sent, higher user's sex discrimination can be realized.
On receiving text to be measured, chatroom message, short message (SMS), Email that user sends, i.e. can be intercepted
Shi Tongxin (IM) text message, and unique ID of user, as raw data.
The step of text feature on extracting text to be measured.For content of text, the conventional latent structure method of industry
It is first to do Chinese word segmentation, then does word screening and character representation.Specifically, Chinese word segmentation is using industry Open-Source Tools stammerer point
Word;The tf-idf sequence screening techniques that word screening is commonly used using industry, filter out the small word of some discriminations;Feature generates
The word 0-1 method for expressing commonly used using industry.TF-IDF(Term Frequency-Inverse Document
Frequency, word frequency-inverse file frequency) it is a kind of conventional weighting technique prospected for information retrieval with information.TF-IDF is calculated
Method is a kind of statistical method, to assess weight of the words for a copy of it file in a file set or a corpus
Want degree.The directly proportional increase of number that the importance of words occurs hereof with it, but simultaneously can be as it is in corpus
The frequency of middle appearance is inversely proportional decline.The TF-IDF sequence screening techniques that word screening is commonly used using industry, so as to some areas
Index small word to filter out, for highlighting the high still frequency of occurrences in whole dictionary of the frequency of occurrences in a sentence or article
Low word.But this method is preferable for professional stronger specification text, the discrimination effect of feature.But for not advising
The Internet chat text of model, the characteristic area indexing of generation is very poor, and mathematical expression power is very weak.
On the other hand, in order to build the higher feature of discrimination, one or more other embodiments of the present disclosure employ word insertion
The latent structure method of (word embedding) formula, its idiographic flow refer to Fig. 2.It can understand that word is embedding by an example
Enter, the part of speech for for example judging a word is verb or noun.In engineering, there are a series of samples (x, y), wherein x is word
Language, y are corresponding parts of speech;Target be structure f (x)->Y mapping, wherein mathematical modeling f only receive numeric type input.It is abstract
Ground, symbol (such as Chinese, English, Latin etc.) are converted into numeric form, can be with analogy into being embedded into a mathematical space
In;This embedded mode, word is just made to be embedded in (word embedding).We use Word2vec in an experiment, and this is that word is embedding
Enter the one kind of (word embedding).
The step of extracting the text feature of text to be measured includes following sub-step:Text to be measured is segmented, to generate correspondingly
One or more words to be measured of the text to be measured;Count the word frequency of word to be measured;Word to be measured is encoded based on Huffman tree, generation is corresponding
The Huffman encoding of the word to be measured;Word frequency and Huffman encoding based on word to be measured, utilize insertion corresponding to the output of CBOW models
Vector, comprise the following steps that.
On Chinese word segmentation.In one or more embodiments, in the step of being segmented to text to be measured, including following son
Step:Based on dictionary for word segmentation, the directed acyclic graph of text to be measured is built, wherein using the viterbi algorithms of HMM model to this point
Word is not included in word dictionary and carries out word segmentation processing;The maximum probability path of directed acyclic graph is found using Dynamic Programming;Output pair
Answer the word segmentation result in maximum probability path.Wherein, participle (Word Segmentation) is referred to a Chinese character sequence cutting
Into single word one by one.Participle is exactly the mistake that continuous word sequence is reassembled into word sequence according to certain specification
Journey.In the style of writing using English as the Romance language of representative, be between word using space as nature delimiter, and in
Text is that word, sentence and section can simply be demarcated by obvious delimiter, only the formal delimiter of word neither one, although
Similarly there is the partition problem of phrase in English, but on word this layer, Chinese than English it is complicated and much more difficult.Example
Such as, for the understanding of Chinese text " give and punish to the person of spitting everywhere ", " person of spitting everywhere " is divided into a word still in itself
Multiple words (such as being divided into " everywhere " " person of spitting " or " telling everywhere ", " phlegm person " etc.), therefore, correctly participle is to sentence
Understanding has vital influence.For example, it is primarily based on prefix dictionary (dictionary for word segmentation) and carries out word figure scanning, generates sentence
Chinese character is possible into the directed acyclic graph (Directed Acyclic Graph, be abbreviated as DAG) that word situation is formed in son.
Wherein, prefix dictionary refers to that the order that word in dictionary includes according to prefix arranges, for example, occur in dictionary " on ", afterwards with
" on " word of beginning can all appear in this part, such as " Shanghai ", and then " Shanghai City " occurs, so as to form a kind of level bag
Containing structure.In one or more embodiments, dictionary for word segmentation includes filtering dictionary described later, because the content for filtering dictionary can be with
Adjusted by offline iteration updating block dynamic, therefore the participle efficiency of dictionary for word segmentation can be improved.In addition, in one or more
In embodiment, include multiple dictionaries for word segmentation, the content of each dictionary for word segmentation difference.Based on the user's for sending text to be measured
Credit value, select the corresponding dictionary for word segmentation when being segmented to text to be measured.For example, when the credit value of user is higher, then select
With the larger dictionary for word segmentation of the granularity of participle (such as segment to " Shanghai City " afterwards i.e. stop participle, without be further divided into " on
Sea ", " city " etc.), to simplify participle process;When the credit value of user is relatively low, then from the less participle word of granularity of participle
Allusion quotation, to realize more accurate participle.The concept of credit rating will illustrate in greater detail below.Then, dynamic is employed to advise
Draw and search maximum probability path, find out the maximum cutting combination based on word frequency.For unregistered word in dictionary for word segmentation, base is employed
Hidden Markov model (Hidden Markov Model, be abbreviated as HMM) in Chinese character into word ability, has used Viterbi calculations
Method, final output correspond to the word segmentation result in maximum probability path, the basis for estimation as further disaggregated model.
On counting word frequency.The number that each word occurs is counted, then needs to remove the word of stop words, very high frequency
The very word of low frequency.In one or more other embodiments of the present disclosure, using dictionary matching word to be measured, with filter out stop words,
Word higher than default word frequency and the word less than default word frequency.The dictionary can be include stop words, the word higher than default word frequency and
Less than the blacklist of the word of default word frequency, the one or more words to be measured included in dictionary are filtered out, are not involved in follow-up word
Frequency statistics and coding etc..It can also be the white list for containing default high weight word.
According to one or more other embodiments of the present disclosure, in statistics word frequency during or after, to the tone in word to be measured
The word frequency of word is weighted, to improve the weight of the word frequency of modal particle.Inventor has been mentioned above to send out by statistical induction
More significantly statistics sex differernce, such as the tone that women uses in several aspects of term be present in existing men and women game player
Word (represents the function word of the tone, is commonly used in a tail or sentence at pause and represents a variety of tone.As ", eh,, Ow, dispute
Dispute, uh uh ".) ratio be significantly higher than male user.In order to improve the precision of user's Sexual discriminating, in this example, preferred pair symbol
The word for closing modal particle standard is weighted.
Specifically, modal particle dictionary is built in advance, while to word frequency statisticses, is treated to matching the modal particle dictionary
Word is surveyed, improves the weight of its word frequency.It is exemplary, exported again after the word frequency for the word to be measured for meeting modal particle standard is multiplied by into 1.5, with
Improve the weight of the word frequency of modal particle.
To the multiple standards that defined of modal particle in Modern Chinese, the universally recognized modal particle of educational circles is about 10-59
Between.In the present embodiment, according to the rule summarized for a long time, exemplarily employ《The word of Modern Chinese 800》(Luliang Mountain, business
Business print book shop, 1980) 17 modal particles the criteria for classifying based on, while be aided with compound modal particle (as " and Ow " " letting it pass ")
With the dissolving of foregoing modal particle (as " laugh a great ho-ho, clicking the tongue, uh uh ") and allosome word (as " ouch, well, ouch, Ow, aiyo,
Heartily, haha ") form default modal particle dictionary.In other embodiments, other modal particle standards can be also used, but
It is that those skilled in the art should know, what the scope of the modal particle dictionary was to determine.
In addition, according to above it will be appreciated that, inventor by statistical induction and the women game player (user) that finds using doubting
The ratio of question sentence is significantly higher than male player.Therefore, text feature can also include the ratio of the interrogative sentence in text to be measured, with
Improve the accuracy of Sexual discriminating.In addition, inventor also found in Internet chat, game player in punctuate use not
Ten sectional specifications, such as the sentence tail of one section of word would generally omit punctuate including question mark, and whether therefore, it is difficult to only with having and ask
Number judge sentence pattern type.In addition, inventor also found following statistical law:Women game player is when using interrogative sentence, companion
With having, the ratio that modal particle ends up is higher.Therefore, the present embodiment is exemplarily using interrogative by the way of modal particle is combined
Identify interrogative sentence sentence pattern, i.e., if in text to be measured with interrogative (such as " what ", " how ", " who ", " why ")
And ended up with modal particle, judge the text to be measured for query sentence pattern.
On text code.By building Huffman tree (Huffman Tree), for text code.All n omicronn-leaf
Node is stored with a parameter vector, and all leaf nodes represent a word in dictionary respectively.Parameter vector initial value is
0.After having built Huffman tree, corresponding Huffman code is distributed into each word.
On embedded vector.Embedded vector is exported using CBOW models.The input of training set is around several lists in CBOW
The term vector sum of word, output are that middle words.Since root node, logistic is ceaselessly carried out along Huffman tree
Classification, often carry out a subseries and just along Huffman tree toward next layer and correct term vector, to the last reach leaf node;If
It is 1, then it represents that left sibling should be assigned to, otherwise represent that right node should be assigned to.
On by the substitution disaggregated model of the text feature of the correspondence of extraction text to be measured, judging to should text to be measured
The step of user's sex.Disaggregated model is based on shot and long term Memory Neural Networks (Long-Short Term in the disclosure
Memory, LSTM), shot and long term Memory Neural Networks are recurrent neural network (Recurrent Neutral Network, RNN)
One kind.RNNs is also recurrent neural network sequence, and it is that one kind (specifically sees applied field according to time series or character string
Scape) self call special neutral net.After it is deployed by sequence, just turn into common three-layer neural network.It is commonly applied to
Text, field of speech recognition.Determined in view of the semanteme of text by the context of co-text of word and word, in order to simultaneously to this
Two category informations model, method LSTM (shot and long term note of the one or more other embodiments of the present disclosure using educational circles in Sequence Learning
Recall neutral net), this method can be that current educational circles builds to text with the context of co-text information that once length relies in learning text
A kind of most complete method of mould information fidelity.Specifically, it is contemplated that in chat text, each word has precedence relationship,
The meaning that different terms are expressed in different contexts is different, and each word is generally regarded as each by conventional sorting technique
Independent feature, this kind of contextual information with succession cannot be modeled, classification accuracy deficiency.And method
LSTM learns the succession contextual information between word by customizing neutral net of the structure with memory.LSTM's
Mathematical principle reference picture 4, basic ideas are as follows:
● Forgetting Mechanism:Such as when a scene terminates to be that model should reset the relevant information of scene, such as position,
Time etc..And role's death, model should also remember this point.Model have one it is independent forget/memory mechanism, when having
During new input, model will be appreciated which information should lose.
● memory mechanism:When model sees one section of text, can learn wherein whether have what is be worth using and preserve
Information.
● when there is a new input, model knows the long-term memory information for forgetting about which is not used, and learns new input
There is the information what is worth using, and be stored in long-term memory.
● long-term memory is focused in working memory.Which part that model can learn long-term memory can send use immediately
.It not use complete long-term memory always, and it is to be understood which is partly emphasis.
For LSTM, input is a vector, and its length is the number of input layer;In a particular application, this
Individual vector is exactly Embedded embedding vectors.Its length and the number (i.e. the length of sentence) of time step are not related.
The output at each moment is a ProbabilityDistribution Vector, and the subscript of wherein maximum determines which word exported.
(1) Gender Classification training unit
The mark of chat text and player gender based on player, sample training collection is generated, be based on for generating one
LSTM user's Gender Classification model.Specifically, flow is trained with reference to figure 3.It is made up of three parts, including generation training set, structure
Text feature and sorting algorithm are made, wherein the text feature step basic one of construction text feature and said extracted text to be measured
Cause, therefore no longer superfluous words herein.
On generating training set.The training data of composing training collection is made up of two parts, that is, is included content of text and be somebody's turn to do
Player gender corresponding to text marks.The content of text that each player sends in chatroom is collected first, builds chat text
Storehouse;This is the essential information source for judging player gender.Then the other labeled data of constructed, the present invention utilize the identity of player
Information is demonstrate,proved to mark the sex of player.In the log-on message of game, partial player has record ID card information.Rule can be utilized
The sex (herein acquiescence assert the ID card information of user authentic and valid) of player is then judged ID card information, and rule is as follows:
15 ID card No.:15th represents sex, and odd number is man, and even number is female;18 ID card No.:17th representativeness
Not, odd number is man, and even number is female.The training set for training user's Gender Classification model is constituted by above two parts data.
In the training process of disaggregated model, the disclosure provides the annotation repository of sex using the player for having record ID card information;And base
Grader is trained in the storehouse and corresponding chat text, does not record the sex of ID card information for identifying.
Finally, the text feature including embedded vector is substituted into the user's Gender Classification model based on LSTM trained,
It can determine whether user's sex of the corresponding text to be measured.
Recognition effect authentication unit
In order to verify the technique effect of the recognition methods, the disclosure uses knowledge of the recognition effect authentication unit to recognition methods
Not rate is verified.The configuration of experiment and result are as follows:10,000 player accounts of random screening, these players are in game registration
When have registration ID.Sex (the type of the player is can be determined that by the rule of identity card:Male/female).Collect these objects for appreciation
The nearest month content of text in chatroom of family.The prediction result that the labeled data exports with the present invention matches, and counts
Accuracy rate.Wherein accuracy rate be defined as to sample number divided by total number of samples.The checking of experience real data, this disclosure relates to
The accuracy rate of recognition methods is 81.5%.Correspondingly, based on conventional method (judging sex using list of games accounting), accuracy rate
Only 58%.It can be seen that this disclosure relates to technical scheme anticipation stability on be significantly better than conventional method.
According to the second of the disclosure aspect, there is provided a kind of system that user's sex is judged based on Message-text, including:
First module, receive text to be measured;Second module, extract the text feature of text to be measured;3rd module, the correspondence of extraction is treated
Survey text text feature substitution disaggregated model, judge to should text to be measured user's sex, wherein, disaggregated model is based on
Shot and long term Memory Neural Networks;Text feature includes word frequency and the Huffman encoding of the text to be measured.
According to the 3rd of the disclosure the aspect, there is provided a kind of computer-readable recording medium, be stored thereon with computer
Program, the program realizes method described in disclosure one side when being executed by processor the step of.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any
Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained
Cover within protection scope of the present invention.Therefore, protection scope of the present invention described should be defined by scope of the claims.
Claims (9)
1. one kind judges user's property method for distinguishing based on Message-text, it is characterised in that comprises the following steps:
Receive text to be measured;
Extract the text feature of text to be measured;
By the substitution disaggregated model of the text feature of the correspondence of extraction text to be measured, judge to should text to be measured user's property
Not,
Wherein, the disaggregated model is based on shot and long term Memory Neural Networks;
The text feature includes word frequency and the Huffman encoding of the text to be measured.
2. according to the method for claim 1, it is characterised in that the step of extracting the text feature of text to be measured includes following
Sub-step:
To text to be measured segment, with generate to should text to be measured one or more words to be measured;
Count the word frequency of word to be measured;
Word to be measured is encoded based on Huffman tree, generate to should word to be measured Huffman encoding;
Word frequency and Huffman encoding based on word to be measured, it is vectorial using insertion corresponding to the output of CBOW models,
The text feature includes insertion vector.
3. according to the method for claim 2, it is characterised in that in the step of being segmented to text to be measured, including following sub-step
Suddenly:
Based on dictionary for word segmentation, the directed acyclic graph of text to be measured is built, wherein, using the viterbi algorithms of HMM model to this point
Word is not included in word dictionary and carries out word segmentation processing;
The maximum probability path of directed acyclic graph is found using Dynamic Programming;
The word segmentation result in the corresponding maximum probability path of output.
4. according to the method for claim 2, it is characterised in that before the step of counting the word frequency of word to be measured, in addition to
Lower step:
Using dictionary matching word to be measured, to filter out stop words, the word higher than default word frequency and the word less than default word frequency.
5. according to the method for claim 2, it is characterised in that word frequency and Huffman encoding based on word to be measured, utilize
It is further comprising the steps of before the step of embedded vector corresponding to the output of CBOW models:
The word frequency of modal particle in word to be measured is weighted, to improve the weight of the word frequency of modal particle.
6. according to the method for claim 1, it is characterised in that the text feature also includes doubting in the text to be measured
The ratio of question sentence.
7. according to the method for claim 6, it is characterised in that if the text to be measured contains interrogative and with modal particle
Ending, then judge the text to be measured for interrogative sentence.
A kind of 8. system that user's sex is judged based on Message-text, it is characterised in that including:
First module, receive text to be measured;
Second module, extract the text feature of text to be measured;
3rd module, the substitution disaggregated model of the text feature of the correspondence of extraction text to be measured is judged to should text to be measured
User's sex,
Wherein, the disaggregated model is based on shot and long term Memory Neural Networks;
The text feature includes word frequency and the Huffman encoding of the text to be measured.
9. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is held by processor
The step of method described in claim 1-7 is realized during row.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711184662.9A CN107832304A (en) | 2017-11-23 | 2017-11-23 | A kind of method and system that user's sex is judged based on Message-text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711184662.9A CN107832304A (en) | 2017-11-23 | 2017-11-23 | A kind of method and system that user's sex is judged based on Message-text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107832304A true CN107832304A (en) | 2018-03-23 |
Family
ID=61652484
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711184662.9A Pending CN107832304A (en) | 2017-11-23 | 2017-11-23 | A kind of method and system that user's sex is judged based on Message-text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107832304A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110532560A (en) * | 2019-08-30 | 2019-12-03 | 海南车智易通信息技术有限公司 | A kind of method and calculating equipment of generation text header |
CN112446210A (en) * | 2020-11-27 | 2021-03-05 | 广州三七互娱科技有限公司 | User gender prediction method and device and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
USH2187H1 (en) * | 2002-06-28 | 2007-04-03 | Unisys Corporation | System and method for gender identification in a speech application environment |
CN106844687A (en) * | 2017-01-23 | 2017-06-13 | 炫彩互动网络科技有限公司 | A kind of method and system that user's sex is determined based on games log |
CN107085581A (en) * | 2016-02-16 | 2017-08-22 | 腾讯科技(深圳)有限公司 | Short text classification method and device |
CN107229610A (en) * | 2017-03-17 | 2017-10-03 | 咪咕数字传媒有限公司 | The analysis method and device of a kind of affection data |
-
2017
- 2017-11-23 CN CN201711184662.9A patent/CN107832304A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
USH2187H1 (en) * | 2002-06-28 | 2007-04-03 | Unisys Corporation | System and method for gender identification in a speech application environment |
CN107085581A (en) * | 2016-02-16 | 2017-08-22 | 腾讯科技(深圳)有限公司 | Short text classification method and device |
CN106844687A (en) * | 2017-01-23 | 2017-06-13 | 炫彩互动网络科技有限公司 | A kind of method and system that user's sex is determined based on games log |
CN107229610A (en) * | 2017-03-17 | 2017-10-03 | 咪咕数字传媒有限公司 | The analysis method and device of a kind of affection data |
Non-Patent Citations (2)
Title |
---|
戴斌 等: "基于多类型文本的半监督性别分类方法研究", 《山西大学学报自然科学版》 * |
曹志赟: "语气词运用的性别差异", 《语文研究》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110532560A (en) * | 2019-08-30 | 2019-12-03 | 海南车智易通信息技术有限公司 | A kind of method and calculating equipment of generation text header |
CN112446210A (en) * | 2020-11-27 | 2021-03-05 | 广州三七互娱科技有限公司 | User gender prediction method and device and electronic equipment |
CN112446210B (en) * | 2020-11-27 | 2024-01-09 | 广州三七互娱科技有限公司 | User gender prediction method and device and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109145112B (en) | Commodity comment classification method based on global information attention mechanism | |
Lynn et al. | Hierarchical modeling for user personality prediction: The role of message-level attention | |
Preoţiuc-Pietro et al. | Beyond binary labels: political ideology prediction of twitter users | |
CN109918650B (en) | Interview intelligent robot device capable of automatically generating interview draft and intelligent interview method | |
CN111159368B (en) | Reply generation method of personalized dialogue | |
CN112199608B (en) | Social media rumor detection method based on network information propagation graph modeling | |
CN108920622A (en) | A kind of training method of intention assessment, training device and identification device | |
CN107247702A (en) | A kind of text emotion analysis and processing method and system | |
CN109271493A (en) | A kind of language text processing method, device and storage medium | |
CN110083833A (en) | Term vector joint insertion sentiment analysis method in terms of Chinese words vector sum | |
CN105279148B (en) | A kind of APP software users comment on uniformity determination methods | |
CN109726745A (en) | A kind of sensibility classification method based on target incorporating description knowledge | |
CN111708878B (en) | Method, device, storage medium and equipment for extracting sports text abstract | |
CN108804701A (en) | Personage's portrait model building method based on social networks big data | |
CN109325115B (en) | Role analysis method and analysis system | |
CN110309114A (en) | Processing method, device, storage medium and the electronic device of media information | |
CN106202053A (en) | A kind of microblogging theme sentiment analysis method that social networks drives | |
CN110119443A (en) | A kind of sentiment analysis method towards recommendation service | |
CN113392641A (en) | Text processing method, device, storage medium and equipment | |
CN114911932A (en) | Heterogeneous graph structure multi-conversation person emotion analysis method based on theme semantic enhancement | |
Yordanova et al. | Automatic detection of everyday social behaviours and environments from verbatim transcripts of daily conversations | |
CN113298367A (en) | Theme park perception value evaluation method | |
CN115171731A (en) | Emotion category determination method, device and equipment and readable storage medium | |
CN117556802B (en) | User portrait method, device, equipment and medium based on large language model | |
CN107832304A (en) | A kind of method and system that user's sex is judged based on Message-text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180323 |
|
RJ01 | Rejection of invention patent application after publication |