CN107832304A - A kind of method and system that user's sex is judged based on Message-text - Google Patents

A kind of method and system that user's sex is judged based on Message-text Download PDF

Info

Publication number
CN107832304A
CN107832304A CN201711184662.9A CN201711184662A CN107832304A CN 107832304 A CN107832304 A CN 107832304A CN 201711184662 A CN201711184662 A CN 201711184662A CN 107832304 A CN107832304 A CN 107832304A
Authority
CN
China
Prior art keywords
text
measured
word
user
sex
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711184662.9A
Other languages
Chinese (zh)
Inventor
余建兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Xishan Mobile Game Technology Co Ltd
Zhuhai Kingsoft Online Game Technology Co Ltd
Original Assignee
Zhuhai Xishan Mobile Game Technology Co Ltd
Zhuhai Kingsoft Online Game Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Xishan Mobile Game Technology Co Ltd, Zhuhai Kingsoft Online Game Technology Co Ltd filed Critical Zhuhai Xishan Mobile Game Technology Co Ltd
Priority to CN201711184662.9A priority Critical patent/CN107832304A/en
Publication of CN107832304A publication Critical patent/CN107832304A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

One kind judges user's property method for distinguishing based on Message-text, and it comprises the following steps:Receive text to be measured;Extract the text feature of text to be measured;By the substitution disaggregated model of the text feature of the correspondence of extraction text to be measured, judge to should text to be measured user's sex, wherein, disaggregated model is based on shot and long term Memory Neural Networks;Text feature includes word frequency and the Huffman encoding of the text to be measured.Additionally provide a kind of system that user's sex is judged based on Message-text.Significant difference of the present invention based on men and women user on language performance, context semantic relation, word frequency and the Huffman encoding of the content of text sent using user, can realize higher user's sex discrimination.

Description

A kind of method and system that user's sex is judged based on Message-text
Technical field
It is more particularly to a kind of that user's sex is judged based on Message-text the present invention relates to natural language processing technique field Method and system.
Background technology
Along with the development of network, online game is also arisen at the historic moment, this game on line, electronics trip by internet Play, rapidly opens market.Online game, refer generally to several player users and swum by the video of computer network interaction entertainment Play.Online game must rely on internet progress, can more people simultaneously participate in, friendship is reached by interpersonal interaction The purpose of stream, amusement and leisure.When registering game account, because personal attribute's data are related to privacy concern, user often selects Select the modes such as the gender information for not filling in or filling in falseness and hide its personal information, cause the push nothing related to user's sex Method is realized with the demand of user and matched.On the other hand, sex identification is in information retrieval and recommendation, social investigation, psychodiagnostics etc. Aspect equally has very extensive application, therefore the deduction to user's sex has extensive Research Prospects and practical value.
According to investigation, the known accounting distribution that type of play is played using men and women of industry is made a decision (with reference to Publication No. A kind of CN106844687A Chinese invention patent application " method and system that user's sex is determined based on games log "), this Class method can be summarized as follows:The ratio that each money game occurs in other point of table of the own intellectual is first counted, obtains each money game The accounting of sex polarity, i.e. men and women in the game of each money.To unknown sex user, its list of games played is collected, and be based on The list judges user's sex using the method for cumulative summation.The factor that this method relies on is more single, men and women's accounting in game Discrimination it is not high, cause the accuracy rate of method not high.For example rely on and whether play game《King's honor》To judge men and women, accurately Property obviously it is not high.
The content of the invention
In order to solve the problems, such as that existing user's gender identification method discrimination is low.According to the first aspect of the invention, carry One kind has been supplied to judge user's property method for distinguishing based on Message-text, it comprises the following steps:Receive text to be measured;Extract text to be measured This text feature;The substitution disaggregated model of the text feature of the correspondence of extraction text to be measured is judged to should text to be measured User's sex, wherein, disaggregated model is based on shot and long term Memory Neural Networks;Text feature includes the word frequency of the text to be measured And Huffman encoding.
Further, the step of extracting the text feature of text to be measured includes following sub-step:Text to be measured is segmented, with Generate to should text to be measured one or more words to be measured;Count the word frequency of word to be measured;Word to be measured is encoded based on Huffman tree, Generate to should word to be measured Huffman encoding;Word frequency and Huffman encoding based on word to be measured, utilize the output pair of CBOW models The embedded vector answered, this article eigen include insertion vector.
Further, in the step of being segmented to text to be measured, including following sub-step:It is to be measured based on dictionary for word segmentation, structure The directed acyclic graph of text, wherein, segmented using the viterbi algorithms of HMM model to not including word in the dictionary for word segmentation Processing;The maximum probability path of directed acyclic graph is found using Dynamic Programming;The word segmentation result in the corresponding maximum probability path of output.
Further, it is further comprising the steps of before the step of counting the word frequency of word to be measured:It is to be measured using dictionary matching Word, to filter out stop words, the word higher than default word frequency and the word less than default word frequency.
Further, in the word frequency based on word to be measured and Huffman encoding, using be embedded in corresponding to the output of CBOW models to It is further comprising the steps of before the step of amount:The word frequency of modal particle in word to be measured is weighted, to improve the word of modal particle The weight of frequency.
Further, the text feature also includes the ratio of the interrogative sentence in the text to be measured.
Further, if the text to be measured is contained interrogative and ended up with modal particle, the text to be measured is judged For interrogative sentence.
According to the second aspect of the invention, there is provided a kind of system that user's sex is judged based on Message-text, including: First module, receive text to be measured;Second module, extract the text feature of text to be measured;3rd module, the correspondence of extraction is treated Survey text text feature substitution disaggregated model, judge to should text to be measured user's sex, wherein, disaggregated model is based on Shot and long term Memory Neural Networks;Text feature includes word frequency and the Huffman encoding of the text to be measured.
According to the third aspect of the present invention, there is provided a kind of computer-readable recording medium, be stored thereon with computer Program, when the program is executed by processor described in realization one side of the invention the step of method.
The beneficial effects of the present invention are:Significant difference based on men and women user on language performance, is sent out using user Context semantic relation, word frequency and the Huffman encoding of the content of text gone out, higher user's sex discrimination can be realized.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, embodiment will be described below In the required accompanying drawing used be briefly described, it should be apparent that, drawings in the following description be only the present invention some Embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, can also be attached according to these Figure obtains other accompanying drawings.
Fig. 1 is the general flowchart of one or more embodiments of the invention;
Fig. 2 is the detailed figure of one or more embodiments of the invention;
Fig. 3 is the flow chart of train classification models;
Fig. 4 is LSTM mathematical principle schematic diagram.
Embodiment
Those of ordinary skill in the art are it is to be appreciated that the list of each example described with reference to the embodiments described herein Member and algorithm steps, it can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually Performed with hardware or software mode, application-specific and design constraint depending on technical scheme.Professional and technical personnel Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed The scope of the present invention.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.
In several embodiments provided herein, it should be understood that disclosed systems, devices and methods, can be with Realize by another way.For example, device embodiment described above is only schematical, for example, the unit Division, only a kind of division of logic function, can there is other dividing mode, such as multiple units or component when actually realizing Another system can be combined or be desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or The mutual coupling that discusses or to be directly harmonious or communicate to connect can be indirect coupling by some interfaces, device or unit Close or communicate to connect, can be electrical, mechanical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in two processing units, can also That unit is individually physically present, can also two or more units it is integrated in a unit.
If the function is realized in the form of SFU software functional unit and is used as independent production marketing or in use, can be with It is stored in two computer read/write memory mediums.Based on such understanding, technical scheme is substantially in other words The part to be contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are causing a computer equipment (can be People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the present invention. And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
It will be appreciated by those skilled in the art that in the description of the present application, " first ", " secondly ", " first ", " Two ", the term such as " first step ", " second step ", is not offered as the limitation to sequencing unless otherwise stated.For example, " step Three " can be before " step 1 ", and " the 4th step " can be carried out simultaneously with " second step ".
According to the general cognition of society, different trends is presented to the preference of game in men and women user.Overview, male more likes The activity of joyous actively fierceness, they like seeking to irritate, like occupying an leading position, like taking a risk.Relatively, women then likes more It is anxious sentimental, like conservative, they like seeking stable rather than irritated, and they like being dominated, and like endearing little bird.Male's phase To more liking that object for appreciation acts adventure game, women is then relative to prefer the game of entertaining class., can be with order to improve the effect of game recommdation To the different types of product of user-customized recommended of different sexes, beneficial to optimization Consumer's Experience, and the success rate of push is improved. For example think that male user recommends the game of the stimulations such as war class, and recommend the game of the leisure such as flight class game to female user.
On the other hand, existing game will appreciate that the channel of user's sex is limited.For example partial game does not force user to fill out Write real gender information;Even if fill in gender information, also have no way of verifying its authenticity.How the property of in game user is judged It is not the technical barrier of a professional.The accurate distribution of game is realized, it is necessary to solve this technical barrier.
According to known document, due to having distinction and indicative data few to user's sex, rarely to this problem Research.Find that one type method is that the distribution that type of play is played according to men and women makes a decision by investigating, for example play certain class The male of game is a lot, then is just determined as male by acquiescence when there is user to play such game.Such method judgment basis is thick Put, the accuracy rate of differentiation is very low.For example play game《King's honor》Men and women user's ratio it is roughly equal, whether played using user The game distinguishes judgement, and accuracy rate is not obviously high.
On the other hand, inventor recognizes the significant difference on language performance based on men and women, using user in chatroom etc. In content of text can judge user's sex exactly, efficiently solve the problem of professional.Further, identification As a result the accurate distribution of push is can be applied to, clicking rate is effectively improved, creates huge commercial value, there is practical industry Practicality.
Difference of the men and women user on language performance is described below, by long-term statistics and summary, inventor summarizes Briefly sex differernce is more significantly counted at least in terms of following four:
(1) gender differences of vocabulary:The gender differences of vocabulary are mainly shown as that men and women selects the difference of word.Sex language Say that the variation performance on vocabulary is the abundantest, women forms some features in the utilization of language:
1) word of some modifieds such as the kind adjective of women, adverbial word.As " excellent " " does not also come " " what a nuisance " always " greatly " " excellent " " very beautiful " " good general " etc..Male language is then more ease in use, simple and direct, resolute Speech.
2) modal particle that women uses is more.Such as ", eh,, Ow, clicking the tongue, uh uh ", and male is in this side Face uses fewer.Women than men is more prone to anxious state of mind, and this fluctuation is shown between the lines, and has with modal particle There is stronger appeal, once you form a habit, it comes natural to you, forms the different pragmatic habits of men and women's both sexes.
3) women uses mild and roundabout word and term of courtesy.Women emotion is typically finer and smoother, and sentiment is gentleer, typically not Meeting serious wording refusal, word is milder and roundabout, can take the face of hearer into account.When expressing an opinion, the multi-purpose tone solicited, discussed, It will not make one to feel aggressive.Even imperative sentence also more uses the tone for representing request.These asking with the mild and roundabout tone Ask, without any meaning forced and ordered, functioned well as the effect of communication.
4) use of the women to words for colors be than male more more commonly.Women is especially sensitive to color, in language performance In, unconsciously using the word of many expression colors.In the childhood, father and mother just dress up girl must be beautiful bright, contact Color enriched than boy.With the intensification of schooling, the raising of estheticism, the sensitiveness of color is got over Come stronger, the utilizations of some color words can also show the fine and smooth emotion of girl.
(2) gender differences of grammer:It is mainly manifested in women than men and more uses query clause, to each term language Regular habits are in following the beaten track.Women is more prone to the type of writing using social approval, and linguistic form is near the mark language;Male is then Tend to the type of writing do not approved of using society, be imbued with innovation and change.
(3) difference of language emotion:Woman is often more perceptual, male's then often rationality, and this point is in a session also usually Embody.It is both to describe certain section of experience of oneself, male often recollectedly describes its process, objectively analyzes its gain and loss, And women then often pays attention to the description and evaluation to wherein certain fragment, say sometimes and just excitedly rejoice oneself at certain suddenly Final result is satisfied with part thing, or sacred Anshan part thing gives the emotional trauma of oneself, the narration of experience is usually by emotion Express and interrupt.For another example, it is both to tell the unfortunate experience of oneself, male is conceived to after this brings oneself objective more Fruit, often can repeatedly it say:" what ifWhat ifWhat shall i do" show to seek the anxiety for solving desorption method in hot water, And women is then conceived to the subjective feeling that this part thing is triggered more, often can sigh:" my god, cries how I live!" thus, male If compared with rationality, be then usually in a bad mood if women.
(4) speak out with turning to say:The women guard heart is strong, does not often say directly, likes circling.For example " this room is true It is vexed!" meaning between the lines is intended to go out for a walk.Male then speaks frankly and to the point.Therefore, the frequency that women employs an euphemism is significantly larger than Male.
Discovery based on more than, the first purpose of the disclosure are that structure new algorithm understands the chat text of game user, For speculating user's sex.But those skilled in the art should know, this disclosure relates to technical scheme can similarly fit Sex for the identification user using text messages such as short message (SMS), Email, instant messagings (IM).
Reference picture 1-3, according to the one side of the disclosure, there is provided a kind of to judge user's sex based on Message-text Method, it comprises the following steps:Receive text to be measured;Extract the text feature of text to be measured;By the correspondence of extraction text to be measured Text feature substitution disaggregated model, judge to should text to be measured user's sex, wherein, disaggregated model is based on shot and long term Memory Neural Networks (Long-Short Term Memory, LSTM);Text feature includes word frequency and the Kazakhstan of the text to be measured Fu Man encodes (Huffman Coding).Significant difference of this method based on men and women user on language performance, utilizes user Context semantic relation, word frequency and the Huffman encoding of the content of text sent, higher user's sex discrimination can be realized.
On receiving text to be measured, chatroom message, short message (SMS), Email that user sends, i.e. can be intercepted Shi Tongxin (IM) text message, and unique ID of user, as raw data.
The step of text feature on extracting text to be measured.For content of text, the conventional latent structure method of industry It is first to do Chinese word segmentation, then does word screening and character representation.Specifically, Chinese word segmentation is using industry Open-Source Tools stammerer point Word;The tf-idf sequence screening techniques that word screening is commonly used using industry, filter out the small word of some discriminations;Feature generates The word 0-1 method for expressing commonly used using industry.TF-IDF(Term Frequency-Inverse Document Frequency, word frequency-inverse file frequency) it is a kind of conventional weighting technique prospected for information retrieval with information.TF-IDF is calculated Method is a kind of statistical method, to assess weight of the words for a copy of it file in a file set or a corpus Want degree.The directly proportional increase of number that the importance of words occurs hereof with it, but simultaneously can be as it is in corpus The frequency of middle appearance is inversely proportional decline.The TF-IDF sequence screening techniques that word screening is commonly used using industry, so as to some areas Index small word to filter out, for highlighting the high still frequency of occurrences in whole dictionary of the frequency of occurrences in a sentence or article Low word.But this method is preferable for professional stronger specification text, the discrimination effect of feature.But for not advising The Internet chat text of model, the characteristic area indexing of generation is very poor, and mathematical expression power is very weak.
On the other hand, in order to build the higher feature of discrimination, one or more other embodiments of the present disclosure employ word insertion The latent structure method of (word embedding) formula, its idiographic flow refer to Fig. 2.It can understand that word is embedding by an example Enter, the part of speech for for example judging a word is verb or noun.In engineering, there are a series of samples (x, y), wherein x is word Language, y are corresponding parts of speech;Target be structure f (x)->Y mapping, wherein mathematical modeling f only receive numeric type input.It is abstract Ground, symbol (such as Chinese, English, Latin etc.) are converted into numeric form, can be with analogy into being embedded into a mathematical space In;This embedded mode, word is just made to be embedded in (word embedding).We use Word2vec in an experiment, and this is that word is embedding Enter the one kind of (word embedding).
The step of extracting the text feature of text to be measured includes following sub-step:Text to be measured is segmented, to generate correspondingly One or more words to be measured of the text to be measured;Count the word frequency of word to be measured;Word to be measured is encoded based on Huffman tree, generation is corresponding The Huffman encoding of the word to be measured;Word frequency and Huffman encoding based on word to be measured, utilize insertion corresponding to the output of CBOW models Vector, comprise the following steps that.
On Chinese word segmentation.In one or more embodiments, in the step of being segmented to text to be measured, including following son Step:Based on dictionary for word segmentation, the directed acyclic graph of text to be measured is built, wherein using the viterbi algorithms of HMM model to this point Word is not included in word dictionary and carries out word segmentation processing;The maximum probability path of directed acyclic graph is found using Dynamic Programming;Output pair Answer the word segmentation result in maximum probability path.Wherein, participle (Word Segmentation) is referred to a Chinese character sequence cutting Into single word one by one.Participle is exactly the mistake that continuous word sequence is reassembled into word sequence according to certain specification Journey.In the style of writing using English as the Romance language of representative, be between word using space as nature delimiter, and in Text is that word, sentence and section can simply be demarcated by obvious delimiter, only the formal delimiter of word neither one, although Similarly there is the partition problem of phrase in English, but on word this layer, Chinese than English it is complicated and much more difficult.Example Such as, for the understanding of Chinese text " give and punish to the person of spitting everywhere ", " person of spitting everywhere " is divided into a word still in itself Multiple words (such as being divided into " everywhere " " person of spitting " or " telling everywhere ", " phlegm person " etc.), therefore, correctly participle is to sentence Understanding has vital influence.For example, it is primarily based on prefix dictionary (dictionary for word segmentation) and carries out word figure scanning, generates sentence Chinese character is possible into the directed acyclic graph (Directed Acyclic Graph, be abbreviated as DAG) that word situation is formed in son. Wherein, prefix dictionary refers to that the order that word in dictionary includes according to prefix arranges, for example, occur in dictionary " on ", afterwards with " on " word of beginning can all appear in this part, such as " Shanghai ", and then " Shanghai City " occurs, so as to form a kind of level bag Containing structure.In one or more embodiments, dictionary for word segmentation includes filtering dictionary described later, because the content for filtering dictionary can be with Adjusted by offline iteration updating block dynamic, therefore the participle efficiency of dictionary for word segmentation can be improved.In addition, in one or more In embodiment, include multiple dictionaries for word segmentation, the content of each dictionary for word segmentation difference.Based on the user's for sending text to be measured Credit value, select the corresponding dictionary for word segmentation when being segmented to text to be measured.For example, when the credit value of user is higher, then select With the larger dictionary for word segmentation of the granularity of participle (such as segment to " Shanghai City " afterwards i.e. stop participle, without be further divided into " on Sea ", " city " etc.), to simplify participle process;When the credit value of user is relatively low, then from the less participle word of granularity of participle Allusion quotation, to realize more accurate participle.The concept of credit rating will illustrate in greater detail below.Then, dynamic is employed to advise Draw and search maximum probability path, find out the maximum cutting combination based on word frequency.For unregistered word in dictionary for word segmentation, base is employed Hidden Markov model (Hidden Markov Model, be abbreviated as HMM) in Chinese character into word ability, has used Viterbi calculations Method, final output correspond to the word segmentation result in maximum probability path, the basis for estimation as further disaggregated model.
On counting word frequency.The number that each word occurs is counted, then needs to remove the word of stop words, very high frequency The very word of low frequency.In one or more other embodiments of the present disclosure, using dictionary matching word to be measured, with filter out stop words, Word higher than default word frequency and the word less than default word frequency.The dictionary can be include stop words, the word higher than default word frequency and Less than the blacklist of the word of default word frequency, the one or more words to be measured included in dictionary are filtered out, are not involved in follow-up word Frequency statistics and coding etc..It can also be the white list for containing default high weight word.
According to one or more other embodiments of the present disclosure, in statistics word frequency during or after, to the tone in word to be measured The word frequency of word is weighted, to improve the weight of the word frequency of modal particle.Inventor has been mentioned above to send out by statistical induction More significantly statistics sex differernce, such as the tone that women uses in several aspects of term be present in existing men and women game player Word (represents the function word of the tone, is commonly used in a tail or sentence at pause and represents a variety of tone.As ", eh,, Ow, dispute Dispute, uh uh ".) ratio be significantly higher than male user.In order to improve the precision of user's Sexual discriminating, in this example, preferred pair symbol The word for closing modal particle standard is weighted.
Specifically, modal particle dictionary is built in advance, while to word frequency statisticses, is treated to matching the modal particle dictionary Word is surveyed, improves the weight of its word frequency.It is exemplary, exported again after the word frequency for the word to be measured for meeting modal particle standard is multiplied by into 1.5, with Improve the weight of the word frequency of modal particle.
To the multiple standards that defined of modal particle in Modern Chinese, the universally recognized modal particle of educational circles is about 10-59 Between.In the present embodiment, according to the rule summarized for a long time, exemplarily employ《The word of Modern Chinese 800》(Luliang Mountain, business Business print book shop, 1980) 17 modal particles the criteria for classifying based on, while be aided with compound modal particle (as " and Ow " " letting it pass ") With the dissolving of foregoing modal particle (as " laugh a great ho-ho, clicking the tongue, uh uh ") and allosome word (as " ouch, well, ouch, Ow, aiyo, Heartily, haha ") form default modal particle dictionary.In other embodiments, other modal particle standards can be also used, but It is that those skilled in the art should know, what the scope of the modal particle dictionary was to determine.
In addition, according to above it will be appreciated that, inventor by statistical induction and the women game player (user) that finds using doubting The ratio of question sentence is significantly higher than male player.Therefore, text feature can also include the ratio of the interrogative sentence in text to be measured, with Improve the accuracy of Sexual discriminating.In addition, inventor also found in Internet chat, game player in punctuate use not Ten sectional specifications, such as the sentence tail of one section of word would generally omit punctuate including question mark, and whether therefore, it is difficult to only with having and ask Number judge sentence pattern type.In addition, inventor also found following statistical law:Women game player is when using interrogative sentence, companion With having, the ratio that modal particle ends up is higher.Therefore, the present embodiment is exemplarily using interrogative by the way of modal particle is combined Identify interrogative sentence sentence pattern, i.e., if in text to be measured with interrogative (such as " what ", " how ", " who ", " why ") And ended up with modal particle, judge the text to be measured for query sentence pattern.
On text code.By building Huffman tree (Huffman Tree), for text code.All n omicronn-leaf Node is stored with a parameter vector, and all leaf nodes represent a word in dictionary respectively.Parameter vector initial value is 0.After having built Huffman tree, corresponding Huffman code is distributed into each word.
On embedded vector.Embedded vector is exported using CBOW models.The input of training set is around several lists in CBOW The term vector sum of word, output are that middle words.Since root node, logistic is ceaselessly carried out along Huffman tree Classification, often carry out a subseries and just along Huffman tree toward next layer and correct term vector, to the last reach leaf node;If It is 1, then it represents that left sibling should be assigned to, otherwise represent that right node should be assigned to.
On by the substitution disaggregated model of the text feature of the correspondence of extraction text to be measured, judging to should text to be measured The step of user's sex.Disaggregated model is based on shot and long term Memory Neural Networks (Long-Short Term in the disclosure Memory, LSTM), shot and long term Memory Neural Networks are recurrent neural network (Recurrent Neutral Network, RNN) One kind.RNNs is also recurrent neural network sequence, and it is that one kind (specifically sees applied field according to time series or character string Scape) self call special neutral net.After it is deployed by sequence, just turn into common three-layer neural network.It is commonly applied to Text, field of speech recognition.Determined in view of the semanteme of text by the context of co-text of word and word, in order to simultaneously to this Two category informations model, method LSTM (shot and long term note of the one or more other embodiments of the present disclosure using educational circles in Sequence Learning Recall neutral net), this method can be that current educational circles builds to text with the context of co-text information that once length relies in learning text A kind of most complete method of mould information fidelity.Specifically, it is contemplated that in chat text, each word has precedence relationship, The meaning that different terms are expressed in different contexts is different, and each word is generally regarded as each by conventional sorting technique Independent feature, this kind of contextual information with succession cannot be modeled, classification accuracy deficiency.And method LSTM learns the succession contextual information between word by customizing neutral net of the structure with memory.LSTM's Mathematical principle reference picture 4, basic ideas are as follows:
● Forgetting Mechanism:Such as when a scene terminates to be that model should reset the relevant information of scene, such as position, Time etc..And role's death, model should also remember this point.Model have one it is independent forget/memory mechanism, when having During new input, model will be appreciated which information should lose.
● memory mechanism:When model sees one section of text, can learn wherein whether have what is be worth using and preserve Information.
● when there is a new input, model knows the long-term memory information for forgetting about which is not used, and learns new input There is the information what is worth using, and be stored in long-term memory.
● long-term memory is focused in working memory.Which part that model can learn long-term memory can send use immediately .It not use complete long-term memory always, and it is to be understood which is partly emphasis.
For LSTM, input is a vector, and its length is the number of input layer;In a particular application, this Individual vector is exactly Embedded embedding vectors.Its length and the number (i.e. the length of sentence) of time step are not related. The output at each moment is a ProbabilityDistribution Vector, and the subscript of wherein maximum determines which word exported.
(1) Gender Classification training unit
The mark of chat text and player gender based on player, sample training collection is generated, be based on for generating one LSTM user's Gender Classification model.Specifically, flow is trained with reference to figure 3.It is made up of three parts, including generation training set, structure Text feature and sorting algorithm are made, wherein the text feature step basic one of construction text feature and said extracted text to be measured Cause, therefore no longer superfluous words herein.
On generating training set.The training data of composing training collection is made up of two parts, that is, is included content of text and be somebody's turn to do Player gender corresponding to text marks.The content of text that each player sends in chatroom is collected first, builds chat text Storehouse;This is the essential information source for judging player gender.Then the other labeled data of constructed, the present invention utilize the identity of player Information is demonstrate,proved to mark the sex of player.In the log-on message of game, partial player has record ID card information.Rule can be utilized The sex (herein acquiescence assert the ID card information of user authentic and valid) of player is then judged ID card information, and rule is as follows: 15 ID card No.:15th represents sex, and odd number is man, and even number is female;18 ID card No.:17th representativeness Not, odd number is man, and even number is female.The training set for training user's Gender Classification model is constituted by above two parts data. In the training process of disaggregated model, the disclosure provides the annotation repository of sex using the player for having record ID card information;And base Grader is trained in the storehouse and corresponding chat text, does not record the sex of ID card information for identifying.
Finally, the text feature including embedded vector is substituted into the user's Gender Classification model based on LSTM trained, It can determine whether user's sex of the corresponding text to be measured.
Recognition effect authentication unit
In order to verify the technique effect of the recognition methods, the disclosure uses knowledge of the recognition effect authentication unit to recognition methods Not rate is verified.The configuration of experiment and result are as follows:10,000 player accounts of random screening, these players are in game registration When have registration ID.Sex (the type of the player is can be determined that by the rule of identity card:Male/female).Collect these objects for appreciation The nearest month content of text in chatroom of family.The prediction result that the labeled data exports with the present invention matches, and counts Accuracy rate.Wherein accuracy rate be defined as to sample number divided by total number of samples.The checking of experience real data, this disclosure relates to The accuracy rate of recognition methods is 81.5%.Correspondingly, based on conventional method (judging sex using list of games accounting), accuracy rate Only 58%.It can be seen that this disclosure relates to technical scheme anticipation stability on be significantly better than conventional method.
According to the second of the disclosure aspect, there is provided a kind of system that user's sex is judged based on Message-text, including: First module, receive text to be measured;Second module, extract the text feature of text to be measured;3rd module, the correspondence of extraction is treated Survey text text feature substitution disaggregated model, judge to should text to be measured user's sex, wherein, disaggregated model is based on Shot and long term Memory Neural Networks;Text feature includes word frequency and the Huffman encoding of the text to be measured.
According to the 3rd of the disclosure the aspect, there is provided a kind of computer-readable recording medium, be stored thereon with computer Program, the program realizes method described in disclosure one side when being executed by processor the step of.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention described should be defined by scope of the claims.

Claims (9)

1. one kind judges user's property method for distinguishing based on Message-text, it is characterised in that comprises the following steps:
Receive text to be measured;
Extract the text feature of text to be measured;
By the substitution disaggregated model of the text feature of the correspondence of extraction text to be measured, judge to should text to be measured user's property Not,
Wherein, the disaggregated model is based on shot and long term Memory Neural Networks;
The text feature includes word frequency and the Huffman encoding of the text to be measured.
2. according to the method for claim 1, it is characterised in that the step of extracting the text feature of text to be measured includes following Sub-step:
To text to be measured segment, with generate to should text to be measured one or more words to be measured;
Count the word frequency of word to be measured;
Word to be measured is encoded based on Huffman tree, generate to should word to be measured Huffman encoding;
Word frequency and Huffman encoding based on word to be measured, it is vectorial using insertion corresponding to the output of CBOW models,
The text feature includes insertion vector.
3. according to the method for claim 2, it is characterised in that in the step of being segmented to text to be measured, including following sub-step Suddenly:
Based on dictionary for word segmentation, the directed acyclic graph of text to be measured is built, wherein, using the viterbi algorithms of HMM model to this point Word is not included in word dictionary and carries out word segmentation processing;
The maximum probability path of directed acyclic graph is found using Dynamic Programming;
The word segmentation result in the corresponding maximum probability path of output.
4. according to the method for claim 2, it is characterised in that before the step of counting the word frequency of word to be measured, in addition to Lower step:
Using dictionary matching word to be measured, to filter out stop words, the word higher than default word frequency and the word less than default word frequency.
5. according to the method for claim 2, it is characterised in that word frequency and Huffman encoding based on word to be measured, utilize It is further comprising the steps of before the step of embedded vector corresponding to the output of CBOW models:
The word frequency of modal particle in word to be measured is weighted, to improve the weight of the word frequency of modal particle.
6. according to the method for claim 1, it is characterised in that the text feature also includes doubting in the text to be measured The ratio of question sentence.
7. according to the method for claim 6, it is characterised in that if the text to be measured contains interrogative and with modal particle Ending, then judge the text to be measured for interrogative sentence.
A kind of 8. system that user's sex is judged based on Message-text, it is characterised in that including:
First module, receive text to be measured;
Second module, extract the text feature of text to be measured;
3rd module, the substitution disaggregated model of the text feature of the correspondence of extraction text to be measured is judged to should text to be measured User's sex,
Wherein, the disaggregated model is based on shot and long term Memory Neural Networks;
The text feature includes word frequency and the Huffman encoding of the text to be measured.
9. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is held by processor The step of method described in claim 1-7 is realized during row.
CN201711184662.9A 2017-11-23 2017-11-23 A kind of method and system that user's sex is judged based on Message-text Pending CN107832304A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711184662.9A CN107832304A (en) 2017-11-23 2017-11-23 A kind of method and system that user's sex is judged based on Message-text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711184662.9A CN107832304A (en) 2017-11-23 2017-11-23 A kind of method and system that user's sex is judged based on Message-text

Publications (1)

Publication Number Publication Date
CN107832304A true CN107832304A (en) 2018-03-23

Family

ID=61652484

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711184662.9A Pending CN107832304A (en) 2017-11-23 2017-11-23 A kind of method and system that user's sex is judged based on Message-text

Country Status (1)

Country Link
CN (1) CN107832304A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532560A (en) * 2019-08-30 2019-12-03 海南车智易通信息技术有限公司 A kind of method and calculating equipment of generation text header
CN112446210A (en) * 2020-11-27 2021-03-05 广州三七互娱科技有限公司 User gender prediction method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
USH2187H1 (en) * 2002-06-28 2007-04-03 Unisys Corporation System and method for gender identification in a speech application environment
CN106844687A (en) * 2017-01-23 2017-06-13 炫彩互动网络科技有限公司 A kind of method and system that user's sex is determined based on games log
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device
CN107229610A (en) * 2017-03-17 2017-10-03 咪咕数字传媒有限公司 The analysis method and device of a kind of affection data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
USH2187H1 (en) * 2002-06-28 2007-04-03 Unisys Corporation System and method for gender identification in a speech application environment
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device
CN106844687A (en) * 2017-01-23 2017-06-13 炫彩互动网络科技有限公司 A kind of method and system that user's sex is determined based on games log
CN107229610A (en) * 2017-03-17 2017-10-03 咪咕数字传媒有限公司 The analysis method and device of a kind of affection data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
戴斌 等: "基于多类型文本的半监督性别分类方法研究", 《山西大学学报自然科学版》 *
曹志赟: "语气词运用的性别差异", 《语文研究》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532560A (en) * 2019-08-30 2019-12-03 海南车智易通信息技术有限公司 A kind of method and calculating equipment of generation text header
CN112446210A (en) * 2020-11-27 2021-03-05 广州三七互娱科技有限公司 User gender prediction method and device and electronic equipment
CN112446210B (en) * 2020-11-27 2024-01-09 广州三七互娱科技有限公司 User gender prediction method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN109145112B (en) Commodity comment classification method based on global information attention mechanism
Lynn et al. Hierarchical modeling for user personality prediction: The role of message-level attention
Preoţiuc-Pietro et al. Beyond binary labels: political ideology prediction of twitter users
CN109918650B (en) Interview intelligent robot device capable of automatically generating interview draft and intelligent interview method
CN111159368B (en) Reply generation method of personalized dialogue
CN112199608B (en) Social media rumor detection method based on network information propagation graph modeling
CN108920622A (en) A kind of training method of intention assessment, training device and identification device
CN107247702A (en) A kind of text emotion analysis and processing method and system
CN109271493A (en) A kind of language text processing method, device and storage medium
CN110083833A (en) Term vector joint insertion sentiment analysis method in terms of Chinese words vector sum
CN105279148B (en) A kind of APP software users comment on uniformity determination methods
CN109726745A (en) A kind of sensibility classification method based on target incorporating description knowledge
CN111708878B (en) Method, device, storage medium and equipment for extracting sports text abstract
CN108804701A (en) Personage's portrait model building method based on social networks big data
CN109325115B (en) Role analysis method and analysis system
CN110309114A (en) Processing method, device, storage medium and the electronic device of media information
CN106202053A (en) A kind of microblogging theme sentiment analysis method that social networks drives
CN110119443A (en) A kind of sentiment analysis method towards recommendation service
CN113392641A (en) Text processing method, device, storage medium and equipment
CN114911932A (en) Heterogeneous graph structure multi-conversation person emotion analysis method based on theme semantic enhancement
Yordanova et al. Automatic detection of everyday social behaviours and environments from verbatim transcripts of daily conversations
CN113298367A (en) Theme park perception value evaluation method
CN115171731A (en) Emotion category determination method, device and equipment and readable storage medium
CN117556802B (en) User portrait method, device, equipment and medium based on large language model
CN107832304A (en) A kind of method and system that user's sex is judged based on Message-text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180323

RJ01 Rejection of invention patent application after publication