CN113342959A - User topic label generation method and system based on enterprise and micro discussion group - Google Patents

User topic label generation method and system based on enterprise and micro discussion group Download PDF

Info

Publication number
CN113342959A
CN113342959A CN202110757295.7A CN202110757295A CN113342959A CN 113342959 A CN113342959 A CN 113342959A CN 202110757295 A CN202110757295 A CN 202110757295A CN 113342959 A CN113342959 A CN 113342959A
Authority
CN
China
Prior art keywords
word
discussion
priority
word vector
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110757295.7A
Other languages
Chinese (zh)
Inventor
黄楷
梁新敏
陈羲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Minglue Zhaohui Technology Co Ltd
Original Assignee
Beijing Minglue Zhaohui Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Minglue Zhaohui Technology Co Ltd filed Critical Beijing Minglue Zhaohui Technology Co Ltd
Priority to CN202110757295.7A priority Critical patent/CN113342959A/en
Publication of CN113342959A publication Critical patent/CN113342959A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a method and a system for generating a user topic label based on an enterprise and micro discussion group, wherein the method comprises the following steps: a word vector training step, wherein different discussion labels are preset according to industry types, corresponding external word vectors are obtained through screening according to the different discussion labels, and the external word vectors are combined with session archiving training word vectors; a priority word obtaining step, namely calculating the similarity between the word vector and the discussion label, and adding the word vector meeting preset conditions into a priority word list; a conversation marking step, namely scanning a conversation archive by using a word segmentation system and marking the conversation archive according to the priority words and preset stop words; and a step of generating a topic label, namely processing the marking result to obtain the user topic label. According to the method and the device, the topic labels discussed by the user are quickly constructed aiming at massive user session information.

Description

User topic label generation method and system based on enterprise and micro discussion group
Technical Field
The application relates to the technical field of data processing, in particular to a method and a system for generating user topic labels based on enterprise and micro discussion groups.
Background
In the enterprise and micro services at toC, the company operator typically adds clients to the enterprise and micro discussion group using enterprise WeChat to develop marketing campaigns. Under the background, operators can actively market through enterprise and micro discussion groups in modes of responding to user topics, advertising product efficacy and the like. In addition, the user can also discuss daily topics and the like in the enterprise micro discussion group.
For the text information discussed by the user in the discussion group of the enterprise and micro, the operator can record the text information by using the session archiving function of the enterprise and micro. The topic discussed by the user can be correctly identified, so that the operator can be helped to construct a user label, and based on the label, marketing work is carried out on different types of users, or the topic discussed by the user is found, and the construction of operation materials is assisted.
In a traditional method for constructing a user tag based on discussion group information, an operator generally refers to chat information manually and marks the chat information. However, this approach has the following bottlenecks:
when the enterprise and micro conversation archived data reaches a certain magnitude (such as multiple pieces of data, multiple discussion groups and the like), the manual marking mode is slow to progress; meanwhile, marking standards of users of different operators are different, and whether the topic discussed by the user is a more popular topic cannot be judged, so that the possibly marked label is long-tailed data (for example, one label only hits one person), and the operators are difficult to help to develop subsequent marketing activities.
At present, no effective solution is provided for the problem of slow progress of manual marking in the related technology.
Disclosure of Invention
The embodiment of the application provides a user topic label generation method and system based on an enterprise and micro discussion group, so as to at least solve the problem of slow manual marking in the related technology.
In a first aspect, an embodiment of the present application provides a method for generating a user topic tag based on an enterprise and micro discussion group, including the following steps:
a word vector training step, wherein different discussion labels are preset according to industry types, corresponding external word vectors are obtained through screening according to the different discussion labels, and the external word vectors are combined with session archiving training word vectors;
a priority word obtaining step, namely calculating the similarity between the word vector and the discussion label, and adding the word vector meeting preset conditions into a priority word list;
a conversation marking step, namely scanning a conversation archive by using a word segmentation system and marking the conversation archive according to the priority words and preset stop words;
and a step of generating a topic label, namely processing the marking result to obtain the user topic label.
In some of these embodiments, the word vector training step further comprises:
an external word vector screening step, namely acquiring an Tencent AI LAb word vector according to the Tencent AI Lab, calculating the Euclidean distance between the Tencent AI LAb word vector and a corresponding vector of the discussion label, and screening to obtain an external word vector according to the Euclidean distance;
and a Word vector output step, namely preprocessing the session file to obtain a corresponding one-hot vector, inputting the one-hot vector and an external Word vector into a Word2vec model and outputting the Word vector.
In some embodiments, the priority word acquiring step specifically includes:
the cosine similarity of the word vector and the label word vector corresponding to the discussion label is calculated by the following formula,
Figure BDA0003147612140000021
wherein A, B represents the word vector and the word vector corresponding to the discussion tag, respectively, n represents the total dimension, i represents the ith dimension,
and when the cosine similarity is greater than a preset threshold value, adding the participle corresponding to the word vector into the priority word list.
In some embodiments, the session marking step specifically includes:
a session archiving and scanning step, namely adding the priority words and preset stop words into a jieba word segmentation system, and scanning the session archiving by using the jieba word segmentation system;
and an information output step, namely deleting the corresponding participles in the conversation archive according to the preset stop words, and outputting the corresponding priority words, the discussion tags and the speaking users which are hit in the rest part of the conversation archive.
In some embodiments, the topic tag generating step specifically includes:
an information duplication removing step, namely removing duplication of the corresponding priority words and the speaking users output in the information output step to obtain hit priority words and the number of the corresponding speaking users;
a low-frequency data cleaning step, namely counting the variance and the mean of the number of the hit priority words and the speaking users, and filtering by using a 3sigma principle;
and a topic label generation step, namely outputting the information of the speaking user-discussion label according to the filtering result, and obtaining the user topic label according to the information.
In some of these embodiments, the filtering conditions in the low frequency data washing step further include:
and when the number of the corresponding speaking users of the hit priority word is less than a set value, filtering the hit priority word, wherein the set value is the mean value-3 standard deviation.
In a second aspect, an embodiment of the present application provides a system for generating user topic tags based on an enterprise and micro discussion group, and a method for generating user topic tags using the first aspect includes:
the word vector training module presets different discussion labels according to the industry types, corresponding external word vectors are obtained through screening according to the different discussion labels, and the external word vectors are combined with session archiving training word vectors;
the priority word acquisition module is used for calculating the similarity between the word vectors and the discussion labels and adding the word vectors meeting the preset conditions into a priority word list;
the conversation marking module is used for marking the conversation, and scanning a conversation archive by using a word segmentation system and marking the conversation archive according to the priority words and the preset stop words;
and the topic label generating module is used for processing the marking result to obtain the user topic label.
In some of these embodiments, the topic tag generation module comprises:
the information duplication removing unit is used for receiving the corresponding priority words and the speaking users output by the session marking module and removing duplication of the corresponding priority words and the speaking users to obtain hit priority words and the number of the corresponding speaking users;
the low-frequency data cleaning unit is used for counting the variance and the mean value of the number of the hit priority words and the number of the speaking users and filtering by using a 3sigma principle;
and the topic label generating unit outputs the information of the speaking user-discussion label according to the filtering result and obtains the user topic label according to the information.
In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the method for generating user topic tags based on enterprise discussion groups as described in the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for generating a user topic tag based on an enterprise discussion group as described in the first aspect above.
Compared with the related technology, the method and the system for generating the user topic labels based on the enterprise and micro discussion group can be applied to the technical field of data processing and the technical field of data mining.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow chart of a method for generating user topic labels based on an enterprise discussion group according to an embodiment of the application;
FIG. 2 is a flow chart of word vector training steps according to an embodiment of the present application;
FIG. 3 is a flow chart of a session marking step according to an embodiment of the present application;
FIG. 4 is a flow chart of a session marking step according to an embodiment of the present application;
FIG. 5 is a flowchart of another method for generating user topic labels based on enterprise discussion groups according to an embodiment of the present application;
FIG. 6 is a flow chart of a method for generating user topic tags based on enterprise discussion groups in accordance with a preferred embodiment of the present application;
FIG. 7 is a block diagram of a system for generating user topic tags based on enterprise discussion groups according to an embodiment of the present application;
FIG. 8 is a block diagram of a preferred structure of a system for generating user topic labels based on groups of discussion of enterprise and micro-enterprises in accordance with an embodiment of the present application;
fig. 9 is a hardware configuration diagram of a computer device according to an embodiment of the present application.
Description of the drawings:
a word vector training module 1; a priority word acquisition module 2; a session marking module 3;
a topic label generation module 4; an information deduplication unit 41; a low frequency data washing unit 42;
a topic tag generation unit 43; an external word vector screening unit 11; a word vector output unit 12;
a session archive scanning unit 31; an information output unit 32; a processor 81;
a memory 82; a communication interface 83; a bus 80.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The embodiment provides a user topic label generation method based on an enterprise and micro discussion group. Fig. 1 is a flowchart of a method for generating a user topic tag based on an enterprise discussion group according to an embodiment of the present application, and as shown in fig. 1, the flowchart includes the following steps:
a word vector training step S1, presetting different discussion labels according to industry types, and screening to obtain corresponding external word vectors according to the different discussion labels, wherein the external word vectors are combined with session archiving training word vectors;
a priority word obtaining step S2 of calculating similarity between the word vector and the discussion tag, and adding the word vector satisfying a preset condition to a priority word list;
a session marking step S3, scanning the session archive by using a word segmentation system and marking the session archive according to the priority words and the preset stop words;
and a topic label generation step S4, wherein the user topic label is obtained after the marking result is processed.
Through the steps, the user tags are constructed by utilizing the enterprise and micro session files and using the preset discussion tags and word vectors, so that the construction of the discussion tags in the user sessions can be accelerated, and the marking workload of operators is saved.
It should be noted that word vector technology can well describe the similarity between words, that is, the word or phrase of the vocabulary is mapped to the vector of real numbers,
in some embodiments, fig. 2 is a flowchart of a word vector training step according to an embodiment of the present application, and as shown in fig. 2, the word vector training step S1 further includes:
an external word vector screening step S11, acquiring an Tencent AI LAb word vector according to the Tencent AI Lab, calculating the Euclidean distance between the Tencent AI LAb word vector and the corresponding vector of the discussion label, and screening to obtain an external word vector according to the Euclidean distance;
the Euclidean Distance/Euclidean Distance is an actual Distance between two points in an n-dimensional space, and the similarity of the images is calculated by using the Euclidean Distance, and the similarity is larger when the Euclidean Distance is smaller. The specific calculation method is as follows:
two points a ═ a are known1,a2,...,an),B=(b1,b2,...,bn) Then the Euclidean distance between AB is
Figure BDA0003147612140000061
And a Word vector output step S12, preprocessing the session file to obtain a corresponding one-hot vector, inputting the one-hot vector and the external Word vector into a Word2vec model, and outputting the Word vector.
In the above steps, a session file is adopted and an external word vector is assisted to obtain a word vector, wherein the vacation AI LAb word vector is firstly subjected to similarity calculation with a preset discussion label to obtain an external word vector similar to the discussion label, and the training process of the word vector is supplemented based on the external word vector.
The problem that word information in human language is difficult to characterize due to the fact that a single discussion message of a discussion session is usually short and a discussion session training word vector is directly used is solved.
The method of the external word vector can be replaced by other Chinese near-meaning word searching technology or manual judgment.
It should be noted that, in general, the external word vector is a dense dimension, which may be set to 200 dimensions, but the present invention is not limited thereto.
In some embodiments, the priority word obtaining step S2 specifically includes:
the cosine similarity of the word vector and the label word vector corresponding to the discussion label is calculated by the following formula,
Figure BDA0003147612140000062
wherein A, B represents the word vector and the word vector corresponding to the discussion tag, respectively, n represents the total dimension, i represents the ith dimension,
and when the cosine similarity is greater than a preset threshold value, adding the participle corresponding to the word vector into the priority word list.
The similarity between the words corresponding to the word vectors and the discussion labels can be judged through the Euclidean distance and the Hamming distance between the vectors.
In some embodiments, fig. 3 is a flowchart of a session marking step according to an embodiment of the present application, and as shown in fig. 3, the session marking step S3 specifically includes:
a session archiving and scanning step S31, adding the priority words and the preset stop words into a jieba word segmentation system, and scanning the session archiving by using the jieba word segmentation system;
and an information output step S32, deleting the corresponding participles in the conversation archive according to the preset stop words, and outputting the corresponding priority words, discussion labels and speaking users which are hit in the rest part of the conversation archive.
In some embodiments, fig. 4 is a flowchart of a session marking step according to an embodiment of the present application, and as shown in fig. 4, the topic tag generating step S4 specifically includes:
an information duplication removing step S41, wherein the corresponding priority words and the speaking users output in the information output step are subjected to duplication removal to obtain hit priority words and the number of corresponding speaking users;
a low-frequency data cleaning step S42, counting the variance and mean of the hit priority words and the number of speaking users, and filtering by using a 3sigma principle;
a topic label generation step S43, which outputs information of the speaking user-discussion label according to the filtering result, and obtains the user topic label accordingly.
The 3sigma principle is that the probability of numerical value distribution in (mu-3 sigma, mu +3 sigma) is 0.9974; in normal distribution, σ represents a standard deviation, μ represents a mean value x ═ μ is a symmetry axis of the image, and the hit priority words outside (μ -3 σ, μ +3 σ) are filtered by using a 3sigma principle, so that information redundancy and interference of low-frequency session information can be effectively reduced.
The low-frequency data cleaning step can also be directly used as a cleaning target through a label.
Through the steps, data cleaning is carried out on the obtained hit words, a part of tags with few discussion quantity can be filtered, and more popular discussion topic tags are reserved.
In some of these embodiments, the filtering condition in the low frequency data washing step S42 further includes:
and when the number of the corresponding speaking users of the hit priority word is less than a set value, filtering the hit priority word, wherein the set value is the mean value-3 standard deviation.
The embodiment also provides a user topic label generation method based on the enterprise and micro discussion group. Fig. 5 is a flowchart of another method for generating user topic tags based on enterprise discussion groups according to an embodiment of the present application, and as shown in fig. 5, the flowchart includes the following steps:
s501, stop word data preparation
In NLP tasks, there are often a large number of words that are not useful, e.g., "of, we" etc. Stop words are defined for such words.
S502, presetting a discussion label
For the session archiving of the user discussion, the operator needs to preset different discussion tags according to different industries.
S503, obtaining word vectors
1. And searching possible hit words similar to the discussion labels in the S502 by using the Tencent AI Lab word vector in a large-scale Tencent AI Lab word vector of Tencent AI Lab open source to obtain a corresponding external word vector, and supplementing the training process of the following word vectors according to the corresponding external word vector.
In the process, the similarity degree of the words can be described by the Euclidean distance between the vectors.
2. The session archive is preprocessed, and the preprocessing step is generally related to the type of the session archive and the personal purpose, for example, if the session is English, the operations of capital and small case conversion to check spelling errors and the like may be needed, and if the session is Chinese or Japanese, the word segmentation process is needed.
3. After the processing session is archived, the one-hot vector and the external word vector of the processing session archive are used as the input of a word2vec model, and a low-dimensional word vector (word embedding) is trained through the word2vec model
In practical applications, the above steps may adopt two training models (CBOW and Skip-gram), two acceleration algorithms (Negative Sample and Hierarchical software max).
S504, adding the priority words
Through S502 and S503, words in the word vectors with cosine similarity greater than the threshold a to the tag word vector of the preset discussion tag may be calculated by using the word vectors, a priority word list of the participles is added, and words with similar similarity are set to represent the preset topic tag, that is, one topic tag may correspond to a plurality of groups of words, and the cosine similarity is as follows:
Figure BDA0003147612140000081
wherein A, B represents the word vectors corresponding to different words. n represents the total dimension, 200 dimensions at step three, and i represents the ith dimension.
S505, session marking
In this step, stop words in S501 and priority words in S504 are used, added to the jieba participle system, and the jieba participle scanning session is archived. If the corresponding priority word is hit, outputting related information and generating a corresponding table, wherein the related information comprises an original conversation, a hit priority word, a hit tag and a speaking user. S506, data cleaning
1. After the table of S505 is obtained, it is generally scattered due to the topic of discussion. And for the hit words, removing the duplication by using the speaking users to obtain the information of the number of the hit words and the speaking users.
2. For all hit words and number of users, statistics of the variance u and the mean sigma are performed. After the variance and mean are obtained, information is filtered using the 3sigma principle, and hits outside (μ -3 σ, μ +3 σ) are filtered.
3. If the number of users of the hit is less than the mean-3 standard deviation, the hit is less discussed, and if the hit is not in the hits of S506, filtering is performed on the table of S505.
S507, constructing a user session label
And after the table of S506 is obtained, information of the speaking user-hit tag is output, and a user session tag is constructed according to the information, so as to assist construction of downstream tasks such as subsequent user marketing, activity planning and the like.
The embodiments of the present application are described and illustrated below by means of preferred embodiments.
FIG. 6 is a flowchart of a method for generating user topic tags based on enterprise discussion groups according to the preferred embodiment of the present application.
S601, stopping word data preparation and adding word segmentation system
In NLP tasks, there are often a large number of words that are not useful, e.g., "of, we" etc. And defining the words as stop words, adding the stop words into the word segmentation system, and deleting the related words in the session archive.
S602, presetting a discussion label in the field of beauty and make-up
In some cosmetic fields, daily activity information and cosmetic information of a user may be concerned. Based on this, the pre-set discussion tags may be as follows:
'working', 'love', 'school', 'shopping', 'travel', 'home', 'white', 'lipstick', 'game', 'color cosmetics', 'skin care'
S603, training word vector
1. And searching possible hit words similar to the discussion labels in the S502 by using the Tencent AI Lab word vector in a large-scale Tencent AI Lab word vector of Tencent AI Lab open source to obtain a corresponding external word vector, and supplementing the training process of the following word vector according to the possible hit words, wherein the external word vector is dense dimension and can be set to 200 dimensions.
In the process, the similarity degree of the words can be described by the Euclidean distance between the vectors.
2. The session archive is preprocessed.
3. After the processing session is archived, the one-hot vector and the external word vector of the processing session archive are used as the input of a word2vec model, and a low-dimensional word vector (word embedding) is trained through the word2vec model
S604, increasing the priority word of the makeup field
Through S602 and S603, a word whose cosine similarity with the tag word vector of the preset discussion tag in the word vector is greater than the threshold a may be calculated by using the word vector, and a prioritized word list of the participles is added. And words with similar similarity are set to represent the preset topic. Typically, a is set to 0.6 and the cosine similarity is as follows:
Figure BDA0003147612140000101
wherein A, B represents the word vectors corresponding to different words. n represents the total dimension, 200 dimensions at S603, and i represents the ith dimension.
S605, marking the session archive
In the present step, the stop word of S601 and the priority word of S604 are used, a jieba word segmentation system is added, and the jieba word segmentation scanning conversation is used for archiving. If the corresponding priority word is hit, outputting the relevant information and generating a table, wherein the table is shown as follows:
Figure BDA0003147612140000102
s606, cleaning the output data of the S605
After the table of S605 is obtained, since the discussed topics are usually scattered, for the hit word, the information of the hit word + the number of the speaking users is obtained after the speaking user is used for deduplication. For example, lipstick + hits user 2.
For all hit words and number of users, statistics of the variance u and the mean sigma are performed. After the variance and the mean are obtained, information filtering is carried out by using a 3sigma principle, and correlation information within (mu-3 sigma, mu +3 sigma) is obtained and a table is generated.
If the number of the users who hit the words is smaller than the average value-3 standard deviation, the hit words are discussed less, and filtering is carried out;
if the hit word of the table in S605 is not included in the hit words of the table generated in S606, filtering is performed.
S607, constructing a user session tag
And after the table of S606 is obtained, information of the speaking user and the hit label is output, and construction of downstream tasks such as subsequent user marketing, activity planning and the like is assisted.
It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.
The embodiment also provides a system for generating a user topic tag based on an enterprise and micro discussion group, and the device is used for implementing the above embodiments and preferred embodiments, which have already been described and are not described again. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
Fig. 7 is a block diagram of a user topic tag generation system based on an enterprise discussion group according to an embodiment of the present application, and as shown in fig. 7, the apparatus includes:
the word vector training module 1 presets different discussion labels according to industry types, and obtains corresponding external word vectors through screening according to the different discussion labels, and the external word vectors are combined with session archiving training word vectors;
the priority word acquisition module 2 is used for calculating the similarity between the word vectors and the discussion labels and adding the word vectors meeting preset conditions into a priority word list;
the conversation marking module 3 is used for marking conversation, scanning conversation files by using a word segmentation system and marking the conversation files according to the priority words and preset stop words;
and the topic label generating module 4 is used for processing the marking result to obtain the user topic label.
The embodiment of the application uses means such as NLP to extract the chat information of the user, can quickly mine the user label in the session file, helps operators to mark the user automatically, and filters the session information with less discussion.
In some of these embodiments, the topic tag generation module 4 includes:
the information duplication removing unit 41 is used for receiving the corresponding priority words and the speaking users output by the session marking module and removing duplication of the corresponding priority words and the speaking users to obtain hit priority words and the number of the corresponding speaking users;
the low-frequency data cleaning unit 42 is used for counting the variance and the mean value of the number of the hit priority words and the number of the speaking users and filtering by using a 3sigma principle;
the topic tag generation unit 43 outputs information of the speaking user-discussion tag according to the filtering result, and obtains the user topic tag according to the information.
Fig. 8 is a block diagram of a preferred structure of a system for generating user topic labels based on an enterprise discussion group according to an embodiment of the present application, and as shown in fig. 8, the apparatus includes all the modules shown in fig. 7, and further includes:
the word vector training module 1 further comprises:
an external word vector screening unit 11, which obtains an Tencent AI LAb word vector according to the Tencent AI Lab, calculates the Euclidean distance between the Tencent AI LAb word vector and the corresponding vector of the discussion label, and obtains an external word vector by screening according to the Euclidean distance;
and the Word vector output unit 12 is used for preprocessing the session file to obtain a corresponding one-hot vector, inputting the one-hot vector and the external Word vector into the Word2vec model and outputting the Word vector.
The priority word obtaining module 2 calculates the cosine similarity between the word vector and the discussion label through the following formula,
Figure BDA0003147612140000121
wherein A, B represents the word vector and the word vector corresponding to the discussion tag, respectively, n represents the total dimension,
and when the cosine similarity is greater than a preset threshold value, adding the participle corresponding to the word vector into the priority word list.
The session marking module 3 specifically includes:
the conversation archiving and scanning unit 31 is used for adding the priority words and the preset stop words into the jieba word segmentation system and scanning conversation archiving by using the jieba word segmentation system;
the information output unit 32 deletes the corresponding participles in the conversation archive according to the preset stop words, and outputs the corresponding priority words, discussion tags and speaking users hit in the rest of the conversation archive to the information deduplication unit 41.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
In addition, the method for generating the user topic label based on the enterprise discussion group in the embodiment of the application described in conjunction with fig. 1 can be implemented by computer equipment. Fig. 9 is a hardware configuration diagram of a computer device according to an embodiment of the present application.
The computer device may comprise a processor 81 and a memory 82 in which computer program instructions are stored.
Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 82 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.
The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.
The processor 81 reads and executes the computer program instructions stored in the memory 82 to implement any one of the above-described embodiments of the method for generating the user topic tags based on the groups of discussion of enterprise and micro.
In some of these embodiments, the computer device may also include a communication interface 83 and a bus 80. As shown in fig. 8, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.
The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.
Bus 80 includes hardware, software, or both to couple the components of the computer device to each other. Bus 80 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 80 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.
The computer device may execute the session marking step in the embodiment of the present application based on the acquired session archive, thereby implementing the method for generating the user topic tag based on the enterprise and micro discussion group described in conjunction with fig. 1.
In addition, in combination with the method for generating the user topic tag based on the enterprise and micro discussion group in the above embodiment, the embodiment of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the above embodiments of a method for generating user topic tags based on an enterprise discussion group.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A user topic label generation method based on an enterprise and micro discussion group is characterized by comprising the following steps:
a word vector training step, wherein different discussion labels are preset according to industry types, corresponding external word vectors are obtained through screening according to the different discussion labels, and the external word vectors are combined with session archiving training word vectors;
a priority word obtaining step, namely calculating the similarity between the word vector and the discussion label, and adding the word vector meeting preset conditions into a priority word list;
a conversation marking step, namely scanning a conversation archive by using a word segmentation system and marking the conversation archive according to the priority words and preset stop words;
and a step of generating a topic label, namely processing the marking result to obtain the user topic label.
2. The method for generating user topic labels based on an enterprise micro-discussion group according to claim 1, wherein the word vector training step further comprises:
an external word vector screening step, namely acquiring an Tencent AI LAb word vector according to the Tencent AI Lab, calculating the Euclidean distance between the Tencent AI LAb word vector and the corresponding vector of the discussion label, and screening to obtain the external word vector according to the Euclidean distance;
and a Word vector output step, namely preprocessing the session file to obtain a corresponding one-hot vector, inputting the one-hot vector and the external Word vector into a Word2vec model and outputting the Word vector.
3. The method for generating the user topic tag according to claim 1, wherein the priority word acquiring step specifically includes:
calculating the cosine similarity of the word vector and the label word vector corresponding to the discussion label through the following formula,
Figure FDA0003147612130000011
wherein A, B represents the word vector and the word vector corresponding to the discussion tag, respectively, n represents the total dimension, i represents the ith dimension,
and adding the participles corresponding to the word vectors into the priority word list when the cosine similarity is greater than a preset threshold value.
4. The method for generating the user topic tag according to claim 1, wherein the session marking step specifically comprises:
a conversation archiving scanning step, namely adding the priority words and the preset stop words into a jieba word segmentation system, and scanning the conversation archive by using the jieba word segmentation system;
and an information output step, namely deleting the corresponding participles in the conversation archive according to the preset stop words, and outputting the corresponding priority words, the discussion tags and the speaking users which are hit in the rest part of the conversation archive.
5. The method of generating a user hashtag according to claim 1, wherein the hashtag generating step specifically comprises:
an information duplication removing step, namely removing duplication of the corresponding priority words and the speaking users output in the information output step to obtain hit priority words and the number of the corresponding speaking users;
a low-frequency data cleaning step, namely counting the variance and the mean of the hit priority words and the number of the speaking users, and filtering by using a 3sigma principle;
and a topic label generation step, namely outputting the information of the speaking user-discussion label according to the filtering result, and obtaining the user topic label according to the information.
6. The method for generating the user topic tags based on the enterprise micro discussion group as claimed in claim 5, wherein the filtering condition in the low frequency data washing step further comprises:
and when the number of the speaking users corresponding to the hit priority word is less than a set value, filtering the hit priority word, wherein the set value is a mean value-3 standard deviation.
7. A system for generating a user topic tag based on an enterprise and micro discussion group, which applies the method for generating a user topic tag of any one of claims 1 to 6, and is characterized by comprising:
the word vector training module presets different discussion labels according to the industry types, and obtains corresponding external word vectors through screening according to the different discussion labels, wherein the external word vectors are combined with session archiving training word vectors;
the priority word acquisition module is used for calculating the similarity between the word vector and the discussion label and adding the word vector meeting the preset conditions into a priority word list;
the conversation marking module is used for marking a conversation, and scanning a conversation archive by using a word segmentation system and marking the conversation archive according to the priority words and preset stop words;
and the topic label generating module is used for processing the marking result to obtain the user topic label.
8. The system of claim 7, wherein the topic tag generation module comprises:
the information duplication removing unit is used for receiving the corresponding priority words and the speaking users output by the session marking module and removing duplication of the corresponding priority words and the speaking users to obtain hit priority words and the number of the corresponding speaking users;
the low-frequency data cleaning unit is used for counting the variance and the mean value of the hit priority words and the number of the speaking users and filtering by using a 3sigma principle;
and the topic label generating unit outputs the information of the speaking user-discussion label according to the filtering result and obtains the user topic label according to the information.
9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the user hashtag generating method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium on which a computer program is stored, characterized in that the program, when executed by a processor, implements the user topic tag generation method as recited in any one of claims 1 to 6.
CN202110757295.7A 2021-07-05 2021-07-05 User topic label generation method and system based on enterprise and micro discussion group Pending CN113342959A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110757295.7A CN113342959A (en) 2021-07-05 2021-07-05 User topic label generation method and system based on enterprise and micro discussion group

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110757295.7A CN113342959A (en) 2021-07-05 2021-07-05 User topic label generation method and system based on enterprise and micro discussion group

Publications (1)

Publication Number Publication Date
CN113342959A true CN113342959A (en) 2021-09-03

Family

ID=77482409

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110757295.7A Pending CN113342959A (en) 2021-07-05 2021-07-05 User topic label generation method and system based on enterprise and micro discussion group

Country Status (1)

Country Link
CN (1) CN113342959A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118071403A (en) * 2024-04-18 2024-05-24 湖北华中电力科技开发有限责任公司 Marketing system development method and system based on micro-service architecture and middle-stage technology

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118071403A (en) * 2024-04-18 2024-05-24 湖北华中电力科技开发有限责任公司 Marketing system development method and system based on micro-service architecture and middle-stage technology

Similar Documents

Publication Publication Date Title
Sobhani et al. A dataset for multi-target stance detection
Ranjan et al. Multi-label cross-modal retrieval
EP3617946A1 (en) Context acquisition method and device based on voice interaction
US10796203B2 (en) Out-of-sample generating few-shot classification networks
US11822568B2 (en) Data processing method, electronic equipment and storage medium
CN111143530B (en) Intelligent reply method and device
Zheng et al. MMChat: Multi-modal chat dataset on social media
CN112783825B (en) Data archiving method, device, computer device and storage medium
CN111242083A (en) Text processing method, device, equipment and medium based on artificial intelligence
CN109117477B (en) Chinese field-oriented non-classification relation extraction method, device, equipment and medium
US20230410222A1 (en) Information processing apparatus, control method, and program
Kumar et al. A novel approach for ISL alphabet recognition using Extreme Learning Machine
CN110895656A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN106372231A (en) Search method and device
CN113342959A (en) User topic label generation method and system based on enterprise and micro discussion group
CN109885831B (en) Keyword extraction method, device, equipment and computer readable storage medium
CN111191036A (en) Short text topic clustering method, device, equipment and medium
CN112015895A (en) Patent text classification method and device
CN111767710B (en) Indonesia emotion classification method, device, equipment and medium
CN113962221A (en) Text abstract extraction method and device, terminal equipment and storage medium
CN114863459A (en) Out-of-order document sorting method and device and electronic equipment
Vu et al. Lct-malta’s submission to repeval 2017 shared task
Sohail et al. Text classification in an under-resourced language via lexical normalization and feature pooling
Dutta et al. Clustering model for microblogging sites using dimension reduction techniques
CN113191233A (en) Blind signal separation method and system, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination