CN111859148A - Theme extraction method, device and equipment and computer readable storage medium - Google Patents

Theme extraction method, device and equipment and computer readable storage medium Download PDF

Info

Publication number
CN111859148A
CN111859148A CN202010756727.8A CN202010756727A CN111859148A CN 111859148 A CN111859148 A CN 111859148A CN 202010756727 A CN202010756727 A CN 202010756727A CN 111859148 A CN111859148 A CN 111859148A
Authority
CN
China
Prior art keywords
search
conversation
topic
determining
session
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010756727.8A
Other languages
Chinese (zh)
Inventor
姜迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202010756727.8A priority Critical patent/CN111859148A/en
Publication of CN111859148A publication Critical patent/CN111859148A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention relates to the technical field of financial science and technology, and discloses a theme extraction method, device and equipment and a computer-readable storage medium. The theme extraction method comprises the following steps: acquiring search log information and search sentences; determining a session in the search log information according to each search statement, wherein each search statement in the session is semantically associated; determining topic distribution of the conversation through a topic model, and determining a uniform resource locator corresponding to the conversation according to the topic distribution; and extracting target topics from the topics corresponding to the conversation according to the uniform resource locators corresponding to the conversation. The invention improves the accuracy of theme extraction.

Description

Theme extraction method, device and equipment and computer readable storage medium
Technical Field
The invention relates to the technical field of financial technology (Fintech), in particular to a method, a device and equipment for extracting a theme and a computer-readable storage medium.
Background
With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), but higher requirements are also put forward on the technologies due to the requirements of the financial industry on safety and real-time performance.
The topic implicit in the search log information can be used for the function of the search engine, for example, researchers apply the implicit topic to the functions of spelling correction and search personalization in the search engine, that is, the search log information is a very valuable information base for improving the search effect of the search engine.
However, the existing scheme for mining the implied topics performs global search of the implied topics on the whole log information, but some topics in the log information are not related, so that the topic extraction is not accurate enough.
Disclosure of Invention
The invention mainly aims to provide a method, a device and equipment for extracting a theme and a computer readable storage medium, aiming at solving the problem of inaccurate theme extraction.
In order to achieve the above object, the present invention provides a method for extracting a theme, including:
acquiring search log information and search sentences;
determining a session in the search log information according to each search statement, wherein each search statement in the session is semantically associated;
determining topic distribution of the conversation through a topic model, and determining a uniform resource locator corresponding to the conversation according to the topic distribution;
and extracting target topics from the topics corresponding to the conversation according to the uniform resource locators corresponding to the conversation.
Optionally, the step of determining a session in the search log information according to each search statement includes:
determining reference parameters corresponding to adjacent search sentences in the search log information, wherein the reference parameters comprise at least one of keywords of the adjacent search sentences and semantic similarity of the adjacent search sentences;
and extracting sessions from the search log information according to adjacent search sentences corresponding to the reference parameters meeting semantic conditions, wherein the search sentences in the sessions are adjacent in sequence and associated semantically.
Optionally, the step of extracting a session in the search log information according to the adjacent search statement corresponding to the reference parameter meeting the semantic condition includes:
determining interval duration between search time points of adjacent search sentences in the search log information;
and extracting a conversation from the search log information according to the interval duration and adjacent search sentences corresponding to the reference parameters meeting semantic conditions, wherein the search sentences in the conversation are adjacent in sequence and associated semantically, and the interval duration corresponding to the adjacent search sentences in the conversation is less than a preset duration.
Optionally, the preset condition includes at least one of:
the semantic similarity of the adjacent search sentences is greater than the preset similarity;
and the keywords in the adjacent search sentences are the same.
Optionally, the step of determining the uniform resource locator corresponding to the session according to the topic distribution includes:
determining edge probability corresponding to each topic in the conversation according to the topic distribution;
and determining the uniform resource locator corresponding to the session according to the marginal probability corresponding to the session.
Optionally, before the step of determining the edge probability corresponding to each topic in the conversation according to the topic distribution, the method further includes:
determining whether the session is associated with a click operation record;
and when the session association click operation record is determined, executing the step of determining the edge probability corresponding to each topic in the session according to the topic distribution.
Optionally, before the step of determining the topic distribution of the conversation through the topic model, the method further includes:
obtaining each piece of document information, wherein the document information comprises a label, and the label comprises a theme distribution label, a word distribution label and a uniform resource locator distribution label of the document information;
inputting each document information into a preset model so as to train the preset model;
and when the convergence value of the preset model is smaller than the preset convergence value, stopping training the preset model, and storing the preset model of which the training is stopped as a theme model.
Optionally, the step of obtaining the search log information and the search statements includes:
acquiring search log information and determining a search engine corresponding to the search log information;
and acquiring a search record of the search engine, and determining the search statement according to the search record.
Optionally, the step of extracting a target topic from each topic corresponding to the session according to the uniform resource locator corresponding to the session includes:
according to the uniform resource locators corresponding to the conversation, parameter inference is carried out on the conversation to obtain the probability of each topic in the conversation;
and determining the topic with the maximum probability in the conversation as a target topic, and extracting the target topic from the conversation.
Optionally, when there are a plurality of search log information extraction sessions, the parameter inference of each session is processed in parallel.
In order to achieve the above object, the present invention further provides a topic extraction device, including:
the acquisition module is used for acquiring search log information and search sentences;
a determining module, configured to determine a session in the search log information according to each search statement, where each search statement in the session is semantically associated;
the determining module is further configured to determine topic distribution of the session through a topic model, and determine a uniform resource locator corresponding to the session according to the topic distribution;
and the extraction module is used for extracting target topics from the topics corresponding to the conversation according to the uniform resource locators corresponding to the conversation.
In order to achieve the above object, the present invention further provides a topic extraction device, including: a topic model, a memory, a processor and an extraction program stored on the memory and operable on the processor, the topic model being connected to the processor, the extraction program when executed by the processor implementing the steps of the method for extracting a topic as described above.
To achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon an extraction program, which when executed by a processor, implements the steps of the extraction method of the subject matter as described above.
The invention provides a theme extraction method, a theme extraction device, theme extraction equipment and a computer-readable storage medium. The method comprises the steps of splitting search log information into sessions of search sentences with semantic relevance, determining topic distribution of the sessions through a topic model, further determining uniform resource locators of the sessions based on the topic distribution to accurately locate each topic in the sessions, and finally accurately extracting a target topic from each topic in the sessions. Compared with the prior art that the whole log information is subjected to overall search and extraction of the implied subject, the method and the system can classify the semantically correlated search sentences into one conversation, so that all subjects in the conversation are correlated, further, any subject extracted from the conversation can represent the implied subject information of the conversation, the defect that the subject extraction is not accurate enough due to the fact that the subject information in the log information is not correlated in the prior art is overcome, and the accuracy of the subject extraction is improved.
Drawings
FIG. 1 is a schematic diagram of an apparatus in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a first embodiment of the extraction method of the subject matter of the present invention;
fig. 3 is a schematic functional block diagram of an extraction apparatus according to a first embodiment of the present subject matter.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.
The equipment related to the embodiment of the invention is an extraction device of a topic.
As shown in fig. 1, the extraction apparatus for the topic may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., a Wi-Fi interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
It will be appreciated by those skilled in the art that the subject extraction mechanism configuration shown in fig. 1 does not constitute a limitation of the subject extraction mechanism and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and an extraction program.
In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting a server and performing data communication with the server; the user interface 1003 is mainly used for connecting a client and performing data communication with the client; and the processor 1001 may be configured to call the extraction program stored in the memory 1005 and perform the following operations:
acquiring search log information and search sentences;
determining a session in the search log information according to each search statement, wherein each search statement in the session is semantically associated;
determining topic distribution of the conversation through a topic model, and determining a uniform resource locator corresponding to the conversation according to the topic distribution;
and extracting target topics from the topics corresponding to the conversation according to the uniform resource locators corresponding to the conversation.
In one embodiment, the processor 1001 may call the extraction program stored in the memory 1005, and further perform the following operations:
determining reference parameters corresponding to adjacent search sentences in the search log information, wherein the reference parameters comprise at least one of keywords of the adjacent search sentences and semantic similarity of the adjacent search sentences;
and extracting sessions from the search log information according to adjacent search sentences corresponding to the reference parameters meeting semantic conditions, wherein the search sentences in the sessions are adjacent in sequence and associated semantically.
In one embodiment, the processor 1001 may call the extraction program stored in the memory 1005, and further perform the following operations:
determining interval duration between search time points of adjacent search sentences in the search log information;
and extracting a conversation from the search log information according to the interval duration and adjacent search sentences corresponding to the reference parameters meeting semantic conditions, wherein the search sentences in the conversation are adjacent in sequence and associated semantically, and the interval duration corresponding to the adjacent search sentences in the conversation is less than a preset duration.
In one embodiment, the processor 1001 may call the extraction program stored in the memory 1005, and further perform the following operations:
the semantic similarity of the adjacent search sentences is greater than the preset similarity;
and the keywords in the adjacent search sentences are the same.
In one embodiment, the processor 1001 may call the extraction program stored in the memory 1005, and further perform the following operations:
determining edge probability corresponding to each topic in the conversation according to the topic distribution;
and determining the uniform resource locator corresponding to the session according to the marginal probability corresponding to the session.
In one embodiment, the processor 1001 may call the extraction program stored in the memory 1005, and further perform the following operations:
determining whether the session is associated with a click operation record;
and when the session association click operation record is determined, executing the step of determining the edge probability corresponding to each topic in the session according to the topic distribution.
In one embodiment, the processor 1001 may call the extraction program stored in the memory 1005, and further perform the following operations:
obtaining each piece of document information, wherein the document information comprises a label, and the label comprises a theme distribution label, a word distribution label and a uniform resource locator distribution label of the document information;
inputting each document information into a preset model so as to train the preset model;
and when the convergence value of the preset model is smaller than the preset convergence value, stopping training the preset model, and storing the preset model of which the training is stopped as a theme model.
In one embodiment, the processor 1001 may call the extraction program stored in the memory 1005, and further perform the following operations:
acquiring search log information and determining a search engine corresponding to the search log information;
and acquiring a search record of the search engine, and determining the search statement according to the search record.
In one embodiment, the processor 1001 may call the extraction program stored in the memory 1005, and further perform the following operations:
according to the uniform resource locators corresponding to the conversation, parameter inference is carried out on the conversation to obtain the probability of each topic in the conversation;
and determining the topic with the maximum probability in the conversation as a target topic, and extracting the target topic from the conversation.
In one embodiment, the processor 1001 may call the extraction program stored in the memory 1005, and further perform the following operations:
and when the search log information extraction session is multiple, performing parallel processing on parameter inference of each session.
Based on the above hardware structure, embodiments of the extraction method of the subject matter of the present invention are presented.
Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the extraction method of the subject of the present invention, and the extraction method of the subject includes:
step S10, obtaining search log information and each search statement;
in the present embodiment, the subject-oriented extraction device is implemented, and for convenience of description, the device is hereinafter referred to as the subject-oriented extraction device. The apparatus may collect search log information, which is a log generated by a user using an application having a search engine. The search log information includes a search statement, which may be a statement entered by a user on a search interface of an application. Such statements are recorded by the application and form a search operation record, which is then sent to the device. The device stores the search operation record and the application program in an associated manner. The search operation record only contains the search sentence.
When the device needs to extract the theme, the search log information is obtained, the application program generating the search log information is determined, namely, the search engine generating the search log information is determined, then the search record corresponding to the search engine is determined, and therefore the search statement is determined according to the search record. The search sentence may be a sentence or a word.
Step S20, determining a conversation in the search log information according to each search statement, wherein each search statement in the conversation is semantically associated;
the apparatus may determine the session in the search log information after determining the search statement. Specifically, each search statement is distributed in the search log information, each search statement has a corresponding position in the search log information, and the position represents the adjacent relationship between the search statements.
The apparatus marks each search term in the search log information. The apparatus then determines reference parameters of adjacent search terms according to the tokens. The reference parameter includes at least one of a semantic similarity between adjacent search sentences and a semantic similarity of adjacent search sentences. The device can determine whether the reference parameters of two adjacent search sentences meet the semantic condition, if so, the device can determine that the two adjacent search sentences are in the same session, namely, each two adjacent search sentences meet the semantic condition, and the two adjacent search sentences are classified into one session. It can be understood that the device extracts each session in the search log information according to the adjacent search sentences corresponding to the reference parameters meeting the semantic conditions, and the search sentences in the session are adjacent in sequence and need to be semantically associated.
The semantic conditions include: the semantic similarity of the adjacent search sentences is larger than at least one of the preset similarity and the similarity of the keywords in the adjacent search sentences. The device is provided with a model for calculating semantic similarity, the device inputs adjacent search sentences into the model, and the model performs semantic analysis on the adjacent search sentences so as to calculate the semantic similarity of the adjacent search sentences, namely the device obtains the semantic similarity output by the model. The apparatus may extract a keyword corresponding to the search term, where the keyword may be a proper noun in the search term, for example, if the search term is "high-speed rail to seek to a achievement", the keyword may be the proper nouns "achievement" and "high-speed rail". The apparatus may determine a keyword corresponding to each search sentence, and if the keywords of adjacent search sentences are the same, the adjacent search sentences are semantically associated. Of course, if there are 3 or more than 3 keywords in a search sentence, the number of the keywords that are the same in the adjacent search sentences should be more than one, and then the adjacent search sentences can be considered to be in speech correlation. For example, an a search sentence has three keywords abc, and a B search sentence adjacent to the a search sentence has two keywords ab, and the a search sentence and the B search sentence have two identical keywords ab, and they can be considered as being semantically related.
The device determines whether the reference parameters of the adjacent search sentences meet the semantic condition or not in sequence, and if the reference parameter of the previous adjacent search sentence meets the semantic condition and the reference parameter of the current adjacent search sentence meets the semantic condition, the three search sentences belong to the same conversation. And if the reference parameter of the previous adjacent search statement meets the semantic condition and the reference parameter of the current adjacent search statement does not meet the semantic condition, dividing the new search statement into a new session. For example, if the previous adjacent search statement is AB and the current adjacent search statement is BC, then C is the new search statement. The device can determine a plurality of conversations according to the mode, namely the device extracts the conversations from the search log information according to the characteristics that the search sentences in the conversations are adjacent in sequence and associated semantically. It should be noted that, when determining the conversation, the interval duration of adjacent search sentences also needs to be considered. If the interval duration is long, it can be determined that two adjacent search sentences belong to different sessions. In this regard, the apparatus determines an interval duration between search time points of adjacent search sentences, thereby extracting each conversation in the search log information according to the interval duration and the adjacent search sentences corresponding to the reference parameter of the full semantic condition. Specifically, the search term is associated with a corresponding search time point, and the device can determine the interval duration between adjacent search terms according to the search time point. After obtaining the interval duration, the apparatus determines whether the interval duration is less than a preset duration, where the preset duration may be any composite number, for example, the preset duration may be 30 minutes. If the interval duration is less than the preset duration, the device acquires the reference parameters of the adjacent search sentences, and determines whether the adjacent search sentences are in the same conversation according to the reference parameters and the semantic conditions. If the interval duration is greater than or equal to the preset duration, two adjacent search sentences can be directly determined to be in different conversations. Of course, the device may determine each session according to the reference parameter, and then perform the interval duration determination on the adjacent search statements in each session, thereby splitting a new session from the determined sessions.
Step S30, determining the topic distribution of the conversation through a topic model, and determining the uniform resource locator corresponding to the conversation according to the topic distribution;
the device is provided with a theme model, and the theme model can generate a theme corresponding to each conversation respectively. The topic model may be trained. Specifically, the device acquires a plurality of document information, wherein the document information comprises tags, and the tags comprise a theme distribution tag, a word distribution tag and a uniform resource locator tag in the document information.
The device inputs the document information containing the label into the preset model to train the preset model, when the convergence value of the preset model is smaller than the preset convergence value, the training of the preset model can be stopped, and the preset model of which the training is stopped is stored as the theme model.
The topic model first generates a topic distribution theta for each document, and then for each conversation, the device may determine the topics for that conversation from the topic distribution for the conversation. The apparatus may determine edge probabilities P (z | theta) corresponding to respective topics z in the conversation based on the topic distribution of the conversation. Specifically, the device determines the positions of the topics in the conversation according to the topic distribution, then randomly selects an edge probability for each position, and the selected edge probability is the edge probability of the topic corresponding to the position. Topic distributions may be understood as the locations of topics in a conversation, so that the device may determine the locations of individual topics in the conversation directly from the topic distributions. The randomly selected edge probability can be preset or generated instantly. The random selection mode may be a dice rolling mode, that is, each face or each point of the dice is associated with a preset edge probability, and the edge probability associated with the face or the point presented after the dice rolling mode is the edge probability corresponding to the position.
After determining the edge probability corresponding to each topic in the session, the device can determine the uniform resource locator corresponding to the session according to the edge probability corresponding to the session. Specifically, the session generally has a click behavior of the user, the click behavior may be understood as a click search action of a search statement in the input box, the click behavior is recorded in a click operation record, and the click operation record may be associated with the search statement, so that the click operation record is associated with the session. The click operation record comprises click operation of a user on a search statement, if the click operation record is related to a conversation, a theme is randomly determined in the conversation, and a uniform resource locator P (URL | Z) is obtained according to the marginal probability of the theme.
Specifically, a plurality of uniform resource locators P and a plurality of corresponding relationships are stored in the device, and the corresponding relationships are relationships between the faces of the dice and the uniform resource locators P. When the device determines that the session is associated with the click operation record, the edge probability of the randomly determined theme is obtained, the face presented after the dice roll is determined through the edge probability (the edge probability is determined through the face presented after the dice roll, the specific reference is made to the description), the corresponding relation of the face is further extracted, finally, the uniform resource locator P is obtained through the corresponding relation and the face presented after the dice roll, and the obtained uniform resource locator P is the uniform resource locator P of the session corresponding to the session.
It can be understood that, when the operation record is clicked in association with the session, the edge probability corresponding to each topic in the session is determined according to the topic distribution, and the uniform resource locator corresponding to the session is determined according to the edge probability corresponding to the session. In this manner, the device generates a uniform resource locator for a session having a click behavior.
Step S40, according to the uniform resource locator corresponding to the conversation, extracting the target subject from each subject corresponding to the conversation.
After determining the uniform resource locator, the device may locate the positions of the respective topics in the conversation according to the uniform resource locator, thereby extracting the target topics in the conversation. It should be noted that, because the respective search sentences in the conversation are semantically related, each topic can represent an implicit topic of the conversation, and thus the target topic can be any topic in the conversation.
In addition, the device can determine the topic with the maximum probability as the target topic, and the topic with the maximum probability is the implicit topic most closely related to the conversation. In this regard, after calculating the uniform resource locator of the session, the device performs parameter Inference on the session by way of variational Inference (variational Inference) to obtain the probability of each topic in the session, and then the device extracts the topic with the maximum probability in each session as the target topic. In addition, iterative computation of multiple parameters is required in the parameter inference process, so that computation workload of parameter inference is large, search log information is generally divided into multiple sessions, the device can perform parallel processing on parameter inference of each session to reduce workload specifically, the device performs variation inference in parallel by using multiple machines by adopting a MapReduce computing framework, and training efficiency can be effectively enhanced by using the multiple machines to process training data in parallel, namely, workload of parameter inference of multiple sessions is reduced.
In the technical scheme provided by this embodiment, the device splits the search log information into sessions of respective search sentences having semantic association, determines topic distribution of the sessions through the topic model, determines uniform resource locators of the sessions based on the topic distribution to accurately locate respective topics in the sessions, and finally accurately extracts target topics from the respective topics of the sessions. Compared with the prior art that the whole log information is subjected to overall search and extraction of the implied subject, the method and the system can classify the semantically correlated search sentences into one conversation, so that all subjects in the conversation are correlated, further, any subject extracted from the conversation can represent the implied subject information of the conversation, the defect that the subject extraction is not accurate enough due to the fact that the subject information in the log information is not correlated in the prior art is overcome, and the accuracy of the subject extraction is improved.
The invention also provides a theme extracting device.
Referring to fig. 3, fig. 3 is a functional block diagram of a first embodiment of an extraction device as the subject of the present invention.
As shown in fig. 3, the theme extraction apparatus includes:
an obtaining module 10, configured to obtain search log information and search statements;
a determining module 20, configured to determine a session in the search log information according to each search statement, where each search statement in the session is semantically associated;
the determining module 20 is further configured to determine topic distribution of the session through a topic model, and determine a uniform resource locator corresponding to the session according to the topic distribution;
and the extracting module 30 is configured to extract a target topic from each topic corresponding to the session according to the uniform resource locator corresponding to the session.
In one embodiment, the topic extracting device includes:
the determining module 20 is further configured to determine, in the search log information, reference parameters corresponding to adjacent search sentences, where the reference parameters include at least one of keywords of the adjacent search sentences and semantic similarity of the adjacent search sentences;
the extracting module 30 is further configured to extract a session from the search log information according to adjacent search statements corresponding to the reference parameter that satisfy the semantic condition, where the search statements in the session are sequentially adjacent and semantically associated.
In one embodiment, the topic extracting device includes:
the determining module 20 is further configured to determine, in the search log information, an interval duration between search time points of adjacent search statements;
the extracting module 30 is further configured to extract a session from the search log information according to the interval duration and the adjacent search sentences corresponding to the reference parameters meeting the semantic condition, where the search sentences in the session are sequentially adjacent and semantically associated, and the interval duration corresponding to the adjacent search sentences in the session is less than a preset duration.
In one embodiment, the topic extracting device includes:
the determining module 20 is further configured to determine edge probabilities corresponding to the topics in the conversation according to the topic distribution;
the determining module 20 is further configured to determine a uniform resource locator corresponding to the session according to the edge probability corresponding to the session.
In one embodiment, the topic extracting device includes:
the determining module is used for determining whether the session is associated with a click operation record;
and the execution module is used for executing the step of determining the edge probability corresponding to each topic in the conversation according to the topic distribution when the conversation correlation click operation record is determined.
In an embodiment, the apparatus for extracting a theme further includes an obtaining module, an inputting module, and a saving module:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring document information, the document information comprises tags, and the tags comprise a theme distribution tag, a word distribution tag and a uniform resource locator distribution tag of the document information;
the input module is used for inputting each document information into a preset model so as to train the preset model;
and the storage module is used for stopping the training of the preset model when the convergence value of the preset model is smaller than a preset convergence value, and storing the preset model which stops the training as the theme model.
In one embodiment, the apparatus for extracting a theme further includes:
the acquisition module is used for acquiring search log information and determining a search engine corresponding to the search log information;
and the acquisition module is used for acquiring the search record of the search engine and determining the search statement according to the search record.
In one embodiment, the topic extraction device further comprises an inference module:
the inference module is used for carrying out parameter inference on the conversation according to the uniform resource locator corresponding to the conversation to obtain the probability of each topic in the conversation;
and the determining module is used for determining the topic with the maximum probability in the conversation as a target topic and extracting the target topic from the conversation.
In an embodiment, the apparatus for extracting a theme further includes a processing module:
and the processing module is used for performing parallel processing on parameter inference of each session when the search log information extraction session is multiple.
The function implementation of each module in the extraction apparatus for the theme corresponds to each step in the embodiment of the extraction method for the theme, and the functions and implementation processes are not described in detail here.
The present invention also provides a computer-readable storage medium having stored thereon an extraction program which, when executed by a processor, implements the steps of the extraction method of the subject matter as described in any one of the above embodiments.
The specific embodiment of the computer-readable storage medium of the present invention is substantially the same as the embodiments of the subject extraction method, and is not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (13)

1. A method for extracting a theme, comprising:
acquiring search log information and search sentences;
determining a session in the search log information according to each search statement, wherein each search statement in the session is semantically associated;
determining topic distribution of the conversation through a topic model, and determining a uniform resource locator corresponding to the conversation according to the topic distribution;
and extracting target topics from the topics corresponding to the conversation according to the uniform resource locators corresponding to the conversation.
2. The method of extracting a subject matter according to claim 1, wherein the step of determining a conversation in the search log information from each of the search sentences comprises:
determining reference parameters corresponding to adjacent search sentences in the search log information, wherein the reference parameters comprise at least one of keywords of the adjacent search sentences and semantic similarity of the adjacent search sentences;
and extracting sessions from the search log information according to adjacent search sentences corresponding to the reference parameters meeting semantic conditions, wherein the search sentences in the sessions are adjacent in sequence and associated semantically.
3. The method for extracting a topic according to claim 2, wherein the step of extracting a conversation in the search log information according to the adjacent search sentence corresponding to the reference parameter satisfying a semantic condition comprises:
determining interval duration between search time points of adjacent search sentences in the search log information;
and extracting a conversation from the search log information according to the interval duration and adjacent search sentences corresponding to the reference parameters meeting semantic conditions, wherein the search sentences in the conversation are adjacent in sequence and associated semantically, and the interval duration corresponding to the adjacent search sentences in the conversation is less than a preset duration.
4. The method of extracting a subject of claim 2, wherein the preset condition includes at least one of:
the semantic similarity of the adjacent search sentences is greater than the preset similarity;
and the keywords in the adjacent search sentences are the same.
5. The method for extracting a topic according to claim 1, wherein the step of determining the uniform resource locator corresponding to the conversation according to the topic distribution comprises:
determining edge probability corresponding to each topic in the conversation according to the topic distribution;
and determining the uniform resource locator corresponding to the session according to the marginal probability corresponding to the session.
6. The method for extracting topics as claimed in claim 5, wherein before the step of determining the edge probability corresponding to each topic in the conversation according to the topic distribution, the method further comprises:
determining whether the session is associated with a click operation record;
and when the session association click operation record is determined, executing the step of determining the edge probability corresponding to each topic in the session according to the topic distribution.
7. The method for extracting a subject matter according to claim 1, wherein said step of determining a subject matter distribution of said conversation by a subject matter model is preceded by the step of:
obtaining each piece of document information, wherein the document information comprises a label, and the label comprises a theme distribution label, a word distribution label and a uniform resource locator distribution label of the document information;
inputting each document information into a preset model so as to train the preset model;
and when the convergence value of the preset model is smaller than the preset convergence value, stopping training the preset model, and storing the preset model of which the training is stopped as a theme model.
8. The method of extracting a subject matter according to claim 1, wherein the step of acquiring search log information and each search sentence includes:
acquiring search log information and determining a search engine corresponding to the search log information;
and acquiring a search record of the search engine, and determining the search statement according to the search record.
9. The method for extracting topics as claimed in any one of claims 1 to 8, wherein the step of extracting target topics from the topics corresponding to the conversation according to the uniform resource locator corresponding to the conversation comprises:
according to the uniform resource locators corresponding to the conversation, parameter inference is carried out on the conversation to obtain the probability of each topic in the conversation;
and determining the topic with the maximum probability in the conversation as a target topic, and extracting the target topic from the conversation.
10. The method of claim 9, wherein the performing parameter inference on the conversation to derive probabilities of individual topics in the conversation comprises:
and when a plurality of search log information extraction sessions are available, performing parallel processing on parameter inference of each session to obtain the probability of each topic in the session.
11. An extraction apparatus of a subject, characterized in that the extraction apparatus of the subject comprises:
the acquisition module is used for acquiring search log information and search sentences;
a determining module, configured to determine a session in the search log information according to each search statement, where each search statement in the session is semantically associated;
the determining module is further configured to determine topic distribution of the session through a topic model, and determine a uniform resource locator corresponding to the session according to the topic distribution;
and the extraction module is used for extracting target topics from the topics corresponding to the conversation according to the uniform resource locators corresponding to the conversation.
12. An extraction apparatus of a subject, characterized in that the extraction apparatus of the subject comprises: a topic model, a memory, a processor and an extraction program stored on the memory and executable on the processor, the topic model being connected to the processor, the extraction program when executed by the processor implementing the steps of the extraction method of the topic of any of claims 1 to 10.
13. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon an extraction program which, when executed by a processor, implements the steps of the extraction method of the subject matter of any one of claims 1 to 10.
CN202010756727.8A 2020-07-30 2020-07-30 Theme extraction method, device and equipment and computer readable storage medium Pending CN111859148A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010756727.8A CN111859148A (en) 2020-07-30 2020-07-30 Theme extraction method, device and equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010756727.8A CN111859148A (en) 2020-07-30 2020-07-30 Theme extraction method, device and equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111859148A true CN111859148A (en) 2020-10-30

Family

ID=72952667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010756727.8A Pending CN111859148A (en) 2020-07-30 2020-07-30 Theme extraction method, device and equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111859148A (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1645370A (en) * 2004-01-23 2005-07-27 微软公司 Building and using subwebs for focused search
US20090063461A1 (en) * 2007-03-01 2009-03-05 Microsoft Corporation User query mining for advertising matching
CN101599071A (en) * 2009-07-10 2009-12-09 华中科技大学 The extraction method of conversation text topic
CN102332006A (en) * 2011-08-03 2012-01-25 百度在线网络技术(北京)有限公司 Information push control method and device
CN103268348A (en) * 2013-05-28 2013-08-28 中国科学院计算技术研究所 Method for identifying user query intention
CN103744970A (en) * 2014-01-10 2014-04-23 北京奇虎科技有限公司 Method and device for determining subject term of picture
CN104050235A (en) * 2014-03-27 2014-09-17 浙江大学 Distributed information retrieval method based on set selection
US20140280150A1 (en) * 2013-03-15 2014-09-18 Xerox Corporation Multi-source contextual information item grouping for document analysis
US20160203140A1 (en) * 2015-01-14 2016-07-14 General Electric Company Method, system, and user interface for expert search based on case resolution logs
CN106407169A (en) * 2016-09-09 2017-02-15 北京工商大学 Topic model-based document tagging method
CN106649818A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Recognition method and device for application search intentions and application search method and server
CN106708803A (en) * 2016-12-21 2017-05-24 东软集团股份有限公司 Feature extraction method and device
WO2018036555A1 (en) * 2016-08-25 2018-03-01 腾讯科技(深圳)有限公司 Session processing method and apparatus
CN110083774A (en) * 2019-05-10 2019-08-02 腾讯科技(深圳)有限公司 Using determination method, apparatus, computer equipment and the storage medium of recommendation list

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1645370A (en) * 2004-01-23 2005-07-27 微软公司 Building and using subwebs for focused search
US20090063461A1 (en) * 2007-03-01 2009-03-05 Microsoft Corporation User query mining for advertising matching
CN101599071A (en) * 2009-07-10 2009-12-09 华中科技大学 The extraction method of conversation text topic
CN102332006A (en) * 2011-08-03 2012-01-25 百度在线网络技术(北京)有限公司 Information push control method and device
US20140280150A1 (en) * 2013-03-15 2014-09-18 Xerox Corporation Multi-source contextual information item grouping for document analysis
CN103268348A (en) * 2013-05-28 2013-08-28 中国科学院计算技术研究所 Method for identifying user query intention
CN103744970A (en) * 2014-01-10 2014-04-23 北京奇虎科技有限公司 Method and device for determining subject term of picture
CN104050235A (en) * 2014-03-27 2014-09-17 浙江大学 Distributed information retrieval method based on set selection
US20160203140A1 (en) * 2015-01-14 2016-07-14 General Electric Company Method, system, and user interface for expert search based on case resolution logs
WO2018036555A1 (en) * 2016-08-25 2018-03-01 腾讯科技(深圳)有限公司 Session processing method and apparatus
CN106407169A (en) * 2016-09-09 2017-02-15 北京工商大学 Topic model-based document tagging method
CN106708803A (en) * 2016-12-21 2017-05-24 东软集团股份有限公司 Feature extraction method and device
CN106649818A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Recognition method and device for application search intentions and application search method and server
CN110083774A (en) * 2019-05-10 2019-08-02 腾讯科技(深圳)有限公司 Using determination method, apparatus, computer equipment and the storage medium of recommendation list

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周雨佳 等: "基于递归神经网络与注意力机制的动态个性化搜索算法", 计算机学报, vol. 43, no. 5, 31 May 2020 (2020-05-31), pages 812 - 826 *

Similar Documents

Publication Publication Date Title
CN110765244B (en) Method, device, computer equipment and storage medium for obtaining answering operation
US9582757B1 (en) Scalable curation system
US11645517B2 (en) Information processing method and terminal, and computer storage medium
KR102316063B1 (en) Method and apparatus for identifying key phrase in audio data, device and medium
US10423665B2 (en) Method and system for generating a conversational agent by automatic paraphrase generation based on machine translation
WO2017181834A1 (en) Intelligent question and answer method and device
CN111368043A (en) Event question-answering method, device, equipment and storage medium based on artificial intelligence
CN106407393B (en) information processing method and device for intelligent equipment
CN109408821B (en) Corpus generation method and device, computing equipment and storage medium
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
TWI536183B (en) System and method for eliminating language ambiguity
US11238050B2 (en) Method and apparatus for determining response for user input data, and medium
CN110287318B (en) Service operation detection method and device, storage medium and electronic device
CN113343108B (en) Recommended information processing method, device, equipment and storage medium
CN112651236B (en) Method and device for extracting text information, computer equipment and storage medium
CN111159334A (en) Method and system for house source follow-up information processing
CN109634436B (en) Method, device, equipment and readable storage medium for associating input method
CN113220854B (en) Intelligent dialogue method and device for machine reading and understanding
CN106407332B (en) Search method and device based on artificial intelligence
CN112966076A (en) Intelligent question and answer generating method and device, computer equipment and storage medium
CN109684357B (en) Information processing method and device, storage medium and terminal
CN109033082B (en) Learning training method and device of semantic model and computer readable storage medium
CN111859148A (en) Theme extraction method, device and equipment and computer readable storage medium
CN114547059A (en) Platform data updating method and device and computer equipment
CN110647537A (en) Data searching method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination