CN113761125A - Dynamic summary determination method and device, computing equipment and computer storage medium - Google Patents

Dynamic summary determination method and device, computing equipment and computer storage medium Download PDF

Info

Publication number
CN113761125A
CN113761125A CN202110577211.1A CN202110577211A CN113761125A CN 113761125 A CN113761125 A CN 113761125A CN 202110577211 A CN202110577211 A CN 202110577211A CN 113761125 A CN113761125 A CN 113761125A
Authority
CN
China
Prior art keywords
word
keywords
determining
keyword
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110577211.1A
Other languages
Chinese (zh)
Inventor
康战辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110577211.1A priority Critical patent/CN113761125A/en
Publication of CN113761125A publication Critical patent/CN113761125A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application provides a dynamic summary determining method and device, a computing device and a computer storage medium, wherein the method comprises the following steps: acquiring a current document searched based on search content, wherein the current document comprises a title part and a text part; extracting a plurality of keywords in the search content; screening keywords which are not included in the title part of the current document from the plurality of keywords to obtain a first keyword set; extracting keywords from each sentence of the body part of the current document to correspondingly form a second keyword set for each sentence; traversing sentences in the text part, and determining the similarity between the first keyword set and a second keyword set of the traversed sentences; in response to the similarity being greater than a similarity threshold, a portion of a dynamic summary for a current document is determined based on the traversed sentences.

Description

Dynamic summary determination method and device, computing equipment and computer storage medium
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a dynamic summary determination method and apparatus, a computing device, and a computer storage medium.
Background
With the development of computer technology, dynamic summaries are widely used, for example, in the fields of search result summary display, document key sentence marking, and search content related content display. As an example, the same document may have different dynamic summaries for different search content. Currently, in a conventional dynamic summary determination method, it is generally determined which sentences in a document should be used as dynamic summaries of the document according to the number of keywords of search content contained in each sentence in the document.
However, in the dynamic summaries determined by the conventional dynamic summary determination method, it often occurs that some keywords in the search content repeatedly appear in the document titles and the dynamic summaries, but other keywords in the search content do not appear in the document titles and the dynamic summaries, which makes the determined dynamic summaries less accurate and therefore insufficient to present information related to the search content as a whole, even looking far away from the true query intention expressed by the search content.
Disclosure of Invention
In view of the above, the present disclosure provides a dynamic summary determination method and apparatus, a computing device, and a computer storage medium, which desirably overcome some or all of the above-referenced disadvantages and possibly others.
According to a first aspect of the present disclosure, there is provided a dynamic summary determination method, including: acquiring a current document searched based on search content, wherein the current document comprises a title part and a text part; extracting a plurality of keywords in the search content; screening keywords which are not included in the title part of the current document from the plurality of keywords to obtain a first keyword set; extracting keywords from each sentence of the body part of the current document to correspondingly form a second keyword set for each sentence; traversing sentences in the text part, and determining the similarity between the first keyword set and a second keyword set of the traversed sentences; in response to the similarity being greater than a similarity threshold, a portion of a dynamic summary for a current document is determined based on the traversed sentences.
In some embodiments, traversing the sentence in the body portion, determining a similarity between the first set of keywords and a second set of keywords of the traversed sentence, comprises: determining a word vector of each keyword in the first keyword set; determining a first feature vector of the first keyword set based on the word vector of each keyword in the first keyword set; determining word vectors of all keywords in a second keyword set of the traversed sentences; determining a second feature vector of the second keyword set based on the word vector of each keyword in the second keyword set; determining a similarity between the first set of keywords and a second set of keywords of the traversed sentence based on the first and second feature vectors.
In some embodiments, determining a first feature vector of the first set of keywords based on a word vector of each keyword in the first set of keywords comprises: performing bit-wise accumulation on word vectors of the keywords in the first keyword set to obtain a first feature vector of the first keyword set, and determining a second feature vector of the second keyword set based on the word vectors of the keywords in the second keyword set, including: and performing bit-wise accumulation on the word vectors of the keywords in the second keyword set to obtain a second feature vector of the second keyword set.
In some embodiments, determining a similarity between the first set of keywords and a second set of keywords of the traversed sentence based on the first and second feature vectors comprises: and determining the similarity between the first keyword set and the second keyword set of the traversed sentence based on the distance between the first feature vector and the second feature vector, wherein the distance comprises one of cosine distance, Euclidean distance and Manhattan distance.
In some embodiments, traversing the sentence in the body portion, determining a similarity between the first set of keywords and a second set of keywords of the traversed sentence, comprises: traversing the sentences in the body portion, and determining a similarity between the first set of keywords and a second set of keywords of the traversed sentences when a current number of words of the dynamic summary is less than a word number threshold.
In some embodiments, in response to the similarity being greater than a similarity threshold, determining a portion of a dynamic summary for a current document based on the traversed sentences includes: in response to the similarity being greater than a similarity threshold and the sum of the current word count of the traversed sentence and the dynamic summary being greater than a word count threshold, determining a portion of the traversed sentence as a portion of the dynamic summary for the current document such that the sum of the word count of the portion of the traversed sentence and the current word count of the dynamic summary equals the word count threshold.
In some embodiments, extracting a plurality of keywords in the search content comprises: segmenting the search content to obtain a first segmentation set comprising a plurality of words; removing stop words from a plurality of words in the first word segmentation set to obtain a second word segmentation set; determining a word weight of each word in the second word segmentation set; and removing words with word weights smaller than a word weight threshold value from the second word segmentation set so as to obtain a plurality of keywords in the search content.
In some embodiments, determining a word weight for each word in the second set of participles comprises: determining an inverse document frequency value of each word in the second word segmentation set; and determining the inverse document frequency value of each word in the second word segmentation set as the word weight of the word.
In some embodiments, determining a word weight for each word in the second set of participles comprises: determining an inverse document frequency value of each word in the second word segmentation set; determining a word weight for each word in the second set of participles based on the part-of-speech of the word, at least one of a word position in the search content, a historical number of searches, and a historical click rate, and an inverse document frequency value thereof.
In some embodiments, wherein determining the inverse document frequency value for each word in the second set of participles comprises: acquiring a query log, wherein the query log comprises D pieces of search content; determining the number d of search contents containing the corresponding words in the query log aiming at each corresponding word in the second word segmentation set; and determining the quotient of the total number D of the search contents contained in the query log and the number D of the search contents containing the corresponding word in the query log, and taking the logarithm of the quotient to obtain the inverse document frequency value of the corresponding word segmentation.
In some embodiments, determining a word vector for each keyword in the first set of keywords comprises: determining a word vector for each keyword in the first set of keywords based on the trained word embedding model, and wherein the trained word embedding model is trained by: acquiring a query log, and segmenting search contents in the query log to obtain a plurality of segments; training the word embedding model by taking each respective participle of the plurality of participles as input of a word embedding model and taking a context participle of the respective participle as output of a word embedding model, or by taking each respective participle of the plurality of participles as output of a word embedding model and taking a context participle of the respective participle as input of a word embedding model, to arrive at the trained word embedding model.
According to a second aspect of the present disclosure, there is provided a dynamic summary determination apparatus, including: a current document acquisition module configured to acquire a current document searched based on search content, the current document including a title part and a body part; a first keyword extraction module configured to extract a plurality of keywords in the search content; a first keyword set determination module configured to filter keywords, which are not included in a title portion of the current document, from the plurality of keywords as a first keyword set; a second keyword extraction module configured to extract keywords for each sentence of the body part of the current document to correspondingly form a second keyword set for each sentence; a similarity determination module configured to traverse the sentences in the body part, determine a similarity between the first set of keywords and a second set of keywords of the traversed sentences; a dynamic summary determination module configured to determine a portion of a dynamic summary for a current document based on the traversed sentences in response to the similarity being greater than a similarity threshold.
According to a third aspect of the present disclosure, there is provided a computing device comprising: a memory configured to store computer-executable instructions; a processor configured to perform the method of any one of claims 1-11 when the computer-executable instructions are executed by the processor.
According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed, perform any of the methods described above.
In the dynamic summary determination method and apparatus, the computing device, and the computer storage medium claimed in the present disclosure, by considering keywords of search contents already contained in a title of a current document, the keywords already contained in the title of the current document are not considered again in determining a dynamic summary, and certain keywords are prevented from appearing repeatedly in the title of the document and the dynamic summary. Then, by traversing each sentence of the document body and comparing the similarity of the keywords of each sentence with the set of keywords of the search content not contained in the title, it is efficiently determined which sentences are to be part of the dynamic summary. When the dynamic abstract is determined, the hit rate of the whole search content in the article title and the dynamic abstract is considered, so that the accuracy of the determined dynamic abstract is improved while certain keywords are prevented from repeatedly appearing in the document title and the dynamic abstract, the document title and the dynamic abstract can sufficiently present information related to the whole search content, and the user experience is further improved.
These and other advantages of the present disclosure will become apparent from and elucidated with reference to the embodiments described hereinafter.
Drawings
Embodiments of the present disclosure will now be described in more detail and with reference to the accompanying drawings, in which:
fig. 1 illustrates an exemplary application scenario in which a technical solution according to an embodiment of the present disclosure may be implemented;
FIG. 2 illustrates a schematic flow chart diagram of a dynamic summary determination method according to one embodiment of the present disclosure;
FIG. 3 illustrates a schematic flow chart diagram of a method of determining similarity between two sets of keywords involved in the present disclosure, according to one embodiment of the present disclosure;
FIG. 4 illustrates a schematic flow chart diagram of a method of extracting a plurality of keywords in search content according to one embodiment of the present disclosure;
FIG. 5 illustrates an exemplary detailed schematic framework diagram of a word embedding model according to one embodiment of the present disclosure;
FIG. 6 illustrates a schematic effect diagram of a dynamic summary determined using a correlation technique;
FIG. 7 illustrates a schematic effect diagram of a dynamic summary determined using a dynamic summary determination method according to one embodiment of the present disclosure;
FIG. 8 illustrates an exemplary block diagram of a dynamic summary determination apparatus according to an embodiment of the present disclosure;
fig. 9 illustrates an example system that includes an example computing device that represents one or more systems and/or devices that may implement the various techniques described herein.
Detailed Description
The following description provides specific details of various embodiments of the disclosure so that those skilled in the art can fully understand and practice the various embodiments of the disclosure. It is understood that aspects of the disclosure may be practiced without some of these details. In some instances, well-known structures or functions are not shown or described in detail in this disclosure to avoid obscuring the description of the embodiments of the disclosure by these unnecessary descriptions. The terminology used in the present disclosure should be interpreted in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.
First, some terms referred to in the embodiments of the present application are explained to facilitate understanding by those skilled in the art.
Dynamic abstract (dynamic abstract): search engine terminology is a technology for dynamically displaying the main content of a document to be retrieved. For a search engine, in response to the user inputting search content, relevant characters around the search content in a document are extracted according to the position of the search content in the document and returned as a dynamic abstract. Since a document is recalled by different search contents, the dynamic summarization technology may form different dynamic summaries for the same document according to the different search contents.
Searching contents: i.e., meaning a query, to find a particular document, web site, record or series of records in a database, the user enters a search engine to retrieve the words, sentences or any suitable content of the data from the database.
Query log (query log): a journal is one of diaries that is used to document work done each day. In computer science, a log refers to an operation record (Server log) of a computer device such as a Server or software. When problems occur in computer equipment and software, the log is an important basis for troubleshooting the problems. The query log is used for recording information related to search contents input by a user and received from the client.
Stop Words (Stop Words): refer to high frequency words that do not carry any subject information, such as "the", "also", "having". In information retrieval, it is preferable to filter out these words in processing natural language data (or text) in order to save storage space and improve search efficiency. Stop words are all manually input and are not automatically generated, and the generated stop words form a stop word list. When stop words are filtered later, the stop words in the document can be confirmed by querying the stop word table.
A word segmentation device: the method is a tool for analyzing a text segment input by a user into a logic. Common word segmenters include an English word segmenter and a Chinese word segmenter. The word segmentation process of the English word segmentation device generally comprises the following steps: inputting text-segmenting key words-removing stop words-shape reduction-converting to lower case. The Chinese word segmenter segments a Chinese character sequence into individual words. In other words, word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification. In this process, stop words, i.e., words that do not affect the semantic meaning, can be identified. Commonly used word segmenters such as jieba word segmentation, Mmseg4j word segmentation, Ansj word segmentation, etc.
And (3) natural language processing: natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. The natural language processing is mainly applied to the aspects of machine translation, public opinion monitoring, automatic summarization, viewpoint extraction, text classification, question answering, text semantic comparison, voice recognition, Chinese OCR and the like.
Word embedding: word embedding is a type representation of words, words with similar meanings have similar representations, and is a general term for a method of mapping words to real number vectors. Conceptually, it refers to embedding a high-dimensional space with dimensions of the number of all words into a continuous vector space with much lower dimensions, each word or phrase being mapped as a vector on the real number domain. Common Word embedding is for example Word embedding Word2Vec, Word embedding fastText, global Word embedding GloVe etc. Word2Vec is a group of correlation models used to generate Word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The Word-embedded Word2Vec model, after training is complete, can be used to map each Word to a vector, which is the hidden layer of the neural network. GloVe's full name Global Vectors for Word Representation is a Word Representation tool based on Global Word frequency statistics (count-based & overall statistics) that can represent a Word as a vector consisting of real numbers that capture some semantic properties between words, such as similarity (similarity), analogy (analogy), etc. fastText is a fast text classification algorithm, and has two advantages compared with a classification algorithm based on a neural network: training and testing speeds are increased while maintaining high accuracy and without the need for pre-trained word vectors.
Inverse document frequency (inverse document frequency): inverse document frequency (inverse document frequency) is a common weighting technique used for information retrieval and information exploration. Inverse document frequency is a statistical method used to evaluate how important a word is to one of a set of documents or a corpus of documents. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Various forms of inverse document frequency weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, convolutional neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.
Deep Learning (DL) is a new research direction in the field of Machine Learning (ML), which is introduced into Machine Learning to make it closer to the original target, Artificial Intelligence (AI). Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds.
The technical scheme provided by the application relates to a natural language processing technology and mainly relates to a dynamic abstract technology.
Fig. 1 illustrates an exemplary application scenario 100 in which a technical solution according to an embodiment of the present disclosure may be implemented. As shown in fig. 1, the illustrated application scenario includes a terminal 110, a server 120, the terminal 110 communicatively coupled with the server 120 through a network 130.
The terminal 110 may, for example, act as an input device for inputting search content and transmitting the search content to the server 120 via the network 130. The search content may be, for example, search content for searching for related documents.
As an example, the server 120 may obtain a current document searched based on search content, extract a plurality of keywords in the search content, and filter keywords that are not included in a title portion of the current document from the plurality of keywords as a first keyword set, for example; then, the server 120 may extract keywords for each sentence of the body part of the current document to correspondingly form a second set of keywords for each sentence; next, the server 120 may traverse the sentences in the body part, determine a similarity between the first set of keywords and a second set of keywords of the traversed sentences; finally, in response to the similarity being greater than a similarity threshold, the server 120 may determine a portion of the dynamic summary for the current document based on the traversed sentences. The server 120 may send the determined dynamic digest to the terminal 110 for presentation.
The scenario described above is only one example in which the embodiments of the present disclosure may be implemented, and is not limiting. For example, in some embodiment scenarios, it is also possible that a dynamic summary determination process may be implemented on the terminal 110.
For example, the terminal 110 may be used as an input device for inputting search content and saving the acquired current document searched based on the search content to the background of the terminal. Then, extracting a plurality of keywords in the search content, and screening the keywords which are not included in the title part of the current document from the plurality of keywords to serve as a first keyword set; then, the terminal 110 may extract keywords for each sentence of the body part of the current document to correspondingly form a second keyword set for each sentence; next, the terminal 110 may traverse the sentence in the body part, and determine a similarity between the first set of keywords and a second set of keywords of the traversed sentence; finally, in response to the similarity being greater than a similarity threshold, the terminal 110 may determine a portion of the dynamic summary for the current document based on the traversed sentence.
It should be noted that the terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. The network 130 may be, for example, a Wide Area Network (WAN), a Local Area Network (LAN), a wireless network, a public telephone network, an intranet, or any other type of network known to those skilled in the art.
In some embodiments, the application scenario 100 described above may be a distributed system consisting of clusters of terminals 110 and servers 120, which may for example constitute a blockchain system. For example, in the application scenario 100, the determination and storage of the dynamic summary may be performed in a blockchain system, thereby achieving the decentralized effect. As an example, after determining the dynamic digest, the dynamic digest may be stored in a blockchain system for subsequent retrieval from the blockchain system when conducting the same search. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. The blockchain is essentially a decentralized database, which is a string of data blocks associated by using cryptography, each data block contains information of a batch of network transactions, and the information is used for verifying the validity (anti-counterfeiting) of the information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
Fig. 2 illustrates a schematic flow diagram of a dynamic summary determination method 200 according to one embodiment of the present disclosure. The dynamic digest determination method may be implemented by the terminal 110 or the server 120 as shown in fig. 1, for example. As shown in fig. 2, the method 200 includes the following steps.
In step 210, a current document searched based on the search content is obtained, the current document including a title portion and a body portion. Search content may typically include a plurality of terms in order to more clearly characterize the user's query intent. The search engine may retrieve a plurality of related documents for the search content as a search result for the search content. In an embodiment of the present disclosure, a related document may be acquired from the plurality of related documents searched based on the search content as the current document.
In step 220, a plurality of keywords in the search content are extracted. In embodiments of the present disclosure, various technical means may be used to extract a plurality of keywords in the search content, which is not limiting. For example, the search content may be segmented using the above-described various segmenters, and then a plurality of keywords in the search content may be extracted from the obtained segmentations according to the importance of the obtained segmentations.
In step 230, keywords that are not included in the title portion of the current document are filtered from the plurality of keywords as a first keyword set. As an example, keywords "car", "maintenance", "standard", "manual" are extracted from the search content, and if the title of the current document as a result of the search content contains "car", "maintenance", the first keyword set will contain "standard" and "manual". At this time, the dynamic abstract is confirmed according to the first keyword set, the dynamic abstract will contain one or all of "standard", "manual", that is, the dynamic abstract and the title will contain 3/4 or 4/4 of the search content keyword, and the hit rate of the search content as a whole in the title and the dynamic abstract is 75% -100%. In contrast, in the conventional related art, the dynamic summary is determined according to the keywords of the search content, and it is highly likely that the dynamic summary will include one or both of "car" and "maintenance" instead of "standard" and "manual", and the dynamic summary and the title will include 1/4 or 2/4 of the keywords of the search content, and the hit rate of the search content as a whole on the title and the dynamic summary is 25% to 50%.
Because the keywords contained in the title of the current document are not considered any more when the first keyword set is determined, the keywords contained in the title of the current document are not considered any more in the subsequent step of determining the dynamic abstract, so that the repeated occurrence of some keywords in the search content in the title of the document and the dynamic abstract and the occurrence of other keywords in the title of the document and the dynamic abstract are avoided, and the hit rate of the whole search content in the title of the article and the dynamic abstract is improved. Therefore, the hit rate of the whole search content in the article title and the dynamic abstract is considered in step 230, so that the document title and the dynamic abstract extracted by the method disclosed by the invention can sufficiently present the information related to the whole search content, the accuracy of the determined dynamic abstract is improved, the document title and the dynamic abstract present the information related to the whole search content, and the user experience is improved.
At step 240, keywords are extracted for each sentence of the body part of the current document to correspondingly form a second set of keywords for each sentence. In the embodiments of the present disclosure, the keywords may be extracted for each sentence of the body part of the current document using various technical means, which is not restrictive.
In some embodiments, extracting keywords for each sentence of the body part of the current document may be the same as extracting the plurality of keywords in the search content in step 220. As an example, the step of extracting keywords for the sentence "internet technology' benefit is obvious" of the current document may be as follows: firstly, the sentence is participled to obtain a first participle set which comprises Internet, technology, benefit, obvious and obvious; then, removing stop words from the first word segmentation set, wherein the stop words have 'yes' and 'yes', so as to obtain a second word segmentation set; then, determining a word weight (which may be calculated, for example, from historical occurrences) for each word in the second set of participles, as determined herein for "internet", "technology", "benefit", "obvious" as 0.7, 0.5, 0.6, 0.2, respectively; finally, words with a word weight less than a predetermined threshold value, for example, where the predetermined threshold value is set to 0.4, are removed from the second set of participles, so that "obvious" is removed, and it is finally determined that the keyword of the sentence is "internet", "technology", or "benefit".
At step 250, the sentence in the body portion is traversed, and a similarity between the first set of keywords and a second set of keywords of the traversed sentence is determined. Various different methods may be used to determine the similarity between the first set of keywords and the second set of keywords of the traversed sentence, without limitation.
As an example, the first set of keywords contains "jet," airplane, "" cost, "and the second set of keywords of the traversed sentence includes" airplane, "" airport, "" cost, "" hill. In some embodiments, the similarity is determined by comparing the number of words contained in the first set of keywords and the second set of keywords to the number of words contained in the first set of keywords. For example, here the first set of keywords and the second set of keywords contain 2 words at the same time: "plane", "cost", the number of words contained in the first keyword set is 4, and the similarity is 2 ÷ 4= 0.5. In other embodiments, a similarity matrix is determined based on word vectors corresponding to words in the first keyword set and word vectors corresponding to words in the second keyword set, and the similarity is extracted from the similarity matrix.
In some embodiments, traversing the sentence in the body portion, determining a similarity between the first set of keywords and a second set of keywords of the traversed sentence, may include: traversing the sentences in the body portion, and determining a similarity between the first set of keywords and a second set of keywords of the traversed sentences when a current number of words of the dynamic summary is less than a word number threshold. As an example, if the threshold of the number of words of the dynamic summary is 100 words, traversing the sentence in the body part, and if the current number of words of the dynamic summary is less than the threshold of the number of words, for example, the current number of words of the dynamic summary is 90 words, determining a similarity between the first set of keywords and the second set of keywords of the traversed sentence; if the current word count of the dynamic summary is not less than the word count threshold, for example, the current word count of the dynamic summary is 100 words, the similarity between the first keyword set and the second keyword set of the traversed sentence is not determined.
At step 260, in response to the similarity being greater than a similarity threshold, a portion of the dynamic summary for the current document is determined based on the traversed sentence. The similarity threshold may be preset as needed, and is not limited.
In some embodiments, when determining a portion of the dynamic summary for the current document based on the traversed sentences, a portion or all of the traversed sentences may be determined to be part of the dynamic summary of the current document. As an example, some phrases, sentence stems, or the traversed sentence as a whole may be extracted or determined as part of a dynamic summary for the current document. A final dynamic summary may be formed as the sentences in the body portion are traversed and a portion of the dynamic summary for the current document is determined based on the traversed sentences.
In some embodiments, in response to the similarity being greater than a similarity threshold, determining a portion of a dynamic summary for a current document based on the traversed sentences may include: in response to the similarity being greater than a similarity threshold and the sum of the current word count of the traversed sentence and the dynamic summary being greater than a word count threshold, determining a portion of the traversed sentence as a portion of the dynamic summary for the current document such that the sum of the word count of the portion of the traversed sentence and the current word count of the dynamic summary equals the word count threshold. As an example, assuming that the threshold number of words of the dynamic summary is 100 words, the current number of words of the dynamic summary is 95 words, and the traversed sentence is 15 words, 5 words of the traversed sentence (which may be the head, the tail, the middle extracted word, etc. of the sentence, without limitation) are determined as a part of the dynamic summary.
The method 200 avoids certain keywords from appearing repeatedly in the document title and dynamic summary by considering keywords of search content already contained in the title of the current document and not considering these keywords already contained in the title of the current document when determining the dynamic summary. Then, traversing each sentence of the document text, comparing the similarity of the keyword of the sentence with the keyword set of the search content which is not contained in the title, and judging whether the sentence belongs to a part of the dynamic abstract according to whether the similarity is larger than a similarity threshold value or not. When the dynamic abstract is determined, the hit rate of the whole search content in the article title and the dynamic abstract is considered, so that the accuracy of the determined dynamic abstract is improved while certain keywords are prevented from repeatedly appearing in the document title and the dynamic abstract, the document title and the dynamic abstract can sufficiently present information related to the whole search content, and the user experience is further improved.
FIG. 3 illustrates a schematic flow chart diagram of a method 300 of determining similarity between two sets of keywords according to one embodiment of the present disclosure. The two sets of keywords comprise the first set of keywords and a second set of keywords of the traversed sentence. The method 300 may be used, for example, to implement step 250 described with reference to fig. 2. As shown in fig. 3, the method 300 includes the following steps.
In step 310, a word vector for each keyword in the first set of keywords is determined. A word vector for each keyword in the first set of keywords may be determined using a trained word embedding model. The trained word embedding model may be obtained by training a word embedding model using an open-source corpus (e.g., an open corpus provided by google corporation) as a training set, or may be obtained by training a word embedding model using a specific corpus (e.g., a training set established by words in a certain field) as a training set, which is not limited herein. The Word embedding model may be a common Word embedding model, such as Word2Vec embedding, Word fastText embedding, global Word GloVe embedding, and the like, which is not limited herein.
In some embodiments, determining a word vector for each keyword in the first set of keywords comprises: determining a word vector for each keyword in the first set of keywords based on the trained word embedding model, and wherein the trained word embedding model is trained by: acquiring a query log (information related to search content input by a user is received from a client is recorded in the query log), and segmenting the search content in the query log to obtain a plurality of segments; training the word embedding model by taking each respective participle of the plurality of participles as input of a word embedding model and taking a context participle of the respective participle as output of a word embedding model, or by taking each respective participle of the plurality of participles as output of a word embedding model and taking a context participle of the respective participle as input of a word embedding model, to arrive at the trained word embedding model. Because the participles used for training the word embedding model are from the query log, and the information related to the search content input by the user is received from the client side is recorded in the query log, the trained word embedding model can better extract the characteristics in the search content, and the determined word vectors can better represent each word in the search content.
In step 320, a first feature vector of the first keyword set is determined based on the word vector of each keyword in the first keyword set.
In some embodiments, determining a first feature vector of the first set of keywords based on a word vector of each keyword in the first set of keywords comprises: and performing bit-wise accumulation on the word vectors of all the keywords in the first keyword set to obtain a first feature vector of the first keyword set. As an example, the first keyword set includes words "fine dried noodles", "easy" and "pan pasting", and after 200-dimensional word vectors corresponding to the three words are determined, the three 200-dimensional word vectors are accumulated bitwise to obtain a 200-dimensional vector, that is, the first feature vector of the first keyword set is obtained.
At step 330, a word vector for each keyword in the second set of keywords of the traversed sentence is determined. The word vector for each keyword in the second set of keywords may be determined using the trained word embedding model described above or other suitable word embedding model. The Word embedding model may be, for example, a common Word embedding model, such as Word2Vec embedding, Word fastText embedding, global Word GloVe embedding, and the like, which is not limited herein.
In step 340, a second feature vector of the second keyword set is determined based on the word vector of each keyword in the second keyword set. In some embodiments, determining a second feature vector of the second set of keywords based on the word vector of each keyword in the second set of keywords comprises: and performing bit-wise accumulation on the word vectors of the keywords in the second keyword set to obtain a second feature vector of the second keyword set. As an example, the second keyword set includes words "handmade noodles", "fine dried noodles", "sliced noodles" and "burnt pot", and after 200-dimensional word vectors corresponding to the four words are determined, the four 200-dimensional word vectors are accumulated bitwise to obtain a 200-dimensional vector, that is, a second feature vector of the second keyword set is obtained.
In step 350, a similarity between the first set of keywords and the second set of keywords of the traversed sentence is determined based on the first and second feature vectors. As an example, the first feature vector and the second feature vector are 200-dimensional vectors, respectively, cosine similarity of the first feature vector and the second feature vector is calculated, and the cosine similarity is used as similarity between the first keyword set and the second keyword set of the traversed sentence. The cosine similarity between vectors is used for measuring text similarity, and depends on the cosine distance between vectors, the cosine distance maps the vectors into a vector space according to coordinate values, and the formula of the cosine distance between vectors a and b is as follows:
Figure 146740DEST_PATH_IMAGE002
let the coordinates of the vectors a and b in the two-dimensional space be
Figure DEST_PATH_IMAGE004A
Then, the expression of the cosine distance between the vectors a and b in the two-dimensional space is as follows:
Figure DEST_PATH_IMAGE006AA
let the coordinates of the vectors a, b in n-dimensional space be a = (A), respectively1,A2,……,An)、b=(B1,B2,……,Bn) Then, the expression of the cosine distance between the vectors a and b in n-dimensional space is as follows:
Figure DEST_PATH_IMAGE007
in some embodiments, determining a similarity between the first set of keywords and the second set of keywords of the traversed sentence based on the first and second feature vectors may include: and determining the similarity between the first keyword set and the second keyword set of the traversed sentence based on the distance between the first feature vector and the second feature vector, wherein the distance comprises one of cosine distance, Euclidean distance and Manhattan distance. As an example, the distance may be a euclidean distance, i.e. a euclidean distance between the first feature vector and the second feature vector is calculated, and the euclidean distance is taken as a similarity between the first keyword set and the second keyword set of the traversed sentence.
The method 300 determines a similarity between the first set of keywords and the second set of keywords of the traversed sentence by comparing the first feature vector based on the first set of keywords and the second feature vector based on the second set of keywords. The similarity extracted in this way can more accurately characterize the similarity between the first set of keywords and the second set of keywords of the traversed sentence, so as to determine a dynamic summary according to the similarity in the subsequent steps.
Fig. 4 illustrates a schematic flow chart diagram of a method 400 of extracting a plurality of keywords in search content according to one embodiment of the present disclosure. The method 400 may be used, for example, to implement step 220 described with reference to fig. 2. As shown in fig. 4, the method 400 includes the following steps.
At step 410, the search content is tokenized to obtain a first set of tokenization words comprising a plurality of words. In embodiments of the present disclosure, the search content may be tokenized using various technical means, which are not limiting. For example, the search content may be tokenized using various tokenizers as previously described (e.g., jieba tokenization, Mmseg4j tokenization, Ansj tokenization, etc.).
At step 420, stop words are removed from the plurality of words in the first set of participles to obtain a second set of participles. As previously mentioned, high frequency words that do not carry any subject information are meant, such as words such as "what", "also", "having". The elimination of stop words here saves storage space and improves search efficiency, reducing interference for determining dynamic abstractions. In some embodiments, stop words in the first participle set can be identified and removed by querying a pre-constructed stop word table.
At step 430, a word weight for each word in the second set of participles is determined. In some embodiments, determining a word weight for each word in the second set of participles may include: determining an inverse document frequency value of each word in the second word segmentation set; and determining the inverse document frequency value of each word in the second word segmentation set as the word weight of the word. The inverse document frequency is used to evaluate the importance of each word in the second set of participles.
In some embodiments, determining a word weight for each word in the second set of participles may include: determining an inverse document frequency value of each word in the second word segmentation set; determining a word weight for each word in the second set of participles based on the part-of-speech of the word, at least one of a word position in the search content, a historical number of searches, and a historical click rate, and an inverse document frequency value thereof. Because the word weight not only considers the inverse document frequency value of each word in the second participle set, but also comprehensively considers at least one of the part of speech of each word in the second participle set, the word position in the search content, the historical search times and the historical click rate, the finally determined word weight can more comprehensively and accurately reflect the importance of each word in the second participle set.
In some embodiments, determining an inverse document frequency value for each word in the second set of participles may comprise: acquiring a query log, wherein the query log comprises D pieces of search content; determining the number d of search contents containing the corresponding words in the query log aiming at each corresponding word in the second word segmentation set; and determining the quotient of the total number D of the search contents contained in the query log and the number D of the search contents containing the corresponding word in the query log, and taking the logarithm of the quotient to obtain the inverse document frequency value of the corresponding word segmentation. As an example, the query log includes 1000 pieces of search content, the number of pieces of search content containing the corresponding word in the query log is 600, i.e., D =1000, D =300, so the inverse document frequency value of the corresponding participle is loge(D/d)=loge(1000/300)=1.204。
By way of example, the inverse document frequency may be determined by querying an inverse document frequency dictionary. The inverse document frequency dictionary can be obtained based on the query log calculation, and the word t in the inverse document frequency dictionaryiInverse document frequency (idf) ofi) Calculated according to the following formula:
Figure DEST_PATH_IMAGE008
where the numerator | D | represents the total number of search contents in the query log and the denominator represents the inclusion word tiThe number of pieces of search content.
In step 440, removing words with word weights less than a word weight threshold from the second segmentation set to obtain a plurality of keywords in the search content. The word weight threshold may be set as desired and is not limiting. As an example, the second word set includes the words "sima migration", "steg", "western chinese" and "celebrity", the four words respectively have corresponding word weights of 0.6, 0.5, 0.7 and 0.2, and the word weight threshold is set to 0.4, then "celebrity" will be removed, and finally, it is determined that the plurality of keywords in the search content are "sima migration", "steg" and "western chinese".
The method 400 extracts a plurality of keywords in search content by segmenting the search content, removing stop words, determining weights, and removing words whose weights are less than a weight threshold. Compared with the search content, the extracted keywords can more accurately and briefly represent the meaning of the search content, and the dynamic abstract can be conveniently determined in the subsequent steps according to the meaning of the search content.
FIG. 5 illustrates an exemplary detailed schematic framework diagram of a word embedding model according to one embodiment of the present disclosure. As shown in fig. 5, the word embedding model for embedding a high-dimensional space with a dimension of the number of all words into a continuous vector space with a much lower dimension, each word or phrase is mapped to a vector on the real number domain, which may be a three-layered neural network including an input layer, a hidden layer, and an output layer.
The input layer is used for receiving an input vector, and the input vector is usually a one-hot vector; the hidden layer handles the input vector by setting nodes, for example, it wants to represent a word with 300 features (i.e. each word can be represented as a 300-dimensional vector), then the hidden layer will set 300 nodes whose weights can be represented as a matrix of several rows (the number of rows depends on the dimension of the input vector), 300 columns; the output layer is used to process the output of the hidden layer to output the probability distribution at this layer, and the output layer can be a softmax regression classifier, each node of which will output a value (probability) between 0 and 1, and the sum of the probabilities of all the output layer neuron nodes is 1.
In training a word embedding model, training is typically based on pairs of words, the training samples being (input word, output word) pairs of words that use the input word to predict the output word, the input word and the output word both being unique thermally encoded vectors. Typically, the input word and the output word are words having a context relationship (e.g., adjacent words), i.e., a trained word embedding model is obtained by using the context relationship of each word in the corpus.
Fig. 6 illustrates a schematic effect diagram of a dynamic summary determined using a related art. As shown in fig. 6, in the related art, the existing hit situation in the title is not considered in the dynamic summary determination scheme, that is, the overall display related information does not pay attention to the global hit experience. This is more pronounced when the search engine is faced with some search content of greater length (as shown in fig. 6). By way of example, when searching for a "route on foot in a romantic mountain", only "the romantic mountain" and "the foot" are selected from the contained segments, whether the title or the dynamic summary, and the "route" does not appear. This is because, in the related art, when traversing a sentence in a body to determine a dynamic summary, only the similarity between the traversed sentence and the search content is considered, and the title of the current document does not already contain the content of the partial search content, so that part of the content in the search content appears repeatedly in the title and the dynamic summary of the current document, but part of the content does not appear in the title and the dynamic summary of the current document.
As shown in fig. 6, the related art does not consider the hit rate of the search content as a whole in the article title and the dynamic summary, so that some keywords ("flooshan", "hiking") in the search content appear repeatedly in the document title and the dynamic summary, but other keywords ("route") in the search content do not appear in both the document title and the dynamic summary, which makes the document title and the dynamic summary extracted in the related art insufficient to present information related to the search content as a whole. By way of example, the effect of hit of the keyword in the search content in the current document may be embodied by font bolding, or may be implemented by highlighting related text, underlining a keyword font, skewing or enlarging a keyword font, marking a keyword font with a color different from a color of a body or a title, and the like, which is not limited herein.
Fig. 7 illustrates a schematic effect diagram of a dynamic summary determined using a dynamic summary determination method according to one embodiment of the present disclosure. The present embodiment is also directed to a dynamic summary of a current document searched for in a "rocky mountain hiking route", and the document title of the current document already contains "rocky mountain" and "hiking". As shown in fig. 7, when the system selects the segments included in the dynamic summary of the text, it knows that there are actually "floattop" and "hiking" in the title, and tends to extract the text segments at least including "route" in the text as the dynamic summary segments and thicken them. This allows the user to determine that the contents of the article tend toward "rocky mountain hiking route" rather than "rocky mountain hiking" without clicking to read the full text.
As can be seen from comparing fig. 6 and fig. 7, compared to the conventional dynamic summary determining method, the dynamic summary determining method proposed by the present disclosure does not consider the keywords ("rohoshan" and "hiking") of the search content already contained in the headline of the current document when determining the dynamic summary, thereby avoiding repeated occurrence of some keywords in the headline and the dynamic summary of the document, increasing the hit rate of the search content in the headline (hit "rohoshan" and "hiking") and the dynamic summary (hit "route") as a whole, and enabling the headline and the dynamic summary extracted by the present disclosure to present information related to the search content as a whole. The method improves the accuracy of the determined dynamic abstract while avoiding repeated appearance of certain keywords in the document title and the dynamic abstract, so that the document title and the dynamic abstract can sufficiently present information related to the whole searched content, and further the user experience is improved.
Meanwhile, as can be seen from fig. 7, the words hit in the text are not necessarily completely identical to the words not included in the title in the search content, but the semantics are basically guaranteed to be the same, for example, in fig. 7, "route" in the text is hit according to the keyword "route" in the search content, and although "route" and "route" are two words, the semantics are basically the same. This is because the present disclosure adopts a method of comparing the similarity of the keywords of the sentence with the keyword set of the search content that is not included in the title, rather than requiring the keywords of the sentence to be completely identical with the keywords of the search content that is not included in the title, which makes the determination of the dynamic summary by the method shown in the present disclosure more robust and accurate.
Fig. 8 illustrates an exemplary block diagram of a dynamic summary determination apparatus 800 according to an embodiment of the present disclosure. As shown in fig. 8, the dynamic digest determination apparatus includes: the system comprises a current document acquisition module 810, a first keyword extraction module 820, a first keyword set determination module 830, a second keyword extraction module 840, a similarity determination module 850 and a dynamic summary determination module 860.
A current document obtaining module 810 configured to obtain a current document searched based on the search content, the current document including a title part and a body part.
A first keyword extraction module 820 configured to extract a plurality of keywords in the search content.
A first keyword set determination module 830 configured to filter keywords not included in the title portion of the current document from the plurality of keywords as a first keyword set.
A second keyword extraction module 840 configured to extract keywords for each sentence of the body part of the current document to correspondingly form a second set of keywords for each sentence.
A similarity determination module 850 configured to traverse the sentences in the body part, determine similarities between the first set of keywords and a second set of keywords of the traversed sentences.
A dynamic summary determination module 860 configured to determine a portion of a dynamic summary for a current document based on the traversed sentence in response to the similarity being greater than a similarity threshold.
Fig. 9 illustrates an example system 900 that includes an example computing device 910 that represents one or more systems and/or devices that can implement the various techniques described herein. The computing device 910 may be, for example, a server of a service provider, a device associated with a server, a system on a chip, and/or any other suitable computing device or computing system. The dynamic summary determination apparatus 800 described above with reference to fig. 8 may take the form of a computing device 910. Alternatively, the dynamic summary determination apparatus 800 may be implemented as a computer program in the form of an application 916.
The example computing device 910 as illustrated includes a processing system 911, one or more computer-readable media 912, and one or more I/O interfaces 913 communicatively coupled to each other. Although not shown, the computing device 910 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.
The processing system 911 represents functionality to perform one or more operations using hardware. Accordingly, the processing system 911 is illustrated as including hardware elements 914 that may be configured as processors, functional blocks, and the like. This may include implementation in hardware as an application specific integrated circuit or other logic device formed using one or more semiconductors. Hardware element 914 is not limited by the material from which it is formed or the processing mechanisms employed therein. For example, a processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, processor-executable instructions may be electronically-executable instructions.
The computer-readable medium 912 is illustrated as including a memory/storage 915. Memory/storage 915 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage 915 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). The memory/storage 915 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) as well as removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 912 may be configured in various other ways as further described below.
One or more I/O interfaces 913 represent functionality that allows a user to enter commands and information to computing device 910 using various input devices and optionally also allows information to be presented to the user and/or other components or devices using various output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice input), a scanner, touch functionality (e.g., capacitive or other sensors configured to detect physical touch), a camera (e.g., motion that may not involve touch may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, a haptic response device, and so forth. Thus, the computing device 910 may be configured in various ways to support user interaction, as described further below.
The computing device 910 also includes an application 916. The application 916 may be, for example, a software instance of the dynamic summary determination apparatus 800 and implement the techniques described herein in combination with other elements in the computing device 910.
Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer readable media can include a variety of media that can be accessed by computing device 910. By way of example, and not limitation, computer-readable media may comprise "computer-readable storage media" and "computer-readable signal media".
"computer-readable storage medium" refers to a medium and/or device, and/or a tangible storage apparatus, capable of persistently storing information, as opposed to mere signal transmission, carrier wave, or signal per se. Accordingly, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in a method or technology suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of computer readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage devices, tangible media, or an article of manufacture suitable for storing the desired information and accessible by a computer.
"computer-readable signal medium" refers to a signal-bearing medium configured to transmit instructions to hardware of computing device 910, such as via a network. Signal media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave, data signal or other transport mechanism. Signal media also includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
As previously described, hardware element 914 and computer-readable medium 912 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware that, in some embodiments, may be used to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or systems-on-chips, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), Complex Programmable Logic Devices (CPLDs), and other implementations in silicon or components of other hardware devices. In this context, a hardware element may serve as a processing device that performs program tasks defined by instructions, modules, and/or logic embodied by the hardware element, as well as a hardware device for storing instructions for execution, such as the computer-readable storage medium described previously.
Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 914. The computing device 910 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules. Thus, implementing a module as a module executable by the computing device 910 as software may be implemented at least partially in hardware, for example, using the processing system's computer-readable storage media and/or hardware elements 914. The instructions and/or functions may be executable/operable by one or more articles of manufacture (e.g., one or more computing devices 910 and/or processing system 911) to implement the techniques, modules, and examples described herein.
In various implementations, the computing device 910 may assume a variety of different configurations. For example, the computing device 910 may be implemented as a computer-like device including a personal computer, a desktop computer, a multi-screen computer, a laptop computer, a netbook, and so forth. The computing device 910 may also be implemented as a mobile device-like device including mobile devices such as mobile telephones, portable music players, portable gaming devices, tablet computers, multi-screen computers, and the like. The computing device 910 may also be implemented as a television-like device that includes or is connected to a device having a generally larger screen in a casual viewing environment. These devices include televisions, set-top boxes, game consoles, and the like.
The techniques described herein may be supported by these various configurations of the computing device 910 and are not limited to specific examples of the techniques described herein. Functionality may also be implemented in whole or in part on "cloud" 920 through the use of a distributed system, such as through platform 922 as described below.
Cloud 920 includes and/or is representative of a platform 922 for resources 924. The platform 922 abstracts underlying functionality of hardware (e.g., servers) and software resources of the cloud 920. The resources 924 may include applications and/or data that may be used when executing computer processes on servers remote from the computing device 910. The resources 924 may also include services provided over the internet and/or over a subscriber network such as a cellular or Wi-Fi network.
The platform 922 may abstract resources and functionality to connect the computing device 910 with other computing devices. The platform 922 may also be used to abstract a hierarchy of resources to provide a corresponding level of hierarchy encountered for the demand of the resources 924 implemented via the platform 922. Thus, in interconnected device embodiments, implementation of functions described herein may be distributed throughout the system 900. For example, the functionality may be implemented in part on the computing device 910 and by the platform 922 that abstracts the functionality of the cloud 920.
A computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computing device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computing device to perform the dynamic digest determination method provided in the various alternative implementations described above.
It will be appreciated that embodiments of the disclosure have been described with reference to different functional units for clarity. However, it will be apparent that the functionality of each functional unit may be implemented in a single unit, in a plurality of units or as part of other functional units without departing from the disclosure. For example, functionality illustrated to be performed by a single unit may be performed by a plurality of different units. Thus, references to specific functional units are only to be seen as references to suitable units for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present disclosure may be implemented in a single unit or may be physically and functionally distributed between different units and circuits.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, components or sections, these devices, elements, components or sections should not be limited by these terms. These terms are only used to distinguish one device, element, component or section from another device, element, component or section.
Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the accompanying claims. Additionally, although individual features may be included in different claims, these may possibly advantageously be combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the terms "a" or "an" do not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.

Claims (14)

1. A method for dynamic digest determination, comprising:
acquiring a current document searched based on search content, wherein the current document comprises a title part and a text part;
extracting a plurality of keywords in the search content;
screening keywords which are not included in the title part of the current document from the plurality of keywords to obtain a first keyword set;
extracting keywords from each sentence of the body part of the current document to correspondingly form a second keyword set for each sentence;
traversing sentences in the text part, and determining the similarity between the first keyword set and a second keyword set of the traversed sentences;
in response to the similarity being greater than a similarity threshold, a portion of a dynamic summary for a current document is determined based on the traversed sentences.
2. The method of claim 1, wherein traversing the sentence in the body portion, determining a similarity between the first set of keywords and a second set of keywords of the traversed sentence, comprises:
determining a word vector of each keyword in the first keyword set;
determining a first feature vector of the first keyword set based on the word vector of each keyword in the first keyword set;
determining word vectors of all keywords in a second keyword set of the traversed sentences;
determining a second feature vector of the second keyword set based on the word vector of each keyword in the second keyword set;
determining a similarity between the first set of keywords and a second set of keywords of the traversed sentence based on the first and second feature vectors.
3. The method of claim 2, wherein determining a first feature vector for the first set of keywords based on a word vector for each keyword in the first set of keywords comprises:
performing bit-wise accumulation on the word vectors of all the keywords in the first keyword set to obtain a first feature vector of the first keyword set,
and determining a second feature vector of the second keyword set based on the word vector of each keyword in the second keyword set, including:
and performing bit-wise accumulation on the word vectors of the keywords in the second keyword set to obtain a second feature vector of the second keyword set.
4. The method of claim 2, wherein determining a similarity between the first set of keywords and a second set of keywords of the traversed sentence based on the first and second feature vectors comprises:
and determining the similarity between the first keyword set and the second keyword set of the traversed sentence based on the distance between the first feature vector and the second feature vector, wherein the distance comprises one of cosine distance, Euclidean distance and Manhattan distance.
5. The method of claim 1, wherein traversing the sentence in the body portion, determining a similarity between the first set of keywords and a second set of keywords of the traversed sentence, comprises:
traversing the sentences in the body portion, and determining a similarity between the first set of keywords and a second set of keywords of the traversed sentences when a current number of words of the dynamic summary is less than a word number threshold.
6. The method of claim 1, wherein determining a portion of a dynamic summary for a current document based on the traversed sentences in response to the similarity being greater than a similarity threshold comprises:
in response to the similarity being greater than a similarity threshold and the sum of the current word count of the traversed sentence and the dynamic summary being greater than a word count threshold, determining a portion of the traversed sentence as a portion of the dynamic summary for the current document such that the sum of the word count of the portion of the traversed sentence and the current word count of the dynamic summary equals the word count threshold.
7. The method of claim 1, wherein extracting a plurality of keywords in the search content comprises:
segmenting the search content to obtain a first segmentation set comprising a plurality of words;
removing stop words from a plurality of words in the first word segmentation set to obtain a second word segmentation set;
determining a word weight of each word in the second word segmentation set;
and removing words with word weights smaller than a word weight threshold value from the second word segmentation set so as to obtain a plurality of keywords in the search content.
8. The method of claim 7, wherein determining a word weight for each word in the second set of participles comprises:
determining an inverse document frequency value of each word in the second word segmentation set;
and determining the inverse document frequency value of each word in the second word segmentation set as the word weight of the word.
9. The method of claim 7, wherein determining a word weight for each word in the second set of participles comprises:
determining an inverse document frequency value of each word in the second word segmentation set;
determining a word weight for each word in the second set of participles based on the part-of-speech of the word, at least one of a word position in the search content, a historical number of searches, and a historical click rate, and an inverse document frequency value thereof.
10. The method of claim 8 or 9, wherein determining an inverse document frequency value for each word in the second set of participles comprises:
acquiring a query log, wherein the query log comprises D pieces of search content;
determining the number d of search contents containing the corresponding words in the query log aiming at each corresponding word in the second word segmentation set;
and determining the quotient of the total number D of the search contents contained in the query log and the number D of the search contents containing the corresponding word in the query log, and taking the logarithm of the quotient to obtain the inverse document frequency value of the corresponding word segmentation.
11. The method of claim 2, wherein determining a word vector for each keyword in the first set of keywords comprises: determining a word vector for each keyword in the first set of keywords based on the trained word embedding model, and wherein the trained word embedding model is trained by:
acquiring a query log, and segmenting search contents in the query log to obtain a plurality of segments;
training the word embedding model by taking each respective participle of the plurality of participles as input of a word embedding model and taking a context participle of the respective participle as output of a word embedding model, or by taking each respective participle of the plurality of participles as output of a word embedding model and taking a context participle of the respective participle as input of a word embedding model, to arrive at the trained word embedding model.
12. A dynamic summary determination apparatus, comprising:
a current document acquisition module configured to acquire a current document searched based on search content, the current document including a title part and a body part;
a first keyword extraction module configured to extract a plurality of keywords in the search content;
a first keyword set determination module configured to filter keywords, which are not included in a title portion of the current document, from the plurality of keywords as a first keyword set;
a second keyword extraction module configured to extract keywords for each sentence of the body part of the current document to correspondingly form a second keyword set for each sentence;
a similarity determination module configured to traverse the sentences in the body part, determine a similarity between the first set of keywords and a second set of keywords of the traversed sentences;
a dynamic summary determination module configured to determine a portion of a dynamic summary for a current document based on the traversed sentences in response to the similarity being greater than a similarity threshold.
13. A computing device, comprising:
a memory configured to store computer-executable instructions;
a processor configured to perform the method of any one of claims 1-11 when the computer-executable instructions are executed by the processor.
14. A computer-readable storage medium storing computer-executable instructions that, when executed, perform the method of any one of claims 1-11.
CN202110577211.1A 2021-05-26 2021-05-26 Dynamic summary determination method and device, computing equipment and computer storage medium Pending CN113761125A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110577211.1A CN113761125A (en) 2021-05-26 2021-05-26 Dynamic summary determination method and device, computing equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110577211.1A CN113761125A (en) 2021-05-26 2021-05-26 Dynamic summary determination method and device, computing equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN113761125A true CN113761125A (en) 2021-12-07

Family

ID=78787218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110577211.1A Pending CN113761125A (en) 2021-05-26 2021-05-26 Dynamic summary determination method and device, computing equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN113761125A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115688771A (en) * 2023-01-05 2023-02-03 京华信息科技股份有限公司 Document content comparison performance improving method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115688771A (en) * 2023-01-05 2023-02-03 京华信息科技股份有限公司 Document content comparison performance improving method and system
CN115688771B (en) * 2023-01-05 2023-03-21 京华信息科技股份有限公司 Document content comparison performance improving method and system

Similar Documents

Publication Publication Date Title
US10706113B2 (en) Domain review system for identifying entity relationships and corresponding insights
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
Pang et al. Deeprank: A new deep architecture for relevance ranking in information retrieval
Zhao et al. Automatic detection of cyberbullying on social networks based on bullying features
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
Zhao et al. Cyberbullying detection based on semantic-enhanced marginalized denoising auto-encoder
Li et al. Twiner: named entity recognition in targeted twitter stream
Nie et al. Beyond text QA: multimedia answer generation by harvesting web information
US9348900B2 (en) Generating an answer from multiple pipelines using clustering
CN105183833B (en) Microblog text recommendation method and device based on user model
WO2022095374A1 (en) Keyword extraction method and apparatus, and terminal device and storage medium
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN111401045B (en) Text generation method and device, storage medium and electronic equipment
CN107506472B (en) Method for classifying browsed webpages of students
Asgari-Chenaghlu et al. Topic detection and tracking techniques on Twitter: a systematic review
Shawon et al. Website classification using word based multiple n-gram models and random search oriented feature parameters
Tiwari et al. Ensemble approach for twitter sentiment analysis
CN115795030A (en) Text classification method and device, computer equipment and storage medium
Li Text recognition and classification of english teaching content based on SVM
CN114328798A (en) Processing method, device, equipment, storage medium and program product for searching text
Mori et al. Extracting Keyphrases to Represent Relations in Social Networks from Web.
Wang et al. Constructing a comprehensive events database from the web
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium
CN107423294A (en) A kind of community image search method and system
CN115129864A (en) Text classification method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination