CN112925967A - Method, device and equipment for generating expanded query words and storage medium - Google Patents

Method, device and equipment for generating expanded query words and storage medium Download PDF

Info

Publication number
CN112925967A
CN112925967A CN202110168746.3A CN202110168746A CN112925967A CN 112925967 A CN112925967 A CN 112925967A CN 202110168746 A CN202110168746 A CN 202110168746A CN 112925967 A CN112925967 A CN 112925967A
Authority
CN
China
Prior art keywords
words
word
expanded query
category
expanded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110168746.3A
Other languages
Chinese (zh)
Inventor
李新亮
刘建华
刘冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dingcheng Shitong Technology Co ltd
Original Assignee
Beijing Dingcheng Shitong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dingcheng Shitong Technology Co ltd filed Critical Beijing Dingcheng Shitong Technology Co ltd
Priority to CN202110168746.3A priority Critical patent/CN112925967A/en
Publication of CN112925967A publication Critical patent/CN112925967A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a device, equipment and a storage medium for generating an expanded query word, wherein the method comprises the following steps: acquiring real-time search engine data according to a query word input by a user; preprocessing the search engine data to obtain a word frequency matrix; and classifying the words in the word frequency matrix, sequencing the words in each category according to the TF-IDF values of the words, and selecting a preset number of words in each category as generated classified expanded query words. According to the method for generating the expanded query words, the query words can be expanded in real time by acquiring real-time search engine data, the expanded words can be classified, user requirements can be better met, search effectiveness is improved, the trend of the generated expanded query words can be calculated, and the trend result is used as input data of subsequent application.

Description

Method, device and equipment for generating expanded query words and storage medium
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating an expanded query term.
Background
The rapid development of the internet brings about explosive exponential growth of internet data, and in the new era of digital surge, the daily generated data is about as much as 370 EB. With the exponential growth of data, the problem is how to accurately and efficiently retrieve information from massive internet data, and intelligently recommend relevant information and information to users.
In the current information retrieval field, the academic world focuses on exploring various linguistic methods to improve retrieval effectiveness, while the engineering world focuses on fully utilizing historical data and recommending relevant information for users by utilizing statistical methods. However, a model based on a linguistic analysis method needs a relatively perfect training data set, an additional dictionary needs to be maintained, and the migration capability of the model is relatively weak; while statistical-based models rely on searching the user's historical behavioral data; and the methods in the prior art cannot predict the trend of the associated expanded words.
Disclosure of Invention
The embodiment of the disclosure provides a method, a device, equipment and a storage medium for generating an expanded query term. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
In a first aspect, an embodiment of the present disclosure provides a method for generating an expanded query term, including:
acquiring real-time search engine data according to a query word input by a user;
preprocessing search engine data to obtain a word frequency matrix;
classifying the words in the word frequency matrix, sequencing the words in each category according to TF-IDF values of the words, and selecting a preset number of words in each category as generated classified expanded query words.
In an optional embodiment, obtaining real-time search engine data according to a query word input by a user includes:
acquiring a query word input by a user;
acquiring search engine data in real time by adopting a crawler technology;
and analyzing and combining a plurality of search engine data to form text data to be processed.
In an optional embodiment, the preprocessing the search engine data to obtain a word frequency matrix includes:
performing word segmentation on text data to be processed, and labeling the part of speech of each word;
and generating a word frequency matrix according to the occurrence frequency of each word in each article.
In an optional embodiment, classifying the words in the word frequency matrix includes:
and dividing the words in the word frequency matrix into four categories of a name category, a verb category, an adjective category and a mechanism name category according to the part of speech of the label.
In an optional embodiment, after selecting a preset number of terms in each category as the generated classification expansion query term, the method further includes:
and determining the trend of the expanded query words according to the trend threshold of the expanded query words and the categories of the expanded query words.
In an alternative embodiment, determining a trend of the expanded query term based on a trend threshold of the expanded query term and its category includes:
when the TF-IDF value of the expanded query word is greater than or equal to the trend threshold value of the category of the expanded query word, determining the expanded query word to be in an ascending trend;
and when the TF-IDF value of the expanded query word is smaller than the trend threshold value of the category of the expanded query word, determining the expanded query word to be in a descending trend.
In an optional embodiment, after determining the trend of the expanded query term according to the trend threshold of the expanded query term and the category thereof, the method further includes:
and determining the planned topics of the news department according to the expanded query words in the rising trend.
In a second aspect, an embodiment of the present disclosure provides an apparatus for generating an expanded query term, including:
the acquisition module is used for acquiring real-time search engine data according to the query words input by the user;
the preprocessing module is used for preprocessing the search engine data to obtain a word frequency matrix;
and the expansion query word generation module is used for classifying the words in the word frequency matrix, sequencing the words in each category according to TF-IDF values of the words, and selecting a preset number of words in each category as the generated classification expansion query words.
In a third aspect, the disclosed embodiment provides an expanded query term generation device, including a processor and a memory storing program instructions, where the processor is configured to execute the expanded query term generation method provided in the foregoing embodiment when executing the program instructions.
In a fourth aspect, the present disclosure provides a computer-readable medium, on which computer-readable instructions are stored, where the computer-readable instructions are executable by a processor to implement an extended query term generation method provided in the foregoing embodiments.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
according to the method for generating the expanded query word, provided by the embodiment of the disclosure, the timeliness of the expanded query word is ensured by acquiring the data of the search engine in real time, the problem that the query word cannot be expanded in real time is solved, and the problem that the statistical method depends on the historical behavior data of the search user is also solved. By classifying the expanded query words according to the parts of speech, more query words meeting the requirements of the user can be provided for the user, and the searching effectiveness is improved; the trend of the generated expanded query words can be calculated, and the trend result is used as input data of subsequent application, and the method can be applied to the fields of real-time portrait of people, intelligent recommendation of search engines, association of similar words and the like.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flowchart illustrating a method of generating expanded query terms in accordance with an exemplary embodiment;
FIG. 2 is a flowchart illustrating a method of generating expanded query terms in accordance with an exemplary embodiment;
FIG. 3 is a diagram illustrating a method for obtaining search engine data in real-time according to an exemplary embodiment;
FIG. 4 is a schematic diagram illustrating a data pre-processing according to an exemplary embodiment;
FIG. 5 is a diagram illustrating a word frequency matrix in accordance with an exemplary embodiment;
FIG. 6 is a diagram illustrating one type of generating expanded query terms and trends in accordance with an illustrative embodiment;
FIG. 7 is a block diagram illustrating an apparatus for generating expanded query terms in accordance with an illustrative embodiment;
FIG. 8 is a block diagram illustrating an apparatus for generating expanded query terms in accordance with an illustrative embodiment;
FIG. 9 is a schematic diagram illustrating a computer storage medium in accordance with an exemplary embodiment.
Detailed Description
The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of systems and methods consistent with certain aspects of the invention, as detailed in the appended claims.
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
The method for generating the expanded query term according to the embodiment of the present application will be described in detail below with reference to fig. 1 to 6. Fig. 1 is a flowchart illustrating a method for generating an expanded query term according to an exemplary embodiment, and referring to fig. 1, the method specifically includes the following steps.
S101, obtaining real-time search engine data according to the query words input by the user.
In one possible implementation, the user inputs the query term through the search platform, for example, the user inputs the query term through a network platform such as Baidu, 360, Saogue, and the like, and the searched result data is obtained. Preferably, the input query term is a noun or a verb.
And further, calling a search engine interface, acquiring search engine data in real time through a crawler technology, analyzing the obtained search engine result data, extracting titles and abstracts in the list, removing html tags, removing advertisements and the like. And then combining a plurality of search engine data to form text data to be processed.
Fig. 3 is a schematic diagram illustrating a method for acquiring search engine data in real time according to an exemplary embodiment, where as shown in fig. 3, a query word is first input, then a plurality of search engine interfaces are called in real time to obtain a plurality of result data sets, the obtained plurality of result data sets are parsed, for example, titles and abstracts in a list are extracted, html tags are removed, advertisements are removed, and finally, a plurality of search engine data are combined to form text data to be processed.
By acquiring the data of the search engine in real time, the problem that the query words cannot be expanded in real time is solved, and the problem that the statistical method depends on the historical behavior data of the search user is also solved.
S102, preprocessing the search engine data to obtain a word frequency matrix.
Fig. 4 is a schematic diagram of data preprocessing according to an exemplary embodiment, and as shown in fig. 4, preprocessing text data to be processed includes first obtaining the text data to be processed in step S101, then performing word segmentation, part-of-speech tagging, stop word removal, and the like on the text data to be processed, and finally generating a word frequency matrix according to a word segmentation tagging result.
Specifically, the text data to be processed is firstly segmented, and then labeled according to the part of speech of each segmented word, for example, the word is a name of a person, then labeled as a name of a person, the word is a verb, then labeled as a verb, the word is an adjective or an adverb, then labeled as an adjective or an adverb, the word is an organization name, and then labeled as an organization name. And delete some stop words, etc. nonsense words in the text data set.
Furthermore, according to the word segmentation labeling result, word frequency is counted, a word frequency matrix is generated, and the number in the word frequency matrix represents the number of times that the corresponding word appears in the corresponding article.
Fig. 5 is a schematic diagram of a word frequency matrix according to an exemplary embodiment, and as shown in fig. 5, two articles are assumed, wherein a number 0 in a first row and a first column indicates that a word "any president" appears in article 1 with a frequency of 0, and a number 1 in a first column and a second row indicates that a word "any president" appears in article 2 with a frequency of 1. The times of the corresponding words appearing in the corresponding articles can be seen according to the numbers in the word frequency matrix.
S103, classifying the words in the word frequency matrix, sequencing the words in each category according to TF-IDF values of the words, and selecting a preset number of words in each category as generated classified expanded query words.
In one possible implementation, calculating the TF-IDF value of each word in the word frequency matrix, TF-IDF (term frequency-inverse text frequency index) is a statistical method, and if a word or phrase occurs in an article with a high frequency TF and rarely occurs in other articles, the word or phrase is considered to have a good category distinguishing capability and is suitable for classification. TF represents the frequency of occurrence of the term in the document d, and the main idea of IDF is that if the number of documents containing the term t is smaller, the IDF is larger, and the term t has good category distinguishing capability. High word frequencies within a particular document, and low document frequencies for that word across the document collection, may result in a high-weighted TF-IDF, which tends to filter out common words and retain important words.
In the disclosed embodiment, the higher the TF-IDF value of a term, the higher the association between the term and the input query term.
Further, words in the word frequency matrix are divided into four categories of names, verbs, adjectives and mechanism names according to the parts of speech of the labels. And then, sequencing the terms in each category from large to small according to the TF-IDF values of the terms, and selecting a preset number of terms in each category as the generated classified expanded query term.
In a possible implementation mode, the first 10 terms with higher TF-IDF values are selected from each category as the extended query words of the category to obtain a name extended query word set, a verb extended query word set, an adjective/adverb extended query word set and a mechanism name extended query word set. The number of the expanded query terms in each type of expanded query term set can be the same or different, and the expanded query terms can be set by a person skilled in the art according to needs.
By classifying the expanded query words according to the parts of speech, the requirements of users can be better met, and the searching effectiveness is improved.
In an optional embodiment, after selecting a preset number of terms in each category as the generated classified expanded query term, determining a trend of the expanded query term according to a trend threshold of the expanded query term and its category.
Specifically, when the TF-IDF value of the expanded query word is greater than or equal to the trend threshold of the category of the expanded query word, the expanded query word is determined to be in an ascending trend, and when the TF-IDF value of the expanded query word is smaller than the trend threshold of the category of the expanded query word, the expanded query word is determined to be in a descending trend.
Wherein the trend threshold for each category is obtained by analyzing historical data, and in an exemplary scenario, the name trend threshold is 0.35, the verb trend threshold is 0.43, the adjective/adverb trend threshold is 0.28, and the organization name trend threshold is 0.24. And if the TF-IDF value of the expanded query word is smaller than the trend threshold value of the category of the expanded query word, determining that the expanded query word is in a descending trend, and determining that the expanded query word is in a descending trend after a period of time.
Furthermore, after various expanded query terms and trends are obtained, the expanded query terms and the trends can be used as input data of various subsequent applications. In one exemplary scenario, such as when a news campaign is planning a topic, expanded query terms that are highly related and in a rising trend may be selected from the list of expanded query terms as the basis for the topic. In one exemplary scenario, such as when the knowledge graph is supplemented to be refined, the expanded query terms with high association and an ascending trend may be selected from the expanded query term list as the node data of the knowledge graph. In one exemplary scenario, such as when generating a real-time portrait of a person, expanded query terms that are highly related and in an ascending trend may also be selected as input data.
By calculating the trend of various expanded query words, the accuracy of subsequent applications such as intelligent recommendation, related word association, planning and selecting questions and the like can be greatly improved.
Fig. 6 is a schematic diagram of generating an extended query term and a trend according to an exemplary embodiment, and as shown in fig. 6, a generated word-frequency matrix is first obtained, then a TF-IDF value of each term in the word-frequency matrix is calculated, the terms are divided into four categories, namely, a name category, a verb category, an adjective category and a mechanism name category according to the part of speech of the term, then the terms in each category are sorted according to the TF-IDF values of the terms from large to small, and a preset number of terms are selected from each category as the generated classified extended query term.
Further, the expanded query words are compared with the trend threshold value of the types of the expanded query words, when the TF-IDF value of the expanded query words is larger than or equal to the trend threshold value of the types of the expanded query words, the expanded query words are determined to be in an ascending trend, when the TF-IDF value of the expanded query words is smaller than the trend threshold value of the types of the expanded query words, the expanded query words are determined to be in a descending trend, and finally the character expanded query words and trends, the attribute expanded query words and trends, the state expanded query words and trends, and the mechanism expanded query words and trends are obtained. The relevance threshold in fig. 6 refers to a trend threshold, and may be obtained by analyzing historical data.
In order to facilitate understanding of the method for generating the expanded query term provided in the embodiment of the present application, the following description is made with reference to fig. 2. As shown in fig. 2, the method includes:
firstly, real-time data of a search engine is acquired according to a query word input by a user, and preprocessing of operations such as word segmentation, labeling and word frequency matrix generation is performed on the acquired data.
And further, calculating the TF-IDF value of each word in the word frequency matrix, and dividing the words in the word frequency matrix into four categories, namely a name category, a verb category, an adjective category and a mechanism name category according to the marked part of speech. And then, sequencing the terms in each category from large to small according to the TF-IDF values of the terms, and selecting a preset number of terms in each category as the generated classified expanded query term.
Further, the trend of the expanded query word is judged according to the trend threshold of the expanded query word and the category of the expanded query word, when the TF-IDF value of the expanded query word is larger than or equal to the trend threshold of the category of the expanded query word, the expanded query word is determined to be in an ascending trend, and when the TF-IDF value of the expanded query word is smaller than the trend threshold of the category of the expanded query word, the expanded query word is determined to be in a descending trend.
And finally, the generated expanded query words and the trends thereof are used as input data of various subsequent applications, for example, a portrait is generated according to the generated expanded query words, intelligent recommendation is performed, a knowledge graph is perfected, and the like.
According to the method for generating the expanded query words, the query words can be expanded in real time by acquiring real-time search engine data, the expanded words can be classified, user requirements can be better met, search effectiveness is improved, the trend of the generated expanded query words can be calculated, and the trend result is used as input data of subsequent application.
The embodiment of the present disclosure further provides a device for generating an expanded query term, where the device is configured to execute the method for generating an expanded query term in the foregoing embodiment, and as shown in fig. 7, the device includes:
an obtaining module 701, configured to obtain real-time search engine data according to a query word input by a user;
a preprocessing module 702, configured to preprocess search engine data to obtain a word frequency matrix;
the expanded query term generating module 703 is configured to classify the terms in the word-frequency matrix, rank the terms in each category according to the TF-IDF values of the terms, and select a preset number of terms in each category as the generated classified expanded query terms.
It should be noted that, when the generating apparatus for generating an expanded query term provided in the foregoing embodiment executes the generating method for an expanded query term, the above-mentioned division of the function modules is merely used as an example, and in practical applications, the above-mentioned function distribution may be completed by different function modules according to needs, that is, the internal structure of the device is divided into different function modules, so as to complete all or part of the above-mentioned functions. In addition, the generating device of the expanded query term and the generating method of the expanded query term provided by the above embodiments belong to the same concept, and the embodiment of the method embodies the implementation process, which is not described herein again.
The embodiment of the present disclosure further provides an electronic device corresponding to the method for generating an expanded query term provided in the foregoing embodiment, so as to execute the method for generating an expanded query term.
Referring to fig. 8, a schematic diagram of an electronic device provided in some embodiments of the present application is shown. As shown in fig. 8, the electronic apparatus includes: a processor 800, a memory 801, a bus 802 and a communication interface 803, the processor 800, the communication interface 803 and the memory 801 being connected by the bus 802; the memory 801 stores a computer program that can be executed on the processor 800, and the processor 800 executes the method for generating the expanded query term provided in any of the foregoing embodiments of the present application when executing the computer program.
The Memory 801 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 803 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.
Bus 802 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 801 is used for storing a program, and the processor 800 executes the program after receiving an execution instruction, and the method for generating an expanded query term disclosed in any embodiment of the present application may be applied to the processor 800, or implemented by the processor 800.
The processor 800 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 800. The Processor 800 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 801, and the processor 800 reads the information in the memory 801 and completes the steps of the method in combination with the hardware thereof.
The electronic device provided by the embodiment of the application and the method for generating the expanded query term provided by the embodiment of the application have the same inventive concept and have the same beneficial effects as the method adopted, operated or realized by the electronic device.
Referring to fig. 9, the computer-readable storage medium is an optical disc 900, on which a computer program (i.e., a program product) is stored, and when the computer program is executed by a processor, the computer program executes the method for generating the extended query term provided in any of the foregoing embodiments.
It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.
The computer-readable storage medium provided by the above-mentioned embodiment of the present application and the method for generating the expanded query term provided by the embodiment of the present application have the same advantages as the method adopted, run, or implemented by the application program stored in the computer-readable storage medium.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for generating an expanded query term, comprising:
acquiring real-time search engine data according to a query word input by a user;
preprocessing the search engine data to obtain a word frequency matrix;
and classifying the words in the word frequency matrix, sequencing the words in each category according to the TF-IDF values of the words, and selecting a preset number of words in each category as generated classified expanded query words.
2. The method of claim 1, wherein obtaining real-time search engine data based on a query term input by a user comprises:
acquiring a query word input by a user;
acquiring search engine data in real time by adopting a crawler technology;
and analyzing and combining a plurality of search engine data to form text data to be processed.
3. The method of claim 2, wherein preprocessing the search engine data to obtain a word frequency matrix comprises:
performing word segmentation on the text data to be processed, and labeling the part of speech of each word;
and generating a word frequency matrix according to the occurrence frequency of each word in each article.
4. The method of claim 3, wherein classifying the words in the word frequency matrix comprises:
and dividing the words in the word frequency matrix into four categories of a name category, a verb category, an adjective category and a mechanism name category according to the marked part of speech.
5. The method of claim 1, wherein after selecting a preset number of terms in each category as the generated category expansion query terms, further comprising:
and determining the trend of the expanded query term according to the trend threshold of the expanded query term and the category of the expanded query term.
6. The method of claim 5, wherein determining the trend of the expanded query term based on a trend threshold of the expanded query term and its category comprises:
when the TF-IDF value of the expanded query word is larger than or equal to the trend threshold value of the category of the expanded query word, determining that the expanded query word is in an ascending trend;
and when the TF-IDF value of the expanded query word is smaller than the trend threshold value of the category of the expanded query word, determining that the expanded query word is in a descending trend.
7. The method of claim 5, wherein after determining the trend of the expanded query term based on the trend threshold of the expanded query term and its category, further comprising:
and determining the planned topics of the news department according to the expanded query words in the rising trend.
8. An apparatus for generating an expanded query term, comprising:
the acquisition module is used for acquiring real-time search engine data according to the query words input by the user;
the preprocessing module is used for preprocessing the search engine data to obtain a word frequency matrix;
and the expansion query word generation module is used for classifying the words in the word frequency matrix, sequencing the words in each category according to the TF-IDF values of the words, and selecting a preset number of words in each category as the generated classification expansion query words.
9. An expanded query term generation apparatus comprising a processor and a memory storing program instructions, the processor being configured to execute the expanded query term generation method according to any one of claims 1 to 7 when executing the program instructions.
10. A computer-readable medium having computer-readable instructions stored thereon, the computer-readable instructions being executable by a processor to implement an extended query term generation method as claimed in any one of claims 1 to 7.
CN202110168746.3A 2021-02-07 2021-02-07 Method, device and equipment for generating expanded query words and storage medium Pending CN112925967A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110168746.3A CN112925967A (en) 2021-02-07 2021-02-07 Method, device and equipment for generating expanded query words and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110168746.3A CN112925967A (en) 2021-02-07 2021-02-07 Method, device and equipment for generating expanded query words and storage medium

Publications (1)

Publication Number Publication Date
CN112925967A true CN112925967A (en) 2021-06-08

Family

ID=76171131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110168746.3A Pending CN112925967A (en) 2021-02-07 2021-02-07 Method, device and equipment for generating expanded query words and storage medium

Country Status (1)

Country Link
CN (1) CN112925967A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295319A (en) * 2008-06-24 2008-10-29 北京搜狗科技发展有限公司 Method and device for expanding query, search engine system
CN102902806A (en) * 2012-10-17 2013-01-30 深圳市宜搜科技发展有限公司 Method and system for performing inquiry expansion by using search engine
CN106294868A (en) * 2016-08-23 2017-01-04 达而观信息科技(上海)有限公司 A kind of personalized recommendation method based on search engine and system
CN107291914A (en) * 2017-06-27 2017-10-24 达而观信息科技(上海)有限公司 A kind of method and system for generating search engine inquiry expansion word

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295319A (en) * 2008-06-24 2008-10-29 北京搜狗科技发展有限公司 Method and device for expanding query, search engine system
CN102902806A (en) * 2012-10-17 2013-01-30 深圳市宜搜科技发展有限公司 Method and system for performing inquiry expansion by using search engine
CN106294868A (en) * 2016-08-23 2017-01-04 达而观信息科技(上海)有限公司 A kind of personalized recommendation method based on search engine and system
CN107291914A (en) * 2017-06-27 2017-10-24 达而观信息科技(上海)有限公司 A kind of method and system for generating search engine inquiry expansion word

Similar Documents

Publication Publication Date Title
WO2017097231A1 (en) Topic processing method and device
CN104573054A (en) Information pushing method and equipment
CN108520007B (en) Web page information extracting method, storage medium and computer equipment
TW201923629A (en) Data processing method and apparatus
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
CN115374781A (en) Text data information mining method, device and equipment
CN111639255A (en) Search keyword recommendation method and device, storage medium and electronic equipment
CN112307336A (en) Hotspot information mining and previewing method and device, computer equipment and storage medium
CN107977420A (en) The abstract extraction method, apparatus and readable storage medium storing program for executing of a kind of evolved document
JP7395377B2 (en) Content search methods, devices, equipment, and storage media
CN114328983A (en) Document fragmenting method, data retrieval device and electronic equipment
WO2015084757A1 (en) Systems and methods for processing data stored in a database
Alruqimi et al. Bridging the Gap between the Social and Semantic Web: Extracting domain-specific ontology from folksonomy
Darmawiguna et al. The development of integrated Bali tourism information portal using web scrapping and clustering methods
CN114090877A (en) Position information recommendation method and device, electronic equipment and storage medium
CN106372123B (en) Tag-based related content recommendation method and system
Zhang et al. Informing the curious negotiator: Automatic news extraction from the internet
Lee et al. Web document classification using topic modeling based document ranking
CN112528021B (en) Model training method, model training device and intelligent equipment
CN111310017A (en) Method and device for generating timeliness scene content
CN112925967A (en) Method, device and equipment for generating expanded query words and storage medium
CN114328895A (en) News abstract generation method and device and computer equipment
CN110727850B (en) Network information filtering method, computer readable storage medium and mobile terminal
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
US10552459B2 (en) Classifying a document using patterns

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination