CN109740161B - Data generalization method, device, equipment and medium - Google Patents

Data generalization method, device, equipment and medium Download PDF

Info

Publication number
CN109740161B
CN109740161B CN201910015940.0A CN201910015940A CN109740161B CN 109740161 B CN109740161 B CN 109740161B CN 201910015940 A CN201910015940 A CN 201910015940A CN 109740161 B CN109740161 B CN 109740161B
Authority
CN
China
Prior art keywords
search term
target
search
terms
term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910015940.0A
Other languages
Chinese (zh)
Other versions
CN109740161A (en
Inventor
周环宇
冯欣伟
余淼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910015940.0A priority Critical patent/CN109740161B/en
Publication of CN109740161A publication Critical patent/CN109740161A/en
Application granted granted Critical
Publication of CN109740161B publication Critical patent/CN109740161B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the invention discloses a data generalization method, a device, equipment and a medium, and relates to the technical field of data processing. The method comprises the following steps: grouping search term sets comprising target search terms to be generalized and history search terms according to terms in each search term; and determining the generalized search term of the target search term from the historical search terms according to the grouping result. The embodiment of the invention provides a data generalization method, a device, equipment and a medium, which realize reasonable and wide generalization of search terms to be generalized.

Description

Data generalization method, device, equipment and medium
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a data generalization method, a device, equipment and a medium.
Background
Often, a search term (query) that represents the same semantic meaning has more than one expression, and how to mine as many expressions as possible is generalizing the query.
The current generalization of queries is mainly based on synonyms for keyword replacement.
However, although the use of keyword replacement can solve some generalizations, these generalizations are not yet comprehensive. After all, the problems that can be solved by keyword replacement are limited, but people may always want some unexpected questioning modes.
In addition, keyword replacement may also be erroneous in the case of a specific subject. Such as "who" and "which person" are in most cases synonymous. But generalizing to "who is the 2018 world cup champion" is clearly unsuitable, such as for "who is the 2018 world cup champion".
Disclosure of Invention
The embodiment of the invention provides a data generalization method, a device, equipment and a medium, which are used for reasonably and widely generalizing a search term to be generalized.
In a first aspect, an embodiment of the present invention provides a data generalization method, where the method includes:
grouping search term sets comprising target search terms to be generalized and history search terms according to terms in each search term;
and determining the generalized search term of the target search term from the historical search terms according to the grouping result.
In a second aspect, an embodiment of the present invention further provides a data generalization apparatus, where the apparatus includes:
the grouping module is used for grouping search term sets comprising target search terms to be generalized and historical search terms according to words in each search term;
and the generalization module is used for determining the generalization search term of the target search term from the historical search terms according to the grouping result.
In a third aspect, an embodiment of the present invention further provides an apparatus, including:
one or more processors;
storage means for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a data generalization method as described in any of the embodiments of the present invention.
In a fourth aspect, embodiments of the present invention further provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data generalization method according to any of the embodiments of the present invention.
According to the embodiment of the invention, the search item sets comprising the target search item to be generalized and the history search item are grouped according to the words in each search item, so that the target search item is generalized. And determining the generalization search term of the target search term from the historical search terms according to the grouping result, so that the generalization search term accords with user questioning logic, and the condition that the generalization search term is unreasonable due to direct replacement is avoided.
Drawings
FIG. 1 is a flowchart of a data generalization method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a data generalization method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a data generalization method according to a third embodiment of the present invention;
FIG. 4 is a flowchart of a data generalization method according to a fourth embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating another flow chart of a data generalization method according to a fourth embodiment of the present invention;
fig. 6 is a schematic structural diagram of a data generalization apparatus according to a fifth embodiment of the present invention;
fig. 7 is a schematic structural diagram of an apparatus according to a sixth embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Example 1
Fig. 1 is a flowchart of a data generalization method according to an embodiment of the present invention. The present embodiment is applicable to a case of generalizing a search term. The method may be performed by a data generalization method and the apparatus may be implemented in software and/or hardware. Referring to fig. 1, the data generalization method provided by the embodiment of the present invention includes:
s110, according to words in each search term, grouping search term sets comprising target search terms to be generalized and historical search terms.
Wherein the history search term includes a search term of a user history input.
The inventors have found in the course of implementing the present invention that the actual percentage of queries searched on average daily basis does not exceed 1% of the total queries, as seen over a one month span. Queries with average daily searches of no more than 1 are typically ignored, and a large number of different expressions of the queries are actually implemented in the ignored queries. Thus, the history search term may include search terms having a high search frequency and also include search terms having a low search frequency.
Typically, the history search term includes a collection of search terms of a large number of history inputs. The historical search terms may be on the order of billions of data volumes.
Specifically, the grouping, according to the terms in each search term, the search term set including the target search term to be generalized and the history search term includes:
matching words in each search term;
if at least two search terms comprise at least one same word, dividing the at least two search terms into a group.
Optionally, if at least two search terms include at least one synonym, the at least two search terms are divided into a group.
In order to ensure that the semantics of the search items in the group after grouping are the same or similar, the grouping of the search item set comprising the target search item to be generalized and the history search item according to the words in each search item comprises the following steps:
determining the importance of each term in the search term;
taking the words with importance greater than a first importance threshold value in the search term as target words;
and grouping a search term set comprising target search terms to be generalized and historical search terms according to the target terms.
Wherein the importance of each term in the search term can be realized according to any realizable method.
Typically, the importance of each term in a search term may be determined from the sentence components in the search term. For example, the importance of the term in the term as subject is greater than the importance of the term as subject.
S120, determining the generalized search term of the target search term from the historical search terms according to the grouping result.
Specifically, the history search term in the same group as the target search term may be taken as a generalization search term of the target search term.
To further increase the synonym degree of the generalization search term and the target search term, determining the generalization search term of the target search term from the historical search terms according to the grouping result comprises:
determining sentence similarity or conversion loss of the historical search term and the target search term which are positioned in the same group with the target search term;
selecting a history retrieval item with the same semantic meaning as the target retrieval item from the history retrieval items in the same group as the target retrieval item according to the determined similarity and conversion loss;
and taking the history retrieval items with the same semantics as the generalization retrieval items of the target retrieval items.
According to the technical scheme of the embodiment of the invention, the search item sets comprising the target search item to be generalized and the history search item are grouped according to the words in each search item, so that the comprehensive generalization of the target search item is realized. And determining the generalization search term of the target search term from the historical search terms according to the grouping result, so that the generalization search term accords with user questioning logic, and the condition that the generalization search term is unreasonable due to direct replacement is avoided.
Example two
Fig. 2 is a flowchart of a data generalization method according to a second embodiment of the present invention. This embodiment is an alternative to the embodiments described above. Referring to fig. 2, the data generalization method provided in the present embodiment includes:
s210, unifying synonyms and/or words identifying the same entity included in the search term.
Wherein the words identifying the same entity may be aliases of the entities.
Illustratively, the search term is which year the liqueur belongs to, and what dynasty the poem lives in. The unified search term is as follows: the dynasty to which the prune belongs, and the dynasty to which the prune lives.
To avoid introducing excessive unification errors, unifying synonyms and/or words identifying the same entity included in the search term includes:
and unifying synonym words with importance greater than a second importance threshold and/or words identifying the same entity in the search term.
S220, according to the words in each search term, grouping search term sets comprising target search terms to be generalized and historical search terms.
To improve the accuracy of grouping, according to terms in each search term, a search term set including a target search term to be generalized and a history search term is grouped, including:
and grouping the search term sets comprising the target search term to be generalized and the history search term according to the words in each search term and the answer type of each search term.
Specific answer types may include time, name of person, etc. The present embodiment does not limit the determination method of the answer type. Typically, the answer type of the retrieval item may be determined based on a pre-trained model.
S230, determining the generalized search term of the target search term from the historical search terms according to the grouping result.
According to the technical scheme provided by the embodiment of the invention, the synonymous words and/or the words identifying the same entity in the search term are unified, so that the grouping accuracy of the words with the same semantics is improved. And further, the accuracy of determining the generalization search term is improved.
Example III
Fig. 3 is a flowchart of a data generalization method according to a third embodiment of the present invention. This embodiment is an alternative to the embodiments described above. Referring to fig. 3, the data generalization method provided by the embodiment of the present invention includes:
s310, according to the words in each search term, grouping search term sets comprising target search terms to be generalized and historical search terms.
S320, determining the generalization search term of the target search term from the history search terms according to the conversion loss between the target search term and the history search term in the target search term group.
The conversion loss refers to the loss required for converting one search term into another search term. The less the loss, the more similar the two terms are, the more likely it is that the same meaning will be expressed.
Since the conversion losses of the matched identical words (i.e., the same words) are the same, in order to reduce the calculation amount, before determining the generalization search term of the target search term from the history search terms according to the conversion loss between the target search term and the history search term in the target search term group, the method further comprises:
matching the words in the target search term with the words in the history search term;
taking the word which is inconsistent in matching as a target word;
and determining the conversion loss of the target retrieval item and the historical retrieval item in the target retrieval item group according to the target retrieval item and the target words in the historical retrieval item.
The method determines the conversion loss of the target retrieval item and the historical retrieval item in the target retrieval item group based on the words which are inconsistent in matching. Therefore, the method and the device can ensure accurate calculation of the conversion loss, avoid redundant calculation of the same words and improve the calculation efficiency of the conversion loss.
Specifically, before determining the generalization search term of the target search term from the history search terms according to the conversion loss between the target search term and the history search term in the target search term group, the method further comprises:
determining a first conversion loss from the target search term to other search terms and a second conversion loss from the other search terms to the target search term according to the similarity of each word in the target search term and each word in the history search term in the target search term group;
and determining the conversion loss of the target retrieval item and the historical retrieval item according to the first conversion loss and the second conversion loss.
Typically the first conversion loss is different from the second conversion loss. For example, the first search term is which dynode the liqueur lives in, and the second search term is which dynode the Li Baihe Du Pu is living in. The conversion loss of the first search term to the second search term is 0, however, since the second search term includes "Du Pu" which is not present in the first search term, the conversion loss of the second search term to the first search term is not 0.
To avoid determining two search terms containing a relationship as synonymous search terms, i.e. the conversion loss is determined to be 0. Determining conversion losses of the target search term and the history search term based on the first conversion loss and the second conversion loss, comprising: and taking the maximum value of the first conversion loss and the second conversion loss as the conversion loss of the target search term and other search terms.
Further, the determining the first conversion loss from the target search term to other search terms and the second conversion loss from other search terms to the target search term according to the similarity between each word in the target search term and each word in the history search term in the target search term group includes:
taking the sum of minimum values of the similarity of each word in the target search term and each word in other search terms as a first conversion loss for converting the target search term into other search terms;
and taking the sum of the minimum values of the similarity between each word in the other search terms and each word in the target search term as a second conversion loss for converting the other search terms into the target search term.
For example, the similarity between the first term in the target search term and each term in other searches is expressed as: a11, a12, … … a1n. The similarity between the second term in the target search term and each term in other searches is expressed as follows: a21, a22 … … a2n. The similarity between the mth term in the target search term and each term in other searches is expressed as follows: am1, am2 … … amp. The target search term comprises m words, and the other search terms comprise n words. Determining the minimum of a11, a12 … … and a1n, the minimum of a21, a22 … … and a2n, and the minimum of am1, am2 … … and amp; the sum of the determined minimum values is taken as the first conversion loss.
The similarity between the first term in the other search terms and each term in the target search term is expressed as follows: a11, a21, … … am1. The similarity between the second term in the other search term and each term in the target search term is expressed as follows: a12, a22, … … am2. The similarity between the nth term in the other search term and each term in the target search term is expressed as follows: a1n, a2n, … … amp. Determining the minimum of a11, a21, … … and am1, the minimum of a12, a22, … … and am2, and the minimum of a1n, a2n, … … and amp; the sum of the determined minimum values is taken as the second conversion loss.
In order to amplify the data characteristics, the determining the conversion loss of the target search term and other search terms according to the similarity of each term in the target search term group and each term in other search terms comprises the following steps:
performing nonlinearity on similarity of each word in the target search term group and each word in other search terms;
and determining conversion loss of the target search term and other search terms according to the nonlinear result.
According to the technical scheme provided by the embodiment of the invention, in order to improve the generalization comprehensiveness of the search terms, the number of the used historical search terms is large. Resulting in computationally intensive problems. To solve this problem, the present embodiment groups a search term set including a target search term to be generalized and a history search term according to terms in each search term. Thus realizing coarse clustering of similar search sentences. And then determining a generalized search term of the target search term from the historical search terms according to the conversion loss between the target search term and the historical search term in the target search term group by determining the target search term group comprising the target search term in the grouping result. Therefore, the calculation amount is reduced, and the accuracy of determining the generalization search term of the target search term is improved.
Example IV
Fig. 4 is a flowchart of a data generalization method according to a fourth embodiment of the present invention. This embodiment is an alternative to the embodiments described above. Referring to fig. 4, the data generalization method provided in the present embodiment includes:
s410, extracting keywords from the target search term and the history search term to be generalized.
Specifically, word importance analysis is performed on each word in the search term, and trunk words, strong qualifiers, weak qualifiers and redundant words are determined. Wherein the trunk words have strong ideographic capability in the search term; strong qualifiers are often important qualifiers at the content level, such as time, place; weak qualifiers, typically resource-level qualifiers, such as download, video, material or content-level qualifiers that do not greatly affect; redundant words represent directly disposable words.
And unifying the trunk words with higher importance based on the entity alias list. Simultaneously, english appearing in the search term is uniformly rewritten into lower case, and the number is uniformly rewritten into Chinese expression. And then directly ignoring the redundant words with smaller importance. Here, the importance of some words is empirically set to ensure that important words are not ignored.
S420, grouping target search terms and history search terms to be generalized based on the extracted keywords.
Specifically, firstly, judging answer types of target search items and history search items to be generalized. While excluding non-question-and-answer search terms. Then, the retrieval items with the same answer types and the same main words or strong qualifiers are used as the same group.
Since this embodiment uses massive amounts of data, there will naturally be some groups that do not include the target search term. And may be directly ignored for these groups.
S430, determining conversion losses of the target search term and each history search term which are positioned in the same group.
The amount of conversion loss calculation that needs to be performed has been greatly reduced by grouping.
Conversion loss here refers to the loss required to transform from one search term to another. The less the loss, the more similar the two words, the more likely it is that the same meaning will be expressed.
Specifically, words other than the same words used in grouping in the search term are taken as target words. And carrying out word vector conversion on the target words in the two search terms of which the conversion losses are to be determined.
Based on the converted word vector, each two words (denoted as q) in the two search terms for which conversion loss is to be determined 1i And q 2j ) The cosine similarity of (2) is determined; cosine similarity results (denoted r ij ) Return to between 0 and 1; non-linearizing the normalized result, specifically multiplying the normalized result by
Figure BDA0001939061080000101
Tangent is taken to generate a nonlinear matrix; then a transition is to be determinedThe conversion loss from the first search term to the second search term in the lost two search terms is the sum of the minimum values of each row of the nonlinear matrix; and the conversion loss from the second search term to the first search term is the sum of the minimum values of each column; and finally taking the maximum value of the two loss values as the conversion loss of the first search term and the second search term.
The formula is expressed as follows:
q 1 =[q 11 ,q 12 ,q 13 ,…,q 1m ];
q 2 =[q 21 ,q 22 ,q 23 ,…,q 2n ]。
cos ij =cos(q 1i ,q 2j )
r ij =(cos ij +1)/2
Figure BDA0001939061080000111
Y=[[y 11 ,y 12 ,…,y 1n ],[y 21 ,y 22 ,…,y 2n ],…,[y m1 ,y m2 ,…,y mn ]]
trans_cost=max(sum i (min j (y ij )),sum j (min i (y ij )))。
wherein q 1 Is a word vector representation of the first search term, q 11 ,q 12 ,q 13 ,…,q 1m Is a vector representation of the words comprised by the first search term. q 1 The vector representation of the same word used for grouping is not included. q 2 Is a word vector representation of a second search term, also q 21 ,q 22 ,q 23 ,…,q 2n Is a vector representation of the words comprised by the second search term. q 2 Nor does it include a vector representation of the same words used for grouping. trans_cost represents the conversion loss of the first search term and the second search term.
S440, if the trans_cost is lower than the set synonym threshold, the first search item and the second search item are considered to be synonyms, and the history search item serving as the second search item is determined as a generalization search item of the target search item of the first search item.
To achieve expanded recruitment, the importance of the words on which the synonym threshold is set and the grouping time is based can be adjusted as required.
Typically, the importance of the words upon which the grouping is based will be set smaller and the capitalization of the synonymous threshold setting will be set.
Specifically, at S420, the search terms having the same answer type and the same trunk word, strong qualifier, or weak qualifier are used as the same group. So that there are more history retrieval items grouped into the target retrieval item group.
In executing S440, a higher synonym threshold setting will be set. Thereby realizing the determination accuracy of the generalization search term.
The specific flow of the above method can also be seen in fig. 5.
According to the technical scheme provided by the embodiment of the invention, the synonymous search terms can be found more widely. In addition, since the search terms are all input by the user, the search terms can be directly hit and displayed.
It should be noted that, given the technical teaching of this embodiment, those skilled in the art are motivated to combine schemes of any of the embodiments described in the foregoing embodiments to achieve broad and reasonable generalization of search terms.
Example five
Fig. 6 is a schematic structural diagram of a data generalization apparatus according to a fifth embodiment of the present invention. Referring to fig. 6, the data generalization apparatus provided in the present embodiment includes: a grouping module 10 and a generalization module 20.
Wherein, the grouping module 10 is used for grouping the search term set comprising the target search term to be generalized and the history search term according to the words in each search term;
the generalization module 20 is configured to determine a generalization search term of the target search term from the history search terms according to the grouping result.
According to the technical scheme of the embodiment of the invention, the search item sets comprising the target search item to be generalized and the history search item are grouped according to the words in each search item, so that the target search item is generalized. And determining the generalization search term of the target search term from the historical search terms according to the grouping result, so that the generalization search term accords with user questioning logic, and the condition that the generalization search term is unreasonable due to direct replacement is avoided.
Further, the grouping module includes: importance determining unit, target determining unit and grouping unit.
The importance determining unit is used for determining the importance of each word in the search term;
the target determining unit is used for taking the words with importance greater than a first importance threshold value in the retrieval items as target words;
and the grouping unit is used for grouping the search term set comprising the target search term to be generalized and the history search term according to the target word.
Further, the apparatus further comprises: the word unification module.
The term unification module is used for unifying synonymous terms and/or terms identifying the same entity included in the search terms before grouping the search term sets including the target search terms and the history search terms to be generalized according to the terms in each search term.
Further, the generalization module includes: and generalizing the unit.
The generalization unit is used for determining the generalization search term of the target search term from the history search term according to the conversion loss between the target search term and the history search term in the target search term group.
Further, the device further comprises: different loss determination modules and a final loss determination module.
The different loss determining module is used for determining a first conversion loss from the target search term to other search terms and a second conversion loss from other search terms to the target search term according to the similarity of each word in the target search term group and each word in the history search term before determining the generalized search term of the target search term from the history search terms according to the conversion loss between the target search term and the history search term in the target search term group;
and the final loss determination module is used for determining the conversion loss of the target retrieval item and the historical retrieval item according to the first conversion loss and the second conversion loss.
Further, the apparatus further comprises: further comprises: the system comprises a matching module, a target word determining module and a loss determining module.
The matching module is used for matching the words in the target search item with the words in the history search item before determining the generalized search item of the target search item from the history search items according to the conversion loss between the target search item and the history search item in the target search item group;
the target word determining module is used for taking the words which are inconsistent in matching as target words;
and the loss determination module is used for determining the conversion loss of the target retrieval item and the historical retrieval item in the target retrieval item group according to the target retrieval item and the target word in the historical retrieval item.
Further, the word unification module includes: word unification unit.
And the word unification unit is used for unifying synonymous words with importance larger than a second importance threshold value and/or words with the same entity identification in the retrieval items.
Further, the different loss determination module includes: a first loss determination unit and a second loss determination unit.
The first loss determination unit is used for taking the sum of minimum values of similarity of each word in the target search term and each word in other search terms as a first conversion loss for converting the target search term into other search terms;
and a second loss determination unit configured to use a sum of minimum values of similarity between each word in the other search term and each word in the target search term as a second conversion loss for converting the other search term into the target search term.
Further, the generalization module includes: matching word units and grouping units.
The matching word unit is used for matching words in each search term;
and the grouping unit is used for dividing at least two search terms into a group if the at least two search terms comprise at least one same word.
The data generalization device provided by the embodiment of the invention can execute the data generalization method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example six
Fig. 7 is a schematic structural diagram of an apparatus according to a sixth embodiment of the present invention. Fig. 7 shows a block diagram of an exemplary device 12 suitable for use in implementing embodiments of the present invention. The device 12 shown in fig. 7 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 7, device 12 is in the form of a general purpose computing device. Components of device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. Device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 7, commonly referred to as a "hard disk drive"). Although not shown in fig. 7, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
Device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with device 12, and/or any devices (e.g., network card, modem, etc.) that enable device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Also, device 12 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, via network adapter 20. As shown, network adapter 20 communicates with other modules of device 12 over bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the data generalization method provided by the embodiments of the present invention.
Example seven
The seventh embodiment of the present invention further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data generalization method according to any of the embodiments of the present invention, the method comprising:
grouping search term sets comprising target search terms to be generalized and history search terms according to terms in each search term;
and determining the generalized search term of the target search term from the historical search terms according to the grouping result.
The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (10)

1. A method of generalizing data, comprising:
grouping search term sets comprising target search terms to be generalized and history search terms according to terms in each search term;
matching the words in the target search term with the words in the history search term;
taking the word which is inconsistent in matching as a target word;
determining the conversion loss of the target retrieval item and the history retrieval item in the target retrieval item group according to the target retrieval item and the target word in the history retrieval item;
determining a generalization search term of a target search term from a history search term according to conversion loss between the target search term and the history search term in a target search term group; wherein the conversion loss refers to the loss required for transforming from one search term to another search term; wherein before determining the generalization search term of the target search term from the history search terms according to the conversion loss between the target search term and the history search term in the target search term group, the method further comprises:
taking the sum of minimum values of the similarity of each word in the target search term and each word in other search terms as a first conversion loss for converting the target search term into other search terms;
taking the sum of the minimum values of the similarity between each word in the other search terms and each word in the target search term as a second conversion loss for converting the other search terms into the target search term;
and determining the conversion loss of the target retrieval item and the historical retrieval item according to the first conversion loss and the second conversion loss.
2. The method of claim 1, wherein grouping the search term set including the target search term and the history search term to be generalized according to the terms in each search term, comprises:
determining the importance of each term in the search term;
taking the words with importance greater than a first importance threshold value in the search term as target words;
and grouping a search term set comprising target search terms to be generalized and historical search terms according to the target terms.
3. The method of claim 1, wherein grouping the search term set including the target search term and the history search term to be generalized according to the terms in each search term, comprises:
matching words in each search term;
if at least two search terms comprise at least one same word, dividing the at least two search terms into a group.
4. The method according to claim 1, wherein before grouping the search term set including the target search term and the history search term to be generalized according to the terms in each search term, further comprising:
and unifying synonyms and/or words identifying the same entity included in the search term.
5. The method according to claim 4, wherein unifying synonyms included in the search term and/or words identifying the same entity comprises:
and unifying synonym words with importance greater than a second importance threshold and/or words identifying the same entity in the search term.
6. A data generalization apparatus, comprising:
the grouping module is used for grouping search term sets comprising target search terms to be generalized and historical search terms according to words in each search term;
a generalization module comprising:
a generalization unit, configured to determine a generalization search term of a target search term from a history search term according to a conversion loss between the target search term and the history search term in a target search term group, where the conversion loss is a loss required for transforming from one search term to another search term;
the different loss determining module is used for determining the minimum sum of the similarity between each word in the target search term and each word in other search terms before the generalized search term of the target search term is determined from the history search terms according to the conversion loss between the target search term and the history search term in the target search term group, and taking the minimum sum as the first conversion loss converted from the target search term to other search terms; taking the sum of the minimum values of the similarity between each word in the other search terms and each word in the target search term as a second conversion loss for converting the other search terms into the target search term;
a final loss determination module for determining conversion losses of the target search term and the history search term according to the first conversion loss and the second conversion loss;
the matching module is used for matching the words in the target search item with the words in the history search item before determining the generalized search item of the target search item from the history search items according to the conversion loss between the target search item and the history search item in the target search item group;
the target word determining module is used for taking the words which are inconsistent in matching as target words;
and the loss determination module is used for determining the conversion loss of the target retrieval item and the historical retrieval item in the target retrieval item group according to the target retrieval item and the target word in the historical retrieval item.
7. The apparatus of claim 6, wherein the grouping module comprises:
an importance determining unit for determining the importance of each term in the search term;
the target determining unit is used for taking the words with importance greater than a first importance threshold value in the retrieval items as target words;
and the grouping unit is used for grouping the search term set comprising the target search term to be generalized and the history search term according to the target word.
8. The apparatus as recited in claim 6, further comprising:
and the word unification module is used for unifying synonymous words included in the search terms and/or words identifying the same entity before grouping the search term sets including the target search term to be generalized and the history search term according to the words in each search term.
9. An electronic device, the device comprising:
one or more processors;
storage means for storing one or more programs,
when executed by the one or more processors, causes the one or more processors to implement the data generalization method of any of claims 1-5.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a data generalization method according to any of claims 1-5.
CN201910015940.0A 2019-01-08 2019-01-08 Data generalization method, device, equipment and medium Active CN109740161B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910015940.0A CN109740161B (en) 2019-01-08 2019-01-08 Data generalization method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910015940.0A CN109740161B (en) 2019-01-08 2019-01-08 Data generalization method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN109740161A CN109740161A (en) 2019-05-10
CN109740161B true CN109740161B (en) 2023-06-20

Family

ID=66363958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910015940.0A Active CN109740161B (en) 2019-01-08 2019-01-08 Data generalization method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN109740161B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1400901A2 (en) * 2002-09-19 2004-03-24 Microsoft Corporation Method and system for retrieving confirming sentences

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110289081A1 (en) * 2010-05-20 2011-11-24 Intelliresponse Systems Inc. Response relevance determination for a computerized information search and indexing method, software and device
CN104598473B (en) * 2013-10-31 2018-07-06 联想(北京)有限公司 A kind of information processing method and electronic equipment
CN105808685B (en) * 2016-03-02 2021-09-28 腾讯科技(深圳)有限公司 Promotion information pushing method and device
CN105912630B (en) * 2016-04-07 2020-01-31 北京搜狗信息服务有限公司 information expansion method and device
CN108153785B (en) * 2016-12-06 2022-04-29 百度在线网络技术(北京)有限公司 Method and device for generating display information
CN108572971B (en) * 2017-03-09 2022-11-01 百度在线网络技术(北京)有限公司 Method and device for mining keywords related to search terms
CN108509474B (en) * 2017-09-15 2022-01-07 腾讯科技(深圳)有限公司 Synonym expansion method and device for search information
CN107958078A (en) * 2017-12-13 2018-04-24 北京百度网讯科技有限公司 Information generating method and device
CN108052659B (en) * 2017-12-28 2022-03-11 北京百度网讯科技有限公司 Search method and device based on artificial intelligence and electronic equipment
CN109117475B (en) * 2018-07-02 2022-08-16 武汉斗鱼网络科技有限公司 Text rewriting method and related equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1400901A2 (en) * 2002-09-19 2004-03-24 Microsoft Corporation Method and system for retrieving confirming sentences

Also Published As

Publication number Publication date
CN109740161A (en) 2019-05-10

Similar Documents

Publication Publication Date Title
US11182445B2 (en) Method, apparatus, server, and storage medium for recalling for search
US11216504B2 (en) Document recommendation method and device based on semantic tag
US9318027B2 (en) Caching natural language questions and results in a question and answer system
WO2019227585A1 (en) Index-based resume data processing method, device, apparatus, and storage medium
CN107038157B (en) Artificial intelligence-based recognition error discovery method and device and storage medium
CN109325108B (en) Query processing method, device, server and storage medium
CN110162786B (en) Method and device for constructing configuration file and extracting structured information
CN111597800B (en) Method, device, equipment and storage medium for obtaining synonyms
CN108932218B (en) Instance extension method, device, equipment and medium
CN113407785B (en) Data processing method and system based on distributed storage system
WO2019100619A1 (en) Electronic apparatus, method and system for multi-table correlated query, and storage medium
CN110276009B (en) Association word recommendation method and device, electronic equipment and storage medium
CN110377750B (en) Comment generation method, comment generation device, comment generation model training device and storage medium
CN111353311A (en) Named entity identification method and device, computer equipment and storage medium
WO2021254251A1 (en) Input display method and apparatus, and electronic device
US20240104302A1 (en) Minutes processing method and apparatus, device, and storage medium
CN109657127B (en) Answer obtaining method, device, server and storage medium
CN112417875B (en) Configuration information updating method and device, computer equipment and medium
CN108846031B (en) Project similarity comparison method for power industry
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN109740161B (en) Data generalization method, device, equipment and medium
CN116483979A (en) Dialog model training method, device, equipment and medium based on artificial intelligence
CN111062208B (en) File auditing method, device, equipment and storage medium
CN110276001B (en) Checking page identification method and device, computing equipment and medium
CN111949765A (en) Similar text searching method, system, equipment and storage medium based on semantics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant