CN112800315B - Data processing method, device, equipment and storage medium - Google Patents

Data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN112800315B
CN112800315B CN202110130043.1A CN202110130043A CN112800315B CN 112800315 B CN112800315 B CN 112800315B CN 202110130043 A CN202110130043 A CN 202110130043A CN 112800315 B CN112800315 B CN 112800315B
Authority
CN
China
Prior art keywords
word
abnormal
word pair
processed
pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110130043.1A
Other languages
Chinese (zh)
Other versions
CN112800315A (en
Inventor
连义江
杨新涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110130043.1A priority Critical patent/CN112800315B/en
Publication of CN112800315A publication Critical patent/CN112800315A/en
Application granted granted Critical
Publication of CN112800315B publication Critical patent/CN112800315B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9038Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/904Browsing; Visualisation therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method, a device, equipment and a storage medium, which relate to the technical field of computers and further relate to artificial intelligence technologies such as deep learning and intelligent searching. The specific implementation scheme is as follows: identifying word pairs to be processed in the historical search display log according to abnormal seed word pairs in the seed library; wherein, the word pairs comprise search requests submitted by requesters and reference keywords provided by data providers; according to the identification result, updating an abnormal word list, wherein the abnormal word list is used for data searching, so that the influence of abnormal matching conditions on the search result is greatly reduced, and the accuracy of the search result is improved.

Description

Data processing method, device, equipment and storage medium
Technical Field
The application relates to the technical field of computers, in particular to the technical field of artificial intelligence such as deep learning and intelligent search.
Background
Three roles may be included in the data search field: a requestor, a data provider, and a search engine. Wherein the request submits a search request to a search engine, the data provider provides reference keywords and content (e.g., advertising creatives) to the search engine, and the search engine designs a matching mechanism between the search request and the reference keywords. When a search request submitted by a requestor matches a reference keyword provided by a data provider, the content of the data provider (such as an advertising creative) is presented in the requestor's search results page, where matching problems between the search request and the reference keyword are critical. However, the search engine will inevitably have some abnormal matching situations in the matching stage of the search request and the reference keywords, which seriously affects the accuracy of the search results.
Disclosure of Invention
The application provides a data processing method, a data processing device, data processing equipment and a storage medium.
According to an aspect of the present application, there is provided a data processing method, the method including:
identifying word pairs to be processed in the historical search display log according to abnormal seed word pairs in the seed library; wherein, the word pairs comprise search requests submitted by requesters and reference keywords provided by data providers;
and updating an abnormal word list according to the identification result, wherein the abnormal word list is used for data searching.
According to another aspect of the present application, there is provided a data processing apparatus comprising:
the recognition module is used for recognizing word pairs to be processed in the historical search display log according to the abnormal seed word pairs in the seed library; wherein, the word pairs comprise search requests submitted by requesters and reference keywords provided by data providers;
and the list updating module is used for updating an abnormal word list according to the identification result, wherein the abnormal word list is used for data searching.
According to another aspect of the present application, there is provided an electronic device including:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data processing method of any one of the embodiments of the present application.
According to another aspect of the present application, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the data processing method according to any of the embodiments of the present application.
According to another aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a data processing method as described in any of the embodiments of the present application.
According to the technology of the application, the influence of abnormal matching conditions on the search results is greatly reduced, and the accuracy of the search results is improved.
It should be understood that the description of this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.
Drawings
The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:
FIG. 1A is a flow chart of a data processing method provided according to an embodiment of the present application;
FIG. 1B is a schematic diagram of a data processing method provided in accordance with an embodiment of the present application;
FIG. 2A is a flow chart of another data processing method provided in accordance with an embodiment of the present application;
FIG. 2B is a schematic diagram of a synonym metric model provided according to embodiments of the present disclosure;
FIG. 3 is a flow chart of yet another data processing method provided in accordance with an embodiment of the present application;
FIG. 4 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
fig. 5 is a block diagram of an electronic device for implementing a data processing method of an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
FIG. 1A is a flow chart of a data processing method provided according to an embodiment of the present application; fig. 1B is a schematic diagram of a data processing method according to an embodiment of the present application. The method and the device are suitable for the situation of how to process data in the field of data searching so as to reduce the influence of abnormal matching conditions on search results. The present embodiment applies to search engines, which may be performed by data processing means, which may be implemented in software and/or hardware, which may be integrated in an electronic device provided with search engine functionality. As shown in connection with fig. 1A and 1B, the data processing method includes:
s101, identifying word pairs to be processed in a historical search display log according to abnormal seed word pairs in a seed library; wherein the term pairs include search requests submitted by the requesting party and reference keywords provided by the data provider.
In this embodiment, for any search request submitted by a requester, the search engine can match at least one reference keyword from the reference keywords provided by the data provider based on a matching mechanism; for each matched reference keyword, taking the reference keyword and the search request as a word pair; further, for any word pair, if the search request in the word pair is irrelevant or the degree of relevance between the search request and the reference keyword is less than a set threshold, the word pair may be used as an abnormal word pair.
The seed library can also be called an abnormal seed library and is specially used for storing abnormal seed word pairs, wherein the abnormal seed word pairs are determined according to abnormal word pairs fed back by a data provider; alternatively, all abnormal word pairs fed back by the data provider can be used as abnormal seed word pairs and added into the seed library; in order to improve the accuracy of the search result, the embodiment preferably adds an abnormal word pair with low correlation among abnormal word pairs fed back by the data provider as an abnormal seed word pair to the seed library. Because the data provider can feed back the abnormal word pairs to the search engine in real time, further, in order to improve the accuracy of the search result under the condition that the search function of the search engine is not affected, the seed library can be updated periodically according to the abnormal word pairs fed back by the data provider in a period of time in an offline state.
The search display log is a log generated in the process that a search engine generates a search result according to a search request submitted by a requester; specifically, the historical search presentation log may be all the search presentation logs accumulated by the search engine before the current time, may be the search presentation log accumulated by the search engine in a period of time, or may also be a certain number of search presentation logs accumulated by the search engine history, and the like. Optionally, for any historical search presentation log, the historical search presentation log may include a search request, a reference keyword, content corresponding to the reference keyword (such as an advertising creative), and so on. The word pairs to be processed are word pairs which are recorded in the history search display log and are to be identified as abnormal word pairs.
Optionally, in this embodiment, the identifying operation may be performed according to a setting condition, where the setting condition may be, for example, a setting period, or the number of history search display logs reaches a set number. And further, under the condition that the current state meets the set condition (for example, the current time meets the set period), acquiring a historical search display log, extracting word pairs to be processed from the historical search display log, and identifying the word pairs to be processed in the historical search display log according to the abnormal seed word pairs in the seed library. At this time, the history search presentation log is specifically a log accumulated by the search engine during a period from the last recognition operation to the current recognition operation.
Further, the embodiment can identify word pairs to be processed in the historical search display log according to abnormal seed word pairs in the seed library based on a pre-trained synonymous measurement model. Specifically, for each word pair to be processed, the word pair to be processed and the abnormal seed word pair in the seed library may be input into a synonym metric model, and the synonym metric model outputs the degree of correlation between the word pair to be processed and each abnormal seed word pair, so as to determine whether the word pair to be processed is an abnormal word pair according to the degree of correlation.
By way of example, the method and the device can identify word pairs to be processed in the history search display log according to abnormal seed word pairs in the seed library in an offline state, so that the online search function of a search engine can be ensured not to be influenced, and reasonable utilization of system resources is realized.
Furthermore, in order to improve the recognition efficiency, the duplicate removal processing can be performed on the word pairs to be processed in the history search display log before the recognition is performed on the word pairs to be processed in the history search display log. That is, if a word pair to be processed occurs at least twice, only one is retained.
S102, updating an abnormal word list according to the identification result, wherein the abnormal word list is used for data searching.
In this embodiment, the abnormal word list is specially used for storing abnormal word pairs, and can be applied to data searching as a basis for screening search results; further, the abnormal word list may or may not include abnormal word pairs fed back by the data provider.
The recognition result may include a result of whether each word pair to be processed is an abnormal word pair, or the recognition result may include only a word pair to be processed, for which the word pair to be processed is an abnormal word pair, and the like.
Optionally, according to the identification result, the embodiment may add the word pair to be processed, where the word pair to be processed is an abnormal word pair, to the abnormal word list, so as to update the abnormal word list.
It should be noted that, in the matching stage of the search request and the reference keyword, some abnormal matching situations will inevitably occur in the search engine, that is, abnormal word pairs exist, which seriously affect the accuracy of the search result, so that the abnormal word pairs need to be masked in the search process. The prior art generally optimizes search engines based on abnormal word pairs fed back by data providers to avoid similar problems from reoccurring. Because of the diversity of speech expressions, the effect of shielding the abnormal word pairs of the same type cannot be achieved only by constructing an abnormal word list according to the abnormal word pairs fed back by the data provider (namely, limited abnormal word pairs) or relying on manual intervention. For example, the abnormal word pair fed back by the data provider is < cause of syncope, what kind of epilepsy is >, where the search request is "cause of syncope", the reference keyword is "what kind of epilepsy is" what kind of disease ", the prior art is based on the abnormal word pair < cause of syncope, what kind of epilepsy is >, what kind of thing is not covered by < syncope, what kind of cause of epilepsy is > and/or < syncope cause of epilepsy is what kind of cause of epilepsy is, etc.
It is noted that, in the embodiment, under the condition that the number of abnormal seed word pairs is relatively small, the word pairs to be processed in the history search display log are identified regularly, so that a rich abnormal word list can be obtained without relying on manpower. For example, one abnormal seed word pair is < cause of syncope, what is happening to epilepsy >, the present embodiment can identify < how syncope is happening, what cause of epilepsy is happening >, < cause of syncope is what is caused by epilepsy >, < cause of syncope, how can be happened >, and < how syncope is happening by identifying the word pair to be processed in the history search display log, the cause of epilepsy > waits for the word pair to be processed as an abnormal word pair, and adds to the abnormal word list. The embodiment provides a new idea for acquiring a rich abnormal word list, and lays a foundation for identifying a type of abnormal word pair based on the abnormal word pair provided by the data provider. In addition, the abnormal word list can dynamically change along with the change of the history search display log, so that the recognition range in the data search process is widened.
According to the technical scheme, the seed library is introduced, abnormal seed word pairs in the seed library are used as references, word pairs to be processed in the history search display log are identified, and the abnormal word library is updated based on the identification result. Compared with the prior art, the embodiment can obtain a rich abnormal word list without relying on manpower, and provides a new idea for obtaining the rich abnormal word list; in addition, the rich abnormal word list is used for data searching, so that the effect of identifying a type of abnormal word pair based on one abnormal word pair provided by a data provider can be realized, the influence of abnormal matching conditions on search results is greatly reduced, and the accuracy of the search results is improved.
Optionally, as an alternative manner of the embodiment of the present application, after updating the abnormal word list, data searching may be performed based on the updated abnormal word list, for example, a candidate keyword may be selected from reference keywords provided by a data provider according to a target search request submitted by a requester; and identifying the target search request and the candidate keywords according to the updated abnormal word list.
Specifically, the requester may submit a target search request to the search engine in the case of having a search requirement; the search engine may select at least one candidate keyword (optionally, a plurality of candidate keywords in this embodiment) from the reference keywords provided by the data provider based on the matching mechanism, and use each candidate keyword and the target search request as a candidate word pair, and if any candidate word pair hits the updated abnormal word list, mask the candidate word pair, or if the similarity between any candidate word pair and any abnormal word pair in the updated abnormal word list is greater than a set similarity threshold, mask the candidate word pair; and the search engine only displays the corresponding content (such as advertising creative) of the unmasked candidate words to the requesting party, so that the influence of abnormal matching conditions on the search results is greatly reduced, the accuracy of the search results is improved, and the experience of the user is improved.
Fig. 2A is a flowchart of another data processing method according to an embodiment of the present application. Based on the embodiment, the embodiment of the application further explains how to identify the word pairs to be processed in the history search display log according to the abnormal seed word pairs in the seed library. As shown in fig. 2A, the data processing method includes:
S201, determining core words of word pairs to be processed in the history search display log.
In this embodiment, for any word pair, the core word of the word pair can characterize the central idea of the word pair; further, the core word of the word pair is composed of the core word of the search request in the word pair and the core word of the reference keyword in the word pair. For example, one term pair is < DNA detection, DNA pregnancy identification paternity >, the core term of the term pair is (DNA+detection) + (DNA+pregnancy+identification+paternity).
Optionally, for each word pair to be processed in the history search display log, a core word sequence labeling tool may be used to identify a core word of the word pair to be processed, or other manners, such as a pre-built core word identification model, may be used to identify a core word of the word pair to be processed, and so on.
S202, selecting a target word pair of the word pair to be processed from abnormal seed word pairs of the seed library according to the inverted index of the core word associated with the seed library and the core word of the word pair to be processed.
Inverted index is a common indexing mechanism; in this embodiment, a core word sequence labeling tool may be used to identify the core word of each abnormal seed word pair in the seed library, and based on the core words of all abnormal seed word pairs, a core word inverted index using the core word as an index word and the abnormal seed word pair as index content may be constructed. For example, if the core words of the abnormal seed word pair 1 and the abnormal seed word pair 2 are the same, if both the core words are (dna+detection) + (dna+pregnancy+identification+parent), and further if (dna+detection) + (dna+pregnancy+identification+parent) is used as an index word, based on the inverted index of the core word, the abnormal seed word pair with the same core word, for example, the abnormal seed word pair 1 and the abnormal seed word pair 2, can be obtained from the seed pool.
Furthermore, under the condition that the abnormal seed word pairs in the seed library are changed, the inverted index of the core word can be dynamically updated. For example, one or more abnormal seed word pairs are newly added in the seed library, so that the core word of the newly added abnormal seed word pair can be identified, the core word of the newly added abnormal seed word pair is used as an index word, and if the situation that the core word of the newly added abnormal seed word pair is used as the index word exists in the pre-constructed inverted index of the core word, the newly added abnormal seed word pair can be added to the corresponding index content position. Further, if the core word of the newly added abnormal seed word pair does not exist in the pre-constructed inverted index of the core word as an index word, a new index word of the newly added abnormal seed word pair may be added, and the newly added abnormal seed word pair may be used as an index pair of the index content. At least one index pair may be included in the core word inverted index associated with the seed pool.
Optionally, for each word pair to be processed, the embodiment may use the core word of the word pair to be processed as an index word, input the inverted index of the core word associated with the seed library, and obtain an abnormal seed word pair with the core word identical to the core word of the word pair to be processed. Further, the number of the abnormal seed word pairs with the same core words as the word pairs to be processed can be one or more, and if the number of the abnormal seed word pairs with the same core words as the word pairs to be processed is one, the abnormal seed word pairs can be directly used as target word pairs of the word pairs to be processed; if the number of abnormal seed word pairs with the same core word as the core word of the word pair to be processed is a plurality of, one target word pair can be selected from the seed abnormal word pairs, and the method can be realized by the following steps of:
And step A, selecting candidate word pairs of the word pairs to be processed from abnormal seed word pairs of the seed library according to the core word inverted index associated with the seed library and the core word of the word pairs to be processed.
Optionally, for each word pair to be processed, the embodiment may use, as the candidate word pair of the word pair to be processed, an abnormal seed word pair in which a core word in the seed library is the same as a core word of the word pair to be processed, where the number of candidate word pairs is a plurality of.
And step B, determining the distance between the word pair to be processed and the candidate word pair based on the synonymous metric model.
The synonymous metric model may also be referred to as a similarity metric model, and in order to increase the calculation speed, the synonymous metric model in the present embodiment is obtained based on the idea training of the twin network. For example, as shown in fig. 2B, the network structure is a conversion network (i.e., transfomer), word embedding (i.e., word embedding) is used to quantize words into vectors, and network parameters are shared, specifically, the network parameters on the positive sample keyword side are the same as those on the search request side, and the network parameters on the search request side are the same as those on the negative sample keyword side. Further, training is performed by using a sample pair (i.e., pair wise), and a training loss function is used in the training process to predict the relative distance between input samples.
Optionally, the reference keyword clicked by the requester in the history search display log may be used as a positive sample keyword, and the word pair < search request, positive sample keyword > is used as a positive sample; meanwhile, part of reference keywords which are not clicked by a requester in the history search display log can be used as negative sample keywords, and word pairs < search request, negative sample keywords > are used as negative samples; to ensure accuracy of the model, a random factor may be introduced, for example, a set number of < search request, random reference keyword > word pairs may also be randomly selected from the historical search presentation log, also as a negative sample. In this embodiment, the positive and negative samples may be input to the initial model (i.e., the untrained synonymous metric model) for training to obtain the synonymous metric model. Specifically, word embedding can be performed on the positive sample keyword, the search request and the negative sample keyword to obtain word vectors; inputting the obtained word vector into a transformer to respectively obtain a positive sample keyword vector, a search request vector and a negative sample keyword vector; then, metric learning (i.e., metric learning) between the positive sample keyword vector and the search request vector and metric learning between the negative sample keyword vector and the search request vector can be performed respectively, and a loss can be obtained by adopting a training loss function, and then the model is trained based on the loss, so that a synonymous metric model can be obtained. In order to ensure the accuracy of the model, manually marked synonymous and non-synonymous samples can be used as fine-tuning samples, and network parameters can be fine-tuned to obtain a synonymous measurement model with higher accuracy.
Optionally, in this embodiment, for each word pair to be processed, the word pair to be processed and the candidate word pair of the word pair to be processed may be input into a synonym metric model together, where the synonym metric model outputs a distance between the word pair to be processed and each candidate word pair. Optionally, the distance between the word pair to be processed and each candidate word pair may characterize the degree of correlation between the word pair to be processed and each candidate word pair. Further, the larger the distance, the smaller the correlation.
Alternatively, as an alternative manner of the embodiment of the present application, determining the distance between the word pair to be processed and the candidate word pair may be: calculating a first distance between a search request in a word pair to be processed and a search request in a candidate word pair, and a second distance between a reference keyword in the word pair to be processed and a reference keyword in the candidate word pair; and determining the distance between the word pair to be processed and the candidate word pair according to the first distance and the second distance.
Specifically, for each candidate word pair of each word pair to be processed, a first distance between the search request in the candidate word pair and the search request in the word pair to be processed may be calculated, and a second distance between the reference keyword in the candidate word pair and the reference keyword in the word pair to be processed may be calculated, and a sum of the first distance and the second distance may be used as a distance between the candidate word pair and the word pair to be processed; alternatively, the first weight and the second weight may be preset, and then the first weight and the first distance may be multiplied, the second weight and the second distance may be multiplied, and the sum of the products of the two may be used as the distance between the candidate word pair and the word pair to be processed.
And C, selecting a target word pair of the word pair to be processed from the candidate word pairs according to the distance.
Specifically, for each word pair to be processed, after determining the distance between the word pair to be processed and each candidate word pair of the word pair to be processed, the candidate word pair corresponding to the minimum distance may be used as the target word pair of the word pair to be processed.
In the embodiment, the core word and the inverted index of the core word are introduced to screen abnormal seed word pairs of the seed library, so that the distance calculation amount is reduced, and the recognition efficiency is improved; meanwhile, a synonymous measurement model based on twin network idea training is introduced, and the recognition efficiency is greatly improved under the condition that the distance calculation accuracy can be ensured.
S203, identifying whether the word pair to be processed is an abnormal word pair according to the distance between the word pair to be processed and the target word pair.
Optionally, for each word pair to be processed, if the distance between the word pair to be processed and the target word pair of the word pair to be processed is smaller than or equal to a set distance value, determining that the word pair to be processed is an abnormal word pair. And if the distance between the word pair to be processed and the target word pair of the word pair to be processed is larger than the set distance value, determining that the word pair to be processed is not an abnormal word pair.
S204, updating an abnormal word list according to the identification result, wherein the abnormal word list is used for data searching.
According to the technical scheme, the seed library is introduced, the word pairs to be processed in the history search display log are identified by taking the abnormal seed word pairs in the seed library as a reference, and in the specific identification process, the core words and the inverted indexes of the core words are introduced, so that the abnormal seed word pairs in the seed library are screened, the distance calculation amount is reduced, and the identification efficiency is improved; compared with the prior art, the embodiment can obtain a rich abnormal word list without relying on manpower, and provides a new idea for obtaining the rich abnormal word list; in addition, the abnormal word library is updated based on the recognition result, and the rich abnormal word list is used for data searching, so that the effect of recognizing a type of abnormal word pair based on one abnormal word pair provided by a data provider can be realized, the influence of abnormal matching conditions on the search result is greatly reduced, and the accuracy of the search result is improved.
Fig. 3 is a flowchart of yet another data processing method provided according to an embodiment of the present application. The embodiment of the application adds the operation of updating the seed library on the basis of the embodiment. As shown in fig. 3, the data processing method includes:
S301, updating a seed library according to abnormal word pairs fed back by a data provider.
Optionally, the embodiment may add all abnormal word pairs fed back by the data provider in a period of time to the seed library as abnormal seed word pairs, so as to update the seed library. In order to improve the accuracy of the search result, the embodiment may add a part of the abnormal word pairs fed back by the data provider as abnormal seed word pairs to the seed library to update the seed library.
Optionally, a part of the abnormal word pairs fed back from the data provider is selected as abnormal seed word pairs, and the abnormal seed word pairs are added into the seed library, so that the seed library can be updated specifically through the following implementation process:
and step 1, determining the confidence of the abnormal word pairs fed back by the data provider.
Alternatively, for each abnormal word pair fed back by the data provider, the similarity between the search and reference keywords in the abnormal word pair may be determined, and the confidence of the abnormal word pair may be determined according to the similarity. For example, the similarity may be measured as a standard value between 0 and 1, and the difference of 1 minus the quantized standard value may be used as the confidence. Alternatively, the similarity may be substituted into the set confidence calculation formula to obtain the confidence. Alternatively, the similarity is inversely proportional to the confidence, that is, the greater the similarity, the less the confidence.
And 2, selecting abnormal seed word pairs from the abnormal word pairs fed back by the data provider according to the confidence level.
Alternatively, the abnormal word pairs fed back by the data provider may be sorted in descending order according to the confidence level, and then the preset number of abnormal word pairs arranged in front may be used as abnormal seed word pairs and added to the seed library to update the seed library.
And step 3, adding the abnormal seed word pairs into a seed library.
It should be noted that, in this embodiment, based on the confidence, an abnormal seed word pair is selected from the abnormal word pairs fed back by the data provider, and is added to the seed library, so that the accuracy of the abnormal seed word pair in the seed library is ensured, and a foundation is laid for obtaining a precise abnormal word list.
S302, identifying word pairs to be processed in a historical search display log according to the abnormal seed word pairs in the updated seed library; wherein the term pairs include search requests submitted by the requesting party and reference keywords provided by the data provider.
S303, updating an abnormal word list according to the identification result, wherein the abnormal word list is used for data searching.
According to the technical scheme, a foundation is laid for acquiring a rich abnormal word list by dynamically updating the seed library; in addition, by taking the abnormal seed word pairs in the updated seed library as a reference, identifying word pairs to be processed in the history search display log, obtaining a rich abnormal word list without relying on manpower, and providing a new thought for obtaining the rich abnormal word list; and the abnormal word library is updated based on the identification result, and the rich abnormal word list is used for data searching, so that the effect of identifying a type of abnormal word pair based on one abnormal word pair provided by a data provider can be realized, the influence of abnormal matching conditions on the search result is greatly reduced, and the accuracy of the search result is improved.
Fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The method and the device are suitable for the situation of how to process data in the field of data searching so as to reduce the influence of abnormal matching conditions on search results. The device can realize the data processing method according to any embodiment of the application. As shown in fig. 4, the data processing apparatus includes:
the recognition module 401 is configured to recognize word pairs to be processed in the history search display log according to abnormal seed word pairs in the seed library; wherein, the word pairs comprise search requests submitted by requesters and reference keywords provided by data providers;
the list updating module 402 is configured to update an abnormal word list according to the recognition result, where the abnormal word list is used for data searching.
According to the technical scheme, the seed library is introduced, abnormal seed word pairs in the seed library are used as references, word pairs to be processed in the history search display log are identified, and the abnormal word library is updated based on the identification result. Compared with the prior art, the embodiment can obtain a rich abnormal word list without relying on manpower, and provides a new idea for obtaining the rich abnormal word list; in addition, the rich abnormal word list is used for data searching, so that the effect of identifying a type of abnormal word pair based on one abnormal word pair provided by a data provider can be realized, the influence of abnormal matching conditions on search results is greatly reduced, and the accuracy of the search results is improved.
Illustratively, the identification module 401 includes:
the core word determining unit is used for determining core words of word pairs to be processed;
the target selection unit is used for selecting target word pairs of word pairs to be processed from abnormal seed word pairs of the seed library according to the core word inverted index associated with the seed library and the core words of the word pairs to be processed;
the anomaly identification unit is used for identifying whether the word pair to be processed is an anomaly word pair according to the distance between the word pair to be processed and the target word pair.
Exemplary, the target selection unit includes:
a candidate selecting subunit, configured to select a candidate word pair of the word pair to be processed from the abnormal seed word pair of the seed library according to the inverted index of the core word associated with the seed library and the core word of the word pair to be processed;
a distance determining subunit, configured to determine a distance between the word pair to be processed and the candidate word pair based on the synonym metric model;
and the target selecting subunit is used for selecting a target word pair of the word pair to be processed from the candidate word pairs according to the distance.
Illustratively, the distance determination subunit is specifically configured to:
calculating a first distance between a search request in a word pair to be processed and a search request in a candidate word pair, and a second distance between a reference keyword in the word pair to be processed and a reference keyword in the candidate word pair;
And determining the distance between the word pair to be processed and the candidate word pair according to the first distance and the second distance.
Illustratively, the apparatus further comprises:
and the seed library updating module is used for updating the seed library according to the abnormal word pairs fed back by the data provider.
Illustratively, the seed library update module is specifically configured to:
determining the confidence level of abnormal word pairs fed back by a data provider;
according to the confidence level, selecting abnormal seed word pairs from the abnormal word pairs fed back by the data provider;
abnormal seed word pairs are added to the seed pool.
Illustratively, the apparatus further comprises:
the keyword selection module is used for selecting candidate keywords from the reference keywords provided by the data provider according to the target search request submitted by the requester;
the recognition module 401 is further configured to recognize the target search request and the candidate keyword according to the updated abnormal word list.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 5 illustrates a schematic block diagram of an example electronic device 500 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 5, the electronic device 500 includes a computing unit 501 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic device 500 may also be stored. The computing unit 501, ROM 502, and RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
A number of components in electronic device 500 are connected to I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the electronic device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the respective methods and processes described above, such as a data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM 502 and/or the communication unit 509. When a computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the data processing method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (10)

1. A data processing method, comprising:
determining core words of word pairs to be processed; wherein, the word pairs comprise search requests submitted by requesters and reference keywords provided by data providers;
selecting candidate word pairs of the word pairs to be processed from abnormal seed word pairs of the seed library according to the core word inverted index associated with the seed library and the core word of the word pairs to be processed;
Calculating a first distance between the search request in the word pair to be processed and the search request in the candidate word pair and a second distance between the reference keyword in the word pair to be processed and the reference keyword in the candidate word pair based on the synonym metric model;
determining the distance between the word pair to be processed and the candidate word pair according to the first distance and the second distance;
selecting a target word pair of the word pair to be processed from the candidate word pairs according to the distance;
identifying whether the word pair to be processed is an abnormal word pair according to the distance between the word pair to be processed and the target word pair; for any word pair, if the search request in the word pair and the reference keyword are irrelevant or the degree of correlation is smaller than a set threshold, the word pair is an abnormal word pair;
and updating an abnormal word list according to the identification result, wherein the abnormal word list is used for data searching.
2. The method of claim 1, further comprising:
and updating the seed library according to the abnormal word pairs fed back by the data provider.
3. The method of claim 2, wherein updating the seed library based on the abnormal word pairs fed back by the data provider comprises:
Determining the confidence coefficient of the abnormal word pair fed back by the data provider;
selecting abnormal seed word pairs from the abnormal word pairs fed back by the data provider according to the confidence;
adding the abnormal seed word pairs to the seed pool.
4. The method of claim 1, further comprising, after updating the list of abnormal words based on the recognition result:
selecting candidate keywords from the reference keywords provided by the data provider according to the target search request submitted by the requester;
and identifying the target search request and the candidate keywords according to the updated abnormal word list.
5. A data processing apparatus comprising:
the recognition module comprises a core word determining unit, a target selecting unit and an abnormality recognizing unit;
the core word determining unit is used for determining core words of word pairs to be processed; wherein, the word pairs comprise search requests submitted by requesters and reference keywords provided by data providers;
the target selection unit comprises a candidate selection subunit, a distance determination subunit and a target selection subunit;
the candidate selecting subunit is configured to select, according to a core word inverted index associated with a seed library and a core word of the word pair to be processed, a candidate word pair of the word pair to be processed from an abnormal seed word pair of the seed library;
The distance determining subunit is configured to calculate, based on a synonym metric model, a first distance between the search request in the word pair to be processed and the search request in the candidate word pair, and a second distance between the reference keyword in the word pair to be processed and the reference keyword in the candidate word pair; determining the distance between the word pair to be processed and the candidate word pair according to the first distance and the second distance;
the target selection subunit is used for selecting a target word pair of the word pair to be processed from the candidate word pairs according to the distance;
the anomaly identification unit is used for identifying whether the word pair to be processed is an anomaly word pair according to the distance between the word pair to be processed and the target word pair; for any word pair, if the search request in the word pair and the reference keyword are irrelevant or the degree of correlation is smaller than a set threshold, the word pair is an abnormal word pair;
and the list updating module is used for updating an abnormal word list according to the identification result, wherein the abnormal word list is used for data searching.
6. The apparatus of claim 5, further comprising:
And the seed library updating module is used for updating the seed library according to the abnormal word pairs fed back by the data provider.
7. The apparatus of claim 6, wherein the seed library update module is specifically configured to:
determining the confidence coefficient of the abnormal word pair fed back by the data provider;
selecting abnormal seed word pairs from the abnormal word pairs fed back by the data provider according to the confidence;
adding the abnormal seed word pairs to the seed pool.
8. The apparatus of claim 5, further comprising:
the keyword selection module is used for selecting candidate keywords from the reference keywords provided by the data provider according to the target search request submitted by the requester;
and the identification module is also used for identifying the target search request and the candidate keywords according to the updated abnormal word list.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data processing method of any one of claims 1-4.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the data processing method according to any one of claims 1-4.
CN202110130043.1A 2021-01-29 2021-01-29 Data processing method, device, equipment and storage medium Active CN112800315B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110130043.1A CN112800315B (en) 2021-01-29 2021-01-29 Data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110130043.1A CN112800315B (en) 2021-01-29 2021-01-29 Data processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112800315A CN112800315A (en) 2021-05-14
CN112800315B true CN112800315B (en) 2023-08-04

Family

ID=75813030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110130043.1A Active CN112800315B (en) 2021-01-29 2021-01-29 Data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112800315B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761133A (en) * 2021-09-10 2021-12-07 未鲲(上海)科技服务有限公司 System abnormity monitoring method and device based on artificial intelligence and related equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491518A (en) * 2017-08-15 2017-12-19 北京百度网讯科技有限公司 Method and apparatus, server, storage medium are recalled in one kind search
WO2018040503A1 (en) * 2016-08-30 2018-03-08 北京百度网讯科技有限公司 Method and system for obtaining search results
CN111291069A (en) * 2018-12-07 2020-06-16 北京搜狗科技发展有限公司 Data processing method and device and electronic equipment
CN111950254A (en) * 2020-09-22 2020-11-17 北京百度网讯科技有限公司 Method, device and equipment for extracting word features of search sample and storage medium
CN112115232A (en) * 2020-09-24 2020-12-22 腾讯科技(深圳)有限公司 Data error correction method and device and server

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101609184B1 (en) * 2014-05-27 2016-04-06 네이버 주식회사 Method, system and recording medium for providing dictionary function and file distribution system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018040503A1 (en) * 2016-08-30 2018-03-08 北京百度网讯科技有限公司 Method and system for obtaining search results
CN107491518A (en) * 2017-08-15 2017-12-19 北京百度网讯科技有限公司 Method and apparatus, server, storage medium are recalled in one kind search
CN111291069A (en) * 2018-12-07 2020-06-16 北京搜狗科技发展有限公司 Data processing method and device and electronic equipment
CN111950254A (en) * 2020-09-22 2020-11-17 北京百度网讯科技有限公司 Method, device and equipment for extracting word features of search sample and storage medium
CN112115232A (en) * 2020-09-24 2020-12-22 腾讯科技(深圳)有限公司 Data error correction method and device and server

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于抽象概念的知网词语相似度计算;朱新华 等;计算机工程与设计;第38卷(第3期);全文 *

Also Published As

Publication number Publication date
CN112800315A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN113568940B (en) Method, device, equipment and storage medium for data query
CN113326420A (en) Question retrieval method, device, electronic equipment and medium
CN113128209A (en) Method and device for generating word stock
CN114817651B (en) Data storage method, data query method, device and equipment
CN113033194A (en) Training method, device, equipment and storage medium of semantic representation graph model
CN112989170A (en) Keyword matching method applied to information search, information search method and device
CN113988157A (en) Semantic retrieval network training method and device, electronic equipment and storage medium
CN112800315B (en) Data processing method, device, equipment and storage medium
CN113408280B (en) Negative example construction method, device, equipment and storage medium
CN114417118A (en) Abnormal data processing method, device, equipment and storage medium
CN113807091A (en) Word mining method and device, electronic equipment and readable storage medium
CN116383340A (en) Information searching method, device, electronic equipment and storage medium
CN115794473A (en) Root cause alarm positioning method, device, equipment and medium
CN114817476A (en) Language model training method and device, electronic equipment and storage medium
CN114357180A (en) Knowledge graph updating method and electronic equipment
CN112784600A (en) Information sorting method and device, electronic equipment and storage medium
CN112818221A (en) Entity heat determination method and device, electronic equipment and storage medium
CN112818167A (en) Entity retrieval method, entity retrieval device, electronic equipment and computer-readable storage medium
CN114422584B (en) Method, device and storage medium for pushing resources
CN113268987B (en) Entity name recognition method and device, electronic equipment and storage medium
CN116127948B (en) Recommendation method and device for text data to be annotated and electronic equipment
CN113408661B (en) Method, apparatus, device and medium for determining mismatching
JP7558299B2 (en) Method, apparatus, device and storage medium for constructing a search database
CN115578583B (en) Image processing method, device, electronic equipment and storage medium
CN116167455B (en) Model training and data deduplication method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant