CN112749313A - Label labeling method and device, computer equipment and storage medium - Google Patents

Label labeling method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112749313A
CN112749313A CN202010772268.2A CN202010772268A CN112749313A CN 112749313 A CN112749313 A CN 112749313A CN 202010772268 A CN202010772268 A CN 202010772268A CN 112749313 A CN112749313 A CN 112749313A
Authority
CN
China
Prior art keywords
search
data
classification
category
result data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010772268.2A
Other languages
Chinese (zh)
Inventor
黄剑辉
梁龙军
刘海波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010772268.2A priority Critical patent/CN112749313A/en
Publication of CN112749313A publication Critical patent/CN112749313A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Abstract

The application relates to a label labeling method, a label labeling device, computer equipment and a storage medium. The method comprises the following steps: the method comprises the steps of obtaining a search record to be labeled in a target search system, obtaining search result data corresponding to the search record from an external search platform, inputting the search result data into a preset classification model to obtain a classification result corresponding to the search result data, wherein the classification category of the classification model is the same as the classification category of an index database of the target search system, and labeling a category label for the search record according to the classification result. The method has the advantages that the search result data corresponding to the search records are obtained by means of the external search platform, the search records are subjected to category label labeling through classification processing, semi-supervised label labeling is achieved, the method is not limited by the condition that whether the search system has a large amount of historical search click data or not by means of the external search platform, and is suitable for the search system in the cold start stage, so that the labeling efficiency of the search records is effectively improved, and the training data are quickly accumulated.

Description

Label labeling method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a label labeling method and apparatus, a computer device, and a storage medium.
Background
With the development of artificial intelligence technology, various artificial intelligence data processing models have been developed rapidly. The training of the model is an important ring in the process of constructing the model, and the model training depends on training data carrying labels.
Taking an intention analysis model in a search system as an example, the traditional data tagging approach includes manual annotation and construction of training data based on historical search click data. However, since the search system in the cold start stage lacks historical exposure click data, the method for constructing training data based on the historical search click data is not suitable for the search system in the cold start stage, and the processing method adopting manual labeling faces the problem of low labeling efficiency.
Disclosure of Invention
In view of the above, there is a need to provide a label labeling method, apparatus, computer device and storage medium capable of improving training label labeling efficiency of a search system in a cold start phase.
A label labeling method comprises the following steps:
acquiring a search record to be marked in a target search system;
acquiring search result data corresponding to the search record from an external search platform;
inputting the search result data into a preset classification model to obtain a classification result corresponding to the search result data, wherein the classification category of the classification model is the same as that of an index database of the target search system;
and according to the classification result, performing category label labeling on the search record.
A label labelling apparatus, the apparatus comprising:
the search record acquisition module is used for acquiring search records to be marked in the target search system;
the external data acquisition module is used for acquiring search result data corresponding to the search records from an external search platform;
the data classification module is used for inputting the search result data into a preset classification model to obtain a classification result corresponding to the search result data, and the classification category of the classification model is the same as that of an index database of the target search system;
and the label labeling module is used for labeling the category labels of the search records according to the classification result.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring a search record to be marked in a target search system;
acquiring search result data corresponding to the search record from an external search platform;
inputting the search result data into a preset classification model to obtain a classification result corresponding to the search result data, wherein the classification category of the classification model is the same as that of an index database of the target search system;
and according to the classification result, performing category label labeling on the search record.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring a search record to be marked in a target search system;
acquiring search result data corresponding to the search record from an external search platform;
inputting the search result data into a preset classification model to obtain a classification result corresponding to the search result data, wherein the classification category of the classification model is the same as that of an index database of the target search system;
and according to the classification result, performing category label labeling on the search record.
According to the label labeling method, the label labeling device, the computer equipment and the storage medium, the search result data corresponding to the search records are obtained by means of the external search platform, the expansion of the search results of the search records is realized, the classification results which are the same as the classification categories of the index database of the target search system are obtained based on the search result data input into the classification model, the association of the external search result data and the target search system is realized, then the classification label labeling is carried out on the search records according to the classification results, the semi-supervised label labeling is realized, the condition that whether the search system has a large amount of historical search click data or not is avoided by means of the external search platform, the label labeling method is suitable for the search system in the cold starting stage, the labeling efficiency of the search records is effectively improved, and the training data are quickly accumulated.
Drawings
FIG. 1 is a diagram of an exemplary tag labeling method;
FIG. 2 is a flow chart illustrating a label tagging method according to an embodiment;
FIG. 3 is a flow chart illustrating a label labeling method according to another embodiment;
FIG. 4 is a flowchart illustrating a label labeling method according to still another embodiment;
FIG. 5 is a flowchart illustrating a label labeling method according to yet another embodiment;
FIG. 6 is a flowchart illustrating a label tagging method according to an embodiment;
FIG. 7 is a flowchart illustrating a label labeling method according to still another embodiment;
FIG. 8 is a flowchart illustrating a label labeling method according to yet another embodiment;
FIG. 9 is a schematic diagram of a UI interface of an application scenario of the tag labeling method in one embodiment;
FIG. 10 is a flowchart illustrating a label labeling method according to another embodiment;
FIG. 11 is a flowchart illustrating a label labeling method according to yet another embodiment;
FIG. 12 is a block diagram showing the structure of a label labeling apparatus according to an embodiment;
FIG. 13 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The label labeling method provided by the application can be applied to the application environment shown in fig. 1. Wherein the search system 102 of the cold start phase communicates with the server 104 over a network. The server 104 communicates with an external search platform over a network. The server 104 obtains the search record of the search system 102, obtains the search result data corresponding to the search record from the external search platform 104 according to the search record, inputs the search result data into a preset classification model to obtain a classification result corresponding to the search result data, the classification category of the classification model is the same as the classification category of the index database of the target search system, and labels the search record with a category label according to the classification result. The search system 102 and the external search platform 104 in the cold start stage may be installed in a terminal in the form of an application program, and may specifically be installed in the same terminal, the terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices in which corresponding application programs are installed, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers. It is understood that in other embodiments, the search system 102 and the external search platform 104 in the cold start phase may be installed in different terminals in the form of applications.
In one embodiment, as shown in fig. 2, a label labeling method is provided, which is described by taking the method as an example for being applied to the server in fig. 1, and includes the following steps 202 to 208.
Step 202, obtaining a search record to be labeled in the target search system.
The target search system is a search system in a cold start stage, the birth stage of a new search system is called as the cold start stage, the search system is in the problem of lack of users and related resources in the cold start stage, and for search data input in the cold start stage, due to lack of support of historical search click data, the search intention of the users is difficult to analyze, and search results with high matching degree with the search data cannot be fed back. The query intention understanding is one of basic technologies of a core in a search system, is a key for understanding a user search intention, matching and sequencing and giving a most relevant result, can be performed by adopting a supervised training mode such as a classification model and the like, and often faces the problem of annotation data loss. Especially in the cold start stage, enough historical click data is often lacked to be used, so how to acquire enough query annotation data becomes a key point and difficulty of intention model training.
The search record includes search data entered in the target search system. In an embodiment, the search record of the time period can be obtained by traversing the search record log file of the preset time period in the system. The preset time period may be the last week, the last month, etc. The search record includes search data entered each time, and the search data may be specifically entered text data. Based on the intent understanding and analysis of the text data, the search results most relevant to the text data may be obtained.
And step 204, acquiring search result data corresponding to the search record from the external search platform.
The external search platform refers to a mature search platform which is widely applied, such as hundred-degree search, Tencent video search and the like. In particular, it may be a search platform of the same type as the target search system. For example, if the target search system is a document search system, the external search platform should select a mature document search platform. If the target search system is a video search system, the external search platform should select a mature video search platform, such as a video playing platform providing a search function.
In one embodiment, since the external search platform has a mature search system and a large amount of historical search click data, after the user inputs search data and obtains a search result, the user can click and select the search result according to the desired data type. For example, taking an external search platform as a video search system as an example, a search box and video type selection buttons are provided on a search interface of a client, and specifically, the video type selection buttons may include "all", "game", "cartoon", "drama", "other". The search content is input in the input box, after the search is confirmed, the user can select and click the corresponding video type selection button according to needs, the user search intention corresponding to the search data can be determined according to the actual video playing record of the user, the search result can be optimized based on a large amount of historical data, and the video most relevant to the input search data is preferentially displayed.
The search result data corresponding to the search record is a result obtained by searching the search record as the input search content in the external search platform. Each search record may correspond to a plurality of search result data.
In one embodiment, the search result data may take a set number of search results as search result data corresponding to the search record according to the sequence in the search result list of the external search platform.
And step 206, inputting the search result data into a preset classification model to obtain a classification result corresponding to the search result data.
In an embodiment, the search result data may be text data, such as a title of a document, a brief introduction of a video or a picture, or a text data such as a title, which is generally used for a simple introduction of the document, the video, the picture, and the like, so as to facilitate understanding of the data content.
The classification category of the classification model is the same as that of the index database of the target search system, in one embodiment, BERT may be used as a preset classification model, and in other embodiments, a multi-category text classification model such as CNN, LSTM, and the like may be used instead. By utilizing the classification model, the search result data acquired from the external search platform can be classified according to the classification category of the index database of the target search system, and the classification result with high matching degree with the target search system can be obtained. Specifically, the preset training model may be obtained by training sample data carrying classification labels corresponding to the classification categories of the index library of the target search system.
Each piece of data in the search result data corresponds to a classification result, the text data is input into a preset classification model, and the classification model carries out classification analysis on the text and outputs the probability that the data belongs to the corresponding category. It is understood that the class corresponding to the data output by the classification model is the class with the highest probability in the process of performing the classification analysis on the data.
And step 208, labeling the category label of the search record according to the classification result.
In an embodiment, the search result data carries a classification result obtained by the classification model analysis, and the search result data is obtained based on the search record and has a corresponding relationship with the search record. In an embodiment, the category to which the search record corresponds may be determined according to the number of search result data. Specifically, when the search result data includes only one piece of data, the classification result corresponding to the search result data may be set as the classification category corresponding to the search record, and the category label may be labeled. When the search result data comprises a plurality of pieces of data, each piece of data has a corresponding classification result, and an equal voting mechanism is adopted to determine the classification category finally corresponding to the search record.
According to the label labeling method, the search result data corresponding to the search records are obtained by means of the external search platform, expansion of the search results of the search records is achieved, the classification results which are the same as the classification categories of the index database of the target search system are obtained based on the fact that the search result data are input into the classification model, association of the external search result data and the target search system is achieved, then category label labeling is conducted on the search records according to the classification results, semi-supervised label labeling is achieved, the label labeling method is not limited by the condition that whether the search system has a large amount of historical search click data or not, is suitable for the search system in the cold starting stage, effectively improves labeling efficiency of the search records, and quickly accumulates training data.
In one embodiment, as shown in fig. 3, before the search result data corresponding to the search record is obtained from the external search platform, i.e., before step 204, steps 302 to 304 are further included.
Step 302, identify the index repository data type of the target search system.
And step 304, determining an external search platform matched with the data type of the index database according to the data type of the index database.
The data type of the index database of the target search system refers to the data type of data that the index database of the target search system can provide when inputting search data in the target search system. The data type may be a document, an image, a video, etc., and in an embodiment, the data type may include a plurality of data types at the same time or only include one data type.
Take the data type as video as an example. The search data is input in the target search system, the fed back search result data is videos related to the search data, specifically, the number of videos related to the search data is large, for example, a user inputs 'jump and jump' and searches, and the obtained related videos may be game videos of 'jump and jump', skill teaching videos of 'jump and jump', and also may be fun videos similar to 'jump and jump'. When the target search system is a video search system, the data in the index database of the target search system corresponds to the video, so that the search result data are all videos related to the search data.
And determining an external search platform matched with the data type according to the data type of the search record, wherein the external search platform is substantially matched with the external search platform corresponding to the target search system. By means of the external search platform, a large amount of historical search click data of platforms providing similar or same types of search services can be utilized, the defect that the target search system lacks historical search click data in a cold starting stage can be overcome, and classification labels corresponding to search records of the target search system are obtained by taking the data of the external search platform as reference. Compared with the manual labeling mode in the prior art, the manual labeling method has the advantages that manual participation is reduced to a great extent, and the label labeling cost is reduced.
In one embodiment, as shown in FIG. 4, search result data corresponding to the search record is obtained from an external search platform, i.e., step 204 includes steps 402 through 404.
And step 402, building a crawling task according to the search records.
And step 404, executing a crawling task, and performing data crawling processing on an external search platform to obtain search result data corresponding to the search records.
The crawling task is a task of crawling data in a specified range by using a crawler according to specified content. A crawler is a program or script that automatically captures data information according to certain rules. In an embodiment, the specified content is a search record in the target search system, the specified range is search click data in the external search platform, and the crawled data is search result data obtained by inputting the search data in the search record in the external search platform. Data crawling is performed through a crawler, data acquisition efficiency can be improved, and search result data corresponding to search records are acquired in a row-to-row mode.
In one embodiment, the number of search result data is plural. As shown in fig. 5, the search record is labeled with a category label according to the classification result, i.e., step 208 includes steps 502 to 508.
Step 502, obtain the classification result of each item of search result data.
And step 504, performing classified statistics on the classified results of the search result data to obtain classified statistical results.
Step 506, determining the target category according to the classification statistical result.
And step 508, marking the target category as a category label of the search record, and labeling the category label of the search record.
The method comprises the steps of carrying out classification statistics on search result data to determine the quantity of the search result data belonging to the same class, realizing an equal voting mechanism through the classification statistics to determine a target class, marking the target class as a class label of a search record, and marking the class label of the search record.
For example, as shown in fig. 6, for a search record of "skip by skip", the first 5 corresponding search videos are obtained from the external search platform, each of the videos has corresponding text data (video title), where the text data corresponding to the first video is "skip by skip: teach you to ask you simply, get 500 points easily; the text data corresponding to the second video is' original animation: jumping one hop, setting an animation so as to play an interesting character "; the text data corresponding to the third video is 'one hop for real version, you see it'; the text data corresponding to the fourth video is 'the psychological shadow area of seeking to play one jump'; the text data corresponding to the fifth video is 'jump-jump humorous video, which is different and too funny'. The five search result data are respectively input into a classification model, and the processing results output by the classification model are that the classification result of the first video is 'game', the classification result of the second video is 'game', the classification result of the third video is 'fun', the classification result of the fourth video is 'game' and the classification result of the fifth video is 'fun'.
Through classification statistics, 3 games and 2 fusco videos can be obtained, the games are determined to be target categories corresponding to the jumps, and then the games are labeled as category labels of the jumps.
In one embodiment, as shown in FIG. 7, the target class is determined based on the classification statistics, i.e., step 506 includes steps 702 through 706.
Step 702, according to the classification statistical result, the category with the largest number of contained search result data is screened out.
Step 704, when the number of the categories is one, determining the category as the target category.
Step 706, when the number of the categories is multiple, respectively obtaining category probability data corresponding to the search result data included in each category, and selecting a target category from the screened categories according to the category probability data corresponding to the search result data included in each category.
The search result data with the same classification result are collected together by counting according to classification categories to obtain the number of the search result data with the same classification result of each category, the classification statistical results are sequenced according to the sequence of the number, and the category with the largest number, namely the category with the largest number of the contained search result data, is determined.
In the embodiment, the category with the largest number includes two cases, one is that only one category with the largest number exists, and the other is that the category with the largest number simultaneously includes a plurality of categories (the number of categories is greater than or equal to 2), that is, a plurality of categories with the largest parallel number appear, and at this time, the category with the higher probability can be selected from the plurality of categories in parallel as the target category according to the category probability data corresponding to each search result data.
In a specific embodiment, the category with the largest number of screened search result data is used as the candidate category, the category probability data of each search result data included in each candidate category is subjected to accumulation calculation, and the candidate category with the largest calculation result is used as the target category.
In other embodiments, the classification statistics may further include: the search result data with the same classification result are collected together by counting according to classification categories, category probability data of the search result data in the same category are accumulated to obtain the probability sum of each category, then the category with the maximum probability sum is determined to be the target category by sequencing according to the size sequence of the probability sum.
In one embodiment, as shown in FIG. 8, the target class is determined based on the classification statistics, i.e., step 506 includes steps 802 through 806.
Step 802, obtaining a ranking position of search result data in a search result list of an external search platform.
And step 804, determining weight data corresponding to each item of search result data according to the preset incidence relation between the ranking position and the weight data.
Step 806, determining the target category according to the classification statistical result and the weight data.
In the embodiment, different weight data are set for the search result data at different sorting positions by presetting the incidence relation of the weight data at the sorting positions, and specifically, the numerical value of the corresponding weight data is larger for the search result data at the front sorting position. And performing weighted calculation on the data in the classification statistical result based on the weight data, and taking the class with the maximum calculation result as a target class.
Specifically, when the classification statistics uses the number of the homogeneous search result data as a standard, the cardinality of each search result data is 1, and the target class is determined according to the classification statistics result and the weight data, including: and for each category, calculating the product of the base number and the weight data of each search result data, and then accumulating the products of the base number and the weight data of each search result data in the category to obtain the calculation result of the category. Similarly, the classification statistics may also use probability data of the same type of search result data as a standard, and at this time, the cardinality of each search result data is the probability data of the search result data.
In one embodiment, the method further comprises: and inputting the search data carrying the category labels into the initial search intention classification model as training data. And carrying out model training on the initial search intention classification model to obtain a search intention classification model for carrying out search intention classification processing on input search data.
In one embodiment, as shown in fig. 9, a user inputs search data, for example, "jump one jump", the search intention classification model performs intention classification analysis on the input search data, and can determine that the corresponding search intention is "game", then obtains game-class video resource documents corresponding to "jump one jump" from an index library corresponding to the game class, then ranks the obtained video resource documents through rough ranking and fine ranking, and returns video resources related to the search intention to the user. For example, when the input search data is "how to open WeChat jump one jump", the intention category obtained by the analysis of the search intention classification model is "skill teaching", and for example, when the input search data is "old story", the intention category obtained by the analysis of the search intention classification model is "TV play"
The application also provides an application scenario, and the application scenario applies the label labeling method. Specifically, the application of the label labeling method in the application scenario is as follows:
taking the applet video sub-search system as an example, a basic flow of the intention labeling data is constructed based on an external mature video search platform. Firstly, query data of a target search system is given, for example, query data in a recent system is collected and used as basic query data of external platform directional crawling, then a crawler task is built based on the given specific query, a ranking result of a mature video search platform is crawled, for example, the first 5 doc (text data) in a ranking structure returned by the external video search platform are used as input of a pre-trained classification model, the pre-trained classification model is used for classifying the crawled doc, for example, BERT is used as the classification model, the classification model can predict a category label for each doc, the most appeared category labels of 5 articles are selected as final labels of the query, and therefore a semi-supervised labeling mode of query intention training data is built by using the external search platform in a cold starting stage. Through the processing process, under the condition that query-doc historical exposure click data and manually labeled training data are lacked, labeling data with good quality can be rapidly and effectively acquired for training the query intention model. The method is suitable for obtaining query intention labeling data in most search systems, and is an effective way for reducing labeling cost and rapidly accumulating training data.
In one embodiment, a label labeling method is provided, as shown in fig. 10, the label labeling method includes the following steps 1002 to 1024.
Step 1002, obtaining a search record to be marked in the target search system.
And 1004, constructing a crawling task according to the search records.
Step 1006, identify the index database data type of the target search system, and determine the external search platform matching the index database data type.
And step 1008, executing the crawling task, and performing data crawling processing on an external searching platform to obtain searching result data corresponding to the searching records.
Step 1010, inputting the search result data into a preset classification model to obtain a classification result of each item of search result data.
Step 1012, performing classification statistics on the classification results of each search result data to obtain a classification statistical result.
And 1014, screening the category with the largest number of contained search result data according to the classification statistical result.
When the number of classes is one, the class is determined to be the target class, step 1016.
Step 1018, when the number of the categories is multiple, obtaining category probability data corresponding to the search result data included in each category, respectively, and selecting a target category from the screened categories according to the category probability data corresponding to the search result data included in each category.
Step 1020, the target category is marked as a category label of the search record, and the search record is marked with the category label.
And step 1022, inputting the search data carrying the category label as training data into the initial search intention classification model.
And step 1024, performing model training on the initial search intention classification model to obtain a search intention classification model for performing search intention classification processing on input search data.
In another embodiment, a label labeling method is also provided, as shown in fig. 11, the label labeling method includes the following steps 1102 to 1128.
Step 1102, obtaining a search record to be labeled in the target search system.
And 1104, constructing a crawling task according to the search records.
Step 1106, identifying the index database data type of the target search system, and determining an external search platform matched with the index database data type.
Step 1108, the crawling task is executed, and data crawling processing is performed on the external search platform to obtain search result data corresponding to the search records.
Step 1110, inputting the search result data into a preset classification model to obtain a classification result of each item of search result data.
At step 1112, the classification result of each search result data is classified and counted to obtain a classification statistical result.
Step 1114, obtain the rank order position of the search result data in the search result list of the external search platform.
Step 1116, determining the weight data corresponding to each item of search result data according to the preset incidence relation between the ranking positions and the weight data.
Step 1118, the target category is determined according to the classification statistical result and the weight data.
Step 1120, marking the target category as a category label of the search record, and labeling the search record with the category label.
Step 1122, inputting the search data carrying the category label as training data into the initial search intention classification model.
Step 1124, performing model training on the initial search intention classification model to obtain a search intention classification model for performing search intention classification processing on the input search data.
It should be understood that although the various steps in the flowcharts of fig. 2-5, 7-8, and 10-11 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-5, 7-8, and 10-11 may include multiple steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be alternated or performed with at least some of the other steps or steps.
In one embodiment, as shown in fig. 12, a label labeling apparatus 1200 is provided, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: a search record acquisition module 1202, an external data acquisition module 1204, a data classification module 1206, and a label tagging module 1208, wherein:
a search record obtaining module 1202, configured to obtain a search record to be labeled in the target search system.
An external data obtaining module 1204, configured to obtain search result data corresponding to the search record from an external search platform.
The data classification module 1206 is configured to input the search result data into a preset classification model to obtain a classification result corresponding to the search result data, where a classification category of the classification model is the same as a classification category of an index base of the target search system.
And the label labeling module 1208 is configured to label the search record with a category label according to the classification result.
In one embodiment, the tag labeling device further comprises an external search platform determination module for identifying the index database data type of the target search system; and determining an external search platform matched with the data type of the index database according to the data type of the index database.
In one embodiment, the external data acquisition module is further configured to construct a crawling task according to the search record; and executing the crawling task, and performing data crawling processing on an external search platform to obtain search result data corresponding to the search records.
In one embodiment, the number of the search result data is plural; the label labeling module is also used for obtaining the classification result of each item of search result data; carrying out classified statistics on the classified results of the search result data to obtain classified statistical results; determining a target category according to the classification statistical result; marking the target category as a category label of the search record; and labeling the search records with category labels.
In one embodiment, the tag labeling module is further configured to obtain a ranking position of the search result data in a search result list of the external search platform; determining weight data corresponding to each item of search result data according to a preset incidence relation between the sequencing position and the weight data; and determining the target category according to the classification statistical result and the weight data.
In one embodiment, the tag labeling module is further configured to filter out a category with the largest amount of included search result data according to the classification statistical result; when the number of categories is one, the category is determined to be the target category. And when the number of the categories is multiple, respectively acquiring category probability data corresponding to the search result data contained in each category, and selecting a target category from the screened categories according to the category probability data corresponding to the search result data contained in each category.
In one embodiment, the tag labeling device further comprises a search intention classification model training module, which is used for inputting the search data carrying the category tag as training data into the initial search intention classification model; and carrying out model training on the initial search intention classification model to obtain a search intention classification model for carrying out search intention classification processing on input search data.
According to the label labeling device, the search result data corresponding to the search records are obtained by means of the external search platform, expansion of the search results of the search records is achieved, the classification results which are the same as the classification categories of the index library of the target search system are obtained based on the search result data input into the classification model, association of the external search result data and the target search system is achieved, then category label labeling is conducted on the search records according to the classification results, semi-supervised label labeling is achieved, the label labeling device is not limited by the condition whether the search system has a large amount of historical search click data or not, is suitable for the search system in the cold starting stage, labeling efficiency of the search records is effectively improved, and training data are rapidly accumulated.
For specific limitations of the label labeling apparatus, reference may be made to the above limitations of the label labeling method, which is not described herein again. The modules in the label labeling apparatus can be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 13. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing search result data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a label labeling method.
Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A label labeling method, comprising:
acquiring a search record to be marked in a target search system;
acquiring search result data corresponding to the search record from an external search platform;
inputting the search result data into a preset classification model to obtain a classification result corresponding to the search result data, wherein the classification category of the classification model is the same as the classification category of an index database of the target search system;
and performing category label labeling on the search records according to the classification result.
2. The method of claim 1, wherein prior to obtaining search result data corresponding to the search record from an external search platform, further comprising:
identifying an index repository data type of the target search system;
and determining an external search platform matched with the data type of the index database according to the data type of the index database.
3. The method of claim 1, wherein the obtaining search result data corresponding to the search record from an external search platform comprises:
building a crawling task according to the search record;
and executing the crawling task, and performing data crawling processing on an external search platform to obtain search result data corresponding to the search records.
4. The method of claim 1, wherein the number of the search result data is plural;
the step of labeling the search records with category labels according to the classification result comprises:
obtaining the classification result of each item of search result data;
carrying out classified statistics on the classified results of the search result data to obtain classified statistical results;
determining a target category according to the classification statistical result;
and marking the target category as a category label of the search record, and marking the category label of the search record.
5. The method of claim 4, wherein determining a target class according to the classification statistics comprises:
acquiring the sequencing position of the search result data in a search result list of the external search platform;
determining weight data corresponding to each item of search result data according to the preset incidence relation between the sequencing position and the weight data;
and determining a target category according to the classification statistical result and the weight data.
6. The method of claim 4, wherein determining a target class according to the classification statistics comprises:
screening out the category with the largest number of contained search result data according to the classified statistical result;
when the number of the categories is multiple, respectively acquiring category probability data corresponding to search result data contained in each category;
and selecting a target category from the screened categories according to category probability data corresponding to the search result data contained in each category.
7. The method of claim 1, further comprising:
inputting the search data with the category labels as training data into an initial search intention classification model;
and carrying out model training on the initial search intention classification model to obtain a search intention classification model for carrying out search intention classification processing on input search data.
8. A label labelling device, characterised in that said device comprises:
the search record acquisition module is used for acquiring search records to be marked in the target search system;
the external data acquisition module is used for acquiring search result data corresponding to the search record from an external search platform;
the data classification module is used for inputting the search result data into a preset classification model to obtain a classification result corresponding to the search result data, and the classification category of the classification model is the same as that of an index database of the target search system;
and the label labeling module is used for labeling the category label of the search record according to the classification result.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202010772268.2A 2020-08-04 2020-08-04 Label labeling method and device, computer equipment and storage medium Pending CN112749313A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010772268.2A CN112749313A (en) 2020-08-04 2020-08-04 Label labeling method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010772268.2A CN112749313A (en) 2020-08-04 2020-08-04 Label labeling method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112749313A true CN112749313A (en) 2021-05-04

Family

ID=75645263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010772268.2A Pending CN112749313A (en) 2020-08-04 2020-08-04 Label labeling method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112749313A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344078A (en) * 2021-06-09 2021-09-03 北京三快在线科技有限公司 Model training method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899065A (en) * 2015-06-11 2015-09-09 武汉虹信通信技术有限责任公司 Method and system for batch online recovery and software online upgrading
WO2017024884A1 (en) * 2015-08-07 2017-02-16 广州神马移动信息科技有限公司 Search intention identification method and device
CN111078885A (en) * 2019-12-18 2020-04-28 腾讯科技(深圳)有限公司 Label classification method, related device, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899065A (en) * 2015-06-11 2015-09-09 武汉虹信通信技术有限责任公司 Method and system for batch online recovery and software online upgrading
WO2017024884A1 (en) * 2015-08-07 2017-02-16 广州神马移动信息科技有限公司 Search intention identification method and device
CN111078885A (en) * 2019-12-18 2020-04-28 腾讯科技(深圳)有限公司 Label classification method, related device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344078A (en) * 2021-06-09 2021-09-03 北京三快在线科技有限公司 Model training method and device
CN113344078B (en) * 2021-06-09 2022-11-04 北京三快在线科技有限公司 Model training method and device

Similar Documents

Publication Publication Date Title
CN111444428B (en) Information recommendation method and device based on artificial intelligence, electronic equipment and storage medium
CN111241311B (en) Media information recommendation method and device, electronic equipment and storage medium
US20190364123A1 (en) Resource push method and apparatus
CN110781391A (en) Information recommendation method, device, equipment and storage medium
CN106326391B (en) Multimedia resource recommendation method and device
CN111259173B (en) Search information recommendation method and device
CN109753601B (en) Method and device for determining click rate of recommended information and electronic equipment
CN111126495B (en) Model training method, information prediction device, storage medium and equipment
CN112364204B (en) Video searching method, device, computer equipment and storage medium
CN113688167A (en) Deep interest capture model construction method and device based on deep interest network
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN110061908A (en) Application program recommendation, device, electronic equipment and medium
CN112508609B (en) Crowd expansion prediction method, device, equipment and storage medium
CN110347866B (en) Information processing method, information processing device, storage medium and electronic equipment
CN111143684B (en) Artificial intelligence-based generalized model training method and device
CN111783712A (en) Video processing method, device, equipment and medium
CN110674144A (en) User portrait generation method and device, computer equipment and storage medium
CN111061954B (en) Search result sorting method and device and storage medium
CN111159563A (en) Method, device and equipment for determining user interest point information and storage medium
CN113761359B (en) Data packet recommendation method, device, electronic equipment and storage medium
CN108959453A (en) Information extracting method, device and readable storage medium storing program for executing based on text cluster
CN111506727A (en) Text content category acquisition method and device, computer equipment and storage medium
CN111368141A (en) Video tag expansion method and device, computer equipment and storage medium
CN112749330B (en) Information pushing method, device, computer equipment and storage medium
CN112749313A (en) Label labeling method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40048680

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination