CN111950254A - Method, device and equipment for extracting word features of search sample and storage medium - Google Patents

Method, device and equipment for extracting word features of search sample and storage medium Download PDF

Info

Publication number
CN111950254A
CN111950254A CN202011003276.7A CN202011003276A CN111950254A CN 111950254 A CN111950254 A CN 111950254A CN 202011003276 A CN202011003276 A CN 202011003276A CN 111950254 A CN111950254 A CN 111950254A
Authority
CN
China
Prior art keywords
search
word
sample
samples
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011003276.7A
Other languages
Chinese (zh)
Other versions
CN111950254B (en
Inventor
徐思琪
钟辉强
陈亮辉
方军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202011003276.7A priority Critical patent/CN111950254B/en
Publication of CN111950254A publication Critical patent/CN111950254A/en
Application granted granted Critical
Publication of CN111950254B publication Critical patent/CN111950254B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a method, a device, equipment and a storage medium for extracting word features of a search sample, and particularly relates to the technical field of artificial intelligence, and further relates to the technical fields of natural language processing, big data and the like. One embodiment of the method comprises: determining the screened search terms based on the labels of the search samples and the search frequency information of the search terms in the search samples; obtaining semantic vectors of the screened search words, and clustering the semantic vectors of the screened search words to obtain word packets of each cluster; the method has the advantages that the characteristics of the word packet of the search sample are used as the input of the characterization model, the label of the search sample is used as the training target of the characterization model for supervised training, the intermediate layer result of the characterization model is used as the characterization characteristics of the word packet of the search sample, the search words are screened and clustered, the subsequent training of the machine learning model is facilitated, and the effect of the model is improved.

Description

Method, device and equipment for extracting word features of search sample and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to the field of artificial intelligence technologies, and further relates to, but not limited to, the technical fields of natural language processing, big data, and the like, and in particular, to a method, an apparatus, a device, and a storage medium for extracting word features of a search sample.
Background
At present, a word feature mining method based on search engine search samples is mainly unsupervised. That is, first, given a dictionary and candidate search samples, the occurrence frequency of words in the candidate search samples in a period of time is counted; then, sorting the word frequency of each word from large to small, and uniformly dividing the words into a plurality of word packets; and finally, constructing word packet characteristics based on the word packets, wherein the characteristics of one word packet are the number of different words appearing in the word packet in frequency statistics.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for extracting word features of a search sample.
In a first aspect, an embodiment of the present application provides a method for extracting word features of a search sample, including: determining the screened search terms based on the labels of the search samples and the search frequency information of the search terms in the search samples; obtaining semantic vectors of the screened search words, and clustering the semantic vectors of the screened search words to obtain word packets of each cluster; and taking the characteristics of the word packet of the search sample as the input of the characterization model, taking the label of the search sample as a training target of the characterization model for supervised training, and taking at least one intermediate layer result of the trained characterization model as the characterization characteristics of the word packet of the search sample.
In a second aspect, an embodiment of the present application provides a word feature extraction apparatus for searching a sample, including: the search term determining module is configured to determine the screened search terms based on the labels of the search samples and the search frequency information of the search terms in the search samples; the clustering module is configured to acquire the semantic vectors of the screened search words and cluster the semantic vectors of the screened search words to obtain a word packet of each cluster; and the characterization model training module is configured to take the characteristics of the word packet of the search sample as the input of the characterization model, take the label of the search sample as the training target of the characterization model for supervised training, and take at least one intermediate layer result of the trained characterization model as the characterization characteristics of the word packet of the search sample.
In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.
In a fourth aspect, embodiments of the present application propose a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described in any one of the implementations of the first aspect.
According to the method, the device, the equipment and the storage medium for extracting the word features of the search samples, firstly, the screened search words are determined based on the labels of the search samples and the search frequency information of the search words in the search samples; then obtaining semantic vectors of the screened search words, and clustering the semantic vectors of the screened search words to obtain word packets of each cluster; and finally, taking the characteristics of the word packet of the search sample as the input of the characterization model, taking the label of the search sample as the training target of the characterization model for supervised training, taking at least one intermediate layer result of the trained characterization model as the characterization characteristics of the word packet of the search sample, and screening and clustering the search words to facilitate the subsequent training of the machine learning model and improve the effect of the model.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings. The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a method for word feature extraction for a search sample according to the present application;
FIG. 3 is a flow diagram of one embodiment of a method of screening search terms according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an embodiment of a word feature extraction apparatus for searching a sample according to the present application;
fig. 5 is a block diagram of an electronic device for implementing a word feature extraction method for a search sample according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings. Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the search sample word feature extraction method or search sample word feature extraction apparatus of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include a terminal device 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
Terminal device 101 may interact with server 103 through network 102. The terminal device 101 may provide a search sample with tag data, including but not limited to a database, a user terminal, and the like.
The server 103 may provide various services, for example, the server 103 may perform processing such as analysis on data such as a search sample with tag data acquired from the storage device 101, and generate a processing result (e.g., obtain a trained characterization model).
The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the method for extracting word features of a search sample provided in the embodiment of the present application is generally executed by the server 103, and accordingly, the device for extracting word features of a search sample is generally disposed in the server 103.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for word feature extraction of a search sample according to the present application is shown. The method for extracting the word features of the search sample comprises the following steps of:
step 201, determining the screened search terms based on the labels of the search samples and the search frequency information of the search terms in the search samples.
In this embodiment, an executing subject (for example, the server 103 shown in fig. 1) of the word feature extraction method for the search sample may determine the filtered search word based on the tag of each search sample and the search frequency information of each search word in each search sample. The search sample may be a search record of a user on a search engine, for example, a search sample of a user on a hundred degree search. In the present embodiment, a supervised learning method is used, and therefore the execution subject needs to obtain the label data of the search sample, for example, a search sample labeled "1" may be used as a positive sample, and a search sample labeled "0" may be used as a negative sample. The search frequency information of each search term in each search sample refers to the search frequency information of each search term in each search sample in a specific time period. For example, the frequency of searching for each search term for each sample may be counted by day. For example, after the search frequency of each search term is obtained, N search terms with the highest search frequency in the positive sample may be selected as the filtered search terms, where N is a positive integer.
Step 202, obtaining semantic vectors of the screened search terms, and clustering the semantic vectors of the screened search terms to obtain word packets of each cluster.
In this embodiment, the execution main body may obtain the semantic vector of the screened search word, and perform clustering on the semantic vector of the screened search word to obtain a word packet of each cluster. The semantic vector of the screened search term may be obtained through a Natural Language Processing Technology, such as a Natural Language Processing Cloud Platform (NLPC), a Natural Language Technology Platform of the university of harbin industry (NLTP), a Natural Language Processing Platform of the university of stanford (NLP), and the like. And then, performing Kmeans classification on the semantic vectors of the screened search words, so that the word packets of each class cluster are consistent in semantic expression. Illustratively, the number of clusters may be set to 300, with each cluster class forming a bundle of words. The K-means algorithm is a classical clustering method based on partitioning. The basic idea of the K-means algorithm is as follows: clustering is carried out by taking k points in the space as a center, and the semantic vector closest to the points is classified. And (4) gradually updating the value of each clustering center through an iterative method until the best clustering result is obtained. Semantic vectors in the same cluster have higher semantic similarity, while semantic vectors in different clusters have lower semantic similarity.
And 203, taking the characteristics of the word packet of the search sample as the input of the characterization model, taking the label of the search sample as the training target of the characterization model for supervised training, and taking the intermediate layer result of the characterization model as the characterization characteristics of the word packet of the search sample.
In this embodiment, the word packet includes a plurality of search words, and each search word has a corresponding characteristic. The feature of each search word in the word packet can be counted, and the statistical feature of the search words is used as the feature of the word packet. For example, the feature of a word packet of a search sample includes 4 dimensions, which are the number of different search terms appearing in the word packet in the search sample, the total frequency of all search terms, the number proportion of different search terms, and the total frequency proportion of all search terms, where the proportion is the proportion of a certain dimension feature in the word packet to the sum of corresponding dimension features in all word packet features. And taking the word packet characteristics of the training data as input, taking the probability of the label of the search sample corresponding to the predicted word packet as the output of the characterization model, and carrying out supervised training on the initial model of the characterization model by using the label of the search sample corresponding to the word packet to obtain the trained characterization model. And taking the characterization vector of the characteristics of the word packet output by at least one middle layer of the trained characterization model as the word characteristics of the search sample of the corresponding word packet. The initial model of the characterization model herein may be an untrained characterization model or an untrained completed characterization model. Each layer of the untrained characterization model may be set with initial parameters, which may be adjusted continuously during the training process. The untrained characterization model may be various types of untrained or untrained artificial neural networks (also called multilayer perceptron) or a model obtained by combining various types of untrained or untrained artificial neural networks, for example, the untrained characterization model may be an untrained convolutional neural network, an untrained cyclic neural network, or a model obtained by combining an untrained convolutional neural network, an untrained cyclic neural network, and an untrained fully-connected layer. The word packet features are characterized by adopting the multilayer perceptron, so that the sparse word packet features become dense, and the subsequent modeling of a machine learning model is facilitated.
In this embodiment, the training of the feature model is performed in a supervised training manner. For example, the labels of the search samples can be used as y values input by the characterization model in the process of characterizing the characteristics of the word package, the characterization model is trained, and the initial parameters are adjusted according to the fit of the output of the characterization model to the labels.
The embodiment of the application provides a search word feature mining scheme based on supervision data, relates to the technical fields of intelligent search, machine learning and the like, and compared with the prior art, the search words are screened, the situation that words with low distinguishing degrees and words with high distinguishing degrees in a word packet of one dimension have the same importance degree is avoided, the word packet is constructed by adopting a clustering method, so that the word packet of each dimension is similar in semantic expression, the subsequent training of a machine learning model is facilitated, and the effect of the machine learning model is improved.
With continued reference to FIG. 3, a flowchart of one embodiment of a method of screening search terms is shown, according to an embodiment of the present application. As shown in fig. 3, the method includes:
step 301, counting at least one item of data: the total number of positive samples of the search samples and the total number of negative samples of the search samples, the number of positive samples of any search word and the number of negative samples of any search word, the total number of positive sample searches of any search word and the total number of negative sample searches of any search word, and the total number of searches of all search words of the positive samples and the total number of searches of all search words of the negative samples.
In this embodiment, the execution subject may perform the statistical operation in step 301. Specifically, a set of search samples are first given, with a 1 for positive samples and a 0 for negative samples. And performing relevant statistics on each word appearing in the search data according to the search data of the batch of search samples. Wherein the total number of positive samples is the total number of positive samples for a given batch of samples; the total number of negative examples is the total number of negative examples for a given batch of examples; the positive sample number of a word is how many positive samples of the word are searched; the negative sample number of a word is how many negative samples of the word are searched; the total times of all search terms of the positive sample are the sum of the total times of searching of the positive sample of each term; negative examples the total number of negative example searches is the sum of the total number of negative example searches for all words.
Step 302, based on the statistical data, calculating the index value of the discrimination of any search term.
In the present embodiment, the discrimination index value is an index for characterizing the importance of the search term. Specifically, the higher the search frequency of the positive sample of the search term is, the higher the corresponding discrimination index value is, and the higher the importance of the search term is; the more positive samples of the search word, the higher the corresponding discrimination index value, and the higher the importance of the search word. Specifically, the total number of positive samples t _ p _ sc of the search sample, the total number of negative samples t _ n _ sc of the search sample, the number of positive samples p _ sc of any search word, and the number of negative samples n _ sc of any search word may be counted. Then, calculating the positive and negative sample discrimination index rate _ sc of any search word according to the following formula:
the rate _ sc is py _ sc/pn _ sc, where py _ sc is p _ sc/t _ p _ sc and pn _ sc is n _ sc/t _ n _ sc. Or, the total number of searches for positive samples p _ wc of any search word, the total number of searches for negative samples n _ wc of any search word, the total number of searches for all search words of positive samples t _ p _ wc, and the total number of searches for all search words of negative samples t _ n _ wc may be counted. Then, calculating the search time discrimination index rate _ wc of any search term according to the following formula:
rate _ wc is py _ wc/pn _ wc, where py _ wc is p _ wc/t _ p _ wc and pn _ wc is n _ wc/t _ n _ wc.
Alternatively, the positive and negative sample discrimination index rate _ sc of the search word and the search number discrimination index rate _ wc of the search word may be weighted and summed, and the summed result may be used as the discrimination index of the search word. For example, the partition index of the search term is 0.7 rate _ sc +0.3 rate _ wc.
And step 303, determining the screened search terms based on the discrimination index values of the search terms.
In this embodiment, the search term with the discrimination index value in the predetermined value range may be filtered out according to a preset value range of the discrimination index value, for example, the predetermined value range is 0.2 to 0.8.
In the embodiment of the application, the importance of the search word can be subjected to partition processing by screening the search word, so that the search word with low degree of distinction in the search word with the same dimensionality is prevented from having the same importance as the search word with high degree of distinction.
In some optional implementations of this embodiment, clustering semantic vectors of the filtered search terms to obtain a word package of each cluster includes: selecting N search words with highest indexing values in the word packet of any cluster as candidate seed words, and expanding K neighbor words of semantic vectors of the candidate seed words to obtain target seed words, wherein N, K are positive integers; and clustering the semantic vectors by adopting the target seed words. Each search term has a semantic vector to represent, and the distance between the semantic vectors of two similar search terms is very small. And finding k semantic vectors which are closest to the semantic vector of the seed word to obtain corresponding k closest search words. For example, for the word packet of each class cluster, 3 words with the highest index value in the class cluster can be selected as seed words, and semantic vectors k of the expanded seed words are adjacent to the words (k is 10); and after expansion, clustering the semantic vectors again to form a new word packet of the cluster. Through carrying out word expansion on the constructed word packets, the expression of each word packet is richer.
In some optional implementations of this embodiment, the number of active days of each search term in each search sample may be obtained before performing step 201. Specifically, the active state of each search word of each search sample may be counted by day first; and summarizing the number of active days of each search term in a certain time interval, for example, summarizing the number of active days of each search term of each sample monthly. In this embodiment, step 202 further includes: clustering the semantic vectors; and constructing a word packet with a plurality of dimensional characteristics for each class cluster based on the searching frequency information of each searching word and the active days of each searching word. For example, a 6-dimensional feature may be constructed for one word packet of a search sample based on frequency information of each search word and an active day number of each search word, where the 6-dimensional feature is the number of different search words appearing in the word packet in the search sample, the total frequency of all search words, the total active day number of all search words, the ratio of the number of different search words, the ratio of the total frequency of all search words, and the ratio of the total active day number of all search words, where the ratio is the ratio of a certain dimensional feature in the word packet to the sum of corresponding dimensional features in all word packet features. In this embodiment, when constructing the feature of the word package, the feature of one word package is considered from multiple dimensions, and the feature of one word package not only includes the number of different words in the dimension word package appearing in the statistical data, but also includes the total search frequency, the number of active days, and the proportion feature of different words in the dimension word package, thereby enriching the feature of the word package.
In some optional implementations of this embodiment, the user search sample may be input to the characterization model trained in step 203 to obtain the word features of the search sample. And then, taking the word characteristics of the search sample as the input of the information pushing model, taking the pushed information as the expected output of the information pushing model, and training the information pushing model to obtain the trained information pushing model. By adopting the representation vector of the search sample output by the representation model in the step 203 as the input of the information pushing model, the interest and the preference of the user can be fully mined, so that the accuracy of the information pushed by the information pushing model is improved.
With further reference to fig. 4, as an implementation of the method shown in the above figures, the present application provides an embodiment of word feature extraction for a search sample, and the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.
As shown in fig. 4, the word feature extraction apparatus 400 of the search sample of the present embodiment may include: a search term determining module 401, a clustering module 402 and a characterization model training module 403. The search term determining module 401 is configured to determine the filtered search terms based on the tags of the search samples and the search frequency information of the search terms in the search samples; a clustering module 402, configured to obtain semantic vectors of the screened search terms, and cluster the semantic vectors of the screened search terms to obtain a word packet of each cluster; and the characterization model training module 403 is configured to use the features of the word packet of the search sample as input of the characterization model, use the label of the search sample as a training target of the characterization model for supervised training, and use the intermediate layer result of the characterization model as the characterization features of the word packet of the search sample.
In the present embodiment, in the word feature extraction apparatus 400 for searching a sample: the specific processing of the search term determining module 401, the clustering module 402, and the representation model training module 403 and the technical effects thereof can refer to the related descriptions of step 201 and step 203 in the corresponding embodiment of fig. 2, which are not repeated herein.
In some optional implementations of this embodiment, the search term determination module is further configured to: counting at least one item of data as follows: the total number of positive samples of the search samples and the total number of negative samples of the search samples, the number of positive samples of any search word and the number of negative samples of any search word, the total number of positive sample searches of any search word and the total number of negative sample searches of any search word, and the total number of searches of all search words of the positive samples and the total number of searches of all search words of the negative samples; calculating a discrimination index value of any search term based on the statistical data; and determining the screened search terms based on the discrimination index values of the search terms.
In some optional implementations of this embodiment, the clustering module is further configured to: selecting N search words with highest indexing values in the word packet of any cluster as candidate seed words, and expanding K neighbor words of semantic vectors of the candidate seed words to obtain target seed words, wherein N, K are positive integers; and clustering the semantic vectors by adopting the target seed words.
In some optional implementations of this embodiment, the method further includes: the acquisition module is configured to acquire the number of active days of each search word in each search sample; the clustering module comprises: a clustering module configured to cluster the semantic vectors; and the characteristic construction module is configured to construct a word packet with a plurality of dimensional characteristics for each class cluster based on the search frequency information of each search word and the active days of each search word.
In some optional implementations of this embodiment, the characterization model is a multi-layer neural network model.
In some optional implementation manners of this embodiment, the characterization model application module is configured to input a user search sample to the trained characterization model to obtain a word feature of the search sample; and the information push model training module is configured to take word characteristics of the search sample as input of the information push model, take pushed information as expected output of the information push model, train the information push model and obtain the trained information push model.
Fig. 5 is a block diagram of an electronic device for a word feature extraction method for searching a sample according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 5, the electronic apparatus includes: one or more processors 501, memory 502, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 501 is taken as an example.
Memory 502 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform a method of word feature extraction for a search sample as provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the word feature extraction method of a search sample provided by the present application.
The memory 502, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the word feature extraction method of the search sample in the embodiments of the present application (e.g., the search word determination module 401, the clustering module 402, the characterization model training module 403 shown in fig. 4). The processor 501 executes various functional applications of the server and data processing, namely, implements the word feature extraction method of the search sample in the above-described method embodiment, by running the non-transitory software programs, instructions, and modules stored in the memory 502.
The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device of the word feature extraction method of the search sample, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 502 optionally includes memory located remotely from the processor 501, and these remote memories may be connected over a network to an electronic device that searches for word feature extraction methods of samples. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the word feature extraction method of the search sample may further include: an input device 503 and an output device 504. The processor 501, the memory 502, the input device 503 and the output device 404 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.
The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus searching for the word feature extraction method of the sample, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the technical scheme of the application, firstly, the screened search terms are determined based on the labels of the search samples and the search frequency information of the search terms in the search samples; then obtaining semantic vectors of the screened search words, and clustering the semantic vectors of the screened search words to obtain word packets of each cluster; and finally, the characteristics of the word packet of the search sample are used as the input of the characterization model, the label of the search sample is used as the training target of the characterization model for supervised training, the intermediate layer result of the characterization model is used as the characterization characteristics of the word packet of the search sample, and the search words are screened and clustered, so that the subsequent training of the machine learning model is facilitated, and the effect of the model is improved.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (14)

1. A method for extracting word features of a search sample comprises the following steps:
determining the screened search terms based on the labels of the search samples and the search frequency information of the search terms in the search samples;
obtaining semantic vectors of the screened search words, and clustering the semantic vectors of the screened search words to obtain word packets of each cluster;
and taking the characteristics of the word packet of the search sample as the input of the characterization model, taking the label of the search sample as a training target of the characterization model for supervised training, and taking at least one intermediate layer result of the trained characterization model as the characterization characteristics of the word packet of the search sample.
2. The method of claim 1, wherein the determining the filtered search terms based on the labels of the respective search samples and the search frequency information of the respective search terms in the respective search samples comprises:
counting at least one item of data as follows: the total number of positive samples of the search samples and the total number of negative samples of the search samples, the number of positive samples of any search word and the number of negative samples of any search word, the total number of positive sample searches of any search word and the total number of negative sample searches of any search word, and the total number of searches of all search words of the positive samples and the total number of searches of all search words of the negative samples;
calculating a discrimination index value of any search term based on the statistical data;
and determining the screened search terms based on the discrimination index values of the search terms.
3. The method of claim 1, wherein clustering the semantic vectors of the filtered search terms to obtain a word package for each cluster, comprises:
selecting N search words with highest indexing values in the word packet of any cluster as candidate seed words, and expanding K neighbor words of semantic vectors of the candidate seed words to obtain target seed words, wherein N, K are positive integers;
and clustering the semantic vectors by adopting the target seed words.
4. The method of claim 1, further comprising:
acquiring the number of active days of each search word in each search sample;
the clustering the semantic vectors of the screened search terms to obtain the word packet of each cluster further comprises:
clustering the semantic vectors;
and constructing a word packet with a plurality of dimensional characteristics for each class cluster based on the searching frequency information of each searching word and the active days of each searching word.
5. The method of claim 1, the characterization model being a multilayer perceptron.
6. The method according to any one of claims 1-5, the method comprising:
inputting a user search sample into the trained characterization model to obtain word characteristics of the search sample;
and taking the word characteristics of the search sample as the input of the information pushing model, taking the pushed information as the expected output of the information pushing model, and training the information pushing model to obtain the trained information pushing model.
7. An apparatus for extracting word features of a search sample, the apparatus comprising:
the search term determining module is configured to determine the screened search terms based on the labels of the search samples and the search frequency information of the search terms in the search samples;
the clustering module is configured to acquire the semantic vectors of the screened search words and cluster the semantic vectors of the screened search words to obtain a word packet of each cluster;
and the characterization model training module is configured to take the characteristics of the word packet of the search sample as the input of the characterization model, take the label of the search sample as the training target of the characterization model for supervised training, and take at least one intermediate layer result of the trained characterization model as the characterization characteristics of the word packet of the search sample.
8. The apparatus of claim 7, wherein the search term determination module is further configured to:
counting at least one item of data as follows: the total number of positive samples of the search samples and the total number of negative samples of the search samples, the number of positive samples of any search word and the number of negative samples of any search word, the total number of positive sample searches of any search word and the total number of negative sample searches of any search word, and the total number of searches of all search words of the positive samples and the total number of searches of all search words of the negative samples;
calculating a discrimination index value of any search term based on the statistical data;
and determining the screened search terms based on the discrimination index values of the search terms.
9. The apparatus of claim 7, wherein the clustering module is further configured to:
selecting N search words with highest indexing values in the word packet of any cluster as candidate seed words, and expanding K neighbor words of semantic vectors of the candidate seed words to obtain target seed words, wherein N, K are positive integers;
and clustering the semantic vectors by adopting the target seed words.
10. The apparatus of claim 7, further comprising:
the acquisition module is configured to acquire the number of active days of each search word in each search sample;
the clustering module comprises:
a clustering module configured to cluster the semantic vectors;
and the characteristic construction module is configured to construct a word packet with a plurality of dimensional characteristics for each class cluster based on the search frequency information of each search word and the active days of each search word.
11. The apparatus of claim 7, the characterization model being a multi-layer neural network model.
12. The apparatus according to any one of claims 7-11, the apparatus comprising:
the characterization model application module is configured to input a user search sample into the trained characterization model to obtain word characteristics of the search sample;
and the information push model training module is configured to take word characteristics of the search sample as input of the information push model, take pushed information as expected output of the information push model, train the information push model and obtain the trained information push model.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.
CN202011003276.7A 2020-09-22 2020-09-22 Word feature extraction method, device and equipment for searching samples and storage medium Active CN111950254B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011003276.7A CN111950254B (en) 2020-09-22 2020-09-22 Word feature extraction method, device and equipment for searching samples and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011003276.7A CN111950254B (en) 2020-09-22 2020-09-22 Word feature extraction method, device and equipment for searching samples and storage medium

Publications (2)

Publication Number Publication Date
CN111950254A true CN111950254A (en) 2020-11-17
CN111950254B CN111950254B (en) 2023-07-25

Family

ID=73356814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011003276.7A Active CN111950254B (en) 2020-09-22 2020-09-22 Word feature extraction method, device and equipment for searching samples and storage medium

Country Status (1)

Country Link
CN (1) CN111950254B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328891A (en) * 2020-11-24 2021-02-05 北京百度网讯科技有限公司 Method for training search model, method for searching target object and device thereof
CN112560425A (en) * 2020-12-24 2021-03-26 北京百度网讯科技有限公司 Template generation method and device, electronic equipment and storage medium
CN112800315A (en) * 2021-01-29 2021-05-14 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium
CN113033194A (en) * 2021-03-09 2021-06-25 北京百度网讯科技有限公司 Training method, device, equipment and storage medium of semantic representation graph model
CN113408301A (en) * 2021-07-12 2021-09-17 北京沃东天骏信息技术有限公司 Sample processing method, device, equipment and medium
CN113553398A (en) * 2021-07-15 2021-10-26 杭州网易云音乐科技有限公司 Search word correcting method and device, electronic equipment and computer storage medium
CN114238573A (en) * 2021-12-15 2022-03-25 平安科技(深圳)有限公司 Information pushing method and device based on text countermeasure sample
CN114298122A (en) * 2021-10-22 2022-04-08 腾讯科技(深圳)有限公司 Data classification method, device, equipment, storage medium and computer program product
CN115248847A (en) * 2022-09-22 2022-10-28 竹间智慧科技(北京)有限公司 Search data set construction method and device, electronic equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080140806A1 (en) * 2006-12-12 2008-06-12 Yahoo! Inc. Configuring a search engine results page with environment-specific information
CN101980210A (en) * 2010-11-12 2011-02-23 百度在线网络技术(北京)有限公司 Marked word classifying and grading method and system
CN107577763A (en) * 2017-09-04 2018-01-12 北京京东尚科信息技术有限公司 Search method and device
CN109472490A (en) * 2018-11-06 2019-03-15 北京京航计算通讯研究所 Military project group personal information labeling system based on cluster
CN109508414A (en) * 2018-11-13 2019-03-22 北京奇艺世纪科技有限公司 A kind of synonym method for digging and device
CN110162593A (en) * 2018-11-29 2019-08-23 腾讯科技(深圳)有限公司 A kind of processing of search result, similarity model training method and device
CN110442718A (en) * 2019-08-08 2019-11-12 腾讯科技(深圳)有限公司 Sentence processing method, device and server and storage medium
CN110781305A (en) * 2019-10-30 2020-02-11 北京小米智能科技有限公司 Text classification method and device based on classification model and model training method
CN111191102A (en) * 2019-12-28 2020-05-22 合肥长远知识产权管理有限公司 Fast search model training method based on big data retrieval and semantic analysis
CN111368525A (en) * 2020-03-09 2020-07-03 深圳市腾讯计算机系统有限公司 Information searching method, device, equipment and storage medium
CN111476036A (en) * 2020-04-10 2020-07-31 电子科技大学 Word embedding learning method based on Chinese word feature substrings

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080140806A1 (en) * 2006-12-12 2008-06-12 Yahoo! Inc. Configuring a search engine results page with environment-specific information
CN101980210A (en) * 2010-11-12 2011-02-23 百度在线网络技术(北京)有限公司 Marked word classifying and grading method and system
CN107577763A (en) * 2017-09-04 2018-01-12 北京京东尚科信息技术有限公司 Search method and device
CN109472490A (en) * 2018-11-06 2019-03-15 北京京航计算通讯研究所 Military project group personal information labeling system based on cluster
CN109508414A (en) * 2018-11-13 2019-03-22 北京奇艺世纪科技有限公司 A kind of synonym method for digging and device
CN110162593A (en) * 2018-11-29 2019-08-23 腾讯科技(深圳)有限公司 A kind of processing of search result, similarity model training method and device
CN110442718A (en) * 2019-08-08 2019-11-12 腾讯科技(深圳)有限公司 Sentence processing method, device and server and storage medium
CN110781305A (en) * 2019-10-30 2020-02-11 北京小米智能科技有限公司 Text classification method and device based on classification model and model training method
CN111191102A (en) * 2019-12-28 2020-05-22 合肥长远知识产权管理有限公司 Fast search model training method based on big data retrieval and semantic analysis
CN111368525A (en) * 2020-03-09 2020-07-03 深圳市腾讯计算机系统有限公司 Information searching method, device, equipment and storage medium
CN111476036A (en) * 2020-04-10 2020-07-31 电子科技大学 Word embedding learning method based on Chinese word feature substrings

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
来斯惟;徐立恒;陈玉博;刘康;赵军;: "基于表示学习的中文分词算法探索", 中文信息学报, no. 05 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328891A (en) * 2020-11-24 2021-02-05 北京百度网讯科技有限公司 Method for training search model, method for searching target object and device thereof
CN112560425A (en) * 2020-12-24 2021-03-26 北京百度网讯科技有限公司 Template generation method and device, electronic equipment and storage medium
CN112560425B (en) * 2020-12-24 2024-04-09 北京百度网讯科技有限公司 Template generation method and device, electronic equipment and storage medium
CN112800315A (en) * 2021-01-29 2021-05-14 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium
CN112800315B (en) * 2021-01-29 2023-08-04 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium
CN113033194B (en) * 2021-03-09 2023-10-24 北京百度网讯科技有限公司 Training method, device, equipment and storage medium for semantic representation graph model
CN113033194A (en) * 2021-03-09 2021-06-25 北京百度网讯科技有限公司 Training method, device, equipment and storage medium of semantic representation graph model
CN113408301A (en) * 2021-07-12 2021-09-17 北京沃东天骏信息技术有限公司 Sample processing method, device, equipment and medium
CN113553398A (en) * 2021-07-15 2021-10-26 杭州网易云音乐科技有限公司 Search word correcting method and device, electronic equipment and computer storage medium
CN113553398B (en) * 2021-07-15 2024-01-26 杭州网易云音乐科技有限公司 Search word correction method, search word correction device, electronic equipment and computer storage medium
CN114298122A (en) * 2021-10-22 2022-04-08 腾讯科技(深圳)有限公司 Data classification method, device, equipment, storage medium and computer program product
CN114238573A (en) * 2021-12-15 2022-03-25 平安科技(深圳)有限公司 Information pushing method and device based on text countermeasure sample
CN114238573B (en) * 2021-12-15 2023-09-22 平安科技(深圳)有限公司 Text countercheck sample-based information pushing method and device
CN115248847B (en) * 2022-09-22 2022-12-16 竹间智慧科技(北京)有限公司 Search data set construction method and device, electronic equipment and storage medium
CN115248847A (en) * 2022-09-22 2022-10-28 竹间智慧科技(北京)有限公司 Search data set construction method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111950254B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
CN111950254B (en) Word feature extraction method, device and equipment for searching samples and storage medium
CN111639710B (en) Image recognition model training method, device, equipment and storage medium
CN111582453B (en) Method and device for generating neural network model
CN111241282B (en) Text theme generation method and device and electronic equipment
CN111859982B (en) Language model training method and device, electronic equipment and readable storage medium
CN111522967B (en) Knowledge graph construction method, device, equipment and storage medium
CN111967262A (en) Method and device for determining entity tag
CN109960726A (en) Textual classification model construction method, device, terminal and storage medium
CN111582454B (en) Method and device for generating neural network model
CN112148881B (en) Method and device for outputting information
CN112016633A (en) Model training method and device, electronic equipment and storage medium
CN111667056A (en) Method and apparatus for searching model structure
CN111666751B (en) Training text expansion method, device, equipment and storage medium
CN111859953B (en) Training data mining method and device, electronic equipment and storage medium
CN111931509A (en) Entity chain finger method, device, electronic equipment and storage medium
CN111667057A (en) Method and apparatus for searching model structure
CN111539209B (en) Method and apparatus for entity classification
CN112329453B (en) Method, device, equipment and storage medium for generating sample chapter
CN112348107A (en) Image data cleaning method and apparatus, electronic device, and medium
CN111914994A (en) Method and device for generating multilayer perceptron, electronic equipment and storage medium
CN112115313B (en) Regular expression generation and data extraction methods, devices, equipment and media
CN111400456B (en) Information recommendation method and device
CN113033458A (en) Action recognition method and device
CN113961765A (en) Searching method, device, equipment and medium based on neural network model
CN111310058A (en) Information theme recommendation method and device, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant