CN110276001B - Checking page identification method and device, computing equipment and medium - Google Patents

Checking page identification method and device, computing equipment and medium Download PDF

Info

Publication number
CN110276001B
CN110276001B CN201910537402.8A CN201910537402A CN110276001B CN 110276001 B CN110276001 B CN 110276001B CN 201910537402 A CN201910537402 A CN 201910537402A CN 110276001 B CN110276001 B CN 110276001B
Authority
CN
China
Prior art keywords
title
vector
word
training text
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910537402.8A
Other languages
Chinese (zh)
Other versions
CN110276001A (en
Inventor
潘禄
陈玉光
彭卫华
罗雨
刘远圳
韩翠云
施茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910537402.8A priority Critical patent/CN110276001B/en
Publication of CN110276001A publication Critical patent/CN110276001A/en
Application granted granted Critical
Publication of CN110276001B publication Critical patent/CN110276001B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a checking page identification method, a checking page identification device, a calculating device and a medium, wherein the method comprises the following steps: determining a first title vector of the training text title based on the relevance among all the words in the training text title; determining a second title vector of the training text title by using a preset language model, wherein the determined word vectors are different from each other by the preset language model aiming at the same word at different positions in the training text title; and taking the first title vector and the second title vector as input, taking the checking page marking result of the training text title as output, training a checking page recognition model, and determining whether the target text title is the checking page title or not by using the checking page recognition model. The embodiment of the invention can realize the effect of accurately identifying the information checking page containing at least two events or topics from the mass network information, and is beneficial to improving the information recommendation result of the downstream service.

Description

Checking page identification method and device, computing equipment and medium
Technical Field
The embodiment of the invention relates to the technical field of internet information processing, in particular to a checking page identification method, a checking page identification device, computing equipment and a medium.
Background
With the rapid popularization of the internet, network information is explosively increased, so that netizens need to spend a great deal of energy to screen required information from massive information.
One type of information in the network information is obtained through secondary processing, namely information contents of different historical or current occurring subjects are processed, screened and then combined into one piece of information to be presented, for example, the information of' Mengtou boat goes against Canadian government to require compensation; apple prepares to release a folded iPhone; the information contains 3 events or topics with lower relevance, each event can correspond to independent information, and when a query (inquiry) related to 'Mengtou boat', 'apple' or 'WeChat' is searched in a search engine, the information is possibly present in each search result page. In practice, such information including multiple events or topics with low relevance belongs to the search results with low quality, because such search results include more information that is not related to the search terms.
If the network media text which contains two or more different themes/events and does not have close relation between the themes/events is called the checking page, the checking page is accurately identified from the mass network information, and the checking page is pressed, the sequencing of the checking page in the search results can be adjusted, the search results belonging to the non-checking pages are preferentially pushed, so that the search results of the search engine are improved, and the user experience is improved. However, an effective scheme for accurately identifying the inventory page from the massive network information is lacked in the prior art.
Disclosure of Invention
The embodiment of the invention provides an inventory page identification method, an inventory page identification device, computing equipment and a medium, which are used for accurately identifying an information inventory page containing at least two events or topics from massive network information.
In a first aspect, an embodiment of the present invention provides a checking page identification method, where the method includes:
determining a first title vector of a training text title based on the relevance among all words in the training text title;
determining a second title vector of the training text title by using a preset language model, wherein the determined word vectors are different from each other by the preset language model aiming at the same word at different positions in the training text title;
and taking the first title vector and the second title vector as input, taking the checking page marking result of the training text title as output, training a checking page recognition model, and determining whether the target text title is the checking page title or not by using the checking page recognition model.
In a second aspect, an embodiment of the present invention further provides an inventory page identification apparatus, where the apparatus includes:
the first vector determination module is used for determining a first title vector of the training text title based on the relevance among all the words in the training text title;
the second vector determining module is used for determining a second title vector of the training text title by using a preset language model, wherein the determined word vectors of the preset language model are different aiming at the same word at different positions in the training text title;
and the model training module is used for taking the first title vector and the second title vector as input, taking the checking page marking result of the training text title as output, training a checking page recognition model, and determining whether the target text title is the checking page title or not by using the checking page recognition model.
In a third aspect, an embodiment of the present invention further provides a computing device, including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement an inventory page identification method as in any embodiment of the invention.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements an inventory page identification method according to any embodiment of the present invention.
The embodiment of the invention ensures the integrity of the title characteristics of the identification model of the inventory page based on the deep learning idea by determining the vector representation of the training text title by using two title vector determination modes, namely combining the mode of determining the title vector based on the relevance between words in the training text title with the mode of determining the title vector by using the neural network language model (namely the preset language model) based on the model, solves the problem that the prior art lacks an effective technical scheme for identifying the inventory page of the information inventory containing at least two events or themes from the mass network information, realizes the effect of accurately identifying the inventory page from the mass network information, realizes the effective screening of the mass network information from the title layer, and is favorable for improving the information recommendation result of downstream services.
Drawings
Fig. 1 is a flowchart of an inventory page identification method according to an embodiment of the present invention;
fig. 2 is a flowchart of an inventory page identification method according to a second embodiment of the present invention;
fig. 3 is a schematic diagram of a training process of an inventory page recognition model according to a second embodiment of the present invention;
fig. 4 is a schematic structural diagram of an inventory page recognition device according to a third embodiment of the present invention;
fig. 5 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of an inventory page identification method according to an embodiment of the present invention, where this embodiment is applicable to a case where an inventory page is identified by mining a large amount of network information. The method may be performed by an inventory page identification apparatus, which may be implemented in software and/or hardware, and may be integrated on any computing device, including but not limited to a server.
As shown in fig. 1, the method for identifying an inventory page provided in this embodiment may include:
s110, determining a first title vector of the training text title based on the relevance among all the words in the training text title.
Before training a model based on a deep learning thought, training texts need to be prepared in advance, wherein the training texts can be any social media texts, such as various news or information published on platforms such as a microblog, a webpage and a public number, each training text comprises a title, and the title can be regarded as an independent sentence to summarize events or topic information described in the current text; and then manually marking whether each training text title belongs to the checking page title, namely whether the training text title comprises at least two events or topics, wherein the accuracy of a marking result can be ensured by adopting a manual marking mode. The training text titles belonging to the checking page title may be used as positive samples, and the training text titles not belonging to the checking page title may be used as negative samples.
For each training text title, words included in the title can be obtained through a word segmentation technology, then semantic relevance of each word in the title is considered, and a first title vector of each training text title is determined, for example, a traditional language model such as word2vector can be used. It should be noted that, in the conventional language model used for determining the first heading vector, the determined word vector representation is the same for the same word at different positions in the heading of the training text, which is different from the preset language model used in the following.
And S120, determining a second title vector of the training text title by using a preset language model, wherein the preset language model aims at the same word at different positions in the training text title, and the determined word vectors are different.
The preset Language Models include, but are not limited to, BERT Language Models (Bidirectional Encoder expressions from the transmueters, deep Bidirectional pre-training converters for Language understanding), ELMO Language Models (embedded from multiple layers of bi-directional Language Models), ERNIE Language Models (Enhanced reproduction from kNowledge semantic expression Models), and other model-based neural network Language Models, which can provide different vector Representations in combination with specific titles for the same word at different positions of the same training text title, that is, realize dynamic Representation of each word vector. The term in this embodiment includes at least one language element, for example for Chinese, a term may be composed of a single word. In addition, there is no strict execution order restriction between the operation S110 and the operation S120.
Optionally, determining a second heading vector of the training text heading by using the preset language model includes:
determining a word vector of each word in the training text title by using a preset language model, and combining the word vectors of each word to serve as a second title vector of the training text title;
or
And adding identification words at specific positions of the training text titles, determining word vectors of the identification words in the training text titles by using a preset language model, and taking the word vectors of the identification words as second title vectors of the training text titles.
Wherein the specific position of the title of the training text comprises the beginning or the end of the title (adding a recognition word at the specific position cannot destroy the semantic integrity of the title sentence itself), and the recognition word can be any pre-defined word which can be used for distinguishing different titles, for example, the recognition word can be [ SEP ]. Illustratively, an identification word [ SEP ] is added at the beginning of each training text title, then each training text title is input into a preset language model, and a multi-layer vector representation of each word in each training text title is obtained, for example, for the BERT language model, the transform has 12 layers, a combination of multi-layer vectors or a last layer vector can be used for representing a current feature vector of each word, and a word vector of a position of "[ SEP ]" can be taken as an encoding vector of each training text title, namely a second title vector.
And S130, taking the first title vector and the second title vector as input, taking the checking page marking result of the training text title as output, training a checking page recognition model, and determining whether the target text title is the checking page title or not by using the checking page recognition model.
And for each training text title, two title vector determination modes are adopted to determine the vector representation of the title, so that the complementation of the title vector characteristics is realized, and the completeness of the title characteristics of the inventory page recognition model trained on the basis of the deep learning thought is ensured. The target text title comprises a title of any network media text and can be obtained by title extraction of social media texts captured from the Internet. Inputting the target text title into the checking page identification model, determining whether the target text title belongs to the checking page title, if so, sending the text corresponding to the checking page title to a downstream service, realizing high-efficiency screening of massive network information from the title level, improving the data processing efficiency of internet data mining, providing targeted data support for the downstream service, and improving the information recommendation result of the downstream service. For example, in a search scenario, information search may be performed based on event keywords input by a user, and text information of which a text title belongs to a non-counting page title is preferentially pushed to the user, so as to improve a search result of a search engine.
The technical solution of this embodiment determines the vector representation of the title of the training text by using two title vector determination methods, the method combines a mode of determining the title vector based on the relevance between words in the training text title with a mode of determining the title vector by using a model-based neural network language model (namely a preset language model), ensures the integrity of the title characteristic of a training inventory page recognition model based on a deep learning thought, solves the problem that the prior art lacks an effective technical scheme for recognizing an information inventory page containing at least two events or topics from massive network information, realizes the effect of accurately recognizing the inventory page from the massive network information, realizes the effective screening of the massive network information from a title layer, improves the data processing efficiency of internet data mining, and is beneficial to improving the information recommendation result of downstream services.
Example two
Fig. 2 is a flowchart of an inventory page identification method according to a second embodiment of the present invention, which is further optimized based on the above-mentioned embodiment. As shown in fig. 2, the method may include:
s210, segmenting the training text title, and determining a word vector, a position vector and a part of speech vector of each word obtained by segmenting by using a word vector analysis model.
In this embodiment, the vector representation of each word obtained by segmenting the word of the training text title is formed by splicing three vectors: word vectors (Word entries), Position vectors (Position entries), and part-of-speech vectors (POS entries). The word vector can be obtained by utilizing a pre-trained unsupervised model, such as a word2vector model, and the like, wherein the unsupervised model can be obtained by training based on an existing open source word vector or a self-constructed training corpus, and the training corpus comprises a title and a text in a network social media text; the position vector represents the position of each word in the title of the training text, and may be a vector representation of the relative position of the current word and the potential event subject, for example, the current word is the 4 th word in the title of the training text, the position of the event subject in the sentence in the title of the training text is 7, the position of the current word relative to the event subject is-4, then-4 is mapped onto a normal distribution vector with fixed dimensions, so as to obtain the position vector of the current word, and different numbers are mapped into different vectors; the part-of-speech vector refers to mapping the part-of-speech of each word into a multi-dimensional vector, and the same part-of-speech is initialized by using the same vector.
And S220, determining a first title vector of the training text title by considering the relevance of each word in the training text title based on the word vector, the position vector and the part of speech vector.
By considering the relevance among the words, the semantic correctness of the training text title sentences can be ensured.
Optionally, determining a first heading vector of the training text heading by considering the relevance of each word in the training text heading based on the word vector, the position vector, and the part-of-speech vector, including:
performing convolution calculation in the convolution layer by adopting a preset number of convolution kernels based on the word vector, the position vector and the part-of-speech vector, and extracting local features in the training text title;
and pooling the extracted local features, and performing nonlinear transformation on a pooling result to obtain a first title vector of the training text title.
Fig. 3 is a schematic diagram illustrating a training process of the inventory page recognition model provided in this embodiment, taking a convolutional neural network as an example, and as shown in fig. 3, a word vector, a position vector, and a part-of-speech vector of each word in a training text title are input in an input layer; extracting local features in the convolutional layer through a plurality of convolution kernels (Feature maps), and meanwhile avoiding excessive parameters in the network, in the embodiment, extracting the features by using the convolutional layer with a convolution window of 3, wherein the number of the extracted features is related to the predefined parameters, in addition, in the embodiment, equal-length convolution can be used, and the convolution result is consistent with the input width; pooling is continuously performed on the convolution features (namely, the extracted local features), the purpose of pooling is to find out the most important feature information at the same position, and the embodiment can use maximum pooling operation, namely, the maximum value is taken by the same dimension, and then a result after pooling is output; in the full link layer, performing nonlinear transformation on the pooled result to obtain a first heading vector of the training text heading, where the first heading vector considers semantic relevance of each word in the heading sentence, and may also be referred to as a heading sentence context vector (where the context vector represents context features of the entire heading sentence), where the nonlinear transformation includes, but is not limited to, performing nonlinear transformation using an activation function such as tanh.
And S230, determining a second title vector of the training text title by using a preset language model, wherein the preset language model aims at the same word at different positions in the training text title, and the determined word vectors are different.
S240, taking the first heading vector and the second heading vector as input, taking the checking page marking result of the training text title as output, training a checking page recognition model, and determining whether the target text title is the checking page title or not by using the checking page recognition model.
Finally, the first heading vector and the second heading vector are spliced together to form a multi-dimensional vector as an input to the fully connected layer, the output of the output layer being a predefined heading category: a count page title and a non-count page title.
According to the technical scheme of the embodiment, two heading vector determination modes are used for determining the vector representation of the training text heading, namely, the heading vector determination mode based on the relevance among words in the training text heading is combined with the heading vector determination mode based on the neural network language model (namely, the preset language model), so that the completeness of the heading feature of the model for training the inventory page recognition based on the deep learning thought is guaranteed, the problem that an effective technical scheme for recognizing the information inventory page containing at least two events or topics from the mass network information is lacked in the prior art is solved, the effect of accurately recognizing the inventory page from the mass network information is realized, the effective screening of the mass network information from the heading layer is realized, and the information recommendation result of downstream services is improved.
EXAMPLE III
Fig. 4 is a schematic structural diagram of an inventory page recognition device according to a third embodiment of the present invention, where this embodiment is applicable to a case where an inventory page is recognized from a large amount of network information by mining the network information. The apparatus may be implemented in software and/or hardware and may be integrated on any computing device, including but not limited to a server.
As shown in fig. 4, the checking page recognition apparatus provided in this embodiment may include a first vector determination module 310, a second vector determination module 320, and a model training module 330, wherein:
a first vector determination module 310, configured to determine a first heading vector of the training text heading based on a relevance between words in the training text heading;
the second vector determining module 320 is configured to determine a second heading vector of the training text heading by using a preset language model, where the determined word vectors are different for the same word at different positions in the training text heading by using the preset language model;
the model training module 330 is configured to take the first heading vector and the second heading vector as inputs, take the checking page labeling result of the training text title as an output, train the checking page recognition model, and determine whether the target text title is the checking page title by using the checking page recognition model.
Optionally, the second vector determining module 320 is specifically configured to:
determining a word vector of each word in the training text title by using a preset language model, and combining the word vectors of each word to serve as a second title vector of the training text title;
or
And adding identification words at specific positions of the training text titles, determining word vectors of the identification words in the training text titles by using a preset language model, and taking the word vectors of the identification words as second title vectors of the training text titles.
Optionally, the first vector determining module 310 includes:
the word segmentation unit is used for segmenting words of the training text titles and determining word vectors, position vectors and part-of-speech vectors of each word obtained through word segmentation by using a word vector analysis model;
and the association unit is used for determining a first title vector of the training text title by considering the association of each word in the training text title based on the word vector, the position vector and the part of speech vector.
Optionally, the associating unit includes:
the convolution calculation subunit is used for performing convolution calculation in the convolution layer by adopting a preset number of convolution kernels based on the word vector, the position vector and the part-of-speech vector, and extracting local features in the training text title;
and the pooling and nonlinear transformation subunit is used for pooling the extracted local features and performing nonlinear transformation on a pooling result to obtain a first title vector of the training text title.
Optionally, the target text title includes a title of the network media text.
The checking page identification device provided by the embodiment of the invention can execute the checking page identification method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. Reference may be made to the description of any method embodiment of the invention not specifically described in this embodiment.
Example four
Fig. 5 is a schematic structural diagram of a computing device according to a fourth embodiment of the present invention. FIG. 5 illustrates a block diagram of an exemplary computing device 412 suitable for use in implementing embodiments of the present invention. The computing device 412 shown in FIG. 5 is only one example and should not impose any limitations on the functionality or scope of use of embodiments of the present invention. Computing device 412 may be any device with computing capabilities including, but not limited to, a server.
As shown in fig. 5, computing device 412 is in the form of a general purpose computing device. Components of computing device 412 may include, but are not limited to: one or more processors 416, a storage device 428, and a bus 418 that couples the various system components including the storage device 428 and the processors 416.
Bus 418 represents one or more of any of several types of bus structures, including a memory device bus or memory device controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computing device 412 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computing device 412 and includes both volatile and nonvolatile media, removable and non-removable media.
Storage 428 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 430 and/or cache Memory 432. The computing device 412 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 434 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk such as a Compact disk Read-Only Memory (CD-ROM), Digital Video disk Read-Only Memory (DVD-ROM) or other optical media may be provided. In these cases, each drive may be connected to bus 418 by one or more data media interfaces. Storage 428 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 440 having a set (at least one) of program modules 442 may be stored, for instance, in storage 428, such program modules 442 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. The program modules 442 generally perform the functions and/or methodologies of the described embodiments of the invention.
The computing device 412 may also communicate with one or more external devices 414 (e.g., keyboard, pointing terminal, display 424, etc.), with one or more terminals that enable a user to interact with the computing device 412, and/or with any terminals (e.g., network card, modem, etc.) that enable the computing device 412 to communicate with one or more other computing terminals. Such communication may occur via input/output (I/O) interfaces 422. Moreover, computing device 412 may also communicate with one or more networks (e.g., a Local Area Network (LAN), Wide Area Network (WAN), and/or a public Network, such as the internet) through Network adapter 420. As shown in FIG. 5, network adapter 420 communicates with the other modules of computing device 412 over bus 418. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computing device 412, including but not limited to: microcode, device drivers, Redundant processors, external disk drive Arrays, RAID (Redundant Arrays of Independent Disks) systems, tape drives, and data backup storage systems, among others.
The processor 416 executes various functional applications and data processing by executing programs stored in the storage device 428, for example, implementing an inventory page identification method provided by any embodiment of the present invention, which may include:
determining a first title vector of the training text title based on the relevance among all the words in the training text title;
determining a second title vector of the training text title by using a preset language model, wherein the determined word vectors are different from each other by the preset language model aiming at the same word at different positions in the training text title;
and taking the first title vector and the second title vector as input, taking the checking page marking result of the training text title as output, training a checking page recognition model, and determining whether the target text title is the checking page title or not by using the checking page recognition model.
EXAMPLE five
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements an inventory page identification method according to any embodiment of the present invention, where the method may include:
determining a first title vector of the training text title based on the relevance among all the words in the training text title;
determining a second title vector of the training text title by using a preset language model, wherein the determined word vectors are different from each other by the preset language model aiming at the same word at different positions in the training text title;
and taking the first title vector and the second title vector as input, taking the checking page marking result of the training text title as output, training a checking page recognition model, and determining whether the target text title is the checking page title or not by using the checking page recognition model.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or terminal. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (12)

1. An inventory page identification method, comprising:
determining a first title vector of a training text title based on the relevance among all words in the training text title;
determining a second title vector of the training text title by using a preset model-based neural network language model, wherein the determined word vectors are different aiming at the same word at different positions in the training text title by the neural network language model;
taking the first title vector and the second title vector as input, taking the checking page marking result of the training text title as output, training a checking page recognition model, and determining whether a target text title is the checking page title or not by using the checking page recognition model;
wherein, the checking page is a page containing at least two events or topics;
in the first title vector, word vectors aiming at the same word at different positions in the training text title represent the same; in the second heading vector, word vector representations of the same word at different positions in the training text heading are different.
2. The method of claim 1, wherein determining the second heading vector for the training text heading using a predetermined language model comprises:
determining a word vector of each word in the training text title by using the preset language model, and combining the word vectors of each word to serve as a second title vector of the training text title; or
Adding identification words at specific positions of the training text titles, determining word vectors of the identification words in the training text titles by using the preset language model, and taking the word vectors of the identification words as second title vectors of the training text titles.
3. The method of claim 1, wherein determining the first heading vector for the training text heading based on the association between words in the training text heading comprises:
segmenting the training text title, and determining a word vector, a position vector and a part-of-speech vector of each word obtained by segmenting by using a word vector analysis model;
determining a first heading vector for the training text heading by considering the relevance of each word in the training text heading based on the word vector, the position vector, and the part-of-speech vector.
4. The method of claim 3, wherein determining the first heading vector for the training text heading by considering the relevance of each word in the training text heading based on the word vector, the location vector, and the part-of-speech vector comprises:
performing convolution calculation in a convolution layer by adopting a preset number of convolution kernels based on the word vector, the position vector and the part-of-speech vector, and extracting local features in the training text title;
pooling the extracted local features, and carrying out nonlinear transformation on a pooling result to obtain a first title vector of the training text title.
5. The method of claim 1, wherein the target text title comprises a title of a web media text.
6. An inventory page identification device, comprising:
the first vector determination module is used for determining a first title vector of the training text title based on the relevance among all the words in the training text title;
the second vector determining module is used for determining a second title vector of the training text title by utilizing a preset model-based neural network language model, wherein the determined word vectors of the neural network language model are different aiming at the same word at different positions in the training text title;
the model training module is used for taking the first title vector and the second title vector as input, taking the checking page marking result of the training text title as output, training a checking page recognition model, and determining whether a target text title is the checking page title or not by using the checking page recognition model;
wherein, the checking page is a page containing at least two events or topics;
in the first title vector, word vectors aiming at the same word at different positions in the training text title represent the same; in the second heading vector, word vector representations of the same word at different positions in the training text heading are different.
7. The apparatus of claim 6, wherein the second vector determination module is specifically configured to:
determining a word vector of each word in the training text title by using the preset language model, and combining the word vectors of each word to serve as a second title vector of the training text title;
or
Adding identification words at specific positions of the training text titles, determining word vectors of the identification words in the training text titles by using the preset language model, and taking the word vectors of the identification words as second title vectors of the training text titles.
8. The apparatus of claim 6, wherein the first vector determination module comprises:
the word segmentation unit is used for segmenting words of the training text titles and determining word vectors, position vectors and part-of-speech vectors of each word obtained through word segmentation by using a word vector analysis model;
and the association unit is used for determining a first title vector of the training text title by considering the association of each word in the training text title based on the word vector, the position vector and the part of speech vector.
9. The apparatus of claim 8, wherein the associating unit comprises:
a convolution calculation subunit, configured to perform convolution calculation in a convolution layer by using a preset number of convolution kernels based on the word vector, the position vector, and the part-of-speech vector, and extract local features in the training text title;
and the pooling and nonlinear transformation subunit is used for pooling the extracted local features and performing nonlinear transformation on a pooling result to obtain a first title vector of the training text title.
10. The apparatus of claim 6, wherein the target text title comprises a title of a network media text.
11. A computing device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the inventory page identification method as recited in any of claims 1-5.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the inventorying page identification method as claimed in any one of claims 1 to 5.
CN201910537402.8A 2019-06-20 2019-06-20 Checking page identification method and device, computing equipment and medium Active CN110276001B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910537402.8A CN110276001B (en) 2019-06-20 2019-06-20 Checking page identification method and device, computing equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910537402.8A CN110276001B (en) 2019-06-20 2019-06-20 Checking page identification method and device, computing equipment and medium

Publications (2)

Publication Number Publication Date
CN110276001A CN110276001A (en) 2019-09-24
CN110276001B true CN110276001B (en) 2021-10-08

Family

ID=67961334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910537402.8A Active CN110276001B (en) 2019-06-20 2019-06-20 Checking page identification method and device, computing equipment and medium

Country Status (1)

Country Link
CN (1) CN110276001B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737559B (en) * 2020-05-29 2024-05-31 北京百度网讯科技有限公司 Resource ordering method, method for training ordering model and corresponding device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101097580A (en) * 2007-06-20 2008-01-02 精实万维软件(北京)有限公司 Process for ordering network advertisement
CN102567290A (en) * 2010-12-30 2012-07-11 百度在线网络技术(北京)有限公司 Method, device and equipment for expanding short text to be processed
CN103020067A (en) * 2011-09-21 2013-04-03 北京百度网讯科技有限公司 Method and device for determining webpage type
CN103294781A (en) * 2013-05-14 2013-09-11 百度在线网络技术(北京)有限公司 Method and equipment used for processing page data
DE102016013372A1 (en) * 2016-01-13 2017-07-13 Adobe Systems Incorporated Image labeling with weak monitoring
CN108304379A (en) * 2018-01-15 2018-07-20 腾讯科技(深圳)有限公司 A kind of article recognition methods, device and storage medium
US10042880B1 (en) * 2016-01-06 2018-08-07 Amazon Technologies, Inc. Automated identification of start-of-reading location for ebooks

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290634B (en) * 2008-06-03 2010-07-28 北京搜狗科技发展有限公司 Method for recognizing repeated miniature, device and its uses in search engine
CN107294834A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 A kind of method and apparatus for recognizing spam

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101097580A (en) * 2007-06-20 2008-01-02 精实万维软件(北京)有限公司 Process for ordering network advertisement
CN102567290A (en) * 2010-12-30 2012-07-11 百度在线网络技术(北京)有限公司 Method, device and equipment for expanding short text to be processed
CN103020067A (en) * 2011-09-21 2013-04-03 北京百度网讯科技有限公司 Method and device for determining webpage type
CN103294781A (en) * 2013-05-14 2013-09-11 百度在线网络技术(北京)有限公司 Method and equipment used for processing page data
US10042880B1 (en) * 2016-01-06 2018-08-07 Amazon Technologies, Inc. Automated identification of start-of-reading location for ebooks
DE102016013372A1 (en) * 2016-01-13 2017-07-13 Adobe Systems Incorporated Image labeling with weak monitoring
CN108304379A (en) * 2018-01-15 2018-07-20 腾讯科技(深圳)有限公司 A kind of article recognition methods, device and storage medium

Also Published As

Publication number Publication date
CN110276001A (en) 2019-09-24

Similar Documents

Publication Publication Date Title
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
US9923860B2 (en) Annotating content with contextually relevant comments
JP6909832B2 (en) Methods, devices, equipment and media for recognizing important words in audio
CN111046656B (en) Text processing method, text processing device, electronic equipment and readable storage medium
CN110415679B (en) Voice error correction method, device, equipment and storage medium
CN107861948B (en) Label extraction method, device, equipment and medium
CN111597800B (en) Method, device, equipment and storage medium for obtaining synonyms
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN111930792A (en) Data resource labeling method and device, storage medium and electronic equipment
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
US11423219B2 (en) Generation and population of new application document utilizing historical application documents
CN114218945A (en) Entity identification method, device, server and storage medium
CN117011581A (en) Image recognition method, medium, device and computing equipment
CN111460224B (en) Comment data quality labeling method, comment data quality labeling device, comment data quality labeling equipment and storage medium
CN110276001B (en) Checking page identification method and device, computing equipment and medium
CN113096687A (en) Audio and video processing method and device, computer equipment and storage medium
CN110688558B (en) Webpage searching method, device, electronic equipment and storage medium
CN112100364A (en) Text semantic understanding method and model training method, device, equipment and medium
CN115858776A (en) Variant text classification recognition method, system, storage medium and electronic equipment
US20210295036A1 (en) Systematic language to enable natural language processing on technical diagrams
CN112149389A (en) Resume information structured processing method and device, computer equipment and storage medium
US11520839B2 (en) User based network document modification
CN112182182B (en) Method, device, equipment and storage medium for realizing multi-round session

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant