CN111966946A - Method, device, equipment and storage medium for identifying authority value of page - Google Patents

Method, device, equipment and storage medium for identifying authority value of page Download PDF

Info

Publication number
CN111966946A
CN111966946A CN202010947270.9A CN202010947270A CN111966946A CN 111966946 A CN111966946 A CN 111966946A CN 202010947270 A CN202010947270 A CN 202010947270A CN 111966946 A CN111966946 A CN 111966946A
Authority
CN
China
Prior art keywords
page
authority
feature
space
link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010947270.9A
Other languages
Chinese (zh)
Inventor
郑小裕
刘昊
和为
刘准
何伯磊
李雅楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010947270.9A priority Critical patent/CN111966946A/en
Publication of CN111966946A publication Critical patent/CN111966946A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The application discloses a method, a device, equipment and a storage medium for identifying authority values of pages, and relates to the technical field of natural language processing and deep learning. The specific implementation scheme is as follows: acquiring page attribute characteristics, space characteristics and page chain index relationship characteristics of a page; and inputting the page attribute feature, the spatial feature to which the page belongs and the page chain index relation feature into a pre-trained page authority value recognition model, and outputting the authority value of the page. The method for identifying the authority value of the page can realize the importance analysis of the page and feed back the page with high quality and high authority to the user, thereby assisting the user in effectively utilizing the page.

Description

Method, device, equipment and storage medium for identifying authority value of page
Technical Field
The application relates to the technical field of computers, in particular to an artificial intelligence and deep learning technology, and specifically relates to a method, a device, equipment and a storage medium for identifying authority values of pages.
Background
With the development of internet technology, the development trend of online office and electronic office is obvious. Taking an enterprise as an example, after the enterprise runs for many years, a large amount of experience and knowledge of production, research and development, operation and work, including documents of various regulations, project documents, work experience and the like, can be accumulated by a certain carrier, and become daily accumulated knowledge wealth of the enterprise. The documents with various contents can be created, edited by multiple persons in a coordinated manner, saved and browsed in the form of electronic pages. Typical examples are enterprise-level wiki systems, i.e., authoring systems where multiple persons collaborate at the enterprise level.
When the page accumulation quantity is excessive, service functions of page searching, recommendation and the like can be provided for the user, and the functions can provide more appropriate knowledge information for the user only by knowing the value and the importance of the page.
Therefore, a technical solution capable of effectively evaluating and showing the importance and value of the page is needed.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for identifying authority values of pages, so as to effectively identify and display the importance and the value of the pages.
In a first aspect, an embodiment of the present application provides a method for identifying a page authority value, where the method includes:
acquiring page attribute characteristics, space characteristics and page chain index relationship characteristics of a page;
and inputting the page attribute feature, the spatial feature to which the page belongs and the page chain index relation feature into a pre-trained page authority value recognition model, and outputting the authority value of the page.
In a second aspect, an embodiment of the present application provides an apparatus for identifying a page authority value, where the apparatus includes:
the characteristic acquisition module is used for acquiring page attribute characteristics, spatial characteristics of pages and page chain index relationship characteristics of the pages;
and the authority value determining module is used for inputting the page attribute characteristics, the spatial characteristics of the page and the page chain index relationship characteristics into a pre-trained page authority value identification model and outputting the authority value of the page.
In a third aspect, an embodiment of the present application provides an electronic device, including:
one or more processors;
storage means for storing one or more programs;
when the one or more programs are executed by the one or more processors, the one or more processors implement the method for identifying page authority values according to any embodiment of the present application.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for identifying authority values of pages according to any embodiment of the present application.
The embodiment of the application provides a method, a device, equipment and a storage medium for identifying a page authority value, and the method comprises the steps of obtaining page attribute characteristics, space characteristics to which a page belongs and page chain index relation characteristics of the page; and inputting the page attribute feature, the spatial feature to which the page belongs and the page chain index relation feature into a pre-trained page authority value recognition model, and outputting the authority value of the page. The method for identifying the authority value of the page can realize comprehensive and multidimensional importance analysis on the page, and feeds back the true high-quality and high-authority page to the user, thereby assisting the user in effectively utilizing the page.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1A is a schematic diagram illustrating a method for identifying authority values of pages according to an embodiment of the present application;
FIG. 1B is a schematic structural diagram of a page authority value recognition model according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating a method for identifying authority values of pages according to an embodiment of the present application;
FIG. 3 is a schematic overall flowchart of a method for identifying authority values of pages according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an apparatus for identifying authority values of pages according to an embodiment of the present application;
fig. 5 is a block diagram of an electronic device for implementing a method for identifying authority values of pages according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1A is a schematic diagram of a method for identifying authority values of pages according to an embodiment of the present application, where the embodiment is applicable to a case of identifying authority values of pages in an electronic document system, and the electronic document system may be any internet-based online document system, and is preferably a document system operated by multiple persons in a collaborative manner, such as a wiki system in an enterprise. The method for identifying a page authority value provided by this embodiment may be executed by the apparatus for identifying a page authority value provided by this embodiment of the present application, and the apparatus may be implemented in a software and/or hardware manner and integrated in an electronic device such as a server.
Referring to fig. 1A, the method of the present embodiment includes:
s110, acquiring page attribute characteristics, space characteristics of the page and page chain index relationship characteristics of the page.
In the specific embodiment of the application, the page authority value identification can be carried out on any electronic document system based on the internet, preferably, the page is a multi-person collaborative writing page, and the page is configured with at least one user operation authority to the page; the space to which the page belongs is preferably an enterprise office space. Typically, for example, an enterprise-level wiki system, pages in the wiki system are created and manipulated by users of the space to which the wiki system belongs, and the pages are configured with user manipulation permissions to control the editing operations that can be performed on the pages by the users. Any enterprise information can be recorded in the page, the page can be in the form of an html page, and certainly not limited to the html page, and various attachments, plug-ins and the like can be inserted into the page.
The embodiment of the application takes an enterprise-level wiki page as an example for explanation. In order to perform authority analysis on the wiki page, the page features of the wiki page, including the page attribute feature, the spatial feature to which the page belongs, and the page chain index relationship feature, need to be obtained.
The page attribute features are used for describing the quality condition of the content of the page, and can be subdivided into page behavior features, page structure features and page content features. The space feature of the page is used for describing the situation of the space user to which the current page belongs, such as what user creates in the space and which space users have operation authority.
The wiki page space is created by the organization or individual to which the space belongs, and the usage rights of the space are set based on the rights to create the account and the rights of the user after creation. Illustratively, wiki has its unique spatial-based organization for enterprise-level wiki, such as department space, team space, personal space, and the like.
A page link is an element that allows a viewer to jump between the current page and other pages or sites, a connection from one page to a target, which may be another page, a different location on the same page, a picture, an email address, a file, or even an application. The object used as a page link in a page may be a piece of text or a picture. The page chain index relationship refers to a link relationship between the current page and other pages, including an in-chain and an out-chain. And the link entering means that the browser jumps to the current page from other pages through clicking the link. Out-linking means that the browser jumps from the current page to other pages through clicking links. The page chain finger relation can reflect the jump relation between the current page and other pages, for example, the number of entering chains indicates that the current page is more important and can be referred and jumped by a plurality of other pages.
And S120, inputting the page attribute characteristics, the spatial characteristics to which the page belongs and the page chain index relationship characteristics into a pre-trained page authority value recognition model, and outputting the authority value of the page.
In a specific embodiment of the application, the page attribute feature, the spatial feature to which the page belongs, and the page chain index relationship feature of the page are extracted through the above steps, and the feature values are subjected to data processing and input into a pre-trained page authority value identification model, the page is subjected to prediction and scoring on the quality and the importance through the page authority value identification model, and the authority value of the page is output. Optionally, the output authority value may be used as a page authority related feature in subsequent links such as search ranking in a webpage, a knowledge center, knowledge recommendation, page screening optimization, page ranking, and the like.
In the specific embodiment of the application, the wiki page is a hypertext system which is open on a network and can be collaboratively created by multiple people, members of a space to which the page belongs can create, modify and delete the page at will, and changes of the page in the space can be observed by an interviewee. The wiki page is configured with at least one user operation authority for the page, illustratively, for an enterprise wiki page, the operation authority of the page in the team space is all members of the team, and the operation authority of the page in the personal space is an individual. In this embodiment, the space to which the page belongs is an enterprise office space, i.e., a wiki system for multi-user collaborative writing.
Optionally, the page authority value identification model is a machine learning model, and preferably includes: the system comprises a vector mapping layer, a vector splicing module and at least one full link layer which are connected in sequence.
In a specific embodiment of the present application, as shown in fig. 1B, the page authority value identification model includes: a vector mapping layer (embedding layer, as an input layer), at least one full link layer (FC layer), and an output layer, which are sequentially connected. The vector mapping layer is mainly used for carrying out vector mapping on the discrete page attribute characteristics, the spatial characteristics of the pages and the page chain index relation characteristics and converting the discrete page attribute characteristics, the spatial characteristics of the pages and the page chain index relation characteristics into vectors; then, the vectors output by the vector mapping layer are spliced with page attribute characteristics in a continuous vector form, space characteristics to which the pages belong and page chain index relationship characteristics to form an integral input vector which is used as an original input vector value of the full link layer; at least one full link layer is used for performing weight processing according to input data, outputting the page authority value from the output layer, each layer in the full link layer is composed of a plurality of neurons, for example, each FC layer comprises 400 neural unit nodes, data of all nodes of the previous layer is input into each neuron of the layer, and all output results of the layer are used as input data of the next layer. The output layer may comprise two nodes to distinguish between high authority values and low authority values, of course, for authority value outputs of different value ranges, a different number of output nodes may be comprised. Wherein, the data or the weight in the vector mapping layer and the full link layer can be obtained by training the learning model. In this embodiment, the experiment effect of setting three full link layers is the best through multiple experiments.
According to the technical scheme provided by the embodiment, a set of method for analyzing the authority of the page in the enterprise knowledge search scene is realized by extracting the features of the wiki page, such as the page quality, the spatial characteristic and the link index relationship between the pages, and the problems of low authority score coverage rate, low discrimination and the like caused by the application of the page importance analysis method based on the link relationship to the enterprise wiki scene are solved. By means of the method, the importance analysis of the enterprise-level wiki page is achieved, the real high-quality and high-authority page is fed back to the user, and therefore the effective utilization of the page by the user is assisted.
The embodiment of the application carries out thinning characteristics on the basis of the embodiment. Optionally, the page attribute feature, the spatial feature to which the page belongs, and the page chain finger relationship feature of the page are explained in detail.
Firstly, the page attribute characteristics comprise page behavior characteristics, page structure characteristics and page content characteristics.
In a specific embodiment of the present application, the page attribute feature is used to describe a quality status of a page, wherein the page attribute feature can be further subdivided into a page behavior feature, a page structure feature and a page content feature. The page behavior feature is statistical data of at least one operation behavior executed on the page by a user. The page structure feature is document structure data that the page includes that conforms to at least one protocol or format. The page content characteristics are characteristics of the content included in the page.
Optionally, the statistical data of the operation behavior includes at least one of: the number of praise and comment; the page structure features include at least one of: the method comprises the following steps of (1) the number of plug-ins included in a page, the number of attachments included in the page, the number of lists included in the page, the number of titles included in the page, the number of tables included in the page, the number of rows of tables in the page, whether a directory plug-in exists, whether a sub-page plug-in exists and html structural features; the page content features include at least one of: whether preset positive sample keywords are hit or not, whether negative sample keywords are hit or not, preset negative sample keywords are hit, title segmentation, text length, text paragraph number, picture number included in the page, page address hierarchy of the page in the space to which the page belongs, and title tag attributes.
In a specific embodiment of the present application, the operation behavior of the page behavior feature is an interaction behavior between users, wherein the number of praise and the number of comments belong to a continuous feature. The page structure characteristic is a characteristic used for describing page document structure data, wherein the number of plug-ins included in a page, the number of attachments included in the page, the number of lists included in the page, the number of titles included in the page, the number of tables included in the page and the number of rows of tables in the page belong to continuous characteristics; whether a directory plug-in exists and whether a sub-page plug-in exists belong to the discrete feature. The page content characteristics are characteristics used for describing content contained in the page, wherein the title participles and the text participles belong to text characteristics; the text length of the text, the number of text paragraphs, the number of pictures included in the page, the page address level of the page in the space to which the page belongs and the attribute of the title tag belong to continuous characteristics; whether the preset positive sample keywords are hit, whether the negative sample keywords are hit, and whether the hit preset negative sample keywords belong to discrete features. The page attribute features are subjected to feature design, are refined into specific continuous features, discrete features or text features, and the feature values are extracted for data processing, so that the set advantages that the appeal of the user can be more clarified, and the accuracy, the standard and the authority of knowledge searched by the user are ensured.
The page belonging space is a user organization space to which the user of the page belongs, and the page belonging space characteristics include at least one of the following: importance level characteristics of a creating user of the page, whether a home page is defaulted or not, and space operation permission configured by the page.
In a specific embodiment of the present application, the space to which the page belongs is a user organization space to which a user of the page belongs, and is set when created by the user. The importance level characteristics of the creating user of the page, whether the home page is defaulted or not and the space operation authority configured by the page belong to discrete characteristics. In the user organization space, if a page is set as a default home page or a space home page, the weight of the features is higher. The space operation authority of the team space is owned by the team members; the space operation authority of the personal space is personally owned.
Optionally, the importance level feature includes a team spatial level and an individual spatial level from high to low according to the importance level, and the team spatial level includes at least one level corresponding to the team level.
In a specific embodiment of the present application, there is a one-to-one correspondence between the space to which the page belongs and the corporate team users. The importance of the team is given to the page, if the level of one team is higher, the corresponding team space authority is higher, the team space level is also higher, and the weight of the characteristics is also higher. Similarly, personal space is less authoritative, has a lower personal space rank, and features are weighted less heavily than team space. The importance of the spatial features of the pages is divided, so that the authority level of the pages in the space is determined according to the importance level of the space, and the authority of the enterprise-level wiki pages can be more clearly determined.
Thirdly, the page chain finger relationship features comprise: an in-link feature and an out-link feature of a page.
In a specific embodiment of the present application, the page chain referring relationship feature includes: an in-link feature and an out-link feature of a page. The in-link and out-link characteristics include the number of links and the weight of the links, respectively. The number of links specifically refers to the number of in-link links for the browser to jump from other pages to the current page through the clicked link, or the number of out-link links for the browser to jump from the current page to other pages through the clicked link. The link weight specifically refers to a weight value of the link feature. If the number of the links entering the link is more in the current page, the authority of the current page is higher, and the weight value of the corresponding link index relationship is higher; if the number of the links out of the link in the current page is more, the authority of the current page is low, and the weight value of the link finger relationship is lower. In the embodiment, the edges of wiki in the same space are removed when calculating the composition, because the relationship of mutual chain fingers of wiki in the same space is obvious, the discrimination of the space authority is influenced. The method has the advantages that the importance of the current page can be determined according to the number of links and the weight of the links, and the accuracy, the standard and the authority of knowledge searched by a user are ensured.
Optionally, the page chain referring relationship feature further includes: the respective link weights of the in-link feature and the out-link feature include a spatial weight of a space to which the link points to the page.
In an embodiment of the present application, the link weight of the inbound feature refers to a spatial weight of a space to which a page pointed by the inbound link belongs, and if the authority of the space pointed by the inbound link is higher, the weight value of the link-pointing relationship of the inbound link is also higher. The link weight of the outbound feature refers to a space weight of a space to which the page pointed by the outbound link belongs, and if the authority of the space pointed by the outbound link is higher, the weight value of the link pointing relationship of the outbound link is also higher. Optionally, in the current page, the inbound link or the outbound link points to another page, and if the current page and the another page belong to the same space and the mutual influence of the same space is small, the weight of the link pointing relationship is reduced. In this embodiment, the space weight of the space to which the link points is considered, so that the advantage of the setting is that according to the link pointing relationship of the pages, the authority of the current page may refer to the authority of the space to which the link points, and when the link pointing relationship between the multiple pages is more, the more the space to which the link referred by the authority of the current page points is also more, the more accurate the authority of the current page is.
According to the technical scheme provided by the embodiment, the feature design is refined on the basis of the page attribute feature, the spatial feature to which the page belongs and the page chain index relationship feature, and the feature values are extracted for data processing, so that the problem of how to fully and effectively utilize, extract and organize the features for authority analysis is solved, the authority score coverage rate and the discrimination of knowledge search in an enterprise-level wiki scene are improved, and the accuracy, the standard and the authority of knowledge searched by a user can be ensured.
Fig. 2 is a schematic diagram of a method for identifying a page authority value according to an embodiment of the present application. The embodiment of the application is based on the above embodiment, and further adds the processes of sample training and model building, and the embodiment explains the identification process of the page authority value in detail.
Referring to fig. 2, the method of the present embodiment includes:
s210, obtaining an original page, and classifying authority levels of the original page according to a preset page authority value identification rule.
In a specific embodiment of the application, the preset page authority value identification rule refers to sample preprocessing before authority value labeling is performed on a sample, and the purpose is to reduce workload and cost of manually labeling the authority value of the sample. Optionally, the preset page authority value identification rule may be a rule for identifying a high authority level sample, a rule for identifying a low authority level sample, or a rule for identifying both the high authority level sample and the low authority level sample. The original page refers to a page which is not identified by a preset page authority value rule. And classifying the authority levels of the original pages according to a preset page authority value identification rule into a high authority level and a low authority level. For example, a number of specific rules may be set: the number of the incoming chains reaches a set threshold value; the space to which the user belongs is not a personal space, and the like.
S220, taking the original pages with authority levels respectively conforming to the high authority levels and the low authority levels as candidate pages, and carrying out authority value marking.
In a specific embodiment of the present application, the candidate page refers to a page after the preset page authority value rule identification is performed. And taking the original pages with authority levels respectively conforming to the high authority levels and the low authority levels as candidate pages, and carrying out authority value marking. The authority value marking means that authority of page features of a page is manually marked, and a high authority level and a low authority level are distinguished. Illustratively, if the current space hits the authority space and the page quality is higher, marking the current space as a high authority level; if the current space is personal space and the page quality is low, the current space is marked as a low authority level. Optionally, after the authority level and the low authority level are subjected to appropriate random oversampling and random downsampling, the ratio is controlled to be about 1: 1.
And S230, acquiring the page attribute feature, the spatial feature of the page and the page chain index relationship feature of the candidate page, and an authority value labeling result as a training sample.
In the specific embodiment of the application, the page attribute feature, the spatial feature to which the page belongs, the page chain index relationship feature and the authority value labeling result of the candidate page are obtained and used as the training sample. The authority value labeling result comprises a high authority level and a low authority level. Optionally, the authoritative annotation result may also be five levels, for example: a high authority level, a higher authority level, a medium authority level, a lower authority level, and a low authority level. The data needs to be preprocessed before the training samples are input into the page authority value recognition model. The data preprocessing comprises two steps in total, wherein the first step is as follows: dividing training samples into a training set and a verification set according to a ratio of 9: 1; the second step is that: the continuous features need to be normalized, and for each feature, the maximum value used in normalization is not a global maximum value, but a feature value at 90% of the position after sorting is taken as a maximum value.
S240, inputting the training sample into the page authority value recognition model, and performing model training by adopting a cross entropy loss function.
In a specific embodiment of the present application, a cross-entropy loss function is used to describe the loss of a sample. Alternatively, the index of the loss condition of the sample can be evaluated by using two common indexes of the classification model, namely ACC (accuracy rate) and AUC (area Under cut). ACC refers to the ratio of the number of correctly classified samples to the total number of samples. The AUC refers to the area enclosed by the coordinate axes under the ROC curve, the value range is [0.5,1], the closer the AUC is to 1.0, the better the effect is. And inputting the training sample subjected to data preprocessing into a page authority identification model, and performing model training by adopting a cross entropy loss function. And performing effect evaluation on the page authority value recognition model after model training, and if the result of the effect evaluation is not good, returning to the feature design for modification until the result of the effect evaluation of the page authority value recognition model is good. Optionally, the page authority value identification model may perform online prediction scoring on the quality and authority of the wiki page, and provide the output quality and authority score as the relevant features of the page importance to the modules for search and ranking in the web page, knowledge center, knowledge recommendation, page screening optimization, page ranking and the like. The page authority identification model may be as described with reference to FIG. 1B.
Specifically, fig. 3 is a schematic overall flow chart of the method for identifying page authority values according to the embodiment, and referring to fig. 3, the following explains all processes of the method for identifying page authority values in detail.
S310, feature design: and carrying out detailed feature design on the page attribute feature, the spatial feature of the page and the page chain index relationship feature of the wiki page.
S320, sample construction: the method comprises the steps of classifying and labeling authority values of an original page according to a preset page authority value identification rule.
S330, constructing a page authority value identification model: and (4) carrying out importance analysis on the page, and establishing a model for authority identification of the page.
S340, model training: and inputting a training sample into the page authority value recognition model, and performing model training by adopting a cross entropy loss function.
S350, effect evaluation: and judging whether the page authority value identification model is excellent or not.
If yes, executing S360; if not, the process returns to the step S310.
S360, online prediction: and performing online prediction and scoring on the quality and authority of the wiki page by using a page authority value identification model.
According to the technical scheme provided by the embodiment, the authority value of the original page is labeled and then used as the training sample, the training sample is input into the page authority value recognition model, and the cross entropy loss function is adopted for model training, so that a page authority analysis and modeling scheme is obtained, the problems of labeling, training, data processing and the like of the sample are solved, and the page authority analysis model suitable for enterprise-level wiki knowledge search is obtained.
Fig. 4 is a schematic diagram of an apparatus for identifying a page authority value according to an embodiment of the present application, and as shown in fig. 4, the apparatus may include:
the feature obtaining module 410 is configured to obtain a page attribute feature, a spatial feature to which a page belongs, and a page chain index relationship feature of the page.
And the authority value determining module 420 is configured to input the page attribute feature, the spatial feature to which the page belongs, and the page chain index relationship feature into a pre-trained page authority value identification model, and output the authority value of the page.
According to the technical scheme provided by the embodiment, the page attribute feature, the page spatial feature and the page chain index relationship feature of the page are acquired, the page attribute feature, the page spatial feature and the page chain index relationship feature are input into a pre-trained page authority value identification model, and the authority value of the page is output. The method solves the problems of low authority score coverage rate, low discrimination and the like caused by the application of the page importance analysis method based on the link relation to the enterprise wiki scene. Through the importance analysis of the enterprise-level wiki page, the real high-quality and high-authority page is fed back to the user, and therefore the user search experience is improved.
Optionally, the page is a multi-user collaborative writing page, and the page is configured with an operation right of at least one user to the page; the space to which the page belongs is an enterprise office space.
Optionally, the page attribute feature includes a page behavior feature, a page structure feature and a page content feature; the page behavior feature is statistical data of at least one operation behavior executed on the page by a user; the page structure characteristic is document structure data which is included in the page and conforms to at least one protocol or format; the page content features are features of content included in the page.
Optionally, the statistical data of the operation behavior includes at least one of: the number of praise and comment; the page structure features include at least one of: the method comprises the following steps of (1) the number of plug-ins included in a page, the number of attachments included in the page, the number of lists included in the page, the number of titles included in the page, the number of tables included in the page, the number of rows of tables in the page, whether a directory plug-in exists, whether a sub-page plug-in exists and html structural features; the page content features include at least one of: whether preset positive sample keywords are hit or not, whether negative sample keywords are hit or not, preset negative sample keywords are hit, title segmentation, text length, text paragraph number, picture number included in the page, page address hierarchy of the page in the space to which the page belongs, and title tag attributes.
Optionally, the space to which the page belongs is a user organization space to which a user of the page belongs, and the spatial feature to which the page belongs includes at least one of: the importance level characteristics of the creating user of the page, whether the home page is defaulted or not, and the space operation permission configured by the page.
Optionally, the importance level feature includes a team spatial level and an individual spatial level from high to low according to importance levels, where the team spatial level includes at least one level corresponding to the team level.
Optionally, the page chain reference relationship feature includes: the page comprises an in-link characteristic and an out-link characteristic, wherein the in-link characteristic and the out-link characteristic respectively comprise a link quantity and a link weight.
Optionally, the page chain reference relationship feature further includes: the link weight of each of the in-link feature and the out-link feature comprises a spatial weight of a space to which a page to which a link points.
Optionally, the apparatus further includes a model training module, where the model training module includes: the system comprises an original page training unit, a candidate page labeling unit, a training sample generating unit and a model training unit, wherein:
and the original page training unit is used for classifying the authority levels of the original pages according to a preset page authority value identification rule.
And the candidate page marking unit is used for marking the authority value by taking the original page with the authority level respectively conforming to the high authority level and the low authority level as a candidate page.
The training sample generating unit is used for acquiring the page attribute characteristics, the spatial characteristics of the pages and the page chain index relationship characteristics of the candidate pages and authority value labeling results as training samples; the authority value labeling result comprises a high authority level and a low authority level.
And the model training unit is used for inputting the training samples into the page authority value identification model and performing model training by adopting a cross entropy loss function.
The device for identifying the authority value of the page provided by the embodiment can be applied to the method for identifying the authority value of the page provided by any embodiment, and has corresponding functions and beneficial effects.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 5, the electronic device includes a processor 510, a storage device 520, and a communication device 530; the number of the processors 510 in the electronic device may be one or more, and one processor 510 is taken as an example in fig. 5; the processor 510, the storage 520 and the communication device 530 in the electronic device may be connected by a bus or other means, and fig. 5 illustrates the connection by the bus as an example.
The storage device 520 is a computer-readable storage medium and can be used to store software programs, computer-executable programs, and modules, such as the modules corresponding to the page authority value identification method in the embodiment of the present application (for example, the feature obtaining module 410 and the authority value determining module 420 in the identification device for page authority values). The processor 510 executes various functional applications and data processing of the electronic device by executing software programs, instructions and modules stored in the storage device 520, that is, implements the above-mentioned identification method of the page authority value.
The storage device 520 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage 520 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the storage 520 may further include memory located remotely from the processor 510, which may be connected to the electronic device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
A communication device 530 for implementing a network connection or a mobile data connection between servers.
The electronic device provided by the embodiment can be used for executing the method for identifying the authority value of the page provided by any embodiment, and has corresponding functions and beneficial effects.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method for identifying a page authority value according to any embodiment of the present application, where the method specifically includes:
acquiring page attribute characteristics, space characteristics and page chain index relationship characteristics of a page;
and inputting the page attribute feature, the spatial feature to which the page belongs and the page chain index relation feature into a pre-trained page authority value recognition model, and outputting the authority value of the page.
Of course, the storage medium provided in the embodiments of the present application contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the method for identifying page authority values provided in any embodiment of the present application.
From the above description of the embodiments, it is obvious for those skilled in the art that the present application can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments of the present application.
It should be noted that, in the embodiment of the apparatus for identifying a page authority value, each unit and each module included in the apparatus is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the application.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (13)

1. A method for identifying authority values of pages comprises the following steps:
acquiring page attribute characteristics, space characteristics and page chain index relationship characteristics of a page;
and inputting the page attribute feature, the spatial feature to which the page belongs and the page chain index relation feature into a pre-trained page authority value recognition model, and outputting the authority value of the page.
2. The method of claim 1, wherein:
the page is a multi-person collaborative writing page, and the page is configured with at least one user operation authority for the page;
the space to which the page belongs is an enterprise office space.
3. The method of claim 1, wherein the page attribute features include page behavior features, page structure features, and page content features;
the page behavior feature is statistical data of at least one operation behavior executed on the page by a user;
the page structure characteristic is document structure data which is included in the page and conforms to at least one protocol or format;
the page content features are features of content included in the page.
4. The method of claim 3, wherein:
the statistical data of the operation behaviors comprises at least one of the following: the number of praise and comment;
the page structure features include at least one of: the method comprises the following steps of (1) the number of plug-ins included in a page, the number of attachments included in the page, the number of lists included in the page, the number of titles included in the page, the number of tables included in the page, the number of rows of tables in the page, whether a directory plug-in exists, whether a sub-page plug-in exists and html structural features;
the page content features include at least one of: whether preset positive sample keywords are hit or not, whether negative sample keywords are hit or not, preset negative sample keywords are hit, title segmentation, text length, text paragraph number, picture number included in the page, page address hierarchy of the page in the space to which the page belongs, and title tag attributes.
5. The method of claim 1, wherein the page belonging space is a user organization space to which a user of the page belongs, and the page belonging spatial features include at least one of: the importance level characteristics of the creating user of the page, whether the home page is defaulted or not, and the space operation permission configured by the page.
6. The method of claim 5, wherein the importance level characteristics include, by importance, a team spatial level and an individual spatial level from high to low, the team spatial level including at least one level corresponding to a team level.
7. The method of claim 1, wherein the page chain assignment feature comprises: the page comprises an in-link characteristic and an out-link characteristic, wherein the in-link characteristic and the out-link characteristic respectively comprise a link quantity and a link weight.
8. The method of claim 7, wherein the page chain assignment feature further comprises: the link weight of each of the in-link feature and the out-link feature comprises a spatial weight of a space to which a page to which a link points.
9. The method of claim 1, wherein the method further comprises a training process of the page authority value recognition model, the training process comprising:
acquiring an original page, and classifying authority levels of the original page according to a preset page authority value identification rule;
taking the original pages with authority levels respectively conforming to the high authority levels and the low authority levels as candidate pages, and carrying out authority value marking;
acquiring page attribute characteristics, space characteristics and page chain index relation characteristics of the candidate pages, and authority value labeling results as training samples; the authority value labeling result comprises a high authority grade and a low authority grade;
and inputting the training sample into the page authority value recognition model, and performing model training by adopting a cross entropy loss function.
10. The method of claim 1 or 9, wherein the page authority value identification model comprises: a vector mapping layer, at least one full link layer and an output layer which are connected in sequence;
the vector mapping layer is used for carrying out vector mapping on the discrete page attribute characteristics, the spatial characteristics of the pages and the page chain index relation characteristics;
and at least one full link layer is used for performing weight processing according to the input data to output a page authority value.
11. An apparatus for identifying authority values of pages, comprising:
the characteristic acquisition module is used for acquiring page attribute characteristics, spatial characteristics of pages and page chain index relationship characteristics of the pages;
and the authority value determining module is used for inputting the page attribute characteristics, the spatial characteristics of the page and the page chain index relationship characteristics into a pre-trained page authority value identification model and outputting the authority value of the page.
12. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of identifying page authority values according to any of claims 1-10.
13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for identifying authority values of pages according to any one of claims 1 to 10.
CN202010947270.9A 2020-09-10 2020-09-10 Method, device, equipment and storage medium for identifying authority value of page Pending CN111966946A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010947270.9A CN111966946A (en) 2020-09-10 2020-09-10 Method, device, equipment and storage medium for identifying authority value of page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010947270.9A CN111966946A (en) 2020-09-10 2020-09-10 Method, device, equipment and storage medium for identifying authority value of page

Publications (1)

Publication Number Publication Date
CN111966946A true CN111966946A (en) 2020-11-20

Family

ID=73392865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010947270.9A Pending CN111966946A (en) 2020-09-10 2020-09-10 Method, device, equipment and storage medium for identifying authority value of page

Country Status (1)

Country Link
CN (1) CN111966946A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554762A (en) * 2021-06-25 2021-10-26 广东技术师范大学 Short video style image generation method, device and system based on deep learning
CN114416513A (en) * 2022-03-25 2022-04-29 百度在线网络技术(北京)有限公司 Processing method and device for search data, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030225750A1 (en) * 2002-05-17 2003-12-04 Xerox Corporation Systems and methods for authoritativeness grading, estimation and sorting of documents in large heterogeneous document collections
US20100057717A1 (en) * 2008-09-02 2010-03-04 Parashuram Kulkami System And Method For Generating A Search Ranking Score For A Web Page
CN102298583A (en) * 2010-06-22 2011-12-28 腾讯科技(深圳)有限公司 Method and system for evaluating webpage quality of electronic bulletin board
CN102541947A (en) * 2010-12-31 2012-07-04 百度在线网络技术(北京)有限公司 Method and equipment for updating authority score of webpage based on friefox event
CN102541949A (en) * 2010-12-31 2012-07-04 百度在线网络技术(北京)有限公司 Method and equipment for determining authority values on basis of preset link relation of pages
CN103116660A (en) * 2013-03-15 2013-05-22 人民搜索网络股份公司 Method and device for acquiring website authority values
CN108460158A (en) * 2018-03-28 2018-08-28 天津大学 Differentiation Web page sequencing method based on PageRank
CN110866170A (en) * 2019-10-18 2020-03-06 中国科学院信息工程研究所 Importance evaluation method, search method and system for Tor darknet service based on site quality
CN111475750A (en) * 2020-03-04 2020-07-31 百度在线网络技术(北京)有限公司 Page preloading control method, device, system, equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030225750A1 (en) * 2002-05-17 2003-12-04 Xerox Corporation Systems and methods for authoritativeness grading, estimation and sorting of documents in large heterogeneous document collections
US20100057717A1 (en) * 2008-09-02 2010-03-04 Parashuram Kulkami System And Method For Generating A Search Ranking Score For A Web Page
CN102298583A (en) * 2010-06-22 2011-12-28 腾讯科技(深圳)有限公司 Method and system for evaluating webpage quality of electronic bulletin board
CN102541947A (en) * 2010-12-31 2012-07-04 百度在线网络技术(北京)有限公司 Method and equipment for updating authority score of webpage based on friefox event
CN102541949A (en) * 2010-12-31 2012-07-04 百度在线网络技术(北京)有限公司 Method and equipment for determining authority values on basis of preset link relation of pages
CN103116660A (en) * 2013-03-15 2013-05-22 人民搜索网络股份公司 Method and device for acquiring website authority values
CN108460158A (en) * 2018-03-28 2018-08-28 天津大学 Differentiation Web page sequencing method based on PageRank
CN110866170A (en) * 2019-10-18 2020-03-06 中国科学院信息工程研究所 Importance evaluation method, search method and system for Tor darknet service based on site quality
CN111475750A (en) * 2020-03-04 2020-07-31 百度在线网络技术(北京)有限公司 Page preloading control method, device, system, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554762A (en) * 2021-06-25 2021-10-26 广东技术师范大学 Short video style image generation method, device and system based on deep learning
CN113554762B (en) * 2021-06-25 2023-12-29 广州市粤拍粤精广告有限公司 Short video style image generation method, device and system based on deep learning
CN114416513A (en) * 2022-03-25 2022-04-29 百度在线网络技术(北京)有限公司 Processing method and device for search data, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111444428B (en) Information recommendation method and device based on artificial intelligence, electronic equipment and storage medium
Giannoulakis et al. Evaluating the descriptive power of Instagram hashtags
CN106599022B (en) User portrait forming method based on user access data
WO2019043381A1 (en) Content scoring
CN110888990B (en) Text recommendation method, device, equipment and medium
EP2369505A1 (en) Text classifier system
CN111310476B (en) Public opinion monitoring method and system using aspect-based emotion analysis method
CN110968684A (en) Information processing method, device, equipment and storage medium
CN112183056A (en) Context-dependent multi-classification emotion analysis method and system based on CNN-BilSTM framework
CN111625715A (en) Information extraction method and device, electronic equipment and storage medium
CN111966946A (en) Method, device, equipment and storage medium for identifying authority value of page
US20130019163A1 (en) System
CN114238573A (en) Information pushing method and device based on text countermeasure sample
CN108241867A (en) A kind of sorting technique and device
CN111709225A (en) Event cause and effect relationship judging method and device and computer readable storage medium
Sperrle et al. Learning Contextualized User Preferences for Co‐Adaptive Guidance in Mixed‐Initiative Topic Model Refinement
CN113569118A (en) Self-media pushing method and device, computer equipment and storage medium
Bucur Opinion Mining platform for Intelligence in business
Kotenko et al. The intelligent system for detection and counteraction of malicious and inappropriate information on the Internet
Kamel et al. Robust sentiment fusion on distribution of news
Masri et al. A novel approach for Arabic business email classification based on deep learning machines
CN115130453A (en) Interactive information generation method and device
CN114547435A (en) Content quality identification method, device, equipment and readable storage medium
Kuzar et al. Slovak blog clustering enhanced by mining the web comments
JP5129082B2 (en) Citation determination method and reputation extraction method using the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination