WO2019057191A1 - 内容检索方法、终端、服务器、电子设备及存储介质 - Google Patents

内容检索方法、终端、服务器、电子设备及存储介质 Download PDF

Info

Publication number
WO2019057191A1
WO2019057191A1 PCT/CN2018/107273 CN2018107273W WO2019057191A1 WO 2019057191 A1 WO2019057191 A1 WO 2019057191A1 CN 2018107273 W CN2018107273 W CN 2018107273W WO 2019057191 A1 WO2019057191 A1 WO 2019057191A1
Authority
WO
WIPO (PCT)
Prior art keywords
content
page
retrieval
entity
knowledge map
Prior art date
Application number
PCT/CN2018/107273
Other languages
English (en)
French (fr)
Inventor
金刚铭
叶骏
徐羽
范跃伟
胡博
李未
周疏影
王剑
钭伟雨
刘秀芳
吕雪
何枫
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019057191A1 publication Critical patent/WO2019057191A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of data processing, and in particular, to a content retrieval method, a terminal, a server, an electronic device, and a storage medium.
  • the search engine can provide a content entity related to the keyword through the search engine result page, such as helping with a knowledge map.
  • the user knows about the content.
  • the content of the content retrieval method, the content retrieval device (including the terminal, the server, the electronic device) and the computer readable storage medium are provided to expand the content retrieval application scenario range and improve the content retrieval efficiency.
  • the embodiment of the present application provides a content retrieval method, which is executed by a terminal device, and includes:
  • the content entity knowledge map is displayed such that the terminal device transmits content selected by the user from the content entity knowledge map to a server for content retrieval operations.
  • the embodiment of the present application further provides a content retrieval method, which is executed by a server, and includes:
  • the embodiment of the present application further provides a content retrieval terminal, including:
  • a trigger instruction receiving module configured to acquire a page content retrieval trigger instruction
  • a page address obtaining module configured to acquire, according to the page content retrieval triggering instruction, a page address of a page content currently displayed by the content retrieval terminal;
  • a knowledge map generating module configured to acquire a content entity knowledge map corresponding to the page content based on the page address
  • a map display module is configured to display the content entity knowledge map, so that the content retrieval terminal sends the content selected by the user from the content entity knowledge map to a server for content retrieval operation.
  • the embodiment of the present application further provides a content retrieval server, including:
  • a page address receiving module configured to receive a page address of the page content from the retrieval terminal
  • a page content extraction module configured to extract page content according to the page address
  • a content entity extraction module configured to extract a content entity of the page content
  • a knowledge map creation module configured to create the content entity knowledge map according to the extracted content entity and the association between the content entities
  • the knowledge map sending module is configured to send the content entity knowledge map to the search terminal for presentation, so that the search terminal sends the content selected by the user from the content entity knowledge map to a server for content retrieval operation.
  • the embodiment of the present application further provides a computer readable storage medium having stored therein processor executable instructions loaded by one or more processors to perform the content retrieval method described above.
  • the embodiment of the present application further provides an electronic device including a processor and a memory, wherein the memory has a computer program, wherein the processor is configured to execute the content retrieval method described above by calling the computer program.
  • 1A is a system architecture diagram related to the present application
  • FIG. 1B is a flowchart of a content retrieval method in some embodiments of the present application.
  • FIG. 2 is a flow chart of a content retrieval method in some embodiments of the present application.
  • FIG. 3 is a flowchart of generating a content entity knowledge map of a page content by a background server of a content retrieval method in some embodiments of the present application;
  • FIG. 5 is a schematic structural diagram of a content retrieval terminal in some embodiments of the present application.
  • FIG. 6 is a schematic structural diagram of a content retrieval terminal in some embodiments of the present application.
  • FIG. 7 is a schematic structural diagram of a background server corresponding to a content retrieval terminal in some embodiments of the present application.
  • FIG. 8 is a schematic structural diagram of a page content extraction module of a background server corresponding to a content retrieval terminal according to some embodiments of the present disclosure
  • FIG. 9 is a schematic structural diagram of a content retrieval server in some embodiments of the present application.
  • FIG. 10 is a schematic structural diagram of a page content extraction module of a content retrieval server in some embodiments of the present application.
  • FIG. 11 is a sequence diagram of a content retrieval process, a content retrieval terminal, and a content retrieval process of a content retrieval server in some embodiments of the present application;
  • 12a is a schematic diagram of a content retrieval method, a content retrieval terminal, and a page content of a content retrieval server in some embodiments of the present application;
  • 12b and 12c are schematic diagrams of a content retrieval method, a content retrieval terminal, and a content entity knowledge map of a content retrieval server in some embodiments of the present application;
  • FIG. 13 is a schematic structural diagram of a working environment of a content retrieval terminal and an electronic device where a content retrieval server is located in some embodiments of the present application.
  • the content retrieval method, the terminal, and the server of the present application may be disposed in any electronic device for performing a content retrieval operation on a certain page content provided by the user, the application scene range of the content retrieval operation is large, and the content retrieval is performed.
  • the retrieval efficiency is high.
  • the electronic device includes, but is not limited to, a wearable device, a headset, a healthcare platform, a personal computer, a server computer, a handheld or laptop device, a mobile device (such as a mobile phone, a personal digital assistant (PDA), a media player) And so on), multiprocessor systems, consumer electronics, small computers, mainframe computers, distributed computing environments including any of the above systems or devices, and the like.
  • the content retrieval terminal is a mobile terminal
  • the content retrieval server is preferably a content retrieval background server.
  • the content retrieval method of the present application determines a page content that needs to be retrieved by the content retrieval terminal, and performs keyword extraction on the page content by the background server.
  • the establishment of the knowledge map expands the application scenario range of the content retrieval of the content retrieval terminal, and improves the retrieval efficiency of the content retrieval.
  • the application provides a content retrieval method, a terminal, a server, an electronic device, and a storage medium.
  • 1A is a system architecture diagram of the present application.
  • the server 102 provides a retrieval service.
  • Server 102 provides page services to a plurality of users via one or more networks 106, wherein the plurality of users operate their respective terminal devices 104 (e.g., terminal devices 104a-c).
  • each user connects to the server 102 through a client application 108 (eg, client applications 108a-c) executing on the terminal device 104.
  • the client application 108 can be a browser or a social application, for example, WeChat, QQ, Weibo, etc.
  • the client application 108 can also be a multimedia application such as a video application or an article application.
  • the page retrieval triggering prompt may be displayed on the page, and the client application 108 sends the page address of the page to the server 102 in response to triggering the triggering of the presentation page retrieval trigger.
  • the server 102 determines the content entity knowledge map according to the page address of the page, and sends the content entity knowledge map to the client application 108 for display.
  • the content entity knowledge map may include a primary role in the video and an association (association relationship) between the roles.
  • terminal devices 104 include, but are not limited to, palmtop computers, wearable computing devices, personal digital assistants (PDAs), tablet computers, notebook computers, desktop computers, mobile phones, smart phones, enhanced general packet radio service (EGPRS) mobiles.
  • PDAs personal digital assistants
  • EGPRS enhanced general packet radio service
  • Examples of one or more networks 106 include a local area network (LAN) and a wide area network (WAN) such as the Internet.
  • one or more networks 106 may be implemented using any well-known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Global System for Mobile Communications (GSM). Enhanced Data GSM Environment (EDGE), Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Bluetooth, WiFi, Voice over IP (VoIP), Wi-MAX, or any other suitable communication protocol.
  • Each terminal device 104 optionally includes one or more internal peripheral modules, or may be connected to one or more peripheral devices by wire or wirelessly (eg, navigation system, health monitor, climate controller, smart sports equipment, Bluetooth headset, smart watch, etc.)
  • wire or wirelessly eg, navigation system, health monitor, climate controller, smart sports equipment, Bluetooth headset, smart watch, etc.
  • FIG. 1B is a flowchart of a content search method according to the present application.
  • the content search method in this embodiment may be implemented by using the terminal device 104.
  • the content search method in this embodiment includes:
  • Step S101 acquiring a page content retrieval trigger instruction
  • Step S102 Acquire, according to a page content triggering instruction, a page address of a page content currently displayed by the terminal device;
  • Step S103 acquiring a content entity knowledge map corresponding to the page content based on the page address
  • Step S104 displaying a content entity knowledge map, so that the terminal device sends the content selected by the user from the content entity knowledge map to the server for content retrieval operation.
  • the content retrieval terminal acquires a page content retrieval trigger instruction, where the page content retrieval trigger instruction refers to an instruction for triggering the transmission of the page content selected by the user to the background server for content retrieval.
  • the user can generate the page content retrieval trigger instruction by using various methods, such as clicking a search button of a certain page setting position or performing a touch operation on the current page content, such as performing a pull-down operation on the page content through a touch operation, or The page content is zoomed by a touch operation, and the like.
  • step S102 the content retrieval terminal (terminal device 104) acquires the page address of the page content being displayed by the current content retrieval terminal according to the page content retrieval trigger instruction acquired in step S101.
  • step S103 the content retrieval terminal acquires the content entity knowledge map corresponding to the page content based on the page address acquired in step S102; specifically, the content retrieval terminal may send the page address acquired in step S102 to the corresponding background server, so that the background server may The corresponding page content is obtained for the page address, and then the background server can obtain the page content keyword of the page content, and generate a content entity knowledge map of the page content according to the page content keyword.
  • the content retrieval terminal can also generate a content entity knowledge map corresponding to the page content according to the page address.
  • the content entity knowledge map here refers to visually describing the relationship (association) between multiple content entities in the content of the page.
  • the page content can be graphically described by the content entity knowledge map of the page content, so that the user can better obtain the keyword of the page content and the association between the keywords, wherein the content entity is used to represent the page content.
  • Objects contained in the information such as characters, animals, and other non-living objects, which may be works, accommodation apartments, and the like.
  • step S104 the content retrieval terminal receives the content entity knowledge map from the background server, and displays the content entity knowledge map on the screen of the content retrieval terminal.
  • the content retrieval terminal may also generate the content corresponding to the page content according to the page address. Entity knowledge map. The user can perform a keyword content retrieval operation by selecting keywords on the content entity knowledge map.
  • the content retrieval method of the embodiment generates a corresponding content entity knowledge map by using the page content, and the user can perform a content retrieval operation by using keywords in the content entity knowledge map, so that the user does not need to input keywords actively, or even the page content at one time.
  • the multiple keywords in the search operation are simultaneously performed, thereby expanding the application scenario range of the content retrieval and improving the retrieval efficiency of the content retrieval.
  • FIG. 2 is a flowchart of a content search method according to the present application.
  • the content search method in this embodiment may be implemented by using the terminal device 104.
  • the content search method in this embodiment includes:
  • Step S201 receiving a page content retrieval list from the background server, and performing a page content retrieval triggering prompt according to the content of the page content retrieval list;
  • Step S202 Generate a page content retrieval trigger instruction according to a touch operation performed by the user on the page content display interface; that is, the terminal device obtains a touch operation on the page content display interface according to the user content triggering prompt according to the page content.
  • the page content retrieval trigger instruction is
  • Step S203 Acquire a page address of the page content according to the page content retrieval trigger instruction; and obtain a page address of the page content currently displayed by the terminal device.
  • Step S204 Acquire a content entity knowledge map corresponding to the page content based on the page address; wherein the terminal device 104 may generate the content entity knowledge map by itself, or may receive the content entity knowledge map sent by the server 102, and generate a content entity knowledge map by the server 102.
  • Step S205 displaying a content entity knowledge map, so that the user performs a keyword content retrieval operation. So that the terminal device sends the content selected by the user from the content entity knowledge map to the server for content retrieval operation.
  • step S201 since not all the page contents can perform the page content retrieval operation, for example, some pages cannot be extracted by the page crawler. Therefore, the content retrieval terminal (terminal device 104) receives a page content retrieval list from the background server, and the page content retrieval list is used to indicate which pages can perform the page content retrieval operation.
  • the page content retrieval list may be a whitelist list of the page, such as setting a page content under www.qq.com as a whitelist list for page content retrieval; or a blacklist list of the page, such as www.163
  • the content of the page under .com is set to a blacklist that cannot be retrieved from the page content; it can also be a black and white list of pages, or a list of black and white lists of pages, such as pages with cn suffixes for page content retrieval.
  • the type of whitelisted website, the pages of the org suffix are set to the types of blacklisted websites that cannot be searched for page content.
  • the content retrieval terminal performs a page content retrieval triggering prompt on the current browsing page of the user according to the content of the page content retrieval list, so that the user issues a page content retrieval triggering instruction according to the page content retrieval triggering prompt. That is, if the user currently browses the page to perform the page content retrieval operation, the page content retrieval trigger prompt is displayed on the preset position of the browsing page, for example, “retrievable” is indicated in the upper right corner of the page; The page content retrieval operation indicates "unretrievable" in the upper right corner of the page.
  • the display method of the page content retrieval trigger prompt can be modified as required.
  • step S202 if the user currently browses the page to perform the page content retrieval operation, the content retrieval terminal may receive the touch operation of the user on the page display interface to generate a page content retrieval trigger instruction. For example, by clicking the search button of the user's current browsing page setting position or performing a pull-down operation or a zooming operation on the currently browsing page of the user.
  • the page content retrieval trigger instruction herein refers to an instruction for triggering the content of the page selected by the user to be sent to the background server for content retrieval.
  • the touch operation needs to be set in advance, that is, when the user performs the touch operation and the user currently browses the page to perform the page content retrieval operation, the content retrieval terminal generates a page content retrieval trigger instruction.
  • step S203 the content retrieval terminal acquires the page address of the page content currently being displayed by the current content retrieval terminal according to the page content retrieval trigger command generated in step S202.
  • step S204 the content retrieval terminal generates a content entity knowledge map corresponding to the page content based on the page address acquired in step S203. Specifically, the content retrieval terminal sends the page address acquired in step S203 to the corresponding background server, so that the background server can be The page address generates a content entity knowledge map of the page content.
  • FIG. 3 is a flowchart of generating a content entity knowledge map of a page content by a background server of the content retrieval method of the present application.
  • the step S204 includes:
  • Step S301 the background server extracts the page content according to the obtained page address.
  • the background server may perform a normalization operation on the obtained page address, where the normalization operation is used to map page addresses corresponding to different domain names of the same page to the same page address, so that the background server can be Better identify the same page address represented by different domain names.
  • the background server determines whether the server local storage stores the page content corresponding to the page address after the normalization operation. If the server local memory stores the page content corresponding to the page address after the normalization operation, the background server can directly extract the page content from the server local memory, so that the problem that the real-time page content extraction speed is slow can be better, and the problem is improved. The extraction performance of the page content. If the server local memory does not store the page content corresponding to the page address after the normalization operation, the background server directly extracts the page content from the page address.
  • the background server uses the page crawler to perform content entity extraction on the page content. Specifically, the title, subtitle, author, and specific content in the page content can be extracted. Then, the text processing operations such as word segmentation, naming entity recognition (NER, Named Entity Recognition) and word frequency-inverse document frequency (TF-IDF, term frequency-inverse document frequency) are performed on the above-mentioned title and the specific content, and the page content is abstracted into several pieces. Content entity. These content entities can effectively feed back all the content of the page content.
  • NER Named Entity Recognition
  • TF-IDF word frequency-inverse document frequency
  • Step S303 the background server uses the content entity as the search term, and extracts specific data of the content entity from the background database by using the search engine technology, and acquires the association between the content entities (the association relationship between the content entities). That is, the entity attributes (entity name, entity type, entity information, etc.) of the content entity and the entity relationship between the related content entities (such as singers, performers, and relationship between husband and wife) are obtained.
  • the entity attributes entity name, entity type, entity information, etc.
  • the background server uses Andy Lau as the search term to extract the specific data of the content entity from the back-end database through search engine technology, such as the debut time of the actor, singer, Andy Lau, representative works, etc.;
  • the relationship between Andy Lau and another content entity, Jacky Cheung, such as Andy Lau and Jacky Cheung are Hong Kong singers, Andy Lau and Jacky Cheung have starred in the movie "Jianghu”. This will establish the physical relationship between the two content entities, Andy Lau and Jacky Cheung.
  • the entity relationship here can be like the character relationship map of the actors in a TV series and the relationship map of the actors in real life.
  • the name of the drama and the name of the actor are the physical attributes of the content entity.
  • the relationship between the characters in the play, the relationship between the father and the child, and the relationship between the actors and the actors in the drama are the physical relationships of the content entities.
  • the background server can create a content entity knowledge map according to the association between the content entity and the content entity.
  • the content entity knowledge map here refers to visually describing the interrelationship between multiple content entities in the content of the page.
  • the content of the page can be graphically described by the content entity knowledge map of the page content, so that the user can better obtain the keyword of the page content and the association between the keywords.
  • the content entity knowledge map here can represent the interconnection between different content entities through multiple hierarchical structures. The more important content entities should be placed at the highest level of the hierarchy to compare the entity attributes and entity relationships of the content entities. Good show.
  • step S304 since the content of the page may contain too many content entities, it is impossible to feed back the association between all the content entities through a content entity knowledge map of a smaller level.
  • the background server reads the user image of the content retrieval terminal user, and the user portrait can be preset in the background server or preset in the content retrieval terminal, and the user portrait refers to content browsing, content search, and content purchase by the user.
  • the value of the user's interest in different content entities For example, some users have a greater interest in movies, and some users have a greater interest in songs.
  • the background server can perform priority adjustment on the content entity in the content entity knowledge map acquired in step S303 according to the preset user image. Determining a priority of the content entity according to the preset user image when the content entity in the content entity knowledge map is prioritized; determining, according to the priority of the content entity, the content entity in the content entity The way in which the knowledge map is displayed. That is, the content entity knowledge map can preferentially display the content entity that the user is most interested in, and the content entity with poor user interest is placed at the second level or the third level of the content entity knowledge map, and the content entity that is not interested in the user is directly determined from the content entity. Delete the content entity knowledge map, etc.
  • Step S205 The content retrieval terminal receives the content entity knowledge map for priority adjustment from the background server, and displays the content entity knowledge map on the screen of the content retrieval terminal, and the user can select keywords (physical content) on the content entity knowledge map. Performing a keyword content retrieval operation or directly generating a new content entity knowledge map with keywords selected by the user.
  • the touch operation generates a page content retrieval trigger instruction, which improves the diversity of the page content retrieval trigger instruction; the page retrieval process can be performed in the background server, and the content retrieval terminal only displays the content entity knowledge map, thereby improving the content retrieval terminal Performance.
  • FIG. 4 is a flowchart of a content retrieval method in an embodiment of the present application.
  • the content retrieval method in this embodiment may be implemented by using the content retrieval server.
  • the content retrieval method in this embodiment includes:
  • Step S401 receiving a page address of the page content from the terminal device
  • Step S402 extracting page content according to the page address
  • Step S403 extracting a content entity of the page content
  • Step S404 creating a content entity knowledge map according to the extracted content entity and the association between the content entities
  • step S405 may be further included, performing content entity priority adjustment on the content entity knowledge map based on the preset user portrait;
  • Step S406 the content entity knowledge map is sent to the search terminal for display, so that the user performs the keyword content retrieval operation. That is, the terminal device transmits the content selected by the user from the content entity knowledge map to the server for content retrieval operation.
  • step S401 the content retrieval server receives the page address of the page content from the retrieval terminal, that is, retrieves the page address of the page content currently being displayed by the terminal.
  • step S402 the content retrieval server extracts the page content according to the page address acquired in step S401.
  • the content retrieval server may perform a normalization operation on the obtained page address, so that the content retrieval server can better identify the same page address represented by different domain names.
  • the content retrieval server determines whether the server local storage stores the page content corresponding to the page address after the normalization operation. If the server local memory stores the page content corresponding to the page address after the normalization operation, the background server can directly extract the page content from the server local memory, so that the problem that the real-time page content extraction speed is slow can be better, and the problem is improved. The extraction performance of the page content. If the server local memory does not store the page content corresponding to the page address after the normalization operation, the background server directly extracts the page content from the page address.
  • the content retrieval server performs content entity extraction on the page content using the page crawler. Specifically, the title, subtitle, author, and specific content in the page content can be extracted. Then, the text processing operations such as word segmentation, naming entity recognition (NER, Named Entity Recognition) and word frequency-inverse document frequency (TF-IDF, term frequency-inverse document frequency) are performed on the above-mentioned title and the specific content, and the page content is abstracted into several pieces. Content entity. These content entities can effectively feed back all the content of the page content.
  • NER Named Entity Recognition
  • TF-IDF term frequency-inverse document frequency
  • the content retrieval server extracts the specific data of the content entity from the background database by using the content entity as the search term, and acquires the association between the content entities. That is, the entity attributes (entity name, entity type, entity information, etc.) of the content entity and the entity relationship between the related content entities (such as singers, performers, and relationship between husband and wife) are obtained.
  • the background server uses Andy Lau as the search term to extract the specific data of the content entity from the back-end database through search engine technology, such as the debut time of the actor, singer, Andy Lau, representative works, etc.;
  • the relationship between Andy Lau and another content entity, Jacky Cheung, such as Andy Lau and Jacky Cheung are Hong Kong singers, Andy Lau and Jacky Cheung have starred in the movie "Jianghu”. This will establish the physical relationship between the two content entities, Andy Lau and Jacky Cheung.
  • the entity relationship here can be like the character relationship map of the actors in a TV series and the relationship map of the actors in real life.
  • the name of the drama and the name of the actor are the physical attributes of the content entity.
  • the relationship between the characters in the play, the relationship between the father and the child, and the relationship between the actors and the actors in the drama are the physical relationships of the content entities.
  • the content retrieval server can create a content entity knowledge map according to the association between the content entity and the content entity.
  • the content entity knowledge map here refers to visually describing the interrelationship between multiple content entities in the content of the page.
  • the content of the page can be graphically described by the content entity knowledge map of the page content, so that the user can better obtain the keyword of the page content and the association between the keywords.
  • the content entity knowledge map here can represent the interconnection between different content entities through multiple hierarchical structures. The more important content entities should be placed at the highest level of the hierarchy to compare the entity attributes and entity relationships of the content entities. Good show.
  • step S405 since the content of the page may contain too many content entities, it is impossible to feed back the association between all the content entities through a less-level content entity knowledge map.
  • the background server reads the user image of the content retrieval terminal user, and the user portrait can be preset in the background server or preset in the content retrieval terminal, and the user portrait refers to content browsing, content search, and content purchase by the user.
  • the value of the user's interest in different content entities For example, some users have a greater interest in movies, and some users have a greater interest in songs.
  • the content retrieval server can prioritize the content entities in the content entity knowledge map acquired in step S404 according to the preset user image. That is, the content entity knowledge map can preferentially display the content entity that the user is most interested in, and the content entity with poor user interest is placed at the second level or the third level of the content entity knowledge map, and the content entity that is not interested in the user is directly determined from the content entity. Delete the content entity knowledge map, etc.
  • step S406 the content retrieval server sends the priority-adjusted content entity knowledge map to the search terminal for presentation, so that the user of the content retrieval terminal can perform the keyword content retrieval operation by selecting keywords on the content entity knowledge map. Or generate a new content entity knowledge map directly with the keywords selected by the user.
  • the content retrieval method of the embodiment generates a corresponding content entity knowledge map by using the page content, and the user can perform a content retrieval operation by using keywords in the content entity knowledge map, so that the user does not need to input keywords actively, or even the page content at one time.
  • the multiple keywords in the search operation are simultaneously performed, thereby expanding the application scenario range of the content retrieval and improving the retrieval efficiency of the content retrieval.
  • the page retrieval process can be performed on the background server, and the content retrieval terminal only performs the display operation on the content entity knowledge map, thereby effectively improving the performance of the corresponding content retrieval terminal.
  • FIG. 5 is a schematic structural diagram of a content retrieval terminal according to an embodiment of the present application.
  • the content retrieval terminal of the present embodiment can be implemented by using the content retrieval method described above.
  • the content retrieval terminal 50 of the present embodiment includes a trigger instruction receiving module 51, a page address obtaining module 52, a knowledge map generating module 53, and a map display module 54.
  • the triggering instruction receiving module 51 is configured to acquire a page content retrieval triggering instruction;
  • the page address obtaining module 52 is configured to acquire a page address of the page content currently displayed by the terminal device according to the page content retrieval triggering instruction;
  • the knowledge map generating module 53 is configured to acquire the page content based on the page address The content entity knowledge map corresponding to the page content;
  • the map display module 54 is configured to receive and display the content entity knowledge map, so that the terminal device sends the content selected by the user from the content entity knowledge map to the server for content retrieval operation.
  • the triggering instruction receiving module 51 first receives the page content retrieval triggering instruction, where the page content retrieval triggering instruction is used to trigger the sending of the page content selected by the user to the background server for content retrieval. Instructions.
  • the user can generate the page content retrieval trigger instruction by using various methods, such as clicking a search button of a certain page setting position or performing a touch operation on the current page content, such as performing a pull-down operation on the page content through a touch operation, or The page content is zoomed by a touch operation, and the like.
  • the page address obtaining module 52 then acquires the page address of the page content being displayed by the current content retrieval terminal according to the page content retrieval trigger instruction acquired by the trigger instruction receiving module 51.
  • the knowledge map generating module 53 then generates a content entity knowledge map corresponding to the page content based on the page address acquired by the page address obtaining module 52. Specifically, the knowledge map generating module 53 sends the page address obtained by the page address obtaining module 52 to the corresponding background server.
  • the background server can obtain the corresponding page content for the page address, and then the background server can obtain the page content keyword (content entity) of the page content, and generate a content entity knowledge map of the page content according to the page content keyword.
  • the knowledge map generation module 53 can also generate a content entity knowledge map corresponding to the page content according to the page address.
  • the content entity knowledge map here refers to visually describing the interrelationship between multiple content entities in the content of the page.
  • the content of the page can be graphically described by the content entity knowledge map of the page content, so that the user can better obtain the keyword of the page content and the association between the keywords.
  • the final map display module 54 receives the content entity knowledge map from the background server, and displays the content entity knowledge map on the screen of the content retrieval terminal, and the user can perform the keyword content retrieval operation by selecting keywords on the content entity knowledge map.
  • the content retrieval terminal of the embodiment generates a corresponding content entity knowledge map by using the page content, and the user can perform a content retrieval operation by using keywords in the content entity knowledge map, so that the user does not need to input keywords actively, or even the page content at one time.
  • the multiple keywords in the search operation are simultaneously performed, thereby expanding the application scenario range of the content retrieval and improving the retrieval efficiency of the content retrieval.
  • FIG. 6 is a schematic structural diagram of a content retrieval terminal according to the present application.
  • the content retrieval terminal of the present embodiment can be implemented by using the content retrieval method described above.
  • the content retrieval terminal 60 of the present embodiment includes a retrieval trigger prompting module 61, a triggering instruction receiving module 62, a page address obtaining module 63, a knowledge map generating module 64, and The map shows module 65.
  • the retrieval triggering prompting module 61 is configured to receive a page content retrieval list from the background server, and perform a page content retrieval triggering prompt according to the content of the page content retrieval list, and obtain the page content retrieval in response to the operation of triggering the prompting of the page content retrieval triggering Triggering an instruction, so that the user issues a page content retrieval triggering instruction according to the page content retrieval triggering prompt, so that the terminal device acquires the page content retrieval according to the touch operation on the page content display interface according to the page content retrieval triggering prompt by the user. Trigger instruction.
  • the triggering instruction receiving module 62 is configured to generate a page content retrieval triggering instruction according to a touch operation performed by the user on the page content display interface.
  • the page address obtaining module 63 is configured to obtain a page address of the page content according to the page content retrieval triggering instruction; the knowledge map generating module is configured to generate a content entity knowledge map corresponding to the page content based on the page address; the map display module 65 is configured to display the content entity Knowledge map for users to perform keyword content retrieval operations.
  • FIG. 7 is a schematic structural diagram of a corresponding background server of the content retrieval terminal of the present application.
  • the background server 70 includes a page content extraction module 71, a content entity extraction module 72, a knowledge map creation module 73, and a knowledge map priority adjustment module 74.
  • the page content extraction module 71 is configured to extract page content according to the page address; the content entity extraction module 72 is configured to extract the content entity of the page content by using the page crawler; the knowledge map creation module 73 is configured to use the extracted content entity and the association between the content entities Sex, create a content entity knowledge map.
  • the knowledge map priority adjustment module 74 is configured to perform content entity priority adjustment on the content entity knowledge map based on the preset user portrait.
  • FIG. 8 is a schematic structural diagram of a page content extraction module of a background server corresponding to the content retrieval terminal of the present application.
  • the page content extraction module 71 includes a page address normalization unit 81, a page content storage determination unit 82, a first page content extraction unit 83, and a second page content extraction unit 84.
  • the page address normalization unit 81 is configured to perform a normalization operation on the page address; the page content storage determining unit 82 is configured to determine whether the server local memory stores the page content corresponding to the page address after the normalization operation; the first page content The extracting unit 83 is configured to extract the page content from the server local memory if the page content corresponding to the page address after the normalization operation is stored, and the second page content extracting unit 84 is configured to: if the page after the normalization operation is not stored The content of the page corresponding to the address is extracted according to the page address.
  • the retrieval trigger prompting module 61 receives a page content retrieval list from the background server 70, and the page content retrieval list is used to indicate that those pages can perform a page content retrieval operation.
  • the page content retrieval list may be a whitelist list of the page, such as setting a page content under www.qq.com as a whitelist list for page content retrieval; or a blacklist list of the page, such as www.163
  • the content of the page under .com is set to a blacklist that cannot be retrieved from the page content; it can also be a black and white list of pages, or a list of black and white lists of pages, such as pages with cn suffixes for page content retrieval.
  • the type of whitelisted website, the pages of the org suffix are set to the types of blacklisted websites that cannot be searched for page content.
  • the retrieval triggering prompting module 61 then prompts the page content retrieval triggering on the current browsing page of the user according to the content of the page content retrieval list, so that the user issues a page content retrieval triggering instruction according to the page content retrieval triggering prompt. That is, if the user currently browses the page to perform the page content retrieval operation, the page content retrieval trigger prompt is displayed on the preset position of the browsing page, for example, “retrievable” is indicated in the upper right corner of the page; The page content retrieval operation indicates "unretrievable" in the upper right corner of the page.
  • the display method of the page content retrieval trigger prompt can be modified as required.
  • the trigger instruction receiving module 62 can receive the touch operation of the user on the page display interface to generate a page content retrieval trigger instruction. For example, by clicking the search button of the user's current browsing page setting position or performing a pull-down operation or a zooming operation on the currently browsing page of the user.
  • the page content retrieval trigger instruction herein refers to an instruction for triggering the content of the page selected by the user to be sent to the background server for content retrieval.
  • the touch operation needs to be set in advance, that is, when the user performs the touch operation and the user currently browses the page to perform the page content retrieval operation, the content retrieval terminal generates a page content retrieval trigger instruction.
  • the page address obtaining module 63 then acquires the page address of the page content being displayed by the current content retrieval terminal according to the page content retrieval trigger instruction generated by the trigger instruction receiving module 62.
  • the knowledge map generation module 64 then generates a content entity knowledge map corresponding to the page content based on the page address acquired by the page address acquisition module 63. Specifically, the knowledge map generation module 64 sends the page address obtained by the page address acquisition module 63 to the corresponding background server. Thus, the background server 70 can generate a content entity knowledge map of the page content according to the page address.
  • the specific process includes:
  • the page content extraction module 71 of the background server 70 extracts the page content based on the acquired page address.
  • the page address normalization unit 81 of the page content extraction module 71 may perform a normalization operation on the obtained page address, so that the background server can better identify the same page address represented by different domain names.
  • the page content storage determining unit 82 of the page content extraction module 71 determines whether the server local memory stores the page content corresponding to the page address after the normalization operation. If the server local storage stores the page content corresponding to the page address after the normalization operation, the first page content extraction unit 83 of the page content extraction module 71 can directly extract the page content from the server local storage, so that the real-time avoidance can be avoided. The problem of slow page content extraction speed improves the extraction performance of page content. If the server local storage does not store the page content corresponding to the page address after the normalization operation, the second page content extraction unit 84 of the page content extraction module 71 directly extracts the page content according to the page address.
  • the content entity extraction module 72 of the background server 70 uses the page crawler to perform content entity extraction on the page content. Specifically, the title, subtitle, author, and specific content in the page content can be extracted. Then, the text processing operations such as word segmentation, naming entity recognition (NER, Named Entity Recognition) and word frequency-inverse document frequency (TF-IDF, term frequency-inverse document frequency) are performed on the above-mentioned title and the specific content, and the page content is abstracted into several pieces. Content entity. These content entities can effectively feed back all the content of the page content.
  • NER Named Entity Recognition
  • TF-IDF term frequency-inverse document frequency
  • the knowledge map creation module 73 of the background server 70 extracts the specific data (related data) of the content entity from the background database by using the above-mentioned content entity as a search term, and acquires the association between the content entities. That is, the entity attributes (entity name, entity type, entity information, etc.) of the content entity and the entity relationship between the related content entities (such as singers, performers, and relationship between husband and wife) are obtained.
  • the background server uses Andy Lau as the search term to extract the specific data of the content entity from the back-end database through search engine technology, such as the debut time of the actor, singer, Andy Lau, representative works, etc.;
  • the relationship between Andy Lau and another content entity, Jacky Cheung, such as Andy Lau and Jacky Cheung are Hong Kong singers, Andy Lau and Jacky Cheung have starred in the movie "Jianghu”. This will establish the physical relationship between the two content entities, Andy Lau and Jacky Cheung.
  • the entity relationship here can be like the character relationship map of the actors in a TV series and the relationship map of the actors in real life.
  • the name of the drama and the name of the actor are the physical attributes of the content entity.
  • the relationship between the characters in the play, the relationship between the father and the child, and the relationship between the actors and the actors in the drama are the physical relationships of the content entities.
  • the knowledge map creation module 73 can create a content entity knowledge map according to the association between the content entity and the content entity.
  • the content entity knowledge map here refers to visually describing the interrelationship between multiple content entities in the content of the page.
  • the content of the page can be graphically described by the content entity knowledge map of the page content, so that the user can better obtain the keyword of the page content and the association between the keywords.
  • the content entity knowledge map here can represent the interconnection between different content entities through multiple hierarchical structures. The more important content entities should be placed at the highest level of the hierarchy to compare the entity attributes and entity relationships of the content entities. Good show.
  • the knowledge map priority adjustment module 74 of the background server 70 reads the user portrait of the content retrieval terminal user, and the user portrait can be preset in the background server or preset in the content retrieval terminal, and the user portrait refers to the user through the image.
  • the value of interest of users to different content entities derived from behaviors such as content browsing, content search, and content purchase. For example, some users have a greater interest in movies, and some users have a greater interest in songs.
  • the knowledge map priority adjustment module 74 can prioritize the content entities in the content entity knowledge map acquired by the knowledge map creation module 73 according to the preset user portrait. That is, the content entity knowledge map can preferentially display the content entity that the user is most interested in, and the content entity with poor user interest is placed at the second level or the third level of the content entity knowledge map, and the content entity that is not interested in the user is directly determined from the content entity. Delete the content entity knowledge map, etc.
  • the map display module 65 then receives the content entity knowledge map for priority adjustment from the background server 70, and displays the content entity knowledge map on the screen of the content retrieval terminal 60, and the user can perform key by selecting keywords on the content entity knowledge map.
  • the content retrieval terminal of the present embodiment filters the page that cannot be retrieved by the page content through the page content retrieval list and the page content retrieval triggering prompt, thereby further improving the retrieval efficiency of the page content retrieval; and the user touches on the page content display interface.
  • the control operation generates a page content retrieval trigger instruction, which improves the diversity of the page content retrieval trigger instruction; the page retrieval process can be performed in the background server, and the content retrieval terminal only displays the content entity knowledge map, thereby improving the performance of the content retrieval terminal.
  • FIG. 9 is a schematic structural diagram of an embodiment of a content retrieval server according to the present application.
  • the content search server of this embodiment can be implemented using the content search method described above.
  • the content retrieval server 90 of the present embodiment includes a page address receiving module 91, a page content extraction module 92, a content entity extraction module 93, a knowledge map creation module 94, a knowledge map priority adjustment module 95, and a knowledge map transmission module 96.
  • the page address receiving module 91 is configured to receive a page address of the page content from the retrieval terminal; the page content extraction module 92 is configured to extract the page content according to the page address; the content entity extraction module 93 is configured to extract the content entity of the page content by using the page crawler; the knowledge map
  • the creating module 94 is configured to create a content entity knowledge map according to the extracted content entity and the association between the content entities.
  • the knowledge map priority adjustment module 95 is configured to perform content entity priority on the content entity knowledge map based on the preset user portrait.
  • the knowledge map sending module 96 is configured to send the content entity knowledge map to the search terminal for presentation, so that the user performs a keyword content retrieval operation, so that the search terminal sends the user the selected content from the content entity knowledge map. Go to the server for content retrieval.
  • FIG. 10 is a schematic structural diagram of a page content extraction module according to an embodiment of a content retrieval server of the present application.
  • the page content extraction module 92 includes a page address normalization unit 1001, a page content storage determination unit 1002, a first page content extraction unit 1003, and a second page content extraction unit 1004.
  • the page address normalization unit 1001 is configured to perform a normalization operation on the page address; the page content storage determining unit 1002 is configured to determine whether the server local memory stores the page content corresponding to the page address after the normalization operation; the first page content The extracting unit 1003 is configured to extract the page content from the server local memory, such as the page content corresponding to the page address after the normalization operation, and the second page content extracting unit 1004 is configured to correspond to the page address after the normalized operation is not stored. The content of the page, the page content is extracted according to the page address.
  • the page address receiving module 91 first receives the page address of the page content from the retrieval terminal, that is, retrieves the page address of the page content currently being displayed by the terminal.
  • the page content extraction module 92 then extracts the page content according to the page address obtained by the page address receiving module 91.
  • the page address normalization unit 1001 of the page content extraction module 92 can perform a normalization operation on the obtained page address, so that the content retrieval server can better identify the same page address represented by different domain names.
  • the page content storage determining unit 1002 of the page content extraction module 92 determines whether the server local memory stores the page content corresponding to the page address after the normalization operation. If the server local storage stores the page content corresponding to the page address after the normalization operation, the first page content extraction unit 1003 of the page content extraction module 92 can directly extract the page content from the server local storage, which can better avoid real-time. The problem of slow page content extraction speed improves the extraction performance of page content. If the server local storage does not store the page content corresponding to the page address after the normalization operation, the second page content extraction unit 1004 of the page content extraction module 92 extracts the page content according to the page address.
  • the content entity extraction module 93 uses the page crawler to perform content entity extraction on the page content. Specifically, the title, subtitle, author, and specific content in the page content can be extracted. Then, the text processing operations such as word segmentation, naming entity recognition (NER, Named Entity Recognition) and word frequency-inverse document frequency (TF-IDF, term frequency-inverse document frequency) are performed on the above-mentioned title and the specific content, and the page content is abstracted into several pieces. Content entity. These content entities can effectively feed back all the content of the page content.
  • NER Named Entity Recognition
  • TF-IDF term frequency-inverse document frequency
  • the knowledge map creation module 94 extracts the specific data (related data) of the content entity from the background database by using the above-mentioned content entity as a search term, and acquires the association between the content entities. That is, the entity attributes (entity name, entity type, entity information, etc.) of the content entity and the entity relationship between the related content entities (such as singers, performers, and relationship between husband and wife) are obtained.
  • the background server uses Andy Lau as the search term to extract the specific data of the content entity from the back-end database through search engine technology, such as the debut time of the actor, singer, Andy Lau, representative works, etc.;
  • the relationship between Andy Lau and another content entity, Jacky Cheung, such as Andy Lau and Jacky Cheung are Hong Kong singers, Andy Lau and Jacky Cheung have starred in the movie "Jianghu”. This will establish the physical relationship between the two content entities, Andy Lau and Jacky Cheung.
  • the entity relationship here can be like the character relationship map of the actors in a TV series and the relationship map of the actors in real life.
  • the name of the drama and the name of the actor are the physical attributes of the content entity.
  • the relationship between the characters in the play, the relationship between the father and the child, and the relationship between the actors and the actors in the drama are the physical relationships of the content entities.
  • the knowledge map creation module 94 can create a content entity knowledge map based on the association between the content entity and the content entity.
  • the content entity knowledge map here refers to visually describing the interrelationship between multiple content entities in the content of the page.
  • the content of the page can be graphically described by the content entity knowledge map of the page content, so that the user can better obtain the keyword of the page content and the association between the keywords.
  • the content entity knowledge map here can represent the interconnection between different content entities through multiple hierarchical structures. The more important content entities should be placed at the highest level of the hierarchy to compare the entity attributes and entity relationships of the content entities. Good show.
  • the knowledge map priority adjustment module reads the user image of the user of the content retrieval terminal, and the user portrait can be preset in the content retrieval server or preset in the content retrieval terminal, and the user portrait refers to browsing through the content of the user, for example.
  • the knowledge map priority adjustment module 95 can prioritize the content entities in the content entity knowledge map acquired by the knowledge map creation module 94 according to the preset user portrait. That is, the content entity knowledge map can preferentially display the content entity that the user is most interested in, and the content entity with poor user interest is placed at the second level or the third level of the content entity knowledge map, and the content entity that is not interested in the user is directly determined from the content entity. Delete the content entity knowledge map, etc.
  • the last knowledge map sending module 96 sends the priority-adjusted content entity knowledge map to the search terminal for display, so that the user of the content search terminal can perform keyword content retrieval operation or directly by selecting keywords on the content entity knowledge map. A new content entity knowledge map is generated again with the keywords selected by the user.
  • the content retrieval server of the embodiment generates a corresponding content entity knowledge map by using the page content, and the user can perform a content retrieval operation by using keywords in the content entity knowledge map, so that the user does not need to input keywords actively, or even the page content at one time.
  • the multiple keywords in the search operation are simultaneously performed, thereby expanding the application scenario range of the content retrieval and improving the retrieval efficiency of the content retrieval.
  • the page retrieval process is performed by the content retrieval server, and the content retrieval terminal only performs the display operation on the content entity knowledge map, thereby effectively improving the performance of the corresponding content retrieval terminal.
  • FIG. 11 is a sequence diagram of a content retrieval process of a content retrieval method, a content retrieval terminal, and a content retrieval server according to a specific embodiment of the present application.
  • the content retrieval terminal is a mobile terminal of the user
  • the content retrieval server is a background server of the browser application.
  • the content retrieval process of this embodiment includes:
  • step S1101 when the browser user sees the content of the page of interest in the browser application, if the page content retrieval trigger prompt is set on the content of the page, the user may issue a page content retrieval trigger instruction by performing a pull-down operation on the page content.
  • Step S1102 The mobile terminal acquires a page address currently browsed by the browser application according to the page content retrieval trigger instruction, and sends the page address to the background server of the browser application.
  • Step S1103 After the background server normalizes the received page address, the background server obtains the corresponding page content through the local cache or directly through the page address.
  • the background server uses the page crawler to perform content entity extraction on the page content, such as extracting the title, subtitle, author, and specific content in the page content. Then, the text processing operations such as word segmentation, naming entity recognition (NER, Named Entity Recognition) and word frequency-inverse document frequency (TF-IDF, term frequency-inverse document frequency) are performed on the above-mentioned title and the specific content, and the page content is abstracted into several pieces. Content entity.
  • NER Named Entity Recognition
  • TF-IDF term frequency-inverse document frequency
  • Figure 12a is a propaganda page of the TV series A, wherein the text in the figure may be a presentation of the drama of the TV series, and the picture in the figure may be a publicity photo of the TV series.
  • Content entities such as the play name "A”, the role "B”, and the starring "C" can be extracted from the page content.
  • Step S1105 The background server uses the content entity as a search term, and extracts specific data of the content entity from the background database by using a search engine technology, and creates a content entity knowledge map corresponding to the page content based on the association between the content entities. Specifically, as shown in FIG. 12b and FIG. 12c.
  • Step S1106 The background server determines, according to the user portrait formed by the page browsing record of the mobile terminal user, the degree of interest of the user on the content entity in the content entity knowledge map, and according to the interest degree, the location of the content entity in the content entity knowledge map. And prioritize adjustments. If the user has a greater degree of interest in the drama A, the content entity knowledge map shown in FIG. 12b is generated; if the user has a greater degree of interest in the theme, the content entity knowledge map shown in FIG. 12c is generated.
  • Step S1107 The background server sends the adjusted content entity knowledge map to the mobile terminal for display, and the mobile terminal user can perform the keyword content retrieval operation by selecting any keyword on the content entity knowledge map.
  • the user can perform a retrieval operation with the starring as a key word by clicking on the content entity of the starring in FIG. 12b, or switch to the new content entity knowledge map related to the starring of FIG. 12c.
  • the content retrieval method, the content retrieval terminal, the content retrieval server, and the electronic device of the present application generate a corresponding content entity knowledge map through the page content, and the user can perform a content retrieval operation through keywords in the content entity knowledge map, thereby expanding the content retrieval.
  • the scope of the application scenario is improved, and the retrieval efficiency of the content retrieval is improved at the same time; the existing content retrieval method and the technical problem of the content retrieval application scene of the content retrieval device are small and the content retrieval efficiency is relatively low.
  • a component generally refer to a computer-related entity: hardware, a combination of hardware and software, software, or software in execution.
  • a component can be, but is not limited to being, a process running on a processor, a processor, an object, an executable application, a thread of execution, a program, and/or a computer.
  • an application running on a controller and the controller can be a component.
  • One or more components can reside within a process and/or thread of execution, and the components can be located on a computer and/or distributed between two or more computers.
  • Example electronic device 1312 includes, but is not limited to, a wearable device, a headset, a healthcare platform, a personal computer, a server computer, a handheld or laptop device, a mobile device (such as a mobile phone, a personal digital assistant (PDA), media playback) And so on), multiprocessor systems, consumer electronics, small computers, mainframe computers, distributed computing environments including any of the above systems or devices, and the like.
  • Computer readable instructions may be distributed via computer readable media (discussed below).
  • Computer readable instructions may be implemented as program modules, such as functions, objects, application programming interfaces (APIs), data structures, etc. that perform particular tasks or implement particular abstract data types.
  • program modules such as functions, objects, application programming interfaces (APIs), data structures, etc. that perform particular tasks or implement particular abstract data types.
  • APIs application programming interfaces
  • data structures such as lists, etc. that perform particular tasks or implement particular abstract data types.
  • the functionality of the computer readable instructions can be combined or distributed at will in various environments.
  • FIG. 13 illustrates an example of an electronic device 1312 that includes one or more of the content retrieval terminal and content retrieval server of the present application.
  • electronic device 1312 includes at least one processing unit 1316 and memory 1318.
  • memory 1318 can be volatile (such as RAM), non-volatile (such as ROM, flash memory, etc.) or some combination of the two. This configuration is illustrated in Figure 13 by dashed line 1314.
  • electronic device 1312 may include additional features and/or functionality.
  • device 1312 may also include additional storage devices (eg, removable and/or non-removable) including, but not limited to, magnetic storage devices, optical storage devices, and the like.
  • additional storage is illustrated by storage device 1320 in FIG.
  • computer readable instructions for implementing one or more embodiments provided herein may be in storage device 1320.
  • Storage device 1320 can also store other computer readable instructions for implementing an operating system, applications, and the like.
  • Computer readable instructions may be loaded into memory 1318 for execution by, for example, processing unit 1316.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions or other data.
  • Memory 1318 and storage device 1320 are examples of computer storage media.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical storage device, magnetic tape cassette, magnetic tape, magnetic disk storage device or other magnetic storage device, Or any other medium that can be used to store desired information and that can be accessed by electronic device 1312. Any such computer storage media may be part of the electronic device 1312.
  • the electronic device 1312 may also include a communication connection 1326 that allows the electronic device 1312 to communicate with other devices.
  • Communication connection 1326 may include, but is not limited to, a modem, a network interface card (NIC), an integrated network interface, a radio frequency transmitter/receiver, an infrared port, a USB connection, or other interface for connecting electronic device 1312 to other electronic devices.
  • Communication connection 1326 can include a wired connection or a wireless connection.
  • Communication connection 1326 can transmit and/or receive communication media.
  • Computer readable medium can include a communication medium.
  • Communication media typically embodies computer readable instructions or other data in "modulated data signals" such as carrier waves or other transport mechanisms, and includes any information delivery media.
  • modulated data signal can include a signal that one or more of the signal characteristics are set or changed in such a manner as to encode the information into the signal.
  • the electronic device 1312 can include an input device 1324, such as a keyboard, mouse, pen, voice input device, touch input device, infrared camera, video input device, and/or any other input device.
  • Output device 1322 such as one or more displays, speakers, printers, and/or any other output device, may also be included in device 1312.
  • Input device 1324 and output device 1322 can be coupled to electronic device 1312 via a wired connection, a wireless connection, or any combination thereof.
  • an input device or output device from another electronic device can be used as the input device 1324 or output device 1322 of the electronic device 1312.
  • the components of electronic device 1312 can be connected by various interconnects, such as a bus.
  • interconnects may include Peripheral Component Interconnect (PCI) (such as Fast PCI), Universal Serial Bus (USB), Firewire (IEEE 1394), optical bus architecture, and the like.
  • PCI Peripheral Component Interconnect
  • USB Universal Serial Bus
  • Firewire IEEE 1394
  • optical bus architecture and the like.
  • the components of electronic device 1312 can be interconnected by a network.
  • memory 1318 can be comprised of multiple physical memory units that are interconnected by a network located in different physical locations.
  • storage devices for storing computer readable instructions may be distributed across a network.
  • electronic device 1330 accessible via network 1328 can store computer readable instructions for implementing one or more embodiments provided herein.
  • the electronic device 1312 can access the electronic device 1330 and download a portion or all of the computer readable instructions for execution.
  • electronic device 1312 can download a plurality of computer readable instructions as needed, or some of the instructions can be executed at electronic device 1312 and some of the instructions can be executed at electronic device 1330.
  • the one or more operations may constitute computer readable instructions stored on one or more computer readable media that, when executed by an electronic device, cause the computing device to perform the operations.
  • the order in which some or all of the operations are described should not be construed as implying that the operations must be sequential. Those skilled in the art will appreciate alternative rankings that have the benefit of this specification. Moreover, it should be understood that not all operations must be present in every embodiment provided herein.
  • Each functional unit in the embodiment of the present application may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated module is implemented as a software functional module and as a standalone product
  • the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供一种内容检索方法,其包括:获取页面内容检索触发指令;根据页面内容检索触发指令获取页面内容的页面地址;基于所述页面地址生成所述页面内容对应的内容实体知识图谱;展示内容实体知识图谱,所述内容实体知识图谱包括用以进行内容检索操作的关键词。本申请还提供一种内容检索终端以及内容检索服务器。

Description

内容检索方法、终端、服务器、电子设备及存储介质
本申请要求于2017年9月25日提交中国专利局、申请号为201710872842.X、申请名称为“内容检索方法、终端、服务器、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理领域,特别是涉及内容检索方法、终端、服务器、电子设备及存储介质。
背景技术
随着科技的发展,人们对互联网的依赖越来越大,人们可随时通过互联网获取各种各样的信息。当用户想要了解某个内容时,可将该内容相应的关键词输入到搜索引擎,这样搜索引擎可通过搜索引擎结果页提供与该关键词相关的内容实体介绍,如以知识图谱的方式帮助用户了解该内容。
但是上述方式均需要用户输入内容关键词,如用户无法输入关键词(如输入法使用不方便等)或用户自己也不知道关键词(如用户想要查找某部电影中的某个演员的信息等),则搜索引擎是无法对用户提供较好的内容搜索服务的;这时用户可能会放弃对该内容进行搜索,或花较多时间去寻找该内容的关键词。
技术内容
本申请实施例提供内容检索方法、内容检索装置(包括终端、服务器、电子设备)及计算机可读存储介质,以扩大内容检索应用场景范围,提高内容检索效率。
本申请实施例提供一种内容检索方法,由终端设备执行,其包括:
获取页面内容检索触发指令;
根据所述页面内容检索触发指令获取所述终端设备当前显示的页面内容的页面地址;
基于所述页面地址获取所述页面内容对应的内容实体知识图谱;以及
展示所述内容实体知识图谱,以便所述终端设备将用户从所述内容实体知识图谱中选定的内容发送至服务器进行内容检索操作。
本申请实施例还提供一种内容检索方法,由服务器执行,其包括:
从终端设备接收页面内容的页面地址;
根据所述页面地址提取页面内容;
提取所述页面内容的内容实体;
根据提取的所述内容实体以及所述内容实体之间的关联性,创建所述内容实体知识图谱;以及
将所述内容实体知识图谱发送至所述检索终端进行展示,以便终端设备将用户从所述内容实体知识图谱中选定的内容发送至服务器进行内容检索操作。
本申请实施例还提供一种内容检索终端,其包括:
触发指令接收模块,用于获取页面内容检索触发指令;
页面地址获取模块,用于根据所述页面内容检索触发指令获取所述内容检索终端当前显示的页面内容的页面地址;
知识图谱生成模块,用于基于所述页面地址获取所述页面内容对应的内容实体知识图谱;以及
图谱展示模块,用于展示所述内容实体知识图谱,以便所述内容检索终端将用户从所述内容实体知识图谱中选定的内容发送至服务器进行内容检索操作。
本申请实施例还提供一种内容检索服务器,其包括:
页面地址接收模块,用于从检索终端接收页面内容的页面地址;
页面内容提取模块,用于根据所述页面地址提取页面内容;
内容实体提取模块,用于提取所述页面内容的内容实体;
知识图谱创建模块,用于根据提取的所述内容实体以及所述内容实体之间的关联性,创建所述内容实体知识图谱;以及
知识图谱发送模块,用于将所述内容实体知识图谱发送至所述检索终端进行展示,以便所述检索终端将用户从所述内容实体知识图谱中选定的内容发送至服务器进行内容检索操作。
本申请实施例还提供一种计算机可读存储介质,其内存储有处理器可执行指令,所述指令由一个或一个以上处理器加载,以执行上述的内容检索方法。
本申请实施例还提供一种电子设备,包括处理器和存储器,所述存储器有计算 机程序,其中所述处理器通过调用所述计算机程序,用于执行上述的内容检索方法。
附图说明
图1A是本申请涉及的系统架构图;
图1B为本申请的一些实施例中的内容检索方法的流程图;
图2为本申请的一些实施例中的内容检索方法的流程图;
图3为本申请的一些实施例中的内容检索方法的后台服务器生成页面内容的内容实体知识图谱的流程图;
图4为本申请的一些实施例中的内容检索方法的流程图;
图5为本申请的一些实施例中的内容检索终端的结构示意图;
图6为本申请的一些实施例中的内容检索终端的结构示意图;
图7为本申请的一些实施例中的内容检索终端对应的后台服务器的结构示意图;
图8为本申请的一些实施例中的内容检索终端对应的后台服务器的页面内容提取模块的结构示意图;
图9为本申请的一些实施例中的内容检索服务器的结构示意图;
图10为本申请的一些实施例中的内容检索服务器的页面内容提取模块的结构示意图;
图11为本申请的一些实施例中的内容检索方法、内容检索终端以及内容检索服务器的内容检索流程时序图;
图12a为本申请的一些实施例中的内容检索方法、内容检索终端以及内容检索服务器的页面内容的示意图;
图12b和图12c为本申请的一些实施例中的内容检索方法、内容检索终端以及内容检索服务器的内容实体知识图谱的示意图;
图13为本申请的一些实施例中的内容检索终端以及内容检索服务器所在的电子设备的工作环境结构示意图。
具体实施方式
请参照图式,其中相同的组件符号代表相同的组件,本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例,其不应被视为限制本申请未在此详述的其它具体实施例。
在以下的说明中,本申请的具体实施例将参考由一部或多部计算机所执行之作业的步骤及符号来说明,除非另有述明。因此,其将可了解到这些步骤及操作,其中有数次提到为由计算机执行,包括了由代表了以一结构化型式中的数据之电子信号的计算机处理单元所操纵。此操纵转换该数据或将其维持在该计算机之内存系统中的位置处,其可重新配置或另外以本领域技术人员所熟知的方式来改变该计算机之运作。该数据所维持的数据结构为该内存之实体位置,其具有由该数据格式所定义的特定特性。但是,本申请原理以上述文字来说明,其并不代表为一种限制,本领域技术人员将可了解到以下所述的多种步骤及操作亦可实施在硬件当中。
本申请的内容检索方法、终端以及服务器可设置在任何的电子设备中,用于对用户提供的某个页面内容进行内容检索操作,该内容检索操作的应用场景范围较大,且该内容检索的检索效率较高。该电子设备包括但不限于可穿戴设备、头戴设备、医疗健康平台、个人计算机、服务器计算机、手持式或膝上型设备、移动设备(比如移动电话、个人数字助理(PDA)、媒体播放器等等)、多处理器系统、消费型电子设备、小型计算机、大型计算机、包括上述任意系统或设备的分布式计算环境,等等。该内容检索终端优选为移动终端,该内容检索服务器优选为内容检索后台服务器,本申请的内容检索方法通过内容检索终端确定需要进行检索的页面内容,通过后台服务器对该页面内容进行关键词提取以及知识图谱的建立,扩大了内容检索终端的内容检索的应用场景范围,且提高了内容检索的检索效率。本申请提供了内容检索方法、终端、服务器、电子设备及存储介质。图1A是本申请涉及的系统架构图,如图1A所述,服务器102提供检索服务。服务器102通过一个或多个网络106,向多个用户提供页面服务,其中所述多个用户分别操作他们各自的终端设备104(例如,终端设备104a-c)。
在一些实施例中,每个用户通过在终端设备104上执行的客户端应用108(例如,客户端应用108a-c)连接至服务器102。其中,所述客户端应用108可以为浏 览器,也可以为社交应用,例如,微信、QQ、微博等;客户端应用108还可以为视频应用、文章应用等多媒体应用。当用户通过终端设备104上的客户端应用108向服务器102请求页面数据时,服务器102将对应的页面数据发送给客户端应用108,客户端应用108根据接收到的页面数据在终端设备104的显示屏幕上显示对应的页面。其中,当所述页面是可检索页面时,所述页面上可以展示页面检索触发提示,响应于对所述展示页面检索触发提示的触发,客户端应用108向服务器102发送所述页面的页面地址,服务器102根据所述页面的页面地址确定内容实体知识图谱,并发送给客户端应用108进行展示。例如,当所述页面为视频宣传页面时,所述内容实体知识图谱可以包括所述视频中的主要角色以及各角色之间的关联性(关联关系)。
终端设备104的示例包括但不限于掌上型计算机、可穿戴计算设备、个人数字助理(PDA)、平板计算机、笔记本电脑、台式计算机、移动电话、智能手机、增强型通用分组无线业务(EGPRS)移动电话、媒体播放器、导航设备、游戏控制台、电视机、或任意两个或更多的这些数据处理设备或其他数据处理设备的组合。
一个或多个网络106的示例包括局域网(LAN)和广域网(WAN)诸如互联网。可选地,可以使用任意公知的网络协议来实现一个或多个网络106,包括各种有线或无线协议,诸如,以太网、通用串行总线(USB)、FIREWIRE、全球移动通讯系统(GSM)、增强数据GSM环境(EDGE)、码分多址(CDMA)、时分多址(TDMA)、蓝牙、WiFi、IP语音(VoIP),Wi-MAX,或任意其他适合的通信协议。
每个终端设备104可选地包括一个或者多个内部外围设备模块,或可以通过有线或无线连接至一个或多个外围设备(例如,导航系统、健康监测仪、气候控制器、智能运动装备、蓝牙耳机、智能手表等)
请参照图1B,图1B为本申请的内容检索方法的流程图,本实施例的内容检索方法可使用上述的终端设备104进行实施,本实施例的内容检索方法包括:
步骤S101,获取页面内容检索触发指令;
步骤S102,根据页面内容触发指令获取所述终端设备当前显示的页面内容的页面地址;
步骤S103,基于页面地址获取所述页面内容对应的内容实体知识图谱;
步骤S104,展示内容实体知识图谱,以便所述终端设备将用户从所述内容实体知识图谱中选定的内容发送至服务器进行内容检索操作。
下面详细说明本实施例的内容检索方法的各步骤的具体流程。
在步骤S101中,内容检索终端(终端设备104)获取页面内容检索触发指令,这里的页面内容检索触发指令是指用于触发将用户选定的页面内容发送至后台服务器进行内容检索的指令。用户可通过各种方式生成该页面内容检索触发指令,如通过点击某个页面设定位置的检索按键或对当前的页面内容进行触控操作,如通过触控操作对页面内容进行下拉操作,或通过触控操作对页面内容进行缩放操作等。
在步骤S102中,内容检索终端(终端设备104)根据步骤S101获取的页面内容检索触发指令,获取当前内容检索终端正在显示的页面内容的页面地址。
在步骤S103中,内容检索终端基于步骤S102获取的页面地址获取页面内容对应的内容实体知识图谱;具体的,内容检索终端可将步骤S102获取的页面地址发送至对应的后台服务器,这样后台服务器可针对该页面地址获取对应的页面内容,随后后台服务器可获取该页面内容的页面内容关键词,并根据上述页面内容关键词生成该页面内容的内容实体知识图谱。当然这里内容检索终端也可自行根据页面地址生成该页面内容对应的内容实体知识图谱。
这里的内容实体知识图谱是指用可视化的方式描述该页面内容中多个内容实体之间的相互联系(关联关系)。这里可通过页面内容的内容实体知识图谱对页面内容进行图形化的描述,以便用户更好的获取页面内容的关键词以及关键词之间的关联性,其中,所述内容实体用以表征页面内容信息中包含的物体,例如,人物、动物及其他非生命物体,该非生命物体可以是作品、住宿公寓等。
在步骤S104中,内容检索终端从后台服务器接收内容实体知识图谱,并在内容检索终端的屏幕展示该内容实体知识图谱,此外,所述内容检索终端也可以自己根据页面地址生成页面内容对应的内容实体知识图谱。用户可通过选定内容实体知识图谱上的关键词进行关键词内容检索操作。
这样即完成了本实施例的内容检索方法的页面内容检索过程。
本实施例的内容检索方法通过页面内容生成对应的内容实体知识图谱,用户可通过内容实体知识图谱中的关键词进行内容检索操作,这样用户不需要主动输入关 键词,甚至可一次性对页面内容中的多个关键词同时进行检索操作,从而扩大了内容检索的应用场景范围,同时提高了内容检索的检索效率。
请参照图2,图2为本申请的内容检索方法的流程图,本实施例的内容检索方法可使用上述的终端设备104进行实施,本实施例的内容检索方法包括:
步骤S201,从后台服务器接收页面内容检索列表,并根据页面内容检索列表的内容进行页面内容检索触发提示;
步骤S202,根据用户在页面内容展示界面上的触控操作,生成页面内容检索触发指令;即所述终端设备响应于用户根据所述页面内容检索触发提示对页面内容展示界面上的触控操作获取所述页面内容检索触发指令。
步骤S203,根据页面内容检索触发指令获取页面内容的页面地址;获取的是终端设备当前显示的页面内容的页面地址。
步骤S204,基于页面地址获取页面内容对应的内容实体知识图谱;其中,终端设备104可以自己生成内容实体知识图谱,也可以接收服务器102发送的内容实体知识图谱,由服务器102生成内容实体知识图谱。
步骤S205,展示内容实体知识图谱,以便用户进行关键词内容检索操作。以便所述终端设备将用户从所述内容实体知识图谱中选定的内容发送至服务器进行内容检索操作
下面详细说明本实施例的内容检索方法的各步骤的具体流程。
在步骤S201中,由于并非所有的页面内容均可以进行页面内容检索操作,如某些页面无法通过页面爬虫进行页面内容提取。因此内容检索终端(终端设备104)会从后台服务器接收页面内容检索列表,该页面内容检索列表用来表示哪些页面可以进行页面内容检索操作。
该页面内容检索列表可以是页面的白名单列表,如将www.qq.com下的页面内容设置为可进行页面内容检索的白名单列表;也可以是页面的黑名单列表,如将www.163.com下的页面内容设置为不可进行页面内容检索的黑名单列表;也可以是页面的黑白名单列表,或页面的黑白名单种类的列表,如将cn后缀的页面均设置为可进行页面内容检索的白名单网站种类,将org后缀的页面均设置为不可进行页面内容检索的黑名单网站种类等。
随后内容检索终端会根据该页面内容检索列表的内容对用户当前浏览页面进行页面内容检索触发提示,以便用户根据该页面内容检索触发提示发出页面内容检索触发指令。即如用户当前浏览页面可进行页面内容检索操作,则在该浏览页面的预设位置上进行页面内容检索触发提示,例如在页面的右上角标明“可检索”等;如用户当前浏览页面不可进行页面内容检索操作,则在页面的右上角表明“不可检索”。当然这里页面内容检索触发提示的展示方式可根据要求进行修改。
在步骤S202中,如用户当前浏览页面可进行页面内容检索操作,则内容检索终端可接收用户在页面展示界面上的触控操作,以生成页面内容检索触发指令。如通过点击用户当前浏览页面设定位置的检索按键或对用户当前浏览页面进行下拉操作或缩放操作等。这里的页面内容检索触发指令是指用于触发将用户选定的页面内容发送至后台服务器进行内容检索的指令。该触控操作需预先进行设定,即检测到用户进行上述触控操作且用户当前浏览页面可进行页面内容检索操作,则内容检索终端生成页面内容检索触发指令。
在步骤S203中,内容检索终端根据步骤S202生成的页面内容检索触发指令,获取当前内容检索终端正在显示的页面内容的页面地址。
在步骤S204中,内容检索终端基于步骤S203获取的页面地址生成页面内容对应的内容实体知识图谱,具体的,内容检索终端将步骤S203获取的页面地址发送至对应的后台服务器,这样后台服务器可根据页面地址生成页面内容的内容实体知识图谱。具体请参见图3,图3为本申请的内容检索方法的后台服务器生成页面内容的内容实体知识图谱的流程图。该步骤S204包括:
步骤S301,后台服务器根据获取的页面地址提取页面内容。
具体的,这里后台服务器可先对获取的页面地址进行归一化操作,其中,所述归一化操作用以将对应相同页面的不同域名表示的页面地址映射到同一页面地址,以便后台服务器可较好的识别不同域名表示的相同页面地址。
随后后台服务器会判断服务器本地存储器是否存储有该归一化操作后的页面地址对应的页面内容。如服务器本地存储器存储有归一化操作后的页面地址对应的页面内容,则后台服务器可直接从服务器本地存储器提取该页面内容,这样可以较好的避免实时页面内容提取速度慢的问题,提高了页面内容的提取性能。如服务器本 地存储器未存储归一化操作后的页面地址对应的页面内容,则后台服务器直接从页面地址提取上述页面内容。
步骤S302,后台服务器使用页面爬虫对页面内容进行内容实体提取。具体可将页面内容中的标题、副标题、作者以及具体内容提取出来。随后对上述标题以及具体内容进行分词、命名实体识别(NER,Named Entity Recognition)以及词频-逆向文件频率(TF-IDF,term frequency–inverse document frequency)等文本处理操作,将页面内容抽象成若干个内容实体。这些内容实体可有效的反馈该页面内容的所有内容。
步骤S303,后台服务器以上述内容实体作为检索词,通过搜索引擎技术从后台数据库中提取内容实体的具体数据,并获取内容实体之间的关联性(内容实体之间的关联关系)。即获取内容实体的实体属性(实体名称、实体种类以及实体信息等)以及相关内容实体之间的实体关系(如演唱者、表演者以及夫妻人物关系等)。
如内容实体为刘德华,则后台服务器以刘德华作为检索词,通过搜索引擎技术从后台数据库中提取内容实体的具体数据,如刘德华为演员、歌手、刘德华的出道时间、代表作品等;还可提取到刘德华与另一内容实体张学友的关系,如刘德华和张学友均为香港歌手,刘德华和张学友一起出演过电影“江湖”等。这样即可建立刘德华和张学友两个内容实体之间的实体关系。
这里的实体关系可如某部电视剧的演员在剧中的人物关系图谱以及演员在现实生活中的人物关系图谱等。电视剧的名称以及演员的名称即为内容实体的实体属性,剧中人物之间的夫妻关系、父子关系以及演员与该电视剧的演员关系即为内容实体的实体关系。
这样后台服务器可根据上述内容实体以及内容实体之间的关联性,创建内容实体知识图谱。这里的内容实体知识图谱是指用可视化的方式描述该页面内容中多个内容实体之间的相互联系。这里可通过页面内容的内容实体知识图谱对页面内容进行图形化的描述,以便用户更好的获取页面内容的关键词以及关键词之间的关联性。这里的内容实体知识图谱可通过多个层级结构来表示不同内容实体之间的相互联系,较为重要的内容实体应放置在层级结构的最高层级,以便对该内容实体的实体属性以及实体关系进行较好的展示。
步骤S304,由于页面内容包含的内容实体可能会过多,这样导致无法通过一个 较少层级的内容实体知识图谱反馈所有的内容实体之间的关联性。这时后台服务器会读取内容检索终端用户的用户画像,该用户画像可预设在后台服务器或预设在内容检索终端中,该用户画像是指通过用户的如内容浏览、内容搜索以及内容购买等行为得出的用户对不同内容实体的兴趣值。如某些用户对电影兴趣较大,某些用户对歌曲兴趣较大等。
这样后台服务器可根据预设用户画像,对步骤S303获取的内容实体知识图谱中的内容实体进行优先级调整。在对内容实体知识图谱中的内容实体进行优先级调整时,根据所述预设用户画像确定所述内容实体的优先级;根据所述内容实体的优先级确定所述内容实体在所述内容实体知识图谱中的展示方式。即使得内容实体知识图谱可以优先显示用户最感兴趣的内容实体,将用户兴趣较差的内容实体放置到内容实体知识图谱的第二层级或第三层级,将判断用户不感兴趣的内容实体直接从内容实体知识图谱中删除等。
这样即完成后台服务器生成页面内容的实体知识图谱的过程。
步骤S205,内容检索终端从后台服务器接收进行优先级调整的内容实体知识图谱,并在内容检索终端的屏幕展示该内容实体知识图谱,用户可通过选定内容实体知识图谱上的关键词(实体内容)进行关键词内容检索操作或直接以用户选定的关键词再次生成新的内容实体知识图谱。
这样即完成了本实施例的内容检索方法的页面内容检索过程。
采用本申请实施例的技术方案,通过页面内容检索列表以及页面内容检索触发提示将无法进行页面内容检索的页面进行了过滤,进一步提高了页面内容检索的检索效率;通过用户在页面内容展示界面上的触控操作生成页面内容检索触发指令,提高了页面内容检索触发指令的多样性;页面检索过程可在后台服务器进行,内容检索终端仅仅对内容实体知识图谱进行展示操作,因此提高了内容检索终端的性能。请参照图4,图4为本申请一些是实施例中的内容检索方法的流程图,本实施例的内容检索方法可使用上述的内容检索服务器进行实施,本实施例的内容检索方法包括:
步骤S401,从终端设备接收页面内容的页面地址;
步骤S402,根据页面地址提取页面内容;
步骤S403,提取页面内容的内容实体;
步骤S404,根据提取的内容实体以及内容实体之间的关联性,创建内容实体知识图谱;
在一些实例中,在步骤S404之后,还可以包括步骤S405,基于预设用户画像,对内容实体知识图谱进行内容实体优先级调整;
步骤S406,将内容实体知识图谱发送至检索终端进行展示,以便用户进行关键词内容检索操作。即以便终端设备将用户从所述内容实体知识图谱中选定的内容发送至服务器进行内容检索操作。
下面详细说明本实施例的内容检索方法的各步骤的具体流程。
在步骤S401中,内容检索服务器从检索终端接收页面内容的页面地址,即检索终端当前正在显示的页面内容的页面地址。
在步骤S402中,内容检索服务器根据步骤S401获取的页面地址提取页面内容。
具体的,这里内容检索服务器可先对获取的页面地址进行归一化操作,以便内容检索服务器可较好的识别不同域名表示的相同页面地址。
随后内容检索服务器会判断服务器本地存储器是否存储有该归一化操作后的页面地址对应的页面内容。如服务器本地存储器存储有归一化操作后的页面地址对应的页面内容,则后台服务器可直接从服务器本地存储器提取该页面内容,这样可以较好的避免实时页面内容提取速度慢的问题,提高了页面内容的提取性能。如服务器本地存储器未存储归一化操作后的页面地址对应的页面内容,则后台服务器直接从页面地址提取上述页面内容。
在步骤S403中,内容检索服务器使用页面爬虫对页面内容进行内容实体提取。具体可将页面内容中的标题、副标题、作者以及具体内容提取出来。随后对上述标题以及具体内容进行分词、命名实体识别(NER,Named Entity Recognition)以及词频-逆向文件频率(TF-IDF,term frequency–inverse document frequency)等文本处理操作,将页面内容抽象成若干个内容实体。这些内容实体可有效的反馈该页面内容的所有内容。
在步骤S404中,内容检索服务器以上述内容实体作为检索词,通过搜索引擎技术从后台数据库中提取所述内容实体的具体数据,并获取内容实体之间的关联性。 即获取内容实体的实体属性(实体名称、实体种类以及实体信息等)以及相关内容实体之间的实体关系(如演唱者、表演者以及夫妻人物关系等)。
如内容实体为刘德华,则后台服务器以刘德华作为检索词,通过搜索引擎技术从后台数据库中提取内容实体的具体数据,如刘德华为演员、歌手、刘德华的出道时间、代表作品等;还可提取到刘德华与另一内容实体张学友的关系,如刘德华和张学友均为香港歌手,刘德华和张学友一起出演过电影“江湖”等。这样即可建立刘德华和张学友两个内容实体之间的实体关系。
这里的实体关系可如某部电视剧的演员在剧中的人物关系图谱以及演员在现实生活中的人物关系图谱等。电视剧的名称以及演员的名称即为内容实体的实体属性,剧中人物之间的夫妻关系、父子关系以及演员与该电视剧的演员关系即为内容实体的实体关系。
这样内容检索服务器可根据上述内容实体以及内容实体之间的关联性,创建内容实体知识图谱。这里的内容实体知识图谱是指用可视化的方式描述该页面内容中多个内容实体之间的相互联系。这里可通过页面内容的内容实体知识图谱对页面内容进行图形化的描述,以便用户更好的获取页面内容的关键词以及关键词之间的关联性。这里的内容实体知识图谱可通过多个层级结构来表示不同内容实体之间的相互联系,较为重要的内容实体应放置在层级结构的最高层级,以便对该内容实体的实体属性以及实体关系进行较好的展示。
在步骤S405中,由于页面内容包含的内容实体可能会过多,这样导致无法通过一个较少层级的内容实体知识图谱反馈所有的内容实体之间的关联性。这时后台服务器会读取内容检索终端用户的用户画像,该用户画像可预设在后台服务器或预设在内容检索终端中,该用户画像是指通过用户的如内容浏览、内容搜索以及内容购买等行为得出的用户对不同内容实体的兴趣值。如某些用户对电影兴趣较大,某些用户对歌曲兴趣较大等。
这样内容检索服务器可根据预设用户画像,对步骤S404获取的内容实体知识图谱中的内容实体进行优先级调整。即使得内容实体知识图谱可以优先显示用户最感兴趣的内容实体,将用户兴趣较差的内容实体放置到内容实体知识图谱的第二层级或第三层级,将判断用户不感兴趣的内容实体直接从内容实体知识图谱中删除等。
在步骤S406中,内容检索服务器将进行优先级调整后的内容实体知识图谱发送至检索终端进行展示,这样内容检索终端的用户可通过选定内容实体知识图谱上的关键词进行关键词内容检索操作或直接以用户选定的关键词再次生成新的内容实体知识图谱。
这样即完成了本实施例的内容检索方法的页面内容检索过程。
本实施例的内容检索方法通过页面内容生成对应的内容实体知识图谱,用户可通过内容实体知识图谱中的关键词进行内容检索操作,这样用户不需要主动输入关键词,甚至可一次性对页面内容中的多个关键词同时进行检索操作,从而扩大了内容检索的应用场景范围,同时提高了内容检索的检索效率。
且页面检索过程可在后台服务器进行,内容检索终端仅仅对内容实体知识图谱进行展示操作,因此可有效提高对应的内容检索终端的性能。
本申请实施例还提供一种内容检索终端,请参照图5,图5为本申请实施例提供的内容检索终端的结构示意图。本实施例的内容检索终端可使用上述的内容检索方法进行实施,本实施例的内容检索终端50包括触发指令接收模块51、页面地址获取模块52、知识图谱生成模块53以及图谱展示模块54。
触发指令接收模块51用于获取页面内容检索触发指令;页面地址获取模块52用于根据页面内容检索触发指令获取终端设备当前显示的页面内容的页面地址;知识图谱生成模块53用于基于页面地址获取页面内容对应的内容实体知识图谱;图谱展示模块54用于接收并展示内容实体知识图谱,以便所述终端设备将用户从所述内容实体知识图谱中选定的内容发送至服务器进行内容检索操作。
本实施例的内容检索终端50使用时,首先触发指令接收模块51接收页面内容检索触发指令,这里的页面内容检索触发指令是指用于触发将用户选定的页面内容发送至后台服务器进行内容检索的指令。用户可通过各种方式生成该页面内容检索触发指令,如通过点击某个页面设定位置的检索按键或对当前的页面内容进行触控操作,如通过触控操作对页面内容进行下拉操作,或通过触控操作对页面内容进行缩放操作等。
随后页面地址获取模块52根据触发指令接收模块51获取的页面内容检索触发指令,获取当前内容检索终端正在显示的页面内容的页面地址。
然后知识图谱生成模块53基于页面地址获取模块52获取的页面地址生成页面内容对应的内容实体知识图谱;具体的,知识图谱生成模块53将页面地址获取模块52获取的页面地址发送至对应的后台服务器,这样后台服务器可针对该页面地址获取对应的页面内容,随后后台服务器可获取该页面内容的页面内容关键词(内容实体),并根据上述页面内容关键词生成该页面内容的内容实体知识图谱。当然这里知识图谱生成模块53也可自行根据页面地址生成该页面内容对应的内容实体知识图谱。
这里的内容实体知识图谱是指用可视化的方式描述该页面内容中多个内容实体之间的相互联系。这里可通过页面内容的内容实体知识图谱对页面内容进行图形化的描述,以便用户更好的获取页面内容的关键词以及关键词之间的关联性。
最后图谱展示模块54从后台服务器接收内容实体知识图谱,并在内容检索终端的屏幕展示该内容实体知识图谱,用户可通过选定内容实体知识图谱上的关键词进行关键词内容检索操作。
这样即完成了本实施例的内容检索终端50的页面内容检索过程。
本实施例的内容检索终端通过页面内容生成对应的内容实体知识图谱,用户可通过内容实体知识图谱中的关键词进行内容检索操作,这样用户不需要主动输入关键词,甚至可一次性对页面内容中的多个关键词同时进行检索操作,从而扩大了内容检索的应用场景范围,同时提高了内容检索的检索效率。
请参照图6,图6为本申请的内容检索终端的结构示意图。本实施例的内容检索终端可使用上述的内容检索方法进行实施,本实施例的内容检索终端60包括检索触发提示模块61、触发指令接收模块62、页面地址获取模块63、知识图谱生成模块64以及图谱展示模块65。
检索触发提示模块61用于从后台服务器接收页面内容检索列表,并根据页面内容检索列表的内容进行页面内容检索触发提示,响应于对所述页面内容检索触发提示的操作,获取所述页面内容检索触发指令,以便用户根据页面内容检索触发提示发出页面内容检索触发指令,以便所述终端设备响应于用户根据所述页面内容检索触发提示对页面内容展示界面上的触控操作获取所述页面内容检索触发指令。触发指令接收模块62用于根据用户在页面内容展示界面上的触控操作,生成页面内容检 索触发指令。页面地址获取模块63用于根据页面内容检索触发指令获取页面内容的页面地址;知识图谱生成模块用于基于页面地址生成所述页面内容对应的内容实体知识图谱;图谱展示模块65用于展示内容实体知识图谱,以便用户进行关键词内容检索操作。
请参照图7,图7为本申请的内容检索终端的对应的后台服务器的结构示意图。该后台服务器70包括页面内容提取模块71、内容实体提取模块72、知识图谱创建模块73以及知识图谱优先级调整模块74。
页面内容提取模块71用于根据页面地址提取页面内容;内容实体提取模块72用于使用页面爬虫提取页面内容的内容实体;知识图谱创建模块73用于根据提取的内容实体以及内容实体之间的关联性,创建内容实体知识图谱。知识图谱优先级调整模块74用于基于预设用户画像,对内容实体知识图谱进行内容实体优先级调整。
请参照图8,图8为本申请的内容检索终端对应的后台服务器的页面内容提取模块的结构示意图。该页面内容提取模块71包括页面地址归一化单元81、页面内容存储判断单元82、第一页面内容提取单元83以及第二页面内容提取单元84。
页面地址归一化单元81用于对页面地址进行归一化操作;页面内容存储判断单元82用于判断服务器本地存储器是否存储有归一化操作后的页面地址对应的页面内容;第一页面内容提取单元83用于如存储有归一化操作后的页面地址对应的页面内容,则从服务器本地存储器提取页面内容;第二页面内容提取单元84用于如未存储有归一化操作后的页面地址对应的页面内容,则根据页面地址提取页面内容。
本优选实施例的内容检索终端60使用时,由于并非所有的页面内容均可以进行页面内容检索操作,如某些页面无法通过页面爬虫进行页面内容提取。因此检索触发提示模块61会从后台服务器70接收页面内容检索列表,该页面内容检索列表用来表示那些页面可以进行页面内容检索操作。
该页面内容检索列表可以是页面的白名单列表,如将www.qq.com下的页面内容设置为可进行页面内容检索的白名单列表;也可以是页面的黑名单列表,如将www.163.com下的页面内容设置为不可进行页面内容检索的黑名单列表;也可以是页面的黑白名单列表,或页面的黑白名单种类的列表,如将cn后缀的页面均设置为可进行页面内容检索的白名单网站种类,将org后缀的页面均设置为不可进行页面 内容检索的黑名单网站种类等。
随后检索触发提示模块61会根据该页面内容检索列表的内容对用户当前浏览页面进行页面内容检索触发提示,以便用户根据该页面内容检索触发提示发出页面内容检索触发指令。即如用户当前浏览页面可进行页面内容检索操作,则在该浏览页面的预设位置上进行页面内容检索触发提示,例如在页面的右上角标明“可检索”等;如用户当前浏览页面不可进行页面内容检索操作,则在页面的右上角表明“不可检索”。当然这里页面内容检索触发提示的展示方式可根据要求进行修改。
然后如用户当前浏览页面可进行页面内容检索操作,则触发指令接收模块62可接收用户在页面展示界面上的触控操作,以生成页面内容检索触发指令。如通过点击用户当前浏览页面设定位置的检索按键或对用户当前浏览页面进行下拉操作或缩放操作等。这里的页面内容检索触发指令是指用于触发将用户选定的页面内容发送至后台服务器进行内容检索的指令。该触控操作需预先进行设定,即检测到用户进行上述触控操作且用户当前浏览页面可进行页面内容检索操作,则内容检索终端生成页面内容检索触发指令。
随后页面地址获取模块63根据触发指令接收模块62生成的页面内容检索触发指令,获取当前内容检索终端正在显示的页面内容的页面地址。
然后知识图谱生成模块64基于页面地址获取模块63获取的页面地址生成页面内容对应的内容实体知识图谱,具体的,知识图谱生成模块64将页面地址获取模块63获取的页面地址发送至对应的后台服务器,这样后台服务器70可根据页面地址生成页面内容的内容实体知识图谱。具体过程包括:
后台服务器70的页面内容提取模块71根据获取的页面地址提取页面内容。
具体的,页面内容提取模块71的页面地址归一化单元81可先对获取的页面地址进行归一化操作,以便后台服务器可较好的识别不同域名表示的相同页面地址。
随后页面内容提取模块71的页面内容存储判断单元82会判断服务器本地存储器是否存储有该归一化操作后的页面地址对应的页面内容。如服务器本地存储器存储归一化操作后的页面地址对应的页面内容,则页面内容提取模块71的第一页面内容提取单元83可直接从服务器本地存储器提取该页面内容,这样可以较好的避免实时页面内容提取速度慢的问题,提高了页面内容的提取性能。如服务器本地存储器 未存储归一化操作后的页面地址对应的页面内容,则页面内容提取模块71的第二页面内容提取单元84直接根据页面地址提取上述页面内容。
然后后台服务器70的内容实体提取模块72使用页面爬虫对页面内容进行内容实体提取。具体可将页面内容中的标题、副标题、作者以及具体内容提取出来。随后对上述标题以及具体内容进行分词、命名实体识别(NER,Named Entity Recognition)以及词频-逆向文件频率(TF-IDF,term frequency–inverse document frequency)等文本处理操作,将页面内容抽象成若干个内容实体。这些内容实体可有效的反馈该页面内容的所有内容。
随后后台服务器70的知识图谱创建模块73以上述内容实体作为检索词,通过搜索引擎技术从后台数据库中提取所述内容实体的具体数据(相关数据),并获取内容实体之间的关联性。即获取内容实体的实体属性(实体名称、实体种类以及实体信息等)以及相关内容实体之间的实体关系(如演唱者、表演者以及夫妻人物关系等)。
如内容实体为刘德华,则后台服务器以刘德华作为检索词,通过搜索引擎技术从后台数据库中提取内容实体的具体数据,如刘德华为演员、歌手、刘德华的出道时间、代表作品等;还可提取到刘德华与另一内容实体张学友的关系,如刘德华和张学友均为香港歌手,刘德华和张学友一起出演过电影“江湖”等。这样即可建立刘德华和张学友两个内容实体之间的实体关系。
这里的实体关系可如某部电视剧的演员在剧中的人物关系图谱以及演员在现实生活中的人物关系图谱等。电视剧的名称以及演员的名称即为内容实体的实体属性,剧中人物之间的夫妻关系、父子关系以及演员与该电视剧的演员关系即为内容实体的实体关系。
这样知识图谱创建模块73可根据上述内容实体以及内容实体之间的关联性,创建内容实体知识图谱。这里的内容实体知识图谱是指用可视化的方式描述该页面内容中多个内容实体之间的相互联系。这里可通过页面内容的内容实体知识图谱对页面内容进行图形化的描述,以便用户更好的获取页面内容的关键词以及关键词之间的关联性。这里的内容实体知识图谱可通过多个层级结构来表示不同内容实体之间的相互联系,较为重要的内容实体应放置在层级结构的最高层级,以便对该内容实 体的实体属性以及实体关系进行较好的展示。
由于页面内容包含的内容实体可能会过多,这样导致无法通过一个较少层级的内容实体知识图谱反馈所有的内容实体之间的关联性。最后后台服务器70的知识图谱优先级调整模块74会读取内容检索终端用户的用户画像,该用户画像可预设在后台服务器或预设在内容检索终端中,该用户画像是指通过用户的如内容浏览、内容搜索以及内容购买等行为得出的用户对不同内容实体的兴趣值。如某些用户对电影兴趣较大,某些用户对歌曲兴趣较大等。
这样知识图谱优先级调整模块74可根据预设用户画像,对知识图谱创建模块73获取的内容实体知识图谱中的内容实体进行优先级调整。即使得内容实体知识图谱可以优先显示用户最感兴趣的内容实体,将用户兴趣较差的内容实体放置到内容实体知识图谱的第二层级或第三层级,将判断用户不感兴趣的内容实体直接从内容实体知识图谱中删除等。
这样即完成后台服务器70生成页面内容的实体知识图谱的过程。
随后图谱展示模块65从后台服务器70接收进行优先级调整的内容实体知识图谱,并在内容检索终端60的屏幕展示该内容实体知识图谱,用户可通过选定内容实体知识图谱上的关键词进行关键词内容检索操作或直接以用户选定的关键词再次生成新的内容实体知识图谱。
这样即完成了本实施例的内容检索终端60的页面内容检索过程。
本实施例的内容检索终端通过页面内容检索列表以及页面内容检索触发提示将无法进行页面内容检索的页面进行了过滤,进一步提高了页面内容检索的检索效率;通过用户在页面内容展示界面上的触控操作生成页面内容检索触发指令,提高了页面内容检索触发指令的多样性;页面检索过程可在后台服务器进行,内容检索终端仅仅对内容实体知识图谱进行展示操作,因此提高了内容检索终端的性能。
本申请还提供一种内容检索服务器,请参照图9,图9为本申请的内容检索服务器的一实施例的结构示意图。本实施例的内容检索服务器可使用上述的内容检索方法进行实施。本实施例的内容检索服务器90包括页面地址接收模块91、页面内容提取模块92、内容实体提取模块93、知识图谱创建模块94、知识图谱优先级调整模块95以及知识图谱发送模块96。
页面地址接收模块91用于从检索终端接收页面内容的页面地址;页面内容提取模块92用于根据页面地址提取页面内容;内容实体提取模块93用于使用页面爬虫提取页面内容的内容实体;知识图谱创建模块94用于根据提取的内容实体以及内容实体之间的关联性,创建内容实体知识图谱;知识图谱优先级调整模块95用于基于预设用户画像,对内容实体知识图谱进行内容实体优先级调整;知识图谱发送模块96用于将内容实体知识图谱发送至检索终端进行展示,以便用户进行关键词内容检索操作,以便所述检索终端将用户从所述内容实体知识图谱中选定的内容发送至服务器进行内容检索操作。
请参照图10,图10为本申请的内容检索服务器的一实施例的页面内容提取模块的结构示意图。该页面内容提取模块92包括页面地址归一化单元1001、页面内容存储判断单元1002、第一页面内容提取单元1003以及第二页面内容提取单元1004。
页面地址归一化单元1001用于对页面地址进行归一化操作;页面内容存储判断单元1002用于判断服务器本地存储器是否存储有归一化操作后的页面地址对应的页面内容;第一页面内容提取单元1003用于如存储归一化操作后的页面地址对应的页面内容,则从服务器本地存储器提取页面内容;第二页面内容提取单元1004用于如未存储归一化操作后的页面地址对应的页面内容,则根据页面地址提取页面内容。
本实施例的内容检索服务器90使用时,首先页面地址接收模块91从检索终端接收页面内容的页面地址,即检索终端当前正在显示的页面内容的页面地址。
随后页面内容提取模块92根据页面地址接收模块91获取的页面地址提取页面内容。
具体的,这里页面内容提取模块92的页面地址归一化单元1001可先对获取的页面地址进行归一化操作,以便内容检索服务器可较好的识别不同域名表示的相同页面地址。
随后页面内容提取模块92的页面内容存储判断单元1002会判断服务器本地存储器是否存储该归一化操作后的页面地址对应的页面内容。如服务器本地存储器存储归一化操作后的页面地址对应的页面内容,则页面内容提取模块92的第一页面内容提取单元1003可直接从服务器本地存储器提取该页面内容,这样可以较好的避免 实时页面内容提取速度慢的问题,提高了页面内容的提取性能。如服务器本地存储器未存储归一化操作后的页面地址对应的页面内容,则页面内容提取模块92的第二页面内容提取单元1004根据页面地址提取上述页面内容。
然后内容实体提取模块93使用页面爬虫对页面内容进行内容实体提取。具体可将页面内容中的标题、副标题、作者以及具体内容提取出来。随后对上述标题以及具体内容进行分词、命名实体识别(NER,Named Entity Recognition)以及词频-逆向文件频率(TF-IDF,term frequency–inverse document frequency)等文本处理操作,将页面内容抽象成若干个内容实体。这些内容实体可有效的反馈该页面内容的所有内容。
随后知识图谱创建模块94以上述内容实体作为检索词,通过搜索引擎技术从后台数据库中提取所述内容实体的具体数据(相关数据),并获取内容实体之间的关联性。即获取内容实体的实体属性(实体名称、实体种类以及实体信息等)以及相关内容实体之间的实体关系(如演唱者、表演者以及夫妻人物关系等)。
如内容实体为刘德华,则后台服务器以刘德华作为检索词,通过搜索引擎技术从后台数据库中提取内容实体的具体数据,如刘德华为演员、歌手、刘德华的出道时间、代表作品等;还可提取到刘德华与另一内容实体张学友的关系,如刘德华和张学友均为香港歌手,刘德华和张学友一起出演过电影“江湖”等。这样即可建立刘德华和张学友两个内容实体之间的实体关系。
这里的实体关系可如某部电视剧的演员在剧中的人物关系图谱以及演员在现实生活中的人物关系图谱等。电视剧的名称以及演员的名称即为内容实体的实体属性,剧中人物之间的夫妻关系、父子关系以及演员与该电视剧的演员关系即为内容实体的实体关系。
这样知识图谱创建模块94可根据上述内容实体以及内容实体之间的关联性,创建内容实体知识图谱。这里的内容实体知识图谱是指用可视化的方式描述该页面内容中多个内容实体之间的相互联系。这里可通过页面内容的内容实体知识图谱对页面内容进行图形化的描述,以便用户更好的获取页面内容的关键词以及关键词之间的关联性。这里的内容实体知识图谱可通过多个层级结构来表示不同内容实体之间的相互联系,较为重要的内容实体应放置在层级结构的最高层级,以便对该内容实 体的实体属性以及实体关系进行较好的展示。
由于页面内容包含的内容实体可能会过多,这样导致无法通过一个较少层级的内容实体知识图谱反馈所有的内容实体之间的关联性。这时知识图谱优先级调整模块会读取内容检索终端用户的用户画像,该用户画像可预设在内容检索服务器或预设在内容检索终端中,该用户画像是指通过用户的如内容浏览、内容搜索以及内容购买等行为得出的用户对不同内容实体的兴趣值。如某些用户对电影兴趣较大,某些用户对歌曲兴趣较大等。
这样知识图谱优先级调整模块95可根据预设用户画像,对知识图谱创建模块94获取的内容实体知识图谱中的内容实体进行优先级调整。即使得内容实体知识图谱可以优先显示用户最感兴趣的内容实体,将用户兴趣较差的内容实体放置到内容实体知识图谱的第二层级或第三层级,将判断用户不感兴趣的内容实体直接从内容实体知识图谱中删除等。
最后知识图谱发送模块96将进行优先级调整后的内容实体知识图谱发送至检索终端进行展示,这样内容检索终端的用户可通过选定内容实体知识图谱上的关键词进行关键词内容检索操作或直接以用户选定的关键词再次生成新的内容实体知识图谱。
这样即完成了本实施例的内容检索服务器90的页面内容检索过程。
本实施例的内容检索服务器通过页面内容生成对应的内容实体知识图谱,用户可通过内容实体知识图谱中的关键词进行内容检索操作,这样用户不需要主动输入关键词,甚至可一次性对页面内容中的多个关键词同时进行检索操作,从而扩大了内容检索的应用场景范围,同时提高了内容检索的检索效率。
且页面检索过程在内容检索服务器进行,内容检索终端仅仅对内容实体知识图谱进行展示操作,因此可有效提高对应的内容检索终端的性能。
下面通过一具体实施例说明本申请的内容检索方法、内容检索终端以及内容检索服务器的工作原理。请参照图11,图11为本申请的内容检索方法、内容检索终端以及内容检索服务器的具体实施例的内容检索流程时序图。本具体实施例中,内容检索终端为用户的移动终端,内容检索服务器为浏览器应用的后台服务器。本具体实施例的内容检索流程包括:
步骤S1101,移动终端用户在浏览器应用看到感兴趣的页面内容时,如该页面内容上设置有页面内容检索触发提示,则用户可以通过对页面内容进行下拉操作,发出页面内容检索触发指令。
步骤S1102,移动终端根据页面内容检索触发指令获取浏览器应用当前浏览的页面地址,并将该页面地址发送至浏览器应用的后台服务器。
步骤S1103,后台服务器对接收到的页面地址进行归一化操作后,通过本地缓存或直接通过页面地址获取对应的页面内容。
步骤S1104,后台服务器使用页面爬虫对页面内容进行内容实体提取,如将页面内容中的标题、副标题、作者以及具体内容提取出来。随后对上述标题以及具体内容进行分词、命名实体识别(NER,Named Entity Recognition)以及词频-逆向文件频率(TF-IDF,term frequency–inverse document frequency)等文本处理操作,将页面内容抽象成若干个内容实体。
如图12a为电视剧A的宣传页面,其中,图中的文本可以为对该电视剧的剧情介绍,图中的图片可以为该电视剧的宣传照。从页面内容中可提取出剧名“A”、角色“B”以及主演“C”等内容实体。
步骤S1105,后台服务器以上述内容实体作为检索词,通过搜索引擎技术从后台数据库中提取该内容实体的具体数据,并基于内容实体之间的关联性,创建该页面内容对应的内容实体知识图谱。具体如图12b以及图12c所示。
步骤S1106,后台服务器根据移动终端用户之前的页面浏览记录形成的用户画像,确定用户对内容实体知识图谱中的内容实体的兴趣度,并依据该兴趣度对内容实体知识图谱中的内容实体的位置以及优先级进行调整。如用户对电视剧A的兴趣度较大,则生成图12b所示的内容实体知识图谱;如用户对主演的兴趣度较大,则生成图12c所示的内容实体知识图谱。
步骤S1107,后台服务器将调整后的内容实体知识图谱发送至移动终端进行展示,移动终端用户可通过选定内容实体知识图谱上的任何关键词进行关键词内容检索操作。这里用户可通过点击图12b中的主演的内容实体,进行以该主演为关键词的检索操作,或者切换至图12c的与该主演相关的新的内容实体知识图谱。
这样即完成了本具体实施例的内容检索方法、内容检索终端以及内容检索服务 器的页面内容检索过程。
本申请的内容检索方法、内容检索终端、内容检索服务器以及电子设备通过页面内容生成对应的内容实体知识图谱,用户可通过内容实体知识图谱中的关键词进行内容检索操作,从而扩大了内容检索的应用场景范围,同时提高了内容检索的检索效率;解决了现有的内容检索方法以及内容检索装置的内容检索应用场景范围较小且内容检索效率较为低下的技术问题。
如本申请所使用的术语“组件”、“模块”、“系统”、“接口”、“进程”等等一般地指计算机相关实体:硬件、硬件和软件的组合、软件或执行中的软件。例如,组件可以是但不限于是运行在处理器上的进程、处理器、对象、可执行应用、执行的线程、程序和/或计算机。通过图示,运行在控制器上的应用和该控制器二者都可以是组件。一个或多个组件可以在于执行的进程和/或线程内,并且组件可以位于一个计算机上和/或分布在两个或更多计算机之间。
图13和随后的讨论提供了对实现本申请所述的内容检索终端以及内容检索服务器所在的电子设备的工作环境的简短、概括的描述。图13的工作环境仅仅是适当的工作环境的一个实例并且不旨在建议关于工作环境的用途或功能的范围的任何限制。实例电子设备1312包括但不限于可穿戴设备、头戴设备、医疗健康平台、个人计算机、服务器计算机、手持式或膝上型设备、移动设备(比如移动电话、个人数字助理(PDA)、媒体播放器等等)、多处理器系统、消费型电子设备、小型计算机、大型计算机、包括上述任意系统或设备的分布式计算环境,等等。
尽管没有要求,但是在“计算机可读指令”被一个或多个电子设备执行的通用背景下描述实施例。计算机可读指令可以经由计算机可读介质来分布(下文讨论)。计算机可读指令可以实现为程序模块,比如执行特定任务或实现特定抽象数据类型的功能、对象、应用编程接口(API)、数据结构等等。典型地,该计算机可读指令的功能可以在各种环境中随意组合或分布。
图13图示了包括本申请的内容检索终端以及内容检索服务器中的一个或多个实施例的电子设备1312的实例。在一种配置中,电子设备1312包括至少一个处理单元1316和存储器1318。根据电子设备的确切配置和类型,存储器1318可以是易失性的(比如RAM)、非易失性的(比如ROM、闪存等)或二者的某种组合。该配 置在图13中由虚线1314图示。
在其他实施例中,电子设备1312可以包括附加特征和/或功能。例如,设备1312还可以包括附加的存储装置(例如可移除和/或不可移除的),其包括但不限于磁存储装置、光存储装置等等。这种附加存储装置在图13中由存储装置1320图示。在一个实施例中,用于实现本文所提供的一个或多个实施例的计算机可读指令可以在存储装置1320中。存储装置1320还可以存储用于实现操作系统、应用程序等的其他计算机可读指令。计算机可读指令可以载入存储器1318中由例如处理单元1316执行。
本文所使用的术语“计算机可读介质”包括计算机存储介质。计算机存储介质包括以用于存储诸如计算机可读指令或其他数据之类的信息的任何方法或技术实现的易失性和非易失性、可移除和不可移除介质。存储器1318和存储装置1320是计算机存储介质的实例。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字通用盘(DVD)或其他光存储装置、盒式磁带、磁带、磁盘存储装置或其他磁存储设备、或可以用于存储期望信息并可以被电子设备1312访问的任何其他介质。任意这样的计算机存储介质可以是电子设备1312的一部分。
电子设备1312还可以包括允许电子设备1312与其他设备通信的通信连接1326。通信连接1326可以包括但不限于调制解调器、网络接口卡(NIC)、集成网络接口、射频发射器/接收器、红外端口、USB连接或用于将电子设备1312连接到其他电子设备的其他接口。通信连接1326可以包括有线连接或无线连接。通信连接1326可以发射和/或接收通信媒体。
术语“计算机可读介质”可以包括通信介质。通信介质典型地包含计算机可读指令或诸如载波或其他传输机构之类的“己调制数据信号”中的其他数据,并且包括任何信息递送介质。术语“己调制数据信号”可以包括这样的信号:该信号特性中的一个或多个按照将信息编码到信号中的方式来设置或改变。
电子设备1312可以包括输入设备1324,比如键盘、鼠标、笔、语音输入设备、触摸输入设备、红外相机、视频输入设备和/或任何其他输入设备。设备1312中也可以包括输出设备1322,比如一个或多个显示器、扬声器、打印机和/或任意其他 输出设备。输入设备1324和输出设备1322可以经由有线连接、无线连接或其任意组合连接到电子设备1312。在一个实施例中,来自另一个电子设备的输入设备或输出设备可以被用作电子设备1312的输入设备1324或输出设备1322。
电子设备1312的组件可以通过各种互连(比如总线)连接。这样的互连可以包括外围组件互连(PCI)(比如快速PCI)、通用串行总线(USB)、火线(IEEE 1394)、光学总线结构等等。在另一个实施例中,电子设备1312的组件可以通过网络互连。例如,存储器1318可以由位于不同物理位置中的、通过网络互连的多个物理存储器单元构成。
本领域技术人员将认识到,用于存储计算机可读指令的存储设备可以跨越网络分布。例如,可经由网络1328访问的电子设备1330可以存储用于实现本申请所提供的一个或多个实施例的计算机可读指令。电子设备1312可以访问电子设备1330并且下载计算机可读指令的一部分或所有以供执行。可替代地,电子设备1312可以按需要下载多条计算机可读指令,或者一些指令可以在电子设备1312处执行并且一些指令可以在电子设备1330处执行。
本文提供了实施例的各种操作。在一个实施例中,所述的一个或多个操作可以构成一个或多个计算机可读介质上存储的计算机可读指令,其在被电子设备执行时将使得计算设备执行所述操作。描述一些或所有操作的顺序不应当被解释为暗示这些操作必需是顺序相关的。本领域技术人员将理解具有本说明书的益处的可替代的排序。而且,应当理解,不是所有操作必需在本文所提供的每个实施例中存在。
而且,尽管已经相对于一个或多个实现方式示出并描述了本公开,但是本领域技术人员基于对本说明书和附图的阅读和理解将会想到等价变型和修改。本公开包括所有这样的修改和变型,并且仅由所附权利要求的范围限制。特别地关于由上述组件(例如元件、资源等)执行的各种功能,用于描述这样的组件的术语旨在对应于执行所述组件的指定功能(例如其在功能上是等价的)的任意组件(除非另外指示),即使在结构上与执行本文所示的本公开的示范性实现方式中的功能的公开结构不等同。此外,尽管本公开的特定特征已经相对于若干实现方式中的仅一个被公开,但是这种特征可以与如可以对给定或特定应用而言是期望和有利的其他实现方式的一个或多个其他特征组合。而且,就术语“包括”、“具有”、“含有”或其变形被用在 具体实施方式或权利要求中而言,这样的术语旨在以与术语“包含”相似的方式包括。
本申请实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品
销售或使用时,也可以存储在一个计算机可读取存储介质中。上述提到的存储介质可以是只读存储器,磁盘或光盘等。上述的各装置或系统,可以执行相应方法实施例中的方法。
综上所述,虽然本申请已以实施例揭露如上,实施例前的序号仅为描述方便而使用,对本申请各实施例的顺序不造成限制。并且,上述实施例并非用以限制本申请,本领域的普通技术人员,在不脱离本申请的精神和范围内,均可作各种更动与润饰,因此本申请的保护范围以权利要求界定的范围为准。

Claims (15)

  1. 一种内容检索方法,由终端设备执行,其中,所述方法包括:
    获取页面内容检索触发指令;
    根据所述页面内容检索触发指令获取所述终端设备当前显示的页面内容的页面地址;
    基于所述页面地址获取所述页面内容对应的内容实体知识图谱;以及
    展示所述内容实体知识图谱,以便所述终端设备将用户从所述内容实体知识图谱中选定的内容发送至服务器进行内容检索操作。
  2. 根据权利要求1所述的内容检索方法,其中,所述获取页面内容检索触发指令包括:
    响应于对页面内容展示界面上的触控操作,生成所述页面内容检索触发指令。
  3. 根据权利要求1所述的内容检索方法,其中,在所述获取页面内容检索触发指令之前,所述内容检索方法还包括:
    从后台服务器接收页面内容检索列表,并根据所述页面内容检索列表的内容进行页面内容检索触发提示,以便所述终端设备响应于用户根据所述页面内容检索触发提示对页面内容展示界面上的触控操作获取所述页面内容检索触发指令。
  4. 根据权利要求1所述的内容检索方法,其中,所述基于所述页面地址获取所述页面内容对应的内容实体知识图谱包括:
    将所述页面地址发送至服务器,接收所述服务器发送的根据所述页面地址生成的所述页面内容的内容实体知识图谱。
  5. 根据权利要求1所述的内容检索方法,其中,所述基于所述页面地址生成所述页面内容对应的内容实体知识图谱包括:
    获取所述页面地址对应的所述页面内容的内容实体;
    根据所述内容实体获取所述内容实体之间的关联性;
    根据所述内容实体及所述内容实体之间的关联性生成所述内容实体知识图谱。
  6. 根据权利要求5所述的方法,其中,所述根据所述内容实体获取所述内容实体之间的关联性包括:
    以所述页面内容的内容实体为检索词,查找所述内容实体的相关数据;
    根据所述相关数据确定所述内容实体之间的关联性。
  7. 一种内容检索方法,由服务器执行,其中,所述方法包括:
    从终端设备接收页面内容的页面地址;
    根据所述页面地址提取页面内容;
    提取所述页面内容的内容实体;
    根据提取的所述内容实体以及所述内容实体之间的关联性,创建所述内容实体知识图谱;以及
    将所述内容实体知识图谱发送至所述检索终端进行展示,以便终端设备将用户从所述内容实体知识图谱中选定的内容发送至服务器进行内容检索操作。
  8. 根据权利要求7所述的方法,其中,在所述根据提取的所述内容实体以及所述内容实体之间的关联性,创建所述内容实体知识图谱之前,所述方法进一步包括:获取内容实体之间的关联性;
    其中,所述获取内容实体之间的关联性包括:
    以所述页面内容的内容实体为检索词,查找所述内容实体的相关数据;
    根据所述相关数据确定所述内容实体之间的关联性。
  9. 根据权利要求7所述的内容检索方法,其中,所述创建所述内容实体知识图谱之后,所述方法还包括:
    基于预设用户画像,对所述内容实体知识图谱进行内容实体优先级调整。
  10. 根据权利要求9所述的方法,其中,所述基于预设用户画像,对所述内容实体知识图谱进行内容实体优先级调整包括:
    根据所述预设用户画像确定所述内容实体的优先级;
    根据所述内容实体的优先级确定所述内容实体在所述内容实体知识图谱中的展示方式。
  11. 根据权利要求7所述的内容检索方法,其中,所述根据所述页面地址提取页面内容包括:
    对所述页面地址进行归一化操作;
    当所述服务器存储所述归一化操作后的页面地址对应的页面内容时,从所述服务器提取所述页面内容;以及
    当所述服务器未存储所述归一化操作后的页面地址对应的页面内容时,根据所述页面地址提取所述页面内容。
  12. 一种内容检索终端,其中,包括:
    触发指令接收模块,用于获取页面内容检索触发指令;
    页面地址获取模块,用于根据所述页面内容检索触发指令获取所述内容检索终端当前显示的页面内容的页面地址;
    知识图谱生成模块,用于基于所述页面地址获取所述页面内容对应的内容实体知识图谱;以及
    图谱展示模块,用于展示所述内容实体知识图谱,以便所述内容检索终端将用户从所述内容实体知识图谱中选定的内容发送至服务器进行内容检索操作。
  13. 一种内容检索服务器,其中,包括:
    页面地址接收模块,用于从检索终端接收页面内容的页面地址;
    页面内容提取模块,用于根据所述页面地址提取页面内容;
    内容实体提取模块,用于提取所述页面内容的内容实体;
    知识图谱创建模块,用于根据提取的所述内容实体以及所述内容实体之间的关联性,创建所述内容实体知识图谱;以及
    知识图谱发送模块,用于将所述内容实体知识图谱发送至所述检索终端进行展示,以便所述检索终端将用户从所述内容实体知识图谱中选定的内容发送至服务器进行内容检索操作。
  14. 一种存储介质,其内存储有处理器可执行指令,所述指令由一个或一个以上处理器加载,以执行如权利要求1至11中任一项的内容检索方法。
  15. 一种电子设备,包括处理器和存储器,所述存储器有计算机程序,其特征在于,所述处理器通过调用所述计算机程序,用于执行如权利要求1至11任一项所述的内容检索方法。
PCT/CN2018/107273 2017-09-25 2018-09-25 内容检索方法、终端、服务器、电子设备及存储介质 WO2019057191A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710872842.XA CN109948073B (zh) 2017-09-25 2017-09-25 内容检索方法、终端、服务器、电子设备及存储介质
CN201710872842.X 2017-09-25

Publications (1)

Publication Number Publication Date
WO2019057191A1 true WO2019057191A1 (zh) 2019-03-28

Family

ID=65809522

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/107273 WO2019057191A1 (zh) 2017-09-25 2018-09-25 内容检索方法、终端、服务器、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN109948073B (zh)
WO (1) WO2019057191A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134796A (zh) * 2019-04-19 2019-08-16 平安科技(深圳)有限公司 基于知识图谱的临床试验检索方法、装置、计算机设备及存储介质
CN111309872A (zh) * 2020-03-26 2020-06-19 北京百度网讯科技有限公司 搜索处理方法、装置及设备
CN112015281A (zh) * 2019-05-29 2020-12-01 北京搜狗科技发展有限公司 一种云端联想方法和相关装置

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127574A (zh) * 2020-01-15 2021-07-16 京东方科技集团股份有限公司 基于知识图谱的业务数据展示方法、系统、设备及介质
CN111522967B (zh) * 2020-04-27 2023-09-15 北京百度网讯科技有限公司 知识图谱构建方法、装置、设备以及存储介质
CN111931928B (zh) * 2020-07-16 2022-12-27 成都井之丽科技有限公司 场景图的生成方法、装置和设备
CN112182239A (zh) * 2020-09-22 2021-01-05 中国建设银行股份有限公司 信息检索方法和装置
CN113722434B (zh) * 2021-08-30 2024-05-03 平安科技(深圳)有限公司 一种文本数据处理的方法、装置、计算机设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577595A (zh) * 2013-11-15 2014-02-12 北京奇虎科技有限公司 基于当前浏览页面的关键词推送方法及装置
CN104102713A (zh) * 2014-07-16 2014-10-15 百度在线网络技术(北京)有限公司 推荐结果的展现方法和装置
CN105302881A (zh) * 2015-10-14 2016-02-03 上海大学 一种面向文献搜索系统的搜索提示词的生成方法
CN106294596A (zh) * 2016-07-29 2017-01-04 北京小米移动软件有限公司 信息搜索的方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598613B (zh) * 2015-01-30 2017-11-03 百度在线网络技术(北京)有限公司 一种用于垂直领域的概念关系构建方法和装置
WO2016176099A1 (en) * 2015-04-28 2016-11-03 Alibaba Group Holding Limited Information search navigation method and apparatus
CN106156244B (zh) * 2015-04-28 2020-08-28 阿里巴巴集团控股有限公司 一种信息搜索导航方法及装置
CN106817271B (zh) * 2015-11-30 2020-05-22 阿里巴巴集团控股有限公司 流量图谱的形成方法和装置
CN107169010A (zh) * 2017-03-31 2017-09-15 北京奇艺世纪科技有限公司 一种推荐搜索关键词的确定方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577595A (zh) * 2013-11-15 2014-02-12 北京奇虎科技有限公司 基于当前浏览页面的关键词推送方法及装置
CN104102713A (zh) * 2014-07-16 2014-10-15 百度在线网络技术(北京)有限公司 推荐结果的展现方法和装置
CN105302881A (zh) * 2015-10-14 2016-02-03 上海大学 一种面向文献搜索系统的搜索提示词的生成方法
CN106294596A (zh) * 2016-07-29 2017-01-04 北京小米移动软件有限公司 信息搜索的方法及装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134796A (zh) * 2019-04-19 2019-08-16 平安科技(深圳)有限公司 基于知识图谱的临床试验检索方法、装置、计算机设备及存储介质
CN110134796B (zh) * 2019-04-19 2023-06-02 平安科技(深圳)有限公司 基于知识图谱的临床试验检索方法、装置、计算机设备及存储介质
CN112015281A (zh) * 2019-05-29 2020-12-01 北京搜狗科技发展有限公司 一种云端联想方法和相关装置
CN111309872A (zh) * 2020-03-26 2020-06-19 北京百度网讯科技有限公司 搜索处理方法、装置及设备
CN111309872B (zh) * 2020-03-26 2023-08-08 北京百度网讯科技有限公司 搜索处理方法、装置及设备

Also Published As

Publication number Publication date
CN109948073B (zh) 2023-05-23
CN109948073A (zh) 2019-06-28

Similar Documents

Publication Publication Date Title
WO2019057191A1 (zh) 内容检索方法、终端、服务器、电子设备及存储介质
US11238127B2 (en) Electronic device and method for using captured image in electronic device
US10739958B2 (en) Method and device for executing application using icon associated with application metadata
US9378290B2 (en) Scenario-adaptive input method editor
US10122839B1 (en) Techniques for enhancing content on a mobile device
WO2020007012A1 (zh) 一种搜索页面显示方法、装置、终端及存储介质
TWI705337B (zh) 一種資訊搜尋導航方法及裝置
US20090299990A1 (en) Method, apparatus and computer program product for providing correlations between information from heterogenous sources
JP7104242B2 (ja) 個人情報を共有する方法、装置、端末設備及び記憶媒体
KR20170091142A (ko) 웹 콘텐츠 태깅 및 필터링
JP2018514864A (ja) 情報の対象を定めたディスプレイのためのデバイス及び方法
US20160179899A1 (en) Method of providing content and electronic apparatus performing the method
KR20090111827A (ko) 모바일 통신 장치에서의 보이스 검색을 위한 방법 및 장치
WO2017181663A1 (zh) 一种为搜索信息匹配图片的方法及装置
US20150193832A1 (en) Method, apparatus, and system for communicating and presenting product information
CN110391966B (zh) 一种消息处理方法、装置和用于消息处理的装置
TW201308192A (zh) 實施於通訊裝置之媒體內容管理系統與方法
US10725620B2 (en) Generating interactive menu for contents search based on user inputs
US20150161206A1 (en) Filtering search results using smart tags
US10652105B2 (en) Display apparatus and controlling method thereof
CN107515870B (zh) 一种搜索方法和装置、一种用于搜索的装置
CN110309324B (zh) 一种搜索方法及相关装置
KR102519159B1 (ko) 전자 장치 및 그 제어 방법
KR20130012388A (ko) 시맨틱 웹 어플리케이션의 모델 확장 장치, 시맨틱 웹 어플리케이션의 모델 확장 방법 및 이를 이용한 단말기
JP2024509824A (ja) ドキュメントの編集方法、装置、デバイス及び記憶媒体

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18859492

Country of ref document: EP

Kind code of ref document: A1