CN111274358A - Text processing method and device, electronic equipment and storage medium - Google Patents

Text processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111274358A
CN111274358A CN202010066891.6A CN202010066891A CN111274358A CN 111274358 A CN111274358 A CN 111274358A CN 202010066891 A CN202010066891 A CN 202010066891A CN 111274358 A CN111274358 A CN 111274358A
Authority
CN
China
Prior art keywords
text
word
nodes
keywords
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010066891.6A
Other languages
Chinese (zh)
Inventor
陈诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010066891.6A priority Critical patent/CN111274358A/en
Publication of CN111274358A publication Critical patent/CN111274358A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text processing method, a text processing device, electronic equipment and a storage medium; the method comprises the following steps: performing word segmentation on a text to be processed, and forming words obtained by word segmentation into word sequences; carrying out dependency syntax processing on the word sequence to obtain word dependency relationship among words in the word sequence; mapping the words in the word sequence into nodes, and mapping the word dependency relationship into edges between corresponding nodes to obtain a candidate keyword graph formed by connecting the nodes and the edges; propagating node weights of nodes in the candidate keyword graph according to edges in the candidate keyword graph; and determining the nodes meeting the weight conditions in the propagated candidate keyword graph as target nodes, and determining words corresponding to the target nodes as the keywords of the text to be processed. By the method and the device, the accuracy of the determined keywords can be improved, namely the processing precision of natural language processing is improved.

Description

Text processing method and device, electronic equipment and storage medium
Technical Field
The present invention relates to data processing technologies, and in particular, to a text processing method and apparatus, an electronic device, and a storage medium.
Background
Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. Natural Language Processing (NLP) is an important direction in artificial intelligence, and various theories and methods for realizing efficient communication between a person and a computer using natural Language are mainly studied.
Keyword determination is an important application in natural language processing, and the obtained keywords can be used for scenes such as text classification. In the solutions provided by the related technologies, keywords in a text are usually determined through unsupervised learning, and particularly, a co-occurrence relationship between words in the text is determined in a sliding window manner, so that more important keywords are determined. However, the syntactic structure of the text may be complex, so that the relationship between words in the text cannot be effectively reflected by a sliding window, and the accuracy of determining the keywords is low.
Disclosure of Invention
Embodiments of the present invention provide a text processing method and apparatus, an electronic device, and a storage medium, which can improve accuracy of determining a keyword and accuracy of performing relevant processing on a text according to the keyword.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a text processing method, which comprises the following steps:
performing word segmentation on a text to be processed, and forming words obtained by word segmentation into word sequences;
carrying out dependency syntax processing on the word sequence to obtain word dependency relationship among words in the word sequence;
mapping the words in the word sequence into nodes, and mapping the word dependency relationship into edges between corresponding nodes to obtain a candidate keyword graph formed by connecting the nodes and the edges;
propagating node weights of nodes in the candidate keyword graph according to edges in the candidate keyword graph;
determining the nodes meeting the weight condition in the propagated candidate keyword graph as target nodes, and
and determining the words corresponding to the target node as the keywords of the text to be processed.
An embodiment of the present invention provides a text processing apparatus, including:
the word segmentation module is used for performing word segmentation processing on the text to be processed and forming words obtained by the word segmentation processing into word sequences;
the syntax processing module is used for carrying out dependency syntax processing on the word sequence to obtain word dependency relationship among words in the word sequence;
the mapping module is used for mapping the words in the word sequence into nodes and mapping the word dependency relationship into edges between corresponding nodes so as to obtain a candidate keyword graph formed by connecting the nodes and the edges;
a propagation module, configured to propagate node weights of nodes in the candidate keyword graph according to edges in the candidate keyword graph;
a keyword determining module, configured to determine a node satisfying a weight condition in the propagated candidate keyword graph as a target node, and
and determining the words corresponding to the target node as the keywords of the text to be processed.
An embodiment of the present invention provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the text processing method provided by the embodiment of the invention when the executable instructions stored in the memory are executed.
The embodiment of the invention provides a storage medium, which stores executable instructions and is used for causing a processor to execute so as to realize the text processing method provided by the embodiment of the invention.
The embodiment of the invention has the following beneficial effects:
according to the embodiment of the invention, the word dependency relationship among the words in the text to be processed is obtained through dependency syntax processing, the candidate keyword graph is constructed according to the word dependency relationship, and after the node weight in the candidate keyword graph is propagated and completed, the keyword in the text to be processed is determined according to the node weight, so that the accuracy of the determined keyword is improved, and when the device uses the obtained keyword in various scenes to perform relevant processing on the text, the processing accuracy can be remarkably improved.
Drawings
FIG. 1 is an alternative architectural diagram of a text processing system provided by an embodiment of the present invention;
FIG. 2 is an alternative architecture diagram of an electronic device provided by an embodiment of the invention;
FIG. 3 is a block diagram of an alternative architecture of a text processing apparatus according to an embodiment of the present invention;
FIG. 4A is a schematic flow chart of an alternative text processing method according to an embodiment of the present invention;
FIG. 4B is a schematic flow chart of an alternative text processing method according to an embodiment of the present invention;
FIG. 4C is a schematic flow chart of an alternative text processing method according to an embodiment of the present invention;
FIG. 4D is a schematic flow chart of an alternative text processing method according to an embodiment of the present invention;
FIG. 4E is a schematic flow chart of an alternative text processing method according to an embodiment of the present invention;
fig. 5 is an alternative diagram for determining keywords in a text according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the description that follows, references to the terms "first", "second", and the like, are intended only to distinguish similar objects and not to indicate a particular ordering for the objects, it being understood that "first", "second", and the like may be interchanged under certain circumstances or sequences of events to enable embodiments of the invention described herein to be practiced in other than the order illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.
1) Word segmentation processing: the word is the smallest, independently movable and meaningful language component in the language unit, and word segmentation is the core technology of natural language processing, specifically, a sentence is converted into a series of word representations, for example, word segmentation is performed on "zhang san explicit form" to obtain a word sequence of "zhang san/explicit/form", where the language unit is usually a sentence.
2) And (3) part-of-speech tagging: the part of speech of a word is labeled according to the meaning and the context. When performing the part-of-speech tagging task, a word sequence obtained by performing word segmentation processing on a sentence is usually obtained first, and then part-of-speech tagging is performed based on the word sequence. For example, after the part-of-speech tagging processing is performed on the sentence "zhang san explicit form", the part-of-speech of "zhang san" is the name of the person, the part-of-speech of "explicit" is the adjective, and the part-of-speech of "form" is the verb.
3) Dependency syntax processing: the dependency relationship between language components in a language unit is analyzed to reveal the syntactic structure, and the word dependency relationship obtained by dependency syntax processing includes a predicate relationship, an actor-guest relationship, an inter-guest relationship and the like, for example, in the sentence "Zhang three explicit statuses", the predicate relationship is between "Zhang three" and "statuses".
4) Candidate keyword graph: and the graph is formed by connecting nodes and edges, wherein the nodes correspond to words, and the edges correspond to word dependency relationships among the words.
5) Node weight: the importance degree of the corresponding word in the text is represented, and the higher the node weight is, the higher the importance degree of the corresponding word is.
6) Web page ranking (PageRank): an algorithm for ranking web pages in search engine results is used to measure the importance of a particular web page relative to other web pages in the search engine results, and specifically, the importance of a particular web page is determined by the network link relationship between web pages.
7) Text ranking (TextRank): and (3) introducing the PageRank algorithm thought into an algorithm obtained in the natural language processing field, specifically determining the co-occurrence relation among the words, and performing tasks such as keyword extraction according to the co-occurrence relation among the words.
For the task of determining keywords in text, it is generally implemented by TextRank algorithm in the related art. Specifically, a fixed window parameter is set, a sliding window operation is performed in the text according to the fixed window parameter, if two words are located in the same window, it is determined that the two words have a co-occurrence relationship, and the keywords in the text are extracted according to the co-occurrence relationship. However, this method is not suitable for some long sentences with complex syntactic structures, for example, with a fixed window parameter of 5, two words in a text may not be associated with each other, and in the TextRank algorithm, only because the distance between two words is less than 5, a co-occurrence relationship between the two words is established, that is, the established co-occurrence relationship is not accurate; also, for example, two words in a text are grammatically closely related, and in the TextRank algorithm, the co-occurrence relationship between the two words cannot be established simply because the distance between the two words is greater than 5. In summary, when determining keywords according to the scheme provided by the prior art, the accuracy is low, and the applicability is poor for texts with complex syntax structures.
Embodiments of the present invention provide a text processing method and apparatus, an electronic device, and a storage medium, which can improve accuracy of determining a keyword and accuracy of performing relevant processing on a text according to the keyword.
An exemplary application of the electronic device provided by the embodiment of the present invention is described below, where the electronic device provided by the embodiment of the present invention may be a server, for example, a server deployed in a cloud, and provides a remote keyword determination function and more functions based on the obtained keywords, for example, a similar text recommendation function or a query function, to the user according to a text to be processed submitted by the user; the method can also be used as a terminal device, such as a similar text retrieval device, and judges whether the two texts are similar or not by comparing keywords of the two texts; and may even be a handheld terminal or the like. By operating the scheme for text processing provided by the embodiment of the invention, the accuracy of text processing can be improved, namely the performance of the electronic equipment is improved, and the method and the device are suitable for various application scenes of text processing.
Referring to fig. 1, fig. 1 is an alternative architecture diagram of a text processing system 100 according to an embodiment of the present invention, in order to support a text processing application, a terminal device 400 (an exemplary terminal device 400-1 and a terminal device 400-2 are shown) is connected to a server 200 through a network 300, and the server 200 is connected to a database 500, where the network 300 may be a wide area network or a local area network, or a combination of both. For ease of understanding, the architecture shown in FIG. 1 is illustrated in a context of a similar text recommendation.
In some embodiments, after acquiring the text to be processed input or selected by the user, the terminal device 400 may locally execute the text processing method provided by the embodiment of the present invention to obtain the keywords in the text to be processed. Meanwhile, the terminal device 400 locally determines keywords of at least two candidate texts, determines similarity between the text to be processed and the candidate texts according to the text to be processed and the keywords of the candidate texts, determines the candidate texts with the similarity satisfying the similarity condition as similar texts, and executes recommendation operation on the similar texts. It should be noted that the terminal device 400 may obtain the text from the local storage, or may send a request to the server 200 through the network 300, so as to obtain the text from the database 500, where the text refers to the text to be processed or the candidate text.
In some embodiments, the server 200 may also execute the text processing method provided in the embodiments of the present invention, specifically, obtain the text to be processed from the terminal device 400, and determine the keywords in the text to be processed. Meanwhile, the server 200 obtains at least two candidate texts from the database 500 and determines a keyword of each candidate text. The server 200 screens out similar texts of the text to be processed from the at least two candidate texts based on the text to be processed and the keywords of the candidate texts, and performs a recommendation operation on the similar texts, such as sending the similar texts to the terminal device 400.
The terminal device 400 may display various results in the text processing process in a graphical interface 410 (the graphical interface 410-1 and the graphical interface 410-2 are exemplarily shown), such as keywords of the text to be processed, similar texts screened out, and the like, and in fig. 1, only the similar texts are taken as an example, and a similar text 1 and a similar text 2 are shown.
The following continues to illustrate exemplary applications of the electronic device provided by embodiments of the present invention. The electronic device may be implemented as various types of terminal devices such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), and the like, and may also be implemented as a server.
Referring to fig. 2, fig. 2 is a schematic diagram of an architecture of an electronic device 600 (for example, the electronic device 600 may be the server 200 or the terminal device 400 shown in fig. 1) provided in an embodiment of the present invention, where the electronic device 600 shown in fig. 2 includes: at least one processor 610, memory 650, at least one network interface 620, and a user interface 630. The various components in electronic device 600 are coupled together by a bus system 640. It is understood that bus system 640 is used to enable communications among the components. Bus system 640 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 640 in fig. 2.
The Processor 610 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 630 includes one or more output devices 631 including one or more speakers and/or one or more visual displays that enable the presentation of media content. The user interface 630 also includes one or more input devices 632, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 650 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 650 optionally includes one or more storage devices physically located remote from processor 610.
The memory 650 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The depicted memory 650 of embodiments of the invention is intended to comprise any suitable type of memory.
In some embodiments, memory 650 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 651 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and for handling hardware-based tasks;
a network communication module 652 for reaching other computing devices via one or more (wired or wireless) network interfaces 620, exemplary network interfaces 620 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
a presentation module 653 for enabling presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 631 (e.g., display screens, speakers, etc.) associated with the user interface 630;
an input processing module 654 for detecting one or more user inputs or interactions from one of the one or more input devices 632 and translating the detected inputs or interactions.
In some embodiments, the text processing apparatus provided by the embodiments of the present invention may be implemented in software, and fig. 2 shows the text processing apparatus 655 stored in the memory 650, which may be software in the form of programs and plug-ins, etc., and includes the following software modules: a segmentation module 6551, a syntax processing module 6552, a mapping module 6553, a propagation module 6554, and a keyword determination module 6555, which are logical and thus arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be explained below.
In other embodiments, the text processing apparatus provided in the embodiments of the present invention may be implemented in hardware, and by way of example, the text processing apparatus provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the text processing method provided in the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
The electronic device executing the text processing method may be various types of devices, for example, the text processing method provided by the embodiment of the present invention may be executed by the server, or may be executed by a terminal device (for example, terminal device 400-1 or terminal device 400-2 shown in fig. 1), or may be executed by both the server and the terminal device.
In the following, a process of implementing the text processing method by the embedded text processing apparatus in the electronic device will be described with reference to the exemplary application and structure of the electronic device described above.
Referring to fig. 3 and fig. 4A, fig. 3 is an alternative architecture schematic diagram of the text processing apparatus 655 according to the embodiment of the present invention, which shows a flow of determining a keyword through a series of modules, and fig. 4A is a flow schematic diagram of a text processing method according to the embodiment of the present invention, and the steps shown in fig. 4A will be described with reference to fig. 3.
In step 101, a text to be processed is subjected to word segmentation processing, and words obtained by the word segmentation processing are formed into a word sequence.
As an example, referring to fig. 3, in the word segmentation module 6551, a text to be processed is obtained, where the text to be processed may be input by a user or obtained from a database or a local storage. Performing word segmentation on a text to be processed, forming the obtained words into an ordered word sequence, taking the text to be processed as a 'Zhang-three explicit form' for example, the obtained words comprise 'Zhang-three', 'explicit' and 'form', and the formed word sequence is 'Zhang-three/explicit/form'. The embodiment of the present invention does not limit the way of word segmentation Processing, and word segmentation Processing may be performed by a Language Technology Platform (LTP) tool, a jieba tool, or a chinese Language Processing (HanLP) tool, for example.
In some embodiments, the word segmentation processing on the text to be processed may be implemented in such a manner that words obtained by the word segmentation processing form a word sequence: performing sentence segmentation processing on a text to be processed to obtain at least one sentence; and performing word segmentation processing on each sentence obtained by sentence segmentation processing, and forming a plurality of words obtained by word segmentation processing into word sequences corresponding to the sentences.
As an example, referring to fig. 3, in the word segmentation module 6551, since word segmentation is usually performed in units of sentences, the obtained text to be processed is subjected to sentence segmentation processing, for example, the text to be processed is subjected to sentence segmentation processing according to set punctuations to obtain at least one sentence, and the set punctuations are commas, periods and the like. Then, word segmentation processing is carried out on each sentence obtained by sentence segmentation processing, and a word sequence corresponding to each sentence is obtained. Through the method, the ordering of word segmentation processing is realized, and the method is suitable for the text to be processed comprising a plurality of sentences.
In some embodiments, after step 101, further comprising: and performing part-of-speech tagging processing according to the word sequence to obtain the part of speech of each word in the word sequence.
For example, referring to fig. 3, in the word segmentation module 6551, part-of-speech tagging processing is performed according to a word sequence, and the part-of-speech tagging processing can also be implemented by the above tool for word segmentation processing, which is not described herein again. And updating the word sequence according to the result of the part-of-speech tagging, wherein the updated word sequence comprises the part-of-speech of each word, for example, the updated word sequence is Zhang III/explicit person name/formal adjective/verb, and the person name, the adjective and the verb are the parts-of-speech of the corresponding word. The part of speech of each word in the word sequence is determined, so that the subsequent screening of the words is facilitated.
In step 102, dependency syntax processing is performed on the word sequence to obtain word dependency relationships between words in the word sequence.
As an example, referring to fig. 3, in the syntax processing module 6552, the word sequence obtained in step 101 is subjected to dependency syntax processing to obtain word dependency relationship between words in the word sequence, where the word sequence may be a word sequence subjected to part-of-speech tagging processing. Dependency syntax processing can also be implemented by the tools used for word segmentation processing above, and will not be described here. It should be noted that the word dependency relationship is directional, and the word dependency relationship does not necessarily exist between any two words in the word sequence, for example, there is a main meaning relationship from "zhang san" to "table state", and there is a relationship from "explicit" to "table state".
In step 103, words in the word sequence are mapped to nodes, and word dependencies are mapped to edges between corresponding nodes, so as to obtain a candidate keyword graph formed by connecting nodes and edges.
Here, for each word sequence, each word in the word sequence is mapped as a node, and the word dependency relationship is mapped as an edge between the corresponding nodes. And after mapping all word sequences and all word dependency relations, obtaining a candidate keyword graph formed by connecting nodes and edges.
In some embodiments, the mapping of words in the sequence of words to nodes described above may be implemented in such a way that: and mapping the words with the word property meeting the word property condition in the word sequence into corresponding nodes.
For example, referring to fig. 3, in the mapping module 6553, on the basis of the part-of-speech tagging processing performed on the word sequence, the words in the word sequence are screened according to the part-of-speech conditions, that is, the words in the word sequence whose part-of-speech satisfies the part-of-speech conditions are mapped as nodes, where the part-of-speech conditions may be set according to the actual application scenario, for example, the part-of-speech conditions are set to be part-of-speech other than conjunctions, auxiliary words, adverbs, prepositions, stop words, digit words, orientation words, and pronouns. It is worth to be noted that in the mapping process of the edge, for the word dependency relationship corresponding to the word mapped as the node, the word dependency relationship is mapped as the edge between the corresponding nodes; and (4) not considering the word dependency relationship corresponding to the word which is not mapped as the node (namely the word does not meet the part of speech condition). Through the method for filtering the part of speech, words which obviously do not belong to the keywords are effectively filtered, and the accuracy of determining the keywords is improved.
In step 104, the node weights of the nodes in the candidate keyword graph are propagated according to the edges in the candidate keyword graph.
By way of example, referring to FIG. 3, in the propagation module 6554, a node weight for each node in the candidate keyword graph is propagated along the edges in the candidate keyword graph, the node weight representing the degree of importance of the corresponding word in the text to be processed.
In step 105, the nodes meeting the weight condition in the propagated candidate keyword graph are determined as target nodes, and the words corresponding to the target nodes are determined as the keywords of the text to be processed.
After the propagation of the node weight is completed, the node meeting the weight condition in the candidate keyword graph is determined as a target node, and the weight condition can be set according to an actual application scene, for example, the node weight is set to exceed a weight threshold. For the determined target node, the word corresponding to the target node is determined as the keyword of the text to be processed, and the obtained keyword can be used in various application scenarios of text processing, which will be described in detail later.
In some embodiments, the above-mentioned node that satisfies the weight condition in the propagated candidate keyword graph may be implemented in such a manner that the node is determined as a target node: ordering the nodes in the propagated candidate keyword graph according to the node weight to obtain a node sequence; determining the nodes in the node sequence as target nodes one by one according to the access sequence until a set number of target nodes are obtained; wherein the access sequence is a descending order of the node weights of the nodes in the node sequence.
For example, referring to fig. 3, in the keyword determination module 6555, the nodes in the propagated candidate keyword graph are sorted according to the ascending order or the descending order of the node weight, so as to obtain a node sequence. Acquiring the set number of the keywords to be determined, and determining the nodes in the node sequence as target nodes one by one according to an access sequence until the target nodes with the set number are obtained, wherein the access sequence is a descending order of the node weight, namely, determining the nodes with higher importance degree as the target nodes as much as possible. By the method, the orderliness of the determined target node is improved.
In some embodiments, after determining the nodes in the node sequence one by one as the target node according to the access order, the method further includes: marking the determined target node as accessed; all nodes in the node sequence are marked as not-accessed during initialization;
after step 105, the method further comprises: when at least two keywords have adjacent relation in the text to be processed, merging the at least two keywords; determining the number of keywords in the text to be processed; and when the number of the key words is less than the set number, determining the nodes which are not accessed in the node sequence as target nodes according to the access sequence until the number of the obtained key words is equal to the set number.
And when the node sequence is obtained through sorting processing, all words in the node sequence are marked as being not visited. After the target nodes with the set number are determined according to the access sequence, the target nodes with the set number are marked as accessed so as to prevent the same keywords from being determined repeatedly. In the word segmentation processing process of the embodiment of the invention, words belonging to a whole can be separated wrongly, so that when at least two determined keywords have adjacent relation in the text to be processed, at least two keywords are combined to obtain the keywords in a new phrase form. For example, the determined keywords include "machine" and "learning", and if the two keywords have an adjacent relationship in the text to be processed, the two keywords are combined to obtain the "machine learning" keyword. After the merging process is carried out, the number of the keywords cannot reach the expected set number, so according to the access sequence, nodes which are not accessed in the node sequence are determined as target nodes until the number of the obtained keywords is equal to the set number. By the merging processing mode, the completeness of the meaning expressed by the keywords is improved, and the number of the finally obtained keywords can reach the set number by the keyword complementing mode.
In some embodiments, after step 105, further comprising: determining the number of keywords included in the sentence; and when the number of the keywords included in the sentence meets the number condition, determining the sentence as the text abstract of the text to be processed.
The keywords obtained in step 105 may be used to determine a text abstract of the text to be processed, specifically, determine the number of keywords included in each sentence in the text to be processed, and when the number of keywords included in a certain sentence satisfies a number condition, determine that the sentence is the text abstract of the text to be processed. The number condition may be set according to an actual application scenario, for example, the number of the keywords is set to exceed K, or the number of the N keywords with the largest value is set, where K and N are both integers greater than 0. By the aid of the method, accuracy of the determined text abstract is improved, and the text abstract can effectively express the meaning of the text to be processed.
In some embodiments, after step 105, further comprising: responding to a query request comprising query words, and determining a text with keywords matched with the query words as a text to be recommended; acquiring heat data of the text to be recommended, and sequencing the heat data to obtain a heat sequence; and executing recommendation operation on the text to be recommended according to the heat sequence.
After determining the keywords of the text to be processed, the keywords can be stored, and meanwhile, the corresponding relation between the text to be processed and the keywords is established. When a query request including query words is received, determining texts with keywords matched with the query words in all the stored texts, wherein the keywords matched with the query words mean that the keywords include all the query words, and in order to facilitate distinguishing, the determined texts are named as texts to be recommended.
Because there may be at least two texts to be recommended, in the embodiment of the present invention, a recommendation order is further determined, and specifically, popularity data of the texts to be recommended is obtained, where the popularity data may be a weighted result of click volume, forwarding volume, and comment number of the texts. According to the sequence of the numerical values of the heat data from large to small, the heat data is sequenced to obtain a heat sequence, and recommendation operation of the text to be recommended is executed according to the heat sequence, for example, at least two texts to be recommended are recommended in a list form. By the method, the effective response to the query request is realized, the applicability to the query scene is improved, and the method can be applied to a search engine.
As can be seen from the above exemplary implementation of fig. 4A in the embodiment of the present invention, a candidate keyword graph is constructed according to the word dependency relationship, and the keyword is determined according to the propagated node weight, so that the accuracy of determining the keyword is improved, and the accuracy of performing related text processing according to the keyword is also improved.
In some embodiments, referring to fig. 4B, fig. 4B is an optional flowchart of the text processing method according to the embodiment of the present invention, and step 103 shown in fig. 4A may be implemented by steps 201 to 202, which will be described with reference to the steps.
In step 201, words in the sequence of words are mapped to nodes.
In step 202, word dependencies are mapped to edges between corresponding nodes to obtain a candidate keyword graph formed by connecting nodes and edges.
In fig. 4B, step 202 may be implemented by any one of steps 2021 to 2023, and each step will be described.
In step 2021, word dependencies are mapped to undirected and unweighted edges between corresponding nodes to obtain a candidate keyword graph composed of nodes and edges connected.
The embodiment of the invention provides three ways of edge mapping, wherein the first way is to map the word dependency relationship into undirected and unweighted edges among corresponding nodes, and unweighted means that edge weight does not exist.
In step 2022, the word dependency relationship is mapped as undirected edges between the corresponding nodes, and edge weights of the mapped undirected edges are determined according to the frequency of occurrence of the word dependency relationship in the word sequence, wherein the edge weights are positively correlated with the frequency of occurrence.
The second way is to map the word dependency relationship to undirected weighted edges between corresponding nodes, where the edge weight of the undirected weighted edges is determined by the frequency of occurrence of the word dependency relationship in all word sequences, and the edge weight and the frequency of occurrence have a positive correlation, which may be a linear positive correlation or a non-linear positive correlation. Specifically, the frequency of occurrence may be directly determined as the edge weight, or to avoid the situation that the edge weight is 0, the frequency of occurrence may be subjected to a smooth transformation process, for example, an operation of ln (1+ frequency of occurrence) is performed to obtain the edge weight, where ln refers to a natural logarithm, and of course, other determination methods may also be applied. For example, if the predicate relation from "zhangsan" to "table" appears 5 times in all word sequences, the edge weight of the edge between the "zhangsan" corresponding node and the "table" corresponding node may be set to 5, or the edge weight may be set to ln 6. It should be noted that the edge weight is used to indicate the importance of the corresponding edge, and the node weight assigned along the corresponding edge is greater when the edge weight is greater.
In step 2023, the word dependencies are mapped to directional edges between the corresponding nodes in the same direction according to the direction indicated by the word dependencies.
The third way is that according to the direction represented by the word dependency relationship, the word dependency relationship is mapped to the directional edge in the same direction between the corresponding nodes, and the directional edge may be a directional unweighted edge. For example, for a predicate relationship from "zhangsan" to "table," a directed edge is constructed from the corresponding node of zhangsan to the corresponding node of "table".
In fig. 4B, step 104 shown in fig. 4A can be implemented by steps 203 to 204, and will be described with reference to the respective steps.
In step 203, the node weights of the nodes in the candidate keyword graph are initialized.
After a candidate keyword graph is formed according to the nodes and the edges, initializing the node weights of all nodes in the candidate keyword graph to 1/M, wherein M is the number of the nodes included in the candidate keyword graph.
In step 204, nodes in the candidate keyword graph are iteratively traversed, and the node weights of the traversed nodes are distributed to the nodes with connection relations with the traversed nodes, so that the nodes with connection relations sum the distributed node weights to obtain updated node weights until an iteration stop condition is met; wherein, the type of the connection relation comprises: connection without a side; and connecting the outgoing edge and the outgoing edge.
During propagation, at least two iterations are performed. In each iteration process, nodes in the candidate keyword graph are traversed, and the node weights of the traversed nodes are distributed to the nodes which have connection relations with the traversed nodes. According to the different types of the edges in the candidate keyword graph, the connection relationship is different, and specifically, when the edges are undirected edges, the connection relationship is undirected edge connection; when the edge is a directed edge, the connection relation is outgoing edge connection, that is, the node having the connection relation is the end point of the directed edge.
And for the nodes with the connection relation, summing all the distributed node weights, and updating the node weight of the node in the iteration process according to the summation result. And repeating the iteration process until an iteration stop condition is met, wherein the iteration stop condition can be a set iteration frequency, and an error value of any node can also be set to be smaller than an error threshold, wherein the error value refers to an absolute value of a difference value between the node weight obtained by the node in the current iteration and the node weight obtained by the previous iteration.
As can be seen from the above exemplary implementation of fig. 4B, the embodiment of the present invention improves the flexibility of the edge mapping process, can map the word dependency relationship into an edge by applying any of the above manners according to the actual application scenario, and improves the accuracy and effectiveness of the propagation process by an iterative traversal manner.
In some embodiments, referring to fig. 4C, fig. 4C is an optional flowchart of the text processing method according to an embodiment of the present invention, and based on fig. 4A, after step 105, at least two candidate texts may also be obtained in step 301, and the keywords of the candidate texts are determined.
The keywords can be applied to scenes recommended by similar texts, specifically, at least two candidate texts are obtained, the source of the candidate texts is not limited in the embodiment of the present invention, and for example, the candidate texts can be obtained from a database, a local storage or a blockchain network. In order to narrow the determination range of similar texts, at least two candidate texts belonging to the same text type as the text to be processed can be obtained in a limited way, for example, in the case that the text to be processed is social news, all social news except the text to be processed in the database are determined as candidate texts, and for example, if the text to be processed is a paper in a certain field, all papers except the text to be processed and belonging to the field in the database are determined as candidate texts. For each candidate text obtained, keywords in the candidate text are determined similarly to steps 101 to 105.
In step 302, an intersection between the keywords of the text to be processed and the keywords of the candidate text is determined, and a first number of keywords included in the intersection is determined.
For each candidate text, determining an intersection between the keywords of the text to be processed and the keywords of the candidate text, and naming the number of the keywords included in the intersection as a first number for easy distinction.
In step 303, a union between the keywords of the text to be processed and the keywords of the candidate text is determined, and a second number of keywords comprised by the union is determined.
For each candidate text, a union between the keywords of the text to be processed and the keywords of the candidate text is also determined, and the number of keywords included in the union is named a second number for the sake of distinction.
In step 304, the ratio between the first number and the second number is determined as the similarity between the text to be processed and the candidate text.
Here, the ratio of the first number to the second number is determined as the similarity between the text to be processed and the candidate text, and the similar text is screened out according to the similarity.
In step 305, candidate texts with similarity satisfying the similarity condition are determined as similar texts, and recommendation operation on the similar texts is performed.
Here, the similarity condition may be set according to an actual application scenario, such as being set to exceed a similarity threshold, or being set to be one of R similarities with the largest value, where R is an integer greater than 0. And determining the similarity meeting the similarity condition in at least two similarities, determining the candidate text corresponding to the similarity meeting the similarity condition as the similar text of the text to be processed, and executing recommendation operation on the similar text. The embodiment of the present invention does not limit the specific manner of the recommendation operation, and may be, for example, front-end presentation, mail push, short message push, and the like.
As can be seen from the above exemplary implementation of fig. 4C, in the embodiment of the present invention, by determining the intersection and union of the keywords, the similarity between the text to be processed and the candidate text is determined, and the similar text is screened according to the similarity, so that the accuracy of recommending the similar text is improved.
In some embodiments, referring to fig. 4D, fig. 4D is an optional flowchart of the text processing method according to the embodiment of the present invention, based on fig. 4A, after step 105, in step 401, a sample text and a type of the sample text may also be obtained, and a keyword of the sample text is determined.
The keywords in the text to be processed can also be used to determine the text type of the text to be processed, where the text type is different according to the dividing basis of the text, for example, the text type can include entertainment news and social news, and for example, the text type can include an a-domain paper and a B-domain paper. In the embodiment of the invention, the text type of the text to be processed is determined according to the classification model, and before the determination, the sample text and the labeled sample text type are obtained so as to train the classification model. The type of the classification model is not limited in the embodiments of the present invention, and the classification model may be a neural network model, for example.
In step 402, classifying the keywords of the sample text by the classification model to obtain the text type to be compared.
And performing feed-forward classification processing on the keywords of the sample text through the weight parameters of the classification model to obtain the text type to be compared.
In step 403, according to the difference between the sample text type and the text type to be compared, back propagation is performed in the classification model, and in the process of back propagation, the weight parameters of the classification model are updated.
Here, the difference between the sample text type and the text type to be compared is determined according to the loss function of the classification model, back propagation is performed in the classification model according to the difference, and in the process of back propagation, the weight parameter of the classification model is updated along the gradient descending direction until a set update completion condition is met, wherein the update completion condition can be a set iteration number or a set accuracy threshold.
In step 404, the keywords of the text to be processed are classified by the classification model to obtain the text type.
After the training of the classification model is completed, the keywords of the text to be processed are subjected to feedforward classification processing through the weight parameters of the classification model, and the text type of the text to be processed is obtained.
In step 405, an index relationship is established between the text to be processed and the text type, so as to respond to the query request including the text type according to the index relationship.
And establishing an index relationship between the text to be processed and the text type, and returning the text to be processed when a query request comprising the text type is received so as to respond to the query request. For example, if the text type of the text to be processed is obtained as entertainment news through the classification model, an index relationship between the text to be processed and the entertainment news is established in the database, so that when a query request of a user for the entertainment news is received, the text which has the index relationship with the entertainment news is searched from the database, and the searched text to be processed is recommended to the user.
As can be seen from the above exemplary implementation of fig. 4D, in the embodiment of the present invention, the text type of the text to be processed is determined by the classification model, so that text classification is effectively and accurately achieved, and the classification effect is improved.
In some embodiments, referring to fig. 4E, fig. 4E is an optional flowchart of the text processing method according to the embodiment of the present invention, based on fig. 4A, after step 105, in step 501, sample browsing data of the user may also be obtained; the sample browsing data comprises satisfaction, user characteristics and keywords of texts browsed by the user.
The keywords in the text can also be applied to an actively recommended scene, specifically, recommendation is performed according to a text recommendation model, the type of the text recommendation model is not limited in the embodiment of the invention, and for example, the text recommendation model can be a random forest model, a support vector machine model or a neural network model. The first stage of active recommendation is to obtain training data of a text recommendation model, that is, sample browsing data of a user to be recommended, where the sample browsing data includes satisfaction, user characteristics, and keywords of a text browsed by the user, where the satisfaction represents the satisfaction of the user on the browsed text, and the user characteristics represent attributes of the user, such as the age, sex, and city of the user. It is worth to be noted that the satisfaction and dissatisfaction conditions can be divided, the satisfaction value corresponding to the satisfaction is set to be 1, the satisfaction value corresponding to the dissatisfaction is set to be 0, the satisfaction in the sample browsing data can be actively set by the user, and the default value with the value of 1 can be applied under the condition that the user is not set, namely the default user is satisfied with the browsed text.
In step 502, the text recommendation model is updated based on the sample browsing data.
And predicting the satisfaction and the user characteristics in the sample browsing data through a text recommendation model to obtain the satisfaction to be compared. And then, according to a loss function of the text recommendation model, determining the difference between the satisfaction degree to be compared and the satisfaction degree in the sample browsing data, performing back propagation in the text recommendation model according to the difference, and updating the weight parameters of the text recommendation model in the back propagation process until the set updating completion condition is met.
In step 503, at least two texts are obtained, and the user features and the keywords of the texts are combined into a sample to be processed.
Here, at least two texts are obtained from a database, a local storage or other storage location, and the keywords of each text are determined in the manner of steps 101 to 105. For each text, combining the user features and the keywords of the text into a sample to be processed.
In step 504, the prediction processing is performed on the sample to be processed through the updated text recommendation model, so as to obtain the prediction satisfaction.
And performing prediction processing on the sample to be processed through the updated weight parameters in the text recommendation model to obtain the prediction satisfaction.
In step 505, the text with the predicted satisfaction degree meeting the satisfaction degree condition is determined as the text to be recommended, and the recommendation operation of the text to be recommended is executed.
And obtaining the prediction satisfaction corresponding to each text through prediction processing. And screening the texts through a set satisfaction condition to obtain a text to be recommended which is more likely to be interested by the user, wherein the satisfaction condition can be set to exceed a satisfaction threshold or one of S predicted satisfaction degrees with the maximum value, and S is an integer greater than 0. And finally, performing recommendation operation on the text to be recommended, such as front-end presentation, mail recommendation, short message recommendation and the like.
As can be seen from the above exemplary implementation of fig. 4E in the embodiment of the present invention, according to the user characteristics and the keywords of the text, the text to be recommended that the user is more likely to be interested in is screened out, so that the pertinence to different users and the accuracy of active recommendation are improved.
In the following, an exemplary application of the embodiments of the present invention in a practical application scenario will be described.
The embodiment of the present invention provides a schematic diagram of determining a keyword as shown in fig. 5, and for convenience of understanding, a case where a text to be processed only includes one sentence is described, it should be understood that in an actual application scenario, the text to be processed includes at least one sentence. In fig. 5, the pending text 51 is "zhang san is explicitly stated in the media communication during two sessions: oranges are exploring scenes ", where" orange "is the name of a company. In the process of determining the keywords of the text to be processed, firstly, performing word segmentation processing and part-of-speech tagging processing on the text to be processed 51 to obtain a word sequence 52, namely, "zhang san/personal name in/preposition/abbreviation period/directional word in/auxiliary word media/general noun communication/verb middle/directional word explicit/adjective form/verb: punctuation orange/proper noun-now/adverb exploration/verb scene/general noun ", where the part-of-speech tagging process uses 863 part-of-speech tagging sets.
Then, the word sequence 52 is subjected to dependency syntax processing, for example, the word sequence 52 is subjected to dependency syntax processing by an LTP tool, so as to obtain a word dependency relationship diagram 53, where the word dependency relationship diagram 53 includes word dependency relationships between words in the word sequence 52, and a virtual word in the word dependency relationship diagram 53 is an imaginary (R OOT) root of the word sequence 52, is used for representing a core relationship in the word sequence 52, and does not have an actual meaning. Constructing a candidate keyword graph G (V, E) according to the word sequence 52 and the obtained word dependency relationship, taking the example of the case that the edge in the candidate keyword graph G is a directed edge, and then V is a set of nodes corresponding to the words in the word sequence 52; e is a set of directed edges. For example, in the word dependency graph 53, the word dependency between "zhang san" and "table state" is an argument, and the argument corresponds to a directed edge pointing from the corresponding node of zhang san to the corresponding node of "table state", and so on.
And carrying out propagation processing on the node weight of each node in the candidate keyword graph G according to the weight propagation modes of the PageRank and the TextRank until an iteration stop condition is met, wherein the iteration stop condition can be a set iteration frequency, and an error value of any node can be set to be smaller than an error threshold, wherein the error value refers to an absolute value of a difference value between the node weight obtained by the node in the current iteration and the node weight obtained by the previous iteration.
And after the iteration of the node weights in the candidate keyword graph G is completed, selecting the nodes corresponding to the node weights as target nodes according to the numerical value from large to small sequence until the target nodes with the set number are obtained. And then, determining the words corresponding to the target node as keywords in the text to be processed, and completing keyword extraction. The determined keywords can be used in application scenarios of natural language processing such as text query/search, text classification, similar text recommendation, active text recommendation, abstract extraction and the like.
Continuing with the exemplary structure in which the text processing apparatus 655 provided by the embodiments of the present invention is implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the text processing apparatus 655 of the memory 650 may include: the word segmentation module 6551 is configured to perform word segmentation on the text to be processed, and form words obtained through the word segmentation into a word sequence; a syntax processing module 6552, configured to perform dependency syntax processing on the word sequence to obtain a word dependency relationship between words in the word sequence; a mapping module 6553, configured to map words in the word sequence into nodes, and map the word dependency relationship into edges between corresponding nodes, so as to obtain a candidate keyword graph formed by connecting the nodes and the edges; a propagation module 6554, configured to propagate node weights of nodes in the candidate keyword graph according to edges in the candidate keyword graph; a keyword determining module 6555, configured to determine a node that meets a weight condition in the propagated candidate keyword graph as a target node, and determine a word corresponding to the target node as a keyword of the text to be processed.
In some embodiments, the word segmentation module 6551 is further configured to: performing sentence segmentation processing on the text to be processed to obtain at least one sentence; performing word segmentation processing on each sentence obtained by sentence segmentation processing, and forming a plurality of words obtained by word segmentation processing into a word sequence corresponding to the sentence;
the text processing means 655 further includes: the first abstract determining module is used for determining the number of keywords included in the sentence; a second abstract determining module, configured to determine that the sentence is the text abstract of the text to be processed when the number of the keywords included in the sentence meets a number condition
In some embodiments, propagation module 6554 is further configured to: initializing the node weights of the nodes in the candidate keyword graph; iteratively traversing nodes in the candidate keyword graph, and distributing node weights of the traversed nodes to nodes with connection relations with the traversed nodes, so that the nodes with connection relations sum the distributed node weights to obtain updated node weights until an iteration stop condition is met; wherein the type of the connection relationship comprises: connection without a side; and connecting the outgoing edge and the outgoing edge.
In some embodiments, the text processing means 655 further comprises: the labeling module is used for performing part-of-speech labeling processing according to the word sequence to obtain the part-of-speech of each word in the word sequence;
a mapping module 6553, further configured to: and mapping the words of which the word property meets the word property condition in the word sequence into corresponding nodes.
In some embodiments, the mapping module 6553 is further configured to: any one of the following processes is performed: mapping the word dependency relationship to undirected unweighted edges between corresponding nodes; mapping the word dependency relationship to undirected edges between corresponding nodes, and determining the edge weight of the mapped undirected edges according to the frequency of the word dependency relationship in the word sequence; wherein the edge weight is positively correlated with the frequency of occurrence; and mapping the word dependency relationship into directed edges in the same direction between corresponding nodes according to the direction represented by the word dependency relationship.
In some embodiments, the keyword determination module 6555 is further configured to: ordering the nodes in the propagated candidate keyword graph according to the node weight to obtain a node sequence; determining the nodes in the node sequence as target nodes one by one according to the access sequence until a set number of target nodes are obtained; wherein the access sequence is a descending order of node weights of the nodes in the node sequence.
In some embodiments, the text processing means 655 further comprises: the marking module is used for marking the determined target node as accessed; wherein, the nodes in the node sequence are all marked as not accessed during initialization;
the text processing means 655 further includes: the merging module is used for merging at least two keywords when the keywords have adjacent relation in the text to be processed; the quantity determining module is used for determining the quantity of the keywords in the text to be processed; and the continuous access module is used for determining nodes which are not accessed in the node sequence as target nodes according to the access sequence when the number of the keywords is less than the set number until the number of the obtained keywords is equal to the set number.
In some embodiments, the text processing means 655 further comprises: the candidate text acquisition module is used for acquiring at least two candidate texts and determining keywords of the candidate texts; the intersection processing module is used for determining the intersection between the keywords of the text to be processed and the keywords of the candidate text and determining a first number of the keywords included in the intersection; the union set processing module is used for determining a union set between the keywords of the text to be processed and the keywords of the candidate text and determining a second number of the keywords included in the union set; a comparing module, configured to determine a ratio between the first number and the second number as a similarity between the text to be processed and the candidate text; and the first recommending module is used for determining the candidate texts with the similarity meeting the similarity condition as similar texts and performing recommending operation on the similar texts.
In some embodiments, the text processing means 655 further comprises: the classification module is used for classifying the keywords of the text to be processed through a classification model to obtain a text type; and the relation establishing module is used for establishing an index relation between the text to be processed and the text type so as to respond to a query request comprising the text type according to the index relation.
In some embodiments, the text processing means 655 further comprises: the sample acquisition module is used for acquiring a sample text and a sample text type and determining keywords of the sample text; the comparison classification module is used for the sample acquisition module and is used for classifying the keywords of the sample text through the classification model to obtain the text type to be compared; and the back propagation module is used for carrying out back propagation in the classification model according to the difference between the sample text type and the text type to be compared, and updating the weight parameters of the classification model in the process of back propagation.
In some embodiments, the text processing means 655 further comprises: the browsing data acquisition module is used for acquiring sample browsing data of a user; the sample browsing data comprises satisfaction, user characteristics and keywords of texts browsed by the user; the model updating module is used for updating a text recommendation model according to the sample browsing data; the combination module is used for acquiring at least two texts and combining the user characteristics and the keywords of the texts into a sample to be processed; the prediction module is used for performing prediction processing on the sample to be processed through the updated text recommendation model to obtain prediction satisfaction; and the second recommending module is used for determining the text of which the predicted satisfaction meets the satisfaction condition as a text to be recommended and executing the recommending operation of the text to be recommended.
In some embodiments, the text processing means 655 further comprises: the response module is used for responding to a query request comprising query terms, determining the text with the keywords matched with the query terms as a text to be recommended; the popularity ranking module is used for acquiring popularity data of the text to be recommended and ranking the popularity data to obtain a popularity sequence; and the third recommending module is used for executing recommending operation on the text to be recommended according to the heat sequence.
Embodiments of the present invention provide a storage medium storing executable instructions, which when executed by a processor, will cause the processor to perform a text processing method provided by embodiments of the present invention, for example, a text processing method as shown in fig. 4A, 4B, 4C, 4D, or 4E.
In some embodiments, the storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EE PROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, may be stored in a portion of a file that holds other programs or data, e.g., in one or more scripts in a HyperText markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
In summary, the following technical effects can be achieved by the embodiments of the present invention:
1) according to the embodiment of the invention, the candidate keyword graph is constructed according to the word dependency relationship, and the keywords are determined according to the propagated node weights, so that the accuracy of the determined keywords is improved, and the method is also suitable for sentences with more complex syntactic structures.
2) The embodiment of the invention provides three edge mapping modes, improves the flexibility of the edge mapping process, and improves the accuracy and the effectiveness of the transmission process in an iterative traversal mode.
3) The determined keywords can be applied to a query scene, specifically used for determining texts to be recommended, and for the obtained texts to be recommended, ordered recommendation can be performed through the heat data, so that effective response to a query request is realized.
4) The determined keywords can be applied to abstract extraction, so that the accuracy of the determined text abstract is improved, and the text abstract can effectively express the meaning of the text.
5) The keywords can be applied to scenes recommended by similar texts, the intersection and comparison between the keywords of the two texts is used as the similarity between the two texts, and the similar texts are screened out according to the similarity, so that the accuracy of recommending the similar texts is improved.
6) The text type of the text can be determined through the keywords, so that text classification is facilitated, and response to a query request based on the text type is facilitated.
7) Whether the user is interested in the text can be predicted according to the keywords and the user characteristics of the text, so that whether the text is recommended to the user is judged, and the pertinence and the accuracy of active recommendation to the user are improved.
The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims (15)

1. A method of text processing, comprising:
performing word segmentation on a text to be processed, and forming words obtained by word segmentation into word sequences;
carrying out dependency syntax processing on the word sequence to obtain word dependency relationship among words in the word sequence;
mapping the words in the word sequence into nodes, and mapping the word dependency relationship into edges between corresponding nodes to obtain a candidate keyword graph formed by connecting the nodes and the edges;
propagating node weights of nodes in the candidate keyword graph according to edges in the candidate keyword graph;
determining the nodes meeting the weight condition in the propagated candidate keyword graph as target nodes, and
and determining the words corresponding to the target node as the keywords of the text to be processed.
2. The text processing method according to claim 1,
the method for performing word segmentation processing on the text to be processed and forming words obtained by word segmentation processing into word sequences comprises the following steps:
performing sentence segmentation processing on the text to be processed to obtain at least one sentence;
performing word segmentation processing on each sentence obtained by sentence segmentation processing, and forming a plurality of words obtained by word segmentation processing into a word sequence corresponding to the sentence;
after determining the word corresponding to the target node as the keyword of the text to be processed, the method further includes:
determining the number of keywords included in the sentence;
and when the number of the keywords included in the sentence meets the number condition, determining the sentence to be the text abstract of the text to be processed.
3. The text processing method of claim 1, wherein propagating node weights for nodes in the candidate keyword graph based on edges in the candidate keyword graph comprises:
initializing the node weights of the nodes in the candidate keyword graph;
iteratively traversing nodes in the candidate keyword graph, and
distributing the node weight of the traversed node to the node which has a connection relation with the traversed node so as to enable the node which has the connection relation to sum the distributed node weight to obtain an updated node weight until an iteration stop condition is met;
wherein the type of the connection relationship comprises: connection without a side; and connecting the outgoing edge and the outgoing edge.
4. The text processing method according to claim 1,
the method comprises the following steps of carrying out word segmentation processing on a text to be processed, and forming word sequences by words obtained by the word segmentation processing, and further comprising the following steps:
performing part-of-speech tagging processing according to the word sequence to obtain the part-of-speech of each word in the word sequence;
the mapping words in the word sequence into nodes includes:
and mapping the words of which the word property meets the word property condition in the word sequence into corresponding nodes.
5. The text processing method of claim 1, wherein the mapping the word dependencies to edges between corresponding nodes comprises:
any one of the following processes is performed:
mapping the word dependency relationship to undirected unweighted edges between corresponding nodes;
mapping the word dependency relationship to undirected edges between corresponding nodes, and
determining the edge weight of the mapped undirected edge according to the occurrence frequency of the word dependency relationship in the word sequence;
wherein the edge weight is positively correlated with the frequency of occurrence;
and mapping the word dependency relationship into directed edges in the same direction between corresponding nodes according to the direction represented by the word dependency relationship.
6. The method according to claim 1, wherein the determining the node satisfying the weight condition in the propagated candidate keyword graph as a target node comprises:
ordering the nodes in the propagated candidate keyword graph according to the node weight to obtain a node sequence;
determining the nodes in the node sequence as target nodes one by one according to the access sequence until a set number of target nodes are obtained;
wherein the access sequence is a descending order of node weights of the nodes in the node sequence.
7. The text processing method according to claim 6,
after the nodes in the node sequence are determined as the target nodes one by one according to the access sequence, the method further includes:
marking the determined target node as accessed; wherein, the nodes in the node sequence are all marked as not accessed during initialization;
after determining the word corresponding to the target node as the keyword of the text to be processed, the method further includes:
when at least two keywords have adjacent relation in the text to be processed, merging the at least two keywords;
determining the number of keywords in the text to be processed;
and when the number of the keywords is smaller than the set number, determining nodes which are not accessed in the node sequence as target nodes according to the access sequence until the number of the obtained keywords is equal to the set number.
8. The text processing method according to any one of claims 1 to 7, further comprising:
acquiring at least two candidate texts, and determining keywords of the candidate texts;
determining an intersection between the keywords of the text to be processed and the keywords of the candidate text, and determining a first number of the keywords included in the intersection;
determining a union set between the keywords of the text to be processed and the keywords of the candidate text, and determining a second number of the keywords included in the union set;
determining the ratio of the first quantity to the second quantity as the similarity between the text to be processed and the candidate text;
and determining the candidate texts with the similarity meeting the similarity condition as similar texts, and executing recommendation operation on the similar texts.
9. The text processing method according to any one of claims 1 to 7, further comprising:
classifying the keywords of the text to be processed through a classification model to obtain a text type;
establishing an index relationship between the text to be processed and the text type so as to
And responding to the query request comprising the text type according to the index relation.
10. The text processing method according to claim 9, further comprising:
acquiring a sample text and a sample text type, and determining keywords of the sample text;
classifying the keywords of the sample text through the classification model to obtain a text type to be compared;
according to the difference between the sample text type and the text type to be compared, performing back propagation in the classification model, and performing back propagation on the sample text type and the text type to be compared
And updating the weight parameters of the classification model in the process of back propagation.
11. The text processing method according to any one of claims 1 to 7, further comprising:
acquiring sample browsing data of a user; the sample browsing data comprises satisfaction, user characteristics and keywords of texts browsed by the user;
updating a text recommendation model according to the sample browsing data;
acquiring at least two texts, and combining the user characteristics and the keywords of the texts into a sample to be processed;
predicting the to-be-processed sample through the updated text recommendation model to obtain prediction satisfaction;
and determining the text with the predicted satisfaction degree meeting the satisfaction degree condition as a text to be recommended, and executing recommendation operation on the text to be recommended.
12. The text processing method according to any one of claims 1 to 7, further comprising:
responding to a query request comprising query words, and determining a text with keywords matched with the query words as a text to be recommended;
acquiring heat data of the text to be recommended, and sequencing the heat data to obtain a heat sequence;
and executing recommendation operation on the text to be recommended according to the heat sequence.
13. A text processing apparatus, comprising:
the word segmentation module is used for performing word segmentation processing on the text to be processed and forming words obtained by the word segmentation processing into word sequences;
the syntax processing module is used for carrying out dependency syntax processing on the word sequence to obtain word dependency relationship among words in the word sequence;
the mapping module is used for mapping the words in the word sequence into nodes and mapping the word dependency relationship into edges between corresponding nodes so as to obtain a candidate keyword graph formed by connecting the nodes and the edges;
a propagation module, configured to propagate node weights of nodes in the candidate keyword graph according to edges in the candidate keyword graph;
a keyword determining module, configured to determine a node satisfying a weight condition in the propagated candidate keyword graph as a target node, and
and determining the words corresponding to the target node as the keywords of the text to be processed.
14. An electronic device, comprising:
a memory for storing executable instructions;
a processor for implementing the text processing method of any one of claims 1 to 12 when executing executable instructions stored in the memory.
15. A storage medium having stored thereon executable instructions for causing a processor to perform the method of text processing of any of claims 1 to 12 when executed.
CN202010066891.6A 2020-01-20 2020-01-20 Text processing method and device, electronic equipment and storage medium Pending CN111274358A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010066891.6A CN111274358A (en) 2020-01-20 2020-01-20 Text processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010066891.6A CN111274358A (en) 2020-01-20 2020-01-20 Text processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111274358A true CN111274358A (en) 2020-06-12

Family

ID=71001846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010066891.6A Pending CN111274358A (en) 2020-01-20 2020-01-20 Text processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111274358A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651462A (en) * 2020-06-23 2020-09-11 烟台大学 Thesis indexing system based on block chain technology
CN112836529A (en) * 2021-02-19 2021-05-25 北京沃东天骏信息技术有限公司 Method and device for generating target corpus sample
CN113033196A (en) * 2021-03-19 2021-06-25 北京百度网讯科技有限公司 Word segmentation method, device, equipment and storage medium
CN113297354A (en) * 2021-06-16 2021-08-24 深圳前海微众银行股份有限公司 Text matching method, device, equipment and storage medium
CN113468878A (en) * 2021-07-13 2021-10-01 腾讯科技(深圳)有限公司 Part-of-speech tagging method and device, electronic equipment and storage medium
CN113536772A (en) * 2021-07-15 2021-10-22 浙江诺诺网络科技有限公司 Text processing method, device, equipment and storage medium
CN113657113A (en) * 2021-08-24 2021-11-16 北京字跳网络技术有限公司 Text processing method and device and electronic equipment
WO2022156730A1 (en) * 2021-01-22 2022-07-28 北京有竹居网络技术有限公司 Text processing method and apparatus, device, and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
US20170330087A1 (en) * 2016-05-11 2017-11-16 International Business Machines Corporation Automated Distractor Generation by Identifying Relationships Between Reference Keywords and Concepts
CN108319627A (en) * 2017-02-06 2018-07-24 腾讯科技(深圳)有限公司 Keyword extracting method and keyword extracting device
CN109815333A (en) * 2019-01-14 2019-05-28 金蝶软件(中国)有限公司 Information acquisition method, device, computer equipment and storage medium
CN110046236A (en) * 2019-03-20 2019-07-23 腾讯科技(深圳)有限公司 A kind of search method and device of unstructured data
CN110705282A (en) * 2019-09-04 2020-01-17 东软集团股份有限公司 Keyword extraction method and device, storage medium and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
US20170330087A1 (en) * 2016-05-11 2017-11-16 International Business Machines Corporation Automated Distractor Generation by Identifying Relationships Between Reference Keywords and Concepts
CN108319627A (en) * 2017-02-06 2018-07-24 腾讯科技(深圳)有限公司 Keyword extracting method and keyword extracting device
CN109815333A (en) * 2019-01-14 2019-05-28 金蝶软件(中国)有限公司 Information acquisition method, device, computer equipment and storage medium
CN110046236A (en) * 2019-03-20 2019-07-23 腾讯科技(深圳)有限公司 A kind of search method and device of unstructured data
CN110705282A (en) * 2019-09-04 2020-01-17 东软集团股份有限公司 Keyword extraction method and device, storage medium and electronic equipment

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651462A (en) * 2020-06-23 2020-09-11 烟台大学 Thesis indexing system based on block chain technology
WO2022156730A1 (en) * 2021-01-22 2022-07-28 北京有竹居网络技术有限公司 Text processing method and apparatus, device, and medium
CN112836529A (en) * 2021-02-19 2021-05-25 北京沃东天骏信息技术有限公司 Method and device for generating target corpus sample
CN112836529B (en) * 2021-02-19 2024-04-12 北京沃东天骏信息技术有限公司 Method and device for generating target corpus sample
CN113033196A (en) * 2021-03-19 2021-06-25 北京百度网讯科技有限公司 Word segmentation method, device, equipment and storage medium
CN113033196B (en) * 2021-03-19 2023-08-15 北京百度网讯科技有限公司 Word segmentation method, device, equipment and storage medium
CN113297354A (en) * 2021-06-16 2021-08-24 深圳前海微众银行股份有限公司 Text matching method, device, equipment and storage medium
CN113468878A (en) * 2021-07-13 2021-10-01 腾讯科技(深圳)有限公司 Part-of-speech tagging method and device, electronic equipment and storage medium
CN113536772A (en) * 2021-07-15 2021-10-22 浙江诺诺网络科技有限公司 Text processing method, device, equipment and storage medium
CN113657113A (en) * 2021-08-24 2021-11-16 北京字跳网络技术有限公司 Text processing method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN111274358A (en) Text processing method and device, electronic equipment and storage medium
Bharti et al. Sarcastic sentiment detection in tweets streamed in real time: a big data approach
CN107220352B (en) Method and device for constructing comment map based on artificial intelligence
Sankarasubramaniam et al. Text summarization using Wikipedia
Laniado et al. Using WordNet to turn a Folksonomy into a Hierarchy of Concepts.
US11783131B2 (en) Knowledge graph fusion
CN110457708B (en) Vocabulary mining method and device based on artificial intelligence, server and storage medium
US20150170051A1 (en) Applying a Genetic Algorithm to Compositional Semantics Sentiment Analysis to Improve Performance and Accelerate Domain Adaptation
US8843476B1 (en) System and methods for automated document topic discovery, browsable search and document categorization
CN108628834B (en) Word expression learning method based on syntactic dependency relationship
Mills et al. Graph-based methods for natural language processing and understanding—A survey and analysis
WO2014107801A1 (en) Methods and apparatus for identifying concepts corresponding to input information
Alexander et al. Natural language web interface for database (NLWIDB)
CN110162771A (en) The recognition methods of event trigger word, device, electronic equipment
US20180285448A1 (en) Producing personalized selection of applications for presentation on web-based interface
Nasution Semantic interpretation of search engine resultant
CN111368555B (en) Data identification method and device, storage medium and electronic equipment
CN110516062B (en) Method and device for searching and processing document
US11361031B2 (en) Dynamic linguistic assessment and measurement
Kholodna et al. Machine Learning Model for Paraphrases Detection Based on Text Content Pair Binary Classification.
Lilian et al. QeCSO: Design of hybrid Cuckoo Search based Query expansion model for efficient information retrieval
WO2020263182A1 (en) Method and system for conducting a brainstorming session with a virtual expert
Grinchenkov et al. One approach to the design of digital educational resources for the training of personnel in power industry
KR102454261B1 (en) Collaborative partner recommendation system and method based on user information
Dobrovolskyi et al. Probabilistic topic modelling for controlled snowball sampling in citation network collection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40023578

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200612

RJ01 Rejection of invention patent application after publication