CN116956818A - Text material processing method and device, electronic equipment and storage medium - Google Patents

Text material processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116956818A
CN116956818A CN202211564278.2A CN202211564278A CN116956818A CN 116956818 A CN116956818 A CN 116956818A CN 202211564278 A CN202211564278 A CN 202211564278A CN 116956818 A CN116956818 A CN 116956818A
Authority
CN
China
Prior art keywords
text
vocabulary
information
materials
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211564278.2A
Other languages
Chinese (zh)
Inventor
姚波怀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211564278.2A priority Critical patent/CN116956818A/en
Publication of CN116956818A publication Critical patent/CN116956818A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a text material processing method, a text material processing device, electronic equipment, a computer program product and a computer readable storage medium; to artificial intelligence, the method includes: acquiring text materials; coding the text material to obtain a semantic coding sequence corresponding to the text material; performing vocabulary decoding processing based on the semantic coding sequence to obtain a vocabulary decoding sequence, wherein the vocabulary decoding sequence comprises at least one vocabulary; combining each vocabulary in the vocabulary decoding sequence to obtain the writing theme information of the text material characterization; obtaining the matching degree of the writing subject information and a plurality of reference materials in a text material library respectively; and selecting at least one reference material from the plurality of reference materials as the writing material based on the matching degree corresponding to each reference material. By the method and the device, the accuracy of acquiring the recommended writing materials can be improved.

Description

Text material processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to artificial intelligence technology, and in particular, to a method and apparatus for processing text materials, an electronic device, and a storage medium.
Background
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning, deep learning and other directions.
In the authoring of articles, users often refer to some material or example in order to enhance the convincing of the articles. To obtain the material or example, the user needs to search the text material, and self-search wastes labor cost and writing time. In the related technology, the materials containing the keywords can be provided through the keywords input by the user, but the relativity between the materials and the writing intention of the user is not high, and the user still needs to select the materials by himself, so that the writing efficiency is influenced.
In the related art, there is no better way to provide accurate text material for users.
Disclosure of Invention
The embodiment of the application provides a processing method, a processing device, electronic equipment, a computer storage medium and a computer program product for text materials, which can improve the accuracy of acquiring recommended writing materials.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a processing method of text materials, which comprises the following steps:
acquiring text materials;
coding the text material to obtain a semantic coding sequence corresponding to the text material;
performing vocabulary decoding processing based on the semantic coding sequence to obtain a vocabulary decoding sequence, wherein the vocabulary decoding sequence comprises at least one vocabulary;
Combining each vocabulary in the vocabulary decoding sequence to obtain the writing theme information of the text material characterization;
obtaining the matching degree of the writing subject information and a plurality of reference materials in a text material library respectively;
and selecting at least one reference material from the plurality of reference materials as a writing material based on the matching degree corresponding to each reference material.
The embodiment of the application provides a processing device of text materials, which comprises the following steps:
the material acquisition module is configured to acquire text materials;
the coding module is configured to code the text material to obtain a semantic coding sequence corresponding to the text material;
the decoding module is configured to perform vocabulary decoding processing based on the semantic coding sequence to obtain a vocabulary decoding sequence, wherein the vocabulary decoding sequence comprises at least one vocabulary;
the decoding module is further configured to combine each vocabulary in the vocabulary decoding sequence to obtain the sketching theme information of the text material representation;
the material recommendation module is configured to acquire the matching degree of the sketching theme information and a plurality of reference materials in a text material library respectively;
The material recommending module is further configured to select at least one reference material from the plurality of reference materials as a writing material based on the matching degree corresponding to each reference material.
An embodiment of the present application provides an electronic device, including:
a memory for storing computer executable instructions;
and the processor is used for realizing the processing method of the text material provided by the embodiment of the application when executing the computer executable instructions stored in the memory.
The embodiment of the application provides a computer readable storage medium which stores computer executable instructions for realizing the processing method of the text material provided by the embodiment of the application when being executed by a processor.
The embodiment of the application provides a computer program product, which comprises a computer program or a computer executable instruction, and when the computer program or the computer executable instruction are executed by a processor, the processing method of the text material provided by the embodiment of the application is realized.
The embodiment of the application has the following beneficial effects:
the text material is subjected to coding processing and vocabulary prediction processing to obtain the sketching theme information corresponding to the text material, and the sketching theme information can represent the sketching intention, so that the accuracy of obtaining the sketching material is improved; matching the writing subject information with text materials in a writing material library to obtain at least one text material as the writing material, and improving the accuracy of obtaining the writing material by matching the subject information; by acquiring the writing subject information, material retrieval based on the whole text material is avoided, the efficiency of acquiring the writing material is improved, and the computing resources required by acquiring the writing material are saved.
Drawings
Fig. 1 is an application mode schematic diagram of a processing method of text material provided by an embodiment of the present application;
fig. 2A is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 2B is a schematic diagram of a topic prediction model provided by an embodiment of the present application;
fig. 3A to fig. 3G are schematic flow diagrams of a text material processing method according to an embodiment of the present application;
fig. 4 is a first schematic diagram of a display interface of a terminal device according to an embodiment of the present application;
fig. 5 is a second schematic diagram of a display interface of a terminal device according to an embodiment of the present application;
fig. 6 is a third schematic diagram of a display interface of a terminal device according to an embodiment of the present application;
fig. 7 is a flow chart of a processing method of text material according to an embodiment of the present application;
fig. 8 is an interactive flow chart of a processing method of text material according to an embodiment of the present application.
Detailed Description
The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a particular ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a particular order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.
It should be noted that, in the embodiments of the present application, related data such as user information, user feedback data, etc., when the embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use, and processing of related data need to comply with related laws and regulations and standards of related countries and regions.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.
1) The writing is that the person uses language and character symbol to reflect things, express thought emotion, transmit knowledge information and realize creative mental labor process of communication. The user may write through the terminal device, for example: the user edits the document using office software in the computer, and the user edits the blog using social platform software on the cell phone.
2) The subject, the central ideas to be represented in the literature works or social activities and the like generally refer to main contents, and the writing subject in the application is the main contents or core contents of the writing contents of the user or the writing intention of the user.
3) Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
4) Sequence-to-sequence (Sequence to Sequence, seq2 Seq) model, a variant of a recurrent neural network, includes two parts, encoder (Encoder) and Decoder (Decoder). The sequence-to-sequence model is an important model in natural language processing and can be used for machine translation, dialogue systems and text abstracts.
The embodiment of the application provides a text material processing method, a text material processing device, electronic equipment, a computer readable storage medium and a computer program product, which can improve the accuracy of acquiring recommended writing materials.
In the following, exemplary applications of the electronic device provided by the embodiments of the present application are described, where the electronic device provided by the embodiments of the present application may be implemented as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (for example, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), a vehicle-mounted terminal, a Virtual Reality (VR) device, an augmented Reality (Augmented Reality, AR) device, or any other type of user terminal, and may also be implemented as a server. In the following, an exemplary application when the device is implemented as a server will be described.
Referring to fig. 1, fig. 1 is an application mode schematic diagram of a text material processing method according to an embodiment of the present application; by way of example, fig. 1 relates to a recommendation server 200, a database 500, a network 300 and a terminal device 400. The terminal device 400 is connected to the server 200 via the network 300, and the network 300 may be a wide area network or a local area network, or a combination of both.
In some embodiments, the server 200 may be a recommendation server, the user is a user doing authoring, the terminal device 400 is a device used by the user to edit documents, and the database 500 is an authoring material library. The composition material in the composition material library may be obtained from a network. The text material may be at least part of the content in the document that the user has edited, or information such as video, web page, etc. that the user browses through the terminal device 400, and the authoring material may be data in the form of text, video, etc. as described above by way of example.
In some embodiments, when the user edits the text by using the terminal device 400, the content in the current document is obtained as the text material, the terminal device 400 sends the text material to the server 200, the server 200 obtains the authoring subject information corresponding to the text material, and obtains the corresponding authoring material in the database 500 based on the authoring subject information, the server 200 sends the authoring material to the terminal device 400, and the terminal device 400 displays the corresponding authoring material to assist the user's authoring behavior.
The embodiment of the application can be realized by a block chain technology, the writing theme information obtained by the embodiment of the application can be uploaded to the block chain for storage, and the reliability of the theme information is ensured by a consensus algorithm. Blockchains are novel application modes of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
The embodiment of the application can be realized through a Database technology, and a Database (Database) can be taken as a place where the electronic file cabinet stores electronic files in short, so that a user can perform operations such as adding, inquiring, updating, deleting and the like on the data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application.
The database management system (Database Management System, DBMS) is a computer software system designed for managing databases, and generally has basic functions of storage, interception, security, backup, and the like. The database management system may classify according to the database model it supports, e.g., relational, XML (Extensible Markup Language ); or by the type of computer supported, e.g., server cluster, mobile phone; or by classification according to the query language used, such as structured query language (SQL, structured Query Language), XQuery; or by performance impact emphasis, such as maximum scale, maximum speed of operation; or other classification schemes. Regardless of the manner of classification used, some DBMSs are able to support multiple query languages across categories, for example, simultaneously.
The embodiment of the application can also be realized by Cloud Technology, and the Cloud Technology (Cloud Technology) is based on the general terms of network Technology, information Technology, integration Technology, management platform Technology, application Technology and the like applied by a Cloud computing business mode, can form a resource pool, and is used as required, flexible and convenient. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the advanced development and application of the internet industry and the promotion of requirements of search services, social networks, mobile commerce, open collaboration and the like, each article possibly has a hash code identification mark, the hash code identification mark needs to be transmitted to a background system for logic processing, data of different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.
In some embodiments, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The electronic device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.
Referring to fig. 2A, fig. 2A is a schematic structural diagram of an electronic device provided in an embodiment of the present application, where the electronic device shown in fig. 2A is a server 200, and the server 200 includes: at least one processor 410, a memory 450, at least one network interface 420, and a user interface 430. The various components in terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 440 in fig. 2A.
The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.
Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.
Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 450 described in embodiments of the present application is intended to comprise any suitable type of memory.
In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;
a network communication module 452 for accessing other electronic devices via one or more (wired or wireless) network interfaces 420, the exemplary network interface 420 comprising: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;
in some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2A shows a processing apparatus 455 of text material stored in a memory 450, which may be software in the form of a program and a plug-in, and includes the following software modules: the material acquisition module 4551, the encoding module 4552, the decoding module 4553, and the material recommendation module 4554 are logical, and thus may be arbitrarily combined or further split according to the functions implemented. The functions of the respective modules will be described hereinafter.
In the following, a method for processing text materials provided by the embodiment of the present application is described, and as before, an electronic device implementing the method for processing text materials of the embodiment of the present application may be a server. The execution subject of the respective steps will not be repeated hereinafter.
Referring to fig. 3A, fig. 3A is a flowchart of a processing method of text material according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3A.
In step 301, text material is acquired.
By way of example, explained in connection with fig. 1, text material is sent by the terminal device 400 to the server 200. The text material may be at least a portion of the content (e.g., at least one word, at least one sentence, at least one paragraph of text), video data, audio data, etc. of the document in the edited state.
In some embodiments, step 301 is implemented by obtaining text material by at least one of:
mode 1, at least part of the content (e.g., current paragraph, paragraphs before current paragraph, current sentence, sentences before current sentence) in the current document to be edited is taken as text material.
For convenience of explanation, fig. 5 is a second schematic diagram of a display interface of the terminal device according to the embodiment of the present application; the user performs document editing in a document interface 501 displayed in the terminal device, and a piece of text "my hometown … … me loves my hometown" entered by the user is displayed in a document editing area 502. The terminal device transmits the text material to the server after the user edits the text material.
And 2, acquiring a historical document edited before the current document, and taking at least part of contents in the historical document as text materials.
By way of example, history documents, such as: edited documents of similar names for the same storage locations within the time window. Continuing the explanation based on fig. 5, it is assumed that, before the document displayed in fig. 5, the user edits other documents through the terminal device, takes the other documents as history documents, and takes at least part of the contents in the history documents as text material. The history document may be stored in the terminal device or in a cloud server corresponding to the document editing software, and forwarded to the server 200 by the cloud server.
In some embodiments, referring to fig. 3B, fig. 3B is a flow chart of a processing method of text material according to an embodiment of the present application; step 301 in fig. 3A is implemented by the following steps 3011 to 3012, which are described in detail below.
In step 3011, an information browsing record of the terminal device is acquired.
By way of example, the information types referred to by the information browse record include: web pages (e.g., news), social dynamics (e.g., blogs), video (e.g., short videos), audio (e.g., music, broadcast).
The information browsing record of the terminal device is obtained under the condition that the user sets the related authority for obtaining the information browsing record in the terminal device.
In step 3012, text material is extracted from the browsed at least one piece of information based on the information browse record.
For example, the information corresponding to the information browsing record is multimedia data, the text may not necessarily be directly obtained in the multimedia data, and text materials may be extracted by means of keyword extraction, for example: extracting keywords from the text based on the semantic understanding model; performing voice recognition from video and audio based on the voice recognition model to obtain text; and extracting text from the image based on the image recognition model, or extracting the image of the target object, and taking the entity name corresponding to the target object as a text material.
In some embodiments, step 3012 may be implemented by: and screening information meeting the related conditions from the information browsing records, and extracting text materials from the information meeting the related conditions.
For example, the relevant conditions include: the information and other information in the information browsing record comprise the same keywords; the similarity of the information and other information in the information browsing record is greater than a first similarity threshold.
For example, the similarity may be calculated based on the feature vector, and if video, the feature vector is extracted based on at least one modality of image, voice, text (metadata of video).
According to the embodiment of the application, the information browsing record is used as the text material, the writing intention of the user can be predicted before the user edits the text, the recommendation target is determined in advance, and the recommendation efficiency and the user experience are improved.
With continued reference to fig. 3A, in step 302, the text material is encoded to obtain a semantic coding sequence corresponding to the text material.
The encoding process, for example, extracts text from the text material and converts it into semantic information that can be processed directly by the computer.
In some embodiments, referring to fig. 3C, fig. 3C is a flow chart of a processing method of text material according to an embodiment of the present application; step 302 in fig. 3A is implemented by the following steps 3021A to 3023A, which are described in detail below.
In step 3021A, vocabulary extraction processing is performed on the text material, so as to obtain vocabularies in the text material.
By way of example, the vocabulary may be individual words or words. For example: the text material is at least one paragraph in the document edited by the user, and each word in the paragraph is extracted to obtain the vocabulary in the text material.
In step 3022A, each word is encoded to obtain a word embedding vector for each word.
For example, a Word Embedding vector (Word Embedding), that is, converting the characters corresponding to the Word into numbers, and performing coding processing, that is, assigning a vector representation with a fixed length to each Word, where the fixed length can be set according to the actual requirements of the application scenario. The value of the included angle between the vocabulary embedded vectors of two words can measure the relativity of the relationship between the vocabularies.
In step 3023A, each word embedding vector is combined to obtain a semantic coding sequence corresponding to the text material.
For example, assuming that words 1 to N currently exist, N is a natural number greater than 1, and according to the corresponding sequence of words 1 to N in the text material, word embedded vectors of words 1 to N are combined into a sequence, so as to obtain a semantic coding sequence corresponding to the text material.
In the embodiment of the application, the text material is converted into the semantic coding sequence, so that the topic information of the text material is conveniently obtained, and the accuracy of obtaining the writing material is improved.
In some embodiments, referring to fig. 3D, fig. 3D is a flow chart of a processing method of text material provided by an embodiment of the present application; step 302 in fig. 3A is implemented by the following steps 3021B to 3024B, which are described in detail below.
In step 3021B, text extraction processing is performed on the text material, so as to obtain text content in the text material.
By way of example, the text extraction process may be implemented by a natural language processing-related model, such as: and calling a text extraction model to acquire each text content in the text material.
For example, the text material may be content such as a document, a history document, etc. edited by a user or an information browsing record, wherein the content is mixed, and the computing resources required for integrally encoding the content are more.
In step 3022B, key sentences are extracted from the text content.
By way of example, a key sentence is a sentence in the text content having a similarity between other sentences that is greater than a similarity threshold. The similarity between the key sentence and other sentences is larger than a similarity threshold value, which indicates that at least part of the content of the key sentence has certain occurrence frequency in the text content and can represent the content of the text content.
In some embodiments, step 3022B is implemented by: acquiring similarity (such as cosine similarity and editing distance) between every two sentences in the text content; for each sentence, acquiring the number of related sentences of other sentences with the similarity to the sentence being greater than a similarity threshold; sorting each sentence in a descending order based on the number of related sentences to obtain a first descending order sorting list; at least one sentence from the first in the first descending order of the ordered list is taken as at least one key sentence.
According to the above, for example, assuming that 5 sentences exist in the text content, obtaining the similarity between each sentence in the text content, taking each sentence as a node, taking the similarity between sentences as a weight value corresponding to edges between sentences, taking edges with similarity larger than a similarity threshold value as effective edges, and counting the number of the effective edges corresponding to each sentence, namely the number of related sentences. And carrying out descending order sorting on the number of related sentences of each sentence to obtain a first descending order sorting list, and taking the first sentence in the first descending order sorting list as a key sentence. Alternatively, a part of sentences from the first in the first descending order of the ordered list is used as the key sentence.
In step 3023B, each word in each key sentence is encoded to obtain a word embedding vector for each word.
The principle of step 3023B is the same as that of step 3022A, and will not be described again.
In step 3024B, each word embedding vector is combined to obtain a semantic coding sequence corresponding to the text material.
The principle of step 3024B is the same as that of step 3023A, and will not be described again.
In the embodiment of the application, the process of extracting the key sentences and acquiring the semantic coding sequence based on the key sentences is adopted, so that a large amount of texts are prevented from being decoded, the calculation efficiency is improved, and the calculation resources are saved.
With continued reference to fig. 3A, in step 303, a vocabulary decoding process is performed based on the semantic coding sequence, resulting in a vocabulary decoding sequence.
Illustratively, the vocabulary decoding sequence includes at least one vocabulary. And carrying out vocabulary decoding processing, namely predicting the vocabulary in the subject information based on the semantic coding sequence to obtain a vocabulary decoding sequence. The topic information may be characterized as at least one word, at least one sentence, or a hierarchical structure (e.g., a collection of words).
In some embodiments, referring to fig. 3E, fig. 3E is a flow chart of a processing method of text material provided by an embodiment of the present application; step 303 in fig. 3A is implemented by the following steps 3031 to 3035, which are described in detail below.
In step 3031, vocabulary prediction processing is performed based on the semantic coding sequence and the vocabulary, so as to obtain the occurrence probability of each vocabulary in the vocabulary.
By way of example, the vocabulary includes a plurality of words, and a word embedding vector for each word. The probability, i.e. the probability of occurrence, of each word in the vocabulary as a word in the topic information is predicted based on the content in the semantic coding sequence. The probability of occurrence of the vocabulary can be predicted by a normalization function.
In step 3032, the vocabulary with the largest occurrence probability is obtained as the first target vocabulary in the vocabulary decoding sequence.
For example, the vocabulary with the largest occurrence probability is taken as the target vocabulary, and the target vocabulary is the vocabulary in the subject information.
In step 3033, the semantic code sequence and the word embedding vector of the first target word are spliced to obtain a spliced sequence.
For example, the concatenation process also adds the word embedding vector of the target vocabulary to the tail of the semantic coding sequence to form a new sequence, i.e., a concatenation sequence. For example: the first concatenation process results in a concatenation sequence that can be characterized as [ semantic coding sequence, word embedding vector of first target vocabulary ].
In step 3034, multiple vocabulary prediction processes are performed based on the concatenation sequence and the vocabulary, so as to obtain the occurrence probability of each vocabulary in the vocabulary corresponding to each vocabulary prediction process.
By way of example, the splice sequence of each vocabulary prediction processing input includes the following information: the word embedding vector of the target word with the highest occurrence probability is obtained by the splicing sequence used in the previous word prediction processing and the previous word prediction processing.
For example, if i is a positive integer, the concatenation sequence input by the i-th vocabulary prediction processing may be characterized as [ the concatenation sequence used by the i-1 st vocabulary prediction processing, the word embedding vector of the target vocabulary obtained by the last vocabulary prediction processing ]. And stopping vocabulary prediction processing when the occurrence probability of each vocabulary is smaller than the preset minimum probability in the results obtained by the vocabulary prediction processing. The preset minimum probability can be determined according to the actual application scene.
In step 3035, the target vocabulary in the result of each vocabulary prediction processing is obtained, and each target vocabulary is combined into a vocabulary decoding sequence.
For example, the target vocabulary in the result of each vocabulary prediction processing is used as the vocabulary in the subject information, and the target vocabulary is combined into a vocabulary decoding sequence.
According to the embodiment of the application, the accuracy of acquiring the theme information is improved through the occurrence probability of the predicted vocabulary, the integrity of the theme information is ensured through repeated vocabulary prediction processing, the correlation of the theme information and the text material is ensured, and the accuracy of recommending the writing material can be improved.
With continued reference to FIG. 3A, in step 304, each vocabulary in the vocabulary decoding sequence is combined to obtain the authoring subject information for the text material characterization.
By way of example, the authoring subject information may be a subject word, a subject phrase, a subject sentence, or a hierarchy. For example: the topic of the text material is the scenery of the hometown, and the topic information can be characterized as any one of the following forms or combined with the following forms: subject term: "hometown", subject phrase: "My hometown", topic sentence: "My hometown is beautiful throughout the year", hierarchical structure: "four seasons of hometown { spring of hometown, summer of hometown, autumn of hometown, winter of hometown }).
With continued reference to fig. 3A, in step 305, the matching degree of the authoring subject information with the plurality of reference materials in the text material library, respectively, is obtained.
By way of example, the reference material may originate from a network, and the reference material may be text, pictures, short videos, etc. The text material library comprises: reference material, text subject information of the reference material, and correspondence of the reference material and the text subject information. The text subject information may be in the same format as the authoring subject information or in a different format.
In some embodiments, referring to fig. 3F, fig. 3F is a flow chart of a processing method of text material provided by an embodiment of the present application; step 305 in fig. 3A is implemented by the following steps 3051A to 3054A, which are described in detail below.
In step 3051A, text topic information corresponding to each reference material in the text material library is obtained.
For example, the server 200 obtains text topic information corresponding to each reference material from the text material library.
In step 3052A, feature extraction processing is performed on the authoring subject information to obtain a first text feature.
For example, the first text feature may be characterized as a feature vector form, the feature extraction process may be implemented by a text feature extraction model, each vocabulary in the written subject information is converted into a vector, and each vector is sequentially combined to obtain the first text feature. If the lexical decoding sequence is characterized as a feature vector form, the lexical decoding sequence may be directly used as the first text feature.
In step 3053A, feature extraction processing is performed on each text topic information, so as to obtain a second text feature of each text topic information.
For example, the principle of step 3053A is the same as step 3052A, and the first text feature is similar to the second text feature in format, so as to obtain similarity.
In step 3054A, a similarity between the first text feature and each of the second text features is obtained, and each similarity is used as a matching degree between the authoring subject information and each of the reference materials.
For example, the similarity may be a cosine similarity. And acquiring cosine similarity between the feature vector corresponding to the first text feature and the vector corresponding to the second text feature, and taking the cosine similarity as the matching degree between the writing subject information and the reference materials corresponding to each second text feature.
In the embodiment of the application, the accuracy of acquiring the writing materials is improved by taking the feature extraction and the similarity as the matching degree.
In some embodiments, the text corpus comprises: reference material, text subject information of the reference material, and correspondence of the reference material and the text subject information.
Referring to fig. 3G, fig. 3G is a flow chart of a processing method of text material according to an embodiment of the present application; step 305 in fig. 3A is implemented by the following steps 3051B to 3052B, which are described in detail below.
In step 3051B, text topic information corresponding to each reference material in the text material library is obtained.
For example, the correspondence between the reference material and the text subject information is stored in the form of key-value pairs. For example: the Key Value pair is formed by taking the reference material as a Value (Value) and the text subject information as a Key (Key). The corresponding value, i.e. the corresponding reference material, may be acquired based on the key.
In step 3052B, the number of characters of each text subject information and the authoring subject information including the same characters is determined, and the number of characters is taken as the degree of matching between the authoring subject information and each reference material.
By way of example, the reference materials related to the text theme information can be recalled in a retrieval mode, and the retrieval is carried out in a character matching mode, so that the computing resources required for obtaining the matching degree of the reference materials are saved, and the efficiency of obtaining the reference materials is improved.
With continued reference to fig. 3A, in step 306, at least one reference material is selected from the plurality of reference materials as the authoring material based on the corresponding degree of matching for each reference material.
For example, the reference materials may be obtained in descending order based on the matching degree, or the reference materials with the matching degree greater than the preset matching degree may be used as the writing materials.
In some embodiments, step 306 may be implemented by: performing descending order sorting treatment on the matching degree of each reference material to obtain a second descending order sorting list; and taking at least one reference material from the first position in the second descending order of the ordered list as at least one writing material.
For example: and selecting a preset number or a preset proportion of reference materials from the first position of the second descending order sorting list as writing materials. The predicted quantity or the preset proportion is preset according to the actual application scene.
According to the embodiment of the application, the writing materials in the reference materials are selected by descending order, and are preferentially acquired, so that the writing materials better accord with the writing intention of a user, better writing experience is provided for the user, the user can conveniently use the writing materials as writing references, and the writing efficiency is improved.
In some embodiments, the encoding process and vocabulary decoding process are implemented by invoking a topic prediction model based on the text material, i.e., steps 302 through 304 may be implemented by the topic prediction model; referring to fig. 2B, fig. 2B is a schematic structural diagram of a topic prediction model provided by an embodiment of the present application; the topic prediction model 210 includes: an encoder 211, a decoder 212; wherein the encoder 211 is used for performing an encoding process and the decoder 212 is used for performing a vocabulary decoding process.
In some embodiments, the manner in which the model is trained may be an overall training, by training a topic prediction model in the following manner: obtaining a sample material set, wherein the sample material set comprises: sample text materials and actual writing theme information corresponding to each sample text material; invoking a theme prediction model to be trained based on each sample text material, and sequentially performing coding processing and vocabulary decoding processing to obtain prediction writing theme information corresponding to each sample text material; determining a loss function of a theme predictive model to be trained based on the difference between each piece of predicted authoring theme information and the corresponding piece of actual authoring theme information; updating parameters of the theme predictive model to be trained based on the loss function to obtain the theme predictive model after training.
By way of example, updating parameters may be accomplished by: and carrying out back propagation calculation on the gradient of the parameters (the parameters can be obtained by adopting a gradient descent method, and the gradient descent method comprises the steps of searching the minimum value of the loss function along the gradient descent direction of the loss function to obtain the optimal parameters) on the model to be trained layer by layer, and calculating the updated parameters of each layer of the model to be trained based on the gradient. And replacing the corresponding parameters in the model to be trained with the updated parameters to obtain the updated model to be trained.
In some embodiments, the model may be trained in stages by training a topic prediction model in the following manner: acquiring a first training set and a second training set, wherein the first training set comprises: sample text material, sample abstract information corresponding to each sample text material, and the second training set comprises: sample text materials and sample writing theme information corresponding to each sample text material; calling a text abstract model to carry out training treatment based on the first training set to obtain a theme prediction model to be trained; and calling the theme prediction model to be trained based on the second training set to carry out training treatment, so as to obtain the trained theme prediction model.
For example, the training sets adopted by different training modes are different, but the principle of training processing of each stage of model training by stages is the same as that of overall training, and will not be described in detail here.
According to the embodiment of the application, the theme prediction model is trained in different modes, so that the training precision of the theme prediction model is improved, the accuracy of acquiring the theme information by the theme prediction model can be improved, and the accuracy of acquiring the writing materials based on the theme information is further improved.
In some embodiments, after step 306, the server 200 sends the authoring material to the terminal device 400 to cause the terminal device 400 to display the authoring material, wherein the terminal device is the source of the text material.
In some embodiments, referring to fig. 8, fig. 8 is an interactive flow chart of a processing method of text material according to an embodiment of the present application.
The terminal device 400 performs step 801 of transmitting text material to the server 200 in response to satisfaction of the authoring material acquisition condition.
Illustratively, the authoring material acquisition conditions include: the user actively searches the sketching materials; selecting part of contents in the document currently in the editing state; the current document is not subjected to a new editing operation within a preset time; the document editing software is started and the document editing area is displayed.
The server 200 performs step 802 of determining authoring subject information based on the text material and acquiring corresponding authoring material based on the authoring subject information.
For example, the server 200 executes the processing method of the text material provided by the embodiment of the application to obtain the authoring subject information of the text material, and obtains the authoring material from the database based on the subject information.
The server 200 performs step 803 to transmit the authoring material to the terminal apparatus 400. The terminal device 400 performs step 804 of displaying the authoring material.
In order to facilitate understanding of the above-mentioned interaction flow chart, the following explains the application scenario of the embodiment of the present application.
1. When the user selects part of the content which is edited in the document, automatically recommending materials; or if the user pauses inputting the characters for a period of time, recommending the writing material, and if the preset time is reached, automatically disappearing the writing material. Referring to fig. 5, fig. 5 is a second schematic diagram of a display interface of a terminal device according to an embodiment of the present application; the user performs document editing in a document interface 501 displayed in the terminal device, and a piece of text "my hometown … … me loves my hometown" entered by the user is displayed in a document editing area 502. After the user edits the text material, the terminal device sends the text material to the server, the server obtains the corresponding writing material based on the text material, and sends the writing material to the terminal device, and the terminal device displays the writing material 4, the writing material 5 and the writing material 6 in the document interface 501.
2. Copying the text to a search box by a user to display recommended materials; referring to fig. 4, fig. 4 is a first schematic diagram of a display interface of a terminal device according to an embodiment of the present application; describing the steps of fig. 4 with reference to fig. 1, a authoring assistant interface 401 is displayed in a man-machine interaction interface of a terminal device 400, where the authoring assistant interface 401 includes: a text material input box 402, a composition material display box 404. After the user inputs the text material of "i like playing football … …" in the text material input box 402 of the application program of the authoring assistant of the terminal device 400, clicks the confirmation control 403, the terminal device 400 sends the text material to the server 200, the server 200 calls the topic prediction model to obtain topic information, determines matched authoring materials based on the topic information, sends the authoring materials to the terminal device 400, and the authoring material 1, the authoring material 2 and the authoring material 3 are respectively displayed in the authoring material display box 404.
3. The user generates recommended text based on the browsing records prior to authoring. Referring to fig. 6, fig. 6 is a third schematic diagram of a display interface of a terminal device according to an embodiment of the present application; text displayed by user in terminal equipment
Document editing is performed in the document interface 601, text is not currently input in the document editing area 602, the terminal 5 device obtains browsing records (such as web pages, social dynamics and videos) of other multimedia data on the terminal device before a user, the terminal device sends the browsing records as text materials to the server, the server obtains corresponding writing materials based on the text materials and sends the writing materials to the terminal device, and the terminal device displays the writing materials 7, 8 and 9 in the document interface 601.
According to the embodiment of the application, the text material is subjected to coding processing and vocabulary prediction processing to obtain the writing subject information corresponding to the text element 0 material, the writing subject information can represent the writing intention, and the writing element obtaining is improved
Accuracy of the material; matching the writing subject information with text materials in a writing material library to obtain at least one text material as the writing material, and improving the accuracy of obtaining the writing material by matching the subject information; by acquiring the writing subject information, material retrieval based on the whole text material is avoided, calculation resources required for acquiring the writing material are saved, and efficiency of acquiring the writing material is improved.
An exemplary application of the text material processing method according to the embodiment of the present application in an actual application scenario will be described below.
People can have writing requirements in learning and working. When the user is in the process of writing, the user can
Quotation is classical, but situations of not knowing how to write, wanting to not get a famous sentence, thinking that 0 written by oneself is not good enough, etc. are often encountered. The related art writing material recommending function requires users to actively input keywords, and then related writing materials are searched based on the keywords, so that continuous active searching of the users is required, and time and labor are wasted; moreover, the material results required by the user cannot be continuously obtained.
The embodiment of the application understands the writing theme information (writing) of the user based on the current writing content of the user
The topic information may characterize the user's authoring intent), recommending relevant authoring 5 stories for the user based on the user's authoring topic. That is, the embodiment of the application can automatically understand the user writing intention, and recommends the writing materials which can be used for reference for the user in the user writing process or when and before the user finishes writing.
For example, a user wants to record interesting things of making dumplings through a diary during the past year, and the related art cannot necessarily understand that the user's composition subject information is "making dumplings" on the one hand, and cannot sufficiently understand the subject of the material to be recommended on the other hand, so that the material with higher relevance cannot be provided. The embodiment of the application can predict the information of the writing theme (writing theme) of the user based on the edited text content or browsing record of the user 0, and can
To understand the authoring subjects of the text material and recommend the authoring material matched with the authoring subjects to the user based on the information of the user about the authoring subjects.
Referring to fig. 7, fig. 7 is a flowchart of a processing method of text material according to an embodiment of the present application, and the following steps are explained with the server 200 in fig. 1 as an execution body.
5 in step 701, a training sample for training a topic prediction model is obtained.
By way of example, the topic information may be a topic word, a topic phrase, a topic sentence, or a hierarchical structure, or a combination of forms. For example: the topic of the text material is the scenery of the hometown, and the topic information can be characterized as any one of the following forms or combined with the following forms: subject term: "hometown", subject phrase: ' I am
Hometown of (a) ", topic sentence: "My hometown is beautiful throughout the year", hierarchical structure: "four seasons of hometown { spring of hometown 0, summer of hometown, autumn of hometown, winter of hometown }).
For example, in practical application, which form or combination of forms is specifically adopted depends on the requirement of the application scenario, the structural complexity of the text to be understood, and which form of training sample can be obtained.
The training sample includes materials used as references and topic information corresponding to each material. The method can be obtained by the following steps: obtaining unlabeled reference materials, manually labeling each text to obtain subject information, and taking the labeled reference materials as training samples; and acquiring the reference material from the network resource.
By way of example, some network resources may be obtained in a collaborative manner. There may be some training samples on the network that can obtain the topic for model training. Taking the composition as an example, (1) many composition websites classify the composition according to the subject, for example, "composition about hometown", "composition for writing scenery", etc., and by mining, crawling and analyzing the composition, the corresponding relation between the composition text and the subject can be obtained for training the model. (2) The titles of many compositions are relatively specific, and the composition titles can be used as theme information, such as 'my hometown', 'beautiful scenery', and the like, and can also be used as training samples by acquiring similar compositions and titles.
In step 702, a training process is performed on a topic prediction model to be trained based on training samples.
By way of example, the topic prediction model may be a sequence-to-sequence (Sequence to Sequence, seq2 Seq) model, which consists of an Encoder (Encoder) and a Decoder (Decoder), the working mechanisms of which are: firstly, using an encoder to encode input content into a semantic space to obtain a vector with a fixed dimension, wherein the vector represents the input semantic; this semantic vector is then decoded using a decoder, typically a language model, to obtain the desired output content, if the output content is text.
The training mode of the topic prediction model includes integral training and training in stages, and is explained below.
By way of example, the overall training may be achieved by: and calling a theme prediction model to be trained based on the training sample to perform coding processing and decoding processing to obtain predicted theme information, comparing the predicted theme information with actual theme information corresponding to the training sample to obtain a difference of the theme information, determining a loss function of the theme prediction model to be trained, and performing back propagation processing on the theme prediction model to be trained based on the loss function to obtain the theme prediction model after updating parameters.
By way of example, training in stages may be achieved by: the principle similarity between the acquisition of the topic information and the acquisition of the text abstract is very high, so that the capability of the topic prediction model can be further improved based on the text abstract technology. The conventional text abstract model, such as t5-pegasus, bertsum, can be adopted as the subject prediction model to be trained, the training sample comprises text materials, abstracts corresponding to the text materials and subject information, the subject prediction model to be trained is pre-trained by using the data of the text abstracts, and then the pre-trained subject prediction model is finely tuned by using the data of the subject information.
In step 703, the following processing is performed based on the text material calling the topic prediction model: and calling an encoder to encode the text material to obtain a code sequence, and calling a decoder to decode the code sequence to obtain the theme information corresponding to the text material.
By way of example, the weight value is obtained through the attention mechanism, the semantic vector output by the encoder is labeled based on the weight value, the semantic vector carrying the weight value is formed, and the weight value characterizes the correlation degree of each word in the semantic vector and the text. The decoder carries out vocabulary prediction processing based on semantic vectors carrying weight values to obtain probabilities corresponding to each vocabulary id in a preset vocabulary in different dimensions in a decoding sequence, probability distribution is formed, and the vocabulary id with the highest probability is selected as the vocabulary id of the dimension to obtain the decoding sequence. The decoder maps each word id in the decoding sequence into a corresponding word based on the mapping relation between the word id and the word in each dimension in the decoding sequence, and combines the words to obtain the subject information.
In some embodiments, the sequence-to-sequence model may obtain the topic by an extraction generation, which may be implemented by: scoring sentences in the text material through a Page Rank (Page Rank) algorithm, namely, taking each sentence in the text material as nodes, taking the similarity between the sentences as the weight of edges between the nodes, and constructing a node diagram; the score of each node is calculated, the score can be calculated by acquiring the number of edges, corresponding to each sentence, of which the weight value is larger than a preset value, encoding Gao Fengou sub-parts through an encoder to obtain semantic vectors, carrying out vocabulary prediction processing by a decoder based on the semantic vectors carrying the weight values to obtain probabilities corresponding to each vocabulary id in different dimensions in a decoding sequence aiming at a preset vocabulary table, forming probability distribution, and selecting the vocabulary id with the highest probability as the vocabulary id of the dimension to obtain the decoding sequence. The decoder maps each word id in the decoding sequence into a corresponding word based on the mapping relation between the word id and the word in each dimension in the decoding sequence, and combines the words to obtain the subject information.
In step 704, the composition material is obtained in a database based on the subject information, and sent to a terminal device that is the source of the text material.
For example, the common material retrieval is divided into two stages of recall and sorting, and since the topic prediction model can be applied to 2 stages to help the topic prediction model, and the technology adopted by other parts is not limited, the topic prediction model is described below to promote recall and sorting effects.
For example, in an offline state, the topic model is used for predicting topic information of all the reference material contents, the corresponding topic information and the reference material are stored, and the topic information is used as a label of the reference material. And predicting the topic information corresponding to the input content of the user by using the topic prediction model in an online state. And recalling related materials from the material library based on the topic information, and filtering the materials irrelevant to the topic by judging the matching degree of the topic of the text input by the user and the topic of the materials, so as to improve the topic matching degree of the materials. For example: and sorting the materials based on the matching degree, deleting a plurality of last materials in the descending order of the matching degree, and recommending the rest materials to a user as writing materials. Judging the topic matching degree is equivalent to judging the matching degree between texts, and a text similarity technology (editing distance, depth matching model and the like) can be utilized.
By way of example, the authoring material corresponding to the subject information may also be retrieved by way of key-value pairs. And predicting the theme of all the material contents in an offline state by using a theme model, and establishing a Key Value pair taking the theme as a Key (Key) and the material as a Value (Value). And predicting the theme corresponding to the input content of the user by using the theme prediction model in the online state. And finally, taking the topic corresponding to the user text as a Query index (Query), retrieving matched content from keys of all key value pairs, and returning the retrieved key value as the authoring material. The retrieval is realized by the following steps: matching the characters of the topic information serving as the key and the topic information serving as the query index to obtain the same character number, sorting the matched character number in a descending order, and returning the matched topic information serving as the matched topic information with the highest matched character number to the corresponding writing material.
The topic prediction model can also be used for improving the accuracy of input parameters of the ranking model in the recommendation process, predicting topic information of all reference material contents in an offline state by using the topic model, storing the corresponding topic information and the reference material, and taking the topic information as a label of the reference material. And predicting the topic information corresponding to the input content of the user by using the topic prediction model in an online state. And recalling related materials from a material library based on the topic information, judging the matching degree of the topic of the text input by the user and the topic of the materials, taking the calculated matching degree score as one of the features on which the ranking is based, and sending the matching degree score into a ranking model to influence the ranking result.
In some embodiments, the product form and user text of the authoring material recommendation may be varied due to different user requirements, authoring habits, and the like. The text material processing method provided by the embodiment of the application can be applied to scenes such as active retrieval of users and system recommendation.
In some embodiments, in the active search scenario, the user actively inputs some content in the search box or selects a piece of content to obtain the authoring material associated with the input content. In the scenario of active search, the user may actively input text with word granularity to search materials, such as "make dumplings", "responsibility", etc., or may input sentences such as "i like playing football", "the spring of the western lake is particularly beautiful", etc., and the user may select paragraphs and chapter contents to search materials. Referring to fig. 4, fig. 4 is a first schematic diagram of a display interface of a terminal device according to an embodiment of the present application; the authoring assistant interface 401 is displayed in the man-machine interaction interface of the terminal device 400, and the authoring assistant interface 401 includes: a text material input box 402, a composition material display box 404. After the user inputs the text material of "i like playing football … …" in the text material input box 402 of the application program of the authoring assistant of the terminal device 400, clicks the confirmation control 403, the terminal device 400 sends the text material to the server 200, the server 200 calls the topic prediction model to obtain topic information, determines matched authoring materials based on the topic information, sends the authoring materials to the terminal device 400, and the authoring material 1, the authoring material 2 and the authoring material 3 are respectively displayed in the authoring material display box 404.
In some embodiments, in a system recommendation scene, a user automatically understands global or local subjects of user writing contents in the writing process (including copying the whole text after the user finishes writing), and automatically recommends materials for the user based on the global or local subjects so as to help the user finish writing or optimize global or local text contents; the text entered by the user may be in the form of words, sentences, paragraphs, chapters, and the like. In the scenario of automatic system recommendation, text forms are usually chapters, or paragraphs and sentences are needed to be used for recommending materials matched with local information for users. Referring to fig. 5, fig. 5 is a second schematic diagram of a display interface of a terminal device according to an embodiment of the present application; the user performs document editing in a document interface 501 displayed in the terminal device, and a piece of text "my hometown … … me loves my hometown" entered by the user is displayed in a document editing area 502. After the user edits the text material, the terminal device sends the text material to the server, the server obtains the corresponding writing material based on the text material, and sends the writing material to the terminal device, and the terminal device displays the writing material 4, the writing material 5 and the writing material 6 in the document interface 501.
In some embodiments, the theme information may also be obtained based on a browsing record in the terminal device, and the corresponding authoring material may be obtained based on the theme information. Referring to fig. 6, fig. 6 is a third schematic diagram of a display interface of a terminal device according to an embodiment of the present application; the user edits the document in the document interface 601 displayed in the terminal device, but does not currently input text in the document editing area 602, the terminal device obtains browsing records (such as web pages, social dynamics and videos) of other multimedia data on the terminal device before the user, the terminal device sends the browsing records as text materials to the server, the server obtains corresponding writing materials based on the text materials, and sends the writing materials to the terminal device, and the terminal device displays the writing materials 7, 8 and 9 in the document interface 601.
The embodiment of the application can be applied to various product forms and user text forms, the product forms and the user text forms are various, and the text writing efficiency can be improved by understanding what the writing theme of the material is and recommending the material related to the theme; when the text of the user is long, the related technology is difficult to acquire recommended materials corresponding to the whole text based on keywords and key sentences, and the proposal provided by the embodiment of the application can be used for understanding the writing theme of the user. The embodiment of the application can realize the following effects:
1) The supervised training method is adopted, the effect is better, and the accuracy of obtaining the theme information by the theme prediction model obtained through training is improved.
2) The text generation is understood from the global view, and is not influenced by entities and semantics irrelevant to the topic information, so that the accuracy of obtaining the topic information is improved.
3) The topic prediction model uses text generation technology, so that the number of categories of topics is not limited, and the topic words are not limited to be present in the text content. The method has the advantages that the method focuses on global writing topics in a more effective, focused and flexible mode, and the quality of recommended writing materials can be effectively improved.
Continuing with the description below of an exemplary architecture of the text material processing device 455 implemented as a software module provided by embodiments of the present application, in some embodiments, as shown in fig. 2A, the software modules stored in the text material processing device 455 of the memory 450 may include: a material acquisition module 4551 configured to acquire text material; the coding module 4552 is configured to perform coding processing on the text material to obtain a semantic coding sequence corresponding to the text material; the decoding module 4553 is configured to perform vocabulary decoding processing based on the semantic coding sequence to obtain a vocabulary decoding sequence, where the vocabulary decoding sequence includes at least one vocabulary; the decoding module 4553 is further configured to combine each vocabulary in the vocabulary decoding sequence to obtain the authoring subject information of the text material representation; the material recommendation module 4554 is configured to obtain matching degrees of the authoring subject information and a plurality of reference materials in the text material library respectively; the material recommendation module 4554 is further configured to select at least one reference material from the plurality of reference materials as the authoring material based on the matching degree corresponding to each reference material.
In some embodiments, the encoding module 4552 is configured to perform vocabulary extraction processing on the text material to obtain a vocabulary in the text material; coding each vocabulary to obtain word embedded vectors of each vocabulary; and combining the embedded vectors of each word to obtain a semantic coding sequence corresponding to the text material.
In some embodiments, the encoding module 4552 is configured to perform text extraction processing on the text material to obtain text content in the text material; extracting key sentences from the text content, wherein the key sentences are sentences with similarity between other sentences in the text content being greater than a similarity threshold value; carrying out coding processing on each vocabulary in each key sentence to obtain a word embedding vector of each vocabulary; and combining the embedded vectors of each word to obtain a semantic coding sequence corresponding to the text material.
In some embodiments, the encoding module 4552 is configured to obtain a similarity between each sentence in the text content; for each sentence, acquiring the number of related sentences of other sentences with the similarity to the sentence being greater than a similarity threshold; sorting each sentence in a descending order based on the number of related sentences to obtain a first descending order sorting list; at least one sentence from the first in the first descending order of the ordered list is taken as at least one key sentence.
In some embodiments, the decoding module 4553 is configured to perform a vocabulary prediction process based on the semantic coding sequence and a vocabulary, to obtain an occurrence probability of each vocabulary in the vocabulary, where the vocabulary includes a plurality of vocabularies and a word embedding vector for each vocabulary; acquiring a vocabulary with the largest occurrence probability as a first target vocabulary in a vocabulary decoding sequence; splicing the semantic code sequence and the word embedded vector of the first target word to obtain a spliced sequence; performing multiple vocabulary prediction processing based on the spliced sequence and the vocabulary, and obtaining the occurrence probability of each vocabulary in the vocabulary corresponding to each vocabulary prediction processing; the splicing sequence input by each vocabulary prediction processing comprises the following information: the word embedding vector of the target word with the maximum occurrence probability is obtained by the splicing sequence used in the previous word prediction processing and the previous word prediction processing; and obtaining target words in the result of each word prediction processing, and combining each target word into a word decoding sequence.
In some embodiments, the material acquisition module 4551 is configured to acquire text material by at least one of: taking at least part of the content in the current document to be edited as text material; and acquiring the historical document edited before the current document, and taking at least part of contents in the historical document as text materials.
In some embodiments, the material acquisition module 4551 is configured to acquire an information browsing record, where the information browsing record includes information types related to: web pages, social dynamics, video, audio; text material is extracted from at least one piece of information browsed based on the information browsing record.
In some embodiments, the material acquisition module 4551 is configured to screen out information satisfying the relevant condition from the information browsing record, and extract text material from the information satisfying the relevant condition; wherein the relevant conditions include: the information and other information in the information browsing record comprise the same keywords; the similarity of the information and other information in the information browsing record is greater than a first similarity threshold.
In some embodiments, the text corpus comprises: the method comprises the steps of referencing materials, text theme information of the referencing materials and corresponding relation between the referencing materials and the text theme information; the material recommending module 4554 is configured to acquire text topic information corresponding to each reference material in the text material library; performing feature extraction processing on the sketching theme information to obtain a first text feature; performing feature extraction processing on each text topic information to obtain a second text feature of each text topic information; and obtaining the similarity between the first text feature and each second text feature, and taking each similarity as the matching degree between the writing subject information and each reference material.
In some embodiments, the text corpus comprises: the method comprises the steps of referencing materials, text theme information of the referencing materials and corresponding relation between the referencing materials and the text theme information; the material recommending module 4554 is configured to acquire text topic information corresponding to each reference material in the text material library; and determining the number of characters of which each text theme information and the writing theme information comprise the same characters, and taking the number of characters as the matching degree between the writing theme information and each reference material.
In some embodiments, the material recommendation module 4554 is configured to perform a descending order ranking process on the matching degree of each reference material, to obtain a second descending order ranking list; and taking at least one reference material from the first position in the second descending order of the ordered list as at least one writing material.
In some embodiments, the material recommendation module 4554 is configured to send the authoring material to the terminal device after selecting at least one reference material from the plurality of reference materials as the authoring material based on the matching degree corresponding to each reference material, such that the terminal device displays the authoring material, wherein the terminal device is a source of the text material.
In some embodiments, the encoding process and vocabulary decoding process are implemented by invoking a topic prediction model based on text material;
The topic prediction model comprises: an encoder and a decoder; wherein the encoder is used for executing encoding processing, and the decoder is used for executing vocabulary decoding processing.
In some embodiments, the material recommendation module 4554 is configured to train the topic prediction model by:
obtaining a sample material set, wherein the sample material set comprises: sample text materials and actual writing theme information corresponding to each sample text material;
invoking a theme prediction model to be trained based on each sample text material, and sequentially performing coding processing and vocabulary decoding processing to obtain prediction writing theme information corresponding to each sample text material;
determining a loss function of a theme predictive model to be trained based on the difference between each piece of predicted authoring theme information and the corresponding piece of actual authoring theme information;
updating parameters of the theme predictive model to be trained based on the loss function to obtain the theme predictive model after training.
In some embodiments, the material recommendation module 4554 is configured to train the topic prediction model by: acquiring a first training set and a second training set, wherein the first training set comprises: sample text material, sample abstract information corresponding to each sample text material, and the second training set comprises: sample text materials and sample writing theme information corresponding to each sample text material; calling a text abstract model to carry out training treatment based on the first training set to obtain a theme prediction model to be trained; and calling the theme prediction model to be trained based on the second training set to carry out training treatment, so as to obtain the trained theme prediction model.
Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the electronic device reads the computer executable instructions from the computer readable storage medium, and the processor executes the computer executable instructions, so that the electronic device executes the processing method of the text material according to the embodiment of the application.
The embodiment of the present application provides a computer-readable storage medium storing computer-executable instructions or a computer program stored therein, which when executed by a processor, cause the processor to perform a method for processing text material provided by the embodiment of the present application, for example, a method for processing text material as shown in fig. 3A.
In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.
In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or, alternatively, on multiple electronic devices distributed across multiple sites and interconnected by a communication network.
In summary, in the embodiment of the application, through encoding processing and vocabulary prediction processing on the text material, the authoring subject information corresponding to the text material is obtained, the authoring subject information can represent the authoring intention, and the accuracy of obtaining the authoring material is improved; matching the writing subject information with text materials in a writing material library to obtain at least one text material as the writing material, and improving the accuracy of obtaining the writing material by matching the subject information; by acquiring the writing subject information, material retrieval based on the whole text material is avoided, calculation resources required for acquiring the writing material are saved, and efficiency of acquiring the writing material is improved.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (19)

1. A method of processing text material, the method comprising:
acquiring text materials;
coding the text material to obtain a semantic coding sequence corresponding to the text material;
Performing vocabulary decoding processing based on the semantic coding sequence to obtain a vocabulary decoding sequence, wherein the vocabulary decoding sequence comprises at least one vocabulary;
combining each vocabulary in the vocabulary decoding sequence to obtain the writing theme information of the text material characterization;
obtaining the matching degree of the writing subject information and a plurality of reference materials in a text material library respectively;
and selecting at least one reference material from the plurality of reference materials as a writing material based on the matching degree corresponding to each reference material.
2. The method according to claim 1, wherein the encoding the text material to obtain a semantic coding sequence corresponding to the text material includes:
performing vocabulary extraction processing on the text material to obtain vocabularies in the text material;
coding each vocabulary to obtain word embedding vectors of each vocabulary;
and combining each word embedded vector to obtain a semantic coding sequence corresponding to the text material.
3. The method according to claim 1, wherein the encoding the text material to obtain a semantic coding sequence corresponding to the text material includes:
Performing text extraction processing on the text material to obtain text content in the text material;
extracting key sentences from the text content, wherein the key sentences are sentences in the text content, and the similarity between the key sentences and other sentences is greater than a similarity threshold value;
carrying out coding processing on each vocabulary in each key sentence to obtain a word embedding vector of each vocabulary;
and combining each word embedded vector to obtain a semantic coding sequence corresponding to the text material.
4. The method of claim 3, wherein the extracting key sentences from the text content comprises:
obtaining the similarity between every two sentences in the text content;
for each sentence, acquiring the number of related sentences of other sentences with similarity to the sentence being greater than a similarity threshold;
sorting each sentence in a descending order based on the number of related sentences to obtain a first descending order sorting list;
and taking at least one sentence from the first position in the first descending order of the ordered list as at least one key sentence.
5. The method of claim 1, wherein the performing the vocabulary decoding process based on the semantic coding sequence to obtain a vocabulary decoding sequence comprises:
Performing vocabulary prediction processing based on the semantic coding sequence and a vocabulary table to obtain the occurrence probability of each vocabulary in the vocabulary table, wherein the vocabulary table comprises a plurality of vocabularies and word embedding vectors of each vocabulary;
acquiring a vocabulary with the largest occurrence probability as a first target vocabulary in the vocabulary decoding sequence;
splicing the semantic coding sequence and the word embedding vector of the first target word to obtain a spliced sequence;
performing multiple vocabulary prediction processing based on the spliced sequence and the vocabulary, and obtaining the occurrence probability of each vocabulary in the vocabulary corresponding to each vocabulary prediction processing;
the splice sequence input by each vocabulary prediction processing comprises the following information: the word embedding vector of the target word with the maximum occurrence probability is obtained by the splicing sequence used in the previous word prediction processing and the previous word prediction processing;
and obtaining target words in the result of each word prediction processing, and combining each target word into the word decoding sequence.
6. The method of claim 1, wherein the obtaining text material comprises:
The text material is obtained by at least one of the following means:
taking at least part of the content in the current document to be edited as text material;
and acquiring a historical document edited before the current document, and taking at least part of contents in the historical document as text materials.
7. The method of claim 1, wherein the obtaining text material comprises:
obtaining an information browsing record, wherein the information type related to the information browsing record comprises: web pages, social dynamics, video, audio;
and extracting text materials from at least one piece of browsed information based on the information browsing record.
8. The method of claim 7, wherein extracting text material from the browsed at least one piece of information based on the information browse record comprises:
screening out information meeting the related conditions from the information browsing records, and extracting text materials from the information meeting the related conditions;
wherein the correlation condition includes: the information and other information in the information browsing record comprise the same keywords; and the similarity between the information and other information in the information browsing record is greater than a first similarity threshold.
9. The method of claim 1, wherein the text material library comprises: the method comprises the steps of referencing materials, text theme information of the referencing materials and corresponding relation between the referencing materials and the text theme information;
the obtaining the matching degree of the writing subject information and a plurality of reference materials in a text material library respectively comprises the following steps:
acquiring text theme information corresponding to each reference material in the text material library;
performing feature extraction processing on the sketching subject information to obtain a first text feature;
performing feature extraction processing on each piece of text topic information to obtain a second text feature of each piece of text topic information;
and obtaining the similarity between the first text feature and each second text feature, and taking each similarity as the matching degree between the writing subject information and each reference material.
10. The method of claim 1, wherein the text material library comprises: the method comprises the steps of referencing materials, text theme information of the referencing materials and corresponding relation between the referencing materials and the text theme information;
the obtaining the matching degree of the writing theme information and the reference material in the text material library comprises the following steps:
Acquiring text theme information corresponding to each reference material in the text material library;
and determining the number of characters of which each text theme information and the sketching theme information comprise the same characters, and taking the number of characters as the matching degree between the sketching theme information and each reference material.
11. The method of claim 1, wherein the selecting at least one of the reference materials from the plurality of reference materials as the authoring material based on the corresponding degree of matching for each of the reference materials comprises:
performing descending order sorting treatment on the matching degree of each reference material to obtain a second descending order sorting list;
and taking at least one reference material from the first position in the second descending order of the ordered list as at least one writing material.
12. The method of claim 1, wherein after said selecting at least one of said reference materials from said plurality of reference materials as authoring material based on a corresponding degree of matching for each of said reference materials, said method further comprises:
and sending the authoring material to a terminal device so that the terminal device displays the authoring material, wherein the terminal device is a source of the text material.
13. The method of claim 1, wherein the encoding process and the vocabulary decoding process are implemented by invoking a topic prediction model based on the text material;
the topic prediction model comprises: an encoder and a decoder; wherein the encoder is configured to perform the encoding process, and the decoder is configured to perform the vocabulary decoding process.
14. The method of claim 13, wherein the method further comprises:
training the topic prediction model by:
obtaining a sample material set, wherein the sample material set comprises: sample text materials and actual writing theme information corresponding to each sample text material;
calling a theme prediction model to be trained based on each sample text material, and sequentially performing coding processing and vocabulary decoding processing to obtain prediction writing theme information corresponding to each sample text material;
determining a loss function of the theme predictive model to be trained based on the difference between each piece of predicted authoring theme information and the corresponding piece of actual authoring theme information;
and updating parameters of the theme predictive model to be trained based on the loss function to obtain the trained theme predictive model.
15. The method of claim 13, wherein the method further comprises:
training the topic prediction model by:
acquiring a first training set and a second training set, wherein the first training set comprises: sample text material, sample abstract information corresponding to each sample text material, wherein the second training set comprises: sample text materials and sample writing theme information corresponding to each sample text material;
invoking a text abstract model to carry out training treatment based on the first training set to obtain a theme prediction model to be trained;
and calling a theme prediction model to be trained based on the second training set to carry out training treatment, so as to obtain a trained theme prediction model.
16. A processing apparatus for text material, the apparatus comprising:
the material acquisition module is configured to acquire text materials;
the coding module is configured to code the text material to obtain a semantic coding sequence corresponding to the text material;
the decoding module is configured to perform vocabulary decoding processing based on the semantic coding sequence to obtain a vocabulary decoding sequence, wherein the vocabulary decoding sequence comprises at least one vocabulary;
The decoding module is further configured to combine each vocabulary in the vocabulary decoding sequence to obtain the sketching theme information of the text material representation;
the material recommendation module is configured to acquire the matching degree of the sketching theme information and a plurality of reference materials in a text material library respectively;
the material recommending module is further configured to select at least one reference material from the plurality of reference materials as a writing material based on the matching degree corresponding to each reference material.
17. An electronic device, the electronic device comprising:
a memory for storing computer executable instructions;
a processor for implementing the method of processing text material of any of claims 1 to 15 when executing computer-executable instructions or computer programs stored in the memory.
18. A computer-readable storage medium storing computer-executable instructions or a computer program, which when executed by a processor implement the method of any one of claims 1 to 15.
19. A computer program product comprising computer-executable instructions or a computer program, which, when executed by a processor, implements the method of any one of claims 1 to 15.
CN202211564278.2A 2022-12-07 2022-12-07 Text material processing method and device, electronic equipment and storage medium Pending CN116956818A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211564278.2A CN116956818A (en) 2022-12-07 2022-12-07 Text material processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211564278.2A CN116956818A (en) 2022-12-07 2022-12-07 Text material processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116956818A true CN116956818A (en) 2023-10-27

Family

ID=88448044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211564278.2A Pending CN116956818A (en) 2022-12-07 2022-12-07 Text material processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116956818A (en)

Similar Documents

Publication Publication Date Title
CN107220352B (en) Method and device for constructing comment map based on artificial intelligence
Khribi et al. Automatic recommendations for e-learning personalization based on web usage mining techniques and information retrieval
US10217058B2 (en) Predicting interesting things and concepts in content
CN111324728A (en) Text event abstract generation method and device, electronic equipment and storage medium
US20090216741A1 (en) Prioritizing media assets for publication
CN108846138B (en) Question classification model construction method, device and medium fusing answer information
US20160328403A1 (en) Method and system for app search engine leveraging user reviews
CN112188312B (en) Method and device for determining video material of news
CN104978314A (en) Media content recommendation method and device
CN112749341A (en) Key public opinion recommendation method, readable storage medium and data processing device
CN113360646A (en) Text generation method and equipment based on dynamic weight and storage medium
CN110717038A (en) Object classification method and device
CN113032552A (en) Text abstract-based policy key point extraction method and system
WO2019139727A1 (en) Accuracy determination for media
CN111382563A (en) Text relevance determining method and device
Zemlyanskiy et al. DOCENT: Learning self-supervised entity representations from large document collections
Gendarmi et al. Community-driven ontology evolution based on folksonomies
Da et al. Deep learning based dual encoder retrieval model for citation recommendation
Hu et al. Aspect-guided syntax graph learning for explainable recommendation
US20230090601A1 (en) System and method for polarity analysis
CN113434789B (en) Search sorting method based on multi-dimensional text features and related equipment
CN117009578A (en) Video data labeling method and device, electronic equipment and storage medium
US9305103B2 (en) Method or system for semantic categorization
CN114328820A (en) Information searching method and related equipment
CN116956818A (en) Text material processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication