US10083157B2 - Text classification and transformation based on author - Google Patents

Text classification and transformation based on author Download PDF

Info

Publication number
US10083157B2
US10083157B2 US15/229,743 US201615229743A US10083157B2 US 10083157 B2 US10083157 B2 US 10083157B2 US 201615229743 A US201615229743 A US 201615229743A US 10083157 B2 US10083157 B2 US 10083157B2
Authority
US
United States
Prior art keywords
training
language model
author
input text
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
US15/229,743
Other versions
US20170039174A1 (en
Inventor
Brian Patrick Strope
Matthew Steedman Henderson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to US15/229,743 priority Critical patent/US10083157B2/en
Assigned to GOOGLE INC. reassignment GOOGLE INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: STROPE, Brian Patrick, Henderson, Matthew Steedman
Publication of US20170039174A1 publication Critical patent/US20170039174A1/en
Assigned to GOOGLE LLC reassignment GOOGLE LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: GOOGLE INC.
Application granted granted Critical
Publication of US10083157B2 publication Critical patent/US10083157B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F17/2264
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • G06F17/24
    • G06F17/274
    • G06F17/28
    • G06F17/30705
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0454
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This specification describes technologies that relate to transforming and classifying text based on analysis of training texts from particular authors.
  • Text authoring applications e.g., word processors, email clients, web browsers, and other applications, accept text input from a user via a keyboard or other input device. In some cases, these applications may allow text to be formatted and arranged by the users. Some applications may analyze the input text to identify common errors, for example, spelling errors, grammar errors, or formatting errors.
  • This specification describes technologies that relate to rewriting text in a requested linguistic style.
  • one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving an input text including one or more words and a name of a requested author; generating a vector stream representing the input text based on an encoder language model, wherein the vector stream includes one or more multi-dimensional vectors each associated with one or more associated words of the words of the input text and representing a distribution of contexts in which the associated words occurred in a plurality of training texts processed by the encoder language model; and producing an output text representing a particular transformation of the input text based at least in part on a decoder language model, the generated vector stream, and the requested author, wherein the decoder language model stores distributions of words used by particular authors in the plurality of training texts that caused the encoder language model to produce particular vectors representing the words.
  • Another innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving an input text including one or more words and a name of a requested author; generating a vector stream representing the input text based on an encoder language model, wherein the vector stream includes one or more multi-dimensional vectors each associated with one or more associated words of the words of the input text and representing a distribution of contexts in which the associated words occurred in a plurality of training texts processed by the encoder language model; and producing a classification of the input text based on a decoder language model, the generated vector stream, the input text and the author, wherein the decoder language model stores distributions of words used by particular authors in the plurality of training texts that caused the encoder language model to produce particular vectors representing the words.
  • the input text may be changed to use words and phrases common for a particular type of writing associated with the target author, which may make it more likely that the text will be understood by an audience expecting that type of writing.
  • input text may be transformed to a style expected by audience for the text, making it more likely that the text will be well received by the audience.
  • an input text could be transformed to a style used by an intended recipient of an email containing the input text based on email messages previously sent by the intended recipient.
  • an author of an input text may be able to improve the quality of the input text by transforming it to the style of a respected author, for example in the case of an input text author who is not a native speaker of the language of the input text.
  • FIG. 1 shows an example system for transforming and classifying text using language models trained with text from different authors.
  • FIG. 3 shows an example system for transforming input text into an output text rewritten according to the style of a particular author.
  • FIG. 4 shows an example system for producing a classification of an input text.
  • FIG. 5 is a flow diagram of an example process for transforming input text into an output text rewritten according to the style of a particular author.
  • FIG. 6 is a flow diagram of an example process for producing a classification of an input text
  • the opposite transformation e.g., from “what light through yonder window breaks” to “what is that light in the window”
  • Such a transformation may be performed by using the William Shakespeare text as input text (with Shakespeare identified as the author of the input text) and by specifying the person requesting the transformation as the requested author.
  • the text would be transformed into the style of the person requesting the transformation based on previously analyzed and modeled text written by the person (e.g., emails, articles, etc.).
  • a user may request that the input text be transformed into a style common to a particular group of authors, e.g., based on text produced by employees of a particular company, text by authors writing in a particular field, text by authors published in a particular journal, or other groups.
  • One example method for transforming input text includes receiving an input text including one or more words and a name of a requested author.
  • a vector stream representing the input text is then generated based on an encoder language model.
  • the vector stream includes one or more multi-dimensional vectors each associated with one or more associated words of the words of the input text, and represents a distribution of contexts in which the associated words occurred in a plurality of training texts processed by the encoder language model.
  • An output text representing a particular transformation of the input text is then produced based at least in part on a decoder language model, the generated vector stream, and the requested author.
  • the decoder language model stores distributions of words used by particular authors in the plurality of training texts that caused the encoder language model to produce particular vectors representing the words.
  • an input text may also be classified using language models trained with text from different authors. For example, an input text by a particular author can be classified as either “satire” or “non-satire.” In another example, an input text can be classified according to the most likely author to have written the input text.
  • One example method for classifying input texts includes receiving an input text including one or more words and a name of a requested author.
  • a vector stream representing the input text is generated based on an encoder language model.
  • the vector stream includes one or more multi-dimensional vectors each associated with one or more associated words of the words of the input text, and represents a distribution of contexts in which the associated words occurred in a plurality of training texts processed by the encoder language model.
  • a classification of the input text is then produced based on a decoder language model, the generated vector stream, the input text and the author.
  • the decoder language model stores distributions of words used by particular authors in the plurality of training texts that caused the encoder language model to produce particular vectors representing the words.
  • FIG. 1 shows an example system 100 for transforming and classifying text using language models trained with text from different authors.
  • the system 100 includes the user device 108 connected to a text processing system 114 by a network 112 .
  • the user device 108 sends an input text 130 and a name of a requested author 132 to the text processing system 114 over the network 112 .
  • a text processing engine 150 in the text processing system 114 transforms the input text 130 into an output text 134 using the encoder language model 162 and one or more decoder language models 164 .
  • the text processing engine 150 can classify the input text 130 using the encoder language model 162 and the decoder language models 164 .
  • the encoder language model 162 and the decoder language models 164 are generated by a language modeling engine 170 in the text processing system 114 .
  • the language modeling engine 170 analyzes text sources 180 to generate the encoder language model 162 and the decoder language models 164 . These processes are described in greater detail below.
  • the user device 108 also includes a processor 110 . Although illustrated as a single processor 110 in FIG. 1 , two or more processors may be included in particular implementations of system 100 .
  • the processor 110 may be a central processing unit (CPU), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or another suitable component.
  • the processor 110 may also be a single processor core of a larger processor including multiple integrated processor cores.
  • the user device 108 may also include a text processing application 116 configured to receive input text from the user 102 , for example, through a text input device such as a keyboard, or by identification of a text document or other text resource.
  • the text processing application 116 may be a software application executed by the processor 110 and stored in the memory 104 , for example, a word processor, an email client, a web browser, a presentation application, a graphics application, or other types of application that allows the user to input or identify text.
  • the text processing application 116 may allow the user to select a requested linguistic style to transform the input text. For example, the text processing application 116 may present the user with a list of available authors, and allow the user to select the requested author from the list. In some cases, the list can allow the user to select a group of authors including multiple authors.
  • the user device 108 is connected to the text processing system 114 by a data communication network 112 .
  • the network 112 may be a public or private network configured to send information electronically between connected devices.
  • the network 112 can use one or more communications protocols for sending the information, for example, ETHERNET, Internet Protocol (IP), Transmission Control Protocol (TCP), Universal Datagram Protocol (UDP), SONET, cellular data protocols such as CDMA and LTE, 802.11x wireless protocols, or other protocols.
  • the network 112 can include a local area network (LAN) or wide area network (WAN), e.g., the Internet, or a combination of networks, any of which may include wireless links.
  • LAN local area network
  • WAN wide area network
  • the user device 108 sends the input text 130 and the name of the requested author 132 to the text processing system 114 over the network 112 .
  • the text processing application 114 may also allow the user to specify a particular portion of the entered text as input text 130 , for example, by allowing the user to select the input text 130 using an input device.
  • the input text 130 may also be entered directly by the user.
  • the text processing system 114 may include a server or set of servers connected to the network 112 and operable to perform the operations described below.
  • the text processing system 114 may include one or more processors and one or more memories for performing these operations.
  • the text processing engine 150 receives the input text 130 and the name of the requested author 132 , and transforms the input text 130 into an output text 134 written in the requested linguistic style. In some cases, the text processing engine 150 may classify the input text 130 based on the language models 162 and 164 . The text processing engine 150 transforms the input text 130 based on the based on the encoder language model 162 and the decoder language models 164 . In some implementations, the text processing engine 150 can be a software program or set of software programs executed by the text processing system 114 to perform these operations.
  • the text processing engine 150 may receive an indication from the user of a type of transformation or classification to perform on the input text 130 , and may select an appropriate decoder language model 164 to perform the requested transformation or classification. For example, the user may request that the input text 130 should be rewritten in the style of the requested author.
  • each decoder language model 164 is configured and trained to perform a different type of transformation or classification.
  • the encoder language model 162 represents distributions of contexts in which words or groups of words, e.g. phrases, occurred in text sources 180 processed by the encoder language model 162 .
  • the encoder language model 162 includes an artificial neural network model trained using the text sources 180 .
  • the artificial neural network model can model the words or phrases occurring in the input text as points in a high dimensional space.
  • the artificial neural network model can use the word2vec library (available at https://code.google.com/p/word2vec/) to represent the context distributions of words in the text sources 180 .
  • the artificial neural network model can also use other techniques, e.g., Bag of Words (BOW), recurrent neural network (RNN) models, long-short term memory (LSTM) models, or other techniques or combinations of techniques. These techniques can also be varied to, for example, include longer time span averages and explicit attention mechanisms.
  • BOW Bag of Words
  • RNN recurrent neural network
  • LSTM long-short term memory
  • the artificial neural network model can take text as input and produce an output vector mapping each of the words or phrases in the input text to a point in the high dimensional space.
  • the text from the text sources 180 can be passed as input to the encoder language model 162 .
  • the input text 130 can be passed as input to obtain a vector stream representation of the input text 130 .
  • the vector streams produced by the encoder language model 162 are passed as input to the one or more decoder language models 164 , along with an author of the corresponding text source 180 and the text itself.
  • the decoder language models 164 are configured to produce particular output text or classifications for a given vector stream, author, and input text combination.
  • a particular decoder language model 164 may be configured and trained to produce output text representing a transformation of an input text to the particular style of the requested author.
  • the decoder language model 164 may represent distributions of words used by particular authors in the text sources 180 that were mapped to particular word vectors by the encoder language model 162 . Given a particular vector, the decoder language model 164 can produce the word or phrase the requested author would most likely use.
  • the input text 130 is processed by the encoder language model 162 to produce a vector stream representing the input text 130 .
  • the vector stream, and the requested author 132 are passed to one of the decoder language models 164 .
  • the decoder language model examines the vector stream and the name of the requested author 132 , and produces an appropriate transformation (e.g., an output text) or an appropriate classification (e.g., satire/non-satire) depending on the task it is configured and trained to perform.
  • FIG. 1 shows the user device interacting with the text processing system 114 over a network 112
  • some or all of the components of the text processing system 114 may integrated into the user device 108 and the network 112 may be omitted.
  • the text processing system 114 may be a distributed computing environment including multiple computing devices connected by a network. In such cases, the analysis performed by the encoder language model 162 and the decoder language models may be performed across multiple computing devices at least partially in parallel.
  • FIG. 2 shows an example system 200 for training the encoder language model 162 and the decoder language models 164 .
  • the language modeling engine 170 presents input text 206 from text sources 202 to the encoder language model 162 , which produces a vector stream 208 representing the input text 206 .
  • the language modeling engine 170 also presents an author 204 associated with the input text 206 to the encoder language model 162 .
  • the encoder language model 162 presents the vector stream 208 representing the input text 206 , the author 204 , and the input text 206 to each of the decoder language models 164 .
  • the language modeling engine 170 presents the author 204 to the decoder language models 164 directly, while in other cases the author 204 is passed by the encoder language model 162 to the decoder language models 164 .
  • Each decoder language model 164 produces an output 210 representing a transformation or classification of the input text 206 based on the author 204 and the vector stream 208 , as described above.
  • the language modeling engine 170 receives the outputs 210 from each decoder language model 164 , and analyzes them for errors. For example, the language modeling engine 170 may compare an output 210 for an expected output given the input text 206 . If the output 210 differs, the language modeling engine 170 indicates an error 212 to the decoder language model 164 that produced the output 210 . The decoder language model 164 updates its representation of the vectors in the vector stream 208 in response to the error 212 . The decoder language model 164 also back-propagates the error 212 to the encoder language model 162 , which corrects its representations in response.
  • the text sources 202 may be documents selected as representative of a particular author.
  • the text source 202 shown in FIG. 2 include text from the Gettysburg address, and may represent the style of the author “Abraham Lincoln.”
  • the text sources 202 may include webpages, scanned text from books or periodicals, ASCII or Unicode text files, Portable Document Format (PDF) files, email messages sent or received by a particular person or group of persons (representing an email style of the particular persons), or other types of text.
  • PDF Portable Document Format
  • text sources 202 without a known author may be analyzed. Analysis of such text sources can be performed by associating all text sources 202 without a known author with an anonymous author. Doing so may assist in forming a representation of general word usage in the English language not particular to a single author.
  • FIG. 3 shows an example system 300 for transforming input text 302 into an output text 312 rewritten according to the style of a particular author 304 .
  • the depicted interaction generally occurs after the encoder language model 162 and the decoder language models 164 are trained, e.g., as shown in FIG. 2 .
  • the system 300 includes an author transformation decoder 310 , which is one of the decoder language models 164 configured to perform a transformation of input text 302 to the style of a particular author.
  • the encoder language model 162 is presented with input text 302 .
  • the input text 302 is “87 years ago.”
  • the encoder language model 162 analyzes the input text 302 and produces a vector stream 308 representing the input text 302 .
  • the encoder language model 162 presents the vector stream 308 to the author transformation decoder 310 along with the input text 302 .
  • the author transformation decoder also takes the author 304 as input, which in this example is “Abraham Lincoln.”
  • the encoder language model 162 receives the author 304 as input, and passes it along to the author transformation decoder 310 .
  • the transformation decoder 310 produces the output text 312 representing the input text 302 rewritten in the style of the requested author 304 .
  • the input text “87 years ago” has been transformed into the output text “Four score and seven years ago,” which represents the input text as it would likely have been written by Abraham Lincoln.
  • FIG. 4 shows an example system 400 for producing a classification 406 of an input text 402 .
  • the depicted interaction generally occurs after the encoder language model 162 and the decoder language models 164 are trained, e.g., as shown in FIG. 2 .
  • the system 400 includes a classification decoder 410 , which is one of the decoder language models 164 configured to classify input text as satire or non-satire.
  • the encoder language model 162 is presented with input text 402 .
  • the encoder language model 162 analyzes the input text 402 and produces a vector stream 408 representing the input text 402 .
  • the encoder language model 162 presents the vector stream 408 to the classification decoder 410 along with the input text 402 .
  • the classification decoder 410 also takes the author 404 as input. In some cases, the encoder language model 162 receives the author 404 as input, and passes it along to the classification decoder 410 .
  • the classification decoder 410 produces the classification 406 for the input text 402 based on the vector stream 408 , the input 402 , and the author 404 .
  • the input text 402 may be text purporting to be a news story, and the classification 406 may be an indication whether the news story is legitimate news or satire.
  • FIG. 5 is a flow diagram of an example process 500 for transforming input text into an output text rewritten according to the style of a particular author.
  • An input text including one or more words, and a name of a requested author are received ( 502 ).
  • a vector stream is generated representing the input text using an encoder language model ( 504 ).
  • the vector stream includes one or more multi-dimensional vectors each associated with one or more associated words of the words of the input text and represents a distribution of contexts in which the associated words occurred in a plurality of training texts processed by the encoder language model.
  • An output text is produced representing a particular transformation of the input text based at least in part on a decoder language model, the generated vector stream, and the name of the requested author ( 506 ).
  • the decoder language model stores distributions of words used by particular authors in the plurality of training texts that caused the encoder language model to produce particular vectors representing the words.
  • the particular transformation of the input text is a transformation of the input text into text written in the style of the requested author.
  • an original author of the input text is also received, and producing the output text is performed based at least in part on the original author.
  • the encoder language model and decoder language model may be neural network models.
  • the process 500 may also include training the encoder language model using at least the plurality of training texts and training the decoder language using at least a vector stream generated by the encoder language model representing the plurality of training texts, the plurality of training texts, and a particular author associated with each training text.
  • the particular author may include one or more co-authors of the associated training text.
  • the particular author may be an anonymous author associated with training texts for which an author is not known.
  • the requested author includes one or more of the particular authors of the plurality of training texts.
  • FIG. 6 is a flow diagram of an example process 600 for producing a classification of an input text.
  • An input text including one or more words, and data identifying a requested author are received ( 602 ).
  • a vector stream is generated representing the input text based on an encoder language model ( 604 ).
  • the vector stream includes one or more multi-dimensional vectors each associated with one or more associated words of the words of the input text and representing a distribution of contexts in which the associated words occurred in a plurality of training texts processed by the encoder language model
  • a classification of the input text is produced based on a decoder language model, the generated vector stream, the input text and the author ( 606 ).
  • the decoder language model stores distributions of words used by particular authors in the plurality of training texts that caused the encoder language model to produce particular vectors representing the words.
  • the classification of the input text includes a satire indication, a non-satire indication, a predicted author indication, or a relevance indication.
  • an original author of the input text is also received, and producing the output text is performed based at least in part on the original author.
  • the encoder language model and decoder language model may be neural network models.
  • the process 600 includes training the encoder language model using at least the plurality of training texts, and training the decoder language using at least a vector stream generated by the encoder language model representing the plurality of training texts, the plurality of training texts, and a particular author associated with each training text.
  • the particular author may include one or more co-authors of the associated training text.
  • the particular author may be an anonymous author associated with training texts for which an author is not known.
  • the requested author may include one or more of the particular authors of the plurality of training texts.
  • Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of, data processing apparatus.
  • the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
  • the computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • data processing apparatus encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
  • the apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • the apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
  • a computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • a computer program may, but need not, correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, subprograms, or portions of code.
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • the processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output.
  • the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • special purpose logic circuitry e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit.
  • a central processing unit will receive instructions and data from a read-only memory or a random access memory or both.
  • the essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
  • PDA personal digital assistant
  • GPS Global Positioning System
  • USB universal serial bus
  • Computer-readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CDROM and DVD-ROM disks.
  • semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
  • magnetic disks e.g., internal hard disks or removable disks
  • magneto-optical disks e.g., CDROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a
  • Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a backend component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a frontend component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such backend, middleware, or frontend components.
  • the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
  • LAN local area network
  • WAN wide area network
  • the computing system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for transforming and classifying text based on analysis of training texts from particular authors. One of the methods includes receiving an input text including one or more words and a requested author; generating a vector stream representing the input text based on an encoder language model and including one or more multi-dimensional vectors associated with associated words of the words of the input text and representing a distribution of contexts in which the associated words occurred in a plurality of training texts; and producing an output text representing a particular transformation of the input text based at least in part on a decoder language model, the generated vector stream, and the requested author.

Description

CROSS REFERENCE TO RELATED APPLICATION
This application claims the benefit of U.S. Provisional Application No. 62/202,601, filed Aug. 7, 2015, the contents of which are hereby incorporated in its entirety.
BACKGROUND
This specification describes technologies that relate to transforming and classifying text based on analysis of training texts from particular authors.
Text authoring applications, e.g., word processors, email clients, web browsers, and other applications, accept text input from a user via a keyboard or other input device. In some cases, these applications may allow text to be formatted and arranged by the users. Some applications may analyze the input text to identify common errors, for example, spelling errors, grammar errors, or formatting errors.
SUMMARY
This specification describes technologies that relate to rewriting text in a requested linguistic style. In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving an input text including one or more words and a name of a requested author; generating a vector stream representing the input text based on an encoder language model, wherein the vector stream includes one or more multi-dimensional vectors each associated with one or more associated words of the words of the input text and representing a distribution of contexts in which the associated words occurred in a plurality of training texts processed by the encoder language model; and producing an output text representing a particular transformation of the input text based at least in part on a decoder language model, the generated vector stream, and the requested author, wherein the decoder language model stores distributions of words used by particular authors in the plurality of training texts that caused the encoder language model to produce particular vectors representing the words.
Another innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving an input text including one or more words and a name of a requested author; generating a vector stream representing the input text based on an encoder language model, wherein the vector stream includes one or more multi-dimensional vectors each associated with one or more associated words of the words of the input text and representing a distribution of contexts in which the associated words occurred in a plurality of training texts processed by the encoder language model; and producing a classification of the input text based on a decoder language model, the generated vector stream, the input text and the author, wherein the decoder language model stores distributions of words used by particular authors in the plurality of training texts that caused the encoder language model to produce particular vectors representing the words.
The subject matter described in this specification can be implemented in particular embodiments so as to realize one or more of the following advantages. By allowing a user to transform input text to the style of a particular author, the input text may be changed to use words and phrases common for a particular type of writing associated with the target author, which may make it more likely that the text will be understood by an audience expecting that type of writing. Further, input text may be transformed to a style expected by audience for the text, making it more likely that the text will be well received by the audience. For example, an input text could be transformed to a style used by an intended recipient of an email containing the input text based on email messages previously sent by the intended recipient. Moreover, an author of an input text may be able to improve the quality of the input text by transforming it to the style of a respected author, for example in the case of an input text author who is not a native speaker of the language of the input text.
The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 shows an example system for transforming and classifying text using language models trained with text from different authors.
FIG. 2 shows an example system for training an encoder language model and decoder language models
FIG. 3 shows an example system for transforming input text into an output text rewritten according to the style of a particular author.
FIG. 4 shows an example system for producing a classification of an input text.
FIG. 5 is a flow diagram of an example process for transforming input text into an output text rewritten according to the style of a particular author.
FIG. 6 is a flow diagram of an example process for producing a classification of an input text
Like reference numbers and designations in the various drawings indicate like elements.
DETAILED DESCRIPTION
This specification describes techniques for transforming and classifying text using language models trained with text from different authors. For example, an input text provided by a user can be transformed into an output text written in the style of a particular author requested by the user. The transformation can be performed using language models that have previously analyzed texts written by the particular author and modeled the words the author used in the context of those texts. From this information, the language models can predict the most likely words the particular author would use in the context of the input text, and produce an output text reflecting these predictions. The output text, therefore, is a transformation of the input text into the linguistic style of the particular author. For example, given an input text of “what is that light in the window,” and a requested author of “William Shakespeare,” the input text may be transformed into an output text representing how William Shakespeare would likely have written the input text based on language models generated from analysis of his work. In such a case, the input text of “what is that light in the window” could be transformed, for example, into “what light through yonder window breaks.”
In another example, the opposite transformation (e.g., from “what light through yonder window breaks” to “what is that light in the window”) could be performed. Such a transformation may be performed by using the William Shakespeare text as input text (with Shakespeare identified as the author of the input text) and by specifying the person requesting the transformation as the requested author. The text would be transformed into the style of the person requesting the transformation based on previously analyzed and modeled text written by the person (e.g., emails, articles, etc.).
Other transformations can also be performed using these techniques. For example, a user may request that the input text be transformed into a style common to a particular group of authors, e.g., based on text produced by employees of a particular company, text by authors writing in a particular field, text by authors published in a particular journal, or other groups.
One example method for transforming input text includes receiving an input text including one or more words and a name of a requested author. A vector stream representing the input text is then generated based on an encoder language model. The vector stream includes one or more multi-dimensional vectors each associated with one or more associated words of the words of the input text, and represents a distribution of contexts in which the associated words occurred in a plurality of training texts processed by the encoder language model. An output text representing a particular transformation of the input text is then produced based at least in part on a decoder language model, the generated vector stream, and the requested author. The decoder language model stores distributions of words used by particular authors in the plurality of training texts that caused the encoder language model to produce particular vectors representing the words.
Using the present techniques, an input text may also be classified using language models trained with text from different authors. For example, an input text by a particular author can be classified as either “satire” or “non-satire.” In another example, an input text can be classified according to the most likely author to have written the input text.
One example method for classifying input texts includes receiving an input text including one or more words and a name of a requested author. A vector stream representing the input text is generated based on an encoder language model. The vector stream includes one or more multi-dimensional vectors each associated with one or more associated words of the words of the input text, and represents a distribution of contexts in which the associated words occurred in a plurality of training texts processed by the encoder language model. A classification of the input text is then produced based on a decoder language model, the generated vector stream, the input text and the author. The decoder language model stores distributions of words used by particular authors in the plurality of training texts that caused the encoder language model to produce particular vectors representing the words.
FIG. 1 shows an example system 100 for transforming and classifying text using language models trained with text from different authors. The system 100 includes the user device 108 connected to a text processing system 114 by a network 112. In operation, the user device 108 sends an input text 130 and a name of a requested author 132 to the text processing system 114 over the network 112. A text processing engine 150 in the text processing system 114 transforms the input text 130 into an output text 134 using the encoder language model 162 and one or more decoder language models 164. In some cases, the text processing engine 150 can classify the input text 130 using the encoder language model 162 and the decoder language models 164. The encoder language model 162 and the decoder language models 164 are generated by a language modeling engine 170 in the text processing system 114. The language modeling engine 170 analyzes text sources 180 to generate the encoder language model 162 and the decoder language models 164. These processes are described in greater detail below.
The system 100 includes the user device 108 that is used by a user 102 to access the text processing system 114. The user device 108 may be a computing device configured to receive text input from the user 102, including, for example, a desktop computer, a laptop computer, a phone, a tablet, or other types of computing device. The user device 108 may include one or more input devices allowing the user to enter input text, including, but not limited to, a keyboard, a touchscreen, a speech recognition system, a mouse, or other input devices. The user device 108 will generally include a memory 104, e.g., a random access memory (RAM), flash, or other storage device, for storing instructions and data and a processor 110 for executing stored instructions.
The user device 108 also includes a processor 110. Although illustrated as a single processor 110 in FIG. 1, two or more processors may be included in particular implementations of system 100. The processor 110 may be a central processing unit (CPU), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or another suitable component. The processor 110 may also be a single processor core of a larger processor including multiple integrated processor cores.
The user device 108 may also include a text processing application 116 configured to receive input text from the user 102, for example, through a text input device such as a keyboard, or by identification of a text document or other text resource. In some cases, the text processing application 116 may be a software application executed by the processor 110 and stored in the memory 104, for example, a word processor, an email client, a web browser, a presentation application, a graphics application, or other types of application that allows the user to input or identify text.
The text processing application 116 may allow the user to select a requested linguistic style to transform the input text. For example, the text processing application 116 may present the user with a list of available authors, and allow the user to select the requested author from the list. In some cases, the list can allow the user to select a group of authors including multiple authors.
The user device 108 is connected to the text processing system 114 by a data communication network 112. The network 112 may be a public or private network configured to send information electronically between connected devices. The network 112 can use one or more communications protocols for sending the information, for example, ETHERNET, Internet Protocol (IP), Transmission Control Protocol (TCP), Universal Datagram Protocol (UDP), SONET, cellular data protocols such as CDMA and LTE, 802.11x wireless protocols, or other protocols. In some cases, the network 112 can include a local area network (LAN) or wide area network (WAN), e.g., the Internet, or a combination of networks, any of which may include wireless links.
The user device 108 sends the input text 130 and the name of the requested author 132 to the text processing system 114 over the network 112. In some cases, the text processing application 114 may also allow the user to specify a particular portion of the entered text as input text 130, for example, by allowing the user to select the input text 130 using an input device. The input text 130 may also be entered directly by the user.
The text processing system 114 may include a server or set of servers connected to the network 112 and operable to perform the operations described below. The text processing system 114 may include one or more processors and one or more memories for performing these operations.
The text processing engine 150 receives the input text 130 and the name of the requested author 132, and transforms the input text 130 into an output text 134 written in the requested linguistic style. In some cases, the text processing engine 150 may classify the input text 130 based on the language models 162 and 164. The text processing engine 150 transforms the input text 130 based on the based on the encoder language model 162 and the decoder language models 164. In some implementations, the text processing engine 150 can be a software program or set of software programs executed by the text processing system 114 to perform these operations. In some cases, the text processing engine 150 may receive an indication from the user of a type of transformation or classification to perform on the input text 130, and may select an appropriate decoder language model 164 to perform the requested transformation or classification. For example, the user may request that the input text 130 should be rewritten in the style of the requested author. In some cases, each decoder language model 164 is configured and trained to perform a different type of transformation or classification.
The encoder language model 162 represents distributions of contexts in which words or groups of words, e.g. phrases, occurred in text sources 180 processed by the encoder language model 162. In some cases, the encoder language model 162 includes an artificial neural network model trained using the text sources 180. The artificial neural network model can model the words or phrases occurring in the input text as points in a high dimensional space. For example, the artificial neural network model can use the word2vec library (available at https://code.google.com/p/word2vec/) to represent the context distributions of words in the text sources 180. The artificial neural network model can also use other techniques, e.g., Bag of Words (BOW), recurrent neural network (RNN) models, long-short term memory (LSTM) models, or other techniques or combinations of techniques. These techniques can also be varied to, for example, include longer time span averages and explicit attention mechanisms. The artificial neural network model can take text as input and produce an output vector mapping each of the words or phrases in the input text to a point in the high dimensional space. During training, the text from the text sources 180 can be passed as input to the encoder language model 162. At runtime (i.e., after training), the input text 130 can be passed as input to obtain a vector stream representation of the input text 130.
During training, the vector streams produced by the encoder language model 162 are passed as input to the one or more decoder language models 164, along with an author of the corresponding text source 180 and the text itself. The decoder language models 164 are configured to produce particular output text or classifications for a given vector stream, author, and input text combination. For example, a particular decoder language model 164 may be configured and trained to produce output text representing a transformation of an input text to the particular style of the requested author. In this case, the decoder language model 164 may represent distributions of words used by particular authors in the text sources 180 that were mapped to particular word vectors by the encoder language model 162. Given a particular vector, the decoder language model 164 can produce the word or phrase the requested author would most likely use.
At runtime, the input text 130 is processed by the encoder language model 162 to produce a vector stream representing the input text 130. The vector stream, and the requested author 132 are passed to one of the decoder language models 164. The decoder language model examines the vector stream and the name of the requested author 132, and produces an appropriate transformation (e.g., an output text) or an appropriate classification (e.g., satire/non-satire) depending on the task it is configured and trained to perform.
Although FIG. 1 shows the user device interacting with the text processing system 114 over a network 112, in some implementations, some or all of the components of the text processing system 114 may integrated into the user device 108 and the network 112 may be omitted. In some cases, the text processing system 114 may be a distributed computing environment including multiple computing devices connected by a network. In such cases, the analysis performed by the encoder language model 162 and the decoder language models may be performed across multiple computing devices at least partially in parallel.
FIG. 2 shows an example system 200 for training the encoder language model 162 and the decoder language models 164. The language modeling engine 170 presents input text 206 from text sources 202 to the encoder language model 162, which produces a vector stream 208 representing the input text 206. The language modeling engine 170 also presents an author 204 associated with the input text 206 to the encoder language model 162.
The encoder language model 162 presents the vector stream 208 representing the input text 206, the author 204, and the input text 206 to each of the decoder language models 164. In some cases, the language modeling engine 170 presents the author 204 to the decoder language models 164 directly, while in other cases the author 204 is passed by the encoder language model 162 to the decoder language models 164. Each decoder language model 164 produces an output 210 representing a transformation or classification of the input text 206 based on the author 204 and the vector stream 208, as described above.
The language modeling engine 170 receives the outputs 210 from each decoder language model 164, and analyzes them for errors. For example, the language modeling engine 170 may compare an output 210 for an expected output given the input text 206. If the output 210 differs, the language modeling engine 170 indicates an error 212 to the decoder language model 164 that produced the output 210. The decoder language model 164 updates its representation of the vectors in the vector stream 208 in response to the error 212. The decoder language model 164 also back-propagates the error 212 to the encoder language model 162, which corrects its representations in response.
The text sources 202 may be documents selected as representative of a particular author. For example, the text source 202 shown in FIG. 2 include text from the Gettysburg address, and may represent the style of the author “Abraham Lincoln.” By analyzing a large number of text sources 202, the accuracy of the encoder language model 162 and the decoder language models 164 can be improved. In some implementations, the text sources 202 may include webpages, scanned text from books or periodicals, ASCII or Unicode text files, Portable Document Format (PDF) files, email messages sent or received by a particular person or group of persons (representing an email style of the particular persons), or other types of text. In some cases, text sources 202 without a known author may be analyzed. Analysis of such text sources can be performed by associating all text sources 202 without a known author with an anonymous author. Doing so may assist in forming a representation of general word usage in the English language not particular to a single author.
FIG. 3 shows an example system 300 for transforming input text 302 into an output text 312 rewritten according to the style of a particular author 304. The depicted interaction generally occurs after the encoder language model 162 and the decoder language models 164 are trained, e.g., as shown in FIG. 2. The system 300 includes an author transformation decoder 310, which is one of the decoder language models 164 configured to perform a transformation of input text 302 to the style of a particular author.
The encoder language model 162 is presented with input text 302. In the illustrated example, the input text 302 is “87 years ago.” The encoder language model 162 analyzes the input text 302 and produces a vector stream 308 representing the input text 302. The encoder language model 162 presents the vector stream 308 to the author transformation decoder 310 along with the input text 302. The author transformation decoder also takes the author 304 as input, which in this example is “Abraham Lincoln.” In some cases, the encoder language model 162 receives the author 304 as input, and passes it along to the author transformation decoder 310. The transformation decoder 310 produces the output text 312 representing the input text 302 rewritten in the style of the requested author 304. In the example shown, the input text “87 years ago” has been transformed into the output text “Four score and seven years ago,” which represents the input text as it would likely have been written by Abraham Lincoln.
FIG. 4 shows an example system 400 for producing a classification 406 of an input text 402. The depicted interaction generally occurs after the encoder language model 162 and the decoder language models 164 are trained, e.g., as shown in FIG. 2. The system 400 includes a classification decoder 410, which is one of the decoder language models 164 configured to classify input text as satire or non-satire.
The encoder language model 162 is presented with input text 402. The encoder language model 162 analyzes the input text 402 and produces a vector stream 408 representing the input text 402. The encoder language model 162 presents the vector stream 408 to the classification decoder 410 along with the input text 402. The classification decoder 410 also takes the author 404 as input. In some cases, the encoder language model 162 receives the author 404 as input, and passes it along to the classification decoder 410. The classification decoder 410 produces the classification 406 for the input text 402 based on the vector stream 408, the input 402, and the author 404. For example, the input text 402 may be text purporting to be a news story, and the classification 406 may be an indication whether the news story is legitimate news or satire.
FIG. 5 is a flow diagram of an example process 500 for transforming input text into an output text rewritten according to the style of a particular author. An input text including one or more words, and a name of a requested author are received (502). A vector stream is generated representing the input text using an encoder language model (504). In some cases, the vector stream includes one or more multi-dimensional vectors each associated with one or more associated words of the words of the input text and represents a distribution of contexts in which the associated words occurred in a plurality of training texts processed by the encoder language model.
An output text is produced representing a particular transformation of the input text based at least in part on a decoder language model, the generated vector stream, and the name of the requested author (506). In some cases, the decoder language model stores distributions of words used by particular authors in the plurality of training texts that caused the encoder language model to produce particular vectors representing the words. In some implementations, the particular transformation of the input text is a transformation of the input text into text written in the style of the requested author. In some cases, an original author of the input text is also received, and producing the output text is performed based at least in part on the original author. The encoder language model and decoder language model may be neural network models.
The process 500 may also include training the encoder language model using at least the plurality of training texts and training the decoder language using at least a vector stream generated by the encoder language model representing the plurality of training texts, the plurality of training texts, and a particular author associated with each training text. The particular author may include one or more co-authors of the associated training text. In some cases, the particular author may be an anonymous author associated with training texts for which an author is not known. In some cases, the requested author includes one or more of the particular authors of the plurality of training texts.
FIG. 6 is a flow diagram of an example process 600 for producing a classification of an input text. An input text including one or more words, and data identifying a requested author are received (602). A vector stream is generated representing the input text based on an encoder language model (604). The vector stream includes one or more multi-dimensional vectors each associated with one or more associated words of the words of the input text and representing a distribution of contexts in which the associated words occurred in a plurality of training texts processed by the encoder language model
A classification of the input text is produced based on a decoder language model, the generated vector stream, the input text and the author (606). In some cases, the decoder language model stores distributions of words used by particular authors in the plurality of training texts that caused the encoder language model to produce particular vectors representing the words. In some cases, the classification of the input text includes a satire indication, a non-satire indication, a predicted author indication, or a relevance indication. In some cases, an original author of the input text is also received, and producing the output text is performed based at least in part on the original author. The encoder language model and decoder language model may be neural network models.
In some cases, the process 600 includes training the encoder language model using at least the plurality of training texts, and training the decoder language using at least a vector stream generated by the encoder language model representing the plurality of training texts, the plurality of training texts, and a particular author associated with each training text. The particular author may include one or more co-authors of the associated training text. In some cases, the particular author may be an anonymous author associated with training texts for which an author is not known. The requested author may include one or more of the particular authors of the plurality of training texts.
Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
A computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, subprograms, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
Computer-readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CDROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's user device in response to requests received from the web browser.
Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a backend component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a frontend component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such backend, middleware, or frontend components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims (13)

What is claimed is:
1. A method performed by a system comprising one or more computers to generate an output text in a style of a requested author from an input text, wherein the output text and the input text are written in a same natural language, the system comprising an encoder language model and a decoder language model, wherein:
the encoder and decoder language model have been trained with text from multiple authors, the text from multiple authors comprising a plurality of training texts;
as a result of training, the encoder language model stores data representing words occurring in the plurality of training texts from the multiple authors as respective vectors, wherein each vector represents a respective distribution of contexts in the plurality of training texts of a respective word from the plurality of training texts;
as a result of training, the decoder language model (i) stores the distributions of contexts of words used by particular respective authors in the plurality of training texts and (ii) is configured to perform a transformation of a stream of vectors from the encoder language model to generate text in the natural language according to distributions of contexts of words used by a decoder author, the decoder author being one of the multiple authors;
the encoder and decoder language model have been trained by performing the following operations for each of multiple training input texts each having a respective author:
presenting each training input text to the encoder language model;
receiving from the encoder language model a training vector stream representing the training input text, wherein the training vector stream includes vectors that are each (i) associated with a word from the input text and (ii) based on the distribution of contexts of the word in the plurality of training texts;
presenting the training vector stream, an author of the training input text, and the training input text to the decoder language model;
receiving a respective decoder output training text from the decoder language model based on the author, the training input text, and the training vector stream;
comparing the decoder output of the decoder language model with an expected output for the author and the training input text, wherein the expected output is the training input text;
if the comparing indicates a difference for a particular author, indicating an error; and
in the case of an error, updating the decoder language model, including updating the decoder language model's representation of vectors in the training vector stream, and back-propagating the error to the encoder language model, which updates a representation of the encoder language model;
the method using the encoder language model and the decoder language model after training, the method comprising:
receiving an input text including one or more words and a name of a requested author, wherein the requested author is one of the multiple authors;
generating a vector stream of vectors by the encoder language model, each vector in the vector stream representing the distribution of contexts in which a respective word of the input text appears in training input texts; and
producing an output text from the vector stream by the decoder language model according to the distributions of contexts of words used by the requested author, whereby the output text is a transformation of the input text to a style of the requested author.
2. The method of claim 1, wherein the requested author includes one or more co-authors.
3. The method of claim 1, wherein the requested author is an anonymous author associated with training texts for which an author is not known.
4. The method of claim 1, further comprising receiving a name of an original author of the input text, wherein producing the output text is performed based at least in part on the original author.
5. The method of claim 1, wherein the encoder language model and decoder language models are artificial neural network models.
6. A method performed by a system comprising one or more computers to produce a classification of an input text, the system comprising an encoder language model and one classification decoder, wherein:
the encoder language model and the classification decoder have been trained with text from multiple authors, the text from multiple authors comprising a plurality of training texts;
as a result of training, the encoder language model stores data representing words occurring in the plurality of training texts from the multiple authors as respective vectors, wherein each vector represents a respective distribution of contexts in the plurality of training texts of a respective word from the plurality of training texts;
as a result of training, the classification decoder (i) stores the distributions of contexts of words used by particular respective authors in the plurality of training texts and (ii) is configured to classify the input text based on distributions of contexts of words used by an author of the input text, the author of the input text being one of the multiple authors;
the encoder language model and the classification decoder have been trained by performing the following operations for each of multiple training input texts each having a respective author:
presenting each training input text to the encoder language model;
receiving from the encoder language model a training vector stream representing the training input text, wherein the training vector stream includes vectors that are each (i) associated with a word from the input text and (ii) based on the distribution of contexts of the word in the plurality of training texts;
presenting the training vector stream, an author of the training input text, and the training input text to the classification decoder;
receiving a classification from the classification decoder based on the author, the training input text, and the training vector stream;
comparing the classification of the classification decoder with an expected classification for the author and the training input text;
if the comparing indicates a difference, indicating an error;
in the case of an error, updating the classification decoder, including updating the classification decoder's representation of vectors in the training vector stream, and back-propagating the error to the encoder language model, which updates a representation of the encoder language model;
the method using the encoder language model and the classification decoder after training, the method comprising:
receiving an input text, wherein the author of the input text is one of the multiple authors;
generating a vector stream of vectors by the encoder language model, each vector in the vector stream representing the distribution of contexts in which a respective word of the input text appears in the training input texts; and
producing a classification of the input text from the vector stream by the classification decoder according to the distributions of contexts of words used by the authors of the training texts.
7. The method of claim 6, wherein the classification of the input text includes a satire indication, a non-satire indication, a predicted author indication, or a relevance indication.
8. The method of claim 6, wherein the author of the input text is one or more co-authors, and wherein the classification of the input text includes an indication of each predicted co-author.
9. The method of claim 6, wherein the author of the input text is an anonymous author.
10. The method of claim 6, further comprising receiving a name of an original author of the input text, wherein producing the classification of the input text is performed based at least in part of the original author.
11. The method of claim 6, wherein the encoder language model and classification decoder are artificial neural network models.
12. A system for generating an output text in a style of a requested author from an input text, wherein the output text and the input text are written in a same natural language, the system comprising:
memory for storing data and one or more processors, the memory and processors configured to run an encoder language model and a decoder language models, wherein:
the encoder and decoder language model have been trained with text from multiple authors, the text from multiple authors comprising a plurality of training texts;
as a result of training, the encoder language model stores data representing words occurring in the plurality of training texts from the multiple authors as respective vectors, wherein each vector represents a respective distribution of contexts in the plurality of training texts of a respective word from the plurality of training texts;
as a result of training, the decoder language model (i) stores the distributions of contexts of words used by particular respective authors in the plurality of training texts and (ii) is configured to perform a transformation of a stream of vectors from encoder language model to generate text in the natural language according to distributions of contexts of words used by a decoder author, the decoder author being one of the multiple authors;
the encoder and decoder language model have been trained by performing the following operations for each of multiple training input texts each having a respective author:
presenting each training input text to the encoder language model;
receiving from the encoder language model a training vector stream representing the training input text, wherein the training vector stream includes vectors that are each (i) associated with a word from the input text and (ii) based on the distribution of contexts of the word in the plurality of training texts;
presenting the training vector stream, an author of the training input text, and the training input text to the decoder language model;
receiving a respective decoder output training text from the decoder language model based on the author, the training input text, and the training vector stream;
comparing the decoder output of the decoder language model with an expected output for the author and the training input text, wherein the expected output is the training input text;
if the comparing indicates a difference for a particular author, indicating an error;
in the case of an error, updating the decoder language model, including updating the decoder language model's representation of vectors in the training vector stream, and back-propagating the error to the encoder language model, which updates a representation of the encoder language model;
the system operable to perform operations, using the encoder language model and the decoder language model after training, comprising:
receiving an input text including one or more words and a name of a requested author, wherein the requested author is one of the multiple authors;
generating a vector stream of vectors by the encoder language model, each vector in the vector stream representing the distribution of contexts in which a respective word of the input text appears in the training input texts; and
producing an output text from the vector stream by the decoder language model according to the distribution of contexts of words used by the requested author, whereby the output text is a transformation of the input text to a style of the requested author.
13. The system of claim 12, wherein the encoder language model and decoder language model are Word2Vec models.
US15/229,743 2015-08-07 2016-08-05 Text classification and transformation based on author Active US10083157B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/229,743 US10083157B2 (en) 2015-08-07 2016-08-05 Text classification and transformation based on author

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201562202601P 2015-08-07 2015-08-07
US15/229,743 US10083157B2 (en) 2015-08-07 2016-08-05 Text classification and transformation based on author

Publications (2)

Publication Number Publication Date
US20170039174A1 US20170039174A1 (en) 2017-02-09
US10083157B2 true US10083157B2 (en) 2018-09-25

Family

ID=56609751

Family Applications (1)

Application Number Title Priority Date Filing Date
US15/229,743 Active US10083157B2 (en) 2015-08-07 2016-08-05 Text classification and transformation based on author

Country Status (3)

Country Link
US (1) US10083157B2 (en)
EP (1) EP3128439A1 (en)
CN (1) CN106997370B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10402495B1 (en) * 2016-09-01 2019-09-03 Facebook, Inc. Abstractive sentence summarization
US20200110797A1 (en) * 2018-10-04 2020-04-09 International Business Machines Corporation Unsupervised text style transfer system for improved online social media experience
US11250219B2 (en) * 2018-04-04 2022-02-15 International Business Machines Corporation Cognitive natural language generation with style model
US11256872B2 (en) 2019-10-29 2022-02-22 International Business Machines Corporation Natural language polishing using vector spaces having relative similarity vectors

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9773166B1 (en) * 2014-11-03 2017-09-26 Google Inc. Identifying longform articles
US11550751B2 (en) * 2016-11-18 2023-01-10 Microsoft Technology Licensing, Llc Sequence expander for data entry/information retrieval
US10417341B2 (en) 2017-02-15 2019-09-17 Specifio, Inc. Systems and methods for using machine learning and rules-based algorithms to create a patent specification based on human-provided patent claims such that the patent specification is created without human intervention
US11023662B2 (en) * 2017-02-15 2021-06-01 Specifio, Inc. Systems and methods for providing adaptive surface texture in auto-drafted patent documents
US11593564B2 (en) 2017-02-15 2023-02-28 Specifio, Inc. Systems and methods for extracting patent document templates from a patent corpus
US10621371B1 (en) 2017-03-30 2020-04-14 Specifio, Inc. Systems and methods for facilitating editing of a confidential document by a non-privileged person by stripping away content and meaning from the document without human intervention such that only structural and/or grammatical information of the document are conveyed to the non-privileged person
CN110402445B (en) * 2017-04-20 2023-07-11 谷歌有限责任公司 Method and system for browsing sequence data using recurrent neural network
KR102592677B1 (en) * 2017-05-23 2023-10-23 구글 엘엘씨 Attention-based sequence transduction neural networks
US20190065486A1 (en) * 2017-08-24 2019-02-28 Microsoft Technology Licensing, Llc Compression of word embeddings for natural language processing systems
US10831927B2 (en) 2017-11-22 2020-11-10 International Business Machines Corporation Noise propagation-based data anonymization
EP3696810B1 (en) * 2017-12-15 2024-06-12 Google LLC Training encoder model and/or using trained encoder model to determine responsive action(s) for natural language input
CN108172209A (en) * 2018-01-09 2018-06-15 上海大学 Build voice idol method
US10776581B2 (en) 2018-02-09 2020-09-15 Salesforce.Com, Inc. Multitask learning as question answering
CN108536679B (en) * 2018-04-13 2022-05-20 腾讯科技(成都)有限公司 Named entity recognition method, device, equipment and computer readable storage medium
US10915712B2 (en) * 2018-07-26 2021-02-09 International Business Machines Corporation Unsupervised tunable stylized text transformations
KR20200023664A (en) * 2018-08-14 2020-03-06 삼성전자주식회사 Response inference method and apparatus
CN109447706B (en) * 2018-10-25 2022-06-21 深圳前海微众银行股份有限公司 Method, device and equipment for generating advertising copy and readable storage medium
CN109635253B (en) * 2018-11-13 2024-05-28 平安科技(深圳)有限公司 Text style conversion method and device, storage medium and computer equipment
CN110516227A (en) * 2019-03-28 2019-11-29 苏州八叉树智能科技有限公司 Title text generation method, device, electronic equipment and computer-readable medium
US10977439B2 (en) 2019-04-01 2021-04-13 International Business Machines Corporation Controllable style-based text transformation
US20200364303A1 (en) * 2019-05-15 2020-11-19 Nvidia Corporation Grammar transfer using one or more neural networks
CN110287461B (en) * 2019-05-24 2023-04-18 北京百度网讯科技有限公司 Text conversion method, device and storage medium
CN110688834B (en) * 2019-08-22 2023-10-31 创新先进技术有限公司 Method and equipment for carrying out intelligent manuscript style rewriting based on deep learning model
US11270684B2 (en) * 2019-09-11 2022-03-08 Artificial Intelligence Foundation, Inc. Generation of speech with a prosodic characteristic
CN110852043B (en) * 2019-11-19 2023-05-23 北京字节跳动网络技术有限公司 Text transcription method, device, equipment and storage medium
CN110909179B (en) * 2019-11-29 2022-07-08 思必驰科技股份有限公司 Method and system for optimizing text generation model
CN111507101B (en) * 2020-03-03 2020-12-15 杭州电子科技大学 Ironic detection method based on multi-level semantic capsule routing
CN111737983B (en) * 2020-06-22 2023-07-25 网易(杭州)网络有限公司 Text writing style processing method, device, equipment and storage medium
CN112699242A (en) * 2021-01-11 2021-04-23 大连东软信息学院 Method for identifying Chinese text author
CN113468857B (en) * 2021-07-13 2024-03-29 北京百度网讯科技有限公司 Training method and device for style conversion model, electronic equipment and storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060265209A1 (en) * 2005-04-26 2006-11-23 Content Analyst Company, Llc Machine translation using vector space representations
US20070073532A1 (en) * 2005-09-29 2007-03-29 Microsoft Corporation Writing assistance using machine translation techniques
US20070288458A1 (en) * 2006-06-13 2007-12-13 Microsoft Corporation Obfuscating document stylometry
US20090300046A1 (en) * 2008-05-29 2009-12-03 Rania Abouyounes Method and system for document classification based on document structure and written style
US20110320191A1 (en) * 2009-03-13 2011-12-29 Jean-Pierre Makeyev Text creation system and method
US20120251016A1 (en) 2011-04-01 2012-10-04 Kenton Lyons Techniques for style transformation
US20140222928A1 (en) * 2013-02-06 2014-08-07 Msc Intellectual Properties B.V. System and method for authorship disambiguation and alias resolution in electronic data
US8903719B1 (en) * 2010-11-17 2014-12-02 Sprint Communications Company L.P. Providing context-sensitive writing assistance
US8935154B1 (en) * 2012-04-13 2015-01-13 Symantec Corporation Systems and methods for determining authorship of an unclassified notification message
US20150269125A1 (en) * 2014-03-19 2015-09-24 Microsoft Corporation Normalizing message style while preserving intent
US20160162576A1 (en) * 2014-12-05 2016-06-09 Lightning Source Inc. Automated content classification/filtering
US9720978B1 (en) * 2014-09-30 2017-08-01 Amazon Technologies, Inc. Fingerprint-based literary works recommendation system
US20170228591A1 (en) * 2015-04-29 2017-08-10 Hewlett-Packard Development Company, L.P. Author identification based on functional summarization

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9374451B2 (en) * 2002-02-04 2016-06-21 Nokia Technologies Oy System and method for multimodal short-cuts to digital services
CN103488711B (en) * 2013-09-09 2017-06-27 北京大学 A kind of method and system of quick Fabrication vector font library

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060265209A1 (en) * 2005-04-26 2006-11-23 Content Analyst Company, Llc Machine translation using vector space representations
US20070073532A1 (en) * 2005-09-29 2007-03-29 Microsoft Corporation Writing assistance using machine translation techniques
US20070288458A1 (en) * 2006-06-13 2007-12-13 Microsoft Corporation Obfuscating document stylometry
US20090300046A1 (en) * 2008-05-29 2009-12-03 Rania Abouyounes Method and system for document classification based on document structure and written style
US20110320191A1 (en) * 2009-03-13 2011-12-29 Jean-Pierre Makeyev Text creation system and method
US8903719B1 (en) * 2010-11-17 2014-12-02 Sprint Communications Company L.P. Providing context-sensitive writing assistance
US20120251016A1 (en) 2011-04-01 2012-10-04 Kenton Lyons Techniques for style transformation
US8935154B1 (en) * 2012-04-13 2015-01-13 Symantec Corporation Systems and methods for determining authorship of an unclassified notification message
US20140222928A1 (en) * 2013-02-06 2014-08-07 Msc Intellectual Properties B.V. System and method for authorship disambiguation and alias resolution in electronic data
US20150269125A1 (en) * 2014-03-19 2015-09-24 Microsoft Corporation Normalizing message style while preserving intent
US9720978B1 (en) * 2014-09-30 2017-08-01 Amazon Technologies, Inc. Fingerprint-based literary works recommendation system
US20160162576A1 (en) * 2014-12-05 2016-06-09 Lightning Source Inc. Automated content classification/filtering
US20170228591A1 (en) * 2015-04-29 2017-08-10 Hewlett-Packard Development Company, L.P. Author identification based on functional summarization

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
European Search Report in European Application No. 16139014.6, dated Nov. 29, 2016, 8 pages.
Goldberg et al. "Word2vec Explained: Deriving Mikolov et al./'s Negative-Sampling Word-Embedding Method," arXiv 1402.3722v1, Feb. 15, 2014, 5 pages.
Khosmood et al. "Automatic Natural Language Classification and Transformation," BCS Corpus Profiling Workshop, 2008, London, UK.
Khosmood et al. "Automatic Synonym and Phrase Replacement Show Promise for Style Transformation," 2010 Ninth IEEE International Conference on Machine Learning and Applications, Dec. 2010, Washington DC, USA, pp. 12-14.
Khosmood et al. "Toward Automated Stylistic Transformation of Natural Language Text," Digital Humanities 2009, Washington DC USA.
Khosmood et al., "Automatic Synonym and Phrase Replacement Show Promise for Style Transformation", published Dec. 2010, published by IEEE 2010 Ninth International Conference on Machine Learning and Applications, pp. 958-961. *
Mazur, "A Step by Step Backpropagation Example", published Mar. 17, 2015, published at https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/, pp. 1-9 *
Mikolov et al. "Efficient Estimation of Word Representations in Vector Space," arXiv 1301.3781v3, Sep. 7, 2013, 12 pages.
Perlroth, "Software Helps Identify Anonymous Writers or Helps Them Stay That Way", Jan. 3, 2012, published by The New York Times, located at https://bits.blogs.nytimes.com/2012/01/03/software-helps-identify-anonymous-writers-or-helps-them-stay-that-way/. *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10402495B1 (en) * 2016-09-01 2019-09-03 Facebook, Inc. Abstractive sentence summarization
US20190347328A1 (en) * 2016-09-01 2019-11-14 Facebook, Inc. Abstractive sentence summarization
US10643034B2 (en) * 2016-09-01 2020-05-05 Facebook, Inc. Abstractive sentence summarization
US11250219B2 (en) * 2018-04-04 2022-02-15 International Business Machines Corporation Cognitive natural language generation with style model
US20200110797A1 (en) * 2018-10-04 2020-04-09 International Business Machines Corporation Unsupervised text style transfer system for improved online social media experience
US11256872B2 (en) 2019-10-29 2022-02-22 International Business Machines Corporation Natural language polishing using vector spaces having relative similarity vectors

Also Published As

Publication number Publication date
EP3128439A1 (en) 2017-02-08
US20170039174A1 (en) 2017-02-09
CN106997370A (en) 2017-08-01
CN106997370B (en) 2021-06-01

Similar Documents

Publication Publication Date Title
US10083157B2 (en) Text classification and transformation based on author
US20240078386A1 (en) Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
US11657231B2 (en) Capturing rich response relationships with small-data neural networks
US20240070392A1 (en) Computing numeric representations of words in a high-dimensional space
US11651218B1 (en) Adversartail training of neural networks
CN107066449B (en) Information pushing method and device
US20220350965A1 (en) Method for generating pre-trained language model, electronic device and storage medium
US11443170B2 (en) Semi-supervised training of neural networks
US9690772B2 (en) Category and term polarity mutual annotation for aspect-based sentiment analysis
US20200184307A1 (en) Utilizing recurrent neural networks to recognize and extract open intent from text inputs
US20210264203A1 (en) Multimodal Image Classifier using Textual and Visual Embeddings
US20180197105A1 (en) Security classification by machine learning
US11675975B2 (en) Word classification based on phonetic features
US9697819B2 (en) Method for building a speech feature library, and method, apparatus, device, and computer readable storage media for speech synthesis
CN111680159A (en) Data processing method and device and electronic equipment
US10628525B2 (en) Natural language processing of formatted documents
US12073181B2 (en) Systems and methods for natural language processing (NLP) model robustness determination
US10755171B1 (en) Hiding and detecting information using neural networks
US20220358280A1 (en) Context-aware font recommendation from text
US9208142B2 (en) Analyzing documents corresponding to demographics
US11880798B2 (en) Determining section conformity and providing recommendations
WO2022125096A1 (en) Method and system for resume data extraction
CN116796758A (en) Dialogue interaction method, dialogue interaction device, equipment and storage medium
US20240028646A1 (en) Textual similarity model for graph-based metadata
Datta et al. A supervised machine learning approach to fake news identification

Legal Events

Date Code Title Description
AS Assignment

Owner name: GOOGLE INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:STROPE, BRIAN PATRICK;HENDERSON, MATTHEW STEEDMAN;SIGNING DATES FROM 20150921 TO 20150924;REEL/FRAME:039813/0592

AS Assignment

Owner name: GOOGLE LLC, CALIFORNIA

Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044129/0001

Effective date: 20170929

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4