WO2023217019A1 - 文本处理方法、装置、存储介质、电子设备及系统 - Google Patents

文本处理方法、装置、存储介质、电子设备及系统 Download PDF

Info

Publication number
WO2023217019A1
WO2023217019A1 PCT/CN2023/092453 CN2023092453W WO2023217019A1 WO 2023217019 A1 WO2023217019 A1 WO 2023217019A1 CN 2023092453 W CN2023092453 W CN 2023092453W WO 2023217019 A1 WO2023217019 A1 WO 2023217019A1
Authority
WO
WIPO (PCT)
Prior art keywords
phrase
text
target
rewritten
index
Prior art date
Application number
PCT/CN2023/092453
Other languages
English (en)
French (fr)
Inventor
曹军
孙泽维
王明轩
欧阳宇星
程亦曲
庞赛康
胡凯
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023217019A1 publication Critical patent/WO2023217019A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • Embodiments of the present disclosure relate to a text processing method, device, storage medium, electronic device and system.
  • the original input text obtained may not reasonably express its original meaning. Therefore, rational intervention processing (such as rewriting processing) needs to be performed on such text so that the original input text can better express its original meaning.
  • the present disclosure provides a text processing method, including:
  • target example text that needs to be rewritten and a target phrase rewritten example pair corresponding to the target example text
  • the target phrase rewritten example pair includes a target example rewritten phrase and a target example replacement phrase corresponding to the target example rewritten phrase
  • the obtained input text is text rewritten.
  • a text processing device including:
  • the first acquisition module is used to obtain the target example text that needs to be rewritten and the target phrase rewritten example pair corresponding to the target example text.
  • the target phrase rewritten example pair includes the target example rewritten phrase and the target example rewritten phrase corresponding to the target example text.
  • a generation module configured to rewrite the example pairs according to the target example text and the target phrase to generate an index relationship
  • a storage module used to store the index relationship into an index database
  • a rewriting module configured to rewrite the obtained input text according to the index relationship in the index database.
  • the present disclosure provides a computer-readable medium having a computer program stored thereon, which implements the steps of the method described in the first aspect when executed by a processing device.
  • an electronic device including:
  • a processing device configured to execute the computer program in the storage device to implement the steps of the method in the first aspect.
  • the present disclosure provides a text processing system, including:
  • Intervention platform used to obtain the target sample text that needs to be rewritten and the target sample text corresponding to the target sample text
  • a target phrase rewritten example pair the target phrase rewritten example pair includes a target example rewritten phrase and a target example replacement phrase corresponding to the target example rewritten phrase;
  • the index server is configured to obtain the target example text and the target phrase rewritten example pair from the intervention platform, generate an index relationship based on the obtained target example text and the target phrase rewritten example pair, and combine the obtained target example text and the target phrase rewritten example pair.
  • the index relationship is stored in the index database; the index server is also configured to rewrite the obtained input text according to the index relationship in the index database.
  • the index relationship can be generated based on the obtained target example text and target phrase by rewriting the example pairs, and directly stored in the index database, online intervention in the index database can be achieved without taking the index database offline, and The input text is rewritten through the index relationship in the index database, thereby solving the problem of using a model for text rewriting and the model needs to be updated offline, which affects the real-time nature of online text rewriting.
  • Figure 1 is a schematic diagram of a text processing system according to an exemplary embodiment of the present disclosure
  • Figure 2 is a flow chart of a text processing method according to an exemplary embodiment of the present disclosure
  • Figure 3 is a schematic diagram of generating an index relationship according to an exemplary embodiment of the present disclosure
  • Figure 4 is a schematic structural diagram of a BERT model according to an exemplary embodiment of the present disclosure.
  • Figure 5 is a block diagram of a text processing device according to an exemplary embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure.
  • the term “include” and its variations are open-ended, ie, “including but not limited to.”
  • the term “based on” means “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; and the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • a prompt message is sent to the user to clearly remind the user that the operation requested will require the acquisition and use of the user's personal information. Therefore, users can autonomously choose whether to provide personal information to software or hardware such as electronic devices, applications, servers or storage media that perform the operations of the technical solution of the present disclosure based on the prompt information.
  • the method of sending prompt information to the user may be, for example, a pop-up window, and the prompt information may be presented in the form of text in the pop-up window.
  • the pop-up window can also contain a selection control for the user to choose "agree” or "disagree” to provide personal information to the electronic device.
  • model retraining involves the re-adjustment and learning of model parameters, usually in the offline stage. Complete, this will affect the real-time performance of text rewriting in actual industrial application scenarios.
  • model re-online may also involve the redeployment of the environment, this will further affect the real-time performance of text rewriting in actual industrial application scenarios.
  • embodiments of the present disclosure provide a text processing method, device, storage medium, electronic device and system, which effectively ensures the real-time nature of text rewriting processing.
  • FIG. 1 is a schematic diagram of a text processing system according to an exemplary embodiment of the present disclosure.
  • the text processing method can be applied to the intervention side of the text processing system.
  • the intervention platform on the intervention side is used to obtain the target example text that needs to be rewritten and the target phrase rewriting example pair corresponding to the target example text
  • the index server ( Figure 1 (Illustrated in Index Service) is used to obtain target sample text and target phrase rewritten sample pairs from the intervention platform, generate index relationships based on the obtained target sample text and target phrase rewritten sample pairs, and store the index relationships in the index database ( Figure 1 (indicated by vector index);
  • the index server is also used to rewrite the obtained input text according to the index relationship in the index database.
  • the index server can generate the index relationship when receiving an index processing request initiated by the intervention platform (RPC (Remote Procedure Call) between the intervention platform and the index service in Figure 1).
  • RPC Remote Procedure Call
  • the intervention platform on the intervention side is used to receive the sample text that needs to be rewritten and the phrase rewriting sample pair corresponding to the sample text input by the expert.
  • the recalled text corresponding to the pair of example rewritten phrases is recalled in the pre-built inverted index of text and phrases in the corpus database (illustrated by the corpus inverted index in Figure 1).
  • the intervention platform on the intervention side is also used to store sample texts, recalled texts, and sample reworded phrase pairs into the intervention database.
  • the index server receives an index processing request initiated by the intervention platform, it can perform initial loading from the intervention database to obtain sample text, recall text, and sample rewritten phrase pairs, and then perform the generation of index relationships.
  • the index server on the intervention side is used to implement index relationship management in the index database (illustrated with a vector index in Figure 1).
  • This management can be, for example, a new index relationship in the index database ( It can be understood as the generation of the above-mentioned index relationships), deletion of index relationships in the index database, modification of index relationships in the index database, etc.
  • the index server on the intervention side is also used to process text processing requests (RPC between the application side and the index service in Figure 1) initiated by the application side (illustrated as the application side in Figure 1).
  • the text processing request carries the input text input by the user, and the text processing request is sent to the index server through the application end so that the input text can be sent to the index server.
  • the index server responds to the text processing request and implements rewriting processing of the input text input by the user.
  • the text processing request initiated by the application side corresponds to the generation service shown in Figure 1.
  • the generation service includes +intervention() and +generation().
  • +intervention() can be the corresponding retrieval service provided by the index server (in Figure 1 Use +retrieve() for illustration) to determine whether the input text needs to be rewritten and to rewrite the input text if the input text needs to be rewritten.
  • +Generate() can be used to generate the rewritten text, such as +Generate()
  • the rewritten text may be translated to obtain the translated text.
  • the present disclosure can be applied to scenarios including, but not limited to, text translation, text summarization, and intelligent dialogue.
  • text translation scenario uses a text translation scenario as an example to explain a text processing method provided by embodiments of the present disclosure.
  • Chinese text translation is used. Take the English text as an example.
  • FIG. 2 is a flowchart of a text processing method according to an exemplary embodiment of the present disclosure. Referring to Figure 2, include the following steps:
  • Step S201 Obtain the target example text that needs to be rewritten and the target phrase rewritten example pair corresponding to the target example text.
  • the target phrase rewritten example pair includes the target example rewritten phrase and the target example replacement phrase corresponding to the target example rewritten phrase.
  • Step S202 Rewrite the example pairs according to the target example text and the target phrase to generate an index relationship.
  • Step S203 Store the index relationship in the index database.
  • Step S204 Rewrite the obtained input text according to the index relationship in the index database.
  • the target example text can be "This dish tastes great, the chef really has two brushes”
  • the target example rewritten phrase can be "Two brushes”
  • the target example replacement phrase corresponding to the target example rewritten phrase can be "Something”
  • the target example replacement phrase corresponding to the target example rewrite phrase can also be "something”.
  • the index relationship represents a key-value pair relationship
  • the value corresponding to the key can be determined based on the key in the index relationship.
  • the semantic information of the target example rewritten phrase in the target example text can be used as a key
  • the target example replacement phrase corresponding to the target example rewritten phrase can be used as a value to generate an index relationship.
  • the target example rewriting phrase corresponding to the input text actually refers to the target example
  • the contextual semantic information of the rewritten phrase in the input text is the same as the contextual semantic information of the same phrase as the target example rewritten phrase in the input text.
  • the index relationship can be generated based on the obtained target example text and target phrase by rewriting the example pairs, and directly stored in the index database, online intervention in the index database can be achieved without taking the index database offline, and through The index relationship in the index database realizes the rewriting of the input text, thereby solving the problem of using a model for text rewriting and the model needs to be updated offline, which affects the real-time nature of the online rewriting of the text.
  • target example texts can include expert input that needs to be rewritten
  • the sample text and the recall text that are recalled from the corpus based on the sample text also need to be rewritten.
  • step S201 shown in FIG. 2 can be implemented in the following manner: obtaining the input sample text that needs to be rewritten and the phrase rewriting example pair corresponding to the sample text.
  • the phrase rewriting example pair includes the example rewriting phrase and the example rewriting phrase.
  • the sample text may be sample text input by experts that needs to be rewritten.
  • the inverted index originates from the need to find records based on attribute values in practical applications.
  • Each item in this index table includes an attribute value and the address of each record with the attribute value.
  • the example rewrite phrase is used as the value of the attribute
  • the recall text is used as the address of each record with the attribute value.
  • the intervention platform provides an input interface for an example text and a phrase rewriting example pair corresponding to the example text. After inputting the example text and a phrase rewriting example pair corresponding to the example text, the intervention platform provides a request interface.
  • the request interface It is used to call the new service in the indexing service after intervening in the platform (shown as + New () in Figure 1) to establish and store the index relationship between the input sample text and the phrase rewriting example pair corresponding to the sample text.
  • pre-built inverted indexes of text and phrases may be constructed from data in a web corpus.
  • the inverted index with pre-built text and phrases includes "This dish tastes amazing and the chef really has something", “He scored three in a game” in Table 1 "He scored three goals in a game, and he really has something", "This painter used two brushes in total” is an example of the inverted index relationship,
  • the input example rewrites the phrase pair to "two brushes-something". Take “This dish tastes great, the chef really has two brushes” as the example text. Rewrite the phrase pair based on the example. The example rewrites the phrase "two brushes.” Bundle Brushes”, in the above inverted index you can recall "He scored three goals in one game, he really has two brushes”.
  • the above-mentioned inverted index can be established based on the contextual semantics of the phrase in the text.
  • the contextual semantics of the example rewritten phrase in "He scored three goals in one game he really has two brushes
  • the example rewritten phrase in the example text (This dish tastes great, the chef is really good)
  • the contextual semantics in "There are two brushes” are the same.
  • index relationships may be generated using vector representations.
  • step S202 shown in Figure 2 can be implemented in the following manner: determining a first vector representation of the target example rewritten phrase in the target example text, and the first vector representation is used to characterize the target example rewritten phrase in the target example text. Contextual semantic information; rewrite the example pairs based on the first vector representation and the target phrase to generate index relationships.
  • the first vector representation of the target example paraphrase phrase in the target example text may be determined through a pre-trained BERT model.
  • a pre-trained BERT model Referring to Figure 3, "This dish tastes great, the chef really has two brushes” in Figure 3 is the target example text, and “Two Brushes” in Figure 3 is the target The example rewrites the phrase.
  • the "intervention word” in Figure 3 is the target example replacement phrase.
  • the "two brushes” are encoded through the pre-training model, and the obtained vector (i.e., the first vector representation) [0.01, 0.02.-0.03,... , 0.05, 0.37] as the key in the index relationship, "two brushes” and "intervention word” form a mapping relationship as the value in the index relationship, and the generated index relationship is stored in the vector index.
  • the first vector representation can be determined by rewriting the token vector corresponding to each word in the phrase according to the target example output by the last layer of the pre-trained BERT model.
  • Figure 4 shows the structure of a BERT model.
  • the BERT model includes 12 layers of encoders. Each layer of encoders is used to encode the input of the layer encoder to obtain a token vector.
  • the input characters include 9, corresponding to the last layer, the token vector corresponding to each character is output one by one.
  • the target example output by the last layer can be rewritten into each word in the phrase (including "two", “bar”, " The average vector of the token vectors corresponding to "brush", "sub”) is represented as the corresponding first vector.
  • the input of the BERT model in Figure 4 is only a part of the target example text. In practical applications, the entire target example text can also be used as the input of the BERT model to obtain the first vector representation.
  • the text processing method may further include: responding to an update request for the index database, updating the index relationship in the index database, where the update request includes one of a delete request and a modification request.
  • FIG. 2 it can be understood as adding a new index relationship in the index database, which can be understood as an update method of the index database.
  • the intervention side can also provide other services for the index relationships stored in the index database to update the index relationships itself that have been stored in the index database.
  • the update request can be implemented through the index service shown in Figure 1, and the corresponding situation is generated by calling the corresponding service.
  • the update request may carry the identifier of the index relationship that needs to be updated. According to this identification, the corresponding index relationship can be found in the index database, and then the index relationship can be deleted or changed. For example, the change may be to change the intervention words mentioned in the above embodiment.
  • step S204 shown in Figure 2 can be implemented in the following manner: in response to the obtained input text, when the input text includes the phrase to be rewritten, determine according to the index relationship in the index database. Whether the input text is text that needs to be rewritten; when it is determined that the input text is text that needs to be rewritten, the phrase to be rewritten in the input text is rewritten according to the index relationship corresponding to the input text.
  • the input text input by the user does not include the phrase to be rewritten, it means that there is no need to rewrite the phrase to be rewritten in the input text.
  • the input text includes a phrase to be written, it is also necessary to determine that the input text is text that needs to be rewritten, and then rewrite the phrase to be rewritten in the input text according to the index relationship corresponding to the input text, so as to avoid reducing text errors. Probability of rewriting.
  • the dictionary tree also known as the word search tree, is a tree structure with high query efficiency. Dictionary trees are very similar to dictionaries. When you want to check whether a word is in the dictionary tree, first check whether the first letter of the word is on the first level of the dictionary. If it is not, it means that the word is not in the dictionary tree. If it is, Just look for the second letter of the word in the child node of the letter. If there is no word, it means that there is no such word. If there is, continue to search in the same way. Therefore, building a dictionary tree by using the target example rewritten phrase phrase can improve the efficiency of determining whether the input text includes the phrase to be rewritten, and further improve the real-time performance.
  • the index relationship is composed of a first vector representation and a target phrase rewritten example pair.
  • the first vector representation is used to represent the contextual semantic information of the target example rewritten phrase in the target example text. Therefore, the first vector representation can be Whether the representation matches the contextual semantic information of the phrase to be rewritten in the input text, to determine whether the input text is text that needs to be rewritten.
  • the second vector representation of the phrase to be rewritten in the input text is obtained.
  • the second vector representation is used to represent the contextual semantic information of the phrase to be rewritten in the input text; and the second vector representation is searched in the index database according to the second vector representation.
  • the first vector representation in the index database that is closest to the second vector representation is the target vector representation.
  • the determination method of the second vector representation is similar to the determination method of the first vector representation.
  • the detailed determination method please refer to the above-mentioned related embodiments, which will not be described again in this embodiment.
  • the preset distance threshold can be set according to actual conditions, and is not limited in this embodiment.
  • the data structure of the index database can be a graph structure, and the graph structure can be a HNSW (Hierarchical Navigable Small World) graph structure.
  • HNSW Hierarchical Navigable Small World
  • search algorithms please refer to related technologies. This implementation is in This will not be described in detail.
  • a naive search algorithm can be used to search for the target vector representation closest to the second vector representation in the index database, thus avoiding violent retrieval.
  • Input text 1 This dish tastes great and the chef really has two brushes.
  • Input text 2 He writes code very well and really has two brushes.
  • Input text 3 The painter used a total of two brushes.
  • Input text 4 This game is boring, the players are just scumbags.
  • the second vector corresponding to input text 1 represents: [0.01,0.02,-0.03,...,0.05,0.37].
  • the second vector corresponding to input text 2 represents: [0.09,0.04,-0.01,...,0.17,0.07].
  • the second vector corresponding to input text 3 represents: [0.06,0.12,-0.93,...,0.85,0.17].
  • a naive search algorithm is used to find the closest target vector representation corresponding to the second vector representation of input text 1, input text 2 and input text 3, and then obtain the second vector representation of input text 1 respectively.
  • the vector represents the distance represented by its corresponding target vector
  • the second vector of input text 2 represents the distance represented by its corresponding target vector
  • the second vector of input text 3 represents the distance represented by its corresponding target vector:
  • the distance between the second vector representation of input text 3 and its corresponding target vector representation 200.
  • Figure 5 is a block diagram of a text processing device according to an exemplary embodiment of the present disclosure.
  • the text processing device 500 includes:
  • the first acquisition module 501 is used to obtain the target example text that needs to be rewritten and the target phrase rewriting example pair corresponding to the target example text.
  • the target phrase rewriting example pair includes the target example rewriting phrase and the target example rewriting phrase corresponding to the target example text.
  • Generating module 502 configured to rewrite example pairs according to the target example text and the target phrase to generate an index relationship
  • the rewriting module 504 rewrites the obtained input text according to the index database.
  • the first acquisition module 501 includes:
  • the first acquisition sub-module is used to obtain the input example text that needs to be rewritten and the phrase rewriting example pair corresponding to the example text.
  • the phrase rewriting example pair includes an example rewriting phrase and an example replacement corresponding to the example rewriting phrase. phrase;
  • the recall sub-module is used to recall the recalled text corresponding to the example rewritten phrase pair in the pre-built inverted index of text and phrase according to the example rewritten phrase in the phrase rewritten example pair;
  • the first determination sub-module is used to determine the recalled text and the example text as the target example text, and determine the phrase rewritten example pair as the target phrase rewritten example pair.
  • the generation module 502 includes:
  • the second determination sub-module is used to determine the first vector representation of the target example rewritten phrase in the target example text, and the first vector representation is used to characterize the target example rewritten phrase in the target example text.
  • contextual semantic information
  • a generating sub-module is configured to generate an index relationship according to the first vector representation and the target phrase rewritten example pair.
  • the device 500 also includes:
  • a response module configured to respond to an update request for the index database and update the index relationship in the index database, where the update request includes one of a deletion request and a modification request.
  • the rewriting module 504 includes:
  • a response submodule configured to respond to the obtained input text, and determine whether the input text is text that needs to be rewritten based on the index relationship in the index database when the input text includes a phrase to be rewritten.
  • the rewriting submodule is configured to rewrite the phrase to be rewritten in the input text according to the index relationship corresponding to the input text when it is determined that the input text is text that needs to be rewritten.
  • the device 500 also includes:
  • a word segmentation module used to segment the input text to obtain multiple phrase results
  • a matching module configured to match, for each phrase result, a phrase that matches the phrase result in a pre-constructed phrase dictionary tree, where the phrase dictionary tree is constructed by rewriting the phrase from the target example;
  • a first determination module configured to determine that the input text includes the phrase to be rewritten if the phrase corresponding to the phrase result is successfully matched.
  • the index relationship is composed of a first vector representation and a pair of the target phrase rewritten example, and the first vector representation is used to characterize the target example rewritten phrase in the target example text.
  • the contextual semantic information in , the rewriting sub-module includes:
  • An acquisition unit configured to acquire a second vector representation of the phrase to be rewritten in the input text, where the second vector representation is used to characterize the contextual semantic information of the phrase to be rewritten in the input text;
  • a search unit configured to search the index database for a target vector representation that is closest to the second vector representation according to the second vector representation
  • a determining unit configured to determine that the input text is text that needs to be rewritten when the distance between the target vector representation and the second vector representation is less than a preset distance threshold.
  • the data structure of the index database is a graph structure
  • the search unit uses a naive search algorithm to search in the index database for the closest distance to the second vector representation based on the second vector representation.
  • Target vector representation is a naive search algorithm
  • Terminal devices in embodiments of the present disclosure may include, but are not limited to, mobile phones, laptops, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Tablets), PMPs (Portable Multimedia Players), vehicle-mounted terminals (such as Mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers, etc.
  • the electronic device shown in FIG. 6 is only an example and should not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.
  • the electronic device 600 may include a processing device (eg, central processing unit, graphics processor, etc.) 601, which may be loaded into a random access device according to a program stored in a read-only memory (ROM) 602 or from a storage device 608.
  • the program in the memory (RAM) 603 executes various appropriate actions and processes.
  • various programs and data required for the operation of the electronic device 600 are also stored.
  • the processing device 601, ROM 602 and RAM 603 are connected to each other via a bus 604.
  • An input/output (I/O) interface 605 is also connected to bus 604.
  • input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including An output device 607 such as a liquid crystal display (LCD), a speaker, a vibrator, etc.; a storage device 608 including a magnetic tape, a hard disk, etc.; and a communication device 609.
  • Communication device 609 may allow electronic device 600 to communicate wirelessly or wiredly with other devices to exchange data.
  • FIG. 6 illustrates electronic device 600 with various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communication device 609, or from storage device 608, or from ROM 602.
  • the processing device 601 When the computer program is executed by the processing device 601, the above functions defined in the method of the embodiment of the present disclosure are performed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programmd read-only memory (EPROM or flash memory), fiber optics, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • computer readable signal medium any computer-readable medium other than a computer-readable storage medium, the computer-readable signal medium can send, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.
  • electronic devices can communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium ( For example, communication network) interconnection.
  • HTTP HyperText Transfer Protocol
  • communications networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or developed in the future network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device.
  • the computer-readable medium carries one or more programs.
  • the electronic device obtains the target example text that needs to be rewritten and the target phrase rewriting corresponding to the target example text.
  • An example pair, the target phrase rewritten example pair includes a target example rewritten phrase and a target example replacement phrase corresponding to the target example rewritten phrase; generate an index relationship according to the target example text and the target phrase rewritten example pair;
  • the index relationship is stored in an index database; the obtained input text is rewritten according to the index relationship in the index database.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as "C" or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on Execute partially on the user's computer on a remote computer, or entirely on a remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider). connected via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider such as an Internet service provider
  • each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
  • the modules involved in the embodiments of the present disclosure can be implemented in software or hardware.
  • the name of the module does not constitute a limitation on the module itself under certain circumstances.
  • the first acquisition module can also be described as "obtaining the target example text that needs to be rewritten and the target phrase rewriting corresponding to the target example text.”
  • Example pair of modules can be described as "obtaining the target example text that needs to be rewritten and the target phrase rewriting corresponding to the target example text.”
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. quality.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • Example 1 provides a text processing method, including:
  • target example text that needs to be rewritten and a target phrase rewritten example pair corresponding to the target example text
  • the target phrase rewritten example pair includes a target example rewritten phrase and a target example replacement phrase corresponding to the target example rewritten phrase
  • the obtained input text is text rewritten.
  • Example 2 provides the method of Example 1.
  • the obtaining the target example text that needs to be rewritten and the target phrase rewriting example pair corresponding to the target example text includes:
  • the recalled text and the example text are determined as the target example text, and the phrase rewritten example pair is determined as the target phrase rewritten example pair.
  • Example 3 provides the method of Example 1, where the root According to the target example text and the target phrase, rewrite the example pair to generate an index relationship, including:
  • An index relationship is generated based on the first vector representation and the target phrase reworded example pair.
  • Example 4 provides the method of Example 1, the method further comprising:
  • the index relationship in the index database is updated, where the update request includes one of a delete request and a modification request.
  • Example 5 provides the method of any one of Examples 1-4, wherein text rewriting of the acquired input text according to the index relationship in the index database includes:
  • the input text In response to the obtained input text, if the input text includes a phrase to be rewritten, determine whether the input text is text that needs to be rewritten based on the index relationship in the index database;
  • the phrase to be rewritten in the input text is rewritten according to the index relationship corresponding to the input text.
  • Example 6 provides the method of Example 5, the method further comprising:
  • phrase matching the phrase result is matched in a pre-constructed phrase dictionary tree, and the phrase dictionary tree is constructed by rewriting the phrase from the target example;
  • Example 7 provides the method of Example 5, the search
  • the citation relationship is composed of a first vector representation and a pair of rewritten examples of the target phrase.
  • the first vector representation is used to represent the contextual semantic information of the rewritten phrase of the target example in the target example text.
  • the index relationship in the database determines whether the input text is text that needs to be rewritten, including:
  • the distance between the target vector representation and the second vector representation is less than a preset distance threshold, it is determined that the input text is text that needs to be rewritten.
  • Example 8 provides the method of Example 7, the data structure of the index database is a graph structure, and the search and the search in the index database according to the second vector representation are The nearest target vector representation represented by the second vector includes:
  • a naive search algorithm is used to search the target vector representation closest to the second vector representation in the index database.
  • Example 9 provides a text processing device, including:
  • the first acquisition module is used to obtain the target example text that needs to be rewritten and the target phrase rewritten example pair corresponding to the target example text.
  • the target phrase rewritten example pair includes the target example rewritten phrase and the target example rewritten phrase corresponding to the target example text.
  • a generation module configured to rewrite the example pairs according to the target example text and the target phrase to generate an index relationship
  • a storage module used to store the index relationship into an index database
  • the rewriting module rewrites the obtained input text according to the index database.
  • Example 10 provides a computer-readable medium, A computer program is stored thereon, which when executed by the processing device implements the steps of the method described in any one of Examples 1-8.
  • Example 11 provides an electronic device, including:
  • a processing device configured to execute the computer program in the storage device to implement the steps of the method in any one of Examples 1-8.
  • Example 12 provides a text processing system, including:
  • An intervention platform configured to obtain a target example text that needs to be rewritten and a target phrase rewritten example pair corresponding to the target example text, where the target phrase rewritten example pair includes a target example rewritten phrase and a target example corresponding to the target example rewritten phrase. Replace phrase;
  • the index server is configured to obtain the target example text and the target phrase rewritten example pair from the intervention platform, generate an index relationship based on the obtained target example text and the target phrase rewritten example pair, and combine the obtained target example text and the target phrase rewritten example pair.
  • the index relationship is stored in the index database; the index server is also configured to rewrite the obtained input text according to the index relationship in the index database.
  • Example 13 provides the system of Example 12, further comprising:
  • a corpus database that stores pre-built inverted indexes of text and phrases
  • the intervention platform is also used to obtain the input sample text that needs to be rewritten and the phrase rewriting example pair corresponding to the sample text, and recall the sample rewritten phrases in the inverted index of the text and phrases pre-built in the corpus database. corresponding recall text, and determine the recall text and the example text as the target example text, and determine the phrase rewriting example pair as the target
  • a phrase rewriting example pair includes the example rewriting phrase and an example replacement phrase corresponding to the example rewriting phrase.
  • Example 14 provides the system of Example 12, further comprising:
  • the intervention platform is also used to store the obtained target example text that needs to be rewritten and the target phrase rewriting example pair corresponding to the target example text in the intervention database, and generate an index establishment request and send it to the index server;
  • the index server is further configured to respond to the index processing request and obtain the target example text and the target phrase rewriting example pair corresponding to the target example text from the intervention database.
  • Example 15 provides the system of Example 12, further comprising:
  • the application side is used to send the input text to the index server.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

本公开涉及一种文本处理方法、装置、存储介质、电子设备及系统,方法包括:获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,所述目标短语改写示例对包括目标示例改写短语和与所述目标示例改写短语对应的目标示例替换短语;根据所述目标示例文本和所述目标短语改写示例对,生成索引关系;将所述索引关系存储至索引数据库;根据所述索引数据库中的索引关系,对获取的输入文本进行文本改写,解决了因采用模型进行文本改写而模型需要离线更新进而影响文本在线改写的实时性的问题。

Description

文本处理方法、装置、存储介质、电子设备及系统
本申请要求于2022年5月7日递交的中国专利申请第202210495448.X号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。
技术领域
本公开的实施例涉及一种文本处理方法、装置、存储介质、电子设备及系统。
背景技术
在相关技术中,获取的原始输入文本可能并没有合理的表达其原意,因此,需要对此类文本进行合理化的干预处理(例如改写处理),以便于原始输入文本更好的表达其原意。
然而,在传统的文本改写中,通常采用模型来对文本进行改写,而在模型使用中会涉及模型的离线更新,而在实际的工业应用场景下,文本的在线实时处理尤为重要,因此,离线更新模型的方式会严重影响文本在线处理的实时性。
发明内容
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
第一方面,本公开提供一种文本处理方法,包括:
获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,所述目标短语改写示例对包括目标示例改写短语和与所述目标示例改写短语对应的目标示例替换短语;
根据所述目标示例文本和所述目标短语改写示例对,生成索引关系;
将所述索引关系存储至索引数据库;
根据所述索引数据库中的索引关系,对获取的输入文本进行文本改写。
第二方面,本公开提供一种文本处理装置,包括:
第一获取模块,用于获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,所述目标短语改写示例对包括目标示例改写短语和与所述目标示例改写短语对应的目标示例替换短语;
生成模块,用于根据所述目标示例文本和所述目标短语改写示例对,生成索引关系;
存储模块,用于将所述索引关系存储至索引数据库;
改写模块,用于根据所述索引数据库中的索引关系,对获取的输入文本进行文本改写。
第三方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现第一方面中所述方法的步骤。
第四方面,本公开提供一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现第一方面中所述方法的步骤。
第五方面,本公开提供一种文本处理系统,包括:
索引数据库;
索引服务器;
干预平台,用于获取需要改写的目标示例文本与所述目标示例文本对应 的目标短语改写示例对,所述目标短语改写示例对包括目标示例改写短语和与所述目标示例改写短语对应的目标示例替换短语;
所述索引服务器用于从所述干预平台获取所述目标示例文本和所述目标短语改写示例对,根据获取的所述目标示例文本和所述目标短语改写示例对,生成索引关系,并将所述索引关系存储至所述索引数据库;所述索引服务器还用于根据所述索引数据库中的索引关系,对获取的输入文本进行文本改写。
通过上述技术方案,由于可以根据获取的目标示例文本和目标短语改写示例对生成索引关系,并直接存储到索引索引数据库中,无需对索引数据库进行下线就能够实现对索引数据库的在线干预,并通过索引数据库中的索引关系实现对输入文本的改写,从而解决了因采用模型进行文本改写而模型需要离线更新进而影响文本在线改写的实时性的问题。
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。
图1是根据本公开一示例性实施例示出的一种文本处理系统的示意图;
图2是根据本公开一示例性实施例示出的一种文本处理方法的流程图;
图3是根据本公开一示例性实施例示出的一种生成索引关系的示意图;
图4是根据本公开一示例性实施例示出的一种BERT模型的结构示意图;
图5是根据本公开一示例性实施例示出的一种文本处理装置的框图;
以及
图6是根据本公开一示例性实施例示出的一种电子设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
可以理解的是,在使用本公开各实施例公开的技术方案之前,均应当依据相关法律法规通过恰当的方式对本公开所涉及个人信息的类型、使用范围、使用场景等告知用户并获得用户的授权。
例如,在响应于接收到用户的主动请求时,向用户发送提示信息,以明确地提示用户,其请求执行的操作将需要获取和使用到用户的个人信息。从而,使得用户可以根据提示信息来自主地选择是否向执行本公开技术方案的操作的电子设备、应用程序、服务器或存储介质等软件或硬件提供个人信息。
作为一种可选的但非限定性的实现方式,响应于接收到用户的主动请求,向用户发送提示信息的方式例如可以是弹窗的方式,弹窗中可以以文字的方式呈现提示信息。此外,弹窗中还可以承载供用户选择“同意”或者“不同意”向电子设备提供个人信息的选择控件。
可以理解的是,上述通知和获取用户授权过程仅是示意性的,不对本公开的实现方式构成限定,其它满足相关法律法规的方式也可应用于本公开的实现方式中。
同时,可以理解的是,本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。
正如背景技术所言,通常采用大量文本训练数据训练神经网络模型,以根据训练好的神经网络模型对文本进行改写,而在模型的实际使用时,对于与文本训练数据分布相差较大的输入,模型往往给出异常的输出,影响模型的综合表现。为解决模型异常输出的问题,通常是将异常的输入输出实例进行人工校正(或标注),再重新喂入模型进行训练然而,模型重新训练涉及到模型参数的重新调整和学习,通常在离线阶段完成,如此便会影响实际工业应用场景中对文本进行改写处理的实时性。此外,由于模型重新上线可能也同时涉及到环境的重新部署,如此,将进一步影响实际工业应用场景中对文本进行改写处理的实时性。
有鉴于此,本公开实施例提供一种文本处理方法、装置、存储介质、电子设备及系统,有效地保证了文本改写处理的实时性。
以下结合附图对本公开的实施例进行进一步解释说明。
图1是根据本公开一示例性实施例示出的一种文本处理系统的示意图。参照图1,文本处理方法可以应用于文本处理系统的干预侧,干预侧的干预平台用于获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,索引服务器(图1中以索引服务示意)用于从干预平台获取目标示例文本和目标短语改写示例对,根据获取的目标示例文本和目标短语改写示例对,生成索引关系,并将索引关系存储至索引数据库(图1中以向量索引示意);索引服务器还用于根据索引数据库中的索引关系,对获取的输入文本进行文本改写。具体的,索引服务器可以在收到干预平台发起的索引处理请求(图1中干预平台和索引服务之间的RPC(Remote Procedure Call,远程过程调用))时,则执行索引关系的生成。
继续参照图1,在一些实施例中,干预侧的干预平台用于接收专家输入的需要改写的示例文本和与示例文本对应的短语改写示例对,干预侧的干预平台还用于根据示例文本在语料库数据库(图1中以语料库倒排索引示意)的预构建的文本与短语的倒排索引中召回与示例改写短语对对应的召回文本。干预侧的干预平台还用于将示例文本、召回文本以及示例改写短语对存储至干预数据库中。在索引服务器收到干预平台发起的索引处理请求时,可以从干预数据库中进行初始化加载,以获取示例文本、召回文本以及示例改写短语对,进而执行索引关系的生成。
继续参照图1,在一些实施例中,干预侧的索引服务器用于实现对索引数据库(图1中以向量索引示意)中的索引关系管理,该管理例如可以是索引数据库中新增索引关系(可以理解为上述索引关系的生成)、删除索引数据库中的索引关系以及修改索引数据库中的索引关系等。
继续参照图1,在一些实施例中,干预侧的索引服务器还用于对应用端(图1以应用侧示意)发起的文本处理请求(图1中应用侧和索引服务之间的RPC)进行处理,文本处理请求携带有用户输入的输入文本,通过应用端向索引服务器发送文本处理请求,以便于将输入文本发送到索引服务器。索引服务器对文本处理请求进行响应,实现对用户输入的输入文本的改写处理。其中,应用侧发起的文本处理请求对应图1所示的生成服务,生成服务包括+干预()和+生成(),+干预()可以是对应实现通过索引服务器提供的检索服务(图1中以+检索()进行示意)来判断输入文本是否需要改写以及在输入文本需要改写的情况下对输入文本的改写,+生成()可以用于对改写后的文本进行生成,例如+生成()可以是对改写后的文本进行翻译,得到翻译文本。
此外,在本公开可以应用于包括但不限于文本翻译场景、文本摘要、智能对话场景,以下以文本翻译场景为例对本公开实施例提供的一种文本处理方法进行解释说明,具体以中文文本翻译为英文文本为例进行说明。
图2是根据本公开一示例性实施例示出的一种文本处理方法的流程图。参照图2,包括以下步骤:
步骤S201,获取需要改写的目标示例文本与目标示例文本对应的目标短语改写示例对,目标短语改写示例对包括目标示例改写短语和与目标示例改写短语对应的目标示例替换短语。
步骤S202,根据目标示例文本和目标短语改写示例对,生成索引关系。
步骤S203,将索引关系存储至索引数据库。
步骤S204,根据索引数据库中的索引关系,对获取的输入文本进行文本改写。
需要理解的是,在不同语言环境下,将文本进行直译将造成语句语义发生变化,即得到的翻译文本并不能合理地准确表达原意,因此,在此种情况 下,需要对文本进行改写。
示例地,以目标示例文本是“这个菜味道超棒,厨师真的有两把刷子”为例,在翻译场景下,目标示例文本中的短语“两把刷子”,其并不是指代实际的刷子,因此并不能直译为“two brushes”,进而,需要对“这个菜味道超棒,厨师真的有两把刷子”进行改写。
承接上述示例,目标示例文本可以是“这个菜味道超棒,厨师真的有两把刷子”,目标示例改写短语可以是“两把刷子”,与目标示例改写短语对应的目标示例替换短语可以是“点东西”,与目标示例改写短语对应的目标示例替换短语也可以是“something”。在目标示例替换短语是“点东西”的情况下,可以对“这个菜味道超棒,厨师真的有两把刷子”中的“两把刷子”直接进行替换,再根据替换后的文本翻译为英文本文;在目标示例替换短语是“something”的情况下,可以先对“这个菜味道超棒,厨师真的有两把刷子”整个语句进行翻译,在将与“两把刷子”对应的英文单词替换为“something”,也可以先对“这个菜味道超棒,厨师真的有两把刷子”中的“两把刷子”替换为“something”,再对替换后得到的文本进行翻译,具体的改写形式并不造成对本公开的限定。
需要说明的是,索引关系表征了一种键值对的关系,根据该索引关系中的键可以确定与该键对应的值。示例地,可以将目标示例改写短语在目标示例文本的语义信息作为键,将与目标示例改写短语对应的目标示例替换短语作为值,以生成索引关系。
承接上述生成的索引关系的示例,具体来讲,在索引关系中查找与输入文本对应的目标示例改写短语,并根据与输入文本对应的目标示例改写短语对应的目标示例替换短语对输入文本进行改写。具体的改写实施方式可以参照上述示例,本实施例在此不做赘述。
需要说明的是,与输入文本对应的目标示例改写短语实则是指目标示例 改写短语在该输入文本中的上下文语义信息与输入文本中与目标示例改写短语相同短语在输入文本中的上下文语义信息是相同的。
通过上述方式,由于可以根据获取的目标示例文本和目标短语改写示例对生成索引关系,并直接存储到索引索引数据库中,无需对索引数据库进行下线就能够实现对索引数据库的在线干预,并通过索引数据库中的索引关系实现对输入文本的改写,从而解决了因采用模型进行文本改写而模型需要离线更新进而影响文本在线改写的实时性的问题。
在实际应用中,用户的输入文本各种各样,但对于同一短语在不同语句的语义下,可以实现相同的替换。下表提供一些干预(改写)示例:
表1
在表1中对不同输入文本进行干预或不进行干预得到的预期效果进行了比较。针对表1中的输入文本的实例,理想的文本的干预处理可以根据“两把刷子”在不同输入文本中表达不同语义时,能根据几条输入文本的实例,较好地获取短语的上下文语义,合理地对更多此类文本实例进行干预或避免干预,从而提高文本干预的泛化性。
在提高文本干预的泛化性,目标示例文本可以包括专家输入的需要改写 的示例文本和根据示例文本从语料库中召回的同样需要改写的召回文本。在此情况下,图2所示的步骤S201可以通过以下方式实施:获取输入的需要改写的示例文本和与示例文本对应的短语改写示例对,短语改写示例对包括示例改写短语和与示例改写短语对应的示例替换短语;根据短语改写示例对中的示例改写短语,在预构建的文本与短语的倒排索引中召回与示例改写短语对对应的召回文本;将召回文本和示例文本确定为目标示例文本,并将短语改写示例对确定为目标短语改写示例对。
其中,示例文本可以是专家输入的需要改写的示例文本。
需要说明的是,倒排索引源于实际应用中需要根据属性的值来查找记录,这种索引表中的每一项都包括一个属性值和具有该属性值的各记录的地址。对于倒排索引应用到本实施例中,示例改写短语作为属性的值,召回文本作为具有该属性值的各记录的地址。
在一些实施例中,干预平台提供示例文本和与示例文本对应的短语改写示例对的输入接口,在输入示例文本和与示例文本对应的短语改写示例对后,干预平台提供请求接口,该请求接口用于使干预平台后调用索引服务中的新增服务(图1中+新增()进行示意)对输入的示例文本和与示例文本对应的短语改写示例对进行索引关系的建立、存储等。
在一些实施例中,预构建的文本与短语的倒排索引可以通过网络语料库中的数据进行构建。以预构建的文本与短语的倒排索引包括表1中“这个菜味道超棒,厨师真的有两把刷子-This dish tastes amazing and the chef really has something”,“他一场比赛打进三个球,真有两把刷子-He scored three goals in a game,and he really has something”,“这个粉刷匠一共用了两把刷子-This painter used two brushes in total”的倒排索引关系为例,输入的示例改写短语对为“两把刷子-something”为例,以“这个菜味道超棒,厨师真的有两把刷子”为示例文本为例,根据示例改写短语对中示例改写短语“两把 刷子”,在上述倒排索引中可以召回“他一场比赛打进三个球,真有两把刷子”。
具体来讲,上述倒排索引的建立可以根据短语在文本中的上下文语义进行建立。承接上述倒排索引的示例,由于示例改写短语在“他一场比赛打进三个球,真有两把刷子”中的上下文语义,与示例改写短语在示例文本(这个菜味道超棒,厨师真的有两把刷子)中的上下文语义相同,对于“他一场比赛打进三个球,真有两把刷子”的文本而言,其也适用于针对示例文本中的“两把刷子”进行相同的改写。因此,可以将“他一场比赛打进三个球,真有两把刷子”作为召回文本;由于示例改写短语在“这个粉刷匠一共用了两把刷子”的上下文语义,与示例改写短语在示例文本(这个菜味道超棒,厨师真的有两把刷子)中的上下文语义是不相同,其并不适用于针对示例文本中的“两把刷子”进行相同的改写,因此,其不能作为召回文本。
通过上述方式,通过对专家输入的示例文本和与示例文本对应的短语改写示例对,在预构建的文本与短语的倒排索引中召回与示例改写短语在示例文本中具有相同上下文语义的召回文本,并建立该召回文本的索引关系,以此提高索引数据库的泛化性,进而达到提高对输入文本改写的泛化性的效果。
在一些实施例中,可以用向量的表示生成索引关系。举例来讲,图2所示的步骤S202可以通过以下方式实施:确定目标示例改写短语在目标示例文本中的第一向量表示,第一向量表示用于表征目标示例改写短语在目标示例文本中的上下文语义信息;根据第一向量表示和目标短语改写示例对,生成索引关系。
在一些实施例中,可以通过预训练的BERT模型来确定目标示例改写短语在目标示例文本中的第一向量表示。参照图3,图3中的“这个菜味道超棒,厨师真的有两把刷子”为目标示例文本,图3中的“两把刷子”为目标 示例改写短语,图3中的“干预词”为目标示例替换短语,通过预训练模型对“两把刷子”进行编码,得到的向量(即第一向量表示)[0.01,0.02.-0.03,…,0.05,0.37]作为索引关系中的key(键),“两把刷子”和“干预词”组成映射关系作为索引关系中的value(值),生成的索引关系存储到向量索引中。
在一些实施例中,可以根据预训练的BERT模型最后一层输出的目标示例改写短语中的每个字对应的token向量来确定第一向量表示。示例地,参照图4,图4为一种BERT模型的结构,在该BERT模型中包括12层编码器,每一层编码器用于对该层编码器的输入进行编码得到token向量,图4中,输入的字符包括9个,对应最后一层分别一一输出各字符对应的token向量,可以将最后一层输出的目标示例改写短语中的每个字(包括“两”,“把”,“刷”,“子”)对应的token向量的平均向量作为相应的第一向量表示。
需要说明的是,图4中BERT模型的输入仅以目标示例文本的一部分进行示例,在实际应用中,也可以将整个目标示例文本作为BERT模型的输入,得到第一向量表示。
通过上述方式,用向量来表征构建索引关系的键部分,可以便于后续基于索引数据库对输入文本进行改写。
在一些实施例中,文本处理方法还可以包括:响应针对索引数据库的更新请求,对索引数据库中的索引关系进行更新,其中,更新请求包括删除请求和修改请求中的一种。
在图2的实施例中,可以理解为在索引数据库中新增索引关系,其可以理解为是索引数据库的一种更新方式。除新增索引关系的方式之外,针对索引数据库中已存储的索引关系,干预侧还还可以提供其他服务,来实现对索引数据库中已存储的索引关系其本身的更新。
例如,更新请求可以通过图1中所示的索引服务实现,通过调用相应地服务生成相应的情况。
其中,更新请求可以携带需要更新处理的索引关系的标识。根据该标识,可以在索引数据库中查找到对应的索引关系,进而实现索引关系的删除或更改。例如,更改可以是更改上述实施例提及的干预词。
通过上述方式,通过提供不同于新增索引关系服务的更新请求,响应相应服务对应的更新请求,对索引数据库中已存储的索引关系进行更新,提高整个文本处理方法的实用性。
在一些实施例中,图2中所示的步骤S204可以通过以下方式实施:响应获取到的所述输入文本,在输入文本中包括待改写短语的情况下,根据索引数据库中的索引关系,确定输入文本是否为需要改写的文本;在确定输入文本为需要改写的文本的情况下,根据与输入文本对应的索引关系对输入文本中的待改写短语进行改写处理。
需要说明的是,在用户输入的输入文本中不包括待改写短语的情况下,说明无需对输入文本中的待改写短语进行改写处理。在输入文本包括待待写短语的情况下,还需确定输入文本为需要改写的文本,才对根据与输入文本对应的索引关系对输入文本中的待改写短语进行改写处理,以避免降低文本错误改写的概率。
示例地,对于输入文本为“这个粉刷匠一共用了两把刷子”为例,其中,待改写短语“两把刷子”在该输入文本中的含义即表征了是指代实际刷子的含义,因此,无需对输入文本进行改写。
示例地,对于输入文本为“他一场比赛打进三个球,真有两把刷子”为例,其中,待改写短语“两把刷子”在该输入文本中的含义即表征了并不是指代实际刷子的含义,因此,需要对输入文本进行改写。
在一些实施例中,可以通过以下方式确定输入文本中是否包括待改写短语:对输入文本进行分词,得到多个短语结果;针对每一短语结果,在预构建的短语字典树中匹配与该短语结果匹配的短语;在成功匹配到与短语结果 对应的短语的情况下,确定输入文本包括所述待改写短语。
示例地,针对输入文本为“他一场比赛打进三个球,真有两把刷子”为例,可以得到的短语结果包括“两把刷子”。
需要说明的是,字典树又称为单词查找树,是一种树形结构,查询效率较高。字典树与字典很相似,当你要查一个单词是不是在字典树中,首先看单词的第一个字母是不是在字典的第一层,如果不在,说明字典树里没有该单词,如果在就在该字母的孩子节点里找是不是有单词的第二个字母,没有说明没有该单词,有的话用同样的方法继续查找。因此,通过目标示例改写短语短语构建字典树,可以提高确定输入文本是否包括所述待改写短语的效率,进一步提高实时性。
在上述相关实施例中,索引关系通过第一向量表示和目标短语改写示例对组成,第一向量表示用于表征目标示例改写短语在目标示例文本中的上下文语义信息,因此,可以根据第一向量表示和输入文本中待改写短语的上下文语义信息是否匹配,来确定输入文本是否为需要改写的文本。
举例来讲,获取输入文本中的待改写短语的第二向量表示,第二向量表示用于表征待改写短语在输入文本中的上下文语义信息;根据第二向量表示在索引数据库中查找与第二向量表示的距离最近的目标向量表示;在目标向量表示与第二向量表示的距离小于预设距离阈值的情况下,确定输入文本为需要改写的文本。
需要说明的是,索引数据库中与第二向量表示的距离最近的第一向量表示为目标向量表示。
其中,第二向量表示的确定方式与第一向量表示的确定方式类似,详细的确定方式参照上述相关实施例,本实施例在此不做赘述。
其中,预设距离阈值可以根据实际情况进行设定,本实施例在此不作限定。
在本实施例中,索引数据库的数据结构可以是一种图结构,该图结构可以是HNSW(Hierarchical Navigable Small World,分层可导航小世界)图结构,具体查找算法参见相关技术,本实施在此不做赘述。在索引数据库的数据结构是HNSW图结构的情况下,可以通过朴素查找算法在索引数据库中查找与第二向量表示的距离最近的目标向量表示,如此可以避免暴力检索。
以以下输入文本为例进行对文本改写进行示例性说明:
输入文本1:这个菜味道很棒,厨师真的有两把刷子。
输入文本2:他代码写得很6,真的有两把刷子。
输入文本3:这个粉刷匠一共用了两把刷子。
输入文本4:这个游戏没意思,玩家都是刷子。
由于预构建的短语字典树中有“两把刷子”的短语,不存在“刷子”的短语,因此,可以将输入文本4过滤,则对输入文本4的不进行改写处理,以此避免错误的改写;
再通过BERT模型对输入文本1、输入文本2和输入文本3进行向量编码得到各自对应的第二向量表示:
与输入文本1对应的第二向量表示:[0.01,0.02,-0.03,...,0.05,0.37]。
与输入文本2对应的第二向量表示:[0.09,0.04,-0.01,...,0.17,0.07]。
与输入文本3对应的第二向量表示:[0.06,0.12,-0.93,...,0.85,0.17]。
根据索引数据库中的索引关系,采用朴素查找算法分别查找与输入文本1、输入文本2和输入文本3的第二向量表示所对应的距离最近的目标向量表示,进而分别得到输入文本1的第二向量表示与其对应的目标向量表示的距离、输入文本2的第二向量表示与其对应的目标向量表示的距离、输入文本3的第二向量表示与其对应的目标向量表示的距离:
输入文本1的第二向量表示与其对应的目标向量表示的距离:0。
输入文本2的第二向量表示与其对应的目标向量表示的距离:20。
输入文本3的第二向量表示与其对应的目标向量表示的距离:200。
由于预设距离阈值为50,因此,无需对输入文本3进行改写,对输入文本1和输入文本2进行改写,进而得到文本为:
与输入文本1对应的改写后的文本:这个菜味道很棒,厨师真的有[干预词]。
与输入文本2对应的改写后的文本:他代码写得很6,真的有[干预词]。
其中,上述干预词可以参见上述相关实施例,本实施例在此不作限定。图5是根据本公开一示例性实施例示出的一种文本处理装置的框图,参照图5,文本处理装置500包括:
第一获取模块501,用于获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,所述目标短语改写示例对包括目标示例改写短语和与所述目标示例改写短语对应的目标示例替换短语;
生成模块502,用于根据所述目标示例文本和所述目标短语改写示例对,生成索引关系;
存储模块503,用于将所述索引关系存储至索引数据库;
改写模块504,根据所述索引数据库,对获取的输入文本进行文本改写。
可选的,所述第一获取模块501包括:
第一获取子模块,用于获取输入的需要改写的示例文本和与所述示例文本对应的短语改写示例对,所述短语改写示例对包括示例改写短语和与所述示例改写短语对应的示例替换短语;
召回子模块,用于根据所述短语改写示例对中的示例改写短语,在预构建的文本与短语的倒排索引中召回与所述示例改写短语对对应的召回文本;
第一确定子模块,用于将所述召回文本和所述示例文本确定为所述目标示例文本,并将所述短语改写示例对确定为所述目标短语改写示例对。
可选的,所述生成模块502包括:
第二确定子模块,用于确定所述目标示例改写短语在所述目标示例文本中的第一向量表示,所述第一向量表示用于表征所述目标示例改写短语在所述目标示例文本中的上下文语义信息;
生成子模块,用于根据所述第一向量表示和所述目标短语改写示例对,生成索引关系。
可选的,所述装置500还包括:
响应模块,用于响应针对所述索引数据库的更新请求,对所述索引数据库中的索引关系进行更新,其中,所述更新请求包括删除请求和修改请求中的一种。
可选的,所述改写模块504包括:
响应子模块,用于响应获取到的所述输入文本,在所述输入文本中包括待改写短语的情况下,根据所述索引数据库中的索引关系,确定所述输入文本是否为需要改写的文本;
改写子模块,用于在确定所述输入文本为需要改写的文本的情况下,根据与所述输入文本对应的索引关系对所述输入文本中的待改写短语进行改写处理。
可选的,所述装置500还包括:
分词模块,用于对所述输入文本进行分词,得到多个短语结果;
匹配模块,用于针对每一所述短语结果,在预构建的短语字典树中匹配与该短语结果匹配的短语,所述短语字典树通过所述目标示例改写短语进行构建;
第一确定模块,用于在成功匹配到与所述短语结果对应的短语的情况下,确定所述输入文本包括所述待改写短语。
可选的,所述索引关系通过第一向量表示和所述目标短语改写示例对组成,所述第一向量表示用于表征所述目标示例改写短语在所述目标示例文本 中的上下文语义信息,所述改写子模块包括:
获取单元,用于获取所述输入文本中的待改写短语的第二向量表示,所述第二向量表示用于表征所述待改写短语在所述输入文本中的上下文语义信息;
查找单元,用于根据所述第二向量表示在所述索引数据库中查找与所述第二向量表示的距离最近的目标向量表示;
确定单元,用于在所述目标向量表示与所述第二向量表示的距离小于预设距离阈值的情况下,确定所述输入文本为需要改写的文本。
可选的,所述索引数据库的数据结构为图结构,所述查找单元具体根据所述第二向量表示,采用朴素查找算法在所述索引数据库中查找与所述第二向量表示的距离最近的目标向量表示。
下面参考图6,其示出了适于用来实现本公开实施例的电子设备600的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图6示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图6所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括 例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质 还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,电子设备可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,所述目标短语改写示例对包括目标示例改写短语和与所述目标示例改写短语对应的目标示例替换短语;根据所述目标示例文本和所述目标短语改写示例对,生成索引关系;将所述索引关系存储至索引数据库;根据所述索引数据库中的索引关系,对获取的输入文本进行文本改写。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在 用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,第一获取模块还可以被描述为“获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对的模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介 质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,示例1提供了一种文本处理方法,包括:
获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,所述目标短语改写示例对包括目标示例改写短语和与所述目标示例改写短语对应的目标示例替换短语;
根据所述目标示例文本和所述目标短语改写示例对,生成索引关系;
将所述索引关系存储至索引数据库;
根据所述索引数据库中的索引关系,对获取的输入文本进行文本改写。
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,包括:
获取输入的需要改写的示例文本和与所述示例文本对应的短语改写示例对,所述短语改写示例对包括示例改写短语和与所述示例改写短语对应的示例替换短语;
根据所述短语改写示例对中的示例改写短语,在预构建的文本与短语的倒排索引中召回与所述示例改写短语对对应的召回文本;
将所述召回文本和所述示例文本确定为所述目标示例文本,并将所述短语改写示例对确定为所述目标短语改写示例对。
根据本公开的一个或多个实施例,示例3提供了示例1的方法,所述根 据所述目标示例文本和所述目标短语改写示例对,生成索引关系,包括:
确定所述目标示例改写短语在所述目标示例文本中的第一向量表示,所述第一向量表示用于表征所述目标示例改写短语在所述目标示例文本中的上下文语义信息;
根据所述第一向量表示和所述目标短语改写示例对,生成索引关系。
根据本公开的一个或多个实施例,示例4提供了示例1的方法,所述方法还包括:
响应针对所述索引数据库的更新请求,对所述索引数据库中的索引关系进行更新,其中,所述更新请求包括删除请求和修改请求中的一种。
根据本公开的一个或多个实施例,示例5提供了示例1-4中任一项的方法,所述根据所述索引数据库中的索引关系,对获取的输入文本进行文本改写,包括:
响应获取到的所述输入文本,在所述输入文本中包括待改写短语的情况下,根据所述索引数据库中的索引关系,确定所述输入文本是否为需要改写的文本;
在确定所述输入文本为需要改写的文本的情况下,根据与所述输入文本对应的索引关系对所述输入文本中的待改写短语进行改写处理。
根据本公开的一个或多个实施例,示例6提供了示例5的方法,所述方法还包括:
对所述输入文本进行分词,得到多个短语结果;
针对每一所述短语结果,在预构建的短语字典树中匹配与该短语结果匹配的短语,所述短语字典树通过所述目标示例改写短语进行构建;
在成功匹配到与所述短语结果对应的短语的情况下,确定所述输入文本包括所述待改写短语。
根据本公开的一个或多个实施例,示例7提供了示例5的方法,所述索 引关系通过第一向量表示和所述目标短语改写示例对组成,所述第一向量表示用于表征所述目标示例改写短语在所述目标示例文本中的上下文语义信息,所述根据所述索引数据库中的索引关系,确定所述输入文本是否为需要改写的文本,包括:
获取所述输入文本中的待改写短语的第二向量表示,所述第二向量表示用于表征所述待改写短语在所述输入文本中的上下文语义信息;
根据所述第二向量表示在所述索引数据库中查找与所述第二向量表示的距离最近的目标向量表示;
在所述目标向量表示与所述第二向量表示的距离小于预设距离阈值的情况下,确定所述输入文本为需要改写的文本。
根据本公开的一个或多个实施例,示例8提供了示例7的方法,所述索引数据库的数据结构为图结构,所述根据所述第二向量表示在所述索引数据库中查找与所述第二向量表示的距离最近的目标向量表示,包括:
根据所述第二向量表示,采用朴素查找算法在所述索引数据库中查找与所述第二向量表示的距离最近的目标向量表示。
根据本公开的一个或多个实施例,示例9提供了一种文本处理装置,包括:
第一获取模块,用于获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,所述目标短语改写示例对包括目标示例改写短语和与所述目标示例改写短语对应的目标示例替换短语;
生成模块,用于根据所述目标示例文本和所述目标短语改写示例对,生成索引关系;
存储模块,用于将所述索引关系存储至索引数据库;
改写模块,根据所述索引数据库,对获取的输入文本进行文本改写。
根据本公开的一个或多个实施例,示例10提供了一种计算机可读介质, 其上存储有计算机程序,该程序被处理装置执行时实现示例1-8中任一项所述方法的步骤。
根据本公开的一个或多个实施例,示例11提供了一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现示例1-8中任一项所述方法的步骤。
根据本公开的一个或多个实施例,示例12提供了一种文本处理系统,包括:
索引数据库;
索引服务器;
干预平台,用于获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,所述目标短语改写示例对包括目标示例改写短语和与所述目标示例改写短语对应的目标示例替换短语;
所述索引服务器用于从所述干预平台获取所述目标示例文本和所述目标短语改写示例对,根据获取的所述目标示例文本和所述目标短语改写示例对,生成索引关系,并将所述索引关系存储至所述索引数据库;所述索引服务器还用于根据所述索引数据库中的索引关系,对获取的输入文本进行文本改写。
根据本公开的一个或多个实施例,示例13提供了示例12的系统,还包括:
语料数据库,用于存储预构建的文本与短语的倒排索引;
所述干预平台还用于获取输入的需要改写的示例文本和与所述示例文本对应的短语改写示例对,在所述语料数据库中预构建的文本与短语的倒排索引中召回与示例改写短语对对应的召回文本,并将所述召回文本和所述示例文本确定为所述目标示例文本,并将所述短语改写示例对确定为所述目标 短语改写示例对,所述短语改写示例对包括所述示例改写短语和与所述示例改写短语对应的示例替换短语。
根据本公开的一个或多个实施例,示例14提供了示例12的系统,还包括:
干预数据库;
所述干预平台还用于将获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对存储至所述干预数据库中,并生成索引建立请求发送至所述索引服务器;
所述索引服务器还用于响应所述索引处理请求,从所述干预数据库中获取所述目标示例文本与所述目标示例文本对应的目标短语改写示例对。
根据本公开的一个或多个实施例,示例15提供了示例12的系统,还包括:
应用端,用于将所述输入文本发送至所述索引服务器。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合 的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。

Claims (15)

  1. 一种文本处理方法,包括:
    获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,所述目标短语改写示例对包括目标示例改写短语和与所述目标示例改写短语对应的目标示例替换短语;
    根据所述目标示例文本和所述目标短语改写示例对,生成索引关系;
    将所述索引关系存储至索引数据库;
    根据所述索引数据库中的索引关系,对获取的输入文本进行文本改写。
  2. 根据权利要求1所述的方法,其中,所述获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,包括:
    获取输入的需要改写的示例文本和与所述示例文本对应的短语改写示例对,所述短语改写示例对包括示例改写短语和与所述示例改写短语对应的示例替换短语;
    根据所述短语改写示例对中的示例改写短语,在预构建的文本与短语的倒排索引中召回与所述示例改写短语对对应的召回文本;
    将所述召回文本和所述示例文本确定为所述目标示例文本,并将所述短语改写示例对确定为所述目标短语改写示例对。
  3. 根据权利要求1或2所述的方法,其中,所述根据所述目标示例文本和所述目标短语改写示例对,生成索引关系,包括:
    确定所述目标示例改写短语在所述目标示例文本中的第一向量表示,所述第一向量表示用于表征所述目标示例改写短语在所述目标示例文本中的上下文语义信息;
    根据所述第一向量表示和所述目标短语改写示例对,生成索引关系。
  4. 根据权利要求1-3中任一项所述的方法,其中,所述方法还包括:
    响应针对所述索引数据库的更新请求,对所述索引数据库中的索引关系进行更新,其中,所述更新请求包括删除请求和修改请求中的一种。
  5. 根据权利要求1-4中任一项所述的方法,其中,所述根据所述索引数据库中的索引关系,对获取的输入文本进行文本改写,包括:
    响应获取到的所述输入文本,在所述输入文本中包括待改写短语的情况下,根据所述索引数据库中的索引关系,确定所述输入文本是否为需要改写的文本;
    在确定所述输入文本为需要改写的文本的情况下,根据与所述输入文本对应的索引关系对所述输入文本中的待改写短语进行改写处理。
  6. 根据权利要求5所述的方法,还包括:
    对所述输入文本进行分词,得到多个短语结果;
    针对每一所述短语结果,在预构建的短语字典树中匹配与所述短语结果匹配的短语,所述短语字典树通过所述目标示例改写短语进行构建;
    在成功匹配到与所述短语结果对应的短语的情况下,确定所述输入文本包括所述待改写短语。
  7. 根据权利要求5或6所述的方法,其中,所述索引关系通过第一向量表示和所述目标短语改写示例对组成,所述第一向量表示用于表征所述目标示例改写短语在所述目标示例文本中的上下文语义信息,所述根据所述索引数据库中的索引关系,确定所述输入文本是否为需要改写的文本,包括:
    获取所述输入文本中的待改写短语的第二向量表示,所述第二向量表示用于表征所述待改写短语在所述输入文本中的上下文语义信息;
    根据所述第二向量表示在所述索引数据库中查找与所述第二向量表示的距离最近的目标向量表示;
    在所述目标向量表示与所述第二向量表示的距离小于预设距离阈值的情况下,确定所述输入文本为需要改写的文本。
  8. 根据权利要求7所述的方法,其中,所述索引数据库的数据结构为图结构,所述根据所述第二向量表示在所述索引数据库中查找与所述第二向量表示的距离最近的目标向量表示,包括:
    根据所述第二向量表示,采用朴素查找算法在所述索引数据库中查找与所述第二向量表示的距离最近的目标向量表示。
  9. 一种文本处理装置,包括:
    第一获取模块,用于获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,所述目标短语改写示例对包括目标示例改写短语和与所述目标示例改写短语对应的目标示例替换短语;
    生成模块,用于根据所述目标示例文本和所述目标短语改写示例对,生成索引关系;
    存储模块,用于将所述索引关系存储至索引数据库;
    改写模块,根据所述索引数据库,对获取的输入文本进行文本改写。
  10. 一种计算机可读介质,其上存储有计算机程序,其中,所述计算机程序被处理装置执行时实现权利要求1-8中任一项所述的方法。
  11. 一种电子设备,包括:
    存储装置,其上存储有计算机程序;
    处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求1-8中任一项所述的方法。
  12. 一种文本处理系统,包括:
    索引数据库;
    索引服务器;
    干预平台,用于获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对,所述目标短语改写示例对包括目标示例改写短语和与所述目标示例改写短语对应的目标示例替换短语;
    所述索引服务器用于从所述干预平台获取所述目标示例文本和所述目标短语改写示例对,根据获取的所述目标示例文本和所述目标短语改写示例对,生成索引关系,并将所述索引关系存储至所述索引数据库;所述索引服务器还用于根据所述索引数据库中的索引关系,对获取的输入文本进行文本改写。
  13. 根据权利要求12所述的系统,还包括:
    语料数据库,用于存储预构建的文本与短语的倒排索引;
    所述干预平台还用于获取输入的需要改写的示例文本和与所述示例文本对应的短语改写示例对,在所述语料数据库中预构建的文本与短语的倒排索引中召回与示例改写短语对对应的召回文本,并将所述召回文本和所述示例文本确定为所述目标示例文本,并将所述短语改写示例对确定为所述目标短语改写示例对,所述短语改写示例对包括所述示例改写短语和与所述示例改写短语对应的示例替换短语。
  14. 根据权利要求12或13所述的系统,还包括:
    干预数据库;
    所述干预平台还用于将获取需要改写的目标示例文本与所述目标示例文本对应的目标短语改写示例对存储至所述干预数据库中,并生成索引建立请求发送至所述索引服务器;
    所述索引服务器还用于响应所述索引处理请求,从所述干预数据库中获取所述目标示例文本与所述目标示例文本对应的目标短语改写示例对。
  15. 根据权利要求12-14中任一项所述的系统,还包括:
    应用端,用于将所述输入文本发送至所述索引服务器。
PCT/CN2023/092453 2022-05-07 2023-05-06 文本处理方法、装置、存储介质、电子设备及系统 WO2023217019A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210495448.X 2022-05-07
CN202210495448.XA CN114817447A (zh) 2022-05-07 2022-05-07 文本处理方法、装置、存储介质、电子设备及系统

Publications (1)

Publication Number Publication Date
WO2023217019A1 true WO2023217019A1 (zh) 2023-11-16

Family

ID=82512000

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092453 WO2023217019A1 (zh) 2022-05-07 2023-05-06 文本处理方法、装置、存储介质、电子设备及系统

Country Status (2)

Country Link
CN (1) CN114817447A (zh)
WO (1) WO2023217019A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114817447A (zh) * 2022-05-07 2022-07-29 北京有竹居网络技术有限公司 文本处理方法、装置、存储介质、电子设备及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160350395A1 (en) * 2015-05-29 2016-12-01 BloomReach, Inc. Synonym Generation
CN111401038A (zh) * 2020-02-26 2020-07-10 支付宝(杭州)信息技术有限公司 文本处理方法、装置、电子设备及存储介质
CN111475621A (zh) * 2020-04-03 2020-07-31 百度在线网络技术(北京)有限公司 同义词替换表的挖掘方法及装置、电子设备、计算机可读介质
CN114357950A (zh) * 2021-12-31 2022-04-15 科大讯飞股份有限公司 数据改写方法、装置、存储介质及计算机设备
CN114817447A (zh) * 2022-05-07 2022-07-29 北京有竹居网络技术有限公司 文本处理方法、装置、存储介质、电子设备及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160350395A1 (en) * 2015-05-29 2016-12-01 BloomReach, Inc. Synonym Generation
CN111401038A (zh) * 2020-02-26 2020-07-10 支付宝(杭州)信息技术有限公司 文本处理方法、装置、电子设备及存储介质
CN111475621A (zh) * 2020-04-03 2020-07-31 百度在线网络技术(北京)有限公司 同义词替换表的挖掘方法及装置、电子设备、计算机可读介质
CN114357950A (zh) * 2021-12-31 2022-04-15 科大讯飞股份有限公司 数据改写方法、装置、存储介质及计算机设备
CN114817447A (zh) * 2022-05-07 2022-07-29 北京有竹居网络技术有限公司 文本处理方法、装置、存储介质、电子设备及系统

Also Published As

Publication number Publication date
CN114817447A (zh) 2022-07-29

Similar Documents

Publication Publication Date Title
JP7301922B2 (ja) 意味検索方法、装置、電子機器、記憶媒体およびコンピュータプログラム
US20200082814A1 (en) Method and apparatus for operating smart terminal
US11551437B2 (en) Collaborative information extraction
US11669679B2 (en) Text sequence generating method and apparatus, device and medium
US11030402B2 (en) Dictionary expansion using neural language models
WO2019154411A1 (zh) 词向量更新方法和装置
WO2023124005A1 (zh) 地图兴趣点查询方法、装置、设备、存储介质及程序产品
WO2023217019A1 (zh) 文本处理方法、装置、存储介质、电子设备及系统
US11874798B2 (en) Smart dataset collection system
WO2023274187A1 (zh) 基于自然语言推理的信息处理方法、装置和电子设备
WO2024021790A1 (zh) 一种基于数据湖的虚拟列构建方法以及数据查询方法
CN111008213B (zh) 用于生成语言转换模型的方法和装置
WO2023082900A1 (zh) 用于机器翻译的方法、设备和介质
CN112307122A (zh) 一种基于数据湖的数据管理系统及方法
CN111104796B (zh) 用于翻译的方法和装置
JP2023002690A (ja) セマンティックス認識方法、装置、電子機器及び記憶媒体
US20230237277A1 (en) Aspect prompting framework for language modeling
WO2022188534A1 (zh) 信息推送的方法和装置
CN112463973A (zh) 医学知识图谱的构建方法、装置、介质及电子设备
WO2024082827A1 (zh) 文本相似性度量方法、装置、设备、存储介质和程序产品
US11675772B2 (en) Updating attributes in data
WO2023138361A1 (zh) 图像处理方法、装置、可读存储介质及电子设备
CN111090993A (zh) 属性对齐模型训练方法及装置
US20230020574A1 (en) Disfluency removal using machine learning
CN111737571B (zh) 搜索方法、装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23802791

Country of ref document: EP

Kind code of ref document: A1