CN117892700A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN117892700A
CN117892700A CN202311786847.2A CN202311786847A CN117892700A CN 117892700 A CN117892700 A CN 117892700A CN 202311786847 A CN202311786847 A CN 202311786847A CN 117892700 A CN117892700 A CN 117892700A
Authority
CN
China
Prior art keywords
text
separator
word segmentation
processing
processing result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311786847.2A
Other languages
Chinese (zh)
Inventor
刘群
张顾春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202311786847.2A priority Critical patent/CN117892700A/en
Publication of CN117892700A publication Critical patent/CN117892700A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

A data processing method relates to the field of artificial intelligence, comprising the following steps: acquiring a text; inserting a separator into the text, wherein the separator is a character string without blank spaces and is used for identifying the separation positions in the text; and performing word segmentation on the text inserted with the separator to obtain a word segmentation result. The separator in the application is a non-space character string (or a character string which can be called as incomplete space), and the non-space character is not common in the natural language, so that the inserted separator can be accurately deleted in post-processing, thereby realizing lossless word segmentation.

Description

Data processing method and device
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a data processing method and apparatus thereof.
Background
Existing natural language processing systems process input text by adopting a Byte Pair Encoding (BPE) or a sentence piece algorithm, and cut each character string in the text, which is bounded by spaces, according to a certain rule, that is, the spaces are inserted into the text as separators. The method aims to greatly reduce the occurrence of unregistered words during reasoning and improve the generalization and fault tolerance of a language model. If no segmentation is performed, all words contained in the language must appear in the training text, otherwise, the non-appearing words (such as misspellings) are treated as non-logged words during reasoning, and the fault tolerance and the reasoning capability of the language model are greatly affected. In addition, without word segmentation, certain words which are homologous but have different word types can be regarded as mutually independent by the language model, so that the common semantics of the words can not be fully learned, and the generalization capability of the language model is reduced.
However, these algorithms pay more attention to how to get a better sub-word distribution by building a mathematical model, and (partially) neglect the inconsistency caused by the characteristics of the language itself to the word segmentation, i.e. the same word may have different word segmentation results in different scenes. A common solution is to introduce a pre (post) text processing module before (post) the segmentation. Although such schemes can better solve the problem of non-uniformity, noise is often introduced to cause lossy word segmentation, i.e., the input text cannot be recovered after text preprocessing, word segmentation, inverse word segmentation and text post-processing.
Disclosure of Invention
The data processing method can enable inserted separators to be accurately deleted during post-processing, and therefore lossless word segmentation is achieved.
In a first aspect, the present application provides a data processing method, the method comprising: acquiring a text; inserting a separator into the text, wherein the separator is a character string without blank spaces and is used for identifying the separation positions in the text; and performing word segmentation on the text inserted with the separator to obtain a word segmentation result.
In the embodiment of the application, a non-space character string (or a character string which can be called as an incomplete space) is defined as a separator, for example, a character string which starts with a single or a plurality of non-space characters and ends with a single or a plurality of space characters is used as a separator, and the non-space characters are not common in a natural language, so that the inserted separator can be accurately deleted in post-processing, thereby realizing lossless word segmentation.
In one possible implementation, the separator is a string that starts with a non-space and ends with at least one space.
In one possible implementation, the separator is a character corresponding to the source code "\u 2582".
In one possible implementation, the inserting a separator into the text includes:
inserting separators into the text based on at least one of:
inserting separators at adjacent positions after punctuation marks included in the text;
inserting separators at positions in the text where the different languages are converted and are not separated by spaces;
a separator is inserted adjacent to a space subsequent to a last of the successive spaces included in the text.
In one possible implementation, the method further comprises: and obtaining a first processing result of the text through a language model according to the word segmentation result.
In addition, post-processing (e.g., including reverse word segmentation, separator removal, character protection structure removal) can also be performed directly on the word segmentation results.
In one possible implementation, the first processing result includes a plurality of word segmentation units and the separator; the method further comprises the steps of:
And performing reverse word segmentation on the first processing result, and removing the separator to obtain a second processing result of the text.
In one possible implementation, there may also be strings of the same characters in the text as separators, however, this part of the string is text-specific and not as separators, and should not be deleted during post-processing. Therefore, the character string corresponding to the same character as the separator can be protected in the text, that is, the non-space character part in the separator appearing in the input text is protected by constructing the character string embedding structure, so that the protected character string is formed. And further the text after the inserting the separator further includes: and the protection character string is used for identifying the text with the same characters as the separator in the text, and indicating that the text with the same characters as the separator is not the separator.
In one possible implementation, the first processing result further includes the protection string; the method further comprises the steps of: and removing the protection character string in the first processing result.
In a second aspect, the present application provides a data processing apparatus, the apparatus comprising:
The acquisition module is used for acquiring the text;
a processing module for inserting a separator into the text, the separator being a non-space character string, the separator being used to identify a location of a separation in the text; and performing word segmentation on the text inserted with the separator to obtain a word segmentation result.
In one possible implementation, the separator is a character corresponding to the source code "\u 2582".
In one possible implementation, the processing module is specifically configured to:
inserting separators into the text based on at least one of:
inserting separators at adjacent positions after punctuation marks included in the text;
inserting separators at positions in the text where the different languages are converted and are not separated by spaces;
a separator is inserted adjacent to a space subsequent to a last of the successive spaces included in the text.
In one possible implementation, the processing module is further configured to:
and obtaining a first processing result of the text through a language model according to the word segmentation result.
In one possible implementation, the first processing result includes a plurality of word segmentation units and the separator; the processing module is further configured to:
And performing reverse word segmentation on the first processing result, and removing the separator to obtain a second processing result of the text.
In one possible implementation, the text after the inserting the separator further includes: and the protection character string is used for identifying the text with the same characters as the separator in the text, and indicating that the text with the same characters as the separator is not the separator.
In one possible implementation, the first processing result further includes the protection string; the processing module is further configured to:
and removing the protection character string in the first processing result.
In a third aspect, embodiments of the present application provide a data processing apparatus, which may include a memory, a processor, and a bus system, where the memory is configured to store a program, and the processor is configured to execute the program in the memory, so as to perform the method according to the first aspect and any optional method thereof.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, which when run on a computer, causes the computer to perform the above-described first aspect and any of its optional methods.
In a fifth aspect, embodiments of the present application provide a computer program which, when run on a computer, causes the computer to perform the above first aspect and any of its alternative methods.
In a sixth aspect, the present application provides a chip system comprising a processor for supporting a data processing apparatus to perform the functions involved in the above aspects, for example, to transmit or process data involved in the above method; or, information. In one possible design, the chip system further includes a memory for holding program instructions and data necessary for the execution device or the training device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.
Drawings
FIG. 1A is a schematic diagram of a structure of an artificial intelligence main body frame;
FIGS. 1B and 1C are illustrations of an application system framework of the present invention;
FIG. 1D is a schematic diagram of an alternative hardware architecture of a terminal;
FIG. 2 is a schematic diagram of a server;
FIGS. 3-5 are schematic illustrations of a system architecture of the present application;
FIG. 6 is a flow of a cloud service;
FIG. 7 is a flow of a cloud service;
FIG. 8 is a flowchart of a data processing method according to an embodiment of the present application;
FIG. 9 is a flowchart of a data processing method according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;
fig. 11 is a schematic structural diagram of an execution device according to an embodiment of the present application;
FIG. 12 is a schematic structural diagram of a training device according to an embodiment of the present disclosure;
fig. 13 is a schematic structural diagram of a chip according to an embodiment of the present application.
Detailed Description
Embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments of the invention only and is not intended to be limiting of the invention.
Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can appreciate, with the development of technology and the appearance of new scenes, the technical solutions provided in the embodiments of the present application are applicable to similar technical problems.
The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which the embodiments of the application described herein have been described for objects of the same nature. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The terms "basic," "about," and the like are used herein as approximate terms, rather than as degree terms, and are intended to take into account inherent deviations in measured or calculated values that would be known to one of ordinary skill in the art. Furthermore, the use of "may" in describing embodiments of the present invention refers to "one or more embodiments that may be possible". The terms "use", "used", and "used" as used herein may be regarded as synonymous with the terms "utilized", "utilizing", and "utilized", respectively. In addition, the term "exemplary" is intended to refer to an instance or illustration.
Referring to fig. 1A, fig. 1A shows a schematic structural diagram of an artificial intelligence main body framework, and the artificial intelligence main body framework is described below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where the "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process. The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.
(1) Infrastructure of
The infrastructure provides computing capability support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the base platform. Communicating with the outside through the sensor; the computing power is provided by a smart chip (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform comprises a distributed computing framework, a network and other relevant platform guarantees and supports, and can comprise cloud storage, computing, interconnection and interworking networks and the like. For example, the sensor and external communication obtains data that is provided to a smart chip in a distributed computing system provided by the base platform for computation.
(2) Data
The data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to the internet of things data of the traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.
(3) Data processing
Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.
Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.
Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.
Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.
(4) General capability
After the data has been processed, some general-purpose capabilities can be formed based on the result of the data processing, such as algorithms or a general-purpose system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
(5) Intelligent product and industry application
The intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, smart city etc.
The method and the device can be applied to the field of natural language processing in the field of artificial intelligence, and a plurality of application scenes falling to the product are introduced by taking natural language processing as an example.
First, an application scenario of the present application is described, which may be, but not limited to, an application program applied to a natural language processing function (hereinafter may be simply referred to as a natural language processing class application program) or a cloud service provided by a cloud side server, and the application scenario is described below:
1. natural language processing class application
The product form of the embodiment of the application can be a natural language processing application program. The natural language processing class application may run on a terminal device or a server on the cloud side.
In one possible implementation, the natural language processing class application may perform tasks for natural language processing, such as abstract generation, text reply, text generation, and the like.
It should be appreciated that natural language processing may be implemented based on a language model, and more particularly, the present application may be applied to word segmentation processing prior to inputting the language model.
In one possible implementation, a user may open a natural language processing class application installed on a terminal device and input text, where the natural language processing class application may obtain a natural language processing result through a method provided by an embodiment of the present application, and present the natural language processing result to the user (a presentation manner may be, but is not limited to, displaying, saving, uploading to a cloud side, etc.).
In one possible implementation, a user may open a natural language processing class application installed on a terminal device and input a text, where the natural language processing class application may perform natural language processing on the text by using a method provided by an embodiment of the present application, and present a natural language processing result to the user (a presentation manner may be, but is not limited to, displaying, saving, uploading to a cloud side, etc.).
In one possible implementation, a user may open a natural language processing class application installed on the terminal device and input a text, where the natural language processing class application may send the text to a server on the cloud side, and the server on the cloud side processes the text by using a method provided by the embodiment of the present application and returns a natural language processing result to the terminal device, and the terminal device may present the natural language processing result to the user (a presentation manner may be, but not limited to, displaying, saving, uploading to the cloud side, and so on).
In one possible implementation, a user may open a natural language processing class application installed on the terminal device and input a text, where the natural language processing class application may send the text to a server on the cloud side, and the server on the cloud side performs natural language processing on the text by using a method provided by the embodiment of the present application and returns a natural language processing result to the terminal device, and the terminal device may present the natural language processing result to the user (a presentation manner may be, but not limited to, displaying, saving, uploading to the cloud side, and so on).
The natural language processing class application in the embodiments of the present application is next described separately from the functional architecture and the product architecture that implements the functionality.
Referring to fig. 1B, fig. 1B is a schematic functional architecture of a natural language processing application in an embodiment of the present application:
in one possible implementation, as shown in FIG. 1B, a natural language processing class application 102 may receive input parameters 101 (e.g., containing text) and generate predicted text or text natural language processing results 103. The natural language processing class application 102 may be executed on at least one computer system, for example, and includes computer text that, when executed by one or more computers, causes the computers to execute a natural language model trained for performing the methods provided by embodiments of the present application.
Referring to fig. 1C, fig. 1C is a schematic diagram of an entity architecture for running a natural language processing class application in an embodiment of the present application:
referring to fig. 1C, fig. 1C shows a schematic diagram of a system architecture. The system may include a terminal 100 and a server 200. Wherein the server 200 may include one or more servers (illustrated in fig. 1C as including one server as an example), the server 200 may provide natural language processing function services for one or more terminals.
The terminal 100 may install a natural language processing application program thereon, or open a web page related to a natural language processing function, where the application program and the web page may provide an interface, the terminal 100 may receive relevant parameters input by a user on the natural language processing function interface and send the parameters to the server 200, and the server 200 may obtain a processing result based on the received parameters and return the processing result to the terminal 100.
It should be understood that, in some alternative implementations, the terminal 100 may also perform actions of obtaining the processing result based on the received parameters by itself, without requiring a server to cooperate with the implementation, which is not limited by the embodiments of the present application.
Next, the product form of the terminal 100 in fig. 1C will be described;
the terminal 100 in the embodiment of the present application may be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (personal digital assistant, PDA), or the like, which is not limited in any way.
Fig. 1D shows an alternative hardware architecture diagram of the terminal 100.
Referring to fig. 1D, the terminal 100 may include a radio frequency unit 110, a memory 120, an input unit 130, a display unit 140, a camera 150 (optional), an audio circuit 160 (optional), a speaker 161 (optional), a microphone 162 (optional), a processor 170, an external interface 180, a power supply 190, and the like. Those skilled in the art will appreciate that fig. 1D is merely an example of a terminal or multifunction device and is not limiting of the terminal or multifunction device and may include more or fewer components than shown, or may combine certain components, or different components.
The input unit 130 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the portable multifunction device. In particular, the input unit 130 may comprise a touch screen 131 (optional) and/or other input devices 132. The touch screen 131 may collect touch operations on or near the user (e.g., operations of the user on or near the touch screen using any suitable object such as a finger, a joint, a stylus, etc.), and drive the corresponding connection means according to a preset program. The touch screen can detect the touch action of a user on the touch screen, convert the touch action into a touch signal, send the touch signal to the processor 170, and receive and execute a command sent by the processor 170; the touch signal includes at least touch point coordinate information. The touch screen 131 may provide an input interface and an output interface between the terminal 100 and a user. In addition, the touch screen may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 130 may include other input devices in addition to the touch screen 131. In particular, other input devices 132 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys 132, switch keys 133, etc.), a trackball, mouse, joystick, etc.
Where the input device 132 may receive entered text, processing instructions, and the like.
The display unit 140 may be used to display information input by a user or information provided to the user, various menus of the terminal 100, an interactive interface, file display, and/or play of any of the multimedia files. In the embodiment of the present application, the display unit 140 may be used to display an interface of a natural language processing class application program or the like.
The memory 120 may be used to store instructions and data, and the memory 120 may mainly include a storage instruction area and a storage data area, and the storage data area may store various data, such as multimedia files, text, and the like; the store instruction area may store software elements such as operating systems, applications, instructions required for at least one function, or a subset, an extension set thereof. And may also include nonvolatile random access memory; providing processor 170 includes managing hardware, software, and data resources in the computing processing device, supporting control software and applications. And is also used for storing multimedia files and storing running programs and applications.
The processor 170 is a control center of the terminal 100, connects various parts of the entire terminal 100 using various interfaces and lines, and performs various functions of the terminal 100 and processes data by executing or executing instructions stored in the memory 120 and calling data stored in the memory 120, thereby controlling the terminal device as a whole. Optionally, the processor 170 may include one or more processing units; preferably, the processor 170 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, application programs, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 170. In some embodiments, the processor, memory, may be implemented on a single chip, or they may be implemented separately on separate chips in some embodiments. The processor 170 may be further configured to generate corresponding operation control signals to corresponding components of the computing processing device, and to read and process data in the software, and in particular, to read and process data and programs in the memory 120, so that each functional module therein performs a corresponding function, thereby controlling the corresponding components to act as required by the instructions.
The memory 120 may be used to store software text related to the data processing method, and the processor 170 may execute steps of the data processing method of the chip, or may schedule other units (such as the input unit 130 and the display unit 140) to implement corresponding functions.
The rf unit 110 (optional) may be configured to receive and send information or receive and send signals during a call, for example, after receiving downlink information of a base station, process the downlink information with the processor 170; in addition, the data of the design uplink is sent to the base station. Typically, RF circuitry includes, but is not limited to, antennas, at least one amplifier, transceivers, couplers, low noise amplifiers (Low Noise Amplifier, LNAs), diplexers, and the like. In addition, the radio frequency unit 110 may also communicate with network devices and other devices via wireless communications. The wireless communication may use any communication standard or protocol including, but not limited to, global system for mobile communications (Global System of Mobile communication, GSM), general packet radio service (General Packet Radio Service, GPRS), code division multiple access (Code Division Multiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), long term evolution (Long Term Evolution, LTE), email, short message service (Short Messaging Service, SMS), and the like.
In this embodiment, the radio frequency unit 110 may send the text to the server 200, and receive the predicted text or the natural language processing result sent by the server 200.
It should be appreciated that the radio unit 110 is optional and may be replaced with other communication interfaces, such as a portal.
The terminal 100 also includes a power supply 190 (e.g., a battery) for powering the various components, which may be logically connected to the processor 170 via a power management system, such as a power management system that performs functions such as charge, discharge, and power consumption management.
The terminal 100 further includes an external interface 180, which may be a standard Micro USB interface, or a multi-pin connector, which may be used to connect the terminal 100 to communicate with other devices, or may be used to connect a charger to charge the terminal 100.
Although not shown, the terminal 100 may further include a flash, a wireless fidelity (wireless fidelity, wiFi) module, a bluetooth module, sensors of different functions, etc., which will not be described herein. Some or all of the methods described below may be applied in the terminal 100 as shown in fig. 1D.
Next, the product form of the server 200 in fig. 1C will be described;
Fig. 2 provides a schematic structural diagram of a server 200, and as shown in fig. 2, the server 200 includes a bus 201, a processor 202, a communication interface 203, and a memory 204. Communication between processor 202, memory 204, and communication interface 203 is via bus 201.
Bus 201 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 2, but not only one bus or one type of bus.
The processor 202 may be any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (digital signal processor, DSP).
The memory 204 may include volatile memory (RAM), such as random access memory (random access memory). The memory 204 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, a mechanical hard disk (HDD) or a solid state disk (solid state drive, SSD).
The memory 204 may be used for storing software text related to a data processing method, and the processor 202 may execute steps of the data processing method of the chip, or may schedule other units to implement corresponding functions.
It should be appreciated that the terminal 100 and the server 200 may be centralized or distributed devices, and the processors (e.g., the processor 170 and the processor 202) in the terminal 100 and the server 200 may be hardware circuits (such as an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA), a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor, or a microcontroller, etc.), or a combination of these hardware circuits, for example, the processor may be a hardware system with an instruction execution function, such as a CPU, DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, FPGA, etc., or a combination of the hardware system without an instruction execution function and a hardware system with an instruction execution function.
It should be understood that the steps related to the model reasoning process in the embodiments of the present application relate to AI-related operations, and the instruction execution architecture of the terminal device and the server is not limited to the architecture of the processor combined with the memory described above when performing AI operations. The system architecture provided in the embodiment of the present application is described in detail below with reference to fig. 5.
Fig. 5 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 5, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data acquisition system 560.
The execution device 510 includes a computing module 511, an I/O interface 512, a preprocessing module 513, and a preprocessing module 514. The calculation module 511 may include a target model/rule 501 therein, with the preprocessing module 513 and preprocessing module 514 being optional.
The executing device 510 may be a terminal device or a server running a natural language processing application.
The data acquisition device 560 is used to acquire training samples. The training samples may be text or the like. After the training samples are collected, the data collection device 560 stores the training samples in the database 530.
The training device 520 may maintain training samples based on the database 530 to obtain the target model/rule 501 for the neural network to be trained (e.g., language model in embodiments of the present application, etc.).
It should be noted that, in practical applications, the training samples maintained in the database 530 are not necessarily all acquired by the data acquisition device 560, but may be received from other devices. It should be noted that the training device 520 is not necessarily completely based on the training samples maintained by the database 530 to perform training of the target model/rule 501, and it is also possible to obtain the training samples from the cloud or other places to perform model training, which should not be taken as a limitation of the embodiments of the present application.
The target model/rule 501 obtained by training according to the training device 520 may be applied to different systems or devices, such as the execution device 510 shown in fig. 5, where the execution device 510 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a vehicle-mounted terminal, or the like, and may also be a server, or the like.
Specifically, the training device 520 may pass the trained model to the execution device 510.
In fig. 5, an execution device 510 configures an input/output (I/O) interface 512 for data interaction with external devices, and a user may input data (e.g., text, etc., in embodiments of the present application) to the I/O interface 512 through a client device 540.
The preprocessing module 513 and the preprocessing module 514 are used for preprocessing according to the input data received by the I/O interface 512. It should be appreciated that there may be no pre-processing module 513 and pre-processing module 514 or only one pre-processing module. When the preprocessing module 513 and the preprocessing module 514 are not present, the calculation module 511 may be directly employed to process the input data.
In preprocessing input data by the execution device 510, or in performing processing related to computation or the like by the computation module 511 of the execution device 510, the execution device 510 may call data, text or the like in the data storage system 550 for corresponding processing, or may store data, instructions or the like obtained by corresponding processing in the data storage system 550.
Finally, the I/O interface 512 provides the processing results to the client device 540, and thus to the user.
In the case shown in FIG. 5, the user may manually give input data, which may be manipulated through an interface provided by I/O interface 512. In another case, the client device 540 may automatically send the input data to the I/O interface 512, and if the client device 540 is required to automatically send the input data requiring authorization from the user, the user may set the corresponding permissions in the client device 540. The user may view the results output by the execution device 510 at the client device 540, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 540 may also be used as a data collection terminal to collect input data from the input I/O interface 512 and output data from the output I/O interface 512 as new sample data, and store the new sample data in the database 530. Of course, instead of being collected by the client device 540, the I/O interface 512 may directly store the input data of the I/O interface 512 and the output result of the I/O interface 512 as new sample data into the database 530.
It should be noted that fig. 5 is only a schematic diagram of a system architecture provided in the embodiments of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawings is not limited in any way, for example, in fig. 5, the data storage system 550 is an external memory with respect to the execution device 510, and in other cases, the data storage system 550 may be disposed in the execution device 510. It should be appreciated that the execution device 510 described above may be deployed in a client device 540.
From the reasoning side of the model:
in this embodiment, the computing module 511 of the executing device 520 may obtain the text stored in the data storage system 550 to implement the steps related to the model reasoning process in this embodiment of the present application.
In this embodiment, the computing module 511 of the execution device 520 may include a hardware circuit (such as an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA), a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor, or a microcontroller, etc.), or a combination of these hardware circuits, for example, the training device 520 may be a hardware system with an instruction execution function, such as a CPU, a DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, FPGA, etc., or a combination of the above hardware systems without an instruction execution function and a hardware system with an instruction execution function.
Specifically, the computing module 511 of the execution device 520 may be a hardware system with an instruction executing function, the steps related to the model reasoning process provided in the embodiments of the present application may be software text stored in a memory, and the computing module 511 of the execution device 520 may obtain the software text from the memory and execute the obtained software text to implement the steps related to the model reasoning process provided in the embodiments of the present application.
It should be understood that, the computing module 511 of the execution device 520 may be a combination of a hardware system that does not have an instruction execution function and a hardware system that has an instruction execution function, and some of the steps related to the model reasoning process provided in the embodiments of the present application may also be implemented by a hardware system that does not have an instruction execution function in the computing module 511 of the execution device 520, which is not limited herein.
From the training side of the model:
in this embodiment of the present application, the training device 520 may obtain text stored in a memory (not shown in fig. 5, and may be integrated into the training device 520 or disposed separately from the training device 520) to implement the steps related to model training in this embodiment of the present application.
In this embodiment, the training device 520 may include hardware circuits (such as an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA), a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor, or a microcontroller, etc.), or a combination of these hardware circuits, for example, the training device 520 may be a hardware system having an instruction execution function, such as a CPU, a DSP, etc., or a hardware system not having an instruction execution function, such as an ASIC, an FPGA, etc., or a combination of the above hardware systems not having an instruction execution function and a hardware system having an instruction execution function.
It should be understood that, the training device 520 may be a combination of a hardware system without an instruction execution function and a hardware system with an instruction execution function, and some steps related to training a model provided in the embodiment of the present application may also be implemented by a hardware system without an instruction execution function in the training device 520, which is not limited herein.
2. The natural language processing function cloud-like service provided by the server:
in one possible implementation, the server may provide services of natural language processing functions to the end side through an application programming interface (application programming interface, API).
The terminal device may send relevant parameters (including text, for example) to the server through an API provided by the cloud, and the server may obtain a processing result based on the received parameters, and return the processing result to the terminal.
The description of the terminal and the server may be described in the above embodiments, and will not be repeated here.
Fig. 6 shows a flow of a natural language processing function cloud service provided by using a cloud platform.
1. And opening and purchasing natural language processing service.
2. The user can download a software development kit (software development kit, SDK) corresponding to the content auditing service, and generally the cloud platform provides a plurality of development versions of SDKs for the user to select according to requirements of a development environment, for example, a JAVA version of SDK, a python version of SDK, a PHP version of SDK, an Android version of SDK, and the like.
3. After downloading the SDK of the corresponding version to the local according to the requirement, the user imports the SDK project into the local development environment, configures and debugs the SDK project in the local development environment, and develops other functions by the local development environment, so that an application integrating the natural language processing function capability is formed.
4. The natural language processing function is applied in the process of being used, and when the natural language processing function is required to be performed, the API call of the natural language processing function can be triggered. When an application triggers a natural language processing function, an API request is initiated to an operation instance of a natural language processing function service in a cloud environment, wherein the API request carries text, and the operation instance in the cloud environment processes the text to obtain a processing result.
5. The cloud environment returns the processing result to the application, thereby completing one natural language processing function service call.
3. The word segmentation cloud service provided by the server comprises the following steps:
in one possible implementation, the server may provide the end-side with a service of word segmentation functionality through an application programming interface (application programming interface, API).
The terminal device may send relevant parameters (including text, for example) to the server through an API provided by the cloud, and the server may obtain a processing result based on the received parameters, and return the processing result to the terminal.
The description of the terminal and the server may be described in the above embodiments, and will not be repeated here.
Fig. 7 shows a flow of a word segmentation function cloud service provided by using a cloud platform.
1. Opening and purchasing word segmentation service.
2. The user can download a software development kit (software development kit, SDK) corresponding to the content auditing service, and generally the cloud platform provides a plurality of development versions of SDKs for the user to select according to requirements of a development environment, for example, a JAVA version of SDK, a python version of SDK, a PHP version of SDK, an Android version of SDK, and the like.
3. After downloading the SDK of the corresponding version to the local according to the requirement, the user imports the SDK project into the local development environment, configures and debugs the SDK project in the local development environment, and develops other functions by the local development environment, so that an application integrating the word segmentation function capability is formed.
4. The word segmentation function is applied in the process of being used, and when the word segmentation function is required to be carried out, the API call of the word segmentation function can be triggered. When an application triggers a word segmentation function, an API request is initiated to an operation instance of the word segmentation function service in the cloud environment, wherein the API request carries text, and the operation instance in the cloud environment processes the text to obtain a processing result.
5. And the cloud environment returns the processing result to the application, so that the service call of the word segmentation function is completed once.
The description of the terminal and the server may be described in the above embodiments, and will not be repeated here.
In order to better understand the schemes of the embodiments of the present application, a possible application scenario of the embodiments of the present application will be briefly described with reference to fig. 3 to 4.
Fig. 3 shows a natural language processing system comprising a user device and a data processing device. The user equipment comprises intelligent terminals such as a mobile phone, a personal computer or an information processing center. The user equipment is an initiating terminal of natural language data processing, and is used as an initiating party of a request such as a language question answer or a query, and the user usually initiates the request through the user equipment.
The data processing device may be a device or a server having a data processing function, such as a cloud server, a web server, an application server, and a management server. The data processing equipment receives inquiry sentences/voice/text and the like from the intelligent terminal through the interactive interface, performs language data processing in the modes of machine learning, deep learning, searching, reasoning, decision making and the like through a memory for storing data and a processor link for data processing, and feeds back processing results to the user equipment. The memory in the data processing device may be a generic term comprising a database storing the history data locally, either on the data processing device or on another network server.
In the natural language processing system shown in fig. 3, a user device may receive an instruction of a user, for example, the user device may receive a piece of text input by the user, and then initiate a request to the data processing device, so that the data processing device performs a natural language processing application (e.g., natural language generation, text classification, text reasoning, named entity recognition, translation, etc.) on the piece of text obtained by the user device, thereby obtaining a processing result (e.g., a predicted word result, a classification result, a reasoning result, a named entity recognition result, a translation result, etc.) of a corresponding natural language processing application for the piece of text.
In this embodiment of the present application, the user device may receive an instruction of a user, for example, the user device may receive a piece of text (for example, text) input by the user, and then initiate a request to the data processing device, so that the data processing device executes a natural language processing application for the piece of text obtained by the user device, thereby obtaining a processing result of a corresponding natural language processing application for the piece of text.
Fig. 4 shows another natural language processing system, in fig. 4, a user device is directly used as a data processing device, and the user device can directly receive input from a user and directly process the input by hardware of the user device, and a specific process is similar to that of fig. 3, and reference is made to the above description and will not be repeated here.
Fig. 4 is a schematic diagram of a related device 300 for natural language processing provided in an embodiment of the present application.
The user device in fig. 3 and fig. 4 may be specifically the local device 301 or the local device 302 in fig. 4, and the data processing device in fig. 3 may be specifically the executing device 310 in fig. 4, where the data storage system 350 may store data to be processed of the executing device 310, and the data storage system 350 may be integrated on the executing device 310, or may be disposed on a cloud or other network server.
The processors in fig. 3 and 4 may perform data training/machine learning/deep learning through a neural network model or other models, and perform natural language processing or word segmentation on text data (e.g., text described in the embodiments of the present application) by using a model (e.g., a language model in the embodiments of the present application, etc.) obtained by final training or learning of the data, thereby obtaining corresponding processing results.
Since the embodiments of the present application relate to a large number of applications of neural networks, for ease of understanding, related terms and related concepts of the neural networks related to the embodiments of the present application will be described below.
(1) Neural network
The neural network may be composed of neural units, which may refer to an arithmetic unit with xs (i.e., input data) and intercept 1 as inputs, and the output of the arithmetic unit may be:
Where s=1, 2, … … n, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to a next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by joining together a plurality of the above-described single neural units, i.e., the output of one neural unit may be the input of another neural unit. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.
(2) Transformer layer
The neural network includes an embedded layer and at least one transducer layer, which may be N transducer layers (N is an integer greater than 0), wherein each transducer layer includes an attention layer, a sum and normalization (add & norm) layer, a feed forward layer, and a sum and normalization layer, which are sequentially adjacent. At the embedding layer, embedding the current input to obtain a plurality of embedded vectors; in the attention layer, P input vectors are obtained from the upper layer of the first transducer layer, any first input vector in the P input vectors is taken as a center, and based on the association degree between each input vector and the first input vector in the preset attention window range, the intermediate vector corresponding to the first input vector is obtained, and the P intermediate vectors corresponding to the P input vectors are determined in this way; and merging the P intermediate vectors into Q output vectors at the pooling layer, wherein a plurality of output vectors obtained by the last transform layer in the transform layers are used as the characteristic representation of the current input.
(3) Attention mechanism (attention mechanism)
The attention mechanism mimics the internal process of biological observation behavior, i.e., a mechanism that aligns internal experience with external sensations to increase the observation finesse of a partial region, enabling rapid screening of high value information from a large amount of information with limited attention resources. Attention mechanisms can quickly extract important features of sparse data and are thus widely used for natural language processing tasks, particularly machine translation. While the self-attention mechanism (self-attention mechanism) is an improvement of the attention mechanism, which reduces reliance on external information, and is more adept at capturing internal dependencies of data or features. The essential idea of the attention mechanism can be rewritten as the following formula:
wherein lx= |source|represents the length of Source, the meaning of the formula is that the constituent elements in Source are imagined to be composed of a series of data pairs, at this time, given an element Query in a Target, the weight coefficient of Value corresponding to each Key is obtained by calculating the similarity or correlation of the Query and each Key, and then the Value is weighted and summed, thus obtaining the final Value. The attribute mechanism essentially performs weighted summation on the Value values of the elements in the Source, and Query and Key are used to calculate the weight coefficients for the corresponding values. Conceptually, attention is understood to mean that a small amount of important information is selectively screened out from a large amount of information and focused on the important information, and most of the unimportant information is ignored. The focusing process is embodied in the calculation of a weight coefficient, and the larger the weight is, the more focused on the Value corresponding to the weight is, namely the weight represents the importance of the information, and the Value is the information corresponding to the weight. The self-Attention mechanism is understood to be internal Attention (intra Attention), and the Attention mechanism occurs between the element Query of the Target and all elements in the Source, and the self-Attention mechanism is understood to be the Attention mechanism occurring between the elements in the Source or between the elements in the Target, or is understood to be the Attention computing mechanism in the special case of target=source, and the specific computing process is the same, except that the computing object changes.
(4) Natural language processing (natural language processing, NLP)
Natural Language (NLP) is a process of human language, which is a human language. Natural language processing is a process of systematically analyzing, understanding, and extracting information for text data in an intelligent and efficient manner. By using NLP and its components, we can manage very large blocks of text data or perform a large number of automated tasks and solve a wide variety of problems such as automatic summarization (automatic summarization), machine translation (machine translation, MT), named entity recognition (named entity recognition, NER), relationship extraction (relation extraction, RE), information extraction (information extraction, IE), emotion analysis, speech recognition (speech recognition), question-answering system (question answering), and topic segmentation, among others.
(5) Pre-training language model (pre-trained language model)
The pre-training language model is a natural language sequence encoder that encodes each word in the natural language sequence into a vector representation for performing a predictive task. Its training involves two phases. In the pre-training phase, the model performs training of language model tasks on large-scale unsupervised text, thereby learning a word representation. In the fine tuning (training) stage, the model is initialized by using parameters learned in the pre-training stage, and training with fewer steps is performed on downstream tasks (downstream tasks) such as text classification (text classification), sequence labeling (sequence labeling) and the like, so that semantic information obtained by pre-training can be successfully migrated to the downstream tasks.
(6) Back propagation algorithm
The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal super-resolution model, such as a weight matrix.
(7) Loss function
In training the deep neural network, since the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actually expected target value according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the pre-configuration parameters of each layer in the deep neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to be lower than the predicted value, and the adjustment is continuously performed until the deep neural network can predict the actually expected target value or the value very close to the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible.
(8) Text sequence (text sequence)
A sequence of characters, such as a natural language word, or a piece of program code.
(9) Vocabulary (vocabolar)
The language model uses a list of words, each word being a string of characters of arbitrary length without spaces, such as a complete natural language word, or a part of a complete word (also called a subword).
(10) Non-login word (out of vocabulary)
Words that are not present in the vocabulary are inferred by the language model.
(11) Text word (text tokenization)
The text sequence is segmented according to a certain rule, and the aim is to reduce the possibility of the occurrence of unregistered words.
Existing natural language processing systems process input text by adopting a Byte Pair Encoding (BPE) or a sentence piece algorithm, and cut each character string in the text, which is bounded by spaces, according to a certain rule. The method aims to greatly reduce the occurrence of unregistered words during reasoning and improve the generalization and fault tolerance of a language model. If no segmentation is performed, all words contained in the language must appear in the training text, otherwise, the non-appearing words (such as misspellings) are treated as non-logged words during reasoning, and the fault tolerance and the reasoning capability of the language model are greatly affected. In addition, without word segmentation, certain words which are homologous but have different word types can be regarded as mutually independent by the language model, so that the common semantics of the words can not be fully learned, and the generalization capability of the language model is reduced.
However, these algorithms pay more attention to how to get a better sub-word distribution by building a mathematical model, and (partially) neglect the inconsistency caused by the characteristics of the language itself to the word segmentation, i.e. the same word may have different word segmentation results in different scenes. A common solution is to introduce a pre (post) text processing module before (post) the segmentation. Although such schemes can better solve the problem of non-uniformity, noise is often introduced to cause lossy word segmentation, i.e., the input text cannot be recovered after text preprocessing, word segmentation, inverse word segmentation and text post-processing.
In order to solve the above problems, embodiments of the present application provide a data processing method. The following describes the data processing method of the embodiment of the present application in detail with reference to the accompanying drawings.
Referring to fig. 8, fig. 8 is a flowchart of a data processing method provided in an embodiment of the present application, and as shown in fig. 8, the data processing method provided in the embodiment of the present application may include steps 801 to 803, which are respectively described in detail below.
801. Text is obtained.
In one possible implementation, the first text may be text that requires word segmentation or text that requires natural language processing subsequently, which may also be referred to as a text sequence, including a plurality of characters and strings.
802. And inserting a separator into the text, wherein the separator is a character string without blank space, and the separator is used for identifying the character separation position in the text.
In the prior art, word segmentation consistency in different scenes is improved by inserting spaces serving as separators at corresponding positions of texts. The above-mentioned space used as separator is simple and convenient, but because of its complete consistency in form with the naturally occurring space, the post-processing algorithm in some scenes cannot accurately delete the inserted space, resulting in failure to fully recover the original text sequence, resulting in lossy word segmentation.
In the embodiment of the application, a non-space character string (or a character string which can be called as an incomplete space) is defined as a separator, for example, a character string which starts with a single or a plurality of non-space characters and ends with a single or a plurality of space characters is used as a separator, and the non-space characters are not common in a natural language, so that the inserted separator can be accurately deleted in post-processing, thereby realizing lossless word segmentation.
Illustratively, the separator is a character corresponding to the source code "\u 2582". The character may be a lower quarter block element.
In one possible implementation, the non-space character strings described above may be added as separators in the word segmentation model. For example, whether the insertion condition of the separator is satisfied can be recognized for characters in the text one by one.
In word segmentation of text based on a word segmentation model, separators may be inserted into the text first, e.g., in one possible implementation, separators may be inserted at adjacent locations after punctuation marks included in the text. I.e. the position of the word may follow immediately after punctuation (including the period), after which a separator is inserted.
For example, in one possible implementation, separators may be inserted at locations in the text that are not separated by spaces, and that are not translations of different languages. I.e. where the different languages are translated and not separated by spaces, a separator is inserted between the two languages.
For example, in one possible implementation, a separator may be inserted adjacent to a space after the last of the consecutive spaces that the text includes. I.e. at the position of the consecutive space, a separator is inserted after the last space.
In one possible implementation, there may also be strings of the same characters in the text as separators, however, this part of the string is text-specific and not as separators, and should not be deleted during post-processing. Therefore, the character string corresponding to the same character as the separator can be protected in the text, that is, the non-space character part in the separator appearing in the input text is protected by constructing the character string embedding structure, so that the protected character string is formed. And further the text after the inserting the separator further includes: and the protection character string is used for identifying the text with the same characters as the separator in the text, and indicating that the text with the same characters as the separator is not the separator.
803. And performing word segmentation on the text inserted with the separator to obtain a word segmentation result.
The word segmentation result can comprise a plurality of word segmentation units, and the same word can be obtained in different scenes through the design of the separator.
In one possible implementation, the first processing result of the text may also be obtained through a language model according to the word segmentation result. The first processing result includes a plurality of word segmentation units and the separator, in order to obtain a processing result corresponding to the text, post-processing may be performed on the first processing result, where post-processing may include, but is not limited to, reverse word segmentation and removal of the separator, that is, reverse word segmentation may be performed on the first processing result, and the separator may be removed, so as to obtain a second processing result of the text.
In one possible implementation, the first processing result further includes the protection string, and the post-processing may further include: and removing the protection character string in the first processing result.
More specifically, reference may be made to fig. 9, where fig. 9 is a flow chart illustrating a process of inserting separator and protection string, and a process of word segmentation and post-processing (including reverse word segmentation, separator removal, and string protection structure removal) in accordance with an embodiment of the present application.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 10, a data processing apparatus 1000 according to an embodiment of the present application includes:
an obtaining module 1001, configured to obtain text;
the specific description of the obtaining module 1001 may refer to the description of step 801 in the above embodiment, which is not repeated here.
A processing module 1002 configured to insert a separator into the text, the separator being a non-space character string, the separator being configured to identify a location of a separation in the text; and performing word segmentation on the text inserted with the separator to obtain a word segmentation result.
For a specific description of the processing module 1002, reference may be made to the description of step 802 and step 803 in the above embodiment, which is not repeated here.
In one possible implementation, the separator is a character corresponding to the source code "\u 2582".
In one possible implementation, the processing module 1002 is specifically configured to:
inserting separators into the text based on at least one of:
inserting separators at adjacent positions after punctuation marks included in the text;
inserting separators at positions in the text where the different languages are converted and are not separated by spaces;
A separator is inserted adjacent to a space subsequent to a last of the successive spaces included in the text.
In one possible implementation, the processing module 1002 is further configured to:
and obtaining a first processing result of the text through a language model according to the word segmentation result.
In one possible implementation, the first processing result includes a plurality of word segmentation units and the separator; the processing module 1002 is further configured to:
and performing reverse word segmentation on the first processing result, and removing the separator to obtain a second processing result of the text.
In one possible implementation, the text after the inserting the separator further includes: and the protection character string is used for identifying the text with the same characters as the separator in the text, and indicating that the text with the same characters as the separator is not the separator.
In one possible implementation, the first processing result further includes the protection string; the processing module 1002 is further configured to:
and removing the protection character string in the first processing result.
Next, referring to fig. 11, fig. 11 is a schematic structural diagram of an execution device provided in the embodiment of the present application, where the execution device 1100 may be specifically represented by a virtual reality VR device, a mobile phone, a tablet, a notebook, an intelligent wearable device, a monitoring data processing device, or a server, which is not limited herein. Specifically, the execution apparatus 1100 includes: a receiver 1101, a transmitter 1102, a processor 1103 and a memory 1104 (where the number of processors 1103 in the execution device 1100 may be one or more, one processor is exemplified in fig. 11), wherein the processor 1103 may comprise an application processor 11031 and a communication processor 11032. In some embodiments of the present application, the receiver 1101, transmitter 1102, processor 1103 and memory 1104 may be connected by a bus or other means.
The memory 1104 may include read-only memory and random access memory and provides instructions and data to the processor 1103. A portion of the memory 1104 may also include non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 1104 stores a processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.
The processor 1103 controls the operation of the execution device. In a specific application, the individual components of the execution device are coupled together by a bus system, which may include, in addition to a data bus, a power bus, a control bus, a status signal bus, etc. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.
The method disclosed in the embodiments of the present application may be applied to the processor 1103 or implemented by the processor 1103. The processor 1103 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the method described above may be performed by integrated logic circuitry in hardware or instructions in software in the processor 1103. The processor 1103 may be a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The processor 1103 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1104, and the processor 1103 reads the information in the memory 1104 and, in combination with its hardware, performs the steps of the above method that involve the model reasoning process.
The receiver 1101 is operable to receive input numeric or character information and to generate signal inputs related to performing relevant settings and function control of the device. The transmitter 1102 may be used to output numeric or character information through a first interface; the transmitter 1102 may also be configured to send instructions to the disk stack via the first interface to modify data in the disk stack; the transmitter 1102 may also include a display device such as a display screen.
Referring to fig. 12, fig. 12 is a schematic structural diagram of the training device provided in the embodiment of the present application, specifically, the training device 1200 is implemented by one or more servers, where the training device 1200 may be relatively different due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 1212 (e.g., one or more processors) and a memory 1232, and one or more storage media 1230 (e.g., one or more mass storage devices) storing application programs 1242 or data 1244. Wherein memory 1232 and storage medium 1230 can be transitory or persistent. The program stored on storage medium 1230 may include one or more modules (not shown), each of which may include a series of instruction operations for use in training devices. Still further, central processor 1212 may be configured to communicate with storage medium 1230 to execute a series of instruction operations in storage medium 1230 on exercise device 1200.
Training apparatus 1200 may also include one or more power sources 1226, one or more wired or wireless network interfaces 1250, one or more input/output interfaces 1258; or one or more operating systems 1241, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.
In this embodiment, the central processor 1212 is configured to perform the actions related to model training in the above embodiment.
Embodiments of the present application also provide a computer program product that, when run on a computer, causes the computer to perform the steps performed by the aforementioned performing device, or causes the computer to perform the steps performed by the aforementioned training device.
There is also provided in an embodiment of the present application a computer-readable storage medium having stored therein a program for performing signal processing, which when run on a computer, causes the computer to perform the steps performed by the aforementioned performing device or causes the computer to perform the steps performed by the aforementioned training device.
The execution device, training device or terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit to cause the chip in the execution device to perform the data processing method described in the above embodiment, or to cause the chip in the training device to perform the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit in the wireless access device side located outside the chip, such as a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM), etc.
Specifically, referring to fig. 13, fig. 13 is a schematic structural diagram of a chip provided in an embodiment of the present application, where the chip may be represented as a neural network processor NPU 1300, and the NPU 1300 is mounted as a coprocessor on a main CPU (Host CPU), and the Host CPU distributes tasks. The core part of the NPU is an arithmetic circuit 1303, and the controller 1304 controls the arithmetic circuit 1303 to extract matrix data in the memory and perform multiplication.
In some implementations, the arithmetic circuit 1303 includes a plurality of processing units (PEs) inside. In some implementations, the operation circuit 1303 is a two-dimensional systolic array. The arithmetic circuit 1303 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 1303 is a general-purpose matrix processor.
For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1302 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 1301 and performs matrix operation with matrix B, and the partial result or the final result of the matrix obtained is stored in an accumulator (accumulator) 1308.
Unified memory 1306 is used to store input data and output data. The weight data is directly transferred to the weight memory 1302 through the memory cell access controller (Direct Memory Access Controller, DMAC) 1305. The input data is also carried into the unified memory 1306 through the DMAC.
BIU is Bus Interface Unit, bus interface unit 1310 for interaction of the AXI bus with the DMAC and instruction fetch memory (Instruction Fetch Buffer, IFB) 1309.
The bus interface unit 1310 (Bus Interface Unit, abbreviated as BIU) is configured to obtain an instruction from the external memory by the instruction fetch memory 1309, and further configured to obtain raw data of the input matrix a or the weight matrix B from the external memory by the memory unit access controller 1305.
The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1306 or to transfer weight data to the weight memory 1302 or to transfer input data to the input memory 1301.
The vector calculation unit 1307 includes a plurality of operation processing units that perform further processing, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and the like, on the output of the operation circuit 1303, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization (batch normalization), pixel-level summation, up-sampling of a characteristic plane and the like.
In some implementations, the vector computation unit 1307 can store the vector of processed outputs to the unified memory 1306. For example, the vector calculation unit 1307 may perform a linear function; alternatively, a nonlinear function is applied to the output of the arithmetic circuit 1303, for example, to linearly interpolate the feature plane extracted by the convolution layer, and then, for example, to accumulate a vector of values to generate an activation value. In some implementations, vector computation unit 1307 generates a normalized value, a pixel-level summed value, or both. In some implementations, the vector of processed outputs can be used as an activation input to the arithmetic circuit 1303, for example for use in subsequent layers in a neural network.
An instruction fetch memory (instruction fetch buffer) 1309 connected to the controller 1304 for storing instructions used by the controller 1304;
the unified memory 1306, the input memory 1301, the weight memory 1302, and the finger memory 1309 are all On-Chip memories. The external memory is proprietary to the NPU hardware architecture.
The processor mentioned in any of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.
It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection therebetween, and can be specifically implemented as one or more communication buses or signal lines.
From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course may be implemented by dedicated hardware including application specific integrated circuits, dedicated CPUs, dedicated memories, dedicated components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment in many cases for the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a training device, or a network device, etc.) to perform the method described in the embodiments of the present application.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Claims (19)

1. A method of data processing, the method comprising:
acquiring a text;
inserting a separator into the text, wherein the separator is a character string without blank spaces and is used for identifying the separation positions in the text;
and performing word segmentation on the text inserted with the separator to obtain a word segmentation result.
2. The method of claim 1, wherein the separator is a string starting with a non-space and ending with at least one space.
3. The method according to claim 1 or 2, wherein the separator is a character corresponding to a source code "\u 2582".
4. A method according to any one of claims 1 to 3, wherein said inserting separators into said text comprises:
inserting separators into the text based on at least one of:
inserting separators at adjacent positions after punctuation marks included in the text;
inserting separators at positions in the text where the different languages are converted and are not separated by spaces;
a separator is inserted adjacent to a space subsequent to a last of the successive spaces included in the text.
5. The method according to any one of claims 1 to 4, further comprising:
and obtaining a first processing result of the text through a language model according to the word segmentation result.
6. The method according to any one of claims 1 to 5, wherein,
the first processing result comprises a plurality of word segmentation units and the separator; the method further comprises the steps of: performing reverse word segmentation on the first processing result, and removing the separator to obtain a second processing result of the text; or,
the method further comprises the steps of: and performing reverse word segmentation on the word segmentation result, and removing the separator to obtain a second processing result of the text.
7. The method of any of claims 1 to 6, wherein the inserting the separator-after-text further comprises: and the protection character string is used for identifying the text with the same characters as the separator in the text, and indicating that the text with the same characters as the separator is not the separator.
8. The method of claim 6 or 7, wherein the first processing result further comprises the protection string; the method further comprises the steps of:
And removing the protection character string in the first processing result.
9. A data processing apparatus, the apparatus comprising:
the acquisition module is used for acquiring the text;
a processing module for inserting a separator into the text, the separator being a non-space character string, the separator being used to identify a location of a separation in the text; and performing word segmentation on the text inserted with the separator to obtain a word segmentation result.
10. The apparatus of claim 9, wherein the separator is a string starting with a non-space and ending with at least one space.
11. The apparatus of claim 9 or 10, wherein the separator is a character corresponding to a source code "\u 2582".
12. The apparatus according to any one of claims 9 to 11, wherein the processing module is specifically configured to:
inserting separators into the text based on at least one of:
inserting separators at adjacent positions after punctuation marks included in the text;
inserting separators at positions in the text where the different languages are converted and are not separated by spaces;
a separator is inserted adjacent to a space subsequent to a last of the successive spaces included in the text.
13. The apparatus of any one of claims 9 to 12, wherein the processing module is further configured to:
and obtaining a first processing result of the text through a language model according to the word segmentation result.
14. The apparatus of any one of claims 9 to 13, wherein the first processing result includes a plurality of word segmentation units and the separator; the processing module is further configured to: performing reverse word segmentation on the first processing result, and removing the separator to obtain a second processing result of the text; or,
the processing module is further configured to: and performing reverse word segmentation on the word segmentation result, and removing the separator to obtain a second processing result of the text.
15. The apparatus of any of claims 9 to 14, wherein the text after the inserting the separator further comprises: and the protection character string is used for identifying the text with the same characters as the separator in the text, and indicating that the text with the same characters as the separator is not the separator.
16. The apparatus of claim 14 or 15, wherein the first processing result further comprises the protection string; the processing module is further configured to:
And removing the protection character string in the first processing result.
17. A computer storage medium storing one or more instructions which, when executed by one or more computers, cause the one or more computers to perform the operations of the method of any one of claims 1 to 8.
18. A computer program product comprising computer readable instructions which, when run on a computer device, cause the computer device to perform the method of any of claims 1 to 8.
19. A system comprising at least one processor, at least one memory; the processor and the memory are connected through a communication bus and complete communication with each other;
the at least one memory is used for storing text;
the at least one processor is configured to execute the text to perform the method of any of claims 1 to 8.
CN202311786847.2A 2023-12-22 2023-12-22 Data processing method and device Pending CN117892700A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311786847.2A CN117892700A (en) 2023-12-22 2023-12-22 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311786847.2A CN117892700A (en) 2023-12-22 2023-12-22 Data processing method and device

Publications (1)

Publication Number Publication Date
CN117892700A true CN117892700A (en) 2024-04-16

Family

ID=90649862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311786847.2A Pending CN117892700A (en) 2023-12-22 2023-12-22 Data processing method and device

Country Status (1)

Country Link
CN (1) CN117892700A (en)

Similar Documents

Publication Publication Date Title
CN112288075B (en) Data processing method and related equipment
CN111951805A (en) Text data processing method and device
CN112257858A (en) Model compression method and device
CN111523640B (en) Training method and device for neural network model
WO2024041479A1 (en) Data processing method and apparatus
WO2022253074A1 (en) Data processing method and related device
CN111816159A (en) Language identification method and related device
WO2024083121A1 (en) Data processing method and apparatus
CN115688937A (en) Model training method and device
CN112529149A (en) Data processing method and related device
CN116861850A (en) Data processing method and device
CN116432019A (en) Data processing method and related equipment
CN115879508A (en) Data processing method and related device
WO2022246986A1 (en) Data processing method, apparatus and device, and computer-readable storage medium
CN113656563A (en) Neural network searching method and related equipment
CN116910202A (en) Data processing method and related equipment
CN116737895A (en) Data processing method and related equipment
WO2024046473A1 (en) Data processing method and apparatus
CN116052714A (en) Data processing method and device
CN116883715A (en) Data processing method and device
CN116665219A (en) Data processing method and device
CN117892700A (en) Data processing method and device
CN117494705A (en) Model training method and device
CN116306672A (en) Data processing method and device
CN116542289A (en) Data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination