CN110795913A

CN110795913A - Text encoding method and device, storage medium and terminal

Info

Publication number: CN110795913A
Application number: CN201910939618.7A
Authority: CN
Inventors: 王鹏; 阚华
Original assignee: Beijing Dami Technology Co Ltd
Current assignee: Beijing Dami Technology Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-02-14
Anticipated expiration: 2039-09-30
Also published as: CN110795913B

Abstract

The embodiment of the application discloses a text encoding method, a text encoding device, a storage medium and a terminal, wherein the method comprises the following steps: acquiring a target language text to be coded; inputting the target language text into a pre-trained text coding model, wherein the text coding model is generated based on training of a first data sample, a second data sample and a third data sample, the semantic similarity between the first data sample and the second data sample is greater than a first similarity threshold, the semantic similarity between the first data sample and the third data sample is less than a second similarity threshold, and the first similarity threshold is greater than or equal to the second similarity threshold; and outputting the semantic vector corresponding to the target language text. Therefore, by adopting the embodiment of the application, the text coding model is trained by utilizing the sample data with the correlation, and the vector semantic degree output after text coding is carried out by using the text coding model after the training is finished is more accurate.

Description

Text encoding method and device, storage medium and terminal

Technical Field

The present application relates to the field of computer technologies, and in particular, to a text encoding method, an apparatus, a storage medium, and a terminal.

Background

In recent years, the development of artificial intelligence is vigorous, wherein natural language understanding is one of the most important directions in the field of artificial intelligence, and the recognition of natural language by machines has become a hot spot of research by researchers. With the development of technologies such as deep learning and reinforcement learning, researchers are increasingly eager to enable machines to accurately recognize natural languages.

The deep learning mostly adopts a deep neural network model, namely the deep neural network model is adopted to identify the natural language text, so that a corresponding semantic vector is output. Of course, data samples need to be collected first to train the deep neural network model, and at present, a plurality of randomly collected text sentences are usually used as data samples, and because there is no correlation between the plurality of text sentences, semantic accuracy of output vectors is insufficient.

Disclosure of Invention

The embodiment of the application provides a text coding method, a text coding device, a storage medium and a terminal, and the semantic degree of a vector output after text coding can be more accurate. The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a text encoding method, where the method includes:

acquiring a target language text to be coded;

inputting the target language text into a pre-trained text coding model, wherein the text coding model is generated based on training of a first data sample, a second data sample and a third data sample, the semantic similarity between the first data sample and the second data sample is greater than a first similarity threshold, the semantic similarity between the first data sample and the third data sample is less than a second similarity threshold, and the first similarity threshold is greater than or equal to the second similarity threshold;

and outputting the semantic vector corresponding to the target language text.

In a second aspect, an embodiment of the present application provides a text encoding apparatus, including:

the text acquisition module is used for acquiring a target language text to be coded;

the text input module is used for inputting the target language text into a pre-trained text coding model, the text coding model is generated based on a first data sample, a second data sample and a third data sample in a training mode, the semantic similarity between the first data sample and the second data sample is greater than a first similarity threshold value, the semantic similarity between the first data sample and the third data sample is less than a second similarity threshold value, and the first similarity threshold value is greater than the second similarity threshold value;

and the vector output module is used for outputting the semantic vector corresponding to the target language text.

In a third aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

In a fourth aspect, an embodiment of the present application provides a terminal, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The beneficial effects brought by the technical scheme provided by some embodiments of the application at least comprise:

in the embodiment of the application, a text coding device obtains a target language text to be coded, the target language text is input into a pre-trained text coding model, the text coding model is generated based on a first data sample, a second data sample and a third data sample, the semantic similarity between the first data sample and the second data sample is greater than a first similarity threshold, the semantic similarity between the first data sample and the third data sample is less than a second similarity threshold, the first similarity threshold is greater than or equal to the second similarity threshold, and finally a semantic vector corresponding to the target language text is output. Therefore, by adopting the embodiment of the application, the relevance is established among the collected sample data, the text coding model is trained by utilizing the sample data with the relevance, and the semantic degree of the output vector is more accurate after the text coding is carried out by using the text coding model after the training is finished.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a scene schematic diagram of an implementation scenario provided in an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating an effect of an intelligent customer service chat interface display according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating an effect of a voice chat interface display according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a text encoding method according to an embodiment of the present application;

fig. 5A is a schematic diagram illustrating an effect displayed after a text input box is clicked on a chat interface according to an embodiment of the present application;

fig. 5B is a schematic diagram illustrating an effect displayed after editing a target text in a chat interface according to an embodiment of the present application;

FIG. 6 is a flow chart of another text encoding method provided in the embodiments of the present application;

FIG. 7 is a diagram of a training text coding model provided by an embodiment of the present application;

fig. 8 is a schematic flowchart of a process for recognizing a target text according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a text encoding apparatus according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of another text encoding apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a loss value output module according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a loss value calculation unit according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

In the description of the present application, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art. Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

First, some terms referred to in the embodiments of the present application are explained:

the text coding model comprises the following steps: is a mathematical model for outputting a corresponding semantically-modified vector based on input text data.

Text coding models include, but are not limited to: at least one of a Convolutional Neural Network (CNN) model, a Deep Neural Network (DNN) model, a Recurrent Neural Network (RNN) model, an embedding (embedding) model, a Long-Short term memory (LSTM) model, and a Gradient Boosting Decision Tree (GBDT) model.

The DNN model is a deep learning framework, and includes an input layer, at least one hidden layer (or intermediate layer), and an output layer. The input layer, the at least one hidden layer (or intermediate layer) and the output layer each include at least one neuron for processing received data. The number of neurons between different layers may be the same or different.

The RNN model is a neural network with a feedback structure. In the RNN model, the output of a neuron can be directly applied to itself at the next time stamp, i.e., the input of the i-th layer neuron at time m includes its own output at time (m-1) in addition to the output of the (i-1) layer neuron at that time.

The embedding model is based on an entity and a relationship distributed vector representation, considering the relationship in each triplet instance as a translation from the entity head to the entity tail. The triple instance comprises a subject, a relation and an object, and can be expressed as (subject, relation and object); the subject is an entity head, and the object is an entity tail. Such as: dad of the small is large, then represented by the triple instance as (small, dad, large).

The LSTM model is a special RNN model and is provided for solving the problem of gradient diffusion of the RNN model; in the traditional RNN, BPTT is used in a training algorithm, when the time is longer, the residual error needing to be returned is exponentially reduced, so that the network weight is updated slowly, the long-term memory effect of the RNN cannot be embodied, and therefore a storage unit is needed for storing memory, and an LSTM model is proposed;

the GBDT model is an iterative decision tree algorithm that consists of a number of decision trees, with the results of all trees added together as the final result. Each node of the decision tree obtains a predicted value, and taking age as an example, the predicted value is an average value of ages of all people belonging to the node corresponding to the age. The LR model is a model built by applying a logistic function on the basis of linear regression.

Until now, a plurality of randomly collected text sentences are generally used as data samples, and since no correlation exists among the plurality of text sentences, the semantic accuracy of output vectors is insufficient, which undoubtedly reduces the accuracy of natural language recognition by a machine. Therefore, the present application provides a text encoding method, apparatus, storage medium and terminal to solve the above-mentioned problems in the related art. In the technical scheme provided by the application, because the correlation is established among the collected sample data, the text coding model is trained by utilizing the sample data with the correlation, and the semantic degree of the vector output after the text coding is carried out by using the text coding model after the training is finished is more accurate, so that the accuracy of machine recognition of the natural language is improved, and the following adopts an exemplary embodiment for detailed description.

Referring to fig. 1, a scene diagram of an implementation scenario shown in an embodiment of the present application is shown, where the implementation scenario includes a user 110, a user terminal 120, and a server 130. The user terminal 120 is an electronic device with a network communication function, and the electronic device includes, but is not limited to, a smart phone, a tablet computer, a wearable device, a smart home device, a laptop computer, a desktop computer, a smart camera, and the like. The server 130 includes one or more processors or memories, which may include one or more processing cores. The processor connects various parts within the overall text encoding device using various interfaces and lines to perform various functions of the text encoding device and process data by executing or executing instructions, programs, code sets, or instruction sets stored in memory, and invoking data stored in memory. Optionally, the processor may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor may integrate one or more of a Central Processing Unit (CPU), a modem, and the like.

The server 130 stores therein a text coding model. Optionally, the text coding model is a model obtained by training a neural network model (e.g., a CNN model) using training sample data. The user terminal 120 is connected to the server 130 through a wireless or wired network, and an application having a text encoding function is installed in the user terminal 120.

In a possible implementation manner, the user 110 opens the user terminal 120, and then opens the application program with the intelligent customer service robot installed thereon, so as to enter a chat interface of the intelligent customer service robot, as shown in fig. 2, the user can input a target language text in a text input box according to a question that the user needs to consult, after the target language text is input, the user 110 can click to send the target text in the chat interface, so as to complete the input of the target text, when the client detects that the language text is input, the client transmits the target text language to the server 130 through wireless or wired, the server 130 stores a pre-trained text coding model, the text coding model is obtained by training a first data sample, a second data sample and a third data sample which are collected in advance, wherein the first data sample and the second data sample have semantic correlation, the first data sample and the third data sample have no semantic relevance. When the server 130 detects that a language text is input, the text coding model is called to identify a target text language, and a highly similar semantic vector is obtained after identification. After the server 130 obtains the semantic vector, the text and the text answer corresponding to the semantic vector are found in the text database according to the semantic vector, and after obtaining the text answer, the server 130 responds the text answer to the user terminal 120 through wireless or wired. In this way, the relevant questions consulted by the user 110 can be answered in time.

In another implementation scenario, when the user performs a problem query in the user terminal 120 scenario as shown in fig. 1, the user 110 opens the user terminal 120, then opens the application program with the intelligent customer service robot installed thereon, and finally enters the chat interface of the intelligent customer service robot, as shown in fig. 3, the user 110 may organize the voice formed by the target text language in advance according to the problem of the query required by the user, the user 110 may press the microphone icon function key below the "press talk" displayed in the interface as shown in fig. 3, press the microphone icon function key to enter the voice data formed by the target text language, after the voice data entry is completed, the user 110 may end to trigger the microphone icon function key, after the user terminal 120 receives the instruction to end to trigger the microphone icon function key, collect the voice data and send the voice data to the server 130 through a wired or wireless mode, after receiving the collected voice data, the server 130 calls a voice conversion program to identify the collected voice data, and converts the voice data into a target language text object after the identification is completed. After the target language text object is converted, the server 130 calls a text coding model stored in the server to perform semantic vectorization on the converted target language text object, after a semantic vector with higher similarity is extracted, a corresponding text answer is searched in a text database according to the semantic vector, after the corresponding text answer is found, the text answer is correspondingly sent to the user terminal 120 through wireless or wired communication, text conversion and recognition can be performed in a voice collecting mode of the user terminal 120 in the scene, and finally, related questions consulted by the user 110 can be automatically replied.

The text encoding method provided by the embodiment of the present application will be described in detail below with reference to fig. 4 to 8. The method may be implemented in dependence on a computer program, operable on a text encoding device based on the von neumann architecture. The computer program may be integrated into the application or may run as a separate tool-like application. The text encoding device in the embodiment of the present application may be a user terminal.

Please refer to fig. 4, which provides a flowchart of a text encoding method according to an embodiment of the present application. As shown in fig. 4, the method of the embodiment of the present application may include the steps of:

s101, acquiring a target language text to be coded.

The language is a set of communication instructions expressed by common processing rules, and the instructions are transmitted in a visual, sound or tactile manner, and the instructions are specifically natural language used for human communication. Text, which refers to the presentation form of written language, is usually a sentence or a combination of sentences having complete and systematic meaning, and a text can be a sentence, a paragraph or a chapter.

Generally, the language text is a word composed of several characters or a sentence composed of several words, or may be a paragraph composed of several sentences, and a user may describe his or her own idea through the language text, and the description using the language text may change the complicated idea into an instruction that is easy for others to understand. For language texts, different expression modes can be used to enable complex ideas to be popular and easy to understand, and communication is easier to understand. One or more natural languages contained in the target language text may be referred to as sentences for short, or colloquial as sentences, or may be split into sentences according to punctuations in the text, that is, contents ending with periods, question marks, exclamation marks, commas, and the like are taken as a sentence.

The target language text refers to a language text input to the user terminal by the user, and may be a language text generated by editing the user through user terminal text editing software, or a language text generated by the user through voice information recorded by the user terminal voice recording software, and the generation of the target language text has various modes, which is not limited herein.

In some feasible implementation manners, a user enters a chat interface by clicking software with an intelligent customer service robot installed on a user terminal, and then pops up a text editor by clicking a text input box, as shown in fig. 5A, after the text editor pops up, the user can input ideas expressed by the user into the text edit box in a text description manner, as shown in fig. 5B, and the user terminal generates a target language text for the operation of the user. It should be noted that there are various ways to obtain the target language text to be understood, and the method is not limited herein.

For convenience of description, the embodiment of the present application takes, as an example, software with an intelligent customer service robot installed on a user terminal, where the user terminal edits a target text language in the software in response to a user.

For example: the user is well-informed about "cancel lesson deduction or not? "such problems are that, firstly, the user opens the user terminal with little detail, clicks and enters the self-learning software to find the intelligent customer service for consultation, the intelligent customer service consultation page is as shown in fig. 5A, the user with little detail pops up the text editing key after clicking the text editing box on the intelligent customer service consultation page, as shown in fig. 5B, the user with little detail can input" cancel course deduction or not in the course? In this case, after the user terminal clicks the send button after the input is completed, the user terminal obtains the instruction for generating the target language text, and at this time, the user terminal obtains the target language text object.

S102, inputting the target language text into a pre-trained text coding model, wherein the text coding model is generated based on a first data sample, a second data sample and a third data sample, the semantic similarity between the first data sample and the second data sample is greater than a first similarity threshold, the semantic similarity between the first data sample and the third data sample is less than a second similarity threshold, and the first similarity threshold is greater than or equal to the second similarity threshold.

The text coding model is a mathematical model used for semantic recognition according to input text data, the mathematical model is created and generated based on at least one of models of RNN, CNN, LSTM and the like, and the core module of the mathematical model comprises a sentence vectorization module and a vector evaluation module. The sentence vectorization module is a module for vectorizing the collected first data sample, the second data sample and the third data sample, and the vector evaluation module is a module for evaluating a sentence vector generated after the sentence vectorization operation.

Generally, the collected data samples comprise a first data sample, a second data sample and a third data sample, wherein the semantic similarity expressed by the first data sample and the second data sample is higher, but the semantics expressed by the first data sample and the third data sample have no similarity, because a relationship is established among the three groups of data samples, finally, the collected three groups of related array samples are used as training data to be trained on the basis of the mathematical model to obtain a text coding model, and the semantic degree of the text output after the text coding is carried out by using the trained text coding model is more accurate.

Specifically, the first data sample, the second data sample, and the third data sample are collected from a knowledge base, and a data structure in the knowledge base may specifically refer to table 1. Based on the knowledge base in Table 1, for

TABLE 1

Each qid corresponds to a standard question Qstandard and an extension question Qexpansion, and then a standard question Qnegative is randomly extracted from other different qids to serve as a negative example of the current qid, so as to generate a training sample: and qid is < Qstandard, Qexpansion, Qnegative >, and in the training sample, when the semantic similarity between the standard questions and the extension questions is greater than a first similarity threshold value, the semantic correlation between the standard questions and the extension questions is satisfied. When the semantic similarity of the standard question and the negative example is smaller than the second similarity threshold, the semantics of the satisfied standard question and the negative example are irrelevant. The first similarity threshold is greater than or equal to the second similarity threshold, and the magnitudes of the first similarity threshold and the second similarity threshold may be set by themselves in specific implementation, which is not limited herein, for example, the first similarity threshold may be 60%, the second similarity threshold may be 30%, or both the first similarity threshold and the second similarity threshold may be 60%.

In a possible implementation manner, the user terminal obtains a target language text, and please refer to step S101 for details of the target language text, which are not repeated here, when the user terminal detects the target language text, the target language text is transmitted to the server in a wired or wireless manner, a pre-trained text coding model is stored in the server, and the server receives the target language text data and then inputs the target language text into the pre-trained text coding model for recognition through an internal program.

And S103, outputting the semantic vector corresponding to the target language text.

The semantic vector is obtained by converting a target language text through a text coding model, the conversion of the target language text into vector representation in a semantic space is a common practice in the current text coding, and when the language text is analyzed by a machine, sentences in a text language are firstly vectorized and represented.

Generally, a core module of a text coding model comprises a sentence vectorization module and a vector evaluation module, wherein the sentence vectorization module is a neural network model and can be specifically created by adopting at least one model of RNN (neural network), CNN (CNN), LSTM (local state machine) and other models, three sentence vectors with the same dimension are obtained by processing collected first sample data, second sample data and third sample data in the sentence vectorization module in sequence, then the three sentence vectors with the same dimension are evaluated by adopting the vector evaluation module, and when the evaluated loss value is smaller than a set threshold value, the training of the text coding model is completed.

Further, after receiving the target language text, the server first obtains a pre-trained text coding model, processes the target language text through the text coding model to obtain a semantic vector with sufficient semantic degree, and the obtained semantic vector can be used for various tasks such as classification, clustering or similarity calculation.

Please refer to fig. 6, which is a flowchart illustrating a text encoding method according to an embodiment of the present disclosure. The present embodiment is exemplified by applying the text encoding method to the user terminal. The text encoding method may include the steps of:

s201, collecting a first data sample, a second data sample and a third data sample from a language text library.

The data samples are data information sets which are formed by characters, words and sentences and have the functions of expressing the product performance, the functions, the structural principles and the size parameters of the data samples, the data information sets of the data samples are stored in a special data warehouse, and a database is formed, and the data samples are specifically a language text library in the embodiment. The electronic upgrading method is an electronic upgrading version of traditional paper sample data, can be transmitted through a network, is displayed in front of a user in a novel and visual mode, has a visual and friendly human-computer interaction interface, is rich in expressive force, is diversified in expression method, enables the query speed of the user to be faster, and is higher in efficiency of searching for the sample data.

Generally, collected data samples are also called data acquisition, and today in the rapid development of the internet industry, data collection is widely applied to the internet field, accurate selection of the data samples to be collected has a profound influence on products, if the collected data samples are not accurate enough, large deviation of test results can be caused, and inestimable loss is caused to the products. Therefore, it is necessary to accurately collect the sample data information.

In the embodiment of the present application, a first data sample, a second data sample, and a third data sample are selected from a language text library, where semantic similarity between the first data sample and the second data sample is high, and there is no semantic similarity between the first data sample and the third data sample, and the three sets of acquired data form triple training data, for example, the acquired data samples are as shown in table 2:

TABLE 2

Question of standards	Expanding questions	Negative example
			Is the cancellation of lessons deduct time?	How to not to detain class hour	How to cancel the listening trial?

Where the standard question is "cancel lesson time deducted? "extend question" how not to detain class time ", negative example" how to cancel trying to listen to class? The standard question is a target language, the extension question is a language text with the highest semantic similarity with the target language text, the negative example is acquired from other language texts, the semantics of the negative example and the semantics of the target language text have no similarity, and the acquired data samples can be used for training a text coding model.

S202, a text coding model is created, the first data sample, the second data sample and the third data sample are input into the text coding model, and a first semantic vector corresponding to the first data sample, a second semantic vector corresponding to the second data sample and a third semantic vector corresponding to the third data sample are obtained.

Generally, in a text coding model training stage, a text coding model is first created, where the text coding model is a deep neural network model created based on at least one of CNN, RNN and LSTM models, and a core module of the text coding model includes a sentence vectorization module and a vector evaluation module. The first data sample specifically refers to a language text corresponding to a target language text, the second data sample specifically refers to a language text similar to the target language text in semantic degree, and the third data sample specifically refers to a language text dissimilar to the target language text in semantic degree. The semantic vector specifically refers to three sentence vectors with the same dimension, which are processed by a core module sentence vectorization module in the text coding model, of the first data sample, the second data sample and the third data sample.

In the embodiment of the application, a text coding model is created first, and after the text coding model is created, a first data sample, a second data sample and a third data sample can be collected in a text library, wherein the first data sample and the second data sample are semantically related in the collected data samples, and the first data sample and the third data sample are semantically unrelated. And after the data sample collection is finished, processing the collected three groups of data samples through a text coding model to obtain sentence vectors.

As shown in fig. 7, where Qstandard is the same language text corresponding to the target language text in the text base, Qexpansion is the language text corresponding to the target language text in the text base with the same semantics, qnegive is the language text corresponding to the target language text in the text base with dissimilar semantics, and Sentence2ve is a text coding model, where NeuralNetwork in Sentence2ve is a Sentence vectorization module, and three sets of data under Embeddings are Sentence vectors corresponding to the vectors. Specifically, in the training process of the acquired data samples, sentence vectorization processing is performed on the acquired data samples Qstandard, Qexpansion and qnactive through a sentence vectorization module NeuralNetwork in a text coding model, and specific vector values corresponding to three groups of data samples are output after the processing is finished, the three groups of vector values can be used for various tasks such as classification, clustering or similarity calculation, and the specific usage is not limited here.

S203, acquiring a first probability corresponding to the first semantic vector, a second probability corresponding to the second semantic vector and a third probability corresponding to the third semantic vector.

After the sentence vector value corresponding to the collected language text is obtained based on step S202, a core module vector evaluation module in a text coding model is used to perform vector evaluation, so that the first semantic vector and the second semantic vector represent similarly, and the first semantic vector and the second semantic vector represent non-similarly.

In particular, it is assumed that the first semantic vector may be represented as x1, the second semantic vector may be represented as x2, and the third semantic vector may be represented as x 3. And calculating the obtained three semantic vectors through formula mapping to obtain a specific probability value.

Further, for example, the semantic vector x1 is passed through the formula:

the first probability distribution is computed, which can be expressed as Qstandard _ prob, and the semantic vector x2 is formulated as:

is calculated to obtainTo a second probability distribution, which may be represented by Qexpansion _ prob, the semantic vector x3 is finally passed through the formula:a third probability distribution is computed, which can be represented by qnactive _ prob.

S204, calculating a first cross entropy of the first probability and the second probability, and calculating a second cross entropy of the first probability and the third probability.

The entropy is one of parameters representing the state of a substance in thermodynamics, the physical meaning of the entropy is a measure of the degree of system chaos, in the embodiment of the application, parameter optimization of a text coding model is completed according to the difference value of cross entropy, and after a semantic vector probability distribution value is obtained, a first cross entropy and a second cross entropy can be calculated by using the distribution value.

Specifically, the first probability distribution value Qstandard _ prob, the second probability distribution value Qexpansion _ prob, and the third probability distribution value qnegive _ prob are obtained based on step S203, and the probability distribution value Qstandard _ prob and the probability distribution value Qexpansion _ prob are substituted into the formula: h (Qstandard _ prob, qextension _ prob) — Σ qstand _ prob "logqextension _ prob calculates a first cross entropy value, and substitutes the probability distribution value Qstandard _ prob and the probability distribution value qnegive _ prob into the formula: h (Qstandard _ prob, qnnegative _ prob) — Σ Qstandard _ prob "logqnnegative _ prob calculates a second cross-entropy value.

S205, taking the difference value of the first cross entropy and the second cross entropy as the loss value of the model.

Based on the step S204, the first cross entropy and the second cross entropy can be obtained, the cross entropy loss can be calculated according to the obtained cross entropy, the calculated cross entropy loss is used in the classification problem, and the cross entropy is gradually increased as the prediction probability deviates from the actual label.

In the embodiment of the present application, substituting the first cross entropy and the second cross entropy into a text coding model loss function may obtain a formula: and the Loss value of the text coding model can be calculated according to the formula, and parameter optimization can be performed according to the Loss value.

S206, when the loss value does not reach the minimum value, adjusting the text coding model based on the loss value.

Based on step S205, the size of the loss value of the text coding model may be calculated, and when the loss value is greater than the set threshold, it indicates that the acquired data sample is not accurate enough, at this time, the data sample needs to be acquired again to train the text coding model, until the calculated loss value is less than the set threshold, the training of the text coding model is completed.

As shown in fig. 8, each step of the Training generation phase (Training) of the text coding model and each step of the target text semantic vectorization phase (Inference) using the text coding model are specifically shown.

In the Training phase, the knowledge base is a database for storing Training sample data information sets, and a plurality of data samples and information related to the data samples are stored in the database. The triple training data comprises three data samples to be trained, wherein the first data sample is a language text which is the same as a target language text and can be called a standard question, the second data sample is a language text which has similar semanteme with the target language text and can be called an extended question, and the third data sample is a language text which has no semanteme related to the target language text and can be called a negative example. In the reference phase, Query input represents the input of target language text, and the Sennce 2vec model represents the trained text coding model.

In a possible implementation manner, in training a text coding model, a user firstly creates the text coding model based on an artificial intelligence correlation technique, and after a server receives a data acquisition instruction, the server acquires triple training data from a knowledge base as a model training sample, wherein the acquired first data sample and the acquired third data sample are related semantically, and the acquired second data sample and the acquired third data sample are not related semantically. After the collection of the trained data samples is finished, the training data is sequentially processed by a core module Sentence vectorization module (Neural Network) in a text coding model (Sennce 2vec), three Sentence vectors (Embeddings) with the same dimension are obtained after the processing, finally a vector evaluation module is called to carry out vector evaluation calculation, firstly, a probability distribution value corresponding to the vectors is calculated according to the obtained vectors, then, cross entropy is calculated according to the probability distribution value, finally, a text model loss value is obtained according to the obtained difference value of the cross entropy, whether the text coding model is trained or not is determined according to the calculated cross entropy loss value after the vector evaluation, and when the cross entropy loss value is smaller than a set threshold value, the text coding model is trained.

Further, after the text coding model is trained, for example, the text coding model is applied to an intelligent customer service robot, when a user inputs a target text to be coded at a user terminal, the user terminal transmits the target text to be coded to a server in a wired or wireless manner, after the server receives the target text data to be coded, the text coding model is called to identify and then a semantic vector with high semantic similarity is output, the server analyzes and calculates the semantic vector to obtain a result, finally, an answer corresponding to the target language text is found in a text database according to the result, and a secondary answer is responded to the user terminal in a wired or wireless manner.

And S207, when the loss value reaches the minimum value, generating a trained text coding model.

The generation of the text coding model may specifically refer to step S206, which is not described herein again.

And S208, acquiring a target language text to be coded.

Specifically, refer to step S101, which is not described herein again.

S209, inputting the target language text into a pre-trained text coding model, wherein the text coding model is generated based on a first data sample, a second data sample and a third data sample, the semantic similarity between the first data sample and the second data sample is greater than a first similarity threshold, the semantic similarity between the first data sample and the third data sample is less than a second similarity threshold, and the first similarity threshold is greater than or equal to the second similarity threshold.

See S102 for details, which are not described herein.

S210, outputting the semantic vector corresponding to the target language text.

Specifically, refer to step S103, which is not described herein again.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Please refer to fig. 9, which shows a schematic structural diagram of a text encoding apparatus according to an exemplary embodiment of the present application. The text encoding means may be implemented as all or part of the terminal in software, hardware or a combination of both. The apparatus 1 comprises a text acquisition module 10, a text input module 20 and a vector output module 30.

The text acquisition module 10 is used for acquiring a target language text to be coded;

a text input module 20, configured to input the target language text into a pre-trained text coding model, where the text coding model is generated by training based on a first data sample, a second data sample, and a third data sample, a semantic similarity between the first data sample and the second data sample is greater than a first similarity threshold, a semantic similarity between the first data sample and the third data sample is less than a second similarity threshold, and the first similarity threshold is greater than the second similarity threshold;

and the vector output module 30 is configured to output a semantic vector corresponding to the target language text.

Optionally, as shown in fig. 10, the apparatus 1 further includes:

a data collecting module 60, configured to collect a first data sample, a second data sample, and a third data sample from a language text library;

a loss value output module 50, configured to create a text coding model, input the first data sample, the second data sample, and the third data sample into the text coding model, and output a loss value of the model;

and the model generating module 40 is used for generating a trained text coding module when the loss value reaches the minimum value.

Optionally, as shown in fig. 11, the loss value output module 50 includes:

a vector output unit 501, configured to input the first data sample, the second data sample, and the third data sample into the text coding model, so as to obtain a first semantic vector corresponding to the first data sample, a second semantic vector corresponding to the second data sample, and a third semantic vector corresponding to the third data sample;

a probability obtaining unit 502, configured to obtain a first probability corresponding to the first semantic vector, a second probability corresponding to the second semantic vector, and a third probability corresponding to the third semantic vector;

a loss value calculating unit 503, configured to calculate a loss value of the model based on the first probability, the second probability and the third probability.

Optionally, as shown in fig. 12, the loss value calculating unit 503 includes:

a cross entropy calculation subunit 5031, configured to calculate a first cross entropy of the first probability and the second probability, and calculate a second cross entropy of the first probability and the third probability;

a loss value operator unit 5032 configured to use a difference between the first cross entropy and the second cross entropy as a loss value of the model.

Optionally, the model generating module 40 is specifically configured to:

and when the loss value does not reach the minimum value, adjusting the text coding model based on the loss value, and executing the step of inputting the first data sample, the second data sample and the third data sample into the text coding model.

It should be noted that, when the text encoding apparatus provided in the foregoing embodiment executes the text encoding method, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed and completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules, so as to complete all or part of the functions described above. In addition, the text encoding apparatus and the text encoding method provided by the above embodiments belong to the same concept, and details of implementation processes thereof are referred to as method embodiments, which are not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The present application also provides a computer readable medium, on which program instructions are stored, which program instructions, when executed by a processor, implement the text encoding method provided by the above-mentioned various method embodiments.

The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the text encoding method as described in the various method embodiments above.

Please refer to fig. 13, which provides a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 13, the terminal 1000 can include: at least one processor 1001, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002.

Wherein a communication bus 1002 is used to enable connective communication between these components.

The user interface 1003 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.

The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

Processor 1001 may include one or more processing cores, among other things. The processor 1001 interfaces various components throughout the electronic device 1000 using various interfaces and lines to perform various functions of the electronic device 1000 and to process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1005 and invoking data stored in the memory 1005. Alternatively, the processor 1001 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1001 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 1001, but may be implemented by a single chip.

The Memory 1005 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 1005 includes a non-transitory computer-readable medium. The memory 1005 may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory 1005 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 13, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a text encoding application program.

In the electronic device 1000 shown in fig. 13, the user interface 1003 is mainly used as an interface for providing input for a user, and acquiring data input by the user; and the processor 1001 may be configured to invoke the text encoding application stored in the memory 1005 and specifically perform the following operations:

acquiring a target language text to be coded;

and outputting the semantic vector corresponding to the target language text.

In one embodiment, before performing the acquiring the target language text to be encoded, the processor 1001 further performs the following operations:

collecting a first data sample, a second data sample and a third data sample from a language text library;

creating a text coding model, inputting the first data sample, the second data sample and the third data sample into the text coding model, and outputting a loss value of the model;

and when the loss value reaches the minimum value, generating a trained text coding module.

In one embodiment, when the processor 1001 performs the inputting of the first data sample, the second data sample, and the third data sample into the text coding model and the outputting of the loss value of the model, specifically performs the following operations:

inputting the first data sample, the second data sample and the third data sample into the text coding model to obtain a first semantic vector corresponding to the first data sample, a second semantic vector corresponding to the second data sample and a third semantic vector corresponding to the third data sample;

acquiring a first probability corresponding to the first semantic vector, a second probability corresponding to the second semantic vector and a third probability corresponding to the third semantic vector;

calculating a loss value for the model based on the first probability, the second probability, and the third probability.

In one embodiment, the processor 1001 specifically performs the following operations when performing the calculating of the loss value of the model based on the first probability, the second probability and the third probability:

calculating a first cross entropy of the first probability and the second probability, and calculating a second cross entropy of the first probability and the third probability;

and taking the difference value of the first cross entropy and the second cross entropy as the loss value of the model.

In an embodiment, when the processor 1001 executes the text encoding module that completes training when the loss value reaches the minimum, the following operations are specifically performed:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method of text encoding, the method comprising:

acquiring a target language text to be coded;

and outputting the semantic vector corresponding to the target language text.

2. The method of claim 1, wherein before obtaining the target language text to be encoded, further comprising:

and when the loss value reaches the minimum value, generating a trained text coding model.

3. The method of claim 2, wherein inputting the first data sample, the second data sample, and the third data sample into the text coding model, and outputting a loss value for the model comprises:

4. The method of claim 3, wherein calculating the loss value for the model based on the first probability, the second probability, and the third probability comprises:

5. The method of claim 2, wherein generating a trained text-coding model when the loss value reaches a minimum comprises:

6. A text encoding apparatus, characterized in that the apparatus comprises:

the text input module is used for inputting the target language text into a pre-trained text coding model, the text coding model is generated based on a first data sample, a second data sample and a third data sample in a training mode, the semantic similarity between the first data sample and the second data sample is greater than a first similarity threshold, the semantic similarity between the first data sample and the third data sample is less than a second similarity threshold, and the first similarity threshold is greater than or equal to the second similarity threshold;

7. The apparatus of claim 6, further comprising:

the data acquisition module is used for acquiring a first data sample, a second data sample and a third data sample from a language text library;

a loss value output module, configured to create a text coding model, input the first data sample, the second data sample, and the third data sample into the text coding model, and output a loss value of the model;

and the model generation module is used for generating a trained text coding module when the loss value reaches the minimum value.

8. The apparatus of claim 7, wherein the loss value output module comprises:

a vector output unit, configured to input the first data sample, the second data sample, and the third data sample into the text coding model, so as to obtain a first semantic vector corresponding to the first data sample, a second semantic vector corresponding to the second data sample, and a third semantic vector corresponding to the third data sample;

a probability obtaining unit, configured to obtain a first probability corresponding to the first semantic vector, a second probability corresponding to the second semantic vector, and a third probability corresponding to the third semantic vector;

a loss value calculation unit for calculating a loss value of the model based on the first probability, the second probability, and the third probability.

9. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to perform the method steps according to any of claims 1 to 5.

10. A terminal, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 5.