CN112115717B

CN112115717B - Data processing method, device and equipment and readable storage medium

Info

Publication number: CN112115717B
Application number: CN202011040445.4A
Authority: CN
Inventors: 罗俊杰; 孙继超; 陈曦
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2022-03-15
Anticipated expiration: 2040-09-28
Also published as: CN112115717A

Abstract

The embodiment of the application discloses a data processing method, a device, equipment and a readable storage medium, wherein the method comprises the following steps: obtaining a representation speech grain sequence and a speech grain vector matrix respectively corresponding to at least two language domains of a target word segmentation; determining language domain mapping vectors of the target participles under each language domain according to the representation speech particle sequences and the speech particle vector matrixes; fusing the language domain mapping vector to generate a fused language domain mapping vector; and acquiring word segmentation vector representation characteristics of the label segmentation from the word segmentation vector matrix, and respectively adjusting the word segmentation vector matrix and the word segmentation vector matrix according to the language domain mapping vector of the target segmentation, the fusion language domain mapping vector and the word segmentation vector representation characteristics to obtain the target word segmentation vector matrix and the target word segmentation vector matrix which can be used for performing language processing on the segmented words in batches. By adopting the method and the device, the quality of the semantic representation vector of the word can be improved, and therefore the accuracy of the language processing task can be improved.

Description

Data processing method, device and equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method, apparatus, device, and readable storage medium.

Background

A piece of sentence text can be understood as a sequence of one or more words, each of which is a basic unit constituting the sentence text, and semantic information of each word is important for the sentence text. Semantic information of words is also widely used in Natural Language Processing (NLP) belonging to the field of Artificial Intelligence (AI).

In the prior art, word embedding models (e.g., word2vec models and glove models) are generally used to return a word vector of a word, which can be used in language processing tasks (e.g., word similarity matching tasks, chinese medical named entity recognition tasks, etc.). The specific way of returning the word vector of a word to the word embedding model is to use each word as an inseparable unit, use a word as a central word, and train the word vector of the central word in the learning word embedding matrix by predicting the central word by using surrounding words. However, the way of using each word as an inseparable unit can only consider semantic information of the word itself, and the learned word vector only contains semantic information of the word surface, that is, cannot express the semantics of the word more accurately, so that when the word vector is used in a language processing task, the obtained language processing result is not accurate enough.

Disclosure of Invention

Embodiments of the present application provide a data processing method, an apparatus, a device, and a readable storage medium, which can improve quality of semantic representation vectors of words, so as to improve accuracy of a language processing task.

An embodiment of the present application provides a data processing method, including:

obtaining the characteristic speech grain sequences respectively corresponding to at least two language domains of the target word segmentation, and obtaining speech grain vector matrixes respectively corresponding to the at least two language domains; each speech particle vector matrix is associated with a sample text;

determining language domain mapping vectors corresponding to the target participles under each language domain according to the characterization speech particle sequence and the speech particle vector matrix corresponding to each language domain;

fusing language domain mapping vectors respectively corresponding to the target participles under each language domain to generate fused language domain mapping vectors of the target participles;

acquiring a word segmentation vector matrix associated with a sample text; the sample text comprises a sentence text formed by target participles and label participles;

and acquiring word segmentation vector representation characteristics corresponding to the label segmentation in the word segmentation vector matrix, and adjusting the word segmentation vector matrix and the word segmentation vector matrix according to language domain mapping vectors, fusion language domain mapping vectors and word segmentation vector representation characteristics corresponding to the label segmentation of the target segmentation in each language domain to obtain a target word segmentation vector matrix and a target word segmentation vector matrix for performing a language processing task.

One aspect of the present application provides another data processing method, including:

acquiring input words and at least two words to be sorted;

inputting the input words and at least two words to be ordered to a language processing model; the language processing model comprises a target speech particle vector matrix and a target word segmentation vector matrix; the target speech particle vector matrix and the target word segmentation vector matrix are generated by adopting the data processing method provided by the embodiment of the application on the one hand;

determining semantic similarity between at least two words to be sorted and input words respectively through a target word segmentation vector matrix and a target speech particle vector matrix in a language processing model;

and sequencing at least two terms to be sequenced according to the semantic similarity to obtain a term sequence, and outputting the term sequence.

An embodiment of the present application provides a data processing apparatus, including:

a sequence acquisition module for acquiring the corresponding characteristic speech grain sequences of at least two language domains of the target participle,

the speech grain matrix acquisition module is used for acquiring speech grain vector matrixes corresponding to at least two speech domains respectively; each speech particle vector matrix is associated with a sample text;

the vector determination module is used for determining language domain mapping vectors corresponding to the target participles under each language domain according to the characterization speech particle sequences and the speech particle vector matrixes corresponding to each language domain;

the vector fusion module is used for fusing the language domain mapping vectors respectively corresponding to the target participles under each language domain to generate fused language domain mapping vectors of the target participles;

the word segmentation matrix acquisition module is used for acquiring a word segmentation vector matrix associated with the sample text; the sample text comprises a sentence text formed by target participles and label participles;

the word segmentation characteristic acquisition module is used for acquiring word segmentation vector representation characteristics corresponding to the label word segmentation in the word segmentation vector matrix;

and the matrix adjusting module is used for adjusting the partical vector matrix and the partical vector matrix according to the language domain mapping vector, the fusion language domain mapping vector and the partical vector representation characteristics corresponding to the label partical corresponding to the target partical under each language domain to obtain a target partical vector matrix and a target partical vector matrix for performing the language processing task.

Wherein the at least two language domains include a language domain K_i(ii) a i is a positive integer less than or equal to the number of at least two language domains;

the sequence acquisition module comprises:

an initial sequence acquiring unit for acquiring the target word segmentation in the language domain K_iThe corresponding initial characterization speech particles are obtained;

the speech grain combination unit is used for combining the initial characterization speech grains to obtain expanded characterization speech grains;

the speech grain filtering unit is used for filtering the expanded characteristic speech grains to obtain filtered characteristic speech grains, and determining the characteristic speech grains formed by the initial characteristic speech grains and the expanded characteristic speech grains as target characteristic speech grains;

a sequence determining unit for determining the sequence composed of the target characteristic grains as the target participle in the language domain K_iThe corresponding token-particle sequences are as follows.

Wherein the at least two language domains include a language domain K_iAt least two language domains respectively corresponding to the characterization particle sequences comprise a language domain K_iCorresponding token particle sequence M_i(ii) a The speech particle vector matrix corresponding to at least two speech domains respectively comprises a speech domain K_iCorresponding speech particle vector matrix T_i(ii) a Speech particle vector matrix T_iIncluding a language field K_iThe sample in (1) represents the speech particle vector representation characteristics corresponding to the speech particles; the sample characterization speech particles are associated with the sample text, and the sample characterization speech particles comprise a sequence M of characterization speech particles_iThe target token in (1); i is a positive integer less than or equal to the number of at least two language domains;

the vector determination module includes:

a feature obtaining unit for obtaining a speech particle vector matrix T_iThe speech particle vector in (1) represents a feature;

a feature acquisition unit for obtaining a speech particle vector matrix T_iIn the speech particle vector expression characteristics, a characteristic speech particle sequence M is obtained_iThe speech grain vector representation characteristics corresponding to the target representation speech grains;

a quantity obtaining unit for obtaining the token particle sequence M_iThe number of the token grains of the target token grains;

a vector determination unit for determining a sequence M of tokens from the token particle sequence_iThe corresponding speech grain vector representation characteristics and the number of the representation speech grains of the target representation speech grains determine the target word segmentation in the language domain K_iThe next corresponding language domain mapping vector.

Wherein, the token speech particle sequence M_iThe target token in (1) comprises a target token S_tWith the target token speech grain S_w(ii) a t and w are less than or equal to the token particle sequence M_iA positive integer representing the number of grains in (1);

the vector determination unit includes:

an operation processing subunit for processing the target representation speech grain S_tCorresponding speech grain vector representation feature and target representation speech grain S_wAdding the corresponding speech particle vector representation features to obtain a first operation vector representation feature;

the operation processing subunit is further configured to perform mean processing on the first operation vector representation feature and the characterization speech grain number to obtain a mean vector representation feature;

a vector determining subunit, configured to determine the target word segmentation in the language domain K according to the mean vector representation characteristics_iWords corresponding to the followingThe word-space mapping vector.

Wherein the target participle comprises a target participle C_aWord segmentation with target C_b(ii) a a and b are both positive integers less than or equal to the number of participles in the sample text;

a vector determining subunit, further configured to obtain a target word segmentation C_aCorresponding mean vector representation features, and target participle C_bThe corresponding mean vector represents a feature;

a vector determination subunit, further used for dividing the target word C_aCorresponding mean vector representation feature and target participle C_bAdding the corresponding mean vector representation features to obtain second operation vector representation features;

the vector determining subunit is further configured to obtain the number of participles in the sample text, perform mean processing on the second operation vector representation features and the number of the participles, and obtain a target participle C_aWord segmentation with target C_bIn the language domain K_iThe next corresponding language domain mapping vector.

Wherein the at least two language domains further include a language domain K_j(ii) a j is a positive integer less than or equal to the number of at least two language domains;

a vector determining subunit, further configured to obtain a target word segmentation C_aWord segmentation with target C_bIn the language domain K_iLower corresponding language domain mapping vector, and target participle C_aWord segmentation with target C_bIn the language domain K_jA lower corresponding language domain mapping vector;

a vector determination subunit, further used for dividing the target word C_aWord segmentation with target C_bIn the language domain K_iLower corresponding language domain mapping vector, and target participle C_aWord segmentation with target C_bIn the language domain K_jAdding the corresponding language domain mapping vectors to obtain an operation language domain mapping vector;

and the vector determining subunit is further configured to obtain the number of the at least two language domains, and perform mean processing on the operation language domain mapping vector and the number of the at least two language domains to obtain a fusion language domain mapping vector.

Wherein, matrix adjustment module includes:

the distance determining unit is used for determining a first vector distance between language domain mapping vectors and word segmentation vector representation features respectively corresponding to the target word segmentation under each language domain, and a second vector distance between fusion language domain mapping vectors and word segmentation vector representation features;

the distance operation unit is used for adding the first vector distance and the second vector distance to obtain a loss function value;

and the matrix adjusting unit is used for respectively adjusting the speech particle vector matrix and the participle vector matrix according to the loss function value to obtain a target speech particle vector matrix and a target participle vector matrix.

Wherein the distance determining unit includes:

the first multiplication subunit is used for acquiring first transposition mapping vectors corresponding to the target participles under each language domain, multiplying the first transposition mapping vectors by the expression characteristics of the participles, and determining the distance of the first vectors according to the result obtained by the multiplication; the first transposition mapping vector is the transposition vector of the language domain mapping vector corresponding to the target participle under each language domain;

and the second multiplication subunit is used for acquiring a second transposition mapping vector corresponding to the fusion language domain mapping vector, multiplying the second transposition mapping vector by the word segmentation vector representation characteristics, and determining the distance of the second vector according to the result obtained by the multiplication.

Wherein, matrix adjustment unit includes:

the characteristic adjustment subunit is used for matching the loss function value with a distance threshold, and if the loss function value is greater than the distance threshold, respectively adjusting the speech grain vector representation characteristics corresponding to the target representation speech grains in the speech grain vector matrix and the participle vector representation characteristics corresponding to the label participles in the participle vector matrix;

and the matrix determining subunit is used for determining the speech particle vector matrix as a target speech particle vector matrix and determining the participle vector matrix as a target participle vector matrix if the loss function value is less than or equal to the distance threshold.

An aspect of an embodiment of the present application provides a computer device, including: a processor and a memory;

the memory stores a computer program that, when executed by the processor, causes the processor to perform the method in the embodiments of the present application.

An aspect of the embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, where the computer program includes program instructions, and the program instructions, when executed by a processor, perform the method in the embodiments of the present application.

In one aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by one aspect of the embodiments of the present application.

the word acquisition module is used for acquiring input words and at least two words to be ordered;

the word input module is used for inputting input words and at least two words to be ordered into the language processing model; the language processing model comprises a target speech particle vector matrix and a target word segmentation vector matrix; the target speech particle vector matrix and the target word segmentation vector matrix are generated by adopting the data processing method of any one of claims 1 to 9;

the similarity determination module is used for determining semantic similarity between at least two words to be sorted and input words respectively through a target word segmentation vector matrix and a target speech particle vector matrix in the language processing model;

and the word sequencing module is used for sequencing at least two words to be sequenced according to the semantic similarity to obtain a word sequence and outputting the word sequence.

Wherein, the similarity determination module comprises:

the similarity determining unit is used for determining first semantic similarity between at least two words to be sorted and the input words respectively through the target word segmentation vector matrix;

the similarity determining unit is further used for determining second semantic similarity between at least two words to be sorted and the input words through the target speech particle vector matrix;

and the similarity fusion unit is used for fusing the first semantic similarity and the second semantic similarity between each word to be sorted and the input word to obtain the semantic similarity between each word to be sorted and the input word.

In the embodiment of the application, the language domain mapping vector of each target participle (surrounding word) in each language domain can be determined through the characteristic participle sequence and the participle vector matrix of each target participle in each language domain, the language domain mapping vectors of the target participle in a plurality of language domains can be obtained through the fusion of the language domain mapping vectors, the language domain mapping vector of the target participle in each language domain and the participle vector representation characteristics of the label participle (central word), the participle vector matrix corresponding to the central word and the participle vector matrix corresponding to the language domain can be trained and adjusted, and the target participle vector matrix with high quality can be obtained. It should be understood that, in the training of the participle vector matrix, the semantic information of the language domain of each word is introduced, each language domain can be used to represent the relevant semantic information of one word (for example, the component language domain can be used to represent the morphological information of a word, the pinyin language domain can be used to represent the phoneme information of a word, and the part-of-speech language domain can be used to represent the grammar information of a word), compared with the semantic information of each word itself, it is obvious that the semantic information represented by the language domain is deeper and more specific, and by fusing the semantic information of multiple language domains of each word, the word vector matrix of the trained word can include not only the semantic relation in a single language domain but also the semantic relation among multiple language domains, so that the participle vector matrix obtained by training can be more accurate, the quality is higher; meanwhile, a high-quality speech particle vector matrix can be obtained through training, each segmentation vector in the segmentation vector matrix or each speech particle vector in the speech particle vector matrix can be used for representing semantic information of words, and when the high-quality segmentation vector matrix and the high-quality speech particle vector matrix are applied to a language processing task, the result of the language processing task can be more accurate. Therefore, the method and the device can improve the quality of the semantic representation vector of the word, and therefore the accuracy of the language processing task can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a diagram of a network architecture provided by an embodiment of the present application;

FIG. 2 is a schematic view of a scenario provided by an embodiment of the present application;

fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application;

FIG. 4 is a diagram of an exemplary architecture for determining a language domain mapping vector for each participle according to an embodiment of the present application;

FIGS. 5a-5b are system architecture diagrams of a training matrix according to embodiments of the present application;

fig. 6 is a schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of another data processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application belongs to Natural Language Processing (NLP) belonging to the field of artificial intelligence.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Referring to fig. 1, fig. 1 is a diagram of a network architecture according to an embodiment of the present disclosure. As shown in fig. 1, the network architecture may include a service server 1000 and a user terminal cluster, which may include one or more user terminals, as shown in fig. 1, where the number of user terminals is not limited. As shown in fig. 1, the plurality of user terminals may include a user terminal 100a, a user terminal 100b, a user terminal 100c, …, a user terminal 100 n; as shown in fig. 1, the user terminal 100a, the user terminal 100b, the user terminals 100c, …, and the user terminal 100n may be respectively in network connection with the service server 1000, so that each user terminal may perform data interaction with the service server 1000 through the network connection.

It is understood that each user terminal shown in fig. 1 may be installed with a target application, and when the target application runs in each user terminal, data interaction may be performed between the target application and the service server 1000 shown in fig. 1, respectively, so that the service server 1000 may receive service data from each user terminal. The target application may include an application having a function of displaying data information such as text, images, audio, and video. For example, the application may be a language processing application (e.g., a word similarity matching application, a text classification application, a named entity recognition application, etc.) that may be used for a user to input text data and obtain results corresponding to the text data. For example, for a word similarity matching application, a user may input a given word pair and a given word to be matched (for example, the given word pair input by the user is "queen-queen", and the given word to be matched is "male"), and by the word similarity matching application, a new word matching "male" may be inferred to be "female" according to the given word pair of "queen-queen", and the user may obtain a result that the word matching "male" is "female". It should be understood that the service server 1000 in the present application may obtain service data according to the applications, for example, the service data may be text data input by a user (for example, the text data is a given word pair "king-queen" and a word "male" to be matched).

Subsequently, for the obtained text data, the service server 1000 may obtain a semantic representation vector corresponding to the word "king" in the text data, obtain a semantic representation vector of the word "male", and obtain a semantic representation vector of the word "queen"; the service server 1000 may fuse the three semantic representation vectors to obtain a fused semantic representation vector (for example, the semantic representation vector of the word "male" may be subtracted from the semantic representation vector corresponding to the word "king", and the subtracted result is added to the semantic representation vector of the word "queen", so as to obtain a fused semantic representation vector corresponding to the three words "king", "male" and "queen"), and then the service server 1000 may determine the semantic representation vector having the largest semantic similarity with the fused semantic representation vector, and determine the word (that is, "female") corresponding to the semantic representation vector having the largest semantic similarity with the fused semantic representation vector as the word matching with "male". Subsequently, the service server may return the result of the inference "from the given word pair" king-queen ", that the word matching 'male' is presumed to be 'female'", to the user terminal.

In the embodiment of the present application, one user terminal may be selected from a plurality of user terminals as a target user terminal, and the user terminal may include: smart terminals carrying data processing functions (e.g., a text data display function, a video data playback function, and a music data playback function), such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart television, a smart speaker, a desktop computer, a smart watch, and a vehicle-mounted device, but are not limited thereto. For example, the user terminal 100a shown in fig. 1 may be used as the target user terminal in the embodiment of the present application, and the target application may be integrated in the target user terminal, and at this time, the target user terminal may perform data interaction with the service server 1000 through the target application.

For example, when a user uses a target application (such as a word similar matching application) in a user terminal, the user clicks a word matching control in the word similar matching application, the user terminal can generate and display a word filling interface according to the trigger action for the word matching control, and the user can fill word information (for example, information such as a given word pair and a word to be matched) in the word filling interface; and then, the user terminal can send the word information filled by the user to a service server, and the service server can determine the presumed word matched with the word to be matched according to the given word pair and the semantic representation vector of the word to be matched. The service server may return the presumed word as a presumed result to the user terminal, through which the user may view the presumed word.

Optionally, it may be understood that the network architecture may include a plurality of service servers, one user terminal may be connected to one service server, and each service server may acquire service data (e.g., text data input by a user) in the user terminal connected to the service server, and perform language processing on the text data according to a function of a target application of the user terminal (e.g., the target application is word similarity matching, and the service server may perform word similarity matching processing on the acquired text data if the function of the application is semantic similarity matching on input words).

Optionally, it may be understood that the user terminal may also obtain service data (e.g., text data input by the user), and perform language processing on the text data according to the function of the target application (for example, if the target application is word similarity matching, and the function of the application is semantic similarity matching on input words, the service server may perform word similarity matching processing on the obtained text data).

It is understood that the method provided by the embodiment of the present application can be executed by a computer device, including but not limited to a user terminal or a service server. The service server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and an artificial intelligence platform.

The user terminal and the service server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

It can be understood that, after the service server or the user terminal obtains the text data input by the user, the service server or the user terminal may perform a language processing task on the text data through the speech particle vector matrix or the word vector matrix. For example, for the text data of the given word pair "king-queen" and the word to be matched "male", the service server or the user terminal may determine a fused semantic feature vector corresponding to the three words by using the word particle vector representation features included in the word particle vector matrix and the word segmentation vector representation features included in the word vector matrix, and determine the word segmentation vector representation feature having the largest semantic similarity with the fused semantic feature vector in the word vector matrix, and then may determine the word corresponding to the word segmentation vector representation feature having the largest semantic similarity with the fused semantic feature vector as the word matching with the word to be matched "male".

Wherein, it should be understood that the speech particle vector representation features contained in the speech particle vector matrix are vector representation features corresponding to speech particles of a plurality of words; the word segmentation vector representation characteristics contained in the word segmentation vector matrix are vector representation characteristics corresponding to each word. The word segmentation vector representation features can be used for describing the semantics of the word body, and the speech particle vector representation features can be used for describing the deep semantics of the word such as the form, the phoneme, the part of speech and the like. For ease of understanding, the following description will be made for the context of grains and related concepts.

A language Field (Linguistic Field), which may be a feature representing a description of a sentence or paragraph. The word sequences of sentences and the like are regarded as sequences consisting of words, language domain words can describe the bodies of the sentences, language domain components can describe the forms of the sentences, language domain pinyin can describe phonemes of the sentences, language domain parts of speech can describe grammatical information of the sentences, and the like. A language domain may be understood as a sub-word of a word in a sentence.

Words and grains, a Word is a basic unit of a sentence (for example, for the sentence "we love apple", the words "i", the words "s", the words "all", the words "love", the words "apple" and the words "fruit" are all basic units constituting the sentence). Each word may be represented by a different language domain, and given a word and a language domain, the word may be represented by a sequence of grains consisting of grains of the language domain (linear Grain). For example, for the word "wisdom", the word "wisdom" may be characterized by a sequence of grains of the component language domain, i.e., the word "wisdom" may be characterized by a sequence of grains of the component language "vector, mouth, day", wherein "vector", "mouth", and "day" may be understood as a type of grain of the word "wisdom"; the phonetic particle sequence of the phonetic language domain can also be used for characterizing the word "wisdom", i.e. the phonetic particle sequence of "zh, i" can be used for characterizing the word "wisdom", wherein "zh" and "i" can be understood as a kind of a characterizing particle of the word "wisdom".

Word Embedding and Grain Embedding, wherein a Word Embedding matrix (participle vector matrix) can return a vector consisting of continuous real numbers to each Word in a Word stock, and then after the concepts of a language domain and grains are introduced, a Grain Embedding matrix (Grain vector matrix) can be generated, and the Grain vector matrix of each language domain can return a vector consisting of continuous real numbers to the grains in the language domain.

It should be appreciated that the segmentation vector matrix or the speech particle vector matrix may be deployed in a language processing model through which language processing tasks (e.g., a word similarity matching task, a text classification task, etc.) may be performed. In order to improve the accuracy of the language processing task performed by the language processing model, the word segmentation vector matrix and the speech particle vector matrix can be trained, so that the trained word segmentation vector matrix and speech particle vector matrix have higher quality (that is, the semantics of the words can be accurately represented). For a specific implementation manner of training the word segmentation vector matrix and the speech particle vector matrix, reference may be made to the description in the embodiment corresponding to fig. 3.

For ease of understanding, please refer to fig. 2 together, and fig. 2 is a schematic view of a scenario provided by an embodiment of the present application. The ue M shown in fig. 2 may be any one ue selected from the ue cluster in the embodiment shown in fig. 1, for example, the ue may be the ue 100 b; the service server shown in fig. 2 may be the service server 1000 in the embodiment corresponding to fig. 1.

As shown in fig. 2, in the target application (word similarity matching application) of the user terminal M, the user M can input a given word and words to be sorted. The given word input by the user M is "tiger" and the words to be sorted input are "elephant, river, lion, forest". After the user M finishes inputting, the user M can click a finishing control, and the user terminal M can respond to the triggering operation of the user M for the finishing control, acquire text data (including the given word is tiger, and the words to be sorted are elephant, river, lion and forest) input by the user M and generate a word sorting request; subsequently, the user terminal M may send a word sorting request carrying text data input by the user M to the service server.

Further, the business server can sort the words to be sorted such as elephant, river, lion and forest through the language processing model. The specific method for sequencing the words to be sequenced through the language processing model can be that the language processing model comprises a participle vector matrix and a speech particle vector matrix, wherein the participle vector matrix comprises participle vector representation characteristics corresponding to a plurality of words; the speech particle vector matrix comprises speech particle vector representation characteristics corresponding to the speech particles of each word. It should be understood that the word segmentation vector representation characteristics of the given word "tiger" can be obtained in the word segmentation vector matrix, and the word particle vector representation characteristics of the word particles of the given word "can be obtained in the word particle vector matrix; then, according to the word segmentation vector representation characteristics and the particle vector representation characteristics corresponding to the given word "tiger", the fusion vector representation characteristics of the given word "tiger" can be determined; similarly, the fusion vector representation characteristics of the word to be sorted, namely the elephant, the fusion vector representation characteristics of the word to be sorted, namely the river, the fusion vector representation characteristics of the word to be sorted, namely the lion, and the fusion vector representation characteristics of the word to be sorted, namely the forest can also be determined through the word segmentation vector matrix and the word particle vector matrix.

Further, a vector distance between the given word "tiger" and each word to be sorted may be determined, that is, a vector distance between the fused vector representation feature of the given word "tiger" and the fused vector representation feature of the word "elephant" to be sorted may be calculated, a vector distance between the fused vector representation feature of the given word "tiger" and the fused vector representation feature of the word "river" to be sorted, a vector distance between the fused vector representation feature of the given word "tiger" and the fused vector representation feature of the word "lion" to be sorted, and a vector distance between the fused vector representation feature of the given word "tiger" and the fused vector representation feature of the word "forest" to be sorted may be calculated. The words "elephant, river, lion, forest" may then be treated in order of small to large vector distances. The smaller the vector distance is, the higher the semantic similarity between the two words is, because the vector distance between the given word "tiger" and the word "lion" to be sorted is the smallest, the "lion" can be arranged at the first position, and because the vector distance between the given word "tiger" and the word "river" to be sorted is the largest, the "river" can be arranged at the last position, so that the sorted word sequence of "lion, elephant, forest, river" can be obtained.

Optionally, it is understood that, for the ordering of the words to be ordered, the words may also be ordered in the order of the vector distance from large to small, that is, the obtained ordered word sequence may be "river, forest, elephant, lion". The word sequence 'river, forest, elephant, lion' obtained by sequencing the vector distances from big to small is more similar to the semanteme of 'tiger' for words arranged more behind.

Further, the service server may return the word sequence to the user terminal M, the user terminal M may display the word sequence in the display interface, and the user M may view the result of the sorting of the words to be sorted in the display interface, that is, may view the word sequence.

It can be understood that, in order to make the result output by the language processing model (e.g., the word similarity matching task) more accurate, the participle vector matrix and the speech particle vector matrix in the language processing model may be trained, so that the trained participle vector matrix and speech particle vector matrix have higher quality, and semantic information of each word may be accurately represented, so that the language processing result (e.g., the word similarity matching result) obtained by the participle vector matrix or speech particle vector matrix may be more accurate. For a specific implementation of the training word segmentation vector matrix and the speech particle vector matrix, reference may be made to the description in the embodiment corresponding to fig. 3.

For ease of understanding, please refer to fig. 3, and fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application. The method may be executed by a user terminal (e.g., the user terminal shown in fig. 1 and fig. 2), or may be executed by a service server (e.g., the service server 1000 in the embodiment corresponding to fig. 1), or may be executed by both the user terminal and the service server (e.g., the service server 1000 in the embodiment corresponding to fig. 1). For ease of understanding, the present embodiment is described as an example in which the method is executed by the user terminal described above. Wherein, the method at least comprises the following steps S101-S105:

step S101, obtaining representation speech grain sequences respectively corresponding to at least two language domains of a target word segmentation, and obtaining speech grain vector matrixes respectively corresponding to the at least two language domains; each speech particle vector matrix is associated with sample text.

In the present application, a language field may be a representation of a feature describing a sentence or paragraph. Each word in the sentence is a basic unit of the sentence, and each word can be represented by a different language domain, and given a word and a language domain, the word can be represented by a Grain sequence composed of grains (Linguistic Grain) of the language domain. Wherein the at least two language domains may include a word language domain, a part language domain, a pinyin language domain, a part-of-speech language domain, and so on. The word language domain can describe the body of a word in a sentence, the component language domain can describe the form of the word in the sentence, the pinyin language domain can describe the phoneme of the word in the sentence, and the part-of-speech language domain can describe the grammatical information of the word in the sentence. For example, for the word "wisdom", it can be characterized by the particle sequence of the component language domain, i.e. the word "wisdom" can be characterized by the particle sequence of the component language "vector, mouth, day", and "vector", "mouth", and "day" can all describe the form of "wisdom"; "vector", "mouth" and "day" are understood as a kind of token of the word "wisdom", and the sequence { vector, mouth, day } composed of these tokens is the corresponding token sequence of the word "wisdom" in the component language domain.

In the application, the token particle sequence corresponding to the target participle in a language domain can be expanded, and the expanded token particle sequence obtained after expansion is used as the final token particle sequence corresponding to the target participle in the language domain. The language domain K will be included in at least two language domains_iFor example, a specific method for expanding a corresponding token particle sequence of a target participle in each language domain is described, and first, the target participle in the language domain K can be obtained_iThe corresponding initial characterization speech particles are obtained; the initial token grains can then be combined, so that extended token grains can be obtained; then, filtering the extended characteristic speech particles to obtain filtered characteristic speech particles, and determining the characteristic speech particles formed by the initial characteristic speech particles and the filtered characteristic speech particles as target characteristic speech particles; subsequently, the sequence of target token grains can be determined as the target participle in the language domain K_iThe corresponding token-particle sequences are as follows.

It is understood that, for example, the target participle is taken as the word "wisdom" and the language domain K_iFor the component language domain as an example, the initial token particle corresponding to the word "wisdom" under the component language domain includes the initial tokenParticle vector, initial characteristic particle mouth and initial characteristic particle day, wherein the initial characteristic particle vector and the initial characteristic particle mouth can be combined to obtain an extended characteristic particle known; or combining the initial characteristic particle vector with the initial characteristic particle date to obtain an extended characteristic particle vector date; the initial token particle "mouth" may also be combined with the initial token particle "day" to obtain the extended token particle "mouth day".

It can be understood that, because the "vector date" and the "mouth date" of the expanded token particle do not have actual meanings, they can be deleted (i.e., filtered), because the "vector mouth" of the expanded token particle is "known" and has the semantic meaning of "knowledge", and "wisdom" has the semantic meaning of "wisdom", and the semantics of the two have strong similarity, the "known" of the expanded token particle can be retained, and the final expanded token particle is "known". Subsequently, the token particle composed of the initial token particle "vector", "mouth", "day" and the extended token particle "know" can be determined as the target token particle, and thus, a token particle sequence { vector, mouth, day, know } containing the target token particle "vector", "mouth", "day", and "know" can be obtained.

It should be understood that the present application may expand the token of a word by means of N-Gram processing, and at the same time, may screen out the expanded token by means of word dropping (word dropping). After the representation speech grains are expanded, new speech grains can be generated to increase the speech grain sequence of the words, and further the information of each word can be more fully described. And after the extended characteristic words are filtered and screened, partial extended characteristic words without practical significance can be filtered. That is to say, this application can make the token speech grain of word more abundant through the mode that expands and filter the token speech grain of word, and token speech grain sequence can more accurate description word. For example, taking the word "wisdom" as an example, the initial token particle sequence of the word "wisdom" is { arrow, mouth, day }, wherein the semantic meaning of the initial token particle "arrow" is "arrow", the semantic meaning of the initial token particle "mouth" is "mouth", and the semantic meaning of the initial token particle "day" is "sun"; arrow, mouth and sun are not associated with the meaning of intelligence, and after expansion and filtering, the obtained sequence of the characterization speech particles is { vector, mouth, day and arrow mouth }, while the newly generated expansion characterization speech particles 'arrow mouth' is known, and the semantic of the newly generated expansion characterization speech particles is knowledge and has strong association with the semantic of intelligence. Compared with the initial token particle sequence { vector, mouth, day }, it is obvious that the token particle sequence { vector, mouth, day, arrow mouth } can describe the word "wisdom" more accurately.

Optionally, it may be understood that, in the present application, the speech particle table corresponding to each language domain may be expanded by using an N-Gram processing method, each word may obtain a token speech particle from the expanded speech particle table to serve as an initial token speech particle, and after each word expands the initial token speech particle by using the N-Gram processing to obtain an expanded token speech particle, the expanded token speech particle may also be stored in the speech particle table.

The speech particle vector matrix in the application is a matrix formed by speech particle vector representation features, and one speech particle vector representation feature is a vector representation feature corresponding to one characterization speech particle in a speech particle table. It should be understood that, since one language domain corresponds to one kind of token speech, one language domain also corresponds to one speech particle vector matrix, and each speech particle vector matrix is associated with the sample text, that is, each speech particle vector matrix contains speech particle vector representation features corresponding to sample words in the sample text (training text). In order to improve the quality of the speech particle vector matrix, so that the representation features of the speech particle vectors included in the speech particle vector matrix can accurately describe each word, the training adjustment can be performed on each speech particle vector matrix, and for a specific method for performing the training adjustment on each speech particle vector matrix, reference may be made to the description of the subsequent step S102 to step S105.

And S102, determining language domain mapping vectors respectively corresponding to the target participles under each language domain according to the characterization speech particle sequence and the speech particle vector matrix respectively corresponding to each language domain.

In this application, will be at least twoThe language domain includes a language domain K_iAt least two language domains respectively corresponding to the characterization particle sequences comprise the language domain K_iCorresponding token particle sequence M_i(ii) a And the speech particle vector matrixes respectively corresponding to at least two language domains comprise the language domain K_iCorresponding speech particle vector matrix T_iAnd the speech particle vector matrix T_iIncluding the language domain K_iThe sample characterization speech grain vector representation features in (1) are taken as an example to explain a specific method for determining language domain mapping vectors respectively corresponding to target participles under each language domain. Wherein the sample characterization speech particle is associated with the sample text (training text) (the sample characterization speech particle is the language domain k_iCorresponding token grains in the token grain table, the sample token grains include token grains corresponding to words in the sample text), and the sample token grains include the token grain sequence M_iThe target token in (1); i is a positive integer less than or equal to the number of the at least two language domains.

For determining target participles in language domain K_iThe specific method of the language domain mapping vector can be that firstly, the speech particle vector matrix T is obtained_iThe speech particle vector in (1) represents a feature; then, the speech particle vector matrix T can be set_iIn the speech particle vector representation feature, obtaining the characterization speech particle sequence M_iThe speech grain vector representation characteristics corresponding to the target representation speech grains; subsequently, the token-particle sequence M can be obtained_iThe number of the token particles of the target token particles, and the sequence M of the token particles_iThe corresponding speech particle vector representation feature of the target representation speech particle and the number of the representation speech particles in (1) can determine that the target word segmentation is in the language domain K_iThe next corresponding language domain mapping vector.

For a specific method for determining the language domain mapping vector corresponding to the target participle under each language domain, the method can be as shown in formula (1):

wherein G is_f(w) a sequence of token particles that can be used to characterize a word w (e.g., a target participle) under the linguistic domain f; g can be used to characterize the plasmid sequence G_f(w) one token particle;

the method can be used for representing the speech particle vector matrix corresponding to the language domain f;

can be used for representing the characteristic speech grain g in speech grain vector matrix

The vector in (1) represents a feature (a speech particle vector represents a feature); p_f(w) may be used to characterize the corresponding language domain mapping vector of the target participle under the language domain f.

It should be understood that the method described in formula (1) is to determine the token-particle vector representation features of each token particle in the token-particle sequence, add the token-particle vector representation features, and average the sum obtained by the addition and the token-particle number of the token-particle sequence, thereby obtaining the language domain mapping vector corresponding to the word in one language domain.

The plasmid sequence M will be characterized below_iThe target token in (1) comprises a target token S_tWith the target token speech grain S_wFor example, for a given target participle in one language domain (e.g., language domain K)_i) The following method for mapping the vector to the corresponding language domain is specifically described. Wherein t and w are less than or equal to the characteristic speech grain sequence M_iA positive integer representing the number of grains in (1). The target characterization speech grain S can be obtained_tCorresponding speech grain vector representation feature and the target characterization speech grain S_wAdding the corresponding speech particle vector representation features to obtain a first operation vector representation feature (i.e. in formula (1))

) (ii) a Subsequently, the first transport may be transferredComputing the vector representation feature and the number of the token grains (i.e. | G in equation (1))_f(w) |) performing mean value processing to obtain mean value vector representation characteristics; according to the mean vector representation characteristics, determining the target word segmentation in the language domain K_iThe corresponding language domain mapping vector (i.e., P in equation (1))_f(w))。

The above is a description of a specific method for determining a language domain mapping vector of a target participle (a word) in a language domain, and it should be understood that if there are a plurality of target participles, an overall language domain mapping vector corresponding to the plurality of target participles in a language domain may be determined, and the overall language domain mapping vector may be used as a language domain mapping vector corresponding to each target participle in the language domain.

The target participle C will be included in the target participle_aWord segmentation with target C_bFor example, for determining multiple target participles in one language domain (e.g., language domain K)_i) The following method for mapping the vector to the entire language domain will be specifically described. Wherein a and b are both positive integers less than or equal to the number of participles in the sample text (training text).

For determining multiple target participles in language domain K_iThe specific method of the lower corresponding whole language domain mapping vector may be that the target participle C can be obtained_aCorresponding mean vector representation characteristics, and the target participle C_bThe corresponding mean vector represents a feature; dividing the target word into C_aCorresponding mean vector representation feature and the target word segmentation C_bAdding the corresponding mean vector representation features to obtain second operation vector representation features; obtaining the number of participles in the sample text, and carrying out mean processing on the second operation vector representation characteristics and the participle number to obtain the target participle C_aIs participled with the target word C_bIn the language domain K_iThe next corresponding language domain mapping vector.

For a specific method for determining the language domain mapping vector corresponding to the target participles in each language domain, the method can be as shown in formula (2):

where S can be used to characterize a set of target participles (a set of words), w e S can be used to characterize each target participle in S,

the language domain mapping vector (mean vector represents characteristics) corresponding to each target participle in the target participle set under the language domain f can be used for representing;

the method can be used for adding the mean vector representation characteristics of all target participles; p_f(S) a language domain mapping vector that may be used to characterize a corresponding entirety of the plurality of target participles under the language domain f.

And step S103, fusing the language domain mapping vectors respectively corresponding to the target participles under each language domain to generate a fused language domain mapping vector of the target participles.

In this application, it should be understood that, after determining the overall language domain mapping vectors corresponding to the target participle in each language domain, the overall language domain mapping vectors may be fused, so that the fused language domain mapping vectors corresponding to the target participle in a plurality of language domains may be determined.

The language domain K will be included in at least two language domains_iAnd a language domain K_jAnd the target participle comprises a target participle C_aWord segmentation with target C_bFor example, a specific method for determining a fused language domain mapping vector corresponding to a target participle under at least two (multiple) language domains is described. Wherein j is a positive integer less than or equal to the number of the at least two language domains.

The specific method for determining the corresponding fusion language domain mapping vector of the target participle under at least two (multiple) language domains may be to obtain the target participle C_aIs participled with the target word C_bIn the language domain K_iThe lower corresponding language domain mapping vector, and the target participle C_aIs participled with the target word C_bIn the language domain K_jA lower corresponding language domain mapping vector; subsequently, the target participle C can be segmented_aIs participled with the target word C_bIn the language domain K_iThe lower corresponding language domain mapping vector, and the target participle C_aIs participled with the target word C_bIn the language domain K_jAdding the corresponding language domain mapping vectors to obtain an operation language domain mapping vector; subsequently, the number of the at least two language domains may be obtained, and the operation language domain mapping vector and the number of the at least two language domains are averaged to obtain the fusion language domain mapping vector.

For a specific method for determining a fusion language domain mapping vector corresponding to a target participle under at least two language domains, the method can be as shown in formula (3):

wherein F can be used to characterize a set of language domains; p_f() The language domain mapping vector can be used for representing the whole corresponding to the target participles in the language domain f; sigma_f∈FP_f() The method can be used for adding the language domain mapping vectors corresponding to the plurality of target participles under each language domain in the language domain set F; p₀(S) may be used to characterize a fused language domain mapping vector corresponding to the plurality of target participles under all language domains.

Step S104, acquiring a word segmentation vector matrix associated with the sample text; the sample text includes sentence text composed of target participles and tag participles.

In the present application, the sample text is a training text, the training text may be composed of one or more sentence texts, each word in each sentence text may be used as a tag segmentation, and a partial segmentation having an association relationship with the tag segmentation in the sentence text may be used as a target segmentation. For example, if The sentence text is "The fox runs after cat", The participle "runs" in The sentence text may be used as The tag participle, because The participle number between The participle "The" and The "fox" is 1 and 0, which is less than or equal to The participle number threshold 1, The participle "The" and The "fox" may be used as The participle having an association relationship with The tag participle "runs", and The participle "The" and The "fox" may be used as The target participle; similarly, the segmentations "after" and "cat" may also be used as target segmentations. It should be understood that The participles in The sentence text "The fox runs after cat" except for The participle "runs" may also be used as The tag participles, for example, The participle "fox" may be used as The tag participle, and since The number of participles different between The participle "The" and The "fox" is 0 and less than The threshold value 1 of The number of participles, The participles "The" may be used as The target participles, and similarly, since The number of participles different between "runs" and "after" and "fox" is 0 and 1 and less than or equal to The threshold value 1 of The number of participles, both "runs" and "after" may be used as The target participles. The threshold value of the number of difference participles may be an artificial value, for example, may be 1, 4, 100, 1000, and so on, which are not illustrated herein.

The word segmentation vector matrix in the application is a matrix formed by representing characteristics by word segmentation vectors, and one word segmentation vector representation characteristic is a vector representation characteristic corresponding to one word in a word list. The words in the word list include the words in the sample text (training text).

Step S105, obtaining word segmentation vector representation characteristics corresponding to the label word segmentation in the word segmentation vector matrix, and adjusting the word segmentation vector matrix and the word segmentation vector matrix according to language domain mapping vectors, fusion language domain mapping vectors and word segmentation vector representation characteristics corresponding to the label word segmentation of the target word under each language domain to obtain a target word segmentation vector matrix and a target word segmentation vector matrix for performing a language processing task.

In the application, the word segmentation vector representation characteristics corresponding to the label word segmentation can be obtained in the word segmentation vector matrix; then, a first vector distance between a language domain mapping vector corresponding to the target participle under each language domain and the expression feature of the participle vector can be determined; a second vector distance between the fused language domain mapping vector and the word segmentation vector representation feature may also be determined; then, the first vector distance and the second vector distance may be added, and a result obtained by the addition may be determined as a loss function value, and the speech particle vector matrix and the participle vector matrix may be adjusted according to the loss function value, respectively, so that a target speech particle vector matrix and a target participle vector matrix may be obtained.

The specific method for determining the first vector distance between the language domain mapping vector corresponding to the target participle under each language domain and the participle vector representation feature and the second vector distance between the language domain mapping vector and the participle vector representation feature may be that the first transposition mapping vector corresponding to the target participle under each language domain may be obtained, the first transposition mapping vector and the participle vector representation feature are multiplied, and the first vector distance may be determined according to the result obtained by the multiplication; the first transposition mapping vector is a transposition vector of the language domain mapping vector corresponding to the target participle under each language domain; subsequently, a second transposed mapping vector corresponding to the fusion language domain mapping vector may be obtained, and the second transposed mapping vector and the expression feature of the participle vector are multiplied, and the second vector distance may be determined according to a result obtained by the multiplication.

Further, the first vector distance and the second vector distance are added to obtain a loss function value, the loss function value can be matched with a distance threshold, and if the loss function value is greater than the distance threshold, the speech particle vector representation characteristics corresponding to the target token speech particle in the speech particle vector matrix and the participle vector representation characteristics corresponding to the label participle in the participle vector matrix can be respectively adjusted; if the loss function value is less than or equal to the distance threshold, the speech particle vector matrix may be determined as the target speech particle vector matrix, and the participle vector matrix may be determined as the target participle vector matrix.

For a specific method of determining the first vector distance or the second vector distance, it can be as shown in equation (4):

wherein P can be used to characterize P in equation (2) above_f(S), namely, the whole language domain mapping vector corresponding to the target participle in the language domain f; p may also be used to characterize P in equation (3) above₀(S), namely corresponding fusion language domain mapping vectors of the target participle under all language domains; p^TCan be used to characterize P_fA transposed vector (i.e., a first transpose map vector or a second transpose map vector) of (S) or P0 (S); e_wCan be used for characterizing word segmentation vector matrixes; w is a_tCan be used to characterize tag participles, E_w(w_t) Can be used in the vector matrix for representing the participles, and the label participles w_tCorresponding word segmentation vectors represent features; w is a_jCan be used for characterizing participles in a word set (participle set) V except the label participle wt; e_w(w_j) Can be used in the vector matrix for representing the participle, the participle w_jCorresponding word segmentation vectors represent features; phi (w)_t| P) can be used for representing the word segmentation vector representation characteristics corresponding to the label word segmentation and the P_f(S) or P₀(S) (first vector distance or second vector distance).

It should be understood that, through the above formula (4), a first vector distance phi (w) between the language domain mapping vector corresponding to the target participle under each language domain and the expression feature of the participle vector can be obtained_t|P_f(S)); a second vector distance φ (w) between the fused language domain mapping vector and the token vector representation feature may also be obtained_t| P0(S)), the first vector distance and the second vector distance may be added to obtain a loss function value, as shown in equation (5):

Lf_e(w_t)＝φ(w_t|P₀(S))+∑_f∈Fφ(w_t|P_f(S)) formula (5)

It can be understood that in the above formula (5), by φ (w)_t|P₀(S)) and ∑_f∈Fφ(w_t|P_f(S)) to obtain a loss function value Lf_e(wt). For the first item P₀In other words, the gradient at the time of optimization is grad₀The gradient grad₀Is propagated backward to each speech particle in each speech domain, so that the gradient grad₀Shared language knowledge that can be understood as being shared across all language domains; at the same time, P for the language domain f in the second term_fIn other words, the gradient at the time of optimization is grad_fWhen updating reversely, through the gradient grad_fUpdating only the speech grain vector corresponding to the speech grain of the speech field f, the gradient grad_fCan be understood as a unique semantic knowledge of the language domain f. In view of this, when updating the speech particle vectors in the backward direction, each speech particle vector is updated to the extent that

Both information unique to each language domain and information shared in multiple language domains.

The above equation (5) is an explanation of determining the loss function value of a label participle, and the overall loss function value of the whole sample text (training text) can be determined by the above equation (5), and the specific method is shown in equation (6):

where C can be used to characterize the sample text, w_tE C can be used for representing each word in the sample text as a primary label participle; l (c) may be used to characterize the overall loss function value of the sample text (training text).

As can be seen from the above equations (5) and (6), the loss function value L of each tag participle in the present application_fe(w_t) Is designed inIn the following steps: for a sentence text in a training text, tag participles (central words) and target participles (peripheral words) can be determined, and the application can determine the language domain mapping vector (P) corresponding to the target participles (peripheral words) under each language domain through the methods in the above steps S101 to S103_f(S)), and a fused language domain mapping vector (P) corresponding to the target participle under multiple language domains₀(S)); subsequently, the word segmentation vector representation features corresponding to the label word segmentation can be used as labels, and P (including P) obtained based on the target word segmentation is calculated₀(S) and P_f(S)) and the label, and the sum of the obtained distances can be used as a loss function value, and the purpose of adjusting the participle vector matrix and the particle vector matrix according to the loss function value is to make the sum of the distances smaller and smaller, that is, to make P obtained based on the target participle closer and closer to the participle vector representation characteristics corresponding to the label participle.

It should be noted that the specific process of adjusting the participle vector matrix and the speech particle vector matrix according to the loss function value can be implemented by an optimization method of sgd, and the optimization method of sgd can make the participle vector representation features corresponding to the label participles in the participle vector matrix more and more accurately describe the semantics of the label participles, and at the same time can make the speech particle vector representation features corresponding to the target participles in the speech particle vector matrix more and more high in quality, so that the distance between P obtained based on the target participles and the participle vector representation features of the label participles is smaller and smaller until the condition is satisfied.

It can be understood that, through training, the participle vector representation features in the target participle vector matrix can accurately describe each word, and the participle vector representation features related to the word (the participle vector representation features corresponding to surrounding words) in the target participle vector matrix can also accurately describe the word. Therefore, when any matrix of the target word segmentation vector matrix or the speech particle vector matrix is applied in the language processing task, a language processing result with high accuracy can be obtained.

Alternatively, it will be appreciated that for the loss function value L_fe(w_t) May also depend only on phi_t|P₀(S)) to determine.

In the embodiment of the application, the language domain mapping vector of each target participle (surrounding word) in each language domain can be determined through the characteristic participle sequence and the participle vector matrix of each target participle in each language domain, the language domain mapping vectors of the target participle in a plurality of language domains can be obtained through the fusion of the language domain mapping vectors, the language domain mapping vector of the target participle in each language domain and the participle vector representation characteristics of the label participle (central word), the participle vector matrix corresponding to the central word and the participle vector matrix corresponding to the language domain can be trained and adjusted, and the target participle vector matrix with high quality can be obtained. It should be understood that, in the training of the participle vector matrix, the semantic information of the language domain of each word is introduced, each language domain can be used to represent the relevant semantic information of one word (for example, the component language domain can be used to represent the morphological information of a word, the pinyin language domain can be used to represent the phoneme information of a word, and the part-of-speech language domain can be used to represent the grammar information of a word), compared with the semantic information of each word body, it is obvious that the semantic information represented by the language domain is deeper and more specific, and by fusing the semantic information of multiple language domains of each word, the word vector matrix of the trained word can include not only the semantic relation in a single language domain but also the semantic relation among multiple language domains, so that the participle vector matrix obtained by training can be more accurate, the quality is higher; meanwhile, a high-quality speech particle vector matrix can be obtained through training, each speech particle vector in the speech particle vector matrix or each speech particle vector in the speech particle vector matrix can be used for accurately describing semantic information of a word, and when the high-quality speech particle vector matrix and the high-quality speech particle vector matrix are applied to a speech processing task, the result of the speech processing task can be more accurate. Therefore, the method and the device can improve the quality of the semantic representation vector of the word, and therefore the accuracy of the language processing task can be improved.

For ease of understanding, please refer to fig. 4, fig. 4 is a diagram illustrating an architecture for determining a language domain mapping vector for each participle according to an embodiment of the present application. As shown in fig. 4, for the segmented word "wisdom", the initial token grain corresponding to the segmented word "wisdom" in the component language domain may be obtained from the token table, and the initial token grain is "vector", "mouth", "day"; subsequently, the initial token grains can be expanded through N-Gram processing to generate new expanded token grains 'cornoral' and 'oral-daily'; then, the words without practical meaning in the expanded characteristic words can be deleted through a Dropping method, so that the sequence of the last characteristic word of the word segmentation intelligence can be obtained as { vector, mouth, day, know }. Further, a speech particle vector matrix E corresponding to the component language domain_GObtaining each speech particle vector representation feature in the representation speech particle sequence, and obtaining the corresponding language domain mapping vector P (intelligence) of the participle 'intelligence' under the component language domain. For a specific implementation manner of determining the language domain mapping vector P (intelligence) corresponding to the word segmentation "intelligence" in the component language domain, reference may be made to the description in the embodiment corresponding to fig. 3, which will not be described herein again.

Similarly, for the segmented word "wisdom", the initial token word corresponding to the segmented word "wisdom" in the component language domain may be obtained from the corresponding token word table in english, where the initial token word is "w", "i", "s", "d", "o", "m"; subsequently, the initial token grains can be expanded through N-Gram processing to generate new expanded token grains; then, the words without practical meaning in the expanded characteristic words can be deleted through a Dropping method, so that the final characteristic word sequence of the participle of the word "wisdom" can be obtained as { w, i, s, d, o, m, w-i, i-s, s-d, d-o, o-m, w-i-s, i-s-d, d-o-m }. Further, a speech particle vector matrix E corresponding to the component language domain_GObtaining each speech particle vector representation feature in the representation speech particle sequence, and obtaining a language domain mapping vector P (wisdom) corresponding to the participle 'wisdom' in the component language domain. WhereinFor a specific implementation manner of determining the language domain mapping vector p (wisdom) corresponding to the participle "wisdom" in the component language domain, reference may be made to the description in the embodiment corresponding to fig. 3, which will not be described herein again.

Referring to fig. 5a-5b, fig. 5a-5b are system architecture diagrams of a training matrix according to an embodiment of the present disclosure. The system architecture shown in fig. 5a-5b is illustrated by taking The training sentence "The fox run after" as an example. As shown in fig. 5a, in The sentence "The fox runs after cat", The participle "runs" can be used as a tag participle (central word), and The participle "The", The participle "fox", The participle "after'", and The participle "cat" can be used as target participles (peripheral words). Can obtain the speech grain vector matrix E corresponding to the first language domain_G ¹By the speech particle vector matrix E_G ¹The speech particle vector representation characteristics respectively corresponding to the target participles in the first language domain can be obtained, and further the language domain mapping vector P1 corresponding to the target participles in the first language domain can be obtained; similarly, a speech particle vector matrix E corresponding to the second language domain can be obtained_G ²By the speech particle vector matrix E_G ²The speech particle vector representation characteristics of the target participle under the second language domain respectively can be obtained, and further the language domain mapping vector P2 corresponding to the target participle under the second language domain can be obtained; similarly, a language domain mapping vector PF corresponding to the target word under the F-th language domain can be obtained. Further, a fused language domain mapping vector P0 corresponding to the target participle in all language domains can be obtained according to the language domain mapping vector P1, the language domain mapping vectors P2, … and the language domain mapping vector PF; according to the language domain mapping vector P1, language domain mapping vectors P2, …, language domain mapping vector PF and fused language domain mapping vector P0, the tag participle "runs" can be predicted, and the participle vector representation characteristics of the tag participle can be obtained from the participle vector matrix Ew, and then the language domain mapping vector P1, language domain mapping vectors P2, …, language domain mapping vector PF and fused language domain mapping vector P0 and the participle vector representation characteristics can be determinedThe vector distance between the word segments can be adjusted according to the vector distance.

Further, as shown in FIG. 5b, FIG. 5b is a specific illustration of determining the language domain mapping vector P1, language domain mapping vectors P2, …, language domain mapping vector PF, and the vector distance between the fused language domain mapping vector P0 and the word segmentation vector representation feature. As shown in fig. 5b, a vector distance 1 between the language domain mapping vector P1 and the word segmentation vector representation feature may be determined, a vector distance 2, … between the language domain mapping vector P2 and the word segmentation vector representation feature may be determined, or a vector distance F between the language domain mapping vector PF and the word segmentation vector representation feature may be determined; then, the vector distance 1, the vector distance 2, … and the vector distance F may be added to obtain a vector distance sum corresponding to the target participle in each language domain; meanwhile, a vector distance 0 between the fusion language domain mapping vector P0 and the participle vector representation feature may be determined, and then, the vector distance corresponding to the target participle in each language domain and the vector distance 0 may be added, thereby obtaining a loss function value corresponding to the label participle "runs"; according to the loss function value, a speech kernel vector matrix E can be input_G ¹The corresponding speech particle vector representation characteristics of the target participle in (1) can be adjusted, and a speech particle vector matrix E can also be adjusted_G ²The corresponding speech particle vector representation characteristics of the target participle in (E) are adjusted, …, or the speech particle vector matrix E_G ^FThe corresponding speech particle vector representation characteristics of the target word segmentation are adjusted; the word segmentation vector representation characteristics corresponding to the label word segmentation in the word segmentation vector matrix Ew can also be adjusted. The target speech particle vector matrix corresponding to each language domain can be obtained through adjustment, and the word segmentation vector matrix can also be obtained.

It should be noted that the system architecture provided by the embodiments corresponding to fig. 5a to fig. 5b may be understood as a training optimization architecture of a cbow model, and for training optimization of a speech particle vector matrix and a word segmentation vector matrix, training optimization may also be performed by a skip-gram method.

For ease of understanding, please refer to fig. 6, and fig. 6 is a schematic flowchart of a data processing method according to an embodiment of the present application. As shown in fig. 6, the process may include:

step S201, an input word and at least two words to be sorted are obtained.

Step S202, inputting input words and at least two words to be ordered into a language processing model; the language processing model comprises a target speech particle vector matrix and a target word segmentation vector matrix; the target speech particle vector matrix and the target word segmentation vector matrix are generated by the data processing method provided by the embodiment corresponding to fig. 3.

Step S203, determining semantic similarity between at least two words to be sorted and input words respectively through a target word segmentation vector matrix and a target speech particle vector matrix in the language processing model.

In the application, a specific method for determining semantic similarity between at least two words to be sorted and input words through a target word segmentation vector matrix and a target speech particle vector matrix in a language processing model may be that, through the target word segmentation vector matrix, first semantic similarity between at least two words to be sorted and input words can be determined; determining second semantic similarity between at least two words to be sorted and the input words respectively through the target speech particle vector matrix; and fusing the first semantic similarity and the second semantic similarity between each word to be sorted and the input word to obtain the semantic similarity between each word to be sorted and the input word.

Optionally, it may be understood that, for a specific method of determining semantic similarity between at least two terms to be sorted and the input term, the semantic similarity may also be determined only by using a target participle vector matrix, that is, the first semantic similarity may be directly determined as the semantic similarity between each term to be sorted and the input term.

Optionally, it may be understood that, for a specific method of determining semantic similarity between at least two terms to be sorted and the input term, the semantic similarity may also be determined only by the target particle vector matrix, that is, the second semantic similarity may be directly determined as the semantic similarity between each term to be sorted and the input term.

And S204, sequencing at least two words to be sequenced according to the semantic similarity to obtain a word sequence, and outputting the word sequence.

It should be understood that the data processing method provided in the embodiment corresponding to fig. 6 is an application of the target word segmentation vector matrix and the speech particle vector matrix obtained in the embodiment corresponding to fig. 3, and the target word segmentation vector matrix or the speech particle vector matrix may also be applied in other language processing scenarios.

The following illustrates language processing scenarios, which may include: the word segmentation vector matrix can be applied to a word similarity matching task, for example, a given word pair "king-queen" is given, a word to be matched "male" is given, and the word corresponding to the "male" can be presumed to be "female" through the given word pair "king-queen" and the word segmentation vector matrix. The language processing scenario may further include: the speech particle vector matrix or the participle vector matrix may be applied to a language processing task at a sentence level or a text level, for example, the trained speech particle vector matrix or the trained participle vector matrix may be applied to a named entity recognition task, and the trained speech particle vector matrix or the trained participle vector matrix may be used as an embedded layer of an initialization language processing model for introducing language knowledge in a corpus, and then, a hidden representation may be generated through a memory (LSTM) layer or a Convolutional Network (CNN) layer in the language processing model, and the hidden representation is input into a conditional random field for classification, so that an entity included in an input sentence may be obtained. The language processing scenario may further include: the trained speech particle vector matrix or participle vector matrix can be applied to a text classification task, the trained speech particle vector matrix or participle vector matrix can be used as an embedded layer of an initialized language processing model and used for introducing language knowledge in linguistic data, then, a recessive representation can be generated through an LSTM layer or a CNN layer in the language processing model, then, a representation vector of an input statement can be obtained through an averaging equal method, classification is carried out through a Softmax method, and a label corresponding to a text is obtained through prediction. The language processing scenario may further include: the trained speech particle vector matrix or word segmentation vector matrix can be applied to the medical consultation robot, and the semantic recognition can be better performed on the text input by the consultant through the speech particle vector matrix or word segmentation vector matrix, so that the medical consultation robot has higher pertinence and accuracy in answering questions.

It is understood that the particle vector matrix and the participle vector matrix can be trained and optimized by using a special text as a sample text, and the special text can be a sample text in the public field (e.g., the legal field, the medical field, etc.) or a customized text meeting special requirements. In addition, the language domain used in the training matrix can also be selected according to requirements. For example, the text in the medical field may be used as the sample text, and after a word segmentation vector matrix or a speech particle vector matrix is trained by using a large amount of medical texts, the training can be used in the task of Chinese medical named entity recognition or medical text classification task (for example, the task of department text classification and disease prediction), and so on.

Further, please refer to fig. 7, where fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing means may be a computer program (comprising program code) running on a computer device, for example the data processing means being an application software; the data processing apparatus may be adapted to perform the method illustrated in fig. 3. As shown in fig. 10, the data processing apparatus 1 may include: the system comprises a sequence acquisition module 11, a speech grain matrix acquisition module 12, a vector determination module 13, a vector fusion module 14, a word segmentation matrix acquisition module 15, a word segmentation characteristic acquisition module 16 and a matrix adjustment module 17.

A sequence obtaining module 11, configured to obtain token particle sequences corresponding to at least two language domains of the target participle,

a speech particle matrix obtaining module 12, configured to obtain speech particle vector matrices corresponding to at least two speech domains respectively; each speech particle vector matrix is associated with a sample text;

the vector determination module 13 is configured to determine, according to the token speech particle sequence and the speech particle vector matrix respectively corresponding to each language domain, language domain mapping vectors respectively corresponding to the target word segmentation in each language domain;

the vector fusion module 14 is configured to fuse language domain mapping vectors corresponding to the target participles respectively in each language domain to generate a fusion language domain mapping vector of the target participles;

a word segmentation matrix obtaining module 15, configured to obtain a word segmentation vector matrix associated with the sample text; the sample text comprises a sentence text formed by target participles and label participles;

a word segmentation feature obtaining module 16, configured to obtain a word segmentation vector representation feature corresponding to the label word segmentation in the word segmentation vector matrix;

and the matrix adjusting module 17 is configured to adjust the word segmentation vector matrix and the word segmentation vector matrix according to the language domain mapping vector, the fusion language domain mapping vector, and the word segmentation vector representation characteristic corresponding to the tag word segmentation respectively corresponding to the target word segmentation in each language domain, so as to obtain a target word segmentation vector matrix and a target word segmentation vector matrix for performing a language processing task.

For a specific implementation manner of the sequence obtaining module 11, the speech grain matrix obtaining module 12, the vector determining module 13, the vector fusion module 14, the word segmentation matrix obtaining module 15, the word segmentation feature obtaining module 16, and the matrix adjusting module 17, reference may be made to the description in step S101 in the embodiment corresponding to fig. 3, which will not be described herein again.

referring to fig. 7, the sequence acquiring module 11 may include: an initial sequence acquisition unit 111, a speech grain combination unit 112, a speech grain filtering unit 113, and a sequence determination unit 114.

An initial sequence obtaining unit 111 for obtaining the target word segmentation in the language domain K_iThe corresponding initial characterization speech particles are obtained;

a token particle combination unit 112, configured to combine the initial token tokens to obtain extended token particles;

the speech particle filtering unit 113 is configured to filter the extended characteristic speech particles to obtain filtered characteristic speech particles, and determine characteristic speech particles formed by the initial characteristic speech particles and the extended characteristic speech particles as target characteristic speech particles;

a sequence determination unit 114 for determining the sequence composed of the target characteristic grains as the target participle in the language domain K_iThe corresponding token-particle sequences are as follows.

For specific implementation manners of the initial sequence obtaining unit 111, the speech grain combining unit 112, the speech grain filtering unit 113, and the sequence determining unit 114, reference may be made to the description in step S101 in the embodiment corresponding to fig. 3, and details will not be described here.

referring to fig. 7, the vector determination module 13 may include: a feature acquisition unit 131, a number acquisition unit 132, and a vector determination unit 133.

A feature obtaining unit 131 for obtaining a speech particle vector matrix T_iThe speech particle vector in (1) represents a feature;

a feature obtaining unit 131, further configured to obtain a speech particle vector matrix T_iIn the speech particle vector expression characteristics, a characteristic speech particle sequence M is obtained_iThe speech grain vector representation characteristics corresponding to the target representation speech grains;

a quantity obtaining unit 132 for obtaining the token particle sequence M_iThe number of the token grains of the target token grains;

a vector determination unit 133 for determining a sequence M of tokens from the token-particle sequence M_iThe corresponding speech grain vector representation characteristics and the number of the representation speech grains of the target representation speech grains determine the target word segmentation in the language domain K_iThe next corresponding language domain mapping vector.

For specific implementation of the feature obtaining unit 131, the number obtaining unit 132, and the vector determining unit 133, reference may be made to the description in step S102 in the embodiment corresponding to fig. 3, and details will not be repeated here.

referring to fig. 7, the vector determination unit 133 may include: an arithmetic processing sub-unit 1331 and a vector determination sub-unit 1332.

An arithmetic processing subunit 1331 for processing the target token grain S_tCorresponding speech grain vector representation feature and target representation speech grain S_wAdding the corresponding speech particle vector representation features to obtain a first operation vector representation feature;

the operation processing subunit 1331 is further configured to perform mean processing on the first operation vector representation feature and the number of the characterization grains to obtain a mean vector representation feature;

a vector determination subunit 1332, configured to determine the target participle in the language domain K according to the mean vector representation feature_iThe next corresponding language domain mapping vector.

For a specific implementation of the operation processing subunit 1331 and the vector determination subunit 1332, reference may be made to the description in step S102 in the embodiment corresponding to fig. 3, which will not be described herein again.

a vector determining subunit 1333, further configured to obtain the target participle C_aCorresponding mean vector representation features, and target participle C_bThe corresponding mean vector represents a feature;

vector determination subunit 1333, further for segmenting the target word C_aCorresponding mean vector representation feature and target participle C_bAdding the corresponding mean vector representation features to obtain second operation vector representation features;

the vector determining subunit 1333 is further configured to obtain the number of segmented words in the sample text, and perform mean processing on the second operation vector representation features and the number of segmented words to obtain the target segmented word C_aWord segmentation with target C_bIn the language domain K_iThe next corresponding language domain mapping vector.

Referring to fig. 7, the matrix adjustment module 17 may include: a distance determination unit 171, a distance calculation unit 172, and a matrix adjustment unit 173.

A distance determining unit 171, configured to determine a first vector distance between a language domain mapping vector and a participle vector representation feature respectively corresponding to the target participle in each language domain, and a second vector distance between a fusion language domain mapping vector and a participle vector representation feature;

a distance operation unit 172, configured to add the first vector distance and the second vector distance to obtain a loss function value;

the matrix adjusting unit 173 is configured to adjust the speech particle vector matrix and the segmentation word vector matrix according to the loss function value, so as to obtain a target speech particle vector matrix and a target segmentation word vector matrix.

For a specific implementation manner of the distance determining unit 171, the distance calculating unit 172, and the matrix adjusting unit 173, reference may be made to the description in step S105 in the embodiment corresponding to fig. 3, and details will not be repeated here.

Referring to fig. 7, the distance determining unit 171 may include: a first multiplication subunit 1711 and a second multiplication subunit 1712.

A first multiplication subunit 1711, configured to obtain first transposition mapping vectors corresponding to the target participle in each language domain, multiply the first transposition mapping vectors by the expression features of the participle vectors, and determine a first vector distance according to a result obtained by the multiplication; the first transposition mapping vector is the transposition vector of the language domain mapping vector corresponding to the target participle under each language domain;

the second multiplication subunit 1712 is configured to obtain a second transposed mapping vector corresponding to the fusion language domain mapping vector, multiply the second transposed mapping vector with the expression feature of the participle vector, and determine a second vector distance according to a result obtained by the multiplication.

For a specific implementation manner of the first multiplication subunit 1711 and the second multiplication subunit 1712, reference may be made to the description in step S105 in the embodiment corresponding to fig. 3, which will not be described herein again.

Referring to fig. 7, the matrix adjustment unit 173 may include: a feature adjustment subunit 1731 and a matrix determination subunit 1732.

A feature adjustment subunit 1731, configured to match the loss function value with a distance threshold, and if the loss function value is greater than the distance threshold, respectively adjust a speech particle vector representation feature corresponding to a target token speech particle in the speech particle vector matrix and a participle vector representation feature corresponding to a label participle in the participle vector matrix;

the matrix determining subunit 1732 is configured to determine the speech particle vector matrix as a target speech particle vector matrix and determine the participle vector matrix as a target participle vector matrix if the loss function value is less than or equal to the distance threshold.

The specific implementation manners of the characteristic adjustment subunit 1731 and the matrix determination subunit 1732 may refer to the description in step S105 in the embodiment corresponding to fig. 3, and will not be described herein again.

Further, please refer to fig. 8, where fig. 8 is a schematic structural diagram of another data processing apparatus according to an embodiment of the present application. The data processing apparatus may be adapted to perform the method illustrated in fig. 6. As shown in fig. 8, the data processing apparatus 2 may include: a word acquisition module 21, a word input module 22, a similarity determination module 23, and a word ranking module 24.

The word acquisition module 21 is configured to acquire an input word and at least two words to be sorted;

a word input module 22, configured to input an input word and at least two words to be sorted to the language processing model; the language processing model comprises a target speech particle vector matrix and a target word segmentation vector matrix; the target speech particle vector matrix and the target word segmentation vector matrix are generated by adopting the data processing method provided by the embodiment corresponding to the figure 3;

the similarity determining module 23 is configured to determine semantic similarities between at least two words to be sorted and the input words respectively according to a target word segmentation vector matrix and a target speech particle vector matrix in the language processing model;

and the word sorting module 24 is configured to sort at least two words to be sorted according to the semantic similarity to obtain a word sequence, and output the word sequence.

For specific implementation manners of the word obtaining module 21, the word input module 22, the similarity determining module 23, and the word sorting module 24, reference may be made to the descriptions in step S201 to step S204 in the embodiment corresponding to fig. 5, and details will not be described here.

Wherein, the similarity determining module 23 includes: a similarity determination unit 231, a similarity determination unit 232, and a similarity fusion unit 233.

The similarity determining unit 231 is configured to determine, through the target word segmentation vector matrix, a first semantic similarity between each of the at least two words to be sorted and the input word;

the similarity determining unit 232 is further configured to determine, through the target speech particle vector matrix, a second semantic similarity between each of the at least two to-be-sorted terms and the input term;

and a similarity fusion unit 233, configured to fuse the first semantic similarity and the second semantic similarity between each word to be sorted and the input word, so as to obtain the semantic similarity between each word to be sorted and the input word.

For specific implementation manners of the similarity determining unit 231, the similarity determining unit 232, and the similarity fusing unit 233, reference may be made to the description in step S204 in the embodiment corresponding to fig. 6, and details will not be repeated here.

Further, please refer to fig. 9, where fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 9, the apparatus 1 in the embodiment corresponding to fig. 7 or the apparatus 2 in the embodiment corresponding to fig. 8 may be applied to the computer device 1000, and the computer device 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 further includes: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 9, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 9, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

Or realize that:

acquiring input words and at least two words to be sorted;

It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the data processing method in the embodiment corresponding to fig. 3 or fig. 6, and may also perform the description of the data processing apparatus 1 in the embodiment corresponding to fig. 7 or the description of the data processing apparatus 2 in the embodiment corresponding to fig. 8, which is not repeated herein. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where a computer program executed by the aforementioned data processing computer device 1000 is stored in the computer-readable storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the data processing method in the embodiment corresponding to fig. 3 or fig. 6 can be executed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

The computer readable storage medium may be the data processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

The terms "first," "second," and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, apparatus, product, or apparatus.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A data processing method, comprising:

obtaining the characteristic speech grain sequences respectively corresponding to at least two language domains of a target word segmentation, and obtaining speech grain vector matrixes respectively corresponding to the at least two language domains; each speech particle vector matrix is associated with a sample text; each language domain of the at least two language domains is used for describing the characteristics of the target participle, and the characterization particle sequence corresponding to one language domain consists of the characterization particles of the target participle in the language domain; the speech particle vector matrix corresponding to one language domain is a matrix formed by representing characteristics of speech particle vectors corresponding to sample words; the speech particle vector representation characteristics corresponding to the sample words refer to vector representation characteristics corresponding to the representation speech particles of the sample words in the language domain; the sample text includes the sample word;

determining language domain mapping vectors respectively corresponding to the target participles under each language domain according to the characterization speech particle sequence and the speech particle vector matrix respectively corresponding to each language domain; the language domain mapping vector corresponding to one language domain refers to the mean vector representation characteristic of the target participle in the language domain;

acquiring a word segmentation vector matrix associated with the sample text; the sample text comprises sentence text formed by the target participles and the label participles; the label participles refer to central words in the sentence text, and the target participles refer to surrounding words which have an association relation with the label participles;

and acquiring word segmentation vector representation characteristics corresponding to the label word segmentation from the word segmentation vector matrix, and adjusting the word segmentation vector matrix and the word segmentation vector matrix according to language domain mapping vectors, the fusion language domain mapping vectors and the word segmentation vector representation characteristics corresponding to the label word segmentation of the target word in each language domain to obtain a target word segmentation vector matrix and a target word segmentation vector matrix for performing a language processing task.

2. The method of claim 1, wherein the at least two linguistic domains include a linguistic domain K_i(ii) a i is a positive integer less than or equal to the number of the at least two linguistic domains;

the obtaining of the token particle sequences corresponding to the at least two language domains of the target participle includes:

acquiring the target word segmentation in the language domain K_iThe corresponding initial characterization speech particles are obtained;

combining the initial characterization speech particles to obtain expanded characterization speech particles;

filtering the extended characteristic speech particles to obtain filtered characteristic speech particles, and determining the characteristic speech particles formed by the initial characteristic speech particles and the extended characteristic speech particles as target characteristic speech particles;

determining the sequence formed by the target characteristic speech particles as the target participle in the language domain K_iThe corresponding token-particle sequences are as follows.

3. The method of claim 1, wherein the at least two linguistic domains include a linguistic domain K_iThe token particle sequences respectively corresponding to the at least two language domains comprise the language domain K_iCorresponding token particle sequence M_i(ii) a The speech particle vector matrixes respectively corresponding to the at least two language domains comprise the language domain K_iCorresponding speech particle vector matrix T_i(ii) a The speech particle vector matrix T_iIncluding said language domain K_iThe sample in (1) represents the speech particle vector representation characteristics corresponding to the speech particles; the sample token speech particles are associated with sample text and the sample token speech particles comprise the sequence of token speech particles M_iThe target token in (1); i is a positive integer less than or equal to the number of the at least two linguistic domains;

determining language domain mapping vectors corresponding to the target participles under each language domain according to the characterization speech particle sequence and the speech particle vector matrix corresponding to each language domain respectively, wherein the determining comprises:

obtaining the speech particle vector matrix T_iThe speech particle vector in (1) represents a feature;

in the speech particle vector matrix T_iIn the speech particle vector representation features, obtaining the characterization speech particle sequence M_iThe speech grain vector representation characteristics corresponding to the target representation speech grains;

obtaining the characterization speech grain sequence M_iThe number of the token grains of the target token grains;

according to the characteristic speech grain sequence M_iThe corresponding speech grain vector representation characteristics of the target representation speech grains and the number of the representation speech grains determine the target word segmentation in the language domain K_iThe next corresponding language domain mapping vector.

4. The method of claim 3, wherein the token speech particle sequence M_iThe target token in (1) comprises a target token S_tWith the target token speech grain S_w(ii) a t and w are less than or equal to the characteristic speech grain sequence M_iA positive integer representing the number of grains in (1);

the sequence M of the token speech particles_iThe corresponding speech grain vector representation characteristics of the target representation speech grains and the number of the representation speech grains determine the target word segmentation in the language domain K_iThe following corresponding language domain mapping vectors, including:

the target characterization speech grain S_tCorresponding speech grain vector representation characteristics and the target characterization speech grain S_wAdding the corresponding speech particle vector representation features to obtain a first operation vector representation feature;

carrying out mean value processing on the first operation vector representation feature and the characterization speech grain quantity to obtain a mean value vector representation feature;

determining the target word segmentation in the language domain K according to the mean vector representation characteristics_iThe next corresponding language domain mapping vector.

5. The method of claim 4, wherein the target is a human subjectThe mark participles comprise a target participle C_aWord segmentation with target C_b(ii) a a and b are both positive integers smaller than or equal to the number of participles in the sample text;

determining the target word segmentation in the language domain K according to the mean vector representation characteristics_iThe following corresponding language domain mapping vectors, including:

obtaining the target participle C_aCorresponding mean vector representation features, and the target participle C_bThe corresponding mean vector represents a feature;

dividing the target word C_aCorresponding mean vector representation features and the target participle C_bAdding the corresponding mean vector representation features to obtain second operation vector representation features;

acquiring the number of participles in the sample text, and carrying out mean processing on the second operation vector representation characteristics and the participle number to obtain the target participle C_aSegmenting with the target word C_bIn the language domain K_iThe next corresponding language domain mapping vector.

6. The method of claim 5, wherein the at least two linguistic domains further comprise a linguistic domain K_j(ii) a j is a positive integer less than or equal to the number of the at least two language domains;

the fusing the language domain mapping vectors respectively corresponding to the target participles under each language domain to generate fused language domain mapping vectors of the target participles, includes:

obtaining the target participle C_aSegmenting with the target word C_bIn the language domain K_iA lower corresponding language domain mapping vector, and the target participle C_aSegmenting with the target word C_bIn the language domain K_jA lower corresponding language domain mapping vector;

dividing the target word C_aSegmenting with the target word C_bIn the language domain K_iA lower corresponding language domain mapping vector, and the target participle C_aSegmenting with the target word C_bIn the language domain K_jAdding the corresponding language domain mapping vectors to obtain an operation language domain mapping vector;

and acquiring the number of the at least two language domains, and performing mean processing on the operation language domain mapping vector and the number of the at least two language domains to obtain the fusion language domain mapping vector.

7. The method according to claim 1, wherein the adjusting the speech particle vector matrix and the segmentation vector matrix according to the language domain mapping vector, the fusion language domain mapping vector and the segmentation vector representation feature corresponding to the target segmentation under each language domain to obtain a target speech particle vector matrix and a target segmentation vector matrix for performing a language processing task comprises:

determining a first vector distance between a language domain mapping vector corresponding to each language domain of the target participle and the participle vector representation feature, and a second vector distance between the fusion language domain mapping vector and the participle vector representation feature;

adding the first vector distance and the second vector distance to obtain a loss function value;

and respectively adjusting the speech particle vector matrix and the word segmentation vector matrix according to the loss function value to obtain the target speech particle vector matrix and the target word segmentation vector matrix.

8. The method of claim 7, wherein the determining a first vector distance between the language domain mapping vector corresponding to the target participle under each language domain and the participle vector representing feature and a second vector distance between the fused language domain mapping vector and the participle vector representing feature comprises:

acquiring first translation mapping vectors respectively corresponding to the target word segmentation in each language domain, multiplying the first translation mapping vectors and the expression characteristics of the word segmentation vectors, and determining the distance between the first vectors according to the result obtained by multiplying; the first transposition mapping vector is a transposition vector of the language domain mapping vector corresponding to the target participle under each language domain;

and acquiring a second transposed mapping vector corresponding to the fusion language domain mapping vector, multiplying the second transposed mapping vector and the expression characteristics of the participle vector, and determining the distance of the second vector according to the result obtained by the multiplication.

9. The method of claim 7, wherein the adjusting the particle vector matrix and the word segmentation vector matrix according to the loss function values to obtain the target particle vector matrix and the target word segmentation vector matrix respectively comprises:

matching the loss function value with a distance threshold, and if the loss function value is greater than the distance threshold, respectively adjusting speech particle vector representation characteristics corresponding to target representation speech particles in the speech particle vector matrix and segmentation vector representation characteristics corresponding to the label segmentation in the segmentation vector matrix;

and if the loss function value is smaller than or equal to the distance threshold, determining the speech particle vector matrix as the target speech particle vector matrix, and determining the participle vector matrix as the target participle vector matrix.

10. A data processing method, comprising:

acquiring input words and at least two words to be sorted;

inputting the input words and the at least two words to be ordered to a language processing model; the language processing model comprises a target speech particle vector matrix and a target word segmentation vector matrix; the target speech particle vector matrix and the target word segmentation vector matrix are generated by adopting the data processing method of any one of claims 1 to 9;

determining semantic similarity between the at least two words to be sorted and the input words respectively through the target word segmentation vector matrix and the target particle vector matrix in the language processing model;

and sequencing the at least two terms to be sequenced according to the semantic similarity to obtain a term sequence, and outputting the term sequence.

11. The method according to claim 10, wherein the determining semantic similarity between the at least two words to be sorted and the input word respectively through the target word segmentation vector matrix and the target particle vector matrix in the language processing model comprises:

determining a first semantic similarity between the at least two words to be sorted and the input words respectively through the target word segmentation vector matrix;

determining second semantic similarity between the at least two words to be sorted and the input words respectively through the target speech particle vector matrix;

and fusing the first semantic similarity and the second semantic similarity between each word to be sorted and the input word in the at least two words to be sorted to obtain the semantic similarity between each word to be sorted and the input word.

12. A data processing apparatus, comprising:

the speech grain matrix acquisition module is used for acquiring the representation speech grain sequences respectively corresponding to at least two speech domains of the target word segmentation and acquiring speech grain vector matrixes respectively corresponding to the at least two speech domains; each speech particle vector matrix is associated with a sample text; each language domain of the at least two language domains is used for describing the characteristics of the target participle, and the characterization particle sequence corresponding to one language domain consists of the characterization particles of the target participle in the language domain; the speech particle vector matrix corresponding to one language domain is a matrix formed by representing characteristics of speech particle vectors corresponding to sample words; the speech particle vector representation characteristics corresponding to the sample words refer to vector representation characteristics corresponding to the representation speech particles of the sample words in the language domain; the sample text includes the sample word;

the vector determination module is used for determining language domain mapping vectors respectively corresponding to the target participles under each language domain according to the characterization speech particle sequence and the speech particle vector matrix respectively corresponding to each language domain; the language domain mapping vector corresponding to one language domain refers to the mean vector representation characteristic of the target participle in the language domain;

the vector fusion module is used for fusing language domain mapping vectors respectively corresponding to the target participles under each language domain to generate fused language domain mapping vectors of the target participles;

the word segmentation matrix acquisition module is used for acquiring a word segmentation vector matrix associated with the sample text; the sample text comprises sentence text formed by the target participles and the label participles; the label participles refer to central words in the sentence text, and the target participles refer to surrounding words which have an association relation with the label participles;

and the matrix adjusting module is used for acquiring word segmentation vector representation characteristics corresponding to the label word segmentation in the word segmentation vector matrix, and adjusting the speech particle vector matrix and the word segmentation vector matrix according to the language domain mapping vector, the fusion language domain mapping vector and the word segmentation vector representation characteristics corresponding to the label word segmentation of the target word under each language domain to obtain a target speech particle vector matrix and a target word segmentation vector matrix for performing a language processing task.

13. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected to the memory and the network interface, wherein the network interface is configured to provide a network communication function, the memory is configured to store program code, and the processor is configured to call the program code to perform the method of any one of claims 1-11.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method of any of claims 1-11.