CN113515586A

CN113515586A - Data processing method and device

Info

Publication number: CN113515586A
Application number: CN202110413748.4A
Authority: CN
Inventors: 陈谦; 王雯
Original assignee: Alibaba Singapore Holdings Pte Ltd
Current assignee: Alibaba Innovation Co
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2021-10-19

Abstract

The embodiment of the application provides a data processing method and device. The method comprises the following steps: acquiring a text sequence to be recognized; determining the element characteristic corresponding to each text element according to the text characteristic and the pronunciation characteristic corresponding to each text element in the text sequence to be recognized so as to obtain an element characteristic sequence corresponding to the text sequence to be recognized; and inputting the element characteristic sequence into a trained language understanding model to obtain a semantic recognition result corresponding to the text sequence to be recognized. The technical scheme provided by the embodiment of the application introduces pronunciation characteristics on the basis of the language understanding model, and can effectively improve the robustness and the semantic recognition accuracy of the language understanding model.

Description

Data processing method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus.

Background

With the continuous development of Automatic Speech Recognition (ASR) technology and Natural Language Understanding (NLU) technology, man-machine conversation functions are embedded in more and more application products to improve user experience.

Wherein the task of ASR is to convert user speech into text; the task of the NLU is to semantically understand the text output by the ASR.

The existing natural language understanding model has the problems of low robustness and low recognition accuracy.

Disclosure of Invention

In view of the above, the present application is proposed to provide a data processing method and apparatus that solves the above problems, or at least partially solves the above problems.

Thus, in one embodiment of the present application, a data processing method is provided. The method comprises the following steps:

acquiring a text sequence to be recognized;

determining the element characteristic corresponding to each text element according to the text characteristic and the pronunciation characteristic corresponding to each text element in the text sequence to be recognized so as to obtain an element characteristic sequence corresponding to the text sequence to be recognized;

and inputting the element characteristic sequence into a trained language understanding model to obtain a semantic recognition result corresponding to the text sequence to be recognized.

Optionally, the pronunciation unit includes a phoneme with tone.

Optionally, determining at least one pronunciation unit corresponding to the first text element includes:

and obtaining at least one pronunciation unit corresponding to the first text element by a table look-up method.

and inputting the first text element into a character-sound conversion model to obtain the at least one pronunciation unit.

Optionally, the semantic recognition result includes a classification result and a semantic slot value prediction result;

the classification result comprises an intention prediction result and/or a domain prediction result.

Optionally, the language understanding model includes:

the input network is used for adding the element characteristics corresponding to the designated symbols at the starting positions of the element characteristic sequences to obtain processed element characteristic sequences;

the semantic fusion neural network is used for performing context semantic fusion on the processed element feature sequence to obtain a fusion feature sequence after the semantic fusion;

the classification network is used for determining the intention corresponding to the text sequence to be recognized according to the fusion feature at the position corresponding to the designated symbol in the fusion feature sequence; and the text sequence to be recognized is subjected to sequence marking according to other fusion features except the fusion feature at the position corresponding to the designated symbol in the fusion feature sequence, so as to obtain a semantic slot value corresponding to the text sequence to be recognized.

In yet another embodiment of the present application, a model training method is provided. The method comprises the following steps:

acquiring a sample text sequence and an expected semantic recognition result thereof;

determining sample element characteristics corresponding to each sample text element according to the text characteristics and pronunciation characteristics corresponding to each sample text element in the sample text sequence to obtain a sample element characteristic sequence corresponding to the sample text sequence;

inputting the sample element characteristic sequence into a language understanding model to be trained to obtain a sample semantic recognition result corresponding to the sample text sequence;

and optimizing the language understanding model according to the sample semantic recognition result and the expected semantic recognition result.

Optionally, the sample semantic recognition result includes a sample classification result and a sample semantic groove value prediction result;

wherein the sample classification result comprises a sample intention prediction result and/or a sample field prediction result.

In yet another embodiment of the present application, a data processing method is provided. The method comprises the following steps:

receiving a voice to be recognized input by a user;

recognizing the voice to be recognized to obtain a text sequence to be recognized;

inputting the element characteristic sequence into a trained language understanding model to obtain a semantic recognition result corresponding to the text sequence to be recognized;

and executing corresponding feedback operation according to the semantic recognition result.

receiving voice sent by a user to the robot;

recognizing the voice to obtain a text sequence to be recognized;

determining feedback voice according to the semantic recognition result;

and controlling the robot to play the feedback voice.

In yet another embodiment of the present application, a data processing apparatus is provided. The device, comprising:

the acquisition module is used for acquiring a text sequence to be recognized;

the determining module is used for determining the element characteristics corresponding to each text element according to the text characteristics and the pronunciation characteristics corresponding to each text element in the text sequence to be recognized so as to obtain an element characteristic sequence corresponding to the text sequence to be recognized;

and the input module is used for inputting the element characteristic sequence into the trained language understanding model to obtain a semantic recognition result corresponding to the text sequence to be recognized.

In yet another embodiment of the present application, an electronic device is provided. The electronic device includes: a memory and a processor, wherein,

the memory is used for storing programs;

the processor is coupled to the memory, and configured to execute the program stored in the memory to implement any of the data processing methods described above.

In still another embodiment of the present application, there is provided a computer-readable storage medium storing a computer program capable of implementing the data processing method of any one of the above when executed by a computer.

In yet another embodiment of the present application, a model training apparatus is provided. The device, comprising:

the acquisition module is used for acquiring a sample text sequence and an expected semantic recognition result thereof;

the determining module is used for determining sample element characteristics corresponding to each sample text element according to the text characteristics and pronunciation characteristics corresponding to each sample text element in the sample text sequence to obtain a sample element characteristic sequence corresponding to the sample text sequence;

the input module is used for inputting the sample element characteristic sequence into a language understanding model to be trained to obtain a sample semantic recognition result corresponding to the sample text sequence;

and the optimization module is used for optimizing the language understanding model according to the sample semantic recognition result and the expected semantic recognition result.

the memory is used for storing programs;

the processor is coupled to the memory and configured to execute the program stored in the memory to implement the model training method.

In yet another embodiment of the present application, a computer-readable storage medium is provided, in which a computer program is stored, which, when executed by a computer, is capable of implementing the model training method described above.

the receiving module is used for receiving the voice to be recognized input by a user;

the recognition module is used for recognizing the speech to be recognized to obtain a text sequence to be recognized;

the input module is used for inputting the element characteristic sequence into a trained language understanding model to obtain a semantic recognition result corresponding to the text sequence to be recognized;

and the execution module is used for executing corresponding feedback operation according to the semantic recognition result.

the memory is used for storing programs;

the processor is coupled with the memory and is used for executing the program stored in the memory so as to realize the data processing method.

In still another embodiment of the present application, there is provided a computer-readable storage medium storing a computer program capable of implementing the data processing method when executed by a computer.

the receiving module is used for receiving voice sent to the robot by a user;

the recognition module is used for recognizing the voice to obtain a text sequence to be recognized;

the first determining module is used for determining the element feature corresponding to each text element according to the text feature and the pronunciation feature corresponding to each text element in the text sequence to be recognized so as to obtain an element feature sequence corresponding to the text sequence to be recognized;

the second determining module is used for determining feedback voice according to the semantic recognition result;

and the control module is used for controlling the robot to play the feedback voice.

the memory is used for storing programs;

According to the technical scheme provided by the embodiment of the application, the element characteristic corresponding to each text element is determined according to the text characteristic and the pronunciation characteristic corresponding to each text element in the text sequence to be recognized, so that the element characteristic sequence corresponding to the text sequence to be recognized is obtained; and inputting the element characteristic sequence into the trained language understanding model to obtain a semantic recognition result corresponding to the text sequence to be recognized. Therefore, the technical scheme provided by the embodiment of the application introduces pronunciation characteristics on the basis of the language understanding model, and can effectively improve the robustness and the semantic recognition accuracy of the language understanding model.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating a model training method according to an embodiment of the present disclosure;

fig. 3a is a schematic flowchart of a data processing method according to another embodiment of the present application;

FIG. 3b is a first exemplary illustration provided in accordance with yet another embodiment of the present application;

FIG. 3c is a second illustration provided in accordance with yet another embodiment of the present application;

fig. 4 is a block diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 5 is a block diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 6 is a block diagram of a data processing apparatus according to another embodiment of the present application;

fig. 7 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Currently, the inputs of many language understanding models are user-derived inputs, such as: the user inputs a speech through an input method or the user inputs a speech through voice. When a user inputs characters through an input method, sometimes there are substitution errors between homophones or nearsighted words; when a user inputs a segment of speech through speech, sometimes because of the problem of pronunciation inaccuracy, the text recognized by the ASR also has substitution errors between homophones or nears. However, in the face of input errors, existing language understanding models have no way to correct them. That is, once an error is input, the output must be erroneous. Therefore, the existing language understanding model is poor in robustness and low in accuracy.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below according to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Further, in some flows described in the specification, claims, and above-described figures of the present application, a number of operations are included that occur in a particular order, which operations may be performed out of order or in parallel as they occur herein. The sequence numbers of the operations, e.g., 101, 102, etc., are used merely to distinguish between the various operations, and do not represent any order of execution per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

Fig. 1 shows a schematic flow chart of a data processing method according to an embodiment of the present application. The execution main body of the method can be a client or a server. The client may be hardware integrated on the terminal and having an embedded program, may also be application software installed in the terminal, and may also be tool software embedded in an operating system of the terminal, which is not limited in this embodiment of the present application. The terminal can be any terminal equipment including a mobile phone, a tablet personal computer, an intelligent sound box and the like. The server may be a common server, a cloud, a virtual server, or the like, which is not specifically limited in this embodiment of the application. As shown in fig. 1, the method includes:

101. and acquiring a text sequence to be recognized.

102. And determining the element characteristic corresponding to each text element according to the text characteristic and the pronunciation characteristic corresponding to each text element in the text sequence to be recognized so as to obtain the element characteristic sequence corresponding to the text sequence to be recognized.

103. And inputting the element characteristic sequence into a trained language understanding model to obtain a semantic recognition result corresponding to the text sequence to be recognized.

In the foregoing 101, the text sequence to be recognized may be a segment of characters manually input by the user through an input method, or a segment of characters obtained by performing voice recognition on a segment of voice input by the user.

In an example, the "acquiring a text sequence to be recognized" in the foregoing 101 specifically includes:

1011. and acquiring the voice to be recognized.

1012. And recognizing the speech to be recognized to obtain the text sequence to be recognized.

In 1012, ASR technology may be used to recognize the speech to be recognized, so as to obtain a text sequence to be recognized.

Typically, the text sequence to be recognized includes a plurality of text elements. In practice, the text element may be a word or a single word.

Taking the text element as a single character as an example, each character in the text sequence to be recognized, namely "navigate to Tiananmen", is a text element.

In 102, the text sequence to be recognized includes a first text element, and the first text element refers to any element in the text sequence to be recognized. And determining the element characteristics corresponding to the first text element according to the text characteristics and the pronunciation characteristics corresponding to the first text element. Specifically, feature fusion may be performed on the text feature and the pronunciation feature corresponding to the first text element to obtain an element feature corresponding to the first text element. For example: the text features and pronunciation features corresponding to the first text element may be summed to obtain the element features corresponding to the first text element.

After the element characteristics corresponding to each text element in the text sequence to be recognized are obtained, the element characteristics corresponding to the text elements are sequenced according to the sequencing of the text elements in the text sequence to be recognized, and the element characteristic sequence corresponding to the text sequence to be recognized is obtained.

For example: in the text sequence to be recognized, navigation to Tiananmen is performed, element features R1 corresponding to "navigation" are performed, element features R2 corresponding to "navigation" are performed, element features R3 corresponding to "go", element features R4 corresponding to "day", element features R5 corresponding to "An", element features R6 corresponding to "gate", and then the element feature sequence corresponding to the text sequence to be recognized obtained after sorting is "R1R 2R 3R 4R 5R 6".

In practical applications, the features may be in the form of vectors. For example: the text feature and the pronunciation feature may be vectors with two dimensions being the same.

The pronunciation characteristics described above are also understood as pronunciation characteristics. In practical application, the pronunciation information of the text element can be determined, and then the pronunciation characteristics of the text element can be determined according to the pronunciation information. Along the above example, the pronunciation information of the "leader" is "dao 3", in which the number 3 represents the tone.

In an example, the manner of obtaining the text feature corresponding to the text element may specifically include: acquiring a text element matrix obtained by training in advance, wherein the text element matrix comprises text vectors corresponding to all text elements; and querying the text element matrix in an index mode to obtain a text vector corresponding to each text element in the text sequence to be recognized, wherein the text vector is used as a text feature corresponding to the corresponding text element.

In step 103, the element feature sequence obtained in step 102 is input into a trained language understanding model, so as to obtain a semantic recognition result corresponding to the text sequence to be recognized by the language understanding model. The language understanding model may be a deep learning model.

The semantic recognition result can be a classification result; the classification result may also be a combination of the classification result and the semantic slot value prediction result, which is not specifically limited in this embodiment of the present application. In one example, the semantic recognition result comprises a classification result and a semantic slot value prediction result; the classification result comprises an intention prediction result and/or a domain prediction result.

Along the above example, its intent is navigation; the semantic slot value is "address ═ Tiananmen".

Some natural language understanding systems require both intent prediction and domain presetting, such as: the field is weather, the intention is to query temperature.

According to the technical scheme provided by the embodiment of the application, the element characteristic corresponding to each text element is determined according to the text characteristic and the pronunciation characteristic corresponding to each text element in the text sequence to be recognized, so that the element characteristic sequence corresponding to the text sequence to be recognized is obtained; and inputting the element characteristic sequence into the trained language understanding model to obtain a semantic recognition result corresponding to the text sequence to be recognized. Therefore, pronunciation characteristics are introduced on the basis of the language understanding model in the technical scheme provided by the embodiment of the application, so that the robustness and the semantic recognition accuracy of the language understanding model can be effectively improved.

At present, errors existing in the input text are mainly replacement errors of homophones or nearsighted words. After the pronunciation characteristics are added, when a replacement error occurs, the pronunciation characteristics are still the same as or similar to the original correct pronunciation characteristics, and then the robustness and the accuracy of the natural language understanding task can be improved by introducing the pronunciation characteristics into the input of the natural language understanding model.

In one implementation, the text sequence to be recognized includes a first text element. The method may further include:

104. and determining at least one pronunciation unit corresponding to the first text element.

105. And acquiring unit vectors corresponding to the at least one pronunciation unit.

106. And determining the pronunciation characteristics corresponding to the first text element according to the unit vector corresponding to the at least one pronunciation unit.

In the 104, the pronunciation unit may include one or more of a phoneme, a syllable, an initial and a final, and the like. Since the pronunciation information of chinese also includes tones, the pronunciation unit can be a pronunciation unit with tones, for example, the pronunciation unit can include one or more of phones with tones, syllables, initials and finals, etc.

The following description will take the pronunciation unit as the initial consonant with tone (generally, the initial consonant has no tone, the final sound has tone) as an example:

along the above example, the pronunciation units corresponding to the "leader" are: "d" and "ao 3"; the pronunciation unit corresponding to the navigation is 'h' and 'ang 2'; the pronunciation units corresponding to the 'go' are 'q' and 'u 4'; the pronunciation units corresponding to the days are t, i and an 1; the pronunciation unit corresponding to 'an' is 'an 1'; the pronunciation units corresponding to the "gate" are "m" and "en 2". Note: here, i and an1 are vowels with tone, and tone of i is one tone. In one example, "ian 1" can also be considered as a vowel with tone in its entirety.

In one embodiment, the at least one pronunciation unit corresponding to the first text element can be obtained by a table lookup method. In this embodiment, the scheme of obtaining at least one pronunciation unit corresponding to the first text element by a table lookup method has high extensibility.

In another embodiment, the first text element may be input into a phonetic conversion model, resulting in the at least one pronunciation unit. The specific implementation process of the word-sound conversion model can be referred to in the prior art, and is not described in detail here.

In the above 105, specifically, a pronunciation unit matrix obtained by training in advance may be obtained, where the pronunciation unit matrix includes unit vectors corresponding to all pronunciation units; and inquiring the pronunciation unit matrix in an index mode to obtain the unit vector corresponding to the at least one pronunciation unit.

In the above 106, in an example, the unit vectors corresponding to the at least one pronunciation unit may be summed to obtain the pronunciation characteristics corresponding to the first text element. Therefore, the vector dimensions of the pronunciation features corresponding to the first text element and the text features corresponding to the first text element can be ensured to be the same, so that the pronunciation features and the text features corresponding to the first text element can be conveniently summed subsequently, and the element features corresponding to the first text element can be obtained.

In one example, the language understanding model may include:

The above-mentioned designated symbol may be a CLS symbol. The element corresponding to the CLS symbol is characterized by R0.

Following the above example, the elemental signature sequence was "R1R 2R 3R 4R 5R 6" and the post-treatment elemental signature sequence was "R0R 1R 2R 3R 4R 5R 6".

The semantic fusion neural network can be a Transformer network, a convolutional neural network, a cyclic neural network and the like.

Following the above example, the post-treatment elemental signature sequence was "R0R 1R 2R 3R 4R 5R 6" and the fused signature sequence was "H0H 1H 2H 3H 4H 5H 6".

In one example, the classification network may constitute an MLP (Multi-Layer Perceptron) classifier.

Along the above example, the fusion feature H0 at the position corresponding to the designated symbol in the fusion feature sequence is input into the MLP classifier, and the intention corresponding to the text sequence to be recognized is obtained as navigation. Sequentially inputting the H1H 2H 3H 4H 5H 6 into an MLP classifier to obtain a semantic label corresponding to each word in the navigation-to-Tiananmen: after the fusion features corresponding to the positions of the three words, namely the "day", "An" and "door", are input into an MLP classifier, the positions of the words are predicted to be 'B-loc', 'I-loc' and 'I-loc', and the positions of the other words are predicted to be '0'. After the sequence labeling format conversion, the semantic slot value is predicted, and the location is equal to Tiananmen. Specifically, the conversion may be performed by using a sequential labeling format conversion manner such as an IOB tagging schema and an IOBES tagging schema, which is not specifically limited in this embodiment of the application.

The method for training the language understanding model provided by the embodiment of the present application will be described with reference to fig. 2. The method comprises the following steps:

201. and acquiring a sample text sequence and an expected semantic recognition result thereof.

202. And determining the sample element characteristic corresponding to each sample text element according to the text characteristic and the pronunciation characteristic corresponding to each sample text element in the sample text sequence to obtain a sample element characteristic sequence corresponding to the sample text sequence.

203. And inputting the sample element characteristic sequence into a language understanding model to be trained to obtain a sample semantic recognition result corresponding to the sample text sequence.

204. And optimizing the language understanding model according to the sample semantic recognition result and the expected semantic recognition result.

In the 201, the expected semantic recognition result may include: a classification result is expected. Wherein the sample classification result comprises an expected intent prediction result and/or an expected domain prediction result.

The expected semantic identification result may also include an expected semantic slot value prediction result.

In the above 202, the sample text sequence includes a first sample text element, and the first sample text element refers to any sample element in the sample text sequence. And determining the element characteristics corresponding to the first sample text element according to the text characteristics and the pronunciation characteristics corresponding to the first sample text element. Specifically, feature fusion may be performed on the text feature and the pronunciation feature corresponding to the first sample text element to obtain a sample element feature corresponding to the first sample text element. For example: the text features and pronunciation features corresponding to the first sample text element may be summed to obtain sample element features corresponding to the first sample text element.

After the element characteristics corresponding to each sample text element in the sample text sequence are obtained, the element characteristics corresponding to the sample text elements are sequenced according to the sequencing of the sample text elements in the sample text sequence, and the sample element characteristic sequence corresponding to the sample text sequence is obtained.

The pronunciation characteristics described above are also understood as pronunciation characteristics. In practical application, the sample pronunciation information of the sample text element can be determined, and then the pronunciation characteristics of the sample text element can be determined according to the sample pronunciation information. Along the above example, the pronunciation information of the "leader" is "dao 3", in which the number 3 represents the tone.

In an example, the manner of obtaining the text feature corresponding to the sample text element may specifically include: acquiring a text element matrix obtained by training in advance, wherein the text element matrix comprises text vectors corresponding to all text elements; and querying the text element matrix in an index mode to obtain a text vector corresponding to each sample text element in the sample text sequence, wherein the text vector is used as a text feature corresponding to the corresponding sample text element.

In 203, the sample semantic recognition result may be a classification result; the sample classification result may also be a combination of the sample classification result and the sample semantic groove value prediction result, which is not specifically limited in this embodiment of the present application. In one example, the sample semantic identification result comprises a sample classification result and a sample semantic groove value prediction result; the sample classification result comprises a sample intention prediction result and/or a sample field prediction result.

In 204, the language understanding model may be optimized according to a difference between the sample semantic recognition result and the expected semantic recognition result.

Taking the sample semantic identification result as an example, wherein the sample semantic identification result comprises a sample intention prediction result and a sample semantic groove value prediction result, and calculating a first difference between the sample intention prediction result and an expected intention prediction result; calculating a second difference between the sample semantic slot value prediction result and the expected semantic slot value prediction result; parameters in the language understanding model are optimized in combination with the first difference and the second difference. Specifically, the first difference and the second difference may be calculated by using a loss function, and the selection of the loss function may be determined according to actual needs, which is not specifically limited in the embodiment of the present application.

In one implementation, the sample text sequence includes a first sample text element. The method may further include:

205. determining at least one pronunciation unit corresponding to the first sample text element.

206. And acquiring unit vectors corresponding to the at least one pronunciation unit.

207. And determining pronunciation characteristics corresponding to the first sample text element according to the unit vector corresponding to each of the at least one pronunciation unit.

In the above 205, the pronunciation unit may include one or more of phoneme, syllable, initial and final, etc. Since the pronunciation information of chinese also includes tones, the pronunciation units may include one or more of phones with tones, syllables, initials and finals, etc.

In one embodiment, the at least one pronunciation unit corresponding to the first sample element can be obtained by a table lookup method.

In another embodiment, the first sample text element may be input into a phonetic conversion model to obtain the at least one pronunciation unit.

In 206, specifically, a pronunciation unit matrix obtained by training in advance may be obtained, where the pronunciation unit matrix includes unit vectors corresponding to all pronunciation units; and inquiring the pronunciation unit matrix in an index mode to obtain the unit vector corresponding to the at least one pronunciation unit.

In the above 207, in an example, the unit vectors corresponding to the at least one pronunciation unit may be summed to obtain the pronunciation characteristics corresponding to the first sample text element.

In one example, the language understanding model may include:

the input network is used for adding the element characteristics corresponding to the designated symbols at the starting positions of the sample element characteristic sequences to obtain processed sample element characteristic sequences;

the semantic fusion neural network is used for performing context semantic fusion on the processed sample element feature sequence to obtain a sample fusion feature sequence after the semantic fusion;

the classification network is used for determining the corresponding intention of the sample text sequence according to the sample fusion characteristics at the position corresponding to the designated symbol in the sample fusion characteristic sequence; and the system is further used for performing sequence labeling on the sample text sequence according to other sample fusion features in the sample fusion feature sequence except for the sample fusion feature at the position corresponding to the designated symbol, so as to obtain a sample semantic groove value corresponding to the sample text sequence.

Fig. 3a is a schematic flow chart illustrating a data processing method according to another embodiment of the present application. The execution main body of the method can be a client or a server. The client may be hardware integrated on the terminal and having an embedded program, may also be application software installed in the terminal, and may also be tool software embedded in an operating system of the terminal, which is not limited in this embodiment of the present application. The terminal can be any terminal equipment including a mobile phone, a tablet personal computer, an intelligent sound box and the like. The server may be a common server, a cloud, a virtual server, or the like, which is not specifically limited in this embodiment of the application. As shown in fig. 3a, the method includes:

301. and receiving the voice to be recognized input by the user.

302. And identifying the voice to be identified to obtain a text sequence to be identified.

303. And determining the element characteristic corresponding to each text element according to the text characteristic and the pronunciation characteristic corresponding to each text element in the text sequence to be recognized so as to obtain the element characteristic sequence corresponding to the text sequence to be recognized.

304. And inputting the element characteristic sequence into a trained language understanding model to obtain a semantic recognition result corresponding to the text sequence to be recognized.

305. And executing corresponding feedback operation according to the semantic recognition result.

In 301, the user can press the voice input control provided on the terminal interface for a long time and speak, thereby implementing voice input.

The specific implementation of the

steps

302, 303, and 304 can refer to the corresponding content in the above embodiments, and is not described herein again.

In 305 above, in one example, based on the semantic recognition result, a feedback voice, a feedback page, and so on can be determined. And playing the determined feedback voice to the user, and displaying the determined feedback page to the user.

Taking navigation as an example, the feedback operation may include playing a feedback voice "a route to the skyhook is planned for you" to the user and presenting a navigation interface to the user, in which the navigation route is presented.

Here, it should be noted that: the content of each step in the method provided by the embodiment of the present application, which is not described in detail in the foregoing embodiment, may refer to the corresponding content in the foregoing embodiment, and is not described herein again. In addition, the method provided in the embodiment of the present application may further include, in addition to the above steps, other parts or all of the steps in the above embodiments, and specific reference may be made to corresponding contents in the above embodiments, which is not described herein again.

The data processing method provided by the embodiment can be applied to scenes such as virtual anchor conversation, voice control, robot conversation, intelligent conference extraction and recording key points and the like. The following will be described by taking a robot conversation as an example, and the specific steps include:

701. and receiving voice sent by a user to the robot.

702. And recognizing the voice to obtain a text sequence to be recognized.

703. And determining the element characteristic corresponding to each text element according to the text characteristic and the pronunciation characteristic corresponding to each text element in the text sequence to be recognized so as to obtain the element characteristic sequence corresponding to the text sequence to be recognized.

704. And inputting the element characteristic sequence into a trained language understanding model to obtain a semantic recognition result corresponding to the text sequence to be recognized.

705. And determining feedback voice according to the semantic recognition result.

706. And controlling the robot to play the feedback voice.

In 701, the robot may be a real robot or a virtual robot. Wherein the virtual robot may be a virtual anchor. The voice is also the voice to be recognized. For example: the user wants to ask the robot about the weather condition of today, and can say "how the weather is today" to the robot.

The specific implementation manners of the foregoing steps 702 to 704 may refer to corresponding contents in the foregoing embodiments, and are not described herein again.

In 705 above, the feedback speech is determined according to the semantic recognition result. Following the above example, the robot recognizes that the user wants to inquire about the weather condition of today, then the robot can inquire about the weather condition of today through networking; then, according to weather conditions, feedback voice is generated, for example: the feedback speech is: "today is sunny.

In 706, when the robot is a real robot, the player on the robot can be controlled to play the feedback voice. When the robot is a virtual robot, a player on the terminal equipment running the virtual robot can be controlled to play the feedback voice.

The following describes, by way of example, the technical solutions provided by the embodiments of the present application with reference to fig. 3b and 3 c:

as shown in fig. 3b and fig. 3c, the speech to be recognized is input into the ASR model, and speech recognition is performed, so as to obtain a text sequence "navigate to the skyway". And determining a pronunciation sequence'd-ao 3h-ang2 q-v4 t-ian1 an1 m-en 2' corresponding to the text sequence to be recognized by a table look-up method. And adding a designated symbol [ CLS ] at the beginning of the sentence of the text sequence to be recognized to obtain a processed text sequence to be recognized "[ CLS ] navigation to Tiananmen". Determining a word vector (namely the text characteristic) corresponding to each element in the processed text sequence to be recognized; converting the pronunciation corresponding to each element in the pronunciation sequence into a pronunciation vector (namely the pronunciation characteristics); and introducing pronunciation vectors corresponding to corresponding elements into the word vectors corresponding to each element in the processed text sequence to be recognized to obtain a vector sequence for input (namely the processed element feature sequence). The vector sequence is input into the language understanding model 32, resulting in an output of the language understanding model: the intent is navigator and the semantic slot value is location (address) ═ Tiananmen.

Fig. 4 shows a block diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 4, the apparatus includes:

an obtaining module 401, configured to obtain a text sequence to be identified;

a determining module 402, configured to determine, according to a text feature and a pronunciation feature corresponding to each text element in the text sequence to be recognized, an element feature corresponding to each text element, so as to obtain an element feature sequence corresponding to the text sequence to be recognized;

an input module 403, configured to input the element feature sequence into a trained language understanding model, so as to obtain a semantic recognition result corresponding to the text sequence to be recognized.

Optionally, the text sequence to be recognized includes a first text element. The determining module 402 is further configured to:

determining at least one pronunciation unit corresponding to the first text element;

acquiring unit vectors corresponding to the at least one pronunciation unit;

and determining the pronunciation characteristics corresponding to the first text element according to the unit vector corresponding to the at least one pronunciation unit.

Here, it should be noted that: the data processing apparatus provided in the foregoing embodiments may implement the technical solutions described in the foregoing method embodiments, and the specific implementation principle of each module may refer to the corresponding content in the foregoing corresponding method embodiments, which is not described herein again.

Fig. 5 shows a block diagram of a model training apparatus according to an embodiment of the present application. As shown in fig. 5, the apparatus includes:

an obtaining module 501, configured to obtain a sample text sequence and an expected semantic recognition result thereof;

a determining module 502, configured to determine, according to a text feature and a pronunciation feature corresponding to each sample text element in the sample text sequence, a sample element feature corresponding to each sample text element, so as to obtain a sample element feature sequence corresponding to the sample text sequence;

an input module 503, configured to input the sample element feature sequence to a language understanding model to be trained, so as to obtain a sample semantic recognition result corresponding to the sample text sequence;

an optimizing module 504, configured to optimize the language understanding model according to the sample semantic recognition result and the expected semantic recognition result.

Optionally, the sample text sequence includes a first sample text element. The determining module 402 is further configured to:

determining at least one pronunciation unit corresponding to the first sample text element;

acquiring unit vectors corresponding to the at least one pronunciation unit;

and determining pronunciation characteristics corresponding to the first sample text element according to the unit vector corresponding to each of the at least one pronunciation unit.

Here, it should be noted that: the model training device provided in the above embodiments may implement the technical solutions described in the above method embodiments, and the specific implementation principle of each module may refer to the corresponding content in the above corresponding method embodiments, which is not described herein again.

Fig. 6 shows a block diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 6, the apparatus includes:

a receiving module 601, configured to receive a voice to be recognized input by a user;

the recognition module 602 is configured to recognize the speech to be recognized, so as to obtain a text sequence to be recognized;

a determining module 603, configured to determine, according to a text feature and a pronunciation feature corresponding to each text element in the text sequence to be recognized, an element feature corresponding to each text element, so as to obtain an element feature sequence corresponding to the text sequence to be recognized;

an input module 604, configured to input the element feature sequence to a trained language understanding model, so as to obtain a semantic recognition result corresponding to the text sequence to be recognized;

and the executing module 605 is configured to execute a corresponding feedback operation according to the semantic recognition result.

Optionally, the text sequence to be recognized includes a first text element. The determining module 603 is further configured to:

acquiring unit vectors corresponding to the at least one pronunciation unit;

the receiving module is used for receiving voice sent to the robot by a user;

and the control module is used for controlling the robot to send the feedback voice.

Here, it should be noted that: the data processing apparatus provided in the foregoing embodiment may implement the technical solutions described in the foregoing corresponding method embodiments, and the specific implementation principle of each module may refer to the corresponding content in the foregoing corresponding method embodiments, which is not described herein again.

Fig. 7 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 7, the electronic device includes a memory 1101 and a processor 1102. The memory 1101 may be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device. The memory 1101 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The memory 1101 is used for storing programs;

the processor 1102 is coupled to the memory 1101, and configured to execute the program stored in the memory 1101, so as to implement the data processing method and the model training method provided by the above method embodiments.

Further, as shown in fig. 7, the electronic device further includes: communication components 1103, display 1104, power components 1105, audio components 1106, and the like. Only some of the components are schematically shown in fig. 7, and the electronic device is not meant to include only the components shown in fig. 7.

Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program can implement the steps or functions of the data processing method and the model training method provided by the above method embodiments when executed by a computer.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A data processing method, comprising:

acquiring a text sequence to be recognized;

2. The method according to claim 1, characterized in that the text sequence to be recognized comprises a first text element;

the method further comprises the following steps:

acquiring unit vectors corresponding to the at least one pronunciation unit;

3. The method of claim 2, wherein determining the pronunciation characteristics corresponding to the first text element according to the unit vector corresponding to each of the at least one pronunciation unit comprises:

and summing unit vectors corresponding to the at least one pronunciation unit to obtain pronunciation characteristics corresponding to the first text element.

4. The method of claim 2, wherein determining the element feature corresponding to the first text element according to the text feature and the pronunciation feature corresponding to the first text element comprises:

and summing the text features corresponding to the first text element and the pronunciation features to obtain the element features corresponding to the first text element.

5. The method according to any one of claims 1 to 4, wherein obtaining a text sequence to be recognized comprises:

acquiring a voice to be recognized;

and recognizing the speech to be recognized to obtain the text sequence to be recognized.

6. A method of model training, comprising:

7. A data processing method, comprising:

receiving a voice to be recognized input by a user;

8. A data processing method, comprising:

receiving voice sent by a user to the robot;

recognizing the voice to obtain a text sequence to be recognized;

determining feedback voice according to the semantic recognition result;

controlling the robot to emit the feedback voice.

9. A data processing apparatus, comprising:

the acquisition module is used for acquiring a text sequence to be recognized;

10. A data processing apparatus, comprising: