CN111739517A

CN111739517A - Speech recognition method, speech recognition device, computer equipment and medium

Info

Publication number: CN111739517A
Application number: CN202010622097.5A
Authority: CN
Inventors: 田植良
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-01
Filing date: 2020-07-01
Publication date: 2020-10-02
Anticipated expiration: 2040-07-01
Also published as: CN111739517B

Abstract

The embodiment of the application discloses a voice recognition method, a voice recognition device, computer equipment and a medium, and belongs to the technical field of computers. The method comprises the following steps: performing feature extraction on voice data of a first user identifier to obtain voice features of the voice data; acquiring user characteristics of the first user identification; performing fusion processing on the voice features and the user features to obtain fusion features corresponding to the voice data; and identifying the fusion characteristics to obtain text data corresponding to the voice data. The method comprehensively considers the content of the voice data and the speaking mode of the user, so that the text data obtained by recognition processing is more consistent with the speaking mode of the user and is more matched with the voice data, and the accuracy rate of voice recognition is improved.

Description

Speech recognition method, speech recognition device, computer equipment and medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a voice recognition method, a voice recognition device, computer equipment and a medium.

Background

With the development of computer technology, voice recognition technology is more and more common in the fields of social application, intelligent customer service and the like, and voice data can be converted into text data by adopting the voice recognition technology for being checked by a user. However, the current technology has limited performance, and text data corresponding to voice data may not be accurately recognized, so that the accuracy of voice recognition is low.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, computer equipment and a medium, and improves the accuracy of voice recognition. The technical scheme is as follows:

in one aspect, a speech recognition method is provided, and the method includes:

performing feature extraction on voice data of a first user identifier to obtain voice features of the voice data;

acquiring user characteristics of the first user identifier, wherein the user characteristics are obtained by performing characteristic extraction on a user relationship network of the first user identifier, and the user relationship network comprises an association relationship between the first user identifier and at least one second user identifier;

performing fusion processing on the voice features and the user features to obtain fusion features corresponding to the voice data;

and identifying the fusion characteristics to obtain text data corresponding to the voice data.

In another aspect, a speech recognition apparatus is provided, the apparatus comprising:

the voice feature acquisition module is used for extracting features of voice data of the first user identifier to obtain voice features of the voice data;

a user characteristic obtaining module, configured to obtain a user characteristic of the first user identifier, where the user characteristic is obtained by performing characteristic extraction on a user relationship network of the first user identifier, and the user relationship network includes an association relationship between the first user identifier and at least one second user identifier;

a fusion feature obtaining module, configured to perform fusion processing on the voice feature and the user feature to obtain a fusion feature corresponding to the voice data;

and the voice recognition module is used for recognizing the fusion characteristics to obtain text data corresponding to the voice data.

Optionally, the user feature obtaining module is configured to invoke a user feature extraction layer of a first speech recognition model, and perform feature extraction on the user relationship network to obtain the user feature of the first user identifier.

Optionally, the step of extracting the feature of the voice data of the first user identifier to obtain the voice feature of the voice data is performed by calling a voice feature extraction layer of the first voice recognition model;

the step of performing fusion processing on the voice features and the user features to obtain fusion features corresponding to the voice data is executed by calling a feature fusion layer of the first voice recognition model;

and the step of identifying the fusion characteristics to obtain text data corresponding to the voice data is executed by calling a voice recognition layer of the first voice recognition model.

Optionally, the training process of the first speech recognition model includes the following steps:

acquiring a sample user relationship network of a sample user identifier, sample voice data of the sample user identifier and sample text data corresponding to the sample voice data;

calling the voice feature extraction layer to extract features of the sample voice data to obtain predicted voice features of the sample voice data;

calling the sample user characteristic acquisition layer, and performing characteristic extraction on the sample user relationship network to obtain the predicted user characteristics of the sample user identification;

calling the feature fusion layer, and carrying out fusion processing on the predicted voice features and the predicted user features to obtain predicted fusion features corresponding to the sample voice data;

calling the voice recognition layer, and carrying out recognition processing on the prediction fusion characteristics to obtain prediction text data corresponding to the sample voice data;

adjusting parameters in the first speech recognition model according to the sample text data and the predicted text data.

Optionally, the apparatus further comprises:

a user identifier obtaining module, configured to obtain at least one second user identifier associated with the first user identifier;

and the composition creating module is used for creating a composition according to the first user identifier and the at least one second user identifier, wherein the composition comprises a first user node corresponding to the first user identifier, a second user node corresponding to the at least one second user identifier and a connecting line between the first user node and the at least one second user node.

Optionally, the apparatus further comprises:

the user characteristic obtaining module is further configured to invoke a user characteristic extraction model, perform characteristic extraction on the user relationship network of any user identifier, and obtain a user characteristic of any user identifier;

and the relationship establishing module is used for establishing the corresponding relationship between the user identification and the user characteristics.

Optionally, the training process of the user feature extraction model includes the following steps:

obtaining a sample user relationship network of a sample user identifier and sample user characteristics of the sample user identifier in the sample user relationship network;

and training the user characteristic extraction model according to the sample user relationship network and the sample user characteristics.

Optionally, the step of extracting the feature of the voice data of the first user identifier to obtain the voice feature of the voice data is performed by calling a voice feature extraction layer of a second voice recognition model;

the step of inquiring the user characteristics of the first user identification according to the established corresponding relation is executed by calling a user characteristic acquisition layer of the second voice recognition model;

the step of performing fusion processing on the voice features and the user features to obtain fusion features corresponding to the voice data is executed by calling a feature fusion layer of the second voice recognition model;

and the step of identifying the fusion characteristics to obtain the text data corresponding to the voice data is executed by calling a voice recognition layer of the second voice recognition model.

Optionally, the training process of the second speech recognition model includes the following steps:

acquiring a sample user identifier, sample voice data of the sample user identifier and sample text data corresponding to the sample voice data;

calling the user characteristic acquisition layer, and inquiring the user characteristics of the first user identification according to the established corresponding relation;

calling the feature fusion layer, and carrying out fusion processing on the predicted voice features and the user features to obtain predicted fusion features corresponding to the sample voice data;

and adjusting parameters in the second speech recognition model according to the sample text data and the predicted text data.

In another aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction is stored, and the at least one instruction is loaded and executed by the processor to implement the operations performed in the speech recognition method according to the above aspect.

In another aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the operations performed in the speech recognition method according to the above aspect.

In another aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer instructions stored in a computer-readable storage medium, the computer instructions being read by a processor of the computer device from the computer-readable storage medium, the computer instructions being executed by the processor so that the computer device implements the operations performed in the speech recognition method according to the above aspect.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

according to the method, the device, the computer equipment and the medium provided by the embodiment of the application, in the process of recognizing the voice data, not only the voice characteristics of the voice data are recognized, but also the user characteristics are obtained by performing characteristic extraction according to the user relation network of the user identification, the user characteristics can reflect the speaking mode of the user, the voice characteristics and the user characteristics are fused, the fused characteristics are recognized, the content of the voice data and the speaking mode of the user can be comprehensively considered, so that the text data obtained by the recognition process better conforms to the speaking mode of the user and is better matched with the voice data, and therefore the accuracy of the voice recognition is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a flow chart of a speech recognition method provided by an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a first speech recognition model provided by an embodiment of the present application;

FIG. 4 is a flow chart of another speech recognition method provided by the embodiments of the present application;

FIG. 5 is a flow chart of model training and use provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a model architecture provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of another model architecture provided by embodiments of the present application;

FIG. 8 is a flow chart of another speech recognition method provided by embodiments of the present application;

FIG. 9 is a diagram illustrating a user feature extraction model according to an embodiment of the present disclosure;

FIG. 10 is a flow chart of another speech recognition method provided by embodiments of the present application;

fig. 11 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of another speech recognition apparatus provided in the embodiment of the present application;

fig. 13 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.

It will be understood that the terms "first," "second," and the like as used herein may be used herein to describe various concepts, which are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, the first subscriber identity may be referred to as a second subscriber identity and the second subscriber identity may be referred to as the first subscriber identity without departing from the scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

The scheme provided by the embodiment of the application relates to the technologies such as artificial intelligence voice technology and natural language processing, and is explained by the following embodiment.

Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application. Referring to fig. 1, the implementation environment includes: at least one terminal 101 and a server 102, at least one terminal 101 and the server 102 being connected via a network.

At least one of the terminals 101 may be a portable, pocket, or handheld terminal, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform.

Each terminal 101 logs in the server 102 based on the user identifier, and the server 102 obtains the association relationship among the plurality of user identifiers according to the user identifiers logged in by the plurality of terminals 101.

Alternatively, each terminal 101 may install a social application, and the server 102 provides a service for the social application, through which the terminal 101 and the server 102 interact. Any terminal 101 may receive data transmitted by other terminals 101, and may also transmit data to other terminals 101, where the data may be any form of data such as text data, voice data, image data, and the like. Moreover, the social application has a voice recognition function, and voice recognition can be performed on voice data corresponding to any user identifier in the social application.

The method provided by the embodiment of the application can be applied to various scenes.

For example, in a scenario where voice data in a social application is identified.

In the process of chatting between two users, one user sends voice data to the other user through the social application, the other user receives the voice data through the social application, and if the other user is inconvenient to play the voice data or wants to view the corresponding text data, a voice recognition instruction can be triggered, so that the voice recognition method provided by the embodiment of the application is adopted to recognize the voice data to obtain the text data corresponding to the voice data, and the text data is displayed through the social application.

Also for example, in the context of voice calls in social applications.

In the process of voice communication, if one user cannot clearly hear the voice data of the other user due to excessive external environment noise, a voice recognition instruction can be triggered.

Fig. 2 is a flowchart of a speech recognition method according to an embodiment of the present application. The execution subject of the embodiment of the application is the server. Referring to fig. 2, the method includes:

201. and the server performs feature extraction on the voice data of the first user identifier to obtain the voice features of the voice data.

The first user identifier is used to represent the identity of the user who sends the voice data, and the first user identifier may be a nickname, an ID, or the like of the user. The voice data of the first user identifier refers to voice data sent by a user corresponding to the first user identifier, and the voice data may be any content, for example, the voice data may be a word, a sentence, or the like; the voice data may be in any language, for example, chinese, english, or the like.

The server extracts the voice features of the voice data of the first user identification, and the voice features are used for representing the content of the voice data, so that the server can recognize the voice data according to the voice features to obtain text data corresponding to the voice data.

In a possible implementation manner, the terminal is provided with a social application, the first user identifier may be any user identifier registered in the social application, the server is a social application server, and the social application server may obtain voice data of the first user identifier, recognize the voice data, and obtain text data corresponding to the voice data.

Optionally, a second user identifier may be registered in the social application, and a friend relationship exists between the second user identifier and the first user identifier, so that the terminal corresponding to the second user identifier may receive the voice data of the first user identifier, and the terminal corresponding to the second user identifier interacts with the server to recognize the voice data of the first user identifier.

Optionally, after the terminal corresponding to the first user identifier sends the voice data of the first user identifier, the terminal may also interact with the server to recognize the voice data of the first user identifier.

202. The server obtains the user characteristics of the first user identification.

Because speaking modes such as accents, word habits and the like of different users have certain differences, if only the speech characteristics are identified, text data corresponding to the speech data is obtained, and the differences among the different users are not considered, so that the identification of the speech characteristics is influenced. For example, two users express different contents, but speech data uttered by the two users are the same due to a difference in dialect, and when only speech features are subjected to recognition processing, the same text data is obtained, resulting in inaccurate recognition. Therefore, the user characteristics of the user identification are obtained, the identification is carried out according to the voice characteristics and the user characteristics, the characteristics of the user are considered in the identification process, and the identified text data are more accurate.

In the embodiment of the application, the user characteristics are obtained by performing characteristic extraction on a user relationship network of the first user identifier, and the user relationship network comprises an association relationship between the first user identifier and at least one second user identifier. And the second user identification is the user identification which is associated with the first user identification in the social application. For example, the second user identifier may be a friend relationship with the first user identifier, or the second user identifier may be a user identifier concerned by the first user identifier, or the first user identifier and the second user identifier may also be other association relationships.

Optionally, before receiving the voice data of the first user identifier, the server already obtains user features of multiple user identifiers, and the server obtains the user features of the first user identifier by querying the user features of the multiple user identifiers; or when the server receives the voice data of the first user identification, the user relationship network of the first user identification is obtained, and the user characteristics of the first user identification are obtained by carrying out characteristic extraction on the user relationship network of the first user identification.

In the embodiment of the application, the user relationship network of the first user identifier can be quickly obtained through social application, the association relationship between the first user identifier and at least one second user identifier in the user relationship network can reflect, the association relationship between the user corresponding to the first user identifier and the user corresponding to each second user identifier in real life is obtained through the user relationship network of the first user identifier, and the user characteristics of the first user identifier can be quickly and accurately obtained. For example, if there is an association relationship between two user identifiers, the users corresponding to the two user identifiers may be friends in real life.

203. And the server performs fusion processing on the voice characteristics and the user characteristics to obtain fusion characteristics corresponding to the voice data.

The server fuses the voice features and the user features, so that the obtained fusion features comprise the voice features and the user features, and the fusion features can reflect the content of the voice data and the characteristics of the user corresponding to the first user identifier.

204. And the server identifies the fusion characteristics to obtain text data corresponding to the voice data.

The correspondence between the voice data and the text data means that the content of the voice data is consistent with the content of the text data. For example, if the voice data is "hello", the text data corresponding to the voice data should also be "hello".

It should be noted that, the embodiment of the present application is only described as an example in which step 201 is performed first and then step 202 is performed, and in another embodiment, step 202 may be performed first and then step 201 is performed, or step 201 and step 202 may be performed simultaneously.

According to the method provided by the embodiment of the application, in the process of recognizing the voice data, not only is the voice feature of the voice data recognized, but also the feature extraction is carried out according to the user relationship network of the user identification to obtain the user feature, the user feature can reflect the speaking mode of the user, the voice feature and the user feature are fused, the fused feature is recognized, the content of the voice data and the speaking mode of the user can be comprehensively considered, so that the text data obtained through the recognition process is more accordant with the speaking mode of the user and is more matched with the voice data, and therefore the accuracy of the voice recognition is improved.

The embodiment of the present application first provides a first speech recognition mode, where the first speech recognition mode uses a first speech recognition model 300 shown in fig. 3 to perform speech recognition, and the first speech recognition model 300 includes a speech feature extraction layer 301, a user feature extraction layer 302, a feature fusion layer 303, and a speech recognition layer 304. The voice data of the user identifier and the user relationship network of the user identifier are input to the first voice recognition model 300, so that the first voice recognition model is called to output text data corresponding to the voice data.

The speech feature extraction layer 301 is connected to the feature fusion layer 303, the user feature extraction layer 302 is connected to the feature fusion layer 303, and the feature fusion layer 303 is connected to the speech recognition layer 304. The voice feature extraction layer 301 is configured to extract a voice feature of the voice data, the user feature extraction layer 302 is configured to extract a user feature of the user identifier according to a user relationship network of the user identifier, the feature fusion layer 303 is configured to fuse the voice feature and the user feature, and the voice recognition layer 304 is configured to recognize text data corresponding to the voice data according to an input feature.

On the basis of the first speech recognition model 300, a speech recognition method based on the first speech recognition model is also provided, which is detailed in the following embodiments.

Fig. 4 is a flowchart of a speech recognition method according to an embodiment of the present application. Referring to fig. 4, the execution subject of the method is a server, and the method includes:

401. and the server calls the voice feature extraction layer to extract the features of the voice data of the first user identification to obtain the voice features of the voice data.

Since different users have different speaking modes such as accent and word habits, even if the voice data sent by different users are the same, the expressed contents are likely to be different, and if a general voice recognition model is adopted to recognize the voice data, the same text data can be recognized aiming at the same voice data, so that the accuracy rate of voice recognition is low. If the corresponding voice recognition models are trained for different user identifications, the trained voice recognition models are personalized models for the user identifications, only voice data of the user identifications can be recognized, if one voice recognition model is trained for each user, the implementation cost is high, and if sample data of the user identifications are few, the voice recognition models cannot be accurately trained.

Therefore, the embodiment of the present application provides a first speech recognition model, which can extract the user characteristics of each user identifier, and when speech data is recognized, the user characteristics of the user identifiers are considered, so as to implement personalized recognition on the speech data of each user identifier, and the speech data of any user identifier can be recognized by using the first speech recognition model, without training a speech recognition model for each user, so that the implementation cost is low.

When the server identifies the voice data of the first user identification, the voice data is firstly input to the voice feature extraction layer, and feature extraction is carried out on the voice data to obtain the voice feature of the voice data. The voice features may be in a vector form or other forms.

In one possible implementation manner, the terminal is installed with a social application, a plurality of user identifiers may be registered in the social application, and the plurality of user identifiers may send voice data to each other. The plurality of user identifications includes at least a first user identification and at least one second user identification associated with the first user identification.

Optionally, the terminal displays voice data of the first user identifier, sends a recognition instruction to the server in response to the received recognition instruction for the voice data, and after receiving the recognition instruction, the server calls the voice feature extraction layer to perform feature extraction on the voice data. The identification instruction carries a first user identifier and voice data to indicate that the voice data is sent by a user corresponding to the first user identifier.

The terminal may be a terminal corresponding to the first user identifier, that is, after the terminal corresponding to the first user identifier sends the voice data of the first user identifier, the terminal interacts with the server to perform voice recognition on the voice data sent by the terminal; the terminal may also be a terminal corresponding to the second user identifier, that is, after receiving the voice data of the first user identifier, the terminal corresponding to the second user identifier interacts with the server to perform voice recognition on the received voice data.

402. And the server calls a user feature extraction layer to extract features of the user relationship network of the first user identifier to obtain the user features of the first user identifier.

And the user characteristic extraction layer performs characteristic extraction on the input user relationship network to obtain the user characteristics of the first user identifier. The user relationship network includes an association relationship between a first user identifier and at least one second user identifier, and the user relationship network may be in any form, for example, a form such as a table, a graph network, and the like.

In a possible implementation manner, before the server performs feature extraction on the user relationship network of the first user identifier, the user relationship network of the first user identifier needs to be acquired first, and the acquiring process includes: the server acquires at least one second user identifier associated with the first user identifier; and creating a composition graph according to the first user identification and at least one second user identification, wherein the composition graph is the user relationship network of the first user identification. The second user identifier may be a friend identified by the first user identifier, or may be a user target focused by the user.

The peer graph comprises a first user node corresponding to the first user identifier, a second user node corresponding to the at least one second user identifier and a connecting line between the first user node and the at least one second user node. The first user node and the second user node are connected by adopting a connecting line, namely, the existence of the association relation between the two user identifications is represented.

Optionally, if there is an association between two second subscriber identities that have an association with the first subscriber identity, the two second subscriber identities may or may not be connected in the subscriber relationship network of the first subscriber identity.

Optionally, if a social application is installed in the terminal, the first user identifier and the at least one second user identifier are user identifiers registered in the same social application, and an association relationship is established between the first user identifier and any one of the second user identifiers.

In a possible implementation manner, the user feature extraction layer performs feature extraction on the user relationship network to obtain user features of a plurality of user identifiers included in the user relationship network, and selects a user feature of the first user identifier from the user features.

In one possible implementation, since the input of the user feature extraction layer is a user relationship network, the user feature extraction layer may be a graph convolution network.

Because the user characteristics are obtained by performing characteristic extraction on the user relationship network, for a plurality of user identifications, if the plurality of user identifications have an association relationship, the user characteristics of the plurality of user identifications are similar, and if the plurality of user identifications do not have an association relationship, the user characteristics of the plurality of user identifications are not similar.

403. And calling the feature fusion layer by the server, and carrying out fusion processing on the voice features and the user features to obtain fusion features corresponding to the voice data.

And inputting the voice characteristics and the user characteristics into a characteristic fusion layer, and performing fusion processing on the voice characteristics and the user characteristics in the characteristic fusion layer to obtain fusion characteristics, wherein the fusion characteristics are used for describing the content of the voice data and the speaking mode of the user corresponding to the first user identifier.

In one possible implementation, if the speech feature and the user feature are represented in a vector form, performing a fusion process on the speech feature and the user feature, including: and splicing the voice characteristics and the user characteristics, or adding, subtracting or performing other operations on the voice characteristics and the user characteristics, or adopting other processing modes to obtain the fusion characteristics.

404. And the server calls the voice recognition layer to recognize the fusion characteristics to obtain text data corresponding to the voice data.

And inputting the fusion characteristics into a voice recognition layer, recognizing the voice data in the voice recognition layer, converting the voice data into text data, and recognizing the voice data of the first user identifier.

For example, when the voice data of the first user identifier "you are good and ask what question you have", and the voice data is converted into text data, in the related art, the voice data is directly recognized, and the "you" may be recognized as "you" due to the spoken question of the user.

When the user relation network is used for recognition, the multiple user identifications in the user relation network have an association relation, speaking modes such as accents, word habits and the like of users corresponding to the multiple user identifications are considered to be similar, and in the process of recognizing the voice data of the multiple user identifications, the user characteristics of the multiple user identifications are obtained to be similar when the user characteristics are obtained according to the user relation network.

In one possible implementation, the first speech recognition model adopts an Encoder-Decoder framework, the speech feature extraction layer is an Encoder, and the speech feature recognition layer is a Decoder. The voice feature extraction layer and the voice feature recognition layer can be a convolutional neural network, a cyclic neural network and the like.

Before the server calls the first speech recognition model, the server needs to train the first speech recognition model, and the training process of the first speech recognition model comprises the following steps:

1. the server obtains a sample user relationship network of the sample user identification, sample voice data of the sample user identification and sample text data corresponding to the sample voice data.

2. And the server calls the voice feature extraction layer to extract the features of the sample voice data to obtain the predicted voice features of the sample voice data.

The first speech recognition model is an initial first speech recognition model or a first speech recognition model after one or more training.

3. And the server calls a sample user characteristic acquisition layer to perform characteristic extraction on the sample user relationship network to obtain the predicted user characteristics of the sample user identification.

4. And calling the feature fusion layer by the server, and fusing the predicted voice features and the predicted user features to obtain the predicted fusion features corresponding to the sample voice data.

5. And the server calls the voice recognition layer to recognize the prediction fusion characteristics to obtain prediction text data corresponding to the sample voice data.

6. The server adjusts parameters in the first speech recognition model according to the sample text data and the predicted text data, namely, adjusts parameters in the first speech recognition model according to the difference between the sample text data and the predicted text data to obtain the trained first speech recognition model.

The sample user relationship network, the sample voice data, and the corresponding sample text data of the sample user identifier are similar to the user relationship network, the voice data, and the text data of the first user identifier in the above embodiments, and are not described herein again.

For example, a first sample user is a customer service person, and a plurality of associated second sample users are also customer service persons, and the users are likely to frequently have worship when speaking, in the process of training the first speech recognition model, the user characteristics of the sample users are similar, and the first speech recognition model is trained by adopting the sample user relationship network and the sample speech data, so that the first speech recognition model learns that the text data corresponding to the similar user characteristics often contain worship characteristics, and when recognizing the speech data sent by the similar users, the recognized text data also contains worship characteristics.

The sample user relationship network of the sample user is adopted for training, and due to the fact that the sample user and other sample users in the sample user relationship network have an incidence relation, user characteristics corresponding to a plurality of sample users in the sample user relationship network are similar, in the training process, under the condition that the total number of required sample voice data is fixed, the sample voice data can be recorded by one sample user in the sample user relationship network, and can also be recorded by a plurality of sample users, each sample user only needs to record less voice data, recording pressure is shared, the workload of a single sample user is saved, and the problem that the sample voice data is insufficient due to the fact that one sample user records is avoided.

The sample user relationship network may be a user relationship network created in advance in social applications, or may be a user relationship network obtained by a technician temporarily creating an association relationship between a plurality of users, and the sample user relationship network may reflect an association relationship between users in real life.

In addition, in the embodiment of the application, the server acquires the user characteristics according to the user relationship network to recognize the voice data of the first user identifier, so for any new user identifier, when the voice data of the new user identifier needs to be recognized, only the user relationship network of the new user identifier needs to be acquired, the user characteristics of the new user identifier can be acquired by using the first voice recognition model, the recognition of the voice data of the new user identifier is realized, and the trained first voice recognition model can have the capability of recognizing the voice data of any user identifier without training the voice data of all user identifiers when the first voice recognition model is trained, so that the universality of the model is improved.

In addition, referring to the flowchart shown in fig. 5, the process of training and using the first speech recognition model, referring to fig. 5, includes the following steps:

501. and the server acquires the user relationship network of each user identifier according to the plurality of user identifiers in the social application and the association relationship among the plurality of user identifiers.

502. And the server marks the voice data of each user identifier and the corresponding text data to obtain sample data, wherein the sample data comprises a sample user relationship network, sample object data and corresponding sample text data.

503. And the server trains the first voice recognition model according to the sample data to obtain the trained first voice recognition model.

504. And the server adopts the trained first voice recognition model to realize the recognition of the voice data of any user identification.

The training process of the first speech recognition model may be executed by a server or by other devices, and the embodiment of the present application is not limited by this way. If the first speech recognition model is executed by other equipment, the other equipment sends the first speech recognition model to the server after training the first speech recognition model, and the server calls the first speech recognition model to realize the recognition of the speech data.

It should be noted that the training process is described by taking only one sample user identifier as an example, a plurality of sample user identifiers may be used in training the first speech recognition model, and the training process of each sample user identifier is similar to the training process of the sample user identifier, and is not described herein again.

According to the method provided by the embodiment of the application, the first voice recognition model is adopted for voice recognition, the voice feature extraction layer is called to extract the voice features of the voice data to obtain the voice features, the user feature extraction layer is called to extract the features of the user relation network of the user identification to obtain the user features, the user features can reflect the speaking mode of the user, the feature fusion layer is called to fuse the voice features and the user features, the voice recognition layer is called to recognize the fusion features, the content of the voice data and the speaking mode of the user can be comprehensively considered, so that the text data obtained through recognition better accords with the speaking mode of the user and is better matched with the voice data, and the accuracy of the voice recognition is improved.

In addition, in the embodiment of the application, in the process of training the first speech recognition model, since the speech data is subjected to feature extraction to obtain predicted speech features, and the sample user relationship network of the sample user identifier is subjected to feature extraction to obtain predicted user features, the predicted user features can reflect the speaking mode of the sample user, the predicted speech features and the predicted user features are fused, and the predicted fused features are subjected to recognition processing, so that the content of the sample speech data and the speaking mode of the sample user can be comprehensively considered, the first speech recognition model is trained according to the predicted text data and the sample text data, and the recognition accuracy of the first speech recognition model can be improved.

And the sample user identifications in the same sample user relationship network have an association relationship, the user characteristics of the sample user identifications are similar, and the speaking modes of the sample users corresponding to the sample user identifications are possibly similar, so that in the training process, under the condition that the number of required sample voice data is fixed, a plurality of sample users can record the sample voice data, each sample user only needs to record less voice data, the recording pressure is shared, the workload of a single sample user is saved, and the problem of insufficient sample voice data caused by recording of one sample user is avoided.

The embodiment of the present application further provides a second speech recognition mode, where the second speech recognition mode performs speech recognition by using the user feature extraction model 601 and the second speech recognition model 602 shown in fig. 6, where, referring to fig. 7, the second speech recognition model 602 includes a speech feature extraction layer 6021, a user feature acquisition layer 6022, a feature fusion layer 6023, and a speech recognition layer 6024. The user relationship network of the user identifier is input to the user feature extraction model 601, so that the feature extraction model 601 is called to output the user feature of the user identifier, the user identifier and the voice data of the user identifier are input to the second voice recognition model 602, and the second voice recognition model is called to output the text data corresponding to the voice data.

The user feature extraction model 601 is connected with a user feature acquisition layer 6022, the voice feature extraction layer 6021 is connected with a feature fusion layer 6023, the user feature acquisition layer 6022 is connected with a feature fusion layer 6023, and the feature fusion layer 6023 is connected with a voice recognition layer 6024. The user feature extraction model 601 is used for extracting and storing user features of the user identifier, the voice feature extraction layer 6021 is used for extracting voice features of the voice data, the user feature acquisition layer 6022 is used for inquiring the user features of the user identifier, the feature fusion layer 6023 is used for fusing the voice features and the user features, and the voice recognition layer 6024 is used for recognizing text data corresponding to the voice data.

On the basis of the user feature extraction model 601 and the second speech recognition model 602, a speech recognition method based on the user feature extraction model and the second speech recognition model is further provided, and details are shown in the following embodiments.

Fig. 8 is a flowchart of a speech recognition method according to an embodiment of the present application. Referring to fig. 8, the execution subject of the method is a server, and the method includes:

801. and the server calls a user feature extraction model to extract features of the user relationship network of any user identifier to obtain the user features of the user identifier.

The user feature extraction model is used for extracting user features according to a user relationship network of any user identification.

If the user relationship network is a homogeneous composition, referring to fig. 9, the user feature extraction model 900 is a graph convolution network model, where solid dots represent a first user identifier, hollow dots represent a second user identifier, the homogeneous composition formed by the six user identifiers is input to the feature extraction model 900, the feature extraction model 900 performs feature extraction on the input homogeneous composition to obtain user features of the six user identifiers in the homogeneous composition, and implements extraction of the user features.

Before voice recognition is carried out, the server inputs the user relationship network of any user identification into the user characteristic extraction model, and the user characteristic extraction model is called to obtain the user characteristics of each user identification in the user relationship network. Of course, the server may also input a plurality of user relationship networks of different user identities to the user feature extraction model, thereby obtaining user features of more user identities.

802. The server establishes a corresponding relation between any user identification and the user characteristics of the user identification.

The server establishes a corresponding relation between the user identification and the user characteristics according to the obtained plurality of user identifications and the corresponding user characteristics, so that the corresponding relation is inquired to obtain the user characteristics of the user identification in the subsequent voice recognition. The correspondence may be stored in a list form or other forms.

In addition, if the new user identifier is added, the user characteristics of the new user identifier are acquired, and the new user identifier and the corresponding user characteristics are added to the established corresponding relationship, so that the corresponding relationship is updated.

803. And the server calls a semantic feature extraction layer to extract the features of the voice data of the first user identification to obtain the voice features of the voice data.

The implementation of step 803 is similar to that of step 401 in the embodiment shown in fig. 4, and is not described herein again.

804. And the server calls the user characteristic acquisition layer and inquires the user characteristics of the first user identification according to the established corresponding relation.

And the server inputs the first user identification to the user characteristic acquisition layer, and inquires the user characteristic of the first user identification in the established corresponding relation. By adopting the mode of inquiring and acquiring the user characteristics, the user relationship network of the user identification does not need to be subjected to characteristic extraction every time voice recognition is carried out, the processing amount in the recognition process is reduced, and the recognition speed is improved.

805. And calling the feature fusion layer by the server, and carrying out fusion processing on the voice features and the user features to obtain fusion features corresponding to the voice data.

806. And the server calls the voice recognition layer to recognize the fusion characteristics to obtain text data corresponding to the voice data.

The implementation of steps 805 to 806 is similar to the implementation of steps 403 to 404 in the embodiment shown in fig. 4, and is not described herein again.

Before the server calls the user feature extraction model, the user feature extraction model needs to be trained, and the training process of the user feature extraction model comprises the following steps:

the server obtains a sample user relationship network of the sample user identification and sample user characteristics of the sample user identification in the sample user relationship network; and training a user characteristic extraction model according to the sample user relationship network and the sample user characteristics. The user feature extraction model to be trained is an initial user feature extraction model or a user feature extraction model after one or more times of training.

Optionally, in the training process, the sample user features of the plurality of sample user identifiers belonging to the same sample user relationship network are set as similar features, so that the user feature extraction model can learn the similar features of the user features of the plurality of user identifiers in the same user relationship network, thereby extracting the common features of the plurality of user identifiers, and facilitating subsequent speech recognition according to the association relationship of the plurality of user identifiers.

Before the server calls the second speech recognition model, the server needs to train the second speech recognition model, and the training process of the second speech recognition model comprises the following steps:

1. the server obtains a sample user identifier, sample voice data of the sample user identifier and sample text data corresponding to the sample voice data.

The second speech recognition model is an initial second speech recognition model or a second speech recognition model after one or more training.

3. And the server calls the user characteristic acquisition layer and inquires the user characteristics of the first user identification according to the established corresponding relation.

4. And calling the feature fusion layer by the server, and fusing the predicted voice features and the user features to obtain the predicted fusion features corresponding to the sample voice data.

6. The server adjusts parameters in the second speech recognition model according to the sample text data and the predicted text data.

The sample user id, the sample voice data, and the corresponding sample text data are similar to the first user id, the voice data, and the text data in the above embodiments, and are not described herein again.

In one possible implementation manner, the user feature extraction model and the second speech recognition model may be trained by using the same sample data, or may be trained by using different sample data.

In addition, the implementation of the training and using processes of the user feature extraction model and the second speech recognition model is similar to the training and using processes of the first recognition model, and is not repeated here.

The method provided by the embodiment of the application adopts the second voice recognition model to perform voice recognition, not only calls the voice feature extraction layer to extract the voice features of the voice data to obtain the voice features, adopts the user feature acquisition model to obtain the corresponding relation between any user identifier and the user features, calls the user feature acquisition layer to inquire the corresponding relation to obtain the user features, the user characteristics can reflect the speaking mode of the user, the characteristic fusion layer is called to fuse the voice characteristics with the user characteristics, then the voice recognition layer is called to recognize the fusion characteristics, the content of the voice data and the speaking mode of the user are comprehensively considered, the voice data is converted according to the speaking mode of the user, so that the obtained text data is more in line with the speaking mode of the user, the recognized text data is ensured to be matched with the voice data, and the accuracy of voice recognition is improved.

And moreover, the user feature extraction model is adopted to extract the user feature of any user identifier from the user relationship network of any user identifier, the corresponding relation between any user identifier and the user feature is established, when the second voice recognition model is adopted for voice recognition, the user feature of the user identifier can be directly inquired from the established corresponding relation, the feature extraction of the user relationship network is not needed each time, the operation efficiency is improved, and the recognition speed is improved.

Fig. 10 is a flowchart of a speech recognition method according to an embodiment of the present application. Referring to fig. 10, the interaction subject of the method is a terminal and a server, and the method includes:

1001. and the terminal of the user A displays a chat interface, and the chat interface comprises voice data sent to the user A by the user B. The user a and the user B may be in a friend relationship.

1002. If the user A is inconvenient to listen to the voice data currently, a button of 'voice to text' is displayed in the chat interface after the user A presses the voice data for a long time, and if the user A clicks the button, the terminal sends a voice recognition instruction to the server, wherein the voice recognition instruction carries the voice data sent by the user B and the user identification of the user B.

1003. And the server receives the identification instruction, inquires the user characteristics of the user B, integrates the user characteristics of the user B and the voice data for identification to obtain text data corresponding to the voice data, and sends the text data to the terminal.

The implementation of step 1003 is similar to the implementation shown in fig. 4 or fig. 7, and is not described herein again.

1004. The terminal receives the text data, displays the text data through the social application, and at the moment, the user A can check the text data and know the content sent by the user B.

Fig. 11 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application. Referring to fig. 11, the apparatus includes:

the voice feature acquisition module 1101 is configured to perform feature extraction on the voice data of the first user identifier to obtain a voice feature of the voice data;

a user characteristic obtaining module 1102, configured to obtain a user characteristic of the first user identifier, where the user characteristic is obtained by performing characteristic extraction on a user relationship network of the first user identifier, and the user relationship network includes an association relationship between the first user identifier and at least one second user identifier;

a fusion feature obtaining module 1103, configured to perform fusion processing on the voice features and the user features to obtain fusion features corresponding to the voice data;

and the voice recognition module 1104 is configured to perform recognition processing on the fusion features to obtain text data corresponding to the voice data.

According to the device provided by the embodiment of the application, in the process of recognizing the voice data, not only is the voice feature of the voice data recognized, but also the feature extraction is carried out according to the user relationship network of the user identification to obtain the user feature, the user feature can reflect the speaking mode of the user, the voice feature and the user feature are fused, the fused feature is recognized, the content of the voice data and the speaking mode of the user can be comprehensively considered, so that the text data obtained through the recognition process is more accordant with the speaking mode of the user and is more matched with the voice data, and therefore the accuracy of the voice recognition is improved.

Optionally, the user characteristic obtaining module 1102 is configured to invoke a user characteristic extraction layer of the first speech recognition model, perform characteristic extraction on the user relationship network, and obtain a user characteristic of the first user identifier.

Optionally, the step of extracting the feature of the voice data of the first user identifier to obtain the voice feature of the voice data is executed by calling a voice feature extraction layer of the first voice recognition model;

performing fusion processing on the voice characteristics and the user characteristics to obtain fusion characteristics corresponding to the voice data, and executing by calling a characteristic fusion layer of the first voice recognition model;

and identifying the fusion characteristics to obtain text data corresponding to the voice data, and executing by calling a voice identification layer of the first voice identification model.

calling a voice feature extraction layer, and performing feature extraction on the sample voice data to obtain predicted voice features of the sample voice data;

calling a sample user characteristic acquisition layer, and performing characteristic extraction on the sample user relationship network to obtain the predicted user characteristics of the sample user identification;

calling a feature fusion layer, and carrying out fusion processing on the predicted voice features and the predicted user features to obtain predicted fusion features corresponding to the sample voice data;

calling a voice recognition layer, and recognizing the prediction fusion characteristics to obtain prediction text data corresponding to the sample voice data;

and adjusting parameters in the first speech recognition model according to the sample text data and the predicted text data.

Optionally, referring to fig. 12, the apparatus further comprises:

a subscriber identity obtaining module 1105, configured to obtain at least one second subscriber identity associated with the first subscriber identity;

a composition creating module 1106, configured to create a composition according to the first user identifier and the at least one second user identifier, where the composition includes a first user node corresponding to the first user identifier, a second user node corresponding to the at least one second user identifier, and a connection line between the first user node and the at least one second user node.

Optionally, the user characteristic obtaining module 1102 is configured to query, according to an established corresponding relationship, a user characteristic of the first user identifier, where the preset corresponding relationship includes at least one user identifier and a corresponding user characteristic, and the user characteristic corresponding to any user identifier in the preset corresponding relationship is obtained by performing characteristic extraction on a user relationship network of any user identifier.

Optionally, referring to fig. 12, the apparatus further comprises:

the user characteristic obtaining module 1101 is further configured to invoke a user characteristic extraction model, perform characteristic extraction on the user relationship network of any user identifier, and obtain a user characteristic of any user identifier;

a relationship establishing module 1107, configured to establish a corresponding relationship between any user identifier and a user feature.

acquiring a sample user relationship network of a sample user identifier and sample user characteristics of the sample user identifier in the sample user relationship network;

and training a user characteristic extraction model according to the sample user relationship network and the sample user characteristics.

Optionally, the step of extracting the feature of the voice data of the first user identifier to obtain the voice feature of the voice data is executed by calling a voice feature extraction layer of the second voice recognition model;

inquiring the user characteristics of the first user identification according to the established corresponding relation, and executing by calling a user characteristic acquisition layer of a second voice recognition model;

performing fusion processing on the voice characteristics and the user characteristics to obtain fusion characteristics corresponding to the voice data, and executing by calling a characteristic fusion layer of a second voice recognition model;

and identifying the fusion characteristics to obtain text data corresponding to the voice data, and executing by calling a voice identification layer of the second voice identification model.

calling a user characteristic acquisition layer, and inquiring the user characteristics of the first user identification according to the established corresponding relation;

calling a feature fusion layer, and carrying out fusion processing on the predicted voice features and the user features to obtain predicted fusion features corresponding to the sample voice data;

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

It should be noted that: in the speech recognition apparatus provided in the above embodiment, when recognizing speech, only the division of the above functional modules is used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the server is divided into different functional modules to complete all or part of the above described functions. In addition, the speech recognition apparatus and the speech recognition method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Fig. 13 shows a block diagram of a terminal 1300 according to an exemplary embodiment of the present application. The terminal 1300 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture experts Group Audio Layer III, motion video experts compression standard Audio Layer 3), an MP4 player (Moving Picture experts Group Audio Layer IV, motion video experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 1300 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, terminal 1300 includes: a processor 1301 and a memory 1302.

Processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1301 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1301 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1301 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing content that the display screen needs to display. In some embodiments, processor 1301 may further include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Memory 1302 may include one or more computer-readable storage media, which may be non-transitory. The memory 1302 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1302 is used to store at least one instruction for execution by processor 1301 to implement the speech recognition methods provided by method embodiments herein.

In some embodiments, terminal 1300 may further optionally include: a peripheral interface 1303 and at least one peripheral. Processor 1301, memory 1302, and peripheral interface 1303 may be connected by a bus or signal line. Each peripheral device may be connected to the peripheral device interface 1303 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1304, display screen 1305, camera assembly 1306, audio circuitry 1307, positioning assembly 1308, and power supply 1309.

Peripheral interface 1303 may be used to connect at least one peripheral associated with I/O (Input/Output) to processor 1301 and memory 1302. In some embodiments, processor 1301, memory 1302, and peripheral interface 1303 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1301, the memory 1302, and the peripheral device interface 1303 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1304 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1304 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1304 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1304 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 1304 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 1304 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 1305 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1305 is a touch display screen, the display screen 1305 also has the ability to capture touch signals on or over the surface of the display screen 1305. The touch signal may be input to the processor 1301 as a control signal for processing. At this point, the display 1305 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1305 may be one, disposed on the front panel of terminal 1300; in other embodiments, display 1305 may be at least two, either on different surfaces of terminal 1300 or in a folded design; in other embodiments, display 1305 may be a flexible display disposed on a curved surface or on a folded surface of terminal 1300. Even further, the display 1305 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display 1305 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-emitting diode), or the like.

The camera assembly 1306 is used to capture images or video. Optionally, camera assembly 1306 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1306 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 1307 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1301 for processing, or inputting the electric signals to the radio frequency circuit 1304 for realizing voice communication. For stereo capture or noise reduction purposes, multiple microphones may be provided, each at a different location of terminal 1300. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1301 or the radio frequency circuitry 1304 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 1307 may also include a headphone jack.

The positioning component 1308 is used for positioning the current geographic position of the terminal 1300 to implement navigation or LBS (location based Service). The positioning component 1308 can be a positioning component based on the GPS (global positioning System) of the united states, the beidou System of china, or the galileo System of russia.

Power supply 1309 is used to provide power to various components in terminal 1300. The power source 1309 may be alternating current, direct current, disposable or rechargeable. When the power source 1309 comprises a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1300 also includes one or more sensors 1310. The one or more sensors 1310 include, but are not limited to: acceleration sensor 1311, gyro sensor 1312, pressure sensor 1313, fingerprint sensor 1314, optical sensor 1315, and proximity sensor 1316.

The acceleration sensor 1311 can detect the magnitude of acceleration on three coordinate axes of the coordinate system established with the terminal 1300. For example, the acceleration sensor 1311 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1301 may control the display screen 1305 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1311. The acceleration sensor 1311 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1312 may detect the body direction and the rotation angle of the terminal 1300, and the gyro sensor 1312 may cooperate with the acceleration sensor 1311 to acquire a 3D motion of the user with respect to the terminal 1300. Processor 1301, based on the data collected by gyroscope sensor 1312, may perform the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensor 1313 may be disposed on a side bezel of terminal 1300 and/or underlying display 1305. When the pressure sensor 1313 is disposed on the side frame of the terminal 1300, a user's holding signal to the terminal 1300 may be detected, and the processor 1301 performs left-right hand recognition or shortcut operation according to the holding signal acquired by the pressure sensor 1313. When the pressure sensor 1313 is disposed at a lower layer of the display screen 1305, the processor 1301 controls an operability control on the UI interface according to a pressure operation of the user on the display screen 1305. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1314 is used for collecting the fingerprint of the user, and the processor 1301 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 1314, or the fingerprint sensor 1314 identifies the identity of the user according to the collected fingerprint. When the identity of the user is identified as a trusted identity, the processor 1301 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 1314 may be disposed on the front, back, or side of the terminal 1300. When a physical button or vendor Logo is provided on the terminal 1300, the fingerprint sensor 1314 may be integrated with the physical button or vendor Logo.

The optical sensor 1315 is used to collect the ambient light intensity. In one embodiment, the processor 1301 may control the display brightness of the display screen 1305 according to the ambient light intensity collected by the optical sensor 1315. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1305 is increased; when the ambient light intensity is low, the display brightness of the display screen 1305 is reduced. In another embodiment, the processor 1301 can also dynamically adjust the shooting parameters of the camera assembly 1306 according to the ambient light intensity collected by the optical sensor 1315.

Proximity sensor 1316, also known as a distance sensor, is typically disposed on a front panel of terminal 1300. Proximity sensor 1316 is used to gather the distance between the user and the front face of terminal 1300. In one embodiment, the processor 1301 controls the display 1305 to switch from the bright screen state to the dark screen state when the proximity sensor 1316 detects that the distance between the user and the front face of the terminal 1300 gradually decreases; the display 1305 is controlled by the processor 1301 to switch from the rest state to the bright state when the proximity sensor 1316 detects that the distance between the user and the front face of the terminal 1300 is gradually increasing.

Those skilled in the art will appreciate that the configuration shown in fig. 13 is not intended to be limiting with respect to terminal 1300 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be employed.

Fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1400 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1401 and one or more memories 1402, where the memory 1402 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 1401 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The server 1400 may be used to perform the steps performed by the server in the speech recognition method described above.

The embodiment of the present application further provides a computer device, where the computer device includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the operations performed in the voice recognition method of the foregoing embodiment.

The embodiment of the present application further provides a computer-readable storage medium, where at least one instruction is stored in the computer-readable storage medium, and the at least one instruction is loaded and executed by a processor to implement the operations performed in the voice recognition method of the foregoing embodiment.

Embodiments of the present application also provide a computer program product including computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device realizes the operations performed in the voice recognition method of the above-described embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only an alternative embodiment of the present application and is not intended to limit the present application, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of speech recognition, the method comprising:

2. The method of claim 1, wherein the obtaining the user characteristic of the first subscriber identity comprises:

and calling a user feature extraction layer of a first voice recognition model, and performing feature extraction on the user relationship network to obtain the user features of the first user identifier.

3. The method of claim 2, wherein the step of performing feature extraction on the voice data of the first user identifier to obtain the voice feature of the voice data is performed by calling a voice feature extraction layer of the first voice recognition model;

4. The method of claim 3, wherein the training process of the first speech recognition model comprises the steps of:

5. The method according to claim 1, wherein the user relationship network is a peer-to-peer graph, and before the obtaining the user characteristic of the first user identifier, the method further comprises:

acquiring at least one second user identifier associated with the first user identifier;

and creating a composition graph according to the first user identifier and the at least one second user identifier, wherein the composition graph comprises a first user node corresponding to the first user identifier, a second user node corresponding to the at least one second user identifier and a connecting line between the first user node and the at least one second user node.

6. The method of claim 1, wherein the obtaining the user characteristic of the first subscriber identity comprises:

and inquiring the user characteristics of the first user identification according to the established corresponding relation, wherein the preset corresponding relation comprises at least one user identification and corresponding user characteristics, and the user characteristics corresponding to any user identification in the preset corresponding relation are obtained by performing characteristic extraction on the user relationship network of any user identification.

7. The method according to claim 6, wherein before querying the user characteristics of the first user identifier according to the established correspondence, the method further comprises:

calling a user feature extraction model, and performing feature extraction on the user relationship network of any user identifier to obtain the user feature of any user identifier;

and establishing a corresponding relation between the any user identification and the user characteristics.

8. The method of claim 7, wherein the training process of the user feature extraction model comprises the following steps:

9. The method of claim 6, wherein the step of extracting the feature of the voice data of the first user identifier to obtain the voice feature of the voice data is performed by calling a voice feature extraction layer of a second voice recognition model;

10. The method of claim 9, wherein the training process of the second speech recognition model comprises the steps of:

11. A speech recognition apparatus, characterized in that the apparatus comprises:

12. The apparatus according to claim 11, wherein the user characteristic obtaining module is configured to invoke a user characteristic extraction layer of a first speech recognition model, perform characteristic extraction on the user relationship network, and obtain the user characteristic of the first user identifier.

13. The apparatus according to claim 11, wherein the user characteristic obtaining module is configured to query a user characteristic of the first user identifier according to an established corresponding relationship, where the preset corresponding relationship includes at least one user identifier and a corresponding user characteristic, and a user characteristic corresponding to any user identifier in the preset corresponding relationship is obtained by performing characteristic extraction on a user relationship network of the any user identifier.

14. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, the at least one instruction being loaded and executed by the processor to perform operations performed in the speech recognition method of any of claims 1 to 10.

15. A computer-readable storage medium having stored thereon at least one instruction, which is loaded and executed by a processor, to perform operations performed in the speech recognition method according to any one of claims 1 to 10.