CN113409793B

CN113409793B - Speech recognition method, intelligent home system, conference equipment and computing equipment

Info

Publication number: CN113409793B
Application number: CN202010129820.6A
Authority: CN
Inventors: 郑斯奇; 雷赟
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2024-05-17
Anticipated expiration: 2040-02-28
Also published as: CN113409793A

Abstract

The application discloses a voice recognition method, an intelligent home system, conference equipment and computing equipment. Wherein the method comprises the following steps: collecting voice information of at least one target object; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content. The application solves the technical problem of lower accuracy of the voice recognition scheme corresponding to the short-time text-independent task.

Description

Speech recognition method, intelligent home system, conference equipment and computing equipment

Technical Field

The application relates to the field of voice recognition, in particular to a voice recognition method, an intelligent home system, conference equipment and computing equipment.

Background

Speaker recognition technology is a technology for recognizing the identity of a speaker through voice. At present, the speaker recognition technology is applied to the industry in the floor, and mainly is a short-time text-related scene, namely text content spoken by a fixed speaker, such as wake-up words of intelligent home; or long-term text-independent, i.e. does not specify what the speaker speaks, but requires a longer speaking duration. For short-time text irrelevant tasks, voice recognition is performed by using a traditional speaker recognition technology, so that the recognition accuracy is low, and the commercial level cannot be achieved.

Aiming at the problem that the accuracy of the voice recognition scheme corresponding to the short-time text-independent task at the present stage is low, no effective solution is proposed at present.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, an intelligent home system, conference equipment and computing equipment, which are used for at least solving the technical problem that the accuracy of a voice recognition scheme corresponding to a short-time text-independent task is low.

According to an aspect of an embodiment of the present application, there is provided a voice recognition method including: collecting voice information of at least one target object; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer in the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

According to another aspect of the embodiment of the present application, there is also provided another voice recognition method, including: receiving voice information of a target object; inputting the voice information of the target object to a corresponding network layer of a target machine learning model for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information, wherein the input of each network layer in the target machine learning model is the output result of each network layer in the first machine learning model and the second machine learning model; and verifying the identity information, and executing the operation corresponding to the voice content when the verification passes.

According to another aspect of the embodiment of the application, there is also provided an intelligent home system, including at least one home appliance and a control device, where the at least one home appliance is configured to collect voice information of a target object in a space where the at least one home appliance is located, and receive a control instruction from the control device; the control equipment is used for receiving the voice information, inputting the voice information of at least one target object into the first machine learning model and the second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and generating control instructions based on the voice content and sending the control instructions to at least one household appliance.

According to another aspect of the embodiment of the present application, there is also provided a conference apparatus including: the voice acquisition equipment is used for acquiring voice information of at least one target object in the space where the voice acquisition equipment is located; the system comprises a controller, a first machine learning model and a second machine learning model, wherein the controller is used for acquiring voice information, inputting the voice information into the first machine learning model, and inputting an output result of each network layer in the first machine learning model into a corresponding network layer in a target machine learning model, the target machine learning model is used for identifying identity information of a target object and voice content corresponding to the identity information, and the first machine learning model is used for identifying acoustic characteristics of at least one target object.

According to another aspect of the embodiment of the present application, there is also provided another conference apparatus including: the voice acquisition equipment is used for acquiring voice information of at least one target object in the space where the voice acquisition equipment is located; the controller is used for acquiring voice information, inputting the voice information into the second machine learning model, and inputting an output result of each network layer in the second machine learning model into a corresponding network layer in the target machine learning model, wherein the target machine learning model is used for identifying the identity information of the target object and the voice content corresponding to the identity information, and the second machine learning model is a model for carrying out content identification on the voice information of at least one target object.

According to another aspect of the embodiment of the present application, there is also provided another voice recognition method, including: collecting voice information of at least one target object; inputting the voice information of at least one target object into a second machine learning model, inputting the output result of a network layer in the second machine learning model into a first machine learning model, and inputting the output result of the network layer in the first machine learning model into a corresponding network layer in the target machine learning model for analysis so as to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

According to still another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program, wherein the device in which the storage medium is controlled to perform the above voice recognition method when the program runs.

According to yet another aspect of an embodiment of the present application, there is also provided a computing device including: a processor; and a memory, coupled to the processor, for providing instructions to the processor for processing the steps of: collecting voice information of at least one target object; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

In the embodiment of the application, the voice information of at least one target object is acquired; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; the voice information of the target object is identified by combining three neural networks and utilizing the three neural networks which are related to each other, so that the technical effect of improving the accuracy of voice identification of the short-time text-independent task is achieved, and the technical problem of low accuracy of a voice identification scheme corresponding to the short-time text-independent task is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 shows a block diagram of the hardware architecture of a computer terminal (or mobile device) for implementing a speech recognition method;

FIG. 2 is a flow chart of a method of speech recognition according to an embodiment of the present application;

FIG. 3 is a schematic illustration of a neural network model, according to an embodiment of the present application;

FIG. 4 is a flow chart of another speech recognition method according to an embodiment of the present application;

fig. 5 is a block diagram of an intelligent home system according to an embodiment of the present application;

fig. 6 is a block diagram of a computer terminal according to an embodiment of the present application;

Fig. 7 is a schematic diagram of an application scenario in which a user wakes up an intelligent home device through voice according to an embodiment of the present application;

fig. 8 is a schematic structural view of a conference apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural view of another conference apparatus according to an embodiment of the present application;

Fig. 10 is a flowchart of another voice recognition method according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, partial terms or terminology appearing in the course of describing embodiments of the application are applicable to the following explanation:

speaker recognition: the identity of the speaker is confirmed by voice recognition.

Short text-independent tasks: the content spoken by the speaker is not limited, and the voice is short (e.g., less than 5 seconds).

Example 1

There is also provided, in accordance with an embodiment of the present application, an embodiment of a speech recognition method, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.

The method according to the first embodiment of the present application may be implemented in a mobile terminal, a computer terminal or a similar computing device. Fig. 1 shows a block diagram of a hardware architecture of a computer terminal (or mobile device) for implementing a speech recognition method. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more processors 102 (shown as 102a, 102b, … …,102 n) which may include, but are not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA, a memory 104 for storing data, and a transmission module 106 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). As referred to in embodiments of the application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the voice recognition method in the embodiment of the present application, and the processor 102 executes the software programs and modules stored in the memory 104, thereby executing various functional applications and data processing, that is, implementing the above-mentioned vulnerability detection method of application program. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission module 106 is used to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission module 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission module 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

In the above-described operating environment, the present application provides a speech recognition method as shown in fig. 2. Fig. 2 is a flowchart of a voice recognition method according to an embodiment of the present application, as shown in fig. 2, the method comprising the steps of:

step S202, collecting voice information of at least one target object.

According to an alternative embodiment of the present application, the voice information in step S202 is voice information corresponding to a short-time text-independent task, and the voice recognition of the short-time text-independent task is mainly applied to wake-up words of the home appliance, such as "hello tv" and the like.

Step S204, inputting the voice information of at least one target object into the first machine learning model and the second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis, so as to obtain the identity information of the target object and the voice content corresponding to the identity information.

In an optional embodiment of the present application, the first machine learning model is a model for identifying acoustic features of the at least one target object; the second machine learning model is a model for content recognition of speech information of at least one target object.

In an alternative embodiment of the present application, the first machine learning model, the second machine learning model, and the target machine learning model are three neural network learning models that are independent of each other and are linked to each other. FIG. 3 is a schematic diagram of a neural network model according to an embodiment of the present application, and as shown in FIG. 3, the neural network learning model is composed of three independent and interlinked neural networks, namely, left, middle and right. The leftmost network is similar to a traditional voiceprint recognition network in that its input is an acoustic feature. The rightmost network, which is input with hidden features extracted from the speech recognition network, represents the content of the speech. An intermediate network formed by the cross-linking of the two.

Step S206, outputting the voice content.

By combining the three neural networks and utilizing the three neural networks to identify the voice information of the target object, the method has the technical effect of improving the accuracy of voice identification of short-time text-independent tasks.

According to an alternative embodiment of the present application, the target machine learning model is trained by: obtaining multiple sets of training data for training a target machine learning model, wherein each set of training data comprises triplet information, and the triplet information comprises: different speech information of the first sample object; speech information of the second sample object; and respectively inputting a plurality of groups of training data into the target machine learning model for training until the prediction result of the target machine learning model meets the preset condition.

The first sample object and the second sample object refer to different speakers, respectively.

In an alternative embodiment of the present application, the training of multiple sets of training data respectively input into a value target machine learning model includes: and when the prediction result does not meet the preset condition, adjusting the weights of the different voice information of the first sample object and the voice information of the second sample object until the prediction result of the target machine learning model meets the preset condition.

According to an alternative embodiment of the present application, when adjusting weights of different voice information of the first sample object and voice information of the second sample object, the following processing procedure may be included: increasing the weights of different speech information of the first sample object; and/or reduce the weight of the speech information of the second sample object. For example, by performing end-to-end recognition through a triple loss, better normalization processing can be performed on different voice contents. For example, information related to sound characteristics may be given more weight when the same person is different, so that two points are as close as possible. If different people say the same content, the target model can discard the information related to the content as far as possible, and find out the difference related to the voiceprints of different people.

In another alternative embodiment of the present application, the training of the target machine learning model by inputting multiple sets of training data, respectively, includes: and when the prediction result does not meet the preset condition, adjusting a loss function of the target machine learning model until the sample distance between the feature vectors of the different voice information of the first sample object is smaller than the sample distance between the feature vectors of the second sample object and the appointed voice information, wherein the appointed voice information is any one of the feature vectors of the different voice information of the first sample object.

The above-described target machine learning model is different from the mainstream voiceprint recognition framework in that the neural network employs a loss function of end-to-end triplet loss. the triple loss is a loss function of deep learning, is similar to training samples with small differences, including an Anchor (Anchor) example, a Positive (Positive) example and a Negative (Negative) example, and realizes similarity calculation of the samples by optimizing the distance between the Anchor example and the Positive example to be smaller than the distance between the Anchor example and the Negative example.

The target machine learning model discards traditional back-end linear discriminant analysis (LINEAR DISCRIMINANT ANALYSIS, LDA) and PLDA (PLDA is used to handle speaker and channel deformation) scoring, and calculates loss functions of loss by directly comparing between different triples.

By the method, based on the triple loss function, the end-to-end speaker recognition is carried out by combining the voice recognition information, and compared with the traditional speaker recognition technology, the accuracy of voice recognition of short-time text irrelevant tasks can be greatly improved.

In some optional embodiments of the present application, the above-mentioned voice recognition method further includes: and verifying the identity information, and executing the operation corresponding to the voice content when the verification passes.

After the voice content corresponding to the identity information of the target object is identified, the identity of the target object can be further verified, and the operation corresponding to the voice content is only executed when the verification is passed. For example, in a specific application, a user performs voice control on an intelligent household appliance through a wake-up word, after the user sends a voice control instruction of turning on an air conditioner, a controller on the air conditioner further needs to judge whether the user has control authority on the air conditioner after recognizing the voice control instruction through the voice recognition method, and after judging that the user has control authority on the air conditioner, the air conditioner executes an operation corresponding to the voice control instruction. By the method, the user permission can be limited, and the operation safety of the household appliance is improved.

Fig. 7 is a schematic view of an application scenario in which a user wakes up an intelligent home device through voice according to an embodiment of the present application, and as shown in fig. 7, a processor of the intelligent home device operates a neural network system including the first machine learning model, the second machine learning model, and the target machine learning model. After a user sends a voice control instruction for waking up the intelligent household electrical appliance, the intelligent household electrical appliance collects the voice control instruction sent by the user through a microphone on the device, the voice control instruction is sent to a processor, and the processor processes voice information corresponding to the voice control instruction by executing the following method:

step S702, collecting voice information of at least one target object;

Step S704, inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information;

Step S706, outputting the voice content.

In particular, the neural network system may give more weight to information related to sound characteristics if the same user has different content. If different users say the same content, the neural network system can discard the information related to the content as far as possible, and find out the differences related to different voice prints.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

From the above description of the embodiments, it will be clear to a person skilled in the art that the speech recognition method according to the above embodiments may be implemented by means of software plus a necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.

Example 2

Fig. 4 is a flowchart of another voice recognition method according to an embodiment of the present application, as shown in fig. 4, including the steps of:

step S402, receiving voice information of a target object.

According to an optional embodiment of the present application, the voice information is voice information corresponding to a short-time text-independent task, and voice recognition of the short-time text-independent task is mainly applied to wake-up words of the home appliance.

Step S404, inputting the voice information of the target object to the corresponding network layer of the target machine learning model for analysis, and obtaining the identity information of the target object and the voice content corresponding to the identity information, wherein the input of each network layer in the target machine learning model is the output result of each network layer in the first machine learning model and the second machine learning model.

According to an alternative embodiment of the application, the first machine learning model is a model identifying acoustic features of the target object; the second machine learning model is a model for performing content recognition on the voice information of the target object.

The first machine learning model, the second machine learning model and the target machine learning model are three independent and interlinked neural networks. The neural network consists of three independent neural networks which are linked with each other. The leftmost network is seen independently, and its input is acoustic features, similar to a conventional voiceprint recognition network. The rightmost network, which is input with hidden features extracted from the speech recognition network, represents the content of the speech. An intermediate network formed by the cross-linking of the two.

Step S406, the identity information is verified, and when the verification passes, the operation corresponding to the voice content is executed.

It should be noted that, the preferred implementation manner of the embodiment shown in fig. 4 may refer to the related description of the embodiment shown in fig. 1, which is not repeated herein.

Example 3

Fig. 5 is a block diagram of an intelligent home system according to an embodiment of the present application, as shown in fig. 5, which includes at least one home device 50 and a control device 52, wherein,

The at least one home device 50 is configured to collect voice information of a target object in a space where the at least one home device 50 is located, and receive a control instruction from the control device 52;

According to an alternative embodiment of the present application, the voice information is voice information corresponding to a short-time text-independent task, and the voice recognition of the short-time text-independent task is mainly applied to wake-up words of the home appliance, for example, "hello television" and the like. Home devices 50 include, but are not limited to, smart air conditioners, smart televisions. And (5) intelligent sound equipment.

The control device 52 is configured to receive the voice information, input the voice information of at least one target object to the first machine learning model and the second machine learning model, and input the output result of each network layer in the first machine learning model and the second machine learning model to the corresponding network layer of the target machine learning model together for analysis, so as to obtain identity information of the target object and voice content corresponding to the identity information; and generating control instructions based on the voice content and sending the control instructions to at least one household appliance.

In an alternative embodiment of the present application, the first machine learning model, the second machine learning model, and the target machine learning model are three neural networks that are independent of each other and are linked to each other. Fig. 3 is a schematic diagram of a neural network model according to an embodiment of the present application, and as shown in fig. 3, the neural network is composed of three independent and interlinked neural networks of left, middle and right. The leftmost network is seen independently, and its input is acoustic features, similar to a conventional voiceprint recognition network. The rightmost network, which is input with hidden features extracted from the speech recognition network, represents the content of the speech. An intermediate network formed by the cross-linking of the two.

Preferably, the control device 52 may further verify the identity of the target object after recognizing the voice content corresponding to the identity information of the target object, and perform the operation corresponding to the voice content only when the verification is passed. For example, in a specific application, the user performs voice control on the intelligent air conditioning apparatus by using the wake-up word, after the user sends a voice control instruction for "turning on the air conditioning", the control apparatus 52 further needs to determine whether the user has control authority on the air conditioning apparatus after recognizing the voice control instruction by using the above voice recognition method, and after determining that the user has control authority on the air conditioning apparatus, the control apparatus performs an operation corresponding to the voice control instruction. By the method, the user permission can be limited, and the operation safety of the household appliance is improved.

By the system, the technical effect of improving the accuracy of voice recognition of short-time text-independent tasks can be achieved.

It should be noted that, the preferred implementation manner of the embodiment shown in fig. 5 may refer to the related description of the embodiment shown in fig. 2, which is not repeated herein.

Example 4

Embodiments of the present application may provide a computer terminal, which may be any one of a group of computer terminals. Alternatively, in the present embodiment, the above-described computer terminal may be replaced with a terminal device such as a mobile terminal.

Alternatively, in this embodiment, the above-mentioned computer terminal may be located in at least one network device among a plurality of network devices of the computer network.

In this embodiment, the computer terminal may execute the program code of the following steps in the speech recognition method of the application program: collecting voice information of at least one target object; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

Alternatively, fig. 6 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 6, the computer terminal 60 may include: one or more (only one is shown) processors 602, memory 604, and radio frequency modules, audio modules, and a display screen.

The memory 604 may be used to store software programs and modules, such as program instructions/modules corresponding to the voice recognition method and apparatus in the embodiment of the present application, and the processor 602 executes the software programs and modules stored in the memory to perform various functional applications and data processing, i.e., implement the voice recognition method described above. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory located remotely from the processor, which may be connected to the computer terminal 60 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: collecting voice information of at least one target object; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

Optionally, the above processor may further execute program code for: obtaining multiple sets of training data for training a target machine learning model, wherein each set of training data comprises triplet information, and the triplet information comprises: different speech information of the first sample object; speech information of the second sample object; and respectively inputting a plurality of groups of training data into corresponding network layers of the target machine learning model for training until the prediction result of the target machine learning model meets the preset condition.

Optionally, the above processor may further execute program code for: and when the prediction result does not meet the preset condition, adjusting the weights of the different voice information of the first sample object and the voice information of the second sample object until the prediction result of the target machine learning model meets the preset condition.

Optionally, the above processor may further execute program code for: increasing weights of different speech information of the first sample object; and/or reducing the weight of the speech information of the second sample object.

Optionally, the above processor may further execute program code for: and when the prediction result does not meet the preset condition, adjusting a loss function of the target machine learning model until the sample distance between the feature vectors of the different voice information of the first sample object is smaller than the sample distance between the feature vectors of the second sample object and the appointed voice information, wherein the appointed voice information is any one of the feature vectors of the different voice information of the first sample object.

Optionally, the above processor may further execute program code for: and verifying the identity information, and executing the operation corresponding to the voice content when the verification passes.

According to an alternative embodiment of the application, the processor may also call the information stored in the memory and the application program through the transmission means to perform the following steps: receiving voice information of a target object; inputting the voice information of the target object to a corresponding network layer of a target machine learning model for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information, wherein the input of each network layer in the target machine learning model is the output result of each network layer in the first machine learning model and the second machine learning model; and verifying the identity information, and executing the operation corresponding to the voice content when the verification passes.

According to an alternative embodiment of the application, the processor may also call the information stored in the memory and the application program through the transmission means to perform the following steps: collecting voice information of at least one target object; inputting the voice information of at least one target object into a second machine learning model, inputting the output result of a network layer in the second machine learning model into a first machine learning model, and inputting the output result of the network layer in the first machine learning model into a corresponding network layer in the target machine learning model for analysis so as to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

By adopting the embodiment of the application, a voice recognition scheme is provided. By combining the three neural networks and utilizing the three neural networks which are related to each other to identify the voice information of the target object, the purpose of improving the accuracy of voice identification of the short-time text-independent task is achieved, and the technical problem that the accuracy of a voice identification scheme corresponding to the short-time text-independent task is lower is solved.

It will be appreciated by those skilled in the art that the structure shown in fig. 6 is only illustrative, and the computer terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile internet device (Mobile INTERNET DEVICES, MID), a PAD, etc. Fig. 6 is not limited to the structure of the electronic device. For example, the computer terminal 60 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 6, or have a different configuration than shown in FIG. 6.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

The embodiment of the application also provides a storage medium. Alternatively, in the present embodiment, the above-described storage medium may be used to store the program code executed by the speech recognition method provided in the above-described embodiment 1.

Alternatively, in this embodiment, the storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: collecting voice information of at least one target object; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

Optionally, the storage medium is further configured to store program code for performing the steps of: obtaining multiple sets of training data for training a target machine learning model, wherein each set of training data comprises triplet information, and the triplet information comprises: different speech information of the first sample object; speech information of the second sample object; and respectively inputting a plurality of groups of training data into corresponding network layers of the target machine learning model for training until the prediction result of the target machine learning model meets the preset condition.

Optionally, the storage medium is further configured to store program code for performing the steps of: and when the prediction result does not meet the preset condition, adjusting the weights of the different voice information of the first sample object and the voice information of the second sample object until the prediction result of the target machine learning model meets the preset condition.

Optionally, the storage medium is further configured to store program code for performing the steps of: increasing weights of different speech information of the first sample object; and/or reducing the weight of the speech information of the second sample object.

Optionally, the storage medium is further configured to store program code for performing the steps of: and when the prediction result does not meet the preset condition, adjusting a loss function of the target machine learning model until the sample distance between the feature vectors of the different voice information of the first sample object is smaller than the sample distance between the feature vectors of the second sample object and the appointed voice information, wherein the appointed voice information is any one of the feature vectors of the different voice information of the first sample object.

Optionally, the storage medium is further configured to store program code for performing the steps of: and verifying the identity information, and executing the operation corresponding to the voice content when the verification passes.

Optionally, in the present embodiment, the storage medium is further configured to store program code for performing the steps of: receiving voice information of a target object; inputting the voice information of the target object to a corresponding network layer of a target machine learning model for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information, wherein the input of each network layer in the target machine learning model is the output result of each network layer in the first machine learning model and the second machine learning model; and verifying the identity information, and executing the operation corresponding to the voice content when the verification passes.

Optionally, in the present embodiment, the storage medium is further configured to store program code for performing the steps of: collecting voice information of at least one target object; inputting the voice information of at least one target object into a second machine learning model, inputting the output result of a network layer in the second machine learning model into a first machine learning model, and inputting the output result of the network layer in the first machine learning model into a corresponding network layer in the target machine learning model for analysis so as to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

Example 5

Fig. 8 is a schematic structural view of a conference apparatus according to an embodiment of the present application, as shown in fig. 8, including:

At least one voice acquisition device 80 for acquiring voice information of at least one target object in a space where the at least one voice acquisition device is located. The voice acquisition device 80 may be a microphone array,

The controller 82 is configured to obtain voice information, input the voice information to a first machine learning model, and input an output result of each network layer in the first machine learning model to a corresponding network layer in a target machine learning model, where the target machine learning model is configured to identify identity information of a target object and voice content corresponding to the identity information, and the first machine learning model is a model that identifies acoustic features of at least one target object.

According to an optional embodiment of the present application, the first machine learning model may identify the voiceprint feature of the target object, and after identifying the voice information of the target object by using the first machine learning model, the voiceprint identification result is input to the target machine learning model for further processing, where it should be noted that the target machine learning model may further process the voiceprint identification result in combination with other machine learning models, for example, may further identify the voice content corresponding to the identified voiceprint result in combination with the voice content identification model.

It should be noted that, the preferred implementation manner of the embodiment shown in fig. 8 may refer to the related description of the embodiment shown in fig. 2, which is not repeated herein.

Example 6

Fig. 9 is a schematic structural view of another conference apparatus according to an embodiment of the present application, as shown in fig. 9, including:

At least one voice acquisition device 90 for acquiring voice information of at least one target object in a space where the at least one voice acquisition device is located. The voice capture device 90 may be a microphone array.

The controller 92 is configured to obtain voice information, input the voice information to a second machine learning model, and input an output result of each network layer in the second machine learning model to a corresponding network layer in a target machine learning model, where the target machine learning model is configured to identify identity information of a target object and voice content corresponding to the identity information, and the second machine learning model is a model that performs content identification on the voice information of at least one target object.

According to an optional embodiment of the present application, the second machine learning model may identify the voice content of the voice information of the target object, and after the voice information of the target object is identified by using the second machine learning model, the voice content identification result is input to the target machine learning model for further processing, where it should be noted that the target machine learning model may further process the voice content identification result in combination with other machine learning models, for example, may further identify the voice print feature corresponding to the voice content in combination with the voice print feature identification model.

It should be noted that, the preferred implementation manner of the embodiment shown in fig. 9 may refer to the related description of the embodiment shown in fig. 2, which is not repeated herein.

Example 7

Fig. 10 is a flowchart of another voice recognition method according to an embodiment of the present application, as shown in fig. 10, including the steps of:

Step S1002, collecting voice information of at least one target object.

According to an alternative embodiment of the present application, the voice information in step S1002 is voice information corresponding to a short-time text-independent task, and the voice recognition of the short-time text-independent task is mainly applied to wake-up words of the home appliance, such as "hello tv" and the like.

Step S1004, inputting the voice information of at least one target object to the second machine learning model, inputting the output result of the network layer in the second machine learning model to the first machine learning model, inputting the output result of the network layer in the first machine learning model to the corresponding network layer in the target machine learning model for analysis, so as to obtain the identity information of the target object and the voice content corresponding to the identity information.

According to an optional embodiment of the application, the first machine learning model is a model identifying acoustic features of at least one target object; the second machine learning model is a model for performing content recognition on voice information of at least one target object.

In an alternative embodiment of the present application, the first machine learning model, the second machine learning model, and the target machine learning model are three neural network learning models that are independent of each other and are linked to each other. In this embodiment, firstly, the collected voice information is input to a voice content recognition model (second machine learning model) for recognition, after the voice content recognition is completed, an acoustic feature recognition model (first machine learning model) is introduced for recognition, then the output result of the first machine learning model is input to a corresponding evening party Luo Zeng of the target machine learning model for analysis, and finally the identity information of the target object and the voice content corresponding to the identity information are obtained.

Step S1006, outputting the voice content.

It should be noted that, the preferred implementation manner of the embodiment shown in fig. 10 may refer to the related description of the embodiment shown in fig. 2, which is not repeated herein.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A method of speech recognition, comprising:

Collecting voice information of at least one target object;

Inputting the voice information of the at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer in the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information;

And outputting the voice content.

2. The method of claim 1, wherein the target machine learning model is trained by:

obtaining multiple sets of training data for training the target machine learning model, wherein each set of training data comprises triplet information, and the triplet information comprises: different speech information of the first sample object; speech information of the second sample object;

And respectively inputting the multiple groups of training data into the target machine learning model for training until the prediction result of the target machine learning model meets the preset condition.

3. The method of claim 2, wherein inputting the plurality of sets of training data into the target machine learning model for training, respectively, comprises:

and when the prediction result does not meet the preset condition, adjusting weights of different voice information of the first sample object and the voice information of the second sample object until the prediction result of the target machine learning model meets the preset condition.

4. A method according to claim 3, wherein adjusting the weights of the different speech information of the first sample object and the speech information of the second sample object comprises:

Increasing weights of different speech information of the first sample object; and/or

The weight of the speech information of the second sample object is reduced.

5. The method of claim 1, wherein the first machine learning model is a model that identifies acoustic features of the at least one target object; the second machine learning model is a model that performs content recognition on the speech information of the at least one target object.

6. The method of claim 2, wherein inputting the plurality of sets of training data into the target machine learning model for training, respectively, comprises:

And when the prediction result does not meet the preset condition, adjusting a loss function of the target machine learning model until the sample distance between the feature vectors of the different voice information of the first sample object is smaller than the sample distance between the feature vectors of the second sample object and the appointed voice information, wherein the appointed voice information is any one of the feature vectors of the different voice information of the first sample object.

7. The method according to claim 1, wherein the method further comprises:

And verifying the identity information, and executing the operation corresponding to the voice content when the verification passes.

8. A method of speech recognition, comprising:

Receiving voice information of a target object;

inputting the voice information of the target object to a corresponding network layer of a target machine learning model for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information, wherein the input of each network layer in the target machine learning model is the output result of each network layer in a first machine learning model and a second machine learning model;

9. The method of claim 8, wherein the first machine learning model is a model that identifies acoustic features of the target object; the second machine learning model is a model for performing content recognition on the voice information of the target object.

10. An intelligent home system is characterized by comprising at least one household appliance and a control device, wherein,

The at least one household appliance is used for collecting voice information of a target object in a space where the at least one household appliance is located and receiving a control instruction from the control equipment;

the control equipment is used for receiving the voice information, inputting the voice information of the at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and generating the control instruction based on the voice content, and sending the control instruction to the at least one household appliance.

11. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the speech recognition method of any one of claims 1 to 9.

12. A computing device, comprising:

A processor; and

A memory, coupled to the processor, for providing instructions to the processor to process the following processing steps: collecting voice information of at least one target object; inputting the voice information of the at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

13. A conference device, comprising:

the voice acquisition equipment is used for acquiring voice information of at least one target object in the space where the voice acquisition equipment is located;

The controller is used for acquiring the voice information, inputting the voice information into a first machine learning model, and inputting an output result of each network layer in the first machine learning model into a corresponding network layer in a target machine learning model, wherein the target machine learning model is used for identifying the identity information of the target object and the voice content corresponding to the identity information, and the first machine learning model is used for identifying the acoustic characteristics of the at least one target object.

14. A conference device, comprising:

The controller is used for acquiring the voice information, inputting the voice information into a second machine learning model, and inputting an output result of each network layer in the second machine learning model into a corresponding network layer in a target machine learning model, wherein the target machine learning model is used for identifying the identity information of the target object and the voice content corresponding to the identity information, and the second machine learning model is used for identifying the voice information of the at least one target object in a content mode.

15. A method of speech recognition, comprising:

Collecting voice information of at least one target object;

Inputting the voice information of the at least one target object into a second machine learning model, inputting the output result of a network layer in the second machine learning model into a first machine learning model, and inputting the output result of the network layer in the first machine learning model into a corresponding network layer in the target machine learning model for analysis so as to obtain the identity information of the target object and the voice content corresponding to the identity information;

And outputting the voice content.

16. The method of claim 15, wherein the first machine learning model is a model that identifies acoustic features of the at least one target object; the second machine learning model is a model that performs content recognition on the speech information of the at least one target object.