CN117334201A

CN117334201A - Voice recognition method, device, equipment and medium

Info

Publication number: CN117334201A
Application number: CN202311385869.8A
Authority: CN
Inventors: 黄东成; 李晓清; 林晓波; 钟奖; 陈红宇
Original assignee: Shenzhen Zhongwei Information Technology Co ltd
Current assignee: Shenzhen Zhongwei Information Technology Co ltd
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-01-02

Abstract

The invention relates to a voice recognition method, a device, equipment and a medium, and relates to the field of voice recognition, wherein the method comprises the following steps: acquiring audio data, and generating an audio identifier of the audio data, wherein the audio data is acquired by a preset audio acquisition device through voice acquisition of a user; obtaining voiceprint data of the audio data, and obtaining content data of the audio data, wherein the voiceprint data is associated with the audio identifier, the content data is associated with the audio identifier, the voiceprint data is obtained by voiceprint recognition of the audio data, the content data is obtained by content recognition of the audio data, and the voiceprint recognition and the content recognition are respectively executed in an asynchronous mode; and associating the voiceprint data with the content data according to the audio identification. The waiting time is shortened, the recognition time is shortened, and the effect of improving the recognition efficiency of the sound is realized.

Description

Voice recognition method, device, equipment and medium

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a method, apparatus, device, and medium for voice recognition.

Background

The existing speech recognition technology generally utilizes a speech recognition network composed of a language model and an acoustic model to recognize speech, the language model is used for recognizing the content of the voice, the acoustic model is used for recognizing the voiceprint characteristics of the voice of a user, generally, a serial synchronous recognition method is adopted to recognize the content and the identity of the speech, and the serial synchronous recognition method is a method of sequentially carrying out voiceprint recognition, identity authentication and content recognition and finally outputting the recognition result together. However, the serial synchronous recognition method needs to consume a lot of time, and if any step fails, the voice needs to be collected again and the serial synchronous recognition method needs to be executed, so that the processor resource is occupied for a long time.

Disclosure of Invention

The invention provides a voice recognition method for solving the problem of low efficiency of downloading PC instructions to a motion controller.

In a first aspect, the present invention provides a method of voice recognition, the method comprising:

acquiring audio data, and generating an audio identifier of the audio data, wherein the audio data is acquired by a preset audio acquisition device through voice acquisition of a user;

obtaining voiceprint data of the audio data, and obtaining content data of the audio data, wherein the voiceprint data is associated with the audio identifier, the content data is associated with the audio identifier, the voiceprint data is obtained by voiceprint recognition of the audio data, the content data is obtained by content recognition of the audio data, and the voiceprint recognition and the content recognition are respectively executed in an asynchronous mode;

and associating the voiceprint data with the content data according to the audio identification.

In a second aspect, the invention provides a voice recognition apparatus comprising means for performing the voice recognition method according to any of the embodiments of the first aspect.

In a third aspect, an electronic device is provided, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor, configured to implement the steps of the voice recognition method according to any one of the embodiments of the first aspect when executing the program stored in the memory.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, implements the steps of the voice recognition method according to any one of the embodiments of the first aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the invention has the following advantages:

by asynchronous processing, the voiceprint recognition step and the content recognition step are executed in parallel, so that the waiting time is reduced, the recognition time is short, and the recognition efficiency of the voice is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a schematic flow chart of a voice recognition method according to an embodiment of the present invention;

fig. 2 is a schematic sub-flowchart of a voice recognition method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Fig. 1 is a schematic flow chart of a voice recognition method according to an embodiment of the present invention. The embodiment of the invention provides a voice recognition method, specifically referring to fig. 1, the voice recognition method comprises the following steps S101-S103.

S101, acquiring audio data, and generating an audio identifier of the audio data, wherein the audio data is acquired by a preset audio acquisition device through voice acquisition of a user.

In specific implementation, the preset audio acquisition device comprises a microphone, audio data is obtained by acquiring the voice of the user through the audio acquisition device, and the audio identifiers are respectively used for identifying the unique identity of each audio data.

In one embodiment, the step S101 includes the steps of: and generating a unique identifier of the audio data according to a preset abstract algorithm, and taking the unique identifier of the audio data as the audio identifier.

In particular, the digest algorithm is also called hash algorithm, and is an algorithm for converting data of any length into a data string (usually expressed by a 16-system character string) of a fixed length through a function. And calculating the data string of the audio data through a summary algorithm, detecting whether the calculated data string has uniqueness through a checking algorithm, and if the calculated data string has uniqueness through the checking algorithm, taking the data string with uniqueness as a unique identifier, and taking the unique identifier of the audio data as the audio identifier, thereby obtaining the audio data which are respectively in one-to-one correspondence with the audio identifiers.

S102, voiceprint data of the audio data are acquired, content data of the audio data are acquired, wherein the voiceprint data are associated with the audio identification, the content data are associated with the audio identification, the voiceprint data are obtained through voiceprint recognition of the audio data, the content data are obtained through content recognition of the audio data, and the voiceprint recognition and the content recognition are respectively executed in an asynchronous mode.

In particular, voiceprint recognition refers to finding voiceprint features describing a particular object, which can be categorized into acoustic features (audiofeature) that can be identified and described by the human ear, such as a description of speaking aero-acoustic multi/mid-aero, and acoustic features (Acoustic features) that refer to a set of acoustic description parameters (vectors) extracted from the acoustic signal by a computer algorithm (mathematical method).

In particular implementations, content recognition is also known as speech recognition, which refers to recognizing the content of a sound, for example: the audio data is '5342' read by the user, the sound content '5342' is obtained through content recognition, and the obtained '5342' is the content data, so that the content data can be further processed in the subsequent steps.

The user identity of the audio data may be determined by voiceprint recognition of the audio data and the content of the audio data may be determined by content recognition of the audio data.

In one embodiment, the voiceprint recognition step includes: and extracting voiceprint characteristics of the audio data according to a pre-trained voiceprint recognition model to obtain the voiceprint data.

In particular implementations, the voiceprint recognition model includes a template model (non-parametric model) and a stochastic model (parametric model). The template model (non-parametric model) compares the training feature parameters with the tested feature parameters, and the distortion (di start) between the two is used as the similarity. The random model (parametric model) simulates the speaker with a probability density function, the training process is used to predict the parameters of the probability density function, and the matching process is accomplished by calculating the similarity of the test sentences of the corresponding model. (the parametric model uses some probability density function to describe the distribution of the speech feature space of the speaker, and uses a set of parameters of the probability density function as the model of the speaker.) e.g. (GMM and HMM) gaussian mixture models and hidden markov models.

The trained voiceprint recognition model has the effect of extracting voiceprint features in the audio so as to facilitate the subsequent distinction of different audio objects according to the voiceprint features, and the voiceprint features of the audio data are extracted as voiceprint data through the trained voiceprint recognition model so as to facilitate the subsequent further processing of the voiceprint data.

In one embodiment, the content identification step includes: and extracting text content of the audio data according to the pre-trained semantic recognition model to obtain the content data.

In particular implementations, the semantic recognition model can recognize text content in the sound by training, i.e., determine content information in the user's sound.

And S103, associating the voiceprint data with the content data according to the audio identification.

In specific implementation, the voiceprint data and the content data with the same audio identification are associated, the data association refers to that the data with the association are connected, and all the connected data can be obtained according to any data in the association.

In this embodiment, the content recognition and the voiceprint recognition employ asynchronous processing, and in general, recognizing the same audio data, it takes longer time to perform the content recognition than to perform the voiceprint recognition, taking the case of sequentially processing a plurality of pieces of audio data A, B, C: when the content identification of the audio data A is completed, the voiceprint identification of the audio data A and the voiceprint identification of the audio data B are completed, and the voiceprint data of the audio data A and the content data of the audio data A can be associated according to the audio identification corresponding to the audio data A.

By asynchronous processing, voiceprint recognition and content recognition are executed in parallel, waiting time is reduced, and recognition efficiency of sound is improved.

In an embodiment, referring to fig. 2, fig. 2 is a schematic sub-flowchart of a voice recognition method according to an embodiment of the present invention. The above step S103 includes steps S201 to S202:

s201, judging whether a reference voiceprint feature matched with the voiceprint data exists in a preset voiceprint database.

In a specific implementation, the voiceprint features are data for distinguishing the voice of the user, the voiceprint features in the voiceprint database are sequentially acquired and compared with the voiceprint data, and if the similarity between the acquired voiceprint features and the voiceprint data is higher than a preset threshold (for example, 90%), the acquired voiceprint features are used as reference voiceprint features to execute subsequent steps.

And S202, if not, associating the voiceprint data with the content data according to the audio identification.

In the implementation, voiceprint features in the voiceprint database are sequentially acquired and compared with the voiceprint data, if the similarity of the acquired voiceprint features and the voiceprint data is higher than a preset threshold value, namely, no reference voiceprint features exist, the voiceprint features of the user corresponding to the audio data are not stored in the voiceprint database, and the voiceprint data and the content identification obtained by the audio data through voiceprint identification are connected.

In one embodiment, the step S202 includes the steps of: and storing the voiceprint data in the verification database.

In the specific implementation, in step S202, it is determined that the similarity between the obtained voiceprint feature and the voiceprint data is higher than a preset threshold, that is, the reference voiceprint feature is not present, the voiceprint feature of the user corresponding to the audio data is not stored in the voiceprint database, and the voiceprint data obtained by voiceprint recognition of the audio data is stored in the verification database.

In the new user voice registration example: firstly, account password authentication needs to bind voiceprint features, voice of a user is acquired through an audio acquisition device such as a microphone to obtain audio data, and voiceprint recognition is carried out on the audio data to obtain voiceprint data. If the reference voiceprint feature exists in the verification database, after the account is allocated to obtain the identity, connecting the content data, the reference voiceprint feature, the voiceprint data and the identity, wherein in the embodiment, the reference voiceprint feature refers to the voiceprint feature which is not connected with the identity in the verification database and is matched with the voiceprint data; if the reference voiceprint features do not exist in the verification database, after the account is distributed to obtain the identity, connecting the content data, the voiceprint data and the identity.

In an embodiment, after the above step S201, the method further includes the steps of: if yes, associating the reference voiceprint feature, the identity, the voiceprint data and the content data.

In the implementation, voiceprint features in a voiceprint database are sequentially acquired and compared with voiceprint data, if the similarity of the acquired voiceprint features and the voiceprint data is higher than a preset threshold value, namely, a reference voiceprint feature exists, the voiceprint database stores the voiceprint features of the user corresponding to the audio data, and the voiceprint data and the content identification obtained by respectively carrying out voiceprint identification on the audio data are respectively connected with the reference voiceprint features stored in the voiceprint database by the user.

The embodiment of the invention can realize the following advantages:

by asynchronous processing, voiceprint recognition and content recognition are executed in parallel, so that waiting time is reduced, recognition time is short, and recognition efficiency of sound is improved.

By asynchronous processing, the failure of the whole recognition process due to the failure of a certain recognition step can also be avoided, for example: collecting audio data D of a user, respectively carrying out content recognition and voiceprint recognition on the audio data D, and if the content recognition fails, storing voiceprint data D1 successfully obtained by the voiceprint recognition in a database, optionally, re-collecting audio data E of the user, respectively carrying out content recognition and voiceprint recognition on the audio data E, successfully obtaining content data E1 by the content recognition, and connecting the voiceprint data D1, the content data E1 and the voiceprint data E2 by the voiceprint data E2. Also for example: collecting audio data F of a user, respectively carrying out content recognition and voiceprint recognition on the audio data F, storing content data F1 successfully obtained by the content recognition in a database, intercepting an audio segment again from the audio data F to carry out voiceprint recognition if the voiceprint recognition fails, storing voiceprint data F2 successfully obtained by the voiceprint recognition in the database, and connecting the content data F1 and the voiceprint data F2. The voice data collected each time are fully utilized, voice data collected for multiple times can enrich voice feature libraries corresponding to users, and the recognition accuracy is improved.

The effect is to the scene that needs voice recognition, for example: no logging state voice content record; recording multi-user voice content; controlling the permission of the voice command; a voice assistant shared by a plurality of people; a voice access control system; a voice authentication system; a voice reservation system; voice conference recording systems, etc.

Taking the above-mentioned example of the scenario of the voice conference recording system: the users G, H, I log in to the voice conference system respectively, and the identity id corresponding to each user G, H, I is recorded in the system: G. h, I the system sequentially collects voices of three-user conversations and executes the voice recognition method, the voice sequence of the three-user conversations is G, H, G, first, a first section of audio is acquired, a first audio identifier is generated, voiceprint data G1 of the first section of audio is acquired, content data G2 of the first section of audio is acquired, the voiceprint data G1 is determined to be matched with an identity G according to the first audio identifier, and the first audio identifier, the voiceprint data G1, the content data G2 and the identity G are associated. And secondly, acquiring a second section of audio, generating a second audio identifier, acquiring voiceprint data H1 of the second section of audio, acquiring content data H2 of the second section of audio, determining that the voiceprint data H1 is matched with the identity H according to the second audio identifier, and associating the second audio identifier, the voiceprint data H1, the content data H2 and the identity H. And thirdly, acquiring third section of audio, generating a third audio identifier, acquiring voiceprint data G3 of the third section of audio, acquiring content data G4 of the first section of audio, determining that the voiceprint data G3 is matched with the identity G according to the third audio identifier, and associating the third audio identifier, the voiceprint data G3, the content data G4 and the identity G, wherein the identity G is also associated with the first audio identifier, the voiceprint data G1 and the content data G2, so that data associated with the first audio identifier, the voiceprint data G1, the content data G2, the third audio identifier, the voiceprint data G3, the content data G4 and the identity G are obtained. The voice conference recording system is realized by the voice recognition method, the voice print data obtained by recognition are used for distinguishing users, the content data obtained by recognition can be converted into text content, the content recognition part and the voice print recognition authentication part are processed in parallel, the recognition speed is increased, and finally, the content data and the voice print data are associated one by one, so that the recognition efficiency of voice is improved.

Referring to fig. 3, the embodiment of the present invention further provides a voice recognition apparatus 600, which includes an acquisition unit 601, a recognition unit 602, and an association unit 603.

The acquiring unit 601 is configured to acquire audio data, generate an audio identifier of the audio data, where the audio data is acquired by a preset audio acquisition device through voice acquisition of a user.

In an embodiment, the obtaining the audio data, generating the audio identifier of the audio data, includes:

and generating a unique identifier of the audio data according to a preset abstract algorithm, and taking the unique identifier of the audio data as the audio identifier.

And an identifying unit 602, configured to obtain voiceprint data of the audio data, and obtain content data of the audio data, where the voiceprint data is associated with the audio identifier, the content data is associated with the audio identifier, the voiceprint data is obtained by voiceprint identifying the audio data, the content data is obtained by content identifying the audio data, and the voiceprint identifying and the content identifying are respectively performed in an asynchronous manner.

In an embodiment, the voiceprint identifies the audio data, comprising:

and extracting voiceprint characteristics of the audio data according to a pre-trained voiceprint recognition model to obtain the voiceprint data.

In an embodiment, the content identifies the audio data, including:

and extracting text content of the audio data according to the pre-trained semantic recognition model to obtain the content data.

And the associating unit 603 is configured to associate the voiceprint data with the content data according to the audio identifier.

In an embodiment, said associating said voiceprint data with said content data according to said audio identification comprises:

judging whether a reference voiceprint feature matched with the voiceprint data exists in a preset voiceprint database or not;

if not, associating the voiceprint data with the content data according to the audio identification.

In an embodiment, the reference voiceprint feature is associated with an identity, and after the judging whether the reference voiceprint feature matched with the voiceprint data exists in the preset verification database, the method further includes:

if yes, associating the reference voiceprint feature, the identity, the voiceprint data and the content data.

and storing the voiceprint data in the verification database.

As shown in fig. 4, fig. 4 is a schematic block diagram of a computer device provided in an embodiment of the present application. The computer device 500 may be a terminal or a server, where the terminal may be an electronic device with a communication function, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device. The server may be an independent server or a server cluster formed by a plurality of servers.

The computer device 500 includes a processor 502, a memory, and a network interface 505, connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform a voice recognition method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a voice recognition method.

The network interface 505 is used for network communication with other devices. It will be appreciated by those skilled in the art that the foregoing structures, which are merely block diagrams of portions of structures related to the present application, are not limiting of the computer device 500 to which the present application may be applied, and that a particular computer device 500 may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

It should be appreciated that in embodiments of the present application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), the processor 502 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program may be stored in a storage medium that is a computer readable storage medium. The computer program is executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present invention also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program.

The storage medium is a physical, non-transitory storage medium, and may be, for example, a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk. The computer readable storage medium may be nonvolatile or may be volatile.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.

The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.

In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method of voice recognition, the method comprising:

2. The method of claim 1, wherein the obtaining audio data, generating an audio identification of the audio data, comprises:

3. The method of claim 1, wherein the voiceprint identifies the audio data, comprising:

4. A method according to claim 3, wherein the content identifies the audio data, comprising:

5. The method of claim 4, wherein said associating said voiceprint data with said content data in accordance with said audio identification comprises:

6. The method according to claim 5, wherein the reference voiceprint feature is associated with an identity, and the determining whether the reference voiceprint feature matching the voiceprint data exists in the preset verification database further comprises:

7. The method of claim 5, wherein said associating said voiceprint data with said content data in accordance with said audio identification comprises:

and storing the voiceprint data in the verification database.

8. A voice recognition device comprising means for performing the method of any one of claims 1-7.

9. The computer equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the steps of the method of any one of claims 1-7 when executing a program stored on a memory.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-7.